DSpace at EWHA: Analysis of inconsistent mutation calls for cell lines in NGS databases

Browse

My Repository

DSpace at EWHA일반대학원 의과학과 Theses_Master

View : 976 Download: 0

Analysis of inconsistent mutation calls for cell lines in NGS databases

Title: Analysis of inconsistent mutation calls for cell lines in NGS databases

Authors: 송유라

Issue Date: 2018

Department/Major: 대학원 의과학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 김형래

Abstract: Errors in Next-Generation Sequencing (NGS) databases have received much attention due to reports of inconsistent mutation calls from cell-line databases. In order to elucidate the reasons for such inconsistency, we analyzed the mutation calls for 592 cell lines and 897 cancer-driver genes from two databases, namely Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE). Discrepancies in annotation for mutations were found in 7.2% of mutation calls. Even after correction of the discrepancies discovered, significant additional discrepancies (about 33-42%) remained. Most of the mutation calls (98.8%, 162/164) were consistent in our two targeted sequencings for 8 cell lines; 7-8% and 11-13% of mutation calls in CCLE and GDSC, respectively, were not found in targeted sequencing, suggesting that these mutation calls are false positive. Contrary to the generally held notion, however, most (85-86%) of the inconsistent calls might be true mutations, which suggests that the inconsistency in the two databases could be related to false-negative mutations (14% in GDSC, 20% in CCLE). In the course of further analysis of the allele frequencies in the targeted sequencing data, consistent mutant allelic loss (2%, 4/155) or inconsistent allelic loss (4%, 7/155), which can be major sources of inconsistency in the two databases, were found. In conclusion, the mutation databases GDSC and CCLE might contain 7-13% false-positive mutation calls originating from polymerase proofreading errors. Important reasons for inconsistency and false-negative mutations might be uneven amplification and genetic drift. In order to resolve the problem of inconsistency in mutation databases constructed by NGS, allelic frequency data for all mutations are essential. ;동일한 세포주에 대한 돌연변이가 보고된 데이터베이스들에서 일치하지 않는다는 것이 보고되면서, 차세대 염기서열 분석기술을 적용한 데이터베이스가 가질 수 있는 오류에 대한 문제점이 제기되었다. 이러한 돌연변이 결과의 불일치에 대한 원인을 찾기 위해, GDSC와 CCLE와 같은 두 데이터베이스에서 공통으로 분석하였던 592개 세포주의 897가지 유전자에 나타난 돌연변이를 분석하였다. 두 데이터베이스의 돌연변이를 비교하는 과정에서, 전체 돌연변이의 7.2%가 동일한 변이를 다른 돌연변이로 명명한 것을 확인할 수 있었다. 그런데 이러한 돌연변이의 명명의 불일치를 교정하였음에도 불구하고, 여전히 두 데이터베이스 사이에 불일치를 보이는 돌연변이가 전체 돌연변이 중33-42% 정도라는 것을 확인하였다. 이러한 불일치의 원인을 분석하기 위해 8개의 세포주에 대해 표적 서열분석을 두 번 시행하였다. 두 번의 표적서열분석 결과는 두 데이터베이스에서 보고하였던 돌연변이 중 98.8% (162/164)가 일관된 결과를 보였으나, CCLE에서 발견된 7-8% 돌연변이와, GDSC에서 발견된 11-13% 돌연변이가 표적 서열분석 결과에서는 검출되지 않았다. 이러한 CCLE혹은 GDSC에서만 발견되고 표적서열분석에서는 발견되지 않은 돌연변이들은 위양성일 가능성이 크다고 할 수 있다. 그렇지만 통상적으로 두 데이터베이스간 불일치한 돌연변이들이 모두 거짓일 수 있다는 가정과는 다르게, 두 데이터베이스에서 일치하지 않았던 돌연변이라도 이중 85-86%는 진짜 존재하는 돌연변이라는 것을 시사하는 결과이다. 이렇게 두 데이터베이스에서 불일치 하였지만 표적열분석에서 검출된 돌연변이는 GDSC의 돌연변이 중 14%와 CCLE 돌연변이 중 20%를 차지하였다. 이러한 결과는 두 데이터베이스에서 돌연변이가 불일치한 이유로 데이터베이스에 존재하는 위음성과 연관이 있다는 것을 시사한다. 이러한 불일치의 원인을 더 분석하기 위하여 표적서열분석 결과로부터 대립유전자 빈도를 추가로 분석하였더니, 두 데이터베이스의 공통 돌연변이 중 돌연변이유전자가 표적서열분석결과 전혀 발견되지 않은 경우는 2% (4/155)였고, 한 개의 데이터베이스에서만 나타난 불일치 돌연변이가 표적서열분석에서 발견되지 않은 경우는 4% (7/155)를 차지하였다. 이와 같은 결과는, 데이터베이스간의 불일치와 차세대염기서열분석으로 인한 돌연변이 검출 실패가 관련이 있다는 것을 시사하는 것이다. 이상과 같은 결과를 종합해 보면, 두 데이터베이스간 돌연변이결과 비교 연구를 통해, GDSC와 CCLE와 같은 데이터베이스는 전체 돌연변이 중 7-13%의 위양성 돌연변이이며, 이는 중합효소 교정 오류(polymerase proofreading error)에 의해 발생하는 것이다. 또한 불규칙한 증폭(uneven amplification)과 유전적 부동 (genetic drift)은 돌연변이의 불일치와 위음성 돌연변이의 중요한 원인이 될 수 있다는 사실을 알게 되었다. 이상의 연구결과는 차세대 염기서열 분석 기술을 기반으로 하여 만들어진 데이터베이스가 갖고 있는 돌연변이 분석결과의 불일치 문제의 원인을 이해하고 불일치를 해소할 수 있는 중요한 열쇠가 될 수 있을 것이다.