DSpace at EWHA: Prediction of Fusion Genes from Sequencing Data

Browse

My Repository

DSpace at EWHA일반대학원 생명·약학부 Theses_Ph.D

View : 866 Download: 0

Prediction of Fusion Genes from Sequencing Data

Title: Prediction of Fusion Genes from Sequencing Data

Authors: 김보라

Issue Date: 2013

Department/Major: 대학원 생명·약학부생명과학전공

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 이상혁

김완규

Abstract: Cancer is a complex disease. Environmental interactions as well as genetic factors play critical roles in tumor initiation, development, and progress. Immune responses and genetic instability are among the key molecular elements of tumor development. For example, malignant cells with genetic instability require multiple genetic hits to become metastatic. Thus, biomarkers based on mutations and structural alterations only often fail in identifying malignant cells. Understanding the entire molecular mechanism underlying the oncogenesis progress would be a better approach, but this usually requires a team-based systematic effort involving genomic scientists, molecular imaging and laboratory medicine experts, engineers, epidemiologists, clinicians, industrial partners, and patients. Since the introduction of microarray technique, genome-wide approach allows the process of biomarker discovery to be more efficient and the mechanistic insights provide fundamentals to translational research for clinical applications. Numerous reports have been already published during the last decade. Chinnaiyan’s group has been pioneering the discovery of novel biomarkers based on integrative analysis of molecular profiles in various types of cancer. From the co-outlier analysis of gene expression, they found novel gene fusion events between a male hormone related gene TMPRSS2 and transcription factors of the erythroblast transformation specific (ETSs) family in prostate cancer. In 2007, Soda et. al. identified a transforming gene fusion between echinoderm microtubule-associated protein-like 4 (EML4) gene and anaplastic lymphoma kinase (ALK) gene in lung cancer by using a molecular screening method of isolating transforming cell and subsequent sequencing. These results changed the generally accepted ideas that blood cancers are caused by translocation like BCR-ABL1 fusion gene in chronic myelogenous leukemia, whereas solid tumors are caused by mutations in oncogenes or tumor suppressor genes. (88) Chromosomal aberrations composed of deletions, inversions, duplications and translocation events have a bigger impact than point mutations. Deep sequencing based on Next Generation Sequencing (NGS) technology provides a powerful tool for identifying biomarkers of these structural variations (SV). In principle, whole genome sequencing would be more desirable to identify any types of SVs, but the sequencing cost and complicated down-stream analysis to remove false positives are often major difficulties. On the other hand, transcriptome sequencing is an attractive alternative because it provides not only the digital gene expression but also the SVs in the coding region of genes that would have direct functional roles. Many studies tried to detect fusion genes by translocation with RNA-sequencing data. Prediction procedures of fusion genes generally follow the following three steps – (i) mapping and filtering, (ii) fusion junction detection, (iii) fusion gene assembly and selection. The results can be vastly different according to the choice of alignment programs and parameters, filtering conditions, and the priority assignment in selecting final candidates. Several hundreds of candidate cases are typical, and it is often critical to choose adequate empirical parameters to reduce numerous false positives. In this thesis, I report algorithms to predict the fusion genes from transcriptome sequencing data of Sanger sequencing as well as NGS technology. This thesis is composed of three chapters. Each chapter describes a method of predicting fusion transcripts based on different types of transcriptome sequencing that reflect the advances in sequencing technology as shown in Figure I.3. Chapter II deals with the mRNA and EST sequencing data obtained by Sanger sequencing. These sequences are relatively long and error-free. We constructed a database of fusion genes, ChimerDB 2.0, which provides supporting evidences for thus identified fusion transcripts from publicly available NGS data. Then, we concentrated on finding fusion genes in lung cancer from transcriptome sequencing data using NGS technology. Chapter III describes a method to identify fusion transcripts from 454 FLX sequencing data, the first generation of NGS method. The read length is somewhat shorter than Sanger sequencing and the depth is much shallower than current second generation machines. We identified and confirmed experimentally a novel gene fusion, ALK-PTPN3, in a non-small cell lung cancer cell line. Chapter IV describes an algorithm for the second generation sequencing, Illumina Solexa or Hi-seq data. These data are the shortest in read length, but its high throughput allows genome-wide screening of fusion transcripts. However, computational analysis is more challenging due to short length and high depth. Removing false positives is a key feature to select reliable candidates for experimental validation. Our program, FusionScan, consists of pre-processing step and main computational procedure that implemented four elaborate filtering schemes. It achieved the best performance with high sensitivity and specificity compared with several other programs publicly available. Using FusionScan, we were able to identify many novel fusion genes from the public sequencing data and from our own sequencing data. Significant portion of our prediction was experimentally validated by Sanger sequencing. Several of those were recurrent in lung cancer patients. ;암은 매우 복잡한 질병이다. 오랜 시간 동안 반복적인 세포환경에의 반응으로 악성 종양이 생기고, 일부가 다른 조직으로 전이되면 그곳의 체내 내제하는 면역반응과 주위 환경과 상호 작용한다. 악성 종양은 유전적으로 불안정하며, 전이되기 위해서는 또 다른 여러 유전적 요인들이 필요로 한다. 그래서 악성 세포 내에서의 유전자의 서열 및 구조의 돌연변이에만 집중한 연구는 한계를 가지며, 이러한 연구 결과로 얻은 타겟은 실패하는 경우가 많다. 이를 극복하기 위하여, 우리는 암 발생 과정에 대한 전반적인 이해가 필요하다. 팀 기반 연구는 유전체 과학자, 분자 이미지와 실험, 약 전문가, 엔지니어, 역학자, 임상 학자, 산업 파트너, 그리고 환자가 함께 이뤄져야 한다. 이러한 노력들은 차세대 시퀀싱 기술의 발전과 함께 증가해 왔으며, ‘translational genomics’, ‘translational bioinformatics’ 라는 새로운 분야를 창출했다. (101) 이러한 경험들에 의한 논문들이 많이 보고 되었다. Chinnaiyan 그룹은 새로운 진단 생체 표지자를 암에서 찾기 위해 분자 프로파일을 분석했다. 유전자 발현 프로파일에서 동시에 outlier로 뽑히는 유전자를 찾는 방법으로 남성 호르몬과 관련된 TMPRSS2 유전자와 적혈구 모세포 형질 변환 특이적인 전사 조절 인자인 ETSs 패밀리의 새로운 융합 유전자를 찾아냈다. 비슷하게 EML4 (echinoderm microtubule-associated protein-like 4) 유전자와 ALK (anaplastic lymphoma kinase) 유전자 사이의 전좌 현상이 비소세포 폐암 환자에서 발견되었다. 이러한 결과들은 급성 백혈병에서 보이는 BCR-ABL1 융합 유전자의 경우처럼, 혈액 암은 유전자 전좌 현상에 의해 생기고, 고형 암은 성장 인자나 암 억제 유전자에 발생한 돌연변이에 의해서 생긴다는 통념을 깬 것이었다.(102) 염색체 결실, 역위, 중복, 전좌로 이루어진 염색체 돌연변이는 점 돌연변이 (point mutaion)보다 큰 영향을 미친다. 이러한 변형들은 큰 유전자 복사 숫자 변화 (copy number variations)를 보이는 유전자를 찾거나, 전체 게놈 시퀀싱 데이터(whole genome sequencing data)에서의 구조적 변화 (structural variations)를 보이는 부분을 연구함으로 찾을 수 있다. 전사체 시퀀싱 데이터(transcriptome sequencing data)는 비용적인 면에서 효율적이며, 디지털 방식의 유전자 발현 정보를 주므로 정확한 양을 잴 수 있으며, 분석에 용이하다. 많은 연구들이 전사체 시퀀싱 데이터를 이용한 전좌(translocation)에 의한 융합 유전자(fusion gene)를 발견했다. 일반적으로 융합 유전자 예측 과정은 다음의 세 과정을 따른다. (i)서열 맵핑과 필터링, (ii) 융합 접합 부분 찾아내기, (iii) 융합 유전자 조립과 고르기이다. 융합 유전자 예측 결과는 유전자 지도에 맵핑하는 프로그램의 종류와 그에 따른 명령어 옵션, 필터링 하는 조건, 융합 접합 서열 만드는 방법과 최종 후보를 뽑을 때 우선순위를 주는 방법에 따라 달라질 수 있다. 따라서 각 단계에서의 정답을 찾을 수 있는 경험적인 조건은 매우 중요하다. 이 논문은 Sanger 시퀀싱 방법에서부터 차세대 시퀀싱 방법을 이용하여 얻은 다량의 전사체 서열들을 가지고 융합 유전자를 예측하기 위한 알고리즘 개발에 대해 보고하고 있다. 이 논문은 세 부분으로 나뉘어져 있는데, 모두 융합 유전자가 발현된 전사체를 예측하는 방법에 대해 다루고 있다. 다만 Figure I.3에서 보이듯이 시퀀싱 역사에 따라 서열의 타입이 다르다. 제 2 장은 상대적으로 길고 안정된 방법인 Sanger 시퀀싱 데이터로 수행되었다. 이 데이터는 NCBI의 mRNA 전사체와 EST 전사체이다. 이들로부터 융합 전사체를 뽑고 융합 유전자에 대한 지식 데이터베이스를 모아서 ChimerDB 2.0을 만들었다. 제 3 장은454 FLX 차세대 시퀀싱 방법으로 폐암 전사체에 대하여 얻은 데이터를 분석하여 첫 번째로 실험적 확인을 한 이야기이다. EML4-ALK가 나온 대표적인 비소세포 폐암 세포계(non-small cell lung cancer cell line)에서 새로운 ALK-PTPN3 융합 유전자를 확인하였다. 이 후에 본인은 대량의 전사체 데이터로부터 융합 유전자를 좀 더 효율적이면서도 참 정답(true positive)을 잘 찾을 수 있는 방법을 연구하였다. 마지막 제 4 장은 요즘 가장 많이 쓰이고 있는 Illumina Solexa 시퀀싱 데이터에 대하여 분석하여 융합 유전자를 찾는 이야기이다. 전사체 시퀀싱 데이터로부터 거짓 답(false positive)의 수를 줄이며 실제로 존재하는 융합 유전자를 찾는 것에 초점을 맞추어서 FusionScan 프로그램을 만들었다. FusionScan은 전처리 과정과 네 가지 필터링 단계로 이루어진 주 처리 과정으로 이루어져 있다. 이 프로그램은 높은 민감도와 정확도를 갖는다. 이 프로그램을 이용하여 공용 시퀀싱 데이터와 우리의 시퀀싱 데이터로부터 새로운 융합 유전자를 찾아 실험으로 확인도 할 수 있었다. 더 많은 암 조직 시퀀싱 데이터에 FusionScan을 이용하여 의미 있는 융합 유전자를 찾아 좋은 약물 타겟을 만들 수 있기를 희망한다.