DSpace at EWHA: Fusion gene and toxicity analyses based on NGS transcriptomic data

Browse

My Repository

DSpace at EWHA일반대학원 바이오정보학협동과정 Theses_Ph.D

View : 1051 Download: 0

Fusion gene and toxicity analyses based on NGS transcriptomic data

Title: Fusion gene and toxicity analyses based on NGS transcriptomic data

Authors: 장예은

Issue Date: 2020

Department/Major: 대학원 바이오정보학협동과정

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 이상혁

Abstract: Microarray and next generation sequencing (NGS) approaches are commonly used in research to provide information relating to the cancer genome, bioinformatics and drug toxicology. The methods used to analyze NGS data are dependent upon the genetic features that researchers want to identify, and new methods of analysis are continually emerging. Thus, we conducted research focusing on fusion gene analysis and on a toxicity database. A fusion gene is a hybrid gene created from two previously separate genes. It occurs as a result of either a translocation, interstitial deletion, or chromosomal inversion. A fusion gene can play an important role in tumorigenesis because of its oncogenic properties, thus, a fusion gene event is often found in cancers arising in patients. Importantly, some fusions are used as therapeutic targets to treat the cancer. For example, in lung cancer patients, the EML4-ALK fusion gene is targeted using ALK inhibitors, leading to successful treatment of the tumor. Considering these properties, we analyzed NGS samples by focusing on fusion genes. First, fusion analysis using RNA-Seq data was performed. Samples were collected from: 114 patients with lung cancer, 80 patients with stomach cancer, and patients with breast cancer and PDX samples. From the lung cancer patient data, 997 fusions were selected of which 67 were confirmed as having kinase activity, and thus, were expected to play a role in cancer development and progression, and 14 of these candidates were shown to be druggable. To know if these selected fusion genes were significant, some were selected from the list and a PCR verification process was carried out, which validated 133. Following the acquisition of RNA-Seq data from 80 gastric cancer patients, 105 fusion candidates were obtained through the filtering process, and 11 of these were confirmed to have an active kinase domain. It was also confirmed that the TFG-ROS1 fusion was present among these. We also made a fusion protein database by translating fusion transcripts to protein, and found 112 fusions that came out simultaneously with RNA-Seq and protein using the technique of LC-MS/MS. The list confirmed that 12 fusion genes had active kinase activity, and that there were 20 fusion genes related to genes in the RhoA and RhoGDI pathways, which are related to gastric cancer. This led to the discovery of fusion candidates that were considered to be significantly associated with the mechanism of gastric cancer development and progression. Besides, PDX models were used for 17 breast cancer patients, and whole exome sequencing and RNA-Seq data were obtained for analysis. First, the matching of genetic characteristics between the patient sample and the PDX sample was identified using the mutations and the CNV plot. The genetic characteristics of the remaining samples, excluding one sample, were consistent. Also, the fusion genes were analyzed; eight fusion genes from both patient and PDX samples were identified, and among the fusion candidates obtained from the breast cancer patients, fusions believed to be useful for treatment were found. Besides, fusion analyses were performed on PDX samples from both lung and stomach cancer patients, and significant fusion candidates were identified respectively. We have also updated the ChimerDB 4.0. fusion gene database. In this update, each of the three modules were improved by adding the information from the database by manual curation, and new tools for the prediction of fusion genes were introduced. The number of predicted fusion candidates was increased greatly as the number of samples analyzed from TCGA increased. Additionally, we introduced a more accurate deep-learning technique for searching articles related to fusion genes, and also a more reliable manual curation was performed to identify significant fusion gene candidates. Also, the database page has been changed to a more interactive UI. These changes will be helpful to users to obtain useful information regarding fusion genes. Finally, we collected liver toxicity information for the construction of a toxicity database. We planned a database focusing on liver toxicity to collect toxicity data and to make a deep-learning prediction model, by utilizing toxicity ‘omics’ data such as LINCS, to predict the liver toxicity of compounds. For toxicity assessment data, LiverTox, DILIrank FDA, CompTox data were integrated. Each has a toxicity score for a chemical or drug. We used these to make positive sets of liver toxicity and negative sets by hazard. We also used these data together with the LINCS data to create databases. We plan to identify and integrate more toxicity assessment data, and expression profiles to provide a more reliable database. In summary, gene fusions were analyzed using RNA-Seq, and a database was established to obtain useful information regarding the treatment for cancer. These databases are developed to help analyze and exchange information using RNA-Seq data, and will serve as useful resources for the development of novel treatment strategies in cancers involving gene fusions.;마이크로어레이와 차세대 염기서열분석(NGS)은 암 게놈 연구나 약물 독성학 연구 등 다양한 유전체 연구와 생물정보학에 전반적으로 사용되고 있다. NGS 데이터를 분석하는 방법은 연구하는 유전적 특성에 따라 다양하며, 새로운 분석 방법도 지속적으로 개발되고 있다. 본 연구에서 우리는 퓨전 분석을 수행하였고 독성 데이터베이스를 구축하였다. 퓨전 유전자는 두 개의 유전자에서 형성된 혼합 유전자이며 염색체 전좌, 또는 삭제나 역전의 결과물이다. 이를 통해 종양이 유발될 수 있기 때문에 퓨전은 종양에 중요한 지표로 사용되기도 한다. 퓨전 유전자는 폐암 환자의 EML4-ALK 유전자를 필두로 암 환자로부터 많이 발견되었으며, 여러 연구를 통해 밝혀진 일부 퓨전 유전자는 암세포를 치료하는 데 중요한 타겟이 되고 있다. 우리는 여러 암 전사체 데이터로부터 이러한 퓨전 유전자를 찾기 위해 NGS 데이터를 분석했다. 퓨전 유전자 분석을 수행하기 앞서 수집된 표본은 폐암 환자 114명, 위암 환자 80명, 유방암 환자 및 PDX 시료 24건이었다. 폐암 환자 데이터에서 997개의 퓨전을 찾았으며, 이 중 67개의 퓨전 후보가 암의 메커니즘에 영향을 미칠 것으로 기대되는 인산화효소로 확인되었고, 이 중 14개는 약효가 있는 타겟으로 나타났다. 이렇게 선택된 퓨전 유전자가 유의미한지를 확인하기 위해 일부 목록을 선정하고 PCR 검증 과정을 진행했으며, 이 중 133개가 검증되었다. 다음으로 위암 환자 80명의 전사체 데이터를 분석하였으며 105개의 퓨전 유전자 후보가 필터링 과정을 거쳐 확보되었고, 이 중 11개가 살아있는 인산화효소 영역을 갖고 있는 것으로 확인됐다. 이 가운데 TFG-ROS1 퓨전 유전자가 존재하는 것으로 확인됐다. 또 단백질로 전사체 시퀀스를 번역해 퓨전 유전자 단백질 데이터베이스를 만들었고, LC-MS/MS 기법을 이용해 전사체 데이터와 단백질 영역에서 동시에 나온 112개의 퓨전을 발견했다. 이 목록에는 인산화효소 영역이 살아 있는 12개의 퓨전이 있었으며, 위암과 관련된 RhoA 경로와 RhoGDI 경로와 관련된 퓨전 유전자가 20개 있음을 확인했다. 이것은 위암의 발생 메커니즘과 연관성이 있다고 여겨진다. 또한, 유방암 환자 17명을 대상으로 PDX 데이터를 확보했으며, 분석을 위해 유전체 및 전사체 데이터를 얻었다. 먼저, 돌연변이와 CNV 데이터를 사용하여 환자 샘플과 PDX 샘플 간의 유전적 특성이 일치하는지를 식별했다. 표본 1개를 제외한 나머지 표본의 유전적 특성은 일치했다. 그 다음으로, 퓨전 유전자를 분석한 결과 환자와 PDX 시료에서 공통으로 확인되는 8개의 퓨전 유전자를 발견했으며 이 중에는 약물 치료 타겟으로 추정되는 퓨전이 포함되어 있었다. 또한 폐암 환자의 PDX 시료와 위암 환자의 PDX 시료를 대상으로 퓨전 유전자 분석을 실시하여 각각 퓨전 후보 리스트를 획득하였다. 우리는 또한 퓨전 유전자 데이터베이스인 ChimerDB 4.0을 업데이트했다. 이번 업데이트에서는 수작업 큐레이션으로 데이터베이스의 정보를 추가하였으며, 3개 모듈 각각을 개선하고, TCGA 데이터에서 퓨전 유전자를 예측하는 데 새로운 퓨전분석 프로그램이 추가되었다. TCGA 데이터 샘플의 양은 이전보다 증가하였으며 이에 따라 예측 유전자 리스트가 크게 증가하였다. 또한, 퓨전 유전자와 관련된 저널 검색에, 이전보다 정확한 딥러닝 기법을 도입하였으며, 신뢰할 만한 퓨전 유전자 후보를 제시하기 위해 수작업 큐레이션을 실시하였다. 이러한 변화는 사용자들이 퓨전 유전자의 유용한 정보를 찾는 데 더 도움이 될 수 있다. 마지막으로 독성 데이터베이스를 구축하기 위해 간 독성 정보를 수집했다. 간 독성을 중심으로 여러 데이터베이스로부터 독성 평가 데이터를 수집하였고, LINCS 등 독성 기전 데이터를 활용해 화합물의 간 독성을 예측하는 딥러닝 예측 모델을 구상했다. 독성 평가 데이터의 경우 LiverTox, DILIrank FDA, CompTox 데이터가 통합되었다. 각 데이터는 화학 물질이나 약물에 대한 독성 점수를 갖는데 우리는 그것들을 유해성에 따라 간 독성의 양성 집합과, 음성 집합으로 만드는데 사용했다. 우리는 또한 이 데이터와 LINCS 데이터를 사용하여 데이터베이스를 만드는 데 사용하였다. 향후 독성 평가 데이터와 발현값 프로파일을 추가하여 보다 신뢰도가 향상된 데이터베이스를 제공할 것이다. 본 연구에서, 우리는 다양한 암종의 전사체 데이터를 이용하여 퓨전 유전자를 분석하였으며, 암 치료에 관한 유용한 퓨전 정보를 제공하는 데이터베이스를 구축하였다. 간 독성과 관련된 데이터베이스를 구축하였다. 이러한 연구 결과들은 전사체 데이터를 이용하여 유전체 정보를 분석하고 교환하는 데 도움이 될 수 있다.