DSpace at EWHA: 클라우드 컴퓨팅 환경에서의 차세대 전사체 시퀀싱 분석

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Master

View : 522 Download: 0

클라우드 컴퓨팅 환경에서의 차세대 전사체 시퀀싱 분석

Title: 클라우드 컴퓨팅 환경에서의 차세대 전사체 시퀀싱 분석

Other Titles: Massively Parallel Transcriptome Sequencing Analysis on the Cloud

Authors: 이아랑

Issue Date: 2011

Department/Major: 대학원 컴퓨터공학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 박현석

Abstract: 시퀀싱 기술이 발달함에 따라, 전체 유전체 분석(Whole Genome Sequencing) 뿐만이 아니라 이제는 특정 연구 목적에 따른 RNA 시퀀싱도 가능해 졌다. 이는 기존의 마이크로어레이에서 유전자의 발현 여부만을 확인할 수 있었던 분석 방법과는 달리, 배경 노이즈 등 없이 더욱 명확한 발현 정보뿐만 아니라 유전체 상의 변이(variants)도 대량으로 분석할 수 있게 되었다. 본 연구에서는 이렇게 대용량으로 얻어진 전사체(transcriptome, RNA-Seq) 시퀀싱 정보를 분석하는 일련의 파이프라인을 FX라는 Hadoop MapReduce 프레임워크 기반의 엔진으로 구현하여 로컬 클러스터에서 테스트하고 클라우드 컴퓨팅 환경에서 규모를 확장해 가며 성능을 비교해 보았다. 파이프라인은 레퍼런스 유전체 상에 정렬(align)하고 SNP, INDEL 등의 유전적 변이를 분석하며 유전자 별로 발현 정도를 정규화 하여 보고해 주는 과정을 포함한다. 특히 정렬 과정에는 전사체 정렬에 적합하나 고성능의 컴퓨팅 사양을 필요로 해서 다른 연구에서 많이 활용되지 못했던 GSNAP으로 서열을 정렬하고, base의 quality를 고려한 섬세한 base 추출 작업을 거쳐 정확하고 빠르게 파이프라인을 수행할 수 있도록 고안하였다. 이렇게 구현한 FX로 친인척 관계가 없는 한국인 서열 8에 관하여 컴퓨팅 리소스를 확장해 감에 따라 소요시간을 측정해 가며 실험한 결과, 5대의 일반 데스크톱 수준의 컴퓨터에서 7시간 넘게 걸리던 작업을 40대에서 1시간 반 만에 수행할 수 있었다. 또한, FX로 분석한 결과물을 동일한 샘플에 대하여 검증된 데이터와 비교한 결과, 변이 분석에 대하여 98% 이상의 유사성을 확인할 수 있었다.;Transcriptome sequencing using massively parallel sequencing technology has becoming a popular method for gene expression profiling and detecting genetic variation in the field of biology and biomedicine on behalf of the declined sequencing cost. Because of its huge data size, managing raw sequence files and subsequent intermediate files storage issue is a challenge for researchers. Moreover, exhaustive time consuming process is a bottleneck of sequencing data analysis. In this work, we propose a fast, accurate and scalable transcriptome analysis pipeline, named FX, based on Hadoop MapReduce framework. It has been tested on Hadoop local cluster, and scaled up over Amazon Web Service cloud computing facilities. By storing the raw sequence data on the AWS S3, and creating a job flow with desired number and size of instance with EC2, we could achieve the total transcriptome analysis pipeline under 1 and a half hour, with less than $45 per sample. In order to enhance performance, we found optimal number of instances in terms of elapsed time and cost. FX includes reference alignment with GSNAP against a set of 3 gene databases (RefSeq, KnownGene, and ENSEMBL). SNP, INDEL calling, and expression profile are reported as a result. Since building and maintaining a de novo local storage computing server is costly and time consuming especially for small-sized group of labs, FX could aid the time consuming work and let researchers concentrate on the post-processing analysis.