DSpace at EWHA: 클라우드 컴퓨팅 환경에서의 차세대 전사체 시퀀싱 분석

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Master

View : 508 Download: 0

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	박현석	-
dc.contributor.author	이아랑	-
dc.creator	이아랑	-
dc.date.accessioned	2016-08-25T11:08:41Z	-
dc.date.available	2016-08-25T11:08:41Z	-
dc.date.issued	2011	-
dc.identifier.other	OAK-000000068077	-
dc.identifier.uri	https://dspace.ewha.ac.kr/handle/2015.oak/186692	-
dc.identifier.uri	http://dcollection.ewha.ac.kr/jsp/common/DcLoOrgPer.jsp?sItemId=000000068077	-
dc.description.abstract	시퀀싱 기술이 발달함에 따라, 전체 유전체 분석(Whole Genome Sequencing) 뿐만이 아니라 이제는 특정 연구 목적에 따른 RNA 시퀀싱도 가능해 졌다. 이는 기존의 마이크로어레이에서 유전자의 발현 여부만을 확인할 수 있었던 분석 방법과는 달리, 배경 노이즈 등 없이 더욱 명확한 발현 정보뿐만 아니라 유전체 상의 변이(variants)도 대량으로 분석할 수 있게 되었다. 본 연구에서는 이렇게 대용량으로 얻어진 전사체(transcriptome, RNA-Seq) 시퀀싱 정보를 분석하는 일련의 파이프라인을 FX라는 Hadoop MapReduce 프레임워크 기반의 엔진으로 구현하여 로컬 클러스터에서 테스트하고 클라우드 컴퓨팅 환경에서 규모를 확장해 가며 성능을 비교해 보았다. 파이프라인은 레퍼런스 유전체 상에 정렬(align)하고 SNP, INDEL 등의 유전적 변이를 분석하며 유전자 별로 발현 정도를 정규화 하여 보고해 주는 과정을 포함한다. 특히 정렬 과정에는 전사체 정렬에 적합하나 고성능의 컴퓨팅 사양을 필요로 해서 다른 연구에서 많이 활용되지 못했던 GSNAP으로 서열을 정렬하고, base의 quality를 고려한 섬세한 base 추출 작업을 거쳐 정확하고 빠르게 파이프라인을 수행할 수 있도록 고안하였다. 이렇게 구현한 FX로 친인척 관계가 없는 한국인 서열 8에 관하여 컴퓨팅 리소스를 확장해 감에 따라 소요시간을 측정해 가며 실험한 결과, 5대의 일반 데스크톱 수준의 컴퓨터에서 7시간 넘게 걸리던 작업을 40대에서 1시간 반 만에 수행할 수 있었다. 또한, FX로 분석한 결과물을 동일한 샘플에 대하여 검증된 데이터와 비교한 결과, 변이 분석에 대하여 98% 이상의 유사성을 확인할 수 있었다.;Transcriptome sequencing using massively parallel sequencing technology has becoming a popular method for gene expression profiling and detecting genetic variation in the field of biology and biomedicine on behalf of the declined sequencing cost. Because of its huge data size, managing raw sequence files and subsequent intermediate files storage issue is a challenge for researchers. Moreover, exhaustive time consuming process is a bottleneck of sequencing data analysis. In this work, we propose a fast, accurate and scalable transcriptome analysis pipeline, named FX, based on Hadoop MapReduce framework. It has been tested on Hadoop local cluster, and scaled up over Amazon Web Service cloud computing facilities. By storing the raw sequence data on the AWS S3, and creating a job flow with desired number and size of instance with EC2, we could achieve the total transcriptome analysis pipeline under 1 and a half hour, with less than $45 per sample. In order to enhance performance, we found optimal number of instances in terms of elapsed time and cost. FX includes reference alignment with GSNAP against a set of 3 gene databases (RefSeq, KnownGene, and ENSEMBL). SNP, INDEL calling, and expression profile are reported as a result. Since building and maintaining a de novo local storage computing server is costly and time consuming especially for small-sized group of labs, FX could aid the time consuming work and let researchers concentrate on the post-processing analysis.	-
dc.description.tableofcontents	I. 서론 1 A. 연구 배경 1 B. 시퀀스 데이터를 다루는 데 있어서의 문제점 4 1. 저장 공간 4 2. 수행 시간 5 C. 연구 목적 및 내용 7 II. 관련 연구 9 A. 기반 기술 9 1. 로컬 클라우드 컴퓨팅 기반 기술 9 2. 아마존 웹 서비스 (AWS, Amazon Web Service) 11 3. 전사체(Transcriptome) 분석 기법 16 B. 관련 연구 29 III. 구현 방법 31 A. FX의 구조 31 1. 유연성: Run-at-once vs. Step-by-Step 31 2. 로컬 클러스터에서의 구동 31 3. AWS로의 확장성 35 B. 작업 단계(Work Flow) 39 1. 준비(Preprocess) 39 2. 정렬 (Alignment) 39 3. Base 추출 (Base Call) 44 4. SNP 추출(SNP Call) 47 5. INDEL 추출 (INDEL Call) 48 6. 발현 수치 보고(Expression Profiling) 49 IV. 연구 결과 50 A. 정확성 50 B. 성능 효율성 51 1. 확장성과 전체 소요시간, 비용 51 2. 정규화한 전체 소요시간 비교 55 V. 결론 57 참고문헌 59 ABSTRACT 63	-
dc.format	application/pdf	-
dc.format.extent	3003656 bytes	-
dc.language	kor	-
dc.publisher	이화여자대학교 대학원	-
dc.title	클라우드 컴퓨팅 환경에서의 차세대 전사체 시퀀싱 분석	-
dc.type	Master's Thesis	-
dc.title.translated	Massively Parallel Transcriptome Sequencing Analysis on the Cloud	-
dc.format.page	viii, 63 p.	-
dc.identifier.thesisdegree	Master	-
dc.identifier.major	대학원 컴퓨터공학과	-
dc.date.awarded	2011. 8	-