DSpace at EWHA: 종양 분류를 위한 특징 추출과 분류기법의 성능분석

Browse

My Repository

DSpace at EWHA과학기술대학원 컴퓨터학과 Theses_Master

View : 655 Download: 0

종양 분류를 위한 특징 추출과 분류기법의 성능분석

Title: 종양 분류를 위한 특징 추출과 분류기법의 성능분석

Authors: 박윤정

Issue Date: 2005

Department/Major: 과학기술대학원 컴퓨터학과

Publisher: 이화여자대학교 과학기술대학원

Degree: Master

Abstract: 생명공학의 발전은 생명체 정보들을 대량으로 얻어내는데 큰 역할을 하고 있다. 마이크로어레이 (microarray) 기술은 처리 조건이나 환경에 따른 대량의 유전자 발현 정보를 정량적인 수치로 제공해 준다. 각 유전자의 발현은 복잡한 상호 작용에 의해 나타나게 되고 일부의 유전자 발현으로 인해 다른 유전자의 발현이 촉진되거나 억제할 수도 있다. 그러나, 아무리 많은 양의 DNA 정보를 획득하였어도 그것만으로는 유전자가 무슨 일을 하는지, 세포가 어떤 역할을 하고 어떻게 유기체를 형성하며 어떻게 노화되는지 등에 대한 알 수는 없다. 따라서 이와 같이 방대한 양의 DNA 서열 정보를 의미 있게 이용하기 위한 기술이 필요한데 이러한 기술이 DNA 마이크로어레이 (microarray)이다. 이러한 마이크로어레이 기술 중 특히 종양 조직에 대한 마이크로어레이 데이터를 사용하여 종양 종류에 따라 유전자가 차별적으로 발현되는 양상을 분석함으로써, 종양의 분류에 유용한 유전자를 식별하고 정확한 분류 도구를 구축하는 것이 바이오 산업에서 중요한 연구대상이 된다. 이러한 분류 방법은 불확실성을 내포하고 있는 기존의 형태학적, 임상적 기반의 종양 분류 방법들의 대신할 수 있으면서도 육안으로는 구분하기 어려운 종양의 세부 분류들까지도 구분할 수 있을 것으로 기대되고 있다. 수많은 유전자들로부터 실제 종양들의 세부 부류에 따라 확연하게 발현 량이 변하는 표본 분류에 유용한 유전자들을 추출하기 위한 특징 추출 (feature selection) 방법과 이 유전자들을 이용하여 보다 정확한 종양 분류 모델(tumor classification model)을 구축하는 것이 매우 중요하다. 이에 본 논문에서는 클래스가 2개, 3개, 7개로 구성된 백혈병에 대한 마이크로어레이 데이터를 이용해 데이터의 정규화를 거쳐 특징추출방법인 Information Gain, Gini Index, One-dimensional Support Vector Machine, T-statistic 방법을 이용하여 질병의 클래스를 구분하는데 있어 분별력 있는 유전자 리스트를 선별하였다. 그 유전자들의 발현 데이터에 Naive Bayes, KNN, Decision Tree, Support Vector Machine, Neural Network 알고리즘을 적용하여 종양 분류 모델을 구축하고 각각의 실험 결과들을 비교 분석함으로써 성능평가를 하였다. 클래스와 샘플 개수에 따른 대략적인 성능의 패턴을 추정할 수 있었는데 실제 전체 데이터 셋을 사용 하는 것 보다 분별력 있는 유전자들을 추출해 분석을 하는 것이 훨씬 더 좋은 성능을 나타냈으며 특징추출 방법으로는 Information Gain이 다른 특징추출 알고리즘보다 효율적인 성능을 보여주었다. 그리고 클래스가 적은 데이터 셋에서는 대부분의 성능이 비슷하게 나타났지만 클래스가 많아질수록 SVM 과 Neural Network 알고리즘이 다른 알고리즘에 비해 좋은 성능을 보였다. 성능대비 시간을 비교했을 때에는 SVM이 훨씬 효율적으로 나타났다. 특징 추출방법과 분류기법의 최적의 조합을 찾아보면 특징추출 방법으로 Information Gain을 사용하고 분류기법으로 SVM 알고리즘을 사용 했을 때가 가장 효율적이었다는 걸 알 수 있다.;Microarray technology provides large-scale gene expression profiles for various experimental conditions as quantitative values. The transcription of genes are triggered or inhibited by complex biological interactions and associations. To understand the function and the role of a gene in a cell, as well as the mechanisms of cellular phenomena, we need to capture snapshots of cellular process using microarray technology. By analyzing informative expression patterns of genes which are associated with development of a tumor, we can build a classifier to discriminate tumor subtypes. In this paper, we performed comparative performance analysis of feature selections and multiclass classification methods for tumor classification using microarray data of leukemia. First, we selected informative gene set which shows dynamic expression profiles depending on subtypes of tumor using several feature selection methods – Information Gain, Gini Index, One dimensional Support Vector Machine, and T-statistic. Second, we built tumor subtype classifiers using state-of-the-art machine learning algorithms – Naïve Bayes, KNN, Decision Tree, Support Vector Machine, and Neural Network. And, we compared the performance of classifiers according to the combination of feature selection and classification methods. As a result, the classification accuracies are strictly increased by using feature selection methods. The Information Gain method shows the best performance among the four feature selection methods. SVM and Neural Network algorithms show higher accuracy in the multiclass classification. SVM is more efficient than Neural Network from the point of view of execution time. The combination of Information Gain and SVM shows the best performance in our multiclass tumor classification.