DSpace at EWHA: Comparative Study of Feature Selection and Classification Techniques for High-throughput Experimental Data Analysis

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Ph.D

View : 1206 Download: 0

Comparative Study of Feature Selection and Classification Techniques for High-throughput Experimental Data Analysis

Title: Comparative Study of Feature Selection and Classification Techniques for High-throughput Experimental Data Analysis

Other Titles: 대용량처리 실험 데이터 분석을 위한 특징 선택 및 분류 기법의 비교 연구

Authors: 이민수

Issue Date: 2007

Department/Major: 대학원 컴퓨터학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Abstract: 최근 생명 현상을 이해하기 위한 생화학 기술들의 발달에 따라 대용량 처리 실험 방법들을 이용하여 거대한 양의 생물학 데이터들이 산출되고 있다. 대용량 처리 실험 방법들은 생물학 실험 수행의 능률성을 획기적으로 향상시켰다. 그러나 대부분의 대용량 처리 실험 방법들은 같은 목적의 실험을 한번에 대용량으로 처리하는 억지 기술 방법론에 기반하고 있기 때문에 데이터의 신뢰도나 관련성에 있어 문제점을 가지고 있다. 또한, 생물학 데이터를 보다 정확하고 조직적으로 해석하기 위해서는 다른 관련 생물학 데이터들과 통합되어 분석해야 한다. 이러한 특성을 가지는 대용량 데이터를 평가하고 분석하기 위해서는 다음과 같은 주요한 두 단계의 처리과정이 요구된다. 첫째, 통합된 데이터베이스로부터 목표 작업과 관련된 생물학적 속성들을 선택하기 위해 적절한 특징 선택 방법이 적용되어야 한다. 둘째, 대용량 처리 데이터를 분석하고 해석하기 위해 기계 학습이나 데이터 마이닝 기법을 이용하여 계산 모델을 구축해야 한다. 그러나 다루는 데이터의 특성에 따라 적합한 특징 선택 방법과 데이터 마이닝 기법이 달라지므로, 대용량 처리 실험으로부터 나온 생물학 데이터를 평가하고 분석하기 위한 표준화된 문제 해결 방법이 요구된다. 본 박사학위 논문에서 우리는 대용량 처리 실험으로부터 나온 데이터 분석을 위한 특징 선택과 분류 기법의 비교 연구를 수행하였다. 특히 우리는 마이크로어레이 데이터 분석과 단백질 상호작용 데이터 검증에 초점을 맞추었다. 마이크로어레이 실험은 만여 개의 유전자들의 발현 정도를 동시에 측정할 수 있는 대용량 처리 실험 방법이다. 마이크로어레이 데이터는 만여 개의 속성을 가지는 적은 수의 샘플들로 구성되기 때문에 데이터를 분석하고 설명력 있는 모델을 구축할 때 적은 수의 데이터에의 과적합 문제와 높은 차원 데이터의 높은 계산 비용 문제가 발생하게 된다. 그래서 우리는 작업과 관련된 유전자 부분집합들을 확인하기 위해 마이크로어레이 데이터에 특징 선택 방법을 적용하였다. 또한 시계열 마이크로어레이 데이터에서 특정 과정에만 관련된 유전자 리스트를 확인하기 위한 특징 선택 방법을 제안하였다. 그 후, 선택된 유전자 부분 집합의 생물학적 의미를 이해하기 위해 우리는 유전자 온톨로지 정보를 통합하고 선택된 유전자들의 유전자 온톨로지 용어들 중 특이적으로 많이 발생한 용어들을 찾아주었다. 마지막으로, 선택된 유전자 집합의 발현 프로파일에 분류 알고리즘을 적용하여 분류 및 예측 모델을 구축하였다. 특징 선택 방법을 적용함으로써 우리는 각 부류에 대한 중요한 바이오마커를 찾아낼 수 있었으며 분류 알고리즘을 이용하여 최적의 성능을 가지는 예측 모델을 구축할 수 있었다. 단백질-단백질 상호작용을 발견하기 위한 대용량 처리 실험 방법들은 상호작용하는 단백질 쌍을 능률적으로 발견할 수 있도록 하였지만, 작은 규모의 연구보다 위양성 데이터 비율이 더 높다고 밝혀졌다. 따라서 오류가 섞여 있는 단백질 상호작용 데이터의 신뢰도를 개별적으로 검증할 수 있는 추가 작업이 요구된다. 우리는 단백질 상호작용의 증거들로 활용할 수 있는 다양한 유전체 데이터들을 활용한 단백질 상호작용 검증 시스템을 디자인하고 개발하였다. 여러 유전체 데이터들 중 단백질 상호작용 검증 작업에 가장 정보력 있는 특징들을 고르기 위해 다양한 특징 선택 방법들을 적용하였다. 그리고 가장 좋은 성능의 단백질 상호작용 검증 시스템을 개발하기 위해 특징 선택 방법들과 분류 기법들의 조합들의 비교 분석을 수행하였다. 특징 선택과 분류 기법을 적용함으로써 우리는 매우 뛰어난 성능을 가지는 단백질 상호작용 검증 시스템을 구축할 수 있었다.;With the recent progress in biochemical technologies to reveal life phenomena, a huge amount of biological data have been produced through various high-throughput experiments (HTE). HTE methods offer vast improvements in efficiency for conducting biological experiments. However, since most HTE methods are based on brute-force approaches, they are prone to problems in data reliability and relevancy. On the other hand, to interpret the biological data more precisely and systematically, HTE methods should be integrated with other relevant biological knowledge sources. To assess and analyze such large-scale data, the following two processes are required. First, to select relevant biological attributes in an integrated database, the proper feature selection method should be applied. Next, to analyze and interpret the high-throughput data, it is desirable to apply machine learning methods or data mining techniques to construct appropriate computational models. However, the proper feature selection methods and data mining techniques vary according to the properties of the target tasks. Hence, the systematic approach for assessing and analyzing biological data through HTE is needed badly. In this dissertation, we provide a comparative study of feature selection and classification techniques for high-throughput experimental data analysis. Specifically, we focused on analysis of microarray data and verification of protein-protein interaction (PPI) data. Microarray experiments measure the expression levels of thousands of genes in parallel. Since microarray data usually include small number of samples with thousands of attributes, it is easy to over-fit to a small data. Also, because computational cost including all genes is expensive, it is desirable to reduce the number of genes discarding irrelevant genes. We applied various feature selection methods that identify the task-relevant gene subset. For time-series microarray data, we proposed a feature selection method that can identify a task-specific gene list. And then, to understand the biological meaning of the selected gene subset, we integrated Gene Ontology information and identified differentially occurring Gene Ontology terms. Finally, we constructed a prediction model to classify sample classes using the selected gene set based on a classification algorithm. By applying the feature selection method, we could identify important biomarkers for each class and construct an optimal prediction model. High-throughput PPI identification methods allow efficient identification of PPIs. However, they are prone to high false positive rates compared to labor intensive studies. A computational algorithm to assess the reliability of PPI data would be valuable. We designed and implemented a protein interaction evaluation system (PIES) using various genomic data as evidence of PPI. To choose informative genomic features, we applied various feature selection methods. Then, we performed a comparative analysis of combinations of feature selection methods and classification techniques to obtain the optimal performance of PIES, because the performance of a classification algorithm depends heavily upon data characteristics. By applying various feature selection methods and classification techniques, we could establish a protein interaction evaluation system that estimated the reliability of PPI data with outstanding performance.