DSpace at EWHA: 멀티 클래스 마이크로어레이 데이터에 대한 분류 앙상블 방법

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Master

View : 782 Download: 0

멀티 클래스 마이크로어레이 데이터에 대한 분류 앙상블 방법

Title: 멀티 클래스 마이크로어레이 데이터에 대한 분류 앙상블 방법

Other Titles: Classification Ensemble Method for Multi-class Microarray Data

Authors: 김영은

Issue Date: 2007

Department/Major: 대학원 컴퓨터학과

Publisher: 이화여자대학교 대학원

Degree: Master

Abstract: 마이크로어레이를 이용하여 수천 개 유전자의 발현 정도를 동시에 관찰하고 분석할 수 있게 되었으며, 이는 특히 질병 및 암 연구에 크게 기여하고 있다. 서로 다른 증상을 보이는 조직에서는 관련 유전자들의 발현 정도가 다르기 때문에 이를 이용하여 암의 종류나 질병의 유무를 분류할 수 있다. 본 논문은 여러 종류의 암으로 구성되어 있는 멀티 클래스 마이크로어레이 데이터를 효과적으로 분류하기 위해 앙상블 방법을 적용하는 과정에서 몇 가지 새로운 방법을 제안한다. 분류 앙상블 과정은 크게 특징 선택, 분류자 생성, 분류자 결합의 단계로 이루어지며 먼저 유의한 유전자를 선택하는 과정에서는 기존의 이진 분류에 적용하던 방법을 확장하여 분산분석, 상관계수 분석, 신호 대 잡음 비 분석 방법을 사용하였다. 각 방법별로 유의한 정도에 따라 순위를 결정하고 상위 유전자를 선택하였으며, 멀티 클래스 데이터 분류에서 선택해야하는 최적의 유전자수는 알려져 있지 않기 때문에 선택 사이즈를 다양하게 하여 비교하였다. 분류자들을 생성하고 앙상블하는 과정에서 효과적인 결과를 얻기 위해서는 다양한 분류자들을 결합해야한다. 이를 위하여 개별 분류자들 사이의 다양성을 측정하여 결합시키는데 본 논문에서는 분류자로 의사결정 트리를 사용하고 그 구성 노드를 고려하여 다양성을 측정하였다. 제시하는 방법은 기존의 다양성 측정 방식이 분류자의 결과 패턴을 바탕으로 측정되어 분류자 내부의 다양성을 고려하지 않았던 점과 클래스 수가 큰 데이터에 대해서는 계산이 복잡하여 적용하기 어렵다는 문제점을 해결하였다.;Microarray experiment has enabled to analysis thousands of gene expression level at once, and has contributed the research about cancer and disease. Because the expression levels of genes from each tissue that shows different symptoms are different, it is possible to classify the disease or tumor-normal. This paper proposes an ensemble method for classifying multi-class microarray data which consists of various cancer. Ensemble is the method combining several classifiers, it obtains more effective performance and stable result than using a single classifier. Classification ensemble procedure consists of feature selection, classifier generation, classifier combination. At the feature selection step, this paper had used (three feature selection methods) ANOVA, correlation coefficients and signal to noise ratio. These are methods developed from binary classification. Genes are raked as significant measure of each method, and higher rank genes are selected. Optimal number of selection for multi-class data are not known, so selected feature set in various size are compared. In the stage of generating and ensemble classifiers, it is required to combine diverse classifier for obtaining a good performance. So diversity among classifiers is measured and considered before combing. This paper proposes the method of the diversity measurement that calculates disjoint rate of node of decision trees. This method solves the problem of existing diversity measurement which does not consider structural diversity of classifiers due to basis of result pattern and is difficult to apply for multi class data because of complex of calculation in the data has many class.