DSpace at EWHA: 유전자 알고리즘에 기반한 k-medoid 클러스터링 알고리즘에서의 최적의 k-탐색과 적용

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Master

View : 743 Download: 0

유전자 알고리즘에 기반한 k-medoid 클러스터링 알고리즘에서의 최적의 k-탐색과 적용

Title: 유전자 알고리즘에 기반한 k-medoid 클러스터링 알고리즘에서의 최적의 k-탐색과 적용

Other Titles: Optimal k-search and Its Application in k-medoid Clustering Algorithm based on Genetic Algorithm

Authors: 안선영

Issue Date: 2007

Department/Major: 대학원 컴퓨터학과

Publisher: 이화여자대학교 대학원

Degree: Master

Abstract: DNA 마이크로어레이(microarray)는 수만에서 수십만 개의 유전자들의 발현차이를 한 번에 관찰하고 분석할 수 있도록 만든 고도의 생물학적 실험 기술이다. 이러한 분석 기술의 발달을 이용한 생물정보학(bioinformatics)의 발전은 방대한 양의 바이오 데이터(bio-data)의 생성과 양적인 증가를 가져왔다. 대용량의 바이오 데이터들로부터 의미 정보를 빠르고 정확하게 얻기 위해서 데이터를 효과적으로 관리하고 분석하기 위한 다양한 데이터 마이닝(data miming) 방법들이 적용되고 있다. 그 중에서 유전자 발현(gene expression) 데이터 분석을 위해서 클러스터링 방법이 효율적인 알고리즘으로 유전자의 기능 분석, 유전자들 간의 네트워크 분석 등에 크게 기여할 수 있기 때문에 가장 널리 적용되고 있다. 대용량의 데이터를 클러스터링 하기에 효율적인 방법으로 분할 클러스터링 방법이 있는데 분할 클러스터링 방법 중 가장 대표적인 것이 k-means와 k-medoid 방법이다. 그러나 두 방법은 모두 고정된 클러스터 수 k를 가지고 실험을 하기 때문에 데이터에 대한 사전 지식이 없으면 올바른 k를 찾기 어렵고, 클러스터 수 k를 변경하면서 여러 번 반복 실험하여 실험 결과에 대한 타당성을 조사해야 하기 때문에 데이터의 크기가 커질수록 시간 비용이 증가하는 단점이 생긴다. 본 논문에서는 이러한 클러스터 수 k를 결정하는 문제에 유전자들 간의 유사도(similarity)와 유사도를 기반으로 하여 구성된 하나의 유전자 네트워크에 사회 네트워크 분석(social network analysis)의 매개 중심 값(betweenness centrality value)을 이용하여 클러스터 수 k를 예측하는 새로운 방법을 제안한다. 또한 이렇게 찾은 를 실제 유전자 알고리즘(genetic algorithm)을 기반으로 하는 k-medoid 클러스터링 방법에 적용하여 기존의 얻어진 클러스터링 결과보다 효율적이고 생물학적으로 더욱 의미 있는 클러스터링 결과를 보인다.;The DNA microarray is a high-tech biological experimental technology for observing and analyzing the difference of tens of thousands to hundreds of thousands genes. The development of bioinformatics using such a development of analysis technology brought forth the creation of considerable amount of bio-data and increase in terms of volume. In order to obtain semantic information promptly and accurately from a large capacity of bio-data, diverse data mining methods have been applied for effectively managing and analyzing data. Among others, a clustering method is most widely applied as it can significantly contribute to the analysis of gene functions and analysis on the networks among genes in order to analyze gene expression data as effective algorithm. There is a partitioning clustering method as an effective method for clustering large capacity of data. The most representative one of partitioning clustering method include k-means method and k-medoid method. However, the two methods conduct an experiment with a fixed cluster number k, thus, it is hard to find a right k if there is no background knowledge. Also, it causes the cost of time as the sizes of data becomes bigger in that it needs to research the appropriateness for an experiment result by repeatedly conducting experiments while changing the number of cluster . In this thesis, a new method was proposed for forecasting cluster number k by using a similarity among genes and a betweenness centrality value of a social network analysis for one gene network composed based on the similarity. In addition, there appeared a more meaningful clustering result in terms of efficiency and biology rather than the existing clustering result by applying the k value to a k-medoid clustering method based on an actual genetic algorithm.