DSpace at EWHA: Hierarchical Clustering Application on Extended Sufficient Dimension Reduction Methods

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 918 Download: 0

Hierarchical Clustering Application on Extended Sufficient Dimension Reduction Methods

Title: Hierarchical Clustering Application on Extended Sufficient Dimension Reduction Methods

Authors: 유채연

Issue Date: 2020

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 유재근

Abstract: 현대 사회에서 데이터 핸들링의 필요성이 부각되면서 효과적인 데이터 축소 또한 중요해지고 있다. 고차원의 데이터를 저차원으로 축소시키는 과정으로, 충분차원 축소(Sufficient dimension reduction; SDR)가 가장 잘 알려져 있으며 이를 발전시킨 방법으로는 CCM(Clustering Conditional Mean; Yoo, 2016)와 PCM(Partial Informative Conditional Mean; Yoo, 2016)이 있다. 두 방법 모두 ‘슬라이싱(slicing)’이라는 범주화 과정에서 k-means 클러스터링이 적용되며, 이 중 PCM은 principal Hessian residual과 ordinary least squares를 통해 X 변수의 2차원 축소가 선행된다. 그런데 k-means는 두 가지 측면에서 결함이 있다. 첫번째로는 매 실행마다 클러스터링의 결과가 달라진다는 것이고 다른 하나는 nested 특징을 갖지 않아 overlapping이 일어난다는 것이다. 따라서, 본 논문에서는 k-means를 대체하는 클러스터링 방법으로 계층적 클러스터링(hierarchical clustering)을 사용하였다. 계층적 클러스터링은 k-means와 달리 항상 동일한 결과를 내며 nested의 특성을 가지고 있다. 아홉 가지의 모델에서 CCM, PCM, PCA 세 가지의 경우에서 수치적 연구를 한 결과, 대부분의 경우에서 계층적 클러스터링의 성능이 k-means보다 좋거나 비슷하다는 것을 확인하였다. 그 중 complete linkage를 사용한 경우의 결과가 가장 우수했다.;As handling of data has become one of the most important tasks in modern society, reducing the size or dimension of a data is now indispensable. For the extended version of sufficient dimension reduction methods, clustering conditional mean (CCM; Yoo, 2016) and partially informative conditional mean (PCM; Yoo, 2016) have been proposed. The former method depends on clustering of X and then slicing is applied. The latter uses the first directions of ordinary least squares and principal Hessian direction residual to construct a new estimate. The new two-dimensional estimation becomes clustered and likewise, followed by slicing. The two extended method both depends on k-means algorithm on the stage of clustering. However, the usage of k-means has several deficits. First, k-means do not satisfy reproducibility. Since it depends on the initial separation and find local optimum of clustering, clusters are not formed identically in all iterations. Second, it does not have the nested property regarding the step of slicing and instead, overlapping occurs. Thus, replacing k-means with hierarchical clustering is given in this paper. Hierarchical clustering method satisfies all the requirements stated above as the deficits of k-means. In this paper, nine models are provided as simulations. K-means and hierarchical clustering algorithms are compared using three X values: original, PCM, and PCA. Also, three linkage types of hierarchical clustering are considered: complete, Ward, and average. Among all clustering methods, hierarchical clustering with complete linkage turned out to be the best method. Hierarchical clustering with Ward linkage seems to follow as second, and it provided the most similar outcomes with k-means. In other words, hierarchical clustering shows equally good or better performances than k-means. Therefore, the replacement of k-means clustering to hierarchical clustering algorithm is reasonable.