DSpace at EWHA: Numerical Comparison of Hierarchical and K-means Clustering Algorithm in K-means Inverse Regression

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 796 Download: 0

Numerical Comparison of Hierarchical and K-means Clustering Algorithm in K-means Inverse Regression

Title: Numerical Comparison of Hierarchical and K-means Clustering Algorithm in K-means Inverse Regression

Other Titles: K-means Clustering Inverse Regression에서의 계층적 클러스터링과 K-means 클러스터링 간의 비교

Authors: 유연주

Issue Date: 2020

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 유재근

Abstract: At present, before actual analysis, reducing high-dimensional data to low-dimensional without loss of information is a necessary process. One of the representative methods of sufficient dimension reduction is sliced inverse regression (SIR), which is widely used when dealing with high-dimensional data. However, when SIR is extended to multivariate response variable to reduce dimension, it is possible to face with the “Curse of dimensionality”. To overcome this issue, K-means Inverse regression (KIR) was suggested. KIR is similar to the SIR method, but multivariate response variable Y is sliced by the K-means clustering algorithm. However, the clusters are changed whenever K-means clustering is performed and K-means clustering algorithm does not have properties of overlapping and nested. Therefore, in this paper, we propose a method of replacing K-means clustering with hierarchical clustering. We compared each methodology using simulations from several cases. As a result, hierarchical clustering algorithm using ward’s method provided similar or better results to K-means clustering algorithm. In addtion, if K-means clustering is replaced with hierarchical clustering, the same result can be obtained every time. And hierarchical clustering has characteristics that are overlapping and nested. That’s why, using hierarchical clustering is suggested instead of K-means clustering when slicing variables.;현대 사회는 컴퓨터와 인터넷의 발달로 인해 데이터 생성이 가속되었으며, 데이터는 기하급수적으로 증가하고 있다. 이런 방대한 데이터에서 효과적으로 필요한 변수들을 선별하는 것은 점점 어려워지고 있으며, 많은 양의 설명변수들은 분석에 편향된 결과 를 갖고 오므로, 분석 전 고차원의 데이터를 정보의 손실 없이 저차원으로 줄이는 것이 필수적인 과정이 되었다. 충분 차원 축소 (Sufficient dimension reduction;SDR)은 차원 축소의 한 방법이며 충분 차원 축소에서 가장 잘 알려진 방법은 SIR (Sliced Inverse Regression; Li, 1991)이다. SIR은 단변량, 다변량 분석에 모두 이용할 수 있는데 SIR을 다변량에 바로 적용을 하게 되면 “차원의 저주” 문제에 직면할 수 있다. 이를 해결하기 위해 KIR (K-means Inverse regression; Setodji and Cook, 2004) 방법이 제안되었으며 KIR은 SIR과 비슷하지만 반응변수 Y 를 슬라이싱 (Slicing)할 때, K-means 클러스터링 방법을 이용한다. 하지만 K-Means 클러스터링은 실행할 때마다 군집이 변한다는 단점과 변수를 슬라이싱할 때 기대되는 성질인 Overlapping과 Nested되는 특징을 갖지 않는다. 따라서 본 논문에서는 K-means 클러스터링 대신 계층적 클러스터링 방법을 제시하였다. 여러 모델의 시뮬레이션을 통해 각 방법론들을 비교해 보았고, Ward method를 이용한 계층적 클러스터링이 K-means 클러스터 알고리즘과 비슷하거나 더 나은 결과를 제공하였다. 또한 계층적 클러스링은 Overlapping과 Nested되는 성질을 갖고 있으므로 변수 Y 를 슬라이싱할 때, K-means 알고리즘 대신 계층적 클러스터링을 이용하는 것을 제안한다.