DSpace at EWHA: Comparison of Classification methods for imbalanced data

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 553 Download: 0

Comparison of Classification methods for imbalanced data

Title: Comparison of Classification methods for imbalanced data

Authors: 김동아

Issue Date: 2010

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 송종우

Abstract: Classification 방법론은 현대 통계학에서 매우 유용하게 쓰이는 방법론 중 하나이다. 이 논문에서는 Logistic Regression, Neural Networks, Support Vector Machines, Decision Tree, K-nearest neighbour 그리고 Boosting 을 이용하여 classification 방법들을 구현해 보고자 한다. 특히, imbalanced data를 이용하여 위에서 제시한 방법론을 서로 비교할 것이다. Imbalanced data는 이름 그대로 그룹 간 비율에 차이가 있는 data로 classification을 하기 어렵다. Imbalanced data를 classification하기 위해서 original data와 down sampling, up sampling, different loss 라는 4가지 방법을 가지고 결과를 비교해 보고자 한다. 이를 위해 1장에서는 여러 가지 용어정의와 imbalanced data에 대한 소개하고, 2장에서는 classification 방법론에 대한 소개를, 3장에서는 simple 한 data를 이용하여 여러 방법론들을 가지고 구현한 결과를 서로 비교하고, 마지막으로 4장에서는 real data를 통해 어떤 방법론의 성능이 가장 우수한지를 보고자 한다.;In this paper, I analyze the performance of classification methods by Logistic regression, Neural Networks, Support vector machines, Decision tree, K-nearest neighbor and Generalized Boosted Regression Modeling. Based on imbalanced data, I compare each method. The imbalanced data are inherently difficult to classification because of the difference in between the major group and the minor group. For that reason, I propose four ways, as follows, to deal with the imbalanced data classification problem: 'original data', 'down sampling', 'up sampling' and 'different loss change'. My study, which uses simple data sets from different ratios and one real data set, shows that classification methods using 'down sampling', 'up sampling' and 'different loss change' are performed more consistently than 'original data' classification methods.