DSpace at EWHA: Comparison of Classification methods for imbalanced data

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 565 Download: 0

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	송종우	-
dc.contributor.author	김동아	-
dc.creator	김동아	-
dc.date.accessioned	2016-08-25T10:08:00Z	-
dc.date.available	2016-08-25T10:08:00Z	-
dc.date.issued	2010	-
dc.identifier.other	OAK-000000060692	-
dc.identifier.uri	https://dspace.ewha.ac.kr/handle/2015.oak/185926	-
dc.identifier.uri	http://dcollection.ewha.ac.kr/jsp/common/DcLoOrgPer.jsp?sItemId=000000060692	-
dc.description.abstract	Classification 방법론은 현대 통계학에서 매우 유용하게 쓰이는 방법론 중 하나이다. 이 논문에서는 Logistic Regression, Neural Networks, Support Vector Machines, Decision Tree, K-nearest neighbour 그리고 Boosting 을 이용하여 classification 방법들을 구현해 보고자 한다. 특히, imbalanced data를 이용하여 위에서 제시한 방법론을 서로 비교할 것이다. Imbalanced data는 이름 그대로 그룹 간 비율에 차이가 있는 data로 classification을 하기 어렵다. Imbalanced data를 classification하기 위해서 original data와 down sampling, up sampling, different loss 라는 4가지 방법을 가지고 결과를 비교해 보고자 한다. 이를 위해 1장에서는 여러 가지 용어정의와 imbalanced data에 대한 소개하고, 2장에서는 classification 방법론에 대한 소개를, 3장에서는 simple 한 data를 이용하여 여러 방법론들을 가지고 구현한 결과를 서로 비교하고, 마지막으로 4장에서는 real data를 통해 어떤 방법론의 성능이 가장 우수한지를 보고자 한다.;In this paper, I analyze the performance of classification methods by Logistic regression, Neural Networks, Support vector machines, Decision tree, K-nearest neighbor and Generalized Boosted Regression Modeling. Based on imbalanced data, I compare each method. The imbalanced data are inherently difficult to classification because of the difference in between the major group and the minor group. For that reason, I propose four ways, as follows, to deal with the imbalanced data classification problem: 'original data', 'down sampling', 'up sampling' and 'different loss change'. My study, which uses simple data sets from different ratios and one real data set, shows that classification methods using 'down sampling', 'up sampling' and 'different loss change' are performed more consistently than 'original data' classification methods.	-
dc.description.tableofcontents	Ⅰ. 서론 1 A. Classification 이란? 1 B. Imbalanced data의 경우 왜 어려운가? 1 C. 어떤 approach들을 사용하였는가? 1 Ⅱ. Classification 방법론 3 A. Logistic regression 3 B. Neural Networks 4 C. Support vector machines 5 D. Decision Tree 7 E. K-nearest neighbour 8 F. GBM (Generalized Boosted Regression Modeling) 8 Ⅲ. Simulation of study 10 A. 9:1 data 10 1. Overlapping 2차원 10 2. Overlapping 10차원 17 3. Perfect separate 2차원 21 B. 8:2 data 26 1. Overlapping 2차원 26 2. Overlapping 10차원 29 C. 9.5:0.5 data 31 1. Overlapping 2차원 31 2. Overlapping 10차원 34 Ⅳ. Real Data 36 A. 변수설명 36 B. Simulation 37 Ⅴ. Conclusion 42 참고문헌 43 ABSTRACT 44	-
dc.format	application/pdf	-
dc.format.extent	1151843 bytes	-
dc.language	kor	-
dc.publisher	이화여자대학교 대학원	-
dc.title	Comparison of Classification methods for imbalanced data	-
dc.type	Master's Thesis	-
dc.creator.othername	Kim, Dong Ah	-
dc.format.page	vii, 44 p.	-
dc.identifier.thesisdegree	Master	-
dc.identifier.major	대학원 통계학과	-
dc.date.awarded	2010. 8	-