DSpace at EWHA: Multiclass classification methods 활용한 데이터 분석

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 845 Download: 0

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	송종우	-
dc.contributor.author	김지연	-
dc.creator	김지연	-
dc.date.accessioned	2016-08-26T04:08:40Z	-
dc.date.available	2016-08-26T04:08:40Z	-
dc.date.issued	2014	-
dc.identifier.other	OAK-000000085158	-
dc.identifier.uri	https://dspace.ewha.ac.kr/handle/2015.oak/210651	-
dc.identifier.uri	http://dcollection.ewha.ac.kr/jsp/common/DcLoOrgPer.jsp?sItemId=000000085158	-
dc.description.abstract	반응변수가 범주형일 때 분류를 위해서 제안된 많은 방법론들은 주로 이범주일 경우를 위해서 연구 되었다. 하지만 최근에는 이범주 보다는 다범주(multi-class)에 대한 분류의 필요성이 점점 증대되고 있다. 특히 이미지를 이용한 다양한 패턴인식이나 생명공학 데이터에서 다범주 분류에 대한 수요가 증대되고 있다. 그리고 이에 따른 다범주 분류를 위한 많은 패키지들이 개발되고 있다. 본 논문에서는 데이터 마이닝 분야에서 자주 사용되는 이범주의 분류에서 나아가서 다범주 분류 모형이 어떤 알고리즘을 가지고 있는지 소개한다. 또한 이범주 분류에서 다범주 분류로 바꾸면서 알고리즘이 어떻게 변화하는지 간단히 살펴본다. 사용한 이범주 분류 알고리즘은 부스팅의 대표적인 방법 중에 하나인 ADA와 최근에 각광받는 알고리즘인 SVM 방법이다. 이에 ADA는 이범주 문제를 다범주 문제로 바꾸는 One vs Rest 방법을 적용한 Adaboost.MH 방법과 Multiclass exponential loss를 사용한 Bayes Rule에 의한 다범주 분류를 시도하는 Adaboost.SAMME 방법이 있다 . 또한 다범주 SVM Random Forest 방법을 사용하여 다범주 simulation과 실제자료를 적용해 오분류율을 계산하고 방법들의 장단점을 비교하려고 한다. 범주가 2개 이상인 다범주 분류 문제에서 어떤 자료에도 적용을 했을 때 성능이 항상 좋은 방법이 없기 때문에 알고리즘의 예측력을 비교하여 분류 성능이 좋은 방법론을 찾기 위한 많은 연구가 이루어지고 있다고 한다. 다범주 분류에서 사용되는 다양한 방법론을 자료에 적용을 해주고 오분류율 통해 성능을 비교를 해주며 더 나아가서 자료에 noise가 있는 경우에도 역시 방법론들을 비교해줌으로써 자료에 대한 정보가 부족할 때 성능이 좋은 모형을 찾아서 적용 할 수 있는 방법을 제안해보려고 한다.;Nowdays, there have been many researches to find multi-class classification methods. In going from two-class to multi-class classification, most algorithms have been restricted to reducing the multi-class classification problem multiple two-class problems. This paper aims to introduce binary classification methods and multi-class classification methods. Also, review the procedure that extend binary classification for multi-class classification. This paper attempts to compare performances of various algorithms with simulation datas and a real data. Simulation datasets consist of explanatory variables and noise variables. At first, binary classification problem is classified using binary classification algorithm, ADABoost and Support Vector Machines. The second multi-class classification problem is classified using Random Forest,Support Vector Machines pairwise,Support Vector Machines W&W, ADABoost MH and ADABoost SAMME. This paper compare performances in simulation dataset without noise variables and with noise variables. Comparing model performances , propose the best performance algorithm.	-
dc.description.tableofcontents	1. Introduction 1 2. 2-Class classification methods 2 2.1 Adaboost 2 2.2 SVM 5 3. Multiclass classification methods 8 3.1 Multiclass ADAboost 8 3.1.1 Adaboost.MH 8 3.1.2 Adaboost.SAMME 8 3.2 Multiclass SVM 14 3.3 Random Forest 14 4. Data Analysis 16 4.1 Simulation 16 4.1.1 Data Generation 16 4.1.2 Model Comparison(no noise) 19 4.1.3 Model Comparison(with noise) 20 4.2 Real Data 24 4.2.1 Data Description 24 4.2.2 Model Comparison 24 5. Summary 26 참고문헌 28 ABSTRACT 29	-
dc.format	application/pdf	-
dc.format.extent	1121721 bytes	-
dc.language	kor	-
dc.publisher	이화여자대학교 대학원	-
dc.subject.ddc	500	-
dc.title	Multiclass classification methods 활용한 데이터 분석	-
dc.type	Master's Thesis	-
dc.title.subtitle	ADA SVM RF 방법론 비교분석	-
dc.title.translated	A study of Multiclass classification methods : Comparing with svm,randomforest and ADA	-
dc.creator.othername	Kim, Ji-youn	-
dc.format.page	vii, 29 p.	-
dc.identifier.thesisdegree	Master	-
dc.identifier.major	대학원 통계학과	-
dc.date.awarded	2014. 2	-