DSpace at EWHA: A study on the effect of imputation for semi-supervised regression and classification

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 303 Download: 0

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	이동환	-
dc.contributor.author	서지은	-
dc.creator	서지은	-
dc.date.accessioned	2023-02-24T16:31:16Z	-
dc.date.available	2023-02-24T16:31:16Z	-
dc.date.issued	2023	-
dc.identifier.other	OAK-000000201932	-
dc.identifier.uri	https://dcollection.ewha.ac.kr/common/orgView/000000201932	en_US
dc.identifier.uri	https://dspace.ewha.ac.kr/handle/2015.oak/264459	-
dc.description.abstract	Modeling with unlabeled data has been becoming important because a lot of unlabeled data has been generated as the number of types of data available has increased. If unlabeled data exist, it can be the easiest solution to remove all data with missing values and only use labeled data. However, removing all data that have missing values can lead to the omission of important information. To solve this problem, a method of direct labeling has also been proposed, but labeling all data that have missing values takes a lot of time and money. Accordingly, various imputation algorithms and semi-supervised algorithms are being studied as methods for efficiently utilizing unlabeled data. In this paper, we check the degree of performance improvement of the model when KNN algorithms, MICE algorithms, semi-supervised learning algorithms, and CObc algorithms are used compared to when all unlabeled data are removed. Also, we check the effect on the performance improvement of the model as the ratio of unlabeled data varies. For this, we conduct experiments in a simulation study and real data study. As a result, it is confirmed that the performance of the model is improved when the imputation algorithms and semi-supervised algorithms are used compared to when the unlabeled data are not used. And this effect is clear, especially when the ratio of missing increases.;활용가능한 데이터의 종류가 방대해짐에 따라 반응변수가 결측값인, 즉, 라벨링 되지 않은 데이터를 활용하여 모델링하는 것은 중요한 연구 주제가 되고있다. 라벨링 되지 않은 데이터가 존재하는 경우 이를 제거하고, 라벨링 된 데이터만을 활용하는 것은 가장 간단한 해결방법이다. 하지만 이와 같은 방법을 활용한다면 중요한 정보의 손실이 발생할 수 있다. 이를 해결하기 위해서 라벨링 되지 않은 데이터를 직접 라벨링을 하는 방법도 제시되었지만 이는 시간과 비용이 많이 소요된다. 이에 라벨링되지 않은 데이터를 효율적으로 처리하기 위한 방법으로 다양한 imputation algorithm과 semi-supervised algorithm가 연구되고있다. 본 논문에서는, 라벨링 되지 않은 데이터를 전부 제거하고 모델링했을 때 대비, KNN algorithms, MICE algorithm, semi-supervised learning algorithm, CObc algorithm 방법론을 사용하여 모델링했을 때의 모델의 성능 향상 정도를 확인하고자 한다. 특히 라벨링 되지 않은 데이터의 비율에 따른 각 알고리즘의 성능 향상 효과를 확인해보고자 한다. 이를 위해 다양한 상황에서의 모의실험과 실제 데이터 분석에서 실험을 진행 한 뒤, 각 방법론의 성능을 확인했다. 그 결과, 라벨링이 된 자료만 사용했을 때 대비 imputation 및 semi-supervised algorithm을 사용했을 때 모델의 성능이 좋아지며 특히 결측값의 비율이 높을 때 그 효과가 더 커짐을 확인하였다.	-
dc.description.tableofcontents	Ⅰ. Introduction 1 Ⅱ. Methodology 3 A. Imputation Algorithms 3 A.1 k-nearest neighbors (KNN) 3 A.2 Multivariate Imputation by Chained Equations (MICE) 4 B. Semi-Supervised Random Forests 5 C. Co-Training by Committee (CObc) 6 Ⅲ. Simulation 8 A. Scenario 1: When the responses are continuous 8 B. Scenario 2: When the responses are binary 10 Ⅳ. Real data example 12 A. Data Description 12 B. Results when the responses are continuous 13 C. Results when the responses are binary 14 Ⅴ. Conclusion 16 Bibliography 18 Abstract (in Korean) 19	-
dc.format	application/pdf	-
dc.format.extent	1593182 bytes	-
dc.language	eng	-
dc.publisher	이화여자대학교 대학원	-
dc.subject.ddc	500	-
dc.title	A study on the effect of imputation for semi-supervised regression and classification	-
dc.type	Master's Thesis	-
dc.creator.othername	Seo, Jieun	-
dc.format.page	iii, 19 p.	-
dc.identifier.thesisdegree	Master	-
dc.identifier.major	대학원 통계학과	-
dc.date.awarded	2023. 2	-