DSpace at EWHA: New Sampling Approach to Zero-Inflated Count Data Analysis

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 905 Download: 0

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	이은경	-
dc.contributor.author	김연정	-
dc.creator	김연정	-
dc.date.accessioned	2020-02-03T16:33:06Z	-
dc.date.available	2020-02-03T16:33:06Z	-
dc.date.issued	2020	-
dc.identifier.other	OAK-000000163749	-
dc.identifier.uri	http://dcollection.ewha.ac.kr/common/orgView/000000163749	en_US
dc.identifier.uri	https://dspace.ewha.ac.kr/handle/2015.oak/253242	-
dc.description.abstract	Count data with excess zeros are common in various fields. However, the class imbalance problem of a data set has encountered a difficulty in predicting the response variables of a new data set. Conventional models – Poisson and Negative Binomial model - tend to predict the probability of zero smaller than it is. Several modeling methods such as zero-inflated Poisson model or Poisson hurdle model have been proposed to address the imbalance problem of count data with excess zeros. In this paper, the sampling-based method is proposed to handle zero-inflated count data. We will extend ROSE (Random Over-Sampling Examples) strategy to count data. ROSE was developed to mitigate the imbalanced binary data. With this extended ROSE strategy, we can generate zero deflated data and can make a better prediction. Simulation results show that the performance with this new strategy is better than the original modeling method in some zero-inflated data sets. It also has better performance for predicting fish counts with 56% zero proportion. It also takes less computing time than zero-inflated Poisson model. ;0 과잉 count 데이터는 보험, 제조, 의학 등 다양한 분야에서 쉽게 볼 수 있다. 그러나, 데이터 불균형 문제는 새로운 데이터의 반응 변수를 예측하는 데에 어려움을 야기시킨다. 일반적으로 count 데이터의 적합에 사용되는 모델인 푸아송(Poisson)과 음이항(Negative Binomial) 분포는 0이 나올 확률을 실제보다 더 작게 예측하는 경향이 있다. 따라서, 0 과잉 count 데이터를 위한 여러 모델링 방법론이 제안되어 왔고, 대표적으로 zero-inflated Poisson(ZIP) 또는 zero-inflated Negative Binomial(ZINB)가 있다. 본 논문에서는, 0 과잉 count 데이터에 대한 샘플링 기반의 방법을 제안하였다. 불균형 이진 데이터에 제안된 ROSE(Random Over-Sampling Examplements) 기법을 count 데이터에 적절하게 확장하였다. ROSE는 불균형한 이진 데이터에 대해 예측력을 개선하기 위해 제안된 방법으로, 기존의 오버샘플링(oversampling), 언더샘플링(undersampling) 혹은 SMOTE(Synthetic Minority Over-sampling Technique) 기법의 한계점인 데이터 단순 복제, 데이터 손실 문제를 개선하고자 하였다. 이 확장된 ROSE 기법으로, 0 과잉 count 데이터에 대해 더 좋은 예측력을 가지는 모델을 적합 시키고자 하였다. 시뮬레이션 결과로부터 해당 기법을 사용한 모델링이 일부 0 과잉 데이터에 대해, 기존의 0 과잉 데이터에 제안된 모델링 방법보다 더 좋은 성능을 가진다는 것을 보여주었다. 또한, 실제 데이터 적용을 통해 새롭게 제안된 방법의 우수성을 보여주고자 하였다. 56%의 0 비율을 가지는 fish 데이터에 대해, 이 새롭게 제안된 ROSE 기법이 기존의 방법보다 반응 변수를 예측하는 데 있어 더 우수한 예측력을 가진다는 것을 보였다. 또한, R의 system time을 비교하여 해당 방법이 ZIP 모델 보다 연산 시간이 더 적게 걸린다는 것을 보였다.	-
dc.description.tableofcontents	I. Introduction 1 II. The class imbalance problem 3 A. Imbalanced binary data 3 B. Imbalanced count data 4 III. Extend ROSE strategy to count data 7 IV. Simulation 9 V. Application 13 VI. Conclusion 17 Bibliography 18 Appendix 20 Abstract(in Korean) 22	-
dc.format	application/pdf	-
dc.format.extent	605663 bytes	-
dc.language	eng	-
dc.publisher	이화여자대학교 대학원	-
dc.subject.ddc	500	-
dc.title	New Sampling Approach to Zero-Inflated Count Data Analysis	-
dc.type	Master's Thesis	-
dc.creator.othername	Kim, Yeon Jeong	-
dc.format.page	iv, 23 p.	-
dc.identifier.thesisdegree	Master	-
dc.identifier.major	대학원 통계학과	-
dc.date.awarded	2020. 2	-