DSpace at EWHA: New Sampling Approach to Zero-Inflated Count Data Analysis

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 898 Download: 0

New Sampling Approach to Zero-Inflated Count Data Analysis

Title: New Sampling Approach to Zero-Inflated Count Data Analysis

Authors: 김연정

Issue Date: 2020

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 이은경

Abstract: Count data with excess zeros are common in various fields. However, the class imbalance problem of a data set has encountered a difficulty in predicting the response variables of a new data set. Conventional models – Poisson and Negative Binomial model - tend to predict the probability of zero smaller than it is. Several modeling methods such as zero-inflated Poisson model or Poisson hurdle model have been proposed to address the imbalance problem of count data with excess zeros. In this paper, the sampling-based method is proposed to handle zero-inflated count data. We will extend ROSE (Random Over-Sampling Examples) strategy to count data. ROSE was developed to mitigate the imbalanced binary data. With this extended ROSE strategy, we can generate zero deflated data and can make a better prediction. Simulation results show that the performance with this new strategy is better than the original modeling method in some zero-inflated data sets. It also has better performance for predicting fish counts with 56% zero proportion. It also takes less computing time than zero-inflated Poisson model. ;0 과잉 count 데이터는 보험, 제조, 의학 등 다양한 분야에서 쉽게 볼 수 있다. 그러나, 데이터 불균형 문제는 새로운 데이터의 반응 변수를 예측하는 데에 어려움을 야기시킨다. 일반적으로 count 데이터의 적합에 사용되는 모델인 푸아송(Poisson)과 음이항(Negative Binomial) 분포는 0이 나올 확률을 실제보다 더 작게 예측하는 경향이 있다. 따라서, 0 과잉 count 데이터를 위한 여러 모델링 방법론이 제안되어 왔고, 대표적으로 zero-inflated Poisson(ZIP) 또는 zero-inflated Negative Binomial(ZINB)가 있다. 본 논문에서는, 0 과잉 count 데이터에 대한 샘플링 기반의 방법을 제안하였다. 불균형 이진 데이터에 제안된 ROSE(Random Over-Sampling Examplements) 기법을 count 데이터에 적절하게 확장하였다. ROSE는 불균형한 이진 데이터에 대해 예측력을 개선하기 위해 제안된 방법으로, 기존의 오버샘플링(oversampling), 언더샘플링(undersampling) 혹은 SMOTE(Synthetic Minority Over-sampling Technique) 기법의 한계점인 데이터 단순 복제, 데이터 손실 문제를 개선하고자 하였다. 이 확장된 ROSE 기법으로, 0 과잉 count 데이터에 대해 더 좋은 예측력을 가지는 모델을 적합 시키고자 하였다. 시뮬레이션 결과로부터 해당 기법을 사용한 모델링이 일부 0 과잉 데이터에 대해, 기존의 0 과잉 데이터에 제안된 모델링 방법보다 더 좋은 성능을 가진다는 것을 보여주었다. 또한, 실제 데이터 적용을 통해 새롭게 제안된 방법의 우수성을 보여주고자 하였다. 56%의 0 비율을 가지는 fish 데이터에 대해, 이 새롭게 제안된 ROSE 기법이 기존의 방법보다 반응 변수를 예측하는 데 있어 더 우수한 예측력을 가진다는 것을 보였다. 또한, R의 system time을 비교하여 해당 방법이 ZIP 모델 보다 연산 시간이 더 적게 걸린다는 것을 보였다.