DSpace at EWHA: Applying Binary and Count Sampling Strategies for Zero-Inflated Models

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 468 Download: 0

Applying Binary and Count Sampling Strategies for Zero-Inflated Models

Title: Applying Binary and Count Sampling Strategies for Zero-Inflated Models

Authors: 주선미

Issue Date: 2022

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 이은경

Abstract: Count data with excess zeros is generated in various fields such as insurance, manufacturing, medicine, and commerce. Existing methods like Poisson or Negative Binomial generalized linear model, zero-inflated model and zero-altered model are implemented to this zero-inflated data. In this paper, we propose four methods to improve performance compared to existing models. These methods commonly attempt sampling strategies and additional binomial classification models such as Decision Tree and Support Vector Machine including Logistic regression. In this paper, binary and count sampling are applied to increase the predictive performance of both zero and count parts. We convert separating the zero part and the count part issue into a binary classification problem of imbalanced data, and apply Synthetic Minority Over-sampling TEchnique or Random Over-Sampling Examples (ROSE) binary sampling. In the count part, the Synthetic Minority Over-sampling TEchnique for Regression technique, known as the existing continuous data sampling method, is modified and applied to the count data. As a result of application to 199 music play data sets, sampling techniques showed improvement of model performance although there are some differences in four methods. In particular, method 4 which is sampling with modified zero-altered regression model offered advanced results. In conclusion, this paper has contribution of new suggestion that can help enhance the performance of the zero-inflated model. ;영과잉 가산 자료는 보험, 제조, 의학, 상업 등 다양한 분야에서 수집된다. 영과잉 가산 자료를 모형화하는 기존 방법에는 포아송과 음이항 일반화 선형 회귀, 영과잉 회귀 모형, 영변환 회귀 모형 등이 있다. 본 논문에서는 기존 모형보다 예측 성능을 향상시키기 위해 크게 네 가지 방법을 제시한다. 이 방법들은 공통적으로 표본추출과 로지스틱 회귀(Logistic Regression)을 비롯한 의사결정나무(Decision Tree)와 서포트 벡터 머신(Support Vector Machine) 이진 분류 모형을 추가적으로 시도한다. 본 논문에서는 영 부분과 가산 부분의 예측 성능을 둘 다 높이기 위해 이진 표본추출과 가산 표본추출을 적용한다. 영 부분과 가산 부분을 잘 분리하는 문제는 불균형 데이터의 이진 분류 문제로 전환하여 SMOTE와 ROSE 이진 표본추출을 적용한다. 가산 부분에는 기존의 연속형 자료 표본추출 방법으로 알려진 SMOTER 기법을 가산 자료에 맞게 변형하여 적용한다. 199개의 음악 재생 횟수 자료에 적용해본 결과 네 가지 방법론마다 정도의 차이는 있지만 표본추출 기법이 모형 성능을 향상시키는데 도움을 줄 수 있다는 결론을 얻었다. 특히 영변환 회귀 모형의 변형에 표본추출을 적용한 결과는 높은 성능 향상을 이뤘다. 이를 통해 본 논문은 영과잉 가산 모형의 성능 향상에 도움을 줄 수 있는 새로운 방법들을 제시한다는 의의가 있다.