DSpace at EWHA: Comparison of oversampling methods for dealing with imbalanced data in the binary classification problem

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 584 Download: 0

Comparison of oversampling methods for dealing with imbalanced data in the binary classification problem

Title: Comparison of oversampling methods for dealing with imbalanced data in the binary classification problem

Authors: 박정현

Issue Date: 2022

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 이동환

Abstract: 실제 이분형 자료의 분류에서 자료의 불균형 정도가 심한 문제는 흔하게 발생한다. 한 범주가 다른 범주보다 과도하게 많은 불균형 자료는 모든 자료를 다수 범주로 분류하는 문제가 있다. 이를 해결하기 위해 비용에 민감한 학습이나 자료의 사전 처리 연구가 수행되어 왔다. 본 연구에서는 사 전처리 방법 중 가장 널리 사용되는 과표본재추출 방법을 연구한다. 예측, 추정, 해석의 성능을 비교하기 위해 랜덤 오버 샘플링(ROS), 합성 소수자 오버샘플링(SMOTE), 적응형 합성 샘플링 (ADASYN) 및 랜덤 오버 샘플링 예제(ROSE)를 활용한다. 분류기로는 일반적으로 많이 사용되는 로지스틱 회귀와 랜덤 포레스트를 사용한다. 다양한 상황에서 과표본재추출 방법을 평가하기위해 네가지 모의 실험을 구성했다. ROSE 는 랜덤 포레스트에서 예측 성능을 향상시키는 경향이 있었다. 로지스틱 회귀 모형은 과표본재추출 방법에 전반적으로 큰 영향을 받지 않았지만, 희소 자료의 개수가 변수의 차원 수보다 작을 때, ROSE 가 계수 추정치의 편향을 줄이고 AUC 를 향상시키는 효과를 보였다. 하지만 과표본재추출 방법이 예측 성능을 향상시키더라도 모형의 해석 및 추정 성능 또한 항상 향상되는 것은 아니었다. 실제 자료를 활용한 분석에서도 변수의 중요도는 오버샘플링 방법의 유형에 따라 변화한다는 것을 확인했다.;In realistic situations, the class imbalance problem is a common in binary classification. When one class outnumbers the other class by a large proportion, it may cause a prediction problem that classifies all output to the majority class. To solve this problem, cost-sensitive learning and data preprocessing have been conducted. This study explores oversampling method, which is the most popular preprocessing technique that rebalances skewed data. We use random oversampling(ROS), synthetic minority over-sampling technique (SMOTE), adaptive synthetic sampling (ADASYN), and random oversampling example(ROSE) to compare the performance of prediction, estimation, and interpretation. Logistic regression and random forest commonly used in classification problems are used as classifiers. We organize four simulation scenarios to evaluate those techniques under various circumstances. ROSE tends to improve predictive performance in random forests. In logistic regression, oversampling methods are not effective. However, if the number of events is less than the dimensions of the variable, ROSE reduces the bias of the coefficient estimate and improves AUC. Even though the oversampling methods enhance prediction performance, model interpretation and estimation performance are not always improved. Also, in real data, we confirm that the importance of variables changes along with the oversampling methods.