DSpace at EWHA: An Undersampling Approach based on Clustering for Imbalanced Data in Bankruptcy Prediction Model

Browse

My Repository

DSpace at EWHA일반대학원 경영학과 Theses_Master

View : 1062 Download: 0

An Undersampling Approach based on Clustering for Imbalanced Data in Bankruptcy Prediction Model

Title: An Undersampling Approach based on Clustering for Imbalanced Data in Bankruptcy Prediction Model

Authors: 김예슬

Issue Date: 2018

Department/Major: 대학원 경영학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 신경식

Abstract: 기업의 부도예측은 채권자와 투자자의 의사결정을 위해 중요하며 관련 연구가 활발히 진행되어왔다. 부도예측모형 구축을 위해 선행된 연구에는 전통적인 통계기법을 사용한 모형, 시장기반모형과 인공지능모형의 세가지 주된 방법론이 연구되어왔다. 부도예측모형의 성과는 변수선택 및 분류기의 선택과 모형 적합을 위한 샘플 선택에 따라 영향을 받게 됨에도 불구하고, 주된 선행연구는 모형 구축에 사용되는 분류기의 개선에 집중되어왔다. 실제로, 기업의 건전 사례는 빈번한 반면, 부도 사례는 드물게 발생한다. 부도 사태를 맞은 기업와 건전 기업의 비율은 1:100에서 1:1000까지 이르기도 하는데, 다수의 건전 클래스가 소수의 부도 클래스보다 샘플 크기에서 훨씬 큰 양상을 보이며 이는 모형 구축 시 심각한 클래스 불균형 문제를 초래한다. 이러한 불균형 문제를 해결하기 위해 다양한 분야에서 연구가 진행되고 있다. 불균형 데이터의 선행연구는 주로 데이터 수준의 접근법과 알고리즘 수준의 접근법으로 나뉜다. 데이터 수준의 접근은 클래스 분포를 균형 있게 하기 위한 전처리를 통해 이루어진다. 즉, 언더샘플링과 오버샘플링을 통한 표본화를 통해 클래스 불균형을 해소하게 된다. 언더샘플링은 클래스 사이의 모순을 줄이기 위해 다수의 클래스에서 샘플을 제거해 나가는 방법이며, 오버샘플링은 소수 클래스에서 샘플들을 중복시켜가는 방법이다. 알고리즘 수준의 접근법은 소수 클래스로부터의 학습을 향상시키기 위해 기존의 분류 알고리즘을 미세 조정한다. 데이터 수준의 접근에서 오버샘플링은 샘플의 수만 중복시켜 늘이기 때문에 클래스에 관한 새로운 정보는 제공하지 않는다. 언더샘플링은 클래스에 관한 중요한 정보를 소실할 가능성이 있기 때문에 이를 극복하고 다수 클래스를 전략적으로 표본화하기 위한 랜덤샘플링, 데이터 제거, 클러스터 기반 언더샘플링 등의 연구가 활발히 진행되어 왔다. 본 연구는 클래스 불균형을 해소하기 위해 다양한 언더샘플링 방법론을 사용하여 표본을 재 추출하고 데이터에서 건전 기업들의 특성을 탐색하기 위해 클러스터링을 도입하여 그에 따른 기업의 샘플을 추출할 수 있는 클러스터링 기반의 언더샘플링을 제시한다. 각 군집에서 대표성을 띄는 기업의 샘플을 추출하기 위해 군집을 대표하는 중심점이 계산된다. 문헌연구에서 기존의 클러스터링 기반 언더샘플링 기법은 계산된 중심점 혹은 중심점에서 가장 근접한 샘플을 다수클래스의 샘플로써 추출하는 방법과 모든 데이터를 군집화하여 군집의 비율에 따라 샘플을 추출하는 방법이 제시되었다. 본 연구에서는 건전 기업을 가장 잘 대표하는 샘플을 선정하기 위해 군집 중심점으로부터의 거리와 근접한 샘플들을 각 군집의 비율을 모두 고려하여 추출하는 클러스터링 기반 언더샘플링 방법을 제시한다. 이에 따라 학습에 적합하도록 건전클래스의 재무건전성이 우수한 우량기업과 부도클래스의 부도기업의 샘플이 균형을 이루어 훈련 데이터 셋이 구성되고, 분류기를 통해 부도예측모형을 구축하게 된다. 본 연구는 심각한 클래스 불균형 문제를 보이는 한국 비외감 제조업 중소기업 데이터를 사용하여 4가지 언더샘플링 기법과 두 개의 각기 다른 분류기를 도입한 부도예측모형을 구축하였다. 모형의 평가를 위해 다수의 평가 지표가 사용되었다.;Bankruptcy prediction is important to decision making for creditors and investors. In prior literature, generally three major types of strategies have been investigated for dealing with bankruptcy prediction modeling: statistical approach, market-based approach, and artificial intelligence approach. Though the performance of bankruptcy prediction is affected by selected features, classifiers, and selected samples for fitting models, bankruptcy prediction has been improved by employing developed classifiers in most literature. In practice, bankruptcy cases are very rare, while the non-bankrupt cases are very common. The proportion of bankrupt and non-bankrupt corporates is between 1:100 and 1:1000, entailing a serious class imbalance problem, whereby the majority class is larger than the minority class in sample size. In the literature, two strategies have been studies for handling imbalanced data: the data level and the algorithmic level strategies. The former employs preprocessing to balance the class distribution. Undersampling or oversampling is adopted to rebalance the imbalance data set. Undersampling eliminates instances from the majority class to reduce the inconsistency between classes, while oversampling duplicates instances from the minority class. In the algorithmic level approach, the conventional classification algorithms are fine-tuned to improve the learning task, especially relative to the minority class. In the data level approach, the oversampling does not provide new information about the class since it only increases the number of instances, while the undersampling can lead to loss of potentially important information about the class. There are many strategies for undersampling the majority class, such as random undersampling, data cleaning, and cluster-based undersampling. This paper employs various undersampling techniques to handling imbalanced data sets and will propose an undersampling approach based on clustering to explore data according to non-bankruptcy corporates’ characteristics and cluster them according to financial structure. To select representative samples of each cluster, centroids are calculated from each. In the literature, cluster-based undersampling techniques just select the centroids or nearest instance from centroids of each cluster as the majority sample or select samples from clusters considering proportion. To choose the representative samples from the non-bankrupt class, this study will adopt cluster-based undersampling using the distances between cluster centroids and nearest neighbor samples with the proportion of clusters. Consequently, a balanced training set can be obtained, and it is expected to be suitable for learn with classifier and constructs a well-defined bankruptcy prediction model. To evaluate model performance, the effect of sampling techniques on the performance of balanced training sets for bankruptcy prediction models is examined on externally non-audited small and medium sized Korean manufacturing corporates that has a significant class imbalance problem. Four sampling methods and two classifiers are constructed on the significantly imbalanced datasets. Performance of models is evaluated using different criteria.