DSpace at EWHA: A Study on the Imbalanced Data Problem for Bankruptcy Prediction Modeling

Browse

My Repository

DSpace at EWHA일반대학원 경영학과 Theses_Ph.D

View : 934 Download: 0

A Study on the Imbalanced Data Problem for Bankruptcy Prediction Modeling

Title: A Study on the Imbalanced Data Problem for Bankruptcy Prediction Modeling

Other Titles: 부도예측모형 불균형 데이터 문제 개선에 관한 연구

Authors: YIN MENGQING

Issue Date: 2019

Department/Major: 대학원 경영학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 신경식

Abstract: Numbers of machine learning algorithms and data mining methodologies have been developed in the community, which have attracted considerable attention to be implemented to many real-world problems. The data plays an extremely crucial role in this research field, however, in many cases, it suffers the imbalanced data problem that refers to a skewed data distribution that classes have unequal numbers of data points. In general, the learning process of conventional classification learning algorithms based on the balanced training dataset for each class. It raises the challenges to the research community and data-driven applications due to the performance degradation of learning from imbalanced data. Moreover, it hinders to learn a classifier to successfully detect the patterns of the minority class which is regarded as the most important ones and attract much attention in practices. There have been extensive efforts and a tremendous number of researches have been conducted on solving the imbalanced data problem in the past decades. However, it remains a valuable research issue. It is an essential task to reveal the fundamental issues of the imbalanced data problem beforehand for addressing the problem with the solutions. It is comparatively simple to solve the imbalanced data problem in the case of containing the sufficient number of data in training dataset for learning a classification model. On the contrary, in the case of lacking representative data in the imbalanced training dataset for each class results in more difficulties, which is referred to as the absolute rarity problem in the data-level issue. One of the straightforward solutions is to generate new synthetic data for enlarging the data size of the minority class to be the same with the majority class, which is referred to as the oversampling method. Although many studies have been proposed to tackle the imbalanced data problem by oversampling methods, the problem remains to be solved efficiently which associates with information extracting and fundamental data representation. In the business study, bankruptcy prediction modeling is one of the intensive research topics since the late 1960s. It is a binary classification problem related to various decision makings for financial institutions and the sustainable economic growth for a country. It has been drawn many attentions of scholars to conduct accurate bankruptcy prediction models by applying state-of-the-art methodologies. However, seldom studies have been focused on coping with the imbalanced data problem in bankruptcy prediction modeling, which is frequently observed as suffering absolute rarity problem. The main purpose of this dissertation is to propose two effective novel oversampling approaches to overcome the drawbacks of the previous studies at the data-level fundamental issue for handling imbalanced data problem, especially in bankruptcy prediction modeling. In particular, it attempts to approach the underlying absolute rarity problems. Two novel oversampling approaches take into account the importance level of data related to the learning process, the data distribution, and the level of containing effective and representative patterns in the minority class. The new synthetic data are generated based on the aforementioned data characteristics which are defined by the distance information and density information. In order to verify the effectiveness of the proposed study, the experiments are conducted on the imbalanced non-external auditing small- and medium-sized Korean manufacturing company dataset and comparing the results with the prevalent oversampling methods. The experimental results demonstrate that the proposed novel oversampling approaches that take into account the importance level, the data distribution and the level of containing precise patterns in minority class provide an effective solution for solving the imbalanced data problem in bankruptcy prediction modeling. In addition, it indicates the enhancement of detecting bankruptcy firms in the imbalanced dataset. Therefore, it is worthwhile to consider the aforementioned data characteristics for solving the imbalanced data problem sufficiently. ;지금까지 수많은 기계학습 기법 및 데이터마이닝 방법론이 개발되어왔다. 이것을 연구문제 또는 현실문제에 성공적으로 적용하는 것은 많은 주목을 받고 있다. 이러한 연구분야에서 데이터는 가장 중요한 역할을 맡고 있지만, 굉장히 많은 경우에 불균형 데이터 문제가 존재한다. 불균형 데이터 문제는 데이터의 클래스 별 사이즈가 다르다는 것을 의미한다. 이로 인해 구축된 모형의 성능은 악화되며, 현실문제에서 많은 관심을 받고 있는 패턴이 성공적으로 추출되지 못하게 된다. 과거 몇 십년 간, 불균형 데이터 문제를 해결하기 위한 다양한 연구가 진행되어 왔으며 이는 여전히 가치 있는 연구주제 중 하나이다. 불균형 데이터 문제에 대한 근본적인 이슈를 파악한 후, 적절한 해결책을 통해서 해결하는 것이 매우 중요하다. 분류 모형을 구축하는 것에 충분한 데이터의 양이 확보되는 경우에는 불균형 데이터 문제를 비교적 쉽게 해결 할 수 있는 반면, 각 클래스를 학습하기 위한 충분한 데이터 사이즈가 확보되지 못하는 경우는 문제를 더 심각하게 할 수 있다. 이는 데이터 라벨 이슈 중 absolute rarity 문제에 속한다. 이러한 문제에 대한 가장 간단하고 적절한 해결책은 소수 클래스 데이터를 가공해서 데이터 사이즈를 늘리는 방법, 즉 오버샘플링 (oversampling) 방법이다. 많은 연구들이 불균형 데이터 문제를 해결하고자 오버샘플링 방법을 제시하였지만, 이들은 근본적인 데이터 representation 및 패턴 추출에 대한 문제점을 성공적으로 해결하지 못하였다. 기업의 부도 예측은 1960년대부터 경영학 분야에서 중요한 의사 결정 문제 중 하나이다. 이는 금융기관의 대출 및 투가 의사결정, 기업의 신용평가, 그리고 한 나라의 지속적 경제성장 및 정책 수립에 영향을 미치기 때문이다. 부도 예측에 관한 연구는 주로 모형 구축 기법 관점에서 모형의 성과를 높이는데 초점을 두어왔다. 그러나 부도 예측 모형 구축에 있어 불균형 데이터 문제를 해결하기 위한 선행 연구들은 상대적으로 많지 않다. 따라서 본 연구에서는 기존 오버샘플링 방법의 한계점을 보완하고 부도 예측 모형에 있어 불균형 데이터 문제를 해결하기 위해 두 가지 새로운 오버샘플링 방법을 제안을 하였다. 이는 소수 클래스의 데이터 중요도, 데이터 분포, 그리고 클래스를 대표하는 명확한 패턴이 포함되는 정도를 고려한 방법이다. 이러한 요소들은 데이터의 거리정보 또는 density 정보를 통해서 정의하고, 이를 바탕으로 새로운 가공 데이터를 생성한 오버샘플링 방법이다. 본 연구에서 제안한 오버샘플링 방법의 성과 비교를 위해 비외부감사 한국 중소기업 불균형 부도 데이터를 사용을 하였고, 벤치마크 방법에는 기존에 많이 사용되는 오버샘플링 방법을 사용하였다. 분석 결과로는 기존 연구에서 많이 사용되는 오버샘플링 방법 보다 본 연구에서 제안한 방법을 사용했을 경우 부도예측모형에 있어 불균형 데이터 문제점이 개선됨을 확인하였고, 특히 부도기업 예측 정확도를 높일 수 있었다. 이를 통해 데이터의 특성을 고려한 오버샘플링 방법이 불균형 데이터 문제를 더 효율적으로 해결하는데 기여할 수 있을 것이라고 기대한다.