DSpace at EWHA: 부도 데이터의 불균형 문제 해결을 위한 적대적 생성 신경망(GAN) 기반 오버샘플링 기법

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 2092 Download: 0

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	신경식	-
dc.contributor.author	김혜린	-
dc.creator	김혜린	-
dc.date.accessioned	2020-02-03T16:32:17Z	-
dc.date.available	2020-02-03T16:32:17Z	-
dc.date.issued	2020	-
dc.identifier.other	OAK-000000163658	-
dc.identifier.uri	http://dcollection.ewha.ac.kr/common/orgView/000000163658	en_US
dc.identifier.uri	https://dspace.ewha.ac.kr/handle/2015.oak/253025	-
dc.description.abstract	Corporate bankruptcy is a critical problem that has a very high cost both individually and socially. Thus many researchers have conducted various studies to accurately predict the bankrupting condition of a company based on the economic indices of that company. However, data-based prediction of a bankruptcy is difficult due to the imbalanced nature of the bankruptcy data. The ratio of the amount of data of companies actually undergoing bankruptcy compared to that of economically active ones is nearly 1,000:1, which is a seriously imbalanced data that could lead to overfitting of the prediction model. To overcome this limitation many studies have implemented the use of oversampling methods to provide a balance to the dataset, leading to more accurate model training. Oversampling is a technique for compensating the imbalance of a dataset, by increasing the number of samples within the minority data. Conventional methods include Random Oversampling (ROS), and the Synthetic Minority Oversampling Technique (SMOTE). ROS is the method of randomly selecting a data sample from the minority dataset then duplicating that sample, naturally leading to overfitting of the prediction model. SMOTE has been applied to overcome the overfitting problem of ROS, but has limitations when the minority and majority datasets are mixed together within the variable space. During such cases the SMOTE algorithm might enhance the majority data rather than the minority data, which is a serious problem since the oversampled data could impact the prediction model as unwanted noise, rather than improving the model. Recently, a machine learning model for developing a generative network based on an adversarial learning concept, namely the Generative Adversarial Network (GAN), has been proposed. The characteristic of GAN makes it easily applicable to oversampling studies, since the nature of the neural network developed based on adversarial training allows artificial data to be made that is similar to the original data. Oversampling based on GAN overcomes the limitations of conventional methods, such as overfitting, and allows the development of a highly accurate prediction model of imbalanced data. In this study, a bankruptcy prediction model is developed by implementing the GAN-based oversampling technique. The prediction accuracy of the proposed model is compared with models based on conventional oversampling techniques, including the ROS, SMOTE, and the adaptive synthetic sampling approach (ADASYN). The proposed model provides a more accurate and robust result compared to that of all the other models, showing that GAN-based oversampling overcomes the limitations of the conventional models and appropriately inflates the minority data.;기업의 부도는 국가 경제에 막대한 손실을 입힐 수 있을뿐더러 해당 이해관계자 모두에게 악영향을 미칠 가능성을 잠재하고 있다. 따라서 기업의 부도를 보다 더 정확하게 예측하는 것은 개인을 넘어 사회 전반의 측면에서 매우 중요한 문제로써 관련 연구들이 꾸준히 지속되어왔다. 실제 금융기관과 거래하는 일반 기업들의 부도율은 현저히 낮으며 부도 기업 대비 건전 기업이 최대 1:1000의 비율로 큰 차이가 나는 불균형 자료이다. 만약 이와 같은 상태에서 어떠한 학습이나, 예측 값을 산출하게 되면 상당한 비율을 차지하고 있는 건전 기업의 사례에 편중되어 왜곡된 결과가 도출될 수 있다. 이와 같은 문제를 데이터 불균형 문제라고 한다. 한편, GAN(Generative Adversarial Network)은 데이터의 분포를 학습하여 실제와 가까운 데이터를 생성하는 딥러닝 알고리즘으로, 원본 데이터의 분포를 그대로 유지시키면서 유용한 데이터를 생성할 수 있다. 그러므로 이 방법은 KNN을 기반으로 지역 정보만을 이용하는 기존 오버샘플링 기법보다 더 효과적으로 데이터 불균형 문제를 해결할 수 있다. 따라서 본 연구에서는 기존의 오버샘플링 기법의 한계점을 해결하고, 향상된 분류 예측도를 도출할 수 있을 것으로 기대되는 GAN을 이용하여 부도 데이터의 불균형 문제 해결을 제안한다. 제안하는 방법의 비교 대상으로 전통적인 오버샘플링 기법인 ROS와 SMOTE, Borderline-SMOTE, ADASYN 기법을 사용하였다. 그 후 GAN 기반 오버샘플링을 포함한 5가지의 오버샘플링 기법으로 데이터의 불균형성을 해결한 뒤 실증분석을 위하여 GLM, ANN 그리고 SVM 세가지의 분류 모형에 적합하였다. 그 결과 기존의 오버샘플링 기법인 ROS, SMOTE, Borderline-SMOTE, ADASYN에 비해 GAN 기반 오버샘플링 기법이 예측 정확도, 재현률, F1측도, AUC와 같은 평가 척도에서 더 우수한 결과를 얻었다. 결과적으로 기존 연구에서 많이 사용되는 오버샘플링 방법 보다 본 연구에서 제안한 방법을 사용했을 경우 부도예측모형에 있어 불균형 데이터 문제점이 개선됨을 확인하였고, 특히 부도 기업 예측 정확도를 높일 수 있었다. 이를 통해 소수 범주의 전체적인 분포와 특성을 고려한 오버샘플링 방법이 불균형 데이터 문제를 더 효율적으로 해결하는데 기여할 수 있을 것이라고 기대한다.	-
dc.description.tableofcontents	Ⅰ. 서론 1 A. 연구 배경 및 목적 1 B. 연구의 구성 3 Ⅱ. 관련 연구 5 A. 기업부도 예측 5 B. GAN 기반 oversampling 7 Ⅲ. 연구 방법 9 A. GAN(Generative Adversarial Network) 9 B. 불균형 데이터 문제 해결 12 C. 분류 모형 18 Ⅳ. 제안 모형 23 Ⅴ. 실험 설계 25 A. 실험 데이터 25 B. 실험 설계 27 Ⅵ. 실험 결과 및 논의 32 A. 모형 평가 32 B. 실험 결과 35 C. 논의 41 Ⅶ. 결론 및 향후 연구 43 A. 결론 43 B. 향후 연구 44 참고문헌 46 ABSTRACT 53	-
dc.format	application/pdf	-
dc.format.extent	1627848 bytes	-
dc.language	kor	-
dc.publisher	이화여자대학교 대학원	-
dc.subject.ddc	005.7	-
dc.title	부도 데이터의 불균형 문제 해결을 위한 적대적 생성 신경망(GAN) 기반 오버샘플링 기법	-
dc.type	Master's Thesis	-
dc.title.translated	GAN-based Oversampling Technique for Imbalanced Bankruptcy Data Processing	-
dc.creator.othername	Kim, Hye Rin	-
dc.format.page	vi, 54 p.	-
dc.identifier.thesisdegree	Master	-
dc.identifier.major	대학원 빅데이터분석학협동과정	-
dc.date.awarded	2020. 2	-