DSpace at EWHA: 딥러닝 기반 언어모델을 이용한 한국어 뉴스 카테고리 분류

Browse

My Repository

DSpace at EWHA일반대학원 통계학과 Theses_Master

View : 231 Download: 0

딥러닝 기반 언어모델을 이용한 한국어 뉴스 카테고리 분류

Title: 딥러닝 기반 언어모델을 이용한 한국어 뉴스 카테고리 분류

Authors: 이연경

Issue Date: 2023

Department/Major: 대학원 통계학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 이동환

Abstract: 활용 가능한 데이터의 종류가 방대해짐에 따라 반응변수의 클래스가 불균형한 데이터를 활용하여 모델링하는 것이 중요한 연구 주제가 되고 있다. 클래스가 균일한 데이터가 이상적이지만, 클래스가 균일하지 않으면 불균형 문제가 발생한다. 불균형 데 이터는 알고리즘의 분류 성능을 저하시킨다. 그리고 대부분의 샘플은 다수 클래스로 분류된다. 이를 해결하기 위해서 클래스가 불균형한 데이터를 효율적으로 처리하기 위한 방법으로 다양한 오버샘플링과 언더샘플링 방법이 사용되고 있다. 본 논문에서 는, 텍스트 데이터에서 클래스 불균형을 해결하기 위해 오버 샘플링 기법 중 하나인 랜덤 오버 샘플링을 사용한다. 그리고 랜덤 오버 샘플링을 사용할 때 KoBERT, XLM-RoBERTa, Multilingual-BERT 모델의 성능 향상 효과를 확인한다. 이를 위해 다양한 시나리오를 설정하여 실제 데이터 분석에서 실험을 진행하였다. 그리고 모델 의 성능을 확인하고 카테고리 분류 결과를 비교하였다. 그 결과, 랜덤 오버 샘플링을 사용했을 경우, 시나리오 1과 3에서 모델의 성능이 향상되었다.;Modeling with imbalanced classes has been becoming important, because a lot of imbalanced data has been generated as the number of types of data available has increased. Although a data with balanced classes is ideal, an imbalanced problem occurs if the dataset has unbalanced classes. An imbalanced data reduces the classification performance of modeling. And most of the samples are classified into majority class. To solve this problem, various oversampling and undersampling are being used as methods for efficiently utilizing imbalanced data. In this paper, we use random oversampling method that is one of the oversampling methods to solve the imbalance problem in text data. And we check the degree of performance improvement of the models that are KoBERT, XLM-RoBERTa, and Multilingual-BERT when we use random oversampling. For this, we set up various scenarios, conducted experiment in real data study. And we checked the performance of the model and compared the result of classification. As a result, it is confirmed that the performance of the model is improved in scenario1 and 3 when random oversampling is used.Modeling with imbalanced classes has been becoming important, because a lot of imbalanced data has been generated as the number of types of data available has increased. Although a data with balanced classes is ideal, an imbalanced problem occurs if the dataset has unbalanced classes. An imbalanced data reduces the classification performance of modeling. And most of the samples are classified into majority class. To solve this problem, various oversampling and undersampling are being used as methods for efficiently utilizing imbalanced data. In this paper, we use random oversampling method that is one of the oversampling methods to solve the imbalance problem in text data. And we check the degree of performance improvement of the models that are KoBERT, XLM-RoBERTa, and Multilingual-BERT when we use random oversampling. For this, we set up various scenarios, conducted experiment in real data study. And we checked the performance of the model and compared the result of classification. As a result, it is confirmed that the performance of the model is improved in scenario1 and 3 when random oversampling is used.