DSpace at EWHA: GAN 기반 데이터 증강을 활용한 온라인 리뷰 감성 평가

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 529 Download: 0

GAN 기반 데이터 증강을 활용한 온라인 리뷰 감성 평가

Title: GAN 기반 데이터 증강을 활용한 온라인 리뷰 감성 평가

Other Titles: Sentimental Analysis of Online Review Datasets with GAN-Based Data Augmentation

Authors: 김현빈

Issue Date: 2022

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 신경식

Abstract: 2년 간 지속된 코로나 19는 온라인 쇼핑의 규모를 확대시켰다. 통계청에 의하면 2022년 4월 기준 국내 온라인 쇼핑 거래액은 12조원 이상으로 작년 대비 12.6%가 증가했다고 발표하였다. 이러한 온라인 시장 변화는 소비자들의 소비 습관을 온라인에 적합하도록 고착시켰고 온라인 시장의 규모는 더욱 커질 것으로 예상된다. 평소에 소비자들이 오프라인으로 구매하였던 건강 식품이나 전자제품 등 다양한 상품들이 온라인으로 전환되며 소비자들이 직접 보지 않아도 구매하게끔 만드는 것이 기업의 과제가 되었다. 그 과정에서 소비자의 선호도는 기업이 필요로 하는 중요한 자원이 된다. 소비자가 직관적으로 상품의 정보를 알 수 있고 구매 의사를 확정하게 하는 결정적인 수단은 다른 구매자의 후기, 즉 리뷰이다. 실제로 한 기업에서는 온오프라인 통합 리뷰 서비스를 도입하며, 오프라인에서 구매한 제품도 온라인 상에 리뷰를 작성하면 포인트를 지급하는 서비스를 시행하였다. 대부분 리뷰의 수가 많으면 긍정적인 평가가 많다. 하지만 한 연구에서 밝힌 바에 따르면 긍정적인 리뷰로만 이루어진 상품에 대해 구매자들은 과도한 홍보라고 느끼며 거부감을 표한다. 또한 53%으로 고객은 구매하기 전 부정적인 리뷰를 먼저 찾는다고 한다. 즉, 소비자들은 부정적인 리뷰를 읽음으로써 상품에 대한 신뢰도를 높인다. 따라서, 판매자는 부정적 리뷰에 대해 관심을 기울일 필요가 있다. 본 연구에서는 온라인 리뷰에 대한 올바른 감성 분석을 위해 리뷰 데이터의 불균형 문제를 해결하고자 한다. 데이터 불균형 문제란, 클래스 분포를 예측하기 위한 분류 과정에서 클래스의 데이터 크기가 다른 상태를 의미한다. 온라인 상에서 긍정적인 리뷰와 부정적인 리뷰는 불균형 구조로 이루어져 있어 감성 분석을 하는 과정에서 모든 의견을 긍정적으로 평가하는 과적합(overfitting)의 문제를 초래할 수 있다. 온라인 리뷰 데이터 불균형 문제를 해결하기 위해서 GAN(Generative Adversarial Networks) 방법론을 소개한다. 2014년 Ian Goodfellow에 의해 제안된 GAN은 초기에 거짓 데이터를 생성하고 이를 진실 데이터와 구분해내는 머신 러닝으로 구성되어 있다. 비교적 학습이 용이한 이미지 생성에서 GAN 연구가 주로 진행되었으며 현재는 자연어 처리 분야에서도 텍스트를 생성하는 성과를 보여준다. 본 연구에서는 이 방법론을 통하여 부정적 리뷰를 증강함으로써 불균형 데이터를 1:1의 비율로 맞춘 후 감성 분석을 진행해보았다. GAN 성능의 우수성을 평가하기 위해 다른 방법론과 비교를 동시에 진행한다. EDA(Easy Data Augmentation)이라는 단순한 데이터 증강을 통해 부정 리뷰의 크기를 늘리는 기법과 GAN 2가지를 사용하며 데이터 불균형 문제를 해소해 볼 예정이다. 각각의 기법을 KoBERT라는 분류 모델을 통해 감성 분석한다. 본문에서는 각각의 기법들을 설명하고 진행한 실험의 결과를 보여준다. 감성 분석 결과의 정확도와 F1-score을 통해 GAN의 우수한 성능을 증명하며 온라인 리뷰 데이터 불균형 문제에 대해 해결 방향을 제시한다. 향후 연구에서 자연어 처리에 있어 불균형 문제를 해소하기 위한 GAN의 발전 방향성에 대해 기대한다. ;More than two years, COVID-19 has expanded the scale of online shopping. According to the National Statistical Office, as of April 2022, domestic online shopping transactions amounted to more than 12 trillion won, up 12.6% from last year. These changes in the online market have fixed consumers' consumption habits, and the size of the market is expected to grow further. As various products such as health foods and electronics, which consumers usually purchased offline, are turned online, it has become a challenge for companies to make consumers buy them without seeing them in person. In the process, consumer preferences become an important resource that companies need. A decisive means of determining the consumer's purchase intention is a review. In fact, one company introduced an online and offline integrated review service and implemented a service that pays points for products purchased offline when they write reviews online. Most of reviews have a lot of positive side when the number of reviews is high. However, according to a study, buyers feel that a product made up of only positive reviews is over-promoting, and about 53 percent of customers look for negative reviews first before purchasing. In other words, consumers increase confidence buying products by reading negative reviews. Therefore, sellers need to pay attention to negative reviews. This study aims to solve the problem of imbalance in review data for correct emotional analysis of online reviews. The data imbalanced problem refers to a state in which the data size of a class is different in the classification process for predicting the class distribution. Positive and negative reviews in online are structured in an unbalanced structure, which can lead to the problem of overfitting, which positively evaluates all opinions in the process of emotional analysis. To address the problem of online review data imbalance, we introduce a Generative Adversarial Networks (GAN) methodology. GAN, proposed by Ian Goodfellow in 2014, is machine learning that initially generates fake data and distinguishes it from true data. Research is conducted in image generation, which is relatively easy to learn than text, and currently shows the performance of text generation in the field of natural language processing. In this study, negative reviews are enhanced through this methodology, and the imbalance data are adjusted at a ratio of 1:1 and then emotional analysis is conducted. To evaluate the performance of GANs, comparisons are made simultaneously with other methodologies. We plan to solve the problem of data imbalance by using two techniques: 1. undersampling techniques that adjust the number of positive reviews to match the size of negative reviews and 2. increase the size of negative reviews through data augmentation (EDA). Each technique is emotionally analyzed through a classification model called KoBERT. This paper explains each technique and shows the results of the experiment. I demonstrate the superior performance of GANs with accuracy and F1-score and present a solution direction for the online review data imbalance problem. I expect the direction of development of GAN to solve the imbalance problem in natural language processing in the future.