DSpace at EWHA: 한국어 텍스트에 대한 FastText 워드 임베딩 성능 연구

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 1552 Download: 0

한국어 텍스트에 대한 FastText 워드 임베딩 성능 연구

Title: 한국어 텍스트에 대한 FastText 워드 임베딩 성능 연구

Other Titles: A Study on FastText Word Embedding in Korean Text

Authors: 김소라

Issue Date: 2021

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 신경식

Abstract: 스마트 기기와 소셜 네트워크 서비스(Social Network Service, SNS)의 발달은 많은 사람들에게 다양한 정보와 함께 자신의 의견을 공유할 수 있는 장(場)을 마련해 주었다. 이로 인해 데이터의 양, 특히 비정형 데이터인 텍스트 데이터의 양이 무수히 증가하게 되었다. 이와 함께 텍스트에 담긴 정보나 지식을 추출하는 기법인 텍스트 마이닝(Text Mining)이 주목받게 되었다. 감성분석(Sentiment Analysis)은 텍스트 마이닝 기법 중 하나로, 텍스트 데이터에 담긴 사람들의 태도, 의견, 감성 등과 같은 주관적 요소(subjectivity) 및 극성(polarity)을 판별하는 분석 방법이다. 감성분석을 수행하는 방법은 사전 기반 방식(lexicon-based approach)과 기계학습 기법(machine learning approach) 두 가지로 나뉘나, 주로 기계학습을 적용한 감성분석 연구가 진행되었다. 최근에는 딥 러닝(Deep Learning) 기술을 기반으로 하는 감성분석 연구도 증가하고 있는 추세이다. 감성분석을 비롯한 텍스트 마이닝에 있어서 자질 혹은 특성(feature)을 추출하는 것은 중요한 단계이다. 텍스트 데이터로부터 추출된 자질은 분석 모델의 입력 값으로 사용되어 분석 결과에 영향을 주는 중요한 요소로 작동하는데, 이를 위해서는 자질 표현(feature representation)이 필요하다. 이에 있어 TF-IDF(Term Frequency – Inverse Document Frequency) 같은 통계적 방법을 사용한 연구들도 있으나, 신경망 모델이 여러 도메인에 걸쳐 우수한 성과를 보이면서 자연어 처리(Natural Language Process, NLP) 분야에서도 이를 활용하려는 연구가 활발해졌다. 이 중 하나가 바로 워드 임베딩(Word Embedding)으로, Word2vec과 GloVe(Global Vectors for Word Representation), FastText가 이에 속한다. FastText는 Word2vec을 비롯한 기존의 워드 임베딩 방법론이 단어의 형태학적 특징을 반영하지 못한다는 한계점을 보완하기 위해 제안된 방법론으로, 단어를 문자 단위 n-gram(bag of character n-gram)으로 표현하고, 이 n-gram들의 벡터 값의 합으로 단어를 표현한다. 즉, 부분 단어(subword) 정보를 학습하여, 단어의 형태소적 특징을 고려할 수 있으며 학습에 사용되지 않은 단어(Out of Vocabulary, OOV)에 대해서도 단어 표현이 가능하다. 이러한 특성으로 인해 감성분석을 비롯한 텍스트 마이닝에 FastText를 적용하여 연구가 수행되기도 했는데, 주로 임베딩 시의 매개변수를 조절하면서 다른 임베딩 모델들과의 비교를 통해 FastText의 성능을 입증하였다. 본 연구에서는 한국어 텍스트에 대한 FastText 워드 임베딩의 성능을 입증하는 것을 목표로 하였다. 이 때 상대적인 비교를 위해 Word2vec 워드 임베딩도 수행하였다. Word2vec과 FastText 두 임베딩 모델을 사전 학습을 진행하였으며, 학습 매개변수는 동일하게 맞춰주었다. 그 후, 두 종류의 데이터셋 각각에 대해 합성곱 신경망(Convolutional Neural Network, CNN)을 사용하여 감성분석을 수행하였다. 임베딩 모델에 따른 성능 비교가 목적이므로 CNN 모델의 하이퍼 파라미터도 동일하게 맞춰주었다. 그리고 정확도(accuracy)와 정밀도(precision), 재현율(recall), F1-score를 통해 성능을 비교하였다. 그 결과 전반적으로 FastText가 Word2vec에 비해 성능이 더 높다는 것을 확인할 수 있었다. 이는 Word2vec과 FastText 각각의 임베딩 특징으로 인한 것으로 보인다. Word2vec의 경우 하나의 단어에서 주변 단어의 정보만을 가지고 벡터를 생성한다. 그렇기 때문에 단어의 내부적인 구조(inner structure)에 대한 정보가 반영되지 않는다. 뿐만 아니라 학습된 말뭉치(training corpus) 내에 없는 단어에 대해서는 단어 벡터 표현이 어렵다. 그러나 FastText의 경우 n-gram 단위로 학습이 이루어지기 때문에 단어의 형태학적 정보가 반영될 뿐만 아니라 학습 코퍼스 내에 없는 단어에 대해서도 벡터 표현이 가능하다. 그렇기 때문에 실험 데이터 중 사전에 학습되지 않은 단어에 대해서도 학습 코퍼스의 단어를 참조하여 단어 벡터를 형성할 수 있어 임베딩이 잘 이루어지게 된다. 본 연구는 한국어 텍스트에 대하여 Word2vec 보다 FastText의 성능이 우수하게 나타나는 원인을 고찰했다는 데에 의의가 있다. 그러나 한국어의 경우 음절 단위로 임베딩이 이루어지기 때문에 음절이 다른 활용형에 대해서는 임베딩이 잘 이루어지지 못한다는 한계점이 있다. 따라서 향후 연구에서는 우리나라 언어에 적합한 워드 임베딩 방법론이 제시되리라 기대한다. ;With the development of Smart device and Social Network Service (SNS), people can share various information as well as their own opinion. This phenomenon leads the significant increase in the amount of data, especially text data. Therefore, the Text Mining, a technique for extracting information and knowledge in text data, has received attention. Sentiment Analysis is one of the Text Mining techniques that aims to distinguish subjectivity and polarity of people’s attitudes, opinions, sentiments in the text data. Sentiment Analysis techniques are largely classified into lexicon-based approach and machine learning approach, but most studies took the latter. Recently, applying Deep Learning to Sentiment Analysis has gained its popularity. FastText model learns representations for character n-grams and represents words as the sum of the n-gram vectors. In other words, it learns subword information of words, so that considers the internal structure of word. Also, it is possible to create vector representation for words that are not in training corpus, which is also called as OOV (Out of Vocabulary). With these characteristics, studies have been carried out by applying FastText to Text Mining, including Sentiment Analysis, which demonstrated FastText performance mainly through comparison with other embedding models by adjusting learning parameters. This study aims to demonstrate the performance of FastText word embedding model for Korean text. Word2vec word embedding is also performed for relative comparison. We pretrained two embedding models, Word2vec and FastText, and adjusted learning parameters equally. Then, the Sentiment Analysis was performed using Convolutional Neural Network (CNN) model. We used two types of datasets: movie review data and news articles. As the purpose of the study is to compare the performance of two embedding models, hyperparameters of CNN were equally controlled. Finally, we evaluate the performance with accuracy, precision, recall, and F1-score. As a result, overall performance of FastText were better than Word2vec. It is due to the distinct embedding features of Word2vec and FastText respectively. Word2vec represents a vector with only the information of the surrounding words, so it doesn’t consider the inner structure of the words. Also, it is difficult to represent the word vector that are not in training corpus. However, FastText learn representations for character n-grams. It reflects morphological features and represents vectors for words that are not within the training corpus. Thus, it can generate word vectors by referring to training corpus. The contribution of this study is to verify the reason why FastText performs better than Word2vec for Korean text. As the embeddings are done by a syllable level for Korean, there is a limitation that embeddings are not well for conjugations with different syllables. In hereafter research, it is expected to propose word embedding methodology which is optimal for Korean text data.