DSpace at EWHA: Aspect-Based Sentiment Analysis Using Deep Neural Networks and Embedding Learning

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Ph.D

View : 1970 Download: 0

Aspect-Based Sentiment Analysis Using Deep Neural Networks and Embedding Learning

Title: Aspect-Based Sentiment Analysis Using Deep Neural Networks and Embedding Learning

Other Titles: 딥러닝과 임베딩 학습을 이용한 속성 단위 감성 분석

Authors: 송민채

Issue Date: 2019

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 신경식

Abstract: 2000년대 이후 스마트 폰, 컴퓨터의 대중화와 모바일 인터넷, 소셜 미디어 이용 등의 확산으로 대중이 텍스트의 수용자가 아닌 생산자로 대두되면서 대중이 만들어 낸 대량의 텍스트가 빠른 속도로 증가하고, 분석 기술의 발달로 디지털화된 텍스트에 대한 분석이 가능해 지면서 감성 분석에 대한 연구와 이를 실무에 적용하려는 시도가 활발하다. 감성 분석(sentiment analysis) 또는 오피니언 마이닝(opinion mining)이란 텍스트에 나타난 주관적 요소(subjectivity)를 탐지하여, 감성을 표현하는 이의 평가(evaluation), 판단(judgement), 감정(emotion), 감성(sentiment), 태도(attitude), 입장(stance) 등을 처리하는 텍스트 분석 기법 중 하나이다. 감성 분석의 주 목적은 단순히 감정이나 태도를 긍정 또는 부정, 이분법적으로 구분한다기보다는 인간의 감성 및 사람들의 의사를 결정하는 요인을 파악하고, 이를 정량화된 수치나 도식, 등급 등으로 표현하는 데 있다. 감성 분석의 단위는 크게 문서, 문장, 속성으로 구분되는데, 속성 단위 감성 분석은 문서나 문장과 달리 ‘의견의 대상(opinion target)이 무엇인지’, 그 대상의 ‘어떤 측면(feature or aspect)을 좋고, 싫어하는지’에 대한 세부적인 정보를 추출할 수 있다. 반면, 분석 절차가 복잡하고 보다 정교한 분석 방법을 요구한다. 본 연구에서는 속성 단위 감성 분석을 수행했으며, 이를 위해 최근 자연어 처리(natural language processing)에서 우수한 성과를 보이고 있는 워드 임베딩(word embedding)에 기반한 딥 러닝 (deep learning neural networks) 기법을 적용했다. 감성 분석에도 워드 임베딩과 결합한 딥 러닝이 우수한 성과를 보인다는 연구 결과들이 축적되면서 최근 다양한 형태의 변형된 딥 러닝 모델들이 제시되고 있지만, 주로 영어 텍스트를 분석 대상으로 하고 있다. 신경망 모델의 강점이 비교적 데이터의 특징에 자유롭다는 점이지만 ‘언어’라는 텍스트 데이터의 특수성을 고려할 때 이러한 모델들이 한국어에도 유사한 성과를 보일지는 실증 분석을 통해서만 검증 가능하다. 선행 연구들과 본 연구의 가장 큰 차이점은 기존 워드 임베딩이 가진 한계점을 완화하는 방안으로, 한국어의 특성을 고려한 감성 어휘 임베딩(sentiment lexicon embedding) 기법을 제안했다는 것이다. 본 연구에서 제안한 감성 어휘 임베딩이 감성 분석에 효과적인지 확인하기 위해 감성 어휘 임베딩을 통해 추출된 감성 어휘 벡터(sentiment lexicon vector)를 Convolutional Neural Network(CNN)와 Long Short-Term Memory(LSTM) 모델의 입력 계층에 사용하여 모델을 학습시켰다. 그 결과, 기존 워드 임베딩에 비해 감성 어휘 임베딩을 사용할 경우 모델의 성과 지표가 모든 경우에서 개선된 것으로 나타나 본 연구에서 제안한 감성 어휘 임베딩이 감성 분석에 효과적인 단어 표현 (word representation) 방법이 될 수 있음을 확인할 수 있었다. 한국어는 다른 언어에 비해 조사와 어미가 많이 발달한 언어라는 점, 어순의 변화에 자유롭고, 동음이의어 비중이 높다는 등 영어와는 매우 상이한 특징을 갖는다. 따라서 영어 텍스트에 대해서는 성과가 입증된 분석 방법이라 하더라도 한국어 텍스트의 특징을 반영하는 것이 필요하다. 본 연구에서는 기존 워드 임베딩이 가진 한계점을 완화하고, 감성 분석의 성능을 높이기 위해 한국어 텍스트의 특징을 반영한 감성 어휘 임베딩 방법을 제안하였다. 감성 어휘 임베딩이 단어 특성을 효과적으로 표현하는 기법이 될 수 있는지 다양한 실증 분석을 통해 확인하였다. 본 연구는 감성 분석을 중점으로 감성 어휘 임베딩의 효과를 살펴보았지만, 다른 자연어 처리 분야에 대해 적용하는 것도 흥미로운 연구 주제가 될 수 있을 것으로 기대한다. ;As the user-generated content (UGC) has proliferated and been regarded as invaluable information sources for most organizations, there has been a great deal of interest in natural language processing (NLP) and text-mining techniques for accurately extracting information from text (Chen et al., 2017; Van de Kauter et al., 2015). In particular, researchers have made impressive progress in a sentiment analysis of subjective texts, including online product reviews or social media (Schumaker et al., 2017; Ghiassi et al., 2013; Ghiassi and Lee, 2018). Sentiment analysis, also called opinion mining, uses computational methods to analyze sentiments, opinions, attitudes, and appraisals toward topics or aspects expressed in natural language texts (Pang et al., 2002; Pang and Lee, 2008; Wang et al., 2015). In the past decades, sentiment analysis takes traditional classification models such as Naive Bayes (NB) or support vector machines (SVMs) with bag of words features (Ouyang et al, 2015). These machine learning approaches targeting NLP problems have been based on shallow models using very high dimensional and sparse features. However, feature engineering is labor intensive and almost reaches its performance bottleneck. Therefore, it is necessary to find explanatory factors from the data and build a classifier less dependent on feature engineering (Bengio et al, 2003). In this context, word vectors obtained from embedding learning techniques, such as Word2vec (Mikolov et al. 2013a; 2013b) or GloVe (Pennington et al., 2014), are used as inputs or extra word features to deep neural net models. Word vector representation has been proven powerful in various NLP tasks because it can capture the semantic and syntactic relationship between words (Tang et al., 2016b). Meanwhile, recent advances in various word representation models have often focused on widely-used languages, such as English. It raises the question of whether these methods will be equally effective when applied to other languages with different linguistic characteristics such as, in particular, morphologically rich languages (MRLs), such as Korean and Turkish (Amram et al., 2018; Berardi et al., 2015; Park et al., 2018b; Tsarfaty et al., 2010). Word vector representation that takes into account the distinct characteristics of individual languages still remain challenging in regards to generalized representations. Although the effectiveness of word embedding has been verified in recent studies, traditional embedding learning models have some limitations. Existing unsupervised embedding learning approaches are based on the distributional hypothesis (Harris, 1954), which exposes that the words that occur in similar contexts tend to have similar meanings. For this reason, semantically opposite, but syntactically similar words (e.g., good and bad) have similar word vectors because these words commonly share a small subset of similar surrounding words. Sentiment analysis targets at identifying and classifying sentiment/opinion of text (Tang et al., 2016b); hence it is more problematic when the word embedding is used for sentiment analysis than other NLP applications (Tang et al., 2014). In this regard, this paper proposes a method of sentiment lexicon embedding that better represents sentiment word’s semantic relationships than existing word embedding techniques. We obtained word vectors through Word2vec model, but input and output word formats are revised by jointly encoding morphemes and their corresponding part of speech (POS) tags. And then, only important POS’s morphemes are learned in Word2vec model. To verify the effectiveness of the proposed sentiment lexicon embedding method, we conducted experiments comparing with baseline models, which only used general-word embedding or concatenated general-word and aspect embedding. Experiment results indicate that sentiment lexicon vectors obtained by the proposed sentiment lexicon embedding can strengthen attributional similarities compared to the current word embedding method, and these attributional similarities can be more qualitative features of words for sentiment analysis task. In addition, the revised embedding approach mitigated the problem of conventional context-based word embedding method and, in turn, improved the performance of aspect detection and sentiment classification. Furthermore, the sentiment polarity of reviews is highly included in the sentiment-bearing words with respect to specific aspects. Therefore, it is worthwhile to model the connection between aspect and sentiment words for aspect-based sentiment analysis and this effect was enhanced as the features of sentiment words are better represented.