DSpace at EWHA: 한국어 특성 기반의 어절 수 변화에 따른 n-gram 성과 비교

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 1896 Download: 0

한국어 특성 기반의 어절 수 변화에 따른 n-gram 성과 비교

Title: 한국어 특성 기반의 어절 수 변화에 따른 n-gram 성과 비교

Other Titles: The comparative study on the performance of n-gram by change of the number of words based on characteristic of Korean language

Authors: 송재연

Issue Date: 2017

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 신경식

Abstract: 소셜 미디어에서 텍스트 양이 폭발적으로 증가함에 따라 텍스트 마이닝에 대한 수요가 급증하였다. 텍스트 마이닝을 통해 특정 대상에 대해 사람들이 가지고 있는 생각이나 주관을 분석하기 시작했다. 이에 감성 분석을 통해 특정 감정을 추출하여 향후 방향에 활용할 수 있다는 점에서 감성 분석이 주목 받아 왔다. 감성 분석을 비롯하여 텍스트 마이닝 분야에서는 자질 선정에 따라 방법론의 결과가 판이하게 나타남에 주목하여 이에 관련된 연구가 국내외에서 활발하게 진행되어 왔다. 다수의 국외 연구에서는 범용적으로 사용되어 온 n-gram을 기반으로 다양한 감성 분석의 연구를 진행해왔다. n-gram에서 ‘n’에 따라 달라지는 감성 분석의 결과에 주목하여 이에 대한 연구를 지속적으로 진행했다. 하지만 국내의 연구에서는 n-gram을 도구로만 활용하였기에 해당 방법론에 대한 유효성을 입증하고 한국어 텍스트 데이터 특성에 맞는 방법론 여부를 판가름 하는 데에는 소홀했다. 따라서 본 연구에서는 한국어 텍스트 데이터 대상의 감성 분석 분야에서 n-gram의 유효성을 입증하고 최적의 ‘n’을 찾고자 하였다. 이에 한국어 특성이 어떻게 n-gram에서 작용할 수 있는지 고찰하고 n-gram이 한국어에서 작용할 수 있는지 증명하였다. 이를 통해 향후 연구에서 n-gram의 ‘n’을 적절하게 선택하여 활용할 수 있는 연구의 기반을 잡고자 하였다. 본 연구에서는 한국어 텍스트 데이터 대상의 감성 분석 분야에서 최적의 ‘n’을 확인하고 이에 큰 영향을 끼치는 한국어 특성에 주목하였다. 그 결과, 단일로 사용할 경우에는 n = 2인 바이그램(bi-gram)이 가장 높은 성능을 보이는 것을 입증하였다. 여러 형태의 n-gram을 사용할 경우에는 유니그램(uni-gram), 바이그램(bi-gram), 트라이그램(tri-gram)이 높은 적중률을 보였다. 이는 n-gram이 가진 연속성(Sequence)에 따라 어절 수가 커질수록 이러한 특성에 따라 연속되어 추출되는 단어 간의 조합이 긍정 또는 부정의 감성을 분류하는 데에 효과가 있음을 입증하였다. 이는 첫째, 연속되는 특정 단어 간의 조합은 특정 감성을 강조하며, 둘째, 부정어가 연속되는 단어와 조합을 이룰 경우 기존과 다른 감성으로 변화할 수 있으며, 셋째, 특정 단어가 연속하여 나타날 경우 감성을 부여할 수 있는 구문이 된다는 점에서 하나의 단어보다는 그 이상의 단어 간의 조합이 감성 추출에 유용함을 증명하였다. 이에 그치지 않고 더 나아가, n-gram마다 제1종 오류와 제2종 오류의 양상을 통해 n-gram의 영향을 확인하였다. 실제 데이터를 통해 확인한 결과, 실제 부정 데이터에서는 긍정의 패턴이 자주 나타났다. 하지만 실제 긍정 데이터에서는 부정의 패턴을 보이는 양상이 작아 이에 대한 오분류율이 낮은 것을 확인하였다. 어절 수가 높아질수록 실제 부정 데이터에서 나타나는 부정의 패턴을 유연하게 인식하기에 제1종 오류와 제2종 오류를 낮추는 데에는 단일의 n-gram 보다는 어절 수가 크거나 복합으로 이루어진 n-gram을 활용하는 것이 효과적임을 확인하였다. 하지만 어절 수가 3 이상이 될 경우에는 극성 분류에 효과적인 변수 확보에 어려움을 가지므로 연구 목적에 따라 적절한 ‘n’을 선택하여 유효 변수를 확보하는 것이 중요함을 시사하였다. 본 연구는 한국어 특성을 반영하는 최적의 어절 수 ‘n’을 입증하고 유효 변수를 통해 감성 추출에 활발하게 사용되는 단어 간의 조합을 확인하였다는 점에서 의의점을 갖는다. 어절 수 ‘n’을 1 이상으로 설정하거나 어절 수 조합을 활용할 경우 기존에 모호하였던 변수를 줄이고 특정 감성이 두드러지는 변수를 확보하여 감성 탐지율을 높임을 증명하였으며, 부정 패턴을 인지하는 데에 효과적인 어절 수를 시사하였다. 하지만 본 연구에서는 어절 수 4 이상을 확보하지 못하였다는 점과 보다 다양한 데이터 확보, 순서는 다르지만 같은 단어 간의 조합으로 이루어진 형태를 처리하지 못하였다는 점에서 한계점을 갖는다. 향후 연구에서는 특정 문장에서 감성 탐지율을 높이는 유효 변수 단어로 확인된 서술어 부분을 자질로 채택할 경우 높은 감성 탐지율을 얻을 수 있을 것이라 기대하며, 감성 분석 분야 외의 한국어 텍스트 데이터 대상의 텍스트 마이닝 분야에서도 연속되는 특정 단어 간의 조합에 주목하여 n-gram의 유효성을 검증할 것이라 기대한다.;Recently, text mining begins to make a mark because of understanding people’s mind and sentiment to specific target. In this field, feature selection is most important on text mining due to varying the performance depending on feature that the researcher choose. A few previous oversea studies have been focused on how to raise the performance of classifier depending on the number of words using n-gram in sentiment analysis. However, a few previous domestic researches have not been focused on n-gram because a large number of researchers have developed the studies that is based on the domain specific lexicon. It is unsatisfactory condition to use n-gram for sentiment analysis in Korean language though n-gram is common and outstanding methodology for feature selection in sentiment analysis. The purpose of the study is the proof of the effectiveness of n-gram and the consideration of the causes of the performance difference by the number of words using n-gram in sentiment analysis. Most noticeable is finding the optimum number of words using n-gram. In this study, the performance is defined accuracy rate in sentiment classification-positive and negative-. The results according to the proposed experiment proved the effectiveness of the number of words using n-gram. The optimum number is two when we use the single n-gram and the most optimum number is the combination of one, two and three when we use the combination of the number of words. The main reason that show the performance difference is sequence of particular words. The strong point of n-gram is extracting the consecutive words. We can get the reliable variable when we used more than two words using n-gram as feature. It means three benefit and effectiveness when we use more than two using n-gram as feature. First, some variable is used to emphasize the sentiment. For example, “very good” is more positive than “good”. Second, some negative words change the sentiment when it combines with the particular word. For instance, “not good” is negative. However, unigram recognize the sentence as neutral because unigram set the two variable are “not” and “good”. Though bigram appreciates the sentence as negative because bigram set the one variable is “not good”. Third, some words make new sentiment. For example, “waste time” is negative. However, unigram is failed to realize the sentence as negative. Because unigram set the two variable are “waste” and “time”. Moreover, the list of effective variable consists of words showed the characteristic of Korean language. It means n-gram is valuable methodology in sentiment analysis for Korean language. The contribution of this research is to prove the effectiveness of n-gram, offer the optimum number using n-gram, and find the reason of the performance difference based on the characteristic of Korean language. In hereafter research, it is expected to use briskly n-gram in sentiment analysis and other fields based on characteristic of Korean language.