DSpace at EWHA: Optimization of Text Feature Selection Using Genetic Algorithm for Sentiment Classification

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 1019 Download: 0

Optimization of Text Feature Selection Using Genetic Algorithm for Sentiment Classification

Title: Optimization of Text Feature Selection Using Genetic Algorithm for Sentiment Classification

Authors: 장미정

Issue Date: 2018

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 신경식

Abstract: Sentiment analysis has been actively studied in the field of text-mining research. It analyzes, classifies, and interprets the emotions, opinions, and evaluations of specific individuals in public communication. The sentiment classification prediction model is intended to re-interpret people’s positive and negative viewpoints (instead of simply analyzing their emotions or attitudes) and analyze them with quantified numerical values and diagrams. Feature selection is a procedure that conducts a sentiment analysis, and it simplifies the model by selecting a subset of relevant features (variables, predictors) related to the model and solves the problem of the dimension curse. It can affect the classification model performance by numerically vectorizing the important text data. It has also been studied in the field of data mining. This study shows that the results of the classification prediction model according to the feature selection in the sentiment analysis can also be changed. The feature selection step assumes that the classification is better performed by reducing the irrelevant features and finding the optimal feature set. The classification method is the SVM (Support Vector Machine), which is widely applied as a pattern classifier in machine-learning-based sentiment analysis. Therefore, the accuracy of the SVM sentiment classification model in this study was expected to increase through feature selection. Among feature selection methods, the genetic algorithm (GA) was selected as the global optimization search method to find the ideal subset in the feature subspace in sentiment classification. To validate the effectiveness of the proposed model, we compared the results of the selected classification model with the number of features by considering the highest TF-IDF (Term Frequency-Inverse Document Frequency) value. We validated the SVM sentiment classification prediction model by applying the GA to the optimal feature set. The results verified that the accuracy of sentiment classification prediction model applied to feature selection with GA was higher than that the base models using only TF-IDF values. The validity of the research model was verified by comparing the significance of differences in accuracy between the models with a t-test. By applying two different sets of data, the results of this research model were found features optimized for sentiment classification problems by using genetic algorithms. The results of this research suggest that it is important to find relevant features in order to create a model that improves classification accuracy in the sentiment classification problem as well as classification problem in data mining.;감성 분석은 텍스트 마이닝 연구 분야에서 활발히 연구되고 있다. 특히, 감성 분석 단계에서 텍스트 데이터를 수치 벡터화해서 분류 모델 성능에 영향을 주는 특징을 선택하는 특징 선택방법이 중요하다. 본 연구에서는 지도학습 기반 감성 분석 단계에서 특징 선택 단계에 따른 분류 예측 모델의 결과가 달라질 수 있음을 착안하였다. 특징 선택 단계는 관련성이 없는 특징을 줄이고 최적의 특징 집합을 찾음으로써 분류가 더 잘 수행된다는 가정이 포함되었다. 그리고 분류 모델에 사용하는 분류방법은 지도학습 기반 감성 분석에서 패턴 분류기로 대표적으로 사용되는 SVM (Support Vector Machine)을 사용한다. 여러 특징 선택 방법 중에서 감성 분류에서 특징 부분 공간에서 이상적인 부분 집합을 찾기 위해 적용 할 수 있는 최적화 알고리즘 기법으로 잘 알려진 유전 알고리즘 (GA)을 적용했다. 그리고 제안한 모델의 유효성을 증명하기 위해 상위 TF-IDF값을 고려하여 선택된 60개에서 200개의 특징으로 만든 분류모델과 GA를 특성 선택 방법을 적용한 분류 모델의 결과를 비교했다. 그 결과, 유전 알고리즘을 특징 선택방법으로 사용한 감성 분류 예측 모델이 단순히 TF-IDF값만 고려한 다른 모델들에 비해 정확도가 높은 것으로 나타났다. 본 연구에서 제안한 모델의 유효성을 검사하기 위해 모델 예측 정확도 간 차이를 t검정으로 비교하여 우수성을 검증하였다. 본 연구에서 두 종류의 리뷰데이터를 활용함으로써 우수한 감성 분류 모델을 위한 최적의 특징(단어군)을 찾아낼 수 있었다. 이는 기본 데이터 마이닝 분류 문제에서뿐만 아니라 지도학습 기반 감성 분류 예측 모델을 만드는데 있어서도 특징 선택 방법이 중요함을 알 수 있었다. 특히 기존 연구들과 달리 감성분석을 위한 최적의 특징을 찾아 분류 모델에 적용한 것은 감성 분류에 직접적인 영향을 주는 핵심적인 특징을 찾아 정확도를 향상시키는데 중요한 역할을 함을 보여주었다.