DSpace at EWHA: 대용량 텍스트 문서 간 유사도와 SVM을 이용한 계층 분류의 관계에 관한 연구

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 770 Download: 0

대용량 텍스트 문서 간 유사도와 SVM을 이용한 계층 분류의 관계에 관한 연구

Title: 대용량 텍스트 문서 간 유사도와 SVM을 이용한 계층 분류의 관계에 관한 연구

Other Titles: A Study on the Relationship between Similarity between Large Text Documents and Hierarchical Classification using SVM

Authors: 장수정

Issue Date: 2020

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 민대기

Abstract: In this article, we propose a method for large volume document classification. There are a hierarchical classification method and a flat classification method for classifying a large number of documents. In this study, we believe that the effect of hierarchical classification is better than that of flat classification. In addition, we want to study the relationship between hierarchical classification performance and the similarity between documents. Although there are many studies on the hierarchical classification performance of documents, unlike conventional studies, this study classifies documents with hierarchical knowledge. We propose a method to classify a large number of documents and store and manage documents using the existing document hierarchy. We conducted experiments through 20newsgroup articles, all of which were open data, and conducted papers on water adaptation technology to directly verify the effectiveness of the experiment. In terms of water resource adaptation technology, the entire data is directly constructed by using "Science Direct" to search and crawling abstracts of papers by category between 2015 and 2019. In the classification model, SVM (Support Vector Machine) is used. In order to use only "Accuracy" to evaluate performance, the data distribution is not constant, so use "Accuracy", "Precision", "Recall" and "F1-Measure" index values to evaluate performance. Experiments are performed by preprocessing natural language in a language understandable by the computer. After that the words in the document are extracted, and a word matrix corresponding to the category is constructed. The similarity between documents is calculated by using DTM (Document-Term-Matrix) set in this way. The hierarchical classification experiment is conducted by first classifying the entire data into the upper category of the hierarchy, and then classifying it into the next subcategory. Then, we compared the performance with the flat classification performance, which immediately classified it into subcategories without first classifying the parent category. We compared hierarchical classification performance and flat classification performance, and analyzed the relationship between the calculated document similarity and hierarchical classification performance. As a result of analysis, it is found that hierarchical classification shows higher performance than flat classification, and shows various relationships with similarities. You can also know the performance results of hierarchical classification and flat classification based on the number of parent categories. The purpose of this study is to provide a method for efficiently storing and managing large volume documents of unstructured data. It can be seen that when there is a hierarchcal classification is more effective than flat classification, and when the similarity between documents is high, hierarchcal classification is more effective than flat classification. However, since only the SVM was used to test the classification model, it is possible to add more effectiveness to the research by conducting other classifiers. In addition, most of the documents were more effective than the non-hierarchical classification, but in some document categories, the performance of the non-hierarchical classification was fine but effective. Further research on this area could lead to more accurate results.;본 논문은 대용량 문서 분류에 대한 방법론을 제안하였다. 대용량 문서를 분류 하는 방법에는 계층 분류 방법과 비계층 분류 방법이 있다. 본 연구는 계층 분류 가 비계층 분류에 비해 우수한 성능을 보인다는 것을 주장한다. 또한 계층 분류의 성능과 문서 간 유사도의 관계에 대해서도 알아보고자 한다. 문서의 계층적 분류 성능에 대한 많은 기존 연구들이 존재하지만 본 연구는 기 존의 연구와는 다르게 계층 구조에 대한 지식을 가지고 문서들을 분류한다. 문서 들에 대한 기존의 계층 구조들을 이용해서 대용량 문서들을 분류하고 문서들을 저장하고 관리하는 방법을 제안하였다. 오픈 데이터인 20newsgroup 기사를 통해 서 실험을 진행하고 수자원 적응 기술과 관련된 논문들을 통해서 직접 실험의 타 당성을 검증했다. 수자원 적응 기술의 경우 “Science Direct”에서 2015년에서 2019 년 사이 범주의 검색어로 논문들의 초록을 크롤링하여 직접 전체 데이터를 구축 하였다. 분류 모델로는 SVM(Support Vector Machine)을 사용하였다. 성능을 평가 하기 위해서는 Accuracy만 이용하기에는 데이터의 분포가 일정하지 않았기 때문 에 Accuracy, Precision, Recall, F1-Measure 지표값을 이용해서 성능을 평가하였 다. 실험은 자연어를 컴퓨터가 이해할 수 있는 언어로 전처리하여 진행하였다. 그 후 문서들에서 단어들을 추출하여 해당 범주들에 해당하는 단어들에 대한 Matrix 를 구성하였다. 그렇게 구성된 DTM(Document-Term-Matrix)를 이용하여 문서 간 유사도를 계산하였다. 계층 분류 실험은 전체 데이터를 계층 구조의 상위 범주로 1차 분류한 후 그 결과로 다음 하위 범주로 2차 분류하는 단계를 통해 진행되었다. 그리고 그 성능 과 상위 범주에 대한 1차 분류 없이 바로 하위 범주로 분류하는 비계층 분류 성 능과 비교하였다. 계층 분류 성능과 비계층 분류 성능을 비교하고 계산된 문서 간 유사도와, 계층 분류의 성능에 대한 관계에 대해서 분석해보았다. 분석 결과 계층 분류가 비계층 분류에 비해서 높은 성능을 보였고, 유사도와도 다양한 관계를 보이는 것을 알 수 있었다. 또한 상위 범주의 개수에 따라 계층 분류와 비계층 분류의 성능에 대한 결과도 알 수 있다. 본 연구는 비정형 데이터인 대용량 문서들을 효과적으로 저장하고 관리하는 방 법을 제시하는 것에 있어서 의의가 있다. 실제 계층 구조가 존재할 경우 계층 분 류가 비계층 분류에 비해 더 효과적인 것을 알 수 있었고, 문서 간의 유사도가 높 을 때 비계층 분류에 비해 계층 분류가 효과적인 것을 알 수 있다. 그러나 분류 모델을 SVM만 사용하여 실험을 진행하였기 때문에 조금 더 다양 한 분류 모델을 사용하여 진행해보아서 연구의 타당성을 조금 더 부여할 수 있을 것이다. 또한 대부분의 문서는 계층 분류가 비계층 분류에 비해 효과적이었지만 몇몇 문서 범주의 경우 비계층 분류의 성능이 미세하지만, 효과적으로 나타났다. 이 부분에 대해서도 연구가 추가로 이루어진다면 더 정확한 결과를 도출할 수 있 을 것이다.