DSpace at EWHA: 확장된 분류체계와 강화된 후처리분석을 이용한 자동문서분류시스템의 성능향상방법

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Ph.D

View : 1091 Download: 0

확장된 분류체계와 강화된 후처리분석을 이용한 자동문서분류시스템의 성능향상방법

Title: 확장된 분류체계와 강화된 후처리분석을 이용한 자동문서분류시스템의 성능향상방법

Other Titles: Efficient Classification Method for Complex Literature using Reinforcement Training and Post Processing

Authors: 최윤정

Issue Date: 2007

Department/Major: 대학원 컴퓨터학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Abstract: 실생활에서 우리는 많은 문제들을 정해진 상황에 따라 구분해야 할 때가 많다. 분류란, 어떠한 분류대상을 미리 정의되어있는 범주에 일련의 정해진 규칙 혹은 분류기준에 따라 할당시키는 것으로써, 주어진 항목 중에 분류대상과 가장 연관있는 것을 얻어내기 위한 분석작업이다. 자동문서분류란 문서의 내용에 기반하여 정의된 범주에 문서를 자동으로 할당하는 기법과 관련된 연구분야로서, 대량의 문서를 효율적으로 관리하고 검색할 수 있게 하는 동시에 방대한 양의 수작업을 감소시키는 데 그 목적이 있었다. 최근의 컴퓨터 기술의 발달과 인터넷을 근간으로 한 정보환경은 이를 이용하는 사용자들의 요구에 있어서도 새로운 패러다임을 야기하고 있으며, 사용자는 점점 더 우수하고 다양한 품질의 서비스를 기대하게 된다. 따라서, 최소한의 전문가의 개입만으로도 높은 정확도가 보장되는, 즉 컴퓨터가 알아서 해결하는 자동화 시스템에 대한 요구가 매우 높다. 반면, 최근의 데이터들은 형식이나 내용상으로 그 복잡도가 높아지고 있어서, 일반적인 분류방법으로는 좋은 분석결과를 얻기 어려운 양상을 보인다. 특히, 스팸성 데이터와 같이 어떠한 의도가 반영되어 가공되거나 변형되는 데이터는 분석의 어려움을 가중시킨다. 기존의 문서분류성능을 향상을 위한 연구들은 대부분 학습문서의 선택과 구성문제와 분류알고리즘을 개선시키는데 주력해왔으며, 그 범위는 전통적인 절차와 통계적인 방법을 응용하는 것에 제한되어 있는 편이다. 본 연구의 제안방법은 자동문서분류시스템의 성능향상을 위한 것으로 기존의 학습방법과 단순한 분류지정방법에서 발생하는 문제들의 해결을 위해, 확장된 분류체계에 의한 학습방법(ETOM)과 강화된 후처리 방법(RPost)을 제안하고 있다. 본 연구에서 제안하는 방법의 개요는 다음과 같다. 첫째, 복잡하고 불확실성이 높은 데이터들의 특징은 대부분 분류경계(decision boundary)영역에 위치하므로, 분류경계의 문서들을 새로운 학습항목으로 인식시키는 것이다. 이 때, 확장된 분류항목은 기존의 분류기준을 보다 세분화 시키게 되며 세분화된 기준에 따라 분류지정방법을 달리하여 오류를 최소화하게 하는 것이다. 둘째, 분류방법론 및 성능평가에 있어서 대부분 적용된 분류기의 성능(power)에 의존하고 있고 사용된 학습문서의 개수를 위주로 평가하는 방법이 일반화되어 있다. 그러나 분류기의 계산방식과 그들의 알고리즘은 분석대상이 해당 모델에 잘 부합되는지의 여부에 따라 그 결과에서 차이가 발생한다. 또한, 순위화(ranking order) 정보의 분석과정 없이 단순히 분류기 결과에 의존하는 분류지정방식은 분류의 오류율을 더욱 극대화 시킬 수 있다. 따라서 후처리 과정에서는 분류기의 성능의 문제가 아니라 지정방식의 문제를 지적함으로써, 기존의 순위화된 정보들을 단계적으로 분석하기 위한 방법을 제안한다. 셋째, 제안시스템에서 단계별로 추정된 결과들로 오류의 원인을 분석하여 적합한 단계로 피드백한다. 본 논문에서 제안한 방법의 검증과 타당성 평가를 위하여 불확실성이 높은 문서집단을 대상으로 실험하였고, 제안방법과 기존방법의 성능을 비교하기 위해 학습문서집합에 오류문서를 포함시켜 테스트하였다. ;Recently, the size of online texts and textual information is increasing explosively, and the automated classification has a great potential for many applications handling data such as complex reports, news materials and biological literature. Most of the documents have high complexity in contents, and the similarities are relatively high in their style described, with multiple topics and features. There are many analytical documents where each document has its own traits of style. Classification accuracy can be higher if the document style fits the model. Various kinds of algorithms based on machine learning or statistic approaches have been proposed to this problem, and showed improved results with some success. However, the results are not satisfactory because they focus on enhancing existing algorithms itself, whose ranges are limited by feature-based statistical methodologies. They regard a document as a simple bag-of-words model, and simply assign it to a category, even though the document contains words that can be classified into two or more similar categories. Traditionally, classification techniques have been developed based on information technologies, such as information extraction, information retrieval, statistical Natural Language Processing(NLP), and the machine learning. Classifiers have been built based on these technologies, and each classifier has pros and cons. Generally, when we evaluate the performance of the automated text classification, we simply consider what types of classifier and how many documents have been used. However, most of the classification techniques are based on some typical models such as rule base model, inductive learning model or information retrieval model. Each of these classifiers has many variations. In a rule based model, classification rules are given by the experts or by training. In inductive learning based models, classification rules are given by probability calculations using features extracted from documents. Classifiers such as Naïve Bayesian and Support Vector Machine are based on this model.. Ensemble is an efficient method for handling these combined set of classifiers. An ensemble of classifiers is a set of classifiers, whose individual decisions are combined in some way. In this paper, we present a new approach for improvement of text classification based on simple and efficient training and post-processing method. Especially, we focused on complex documents that are generally considered to be hard to classify. Our proposed method has a different style from traditional classification and takes a knowledge discovery strategy and fault tolerant system approaches. It provides a comparatively cheap alternative to the traditional statistical methods. In experiments, we applied our system to documents which usually get low classification accuracy because they are laid on a decision boundary. We have shown that our system has high accuracy and stability in actual conditions. We show that our system does not need to change the classification algorithm itself to improve the accuracy and flexibility. It does not depend on some factors that have important influences to the classification power. Those factors include the number of training documents, selection of sample data, and the performance of classification algorithms.