DSpace at EWHA: 분류 정보 모형에 기반한 한글 자동 문서 범주화

Browse

My Repository

DSpace at EWHA정보과학대학원 컴퓨터정보학과 Theses_Master

View : 1616 Download: 0

분류 정보 모형에 기반한 한글 자동 문서 범주화

Title: 분류 정보 모형에 기반한 한글 자동 문서 범주화

Other Titles: Automatic Korean Document Categorization based on the Classification Information Model

Authors: 이화림

Issue Date: 1999

Department/Major: 정보과학대학원 컴퓨터정보학

Publisher: 이화여자대학교 정보과학대학원

Degree: Master

Advisors: 박승수

Abstract: 정보 검색 시스템에서 정확도를 향상시키는 한 가지 방법은 특정 범주에 포함된 문서들만을 검색 대상으로 설정하는 것이다. 이 방법을 지원하기 위해서는 문서들에 대해 자동으로 적절한 범주들을 할당해주는 자동 문서 범주화 작업이 요구된다. 여러 가지 자동 문서 범주화 방법 중에서 선형 분류기는 이론적 기반이 견고하고 범주 결정에 소요되는 시간이 작다는 특성이 있다. 선형 분류기를 자동 문서 범주화 작업에 적용할 때에는 자질 선택 방법, 문서 자질 가중치 척도, 학습 방법이라는 세 가지 쟁점 사항이 발생한다. 본 논문에서는 한국어 자동 문서 범주화를 위해 새로운 학습 방법 및 자질 선택 방법을 제안한다. 제안된 학습 방법은 분류 정보 모형에 기반을 두고 있으며 학습 단계에서의 시간 복잡도가 낮은 특성이 있다. 그리고, 제안된 자질선택 방법은 자질의 중요성 측정 기준으로 정보 이론에 기반을 둔 분별값을 사용한다. 또한, 이 논문에서는 선형 분류기의 쟁점들 사이의 연관 관계를 조사한다. 이를 위해 다양한 자질 선택 방법, 문서-자질 가중치 척도, 학습 방법에 대해 성능을 측정하는 실험을 수행한다. 자질 선택 기준으로는 단어 빈도, 문서 빈도, 상호 정보, 기대 상호 정보, 정보 이득, x^(2) 통계량, 분별값이 이용된다. 그 리고, 문서-자질 가중치 척도로는 이진값, 단어 빈도, 역 문서 빈도, 역 범주 빈도 및 이들의 조합이 이용되며 학습 방법으로는 Robertson/Sparck Jones 알고리즘, Rocchio 알고리즘, Widrow-Hoff 알고리즘, Kivinen & Warmuth 알고리즘, 제안된 학습 방법 등이 이용된다.;A method for improving the precision of information retrieval systems is restriction of the target of retrieval to documents which belongs to a specific category. In order to support this method, a task, called automatic document categorization, is required that assign proper classes to a document automatically. Among lots of automatic documents categorization methods, the linear classifier has two desirable characteristics~it is theoretically substantial and consumes little time for determining categories of a document. When a linear classifier is applied to automatic document categorization tasks, there are three issues; a feature selection method, a document-feature weight metric and a learning method. In this paper, a new learning method and a new feature selection. method for automatic Korean document categorization is proposed. The proposed learning method is based on the Classification Information Model and has low time complexity for learning phase. Moreover, the proposed feature selection method takes the Discrimination Score, which is based on the Information Theory, as a metric for importance of a feature. In addition, the correlations of those issues on a linear classifier is investigated. In order to achieve it, the experiments of measuring the performance is performed under diverse feature selection methods, document-feature weight metrics and learning methods. As a measure for feature selection, term frequency, document frequency, Mutual Information, Expected Mutual Information, Information Gain, l statistics and Discrimination Score are used. Moreover, binary value, term frequency, inverse document frequency, inverse category frequency and their combinations are exploited for a document-feature weight metric, and Robertson/Sparck Jones algorithm, Rocchio algorithm, Widrow-Hoff algorithm, Kivinen & Warmuth algorithm and a proposed learning method are applied.