DSpace at EWHA: 이메일 문서 분류를 위한 개인화된 추천 에이전트 시스템

Browse

My Repository

DSpace at EWHA과학기술대학원 컴퓨터학과 Theses_Ph.D

View : 1481 Download: 0

이메일 문서 분류를 위한 개인화된 추천 에이전트 시스템

Title: 이메일 문서 분류를 위한 개인화된 추천 에이전트 시스템

Other Titles: A Personalized Recommendation Agent System for Classifying E-Mail Documents

Authors: 정옥란

Issue Date: 2005

Department/Major: 과학기술대학원 컴퓨터학과

Publisher: 이화여자대학교 과학기술대학원

Degree: Doctor

Advisors: 조동섭

Abstract: This study suggests a recommendation agent system that the user can optimally sort out incomng e-mail messages according to category. The system is an effective way to manage ever-increasing email documents. Categorizing based on the texts will be the most basic pre-recommendation process. The received e-mails are classified according to this categorizing, which means placing the e-mails into each category accordingly. As the number of received e-mails increases, the users are required to spend time and effort in case they need to search, index or summarize each e-mail. Therefore, three pre-processing algorithms are suggested for accurate categorizing of the e-mails and PRAS (Personalized Recommendation Agent System) is constructed as the Web-mail based application system to provide a customized e-mail management system recommending each user the most adequate categories. The step-by-step pre-processing algorithms including the following. Firstly, the characteristics of the users are derived by giving weighted value to each attribute considering the preponderance of the e-mail documents. For the assigning of weighted value, it is assumed and extensively applied that all attributes are independent and equally influential based on the Naive Bayesian Classifier. Secondly, for automatic categorizing, a feature extracion and learning stage is necessary. In this stage, the selection of learning documents for the feature extraction is very important to increase the accuracy. A random learning document group can be used unmodified to generate a rule, but an intellectual restructuring may generate a more accurate rule and therefore, uncertainty-based sampling algorithm is applied. Thirdly, the accuracy of document categorization is determined by rule generation presumptive algorithm. Generating the final rule using the learning document group selected by the learning document group organizing method is the role of this algorithm. The document categorizing system using Naive Bayes document classification algorithm is generally known to provide more accurate ratio when compared to other algorithms. The existing algorithms use fixed threshold, but this study used dynamic threshold to increase the accuracy of document classification The aforementioned three major pre-processing algorithms are designed based on each mthree major pre-processing algorithms are designed based on each module to implement PRAS, an application system appropriate for this study, and are changed to component mode so that it can easily adapt to other application systems and distributed environments for great-size e-mail management. PRAS (Personalized Recommendation Agent System) is characterized by how it generates personal rules by feature extration from a person's e-mail contents or management method considering the particularity of the e-mails. The categorization is performed based on the established rule and each step is designed based on each module of functions. It uses the learning process for accurate categorizing and Basian algorithm for rule generaiton and recommends the users the best category when a new e-mail arrives. This study suggests a recommendation system for the users to perform the optimum categorization of new e-mails in order to efficiently control the increasing numbers of e-mail documents. Three pre-processing algorithms are used for accurate categorizing for effective categorizing and storage of the e-mail documents. The performance test checked precision and recall and tested whether the system categorizes the e-mails accordingly to each category. F1 measure values are used to independently test the performance of each category, but this study uses macro-averaging method for the average performance of all categories. This method calculates recall, precision and F1 measure of each category and generates the average of these values to test the overall performance of the system. The process applied each suggested pre-process algorithm to indicate an improvement.;본 연구에서는 이메일 효율적인 관리를 위해 받은 메일의 자동 분류를 선행한 후 사용자의 편리한 관리를 위해 추천 방식으로 사용자에게 도움을 주는 것이다. 메일을 추천하기 위한 전처리 작업으로는 텍스트 분류가 가장 기본이 될 것이다. 이것을 기반으로 메일을 분류하게 되는데, 메일 분류의 의미는 정해진 해당 카테고리에 각각의 메일들을 할당하는 것이다. 메일의 수가 증가할수록 각각의 메일을 효과적으로 검색 및 색인화(indexing)하고, 내용 요약(summarization)과 같은 작업을 수행할 때 많은 시간 소비와 어려운 작업을 하여야 한다. 사용자에 맞게 이메일 문서의 정확한 분류를 위하여 세가지 전처리(Pre-processing) 알고리즘을 제안하고, 이를 이용하여 웹메일 기반 응용시스템으로 PRAS(Personalized Recommendation Agent System)을 구현하여 사용자에게 가장 적합한 카테고리를 추천하는 맞춤 이메일 관리 시스템을 제안하였다. 제안된 단계별 전처리 알고리즘은 다음과 같다. 첫째, 이메일 문서의 편중성을 고려하여 속성별 가중치를 부여하여 특징추출을 하였다. 가중치를 부여하는 방법으로는 Naive Bayesian Classifier를 기반으로 모든 속성값은 독립이며 분류에 동등한 영향력을 끼칠 것으로 가정하고 확장 응용하였다. 둘째, 자동 분류를 하기 위해서는 특징 추출 및 학습 단계가 필요하다. 이 단계에서 정확도를 향상시키기 위해 규칙 생성시 특징 추출을 위한 학습문서의 선택은 매우 중요하다. 임의의 학습문서집합을 있는 그대로 이용하여 학습을 통해 규칙을 만들 수도 있으나 이를 지능적으로 재구성하여 이용하면 보다 더 정확한 규칙을 얻을 수 있을 것이며, 이때 불확실성 기반 샘플링 알고리즘을 적용하는 것이다. 셋째, 문서 분류의 정확도를 결정하는 것은 규칙형성하는 추정알고리즘을 들 수 있다. 학습문서집합 구성 방법에 의해 채택된 학습문서집합을 이용하여 최종적으로 규칙을 형성하는 것이 이 알고리즘의 역할이다. Naive Bayes 문서분류 알고리즘을 이용한 문서 분류 시스템은 일반적으로 다른 알고리즘들에 비해 문서분류의 정확도가 상대적으로 높다고 알려져 있다. 기존의 알고리즘은 고정 임계치(threshold)를 사용하였는데, 본 연구에서는 임계치를 동적 임계치로 개선하여 문서분류의 정확도를 높이고자 하였다. 위의 세가지 주요 전처리 알고리즘을 모듈별 설계하여 본 연구에 맞는 응용시스템PRAS를 구현하였으며, 이를 컴포넌트화하여 다른 응용시스템과의 확장성과 대량의 메일 관리시 분산 환경에서의 적용가치를 높혔다. PRAS(Personalized Recommendation Agent System)의 특징은 메일의 특수성을 고려하여 개인의 메일 내용이나 처리 방법에서 특징을 추출하여 개인적 규칙(personal rule)을 형성한다. 형성된 규칙을 바탕으로 분류를 하게 되는데, 각 역활들을 기능별로 모듈화 설계를 하였다. 정확한 카테고리별 분류를 위한 학습, 규칙 형성을 위해 베이지안 알고리즘을 이용하며, 새로운 메일이 도착하면 적합한 카테고리를 우선순위별로 에이전트 개념을 이용하여 사용자에게 추천하게 되는 것이다. 본 연구는 갈수록 늘어나는 이메일 문서의 관리를 효율적으로 하기 위한 방법으로 새로운 메일을 받았을 때 해당 카테고리를 추천 받아 사용자가 직접 최적의 분류를 할 수 있는 추천 시스템을 연구하였다. 이메일 문서들의 카테고리별 분류 및 해당 폴더 저장에서 핵심이 될 수 있는 정확한 분류를 위해 세가지 전처리 알고리즘을 이용하였다. 성능 평가는 정확률과 재현률을 먼저 체크한 후, 메일 내용을 해당 카테고리에 맞게 분류하는지 실험하였다. 각 카테고리의 성능을 개별적으로 평가하고 이들의 평균을 계산하여 전체적인 시스템의 성능을 평가한다. 제안된 전처리 알고리즘을 단계별로 적용한 후 향상된 결과를 보여 주었다.