DSpace at EWHA: 웹 컨텐츠의 분류를 위한 마이닝 시스템 설계 및 구현

Browse

My Repository

DSpace at EWHA과학기술대학원 컴퓨터학과 Theses_Master

View : 527 Download: 0

웹 컨텐츠의 분류를 위한 마이닝 시스템 설계 및 구현

Title: 웹 컨텐츠의 분류를 위한 마이닝 시스템 설계 및 구현

Authors: 최윤정

Issue Date: 2001

Department/Major: 과학기술대학원 컴퓨터학과

Publisher: 이화여자대학교 과학기술대학원

Degree: Master

Abstract: New Random data, such as those from website logs and unstructured data in tables from conventional that databases are created by users. Most enterprises and web sites have text data that is created by logging activities of users who randomly in and out of the site. It doesn t have database like structure but has potential values. Bulletin board for customer services and early-collected data are good examples of that kind of unstructured data. Text Mining technique can be applied to unstructured data that contain valuable information. It performs data mining against unstructured text data, using text analysis technique. The goal of this research is to classify hurtful and noxious sites from 38,000 records of initial data set from search Engine Company. This paper is proposing a new feedback method for better classification. At the first step, it sorts data based on text mining analysis method that finds hidden information from text data. And second classified result where pattern is considered is generated through data mining the first result. By pulling out second classified result where pattern is considered through data mining process from the first result, those parts that are not resolved at text mining process are found. This paper proposed feed-backing the result to learning process of the first step. Also I designed and implemented web contents preprocessor, text mining system which has feedback and re-learning feature in unstructured text corpus and data conversion system which transforms the result to analyzable form. The Proposed method is tested against whole data set and verified with 400 proven data set. As a result, those hard to classify documents which have close item score due to fine classification rules could be classified and accuracy and quality of result could be improved. When Text analysis tool were applied recursively in web contents corpus, time for accessing massive documents could be minimized and data could be cleansed. Furthermore, it is possible to classify and reorganize things by new categories and possible to improve search-result quality based on the knowledge and features discovered through data mining with valuable unstructured data, semi-structured files could be generated and in turn in this thesis, significant conclusions could be drawn such as predictive modeling and rule discovery. And more, with this text mining system, we can apply general data mining techniques against valuable but unstructured data so to get new rules or new meaningful conclusions.; 대부분의 기업과 웹사이트에 있어 기존의 데이터베이스 기반이 아닌 무작위로 드나드는 사용자들의 동선들로부터 생성되는 데이터들처럼 데이터베이스 구조를 가지지 않았지만 상당한 잠재적 가치를 지니고 있는 텍스트 데이터들이 있다. 고객을 위한 창구로서 활용되는 게시판이나 검색사이트가 초기 수집한 데이터는 이러한 비구조적 데이터의 좋은 예이다. 본 연구에서는 검색 회사가 보유한 초기 데이터 집합 약 38000건으로부터 유해사이트와 불법사이트를 가려내기 위한 목적으로, 1차적으로 비구조적인 텍스트위주 문서에 숨어있는 정보들을 발견해 내는 텍스트마이닝 분석에 의한 분류를 수행한 후, 1차 분류결과의 데이터마이닝 처리를 통해 패턴이 고려된 2차 분류결과를 얻어냄으로서, 텍스트마이닝을 통한 분류과정에서 해결하지 못했던 부분을 발견하여 1차 분류의 학습과정에 다시 feedback하는 방법을 제안하였다. 그리고, 웹 컨텐츠 문서의 전처리기와, Feedback을 통한 효율적인 재학습-재분류의 반복적인 작업이 가능한 텍스트마이닝 시스템, 그리고 데이터마이닝 분석이 가능하도록 결과를 데이터화 하는 변환시스템을 설계하고 구현하였다. 본 논문에서 제안한 방법을 전체 데이터에 실험한 후, 검증된 400건의 데이터를 이용해 확인해 본 결과, 분류기준이 세분화되어 항목별 근소한 score를 가진 모호성이 강한 문서들의 분류가 가능해짐으로써 정확도(accuracy)와 품질(quality)을 향상시킬 수 있다는 결론을 얻을 수 있었다. 아울러, 분석 가치가 높은 비구조적인 데이터를 대상으로 일반적인 데이터마이닝 기법을 적용하려 할 때, 본 논문에서 설계 및 구현한 텍스트마이닝 시스템을 활용함으로써 규칙발견이나 의미 있는 새로운 결론을 얻을 수 있다는 가능성을 보였다.