DSpace at EWHA: Fuzzy C-Means 클러스터링을 이용한 웹 로그 분석 기법

Browse

My Repository

DSpace at EWHA과학기술대학원 컴퓨터학과 Theses_Master

View : 1205 Download: 0

Fuzzy C-Means 클러스터링을 이용한 웹 로그 분석 기법

Title: Fuzzy C-Means 클러스터링을 이용한 웹 로그 분석 기법

Authors: 김미라

Issue Date: 2003

Department/Major: 과학기술대학원 컴퓨터학과

Publisher: 이화여자대학교 과학기술대학원

Degree: Master

Abstract: 인터넷의 대중화는 시간과 공간의 제약을 받지 않고 필요한 정보의 획득을 가능하게 하는 편리한 수단을 제공하였다. 이러한 가운데 인터넷의 발달로 인하여 정보의 중요성이 강조되어지고 있다. 데이터 마이닝(Data Mining)이란 저장된 많은 양의 자료로부터 통계적, 수학적 분석방법을 이용하여 다양한 가치 있는 정보를 찾아내는 일련의 과정이다. 데이터 클러스터링은 이러한 데이터 마이닝을 위한 하나의 중요한 기법이다. 사용자의 다양한 자료를 통해서 의미 있는 정보를 얻기 위해서는 데이터 클러스터링이 많이 이용된다. 클러스터링(Clustering)이란 주어진 데이터 집합을 서로 유사성을 가지는 몇 개의 클러스터로 분할해 나가는 과정으로, 하나의 클러스터에 속하는 데이터 점들 간에는 서로 다른 클러스터 내의 점들과는 구분되는 유사성을 갖게 된다. 데이터 마이닝에서 클러스터링 방법은 기존의 통계, 기계 학습, 패턴인식에서 쓰이던 방법에 부가적으로 데이터베이스 지향적인 제약 사항들을 첨가시킨 것으로서, 최근의 멀티미디어 데이터와 같이 혼합되고 다양한 다차원 데이터를 효율적으로 사용하기 위한 방안으로 연구되고 있다. 로그분석이란 사용자들이 웹 사이트를 이용하면 기록이 로그라는 형태로 흔적을 남기게 되는데, 이 데이터를 기반으로 다양한 정보를 추출해 내는 것이다. 또한 로그 데이터를 이용하여 웹사이트의 페이지뷰, 사용자별 페이지뷰, 접속장소 및 방식, 시간별 페이지뷰, 방문자수 등에 대한 현황 및 추세를 분석하는 것이다. 본 논문에서는 Fuzzy C-Means 알고리즘을 이용하여 웹 사용자들의 행위가 기록되어 있는 웹 로그 데이터를 데이터 클러스터링 하는 방법에 관하여 연구하고자 한다. Fuzzy C-Means 클러스터링 알고리즘은 각 데이터와 각 클러스터 중심과의 거리를 고려한 유사도 측정에 기초한 목적 함수의 최적화 방식을 사용한다. 웹 로그 데이터의 여러 필드 중에서 사용자 IP, 시간, 웹 페이지 필드를 WLDF(Web Log Data for FCM)으로 가공한 후, 다차원 Fuzzy C-Means 클러스터링을 한다. 그리고 이를 이용하여 샘플 데이터와 임의의 데이터간의 유사 패턴 분석을 하고자 한다. ; Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, to focus further analysis and data processing, or as a preprocessing step for other algorithms operating on the detected clusters. The purpose of cluster analysis is to partition a data set into a number of disjoint groups or clusters. The members within a cluster are more similar to each other than members from other clusters. The clustering is the process of grouping feature vectors into classes in the self-organizing mode. Choosing cluster centers is crucial to the clustering process. Density-based approaches apply a local cluster criterion. Clusters are regarded as regions in the data space in which the objects are dense, and which are separated by regions of low object density. The Fuzzy C-Means algorithm is an iterative portioning method that produces optimal c-partitions. The method computes the cluster centers and generates the class membership matrix. The Fuzzy C-Means clustering uses an iterative partitioning method to generate optimal c-partitions. Fuzzy C-Means algorithm uses the reciprocal of distances to decide the new cluster centers. Every cluster centers are updated to minimize the overall distance error. Web server should keep a record of all information that was the executed scripts, the dispatched files to the web browser, and the data generated from the dynamic CGI programs. Namely, it should record the user s request and their responses everytime they visit web site. In these kind of web logging system, there are many fields concerning web service information, and we should spend some time to recollect meaningful data and extract possible information about web usage. In this paper, we propose a Fuzzy C-Means clustering approach to make a clustering for identifying the web log data sets multi-dimensional Fuzzy C-Means clustering algorithm, and find out similar pattern between them using the moving distance values.