DSpace at EWHA: 저차원 집계 테이블들을 사용한 고차원 데이터의 온라인 분석

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Master

View : 700 Download: 0

저차원 집계 테이블들을 사용한 고차원 데이터의 온라인 분석

Title: 저차원 집계 테이블들을 사용한 고차원 데이터의 온라인 분석

Other Titles: Analysis of High Dimensional Data using Low Dimensional Summary Tables

Authors: 최혜정

Issue Date: 2003

Department/Major: 대학원 컴퓨터학과

Publisher: 이화여자대학교 과학기술대학원

Degree: Master

Advisors: 김명

Abstract: 기업 전략을 세우거나 고객의 성향을 다각도에서 신뢰도 높게 분석하려면 여러 차원(또는 애트리뷰트)으로 구성된 비즈니스 데이터를 취급하게 된다. 이러한 고차원 데이터의 온라인 분석은 저차원 데이터를 취급할 때에는 그다지 심각하지 않았던 문제들을 야기시킨다. OLAP에서는 사용자에게 분석 결과를 빠르게 제공하기 위해 사전에 집계 테이블들을 계산하여 저장해 둔다. 고차원 데이터의 경우에는 집계 테이블의 분량과 크기가 천문학적으로 방대하기 때문에 사전 집계 계산이 현실적으로 불가능한 경우가 많다. 고차원 데이터 처리에 관한 연구는 사전 연산의 데이터의 분량이 분석하고자 하는 데이터의 수천 배가 되는 데이터의 폭발 현상을 막기 위해 방법이다. 이를 위해 유효 셀로만 데이터를 압축하여 공간적얀 비용을 줄이거나 사실 테이블을 칼럼 단위로 저장함으로 집계 연산을 빠르게 처리하는 연구가 있다. 하지만 이러한 방법들은 고차원 데이터의 온라인 분석 시에 발생하는 데이터의 폭발 현상을 근본적으로 해결하지는 못한다. 본 연구에서는 고차원 데이터가 분석될 때 고차원 데이터라 하더라도 실제로 저차원 집계 테이블들이 주로 사용된다는 점에 착안하여 데이터의 폭발 현상을 감소시커면서 데이터를 분석하는 방안을 제시한다. 이와 같은 방법의 효율성을 예제를 통해 질의를 처리하는 면과 큐브 생성의 비용 절감 면에서 살펴보았고 새로운 생성 알고리즘을 제안하였다. 제안하는 알고리즘은 사전 집계 연산을 할 때 크기가 방대한 고차원 집계 테이블들의 생성을 생략하고, 3~6차원 또는 그 이하 차원의 집계 테이블들만을 고속으로 동시에 생성하는 방법이다. 이 알고리즘은 메모리를 효율적으로 재활용하여 고차원 데이터의 분석에 있어서의 MOLAP의 한계점을 개선하고 고차원 데이터의 신속한 분석이 원활하게 이루어지도록 하였다. 본 연구의 효율성을 이론적인 분석과 저차원 집계 테이블 비용 측정기를 개발함으로 제안 한 알고리즘의 비용을 측정할 수 있게 하였다.;Business data with many dimensions(or attribute) is used for planning company strategy or analyzing customer patterns in various views. The online analysis of high dimensional data incurs serious problems differently from processing of low dimensional data. For providing users with results of data analysis quickly, OLAP systems pre-compute such results called summary tables. In case of high dimensional data, it is impossible to pre-compute whole summary tables because of the vast quantity and size of summary tables. Previous approach of dealing with high dimensional data is to reduce data explosion that the cube size of pre-computed aggregation is larger than the size of analysis data. In order to reduce data explosion, large data is compressed into valid cells for reducing storage cost, it improve scanning time to storing fact table as column unit. These methods are not fundamental solutions for data explosion. In this paper, we propose a new analysis method for reducing data explosion in high dimensional data. It is focus a fact that analysts tends to be interested in querying low dimensional summary results such as 3~6. We analyze effectiveness of using low dimensional data in query processing with real examples and in reducing cost of cube generation. Also, we propose a generation algorithm. The proposed algorithm generates low dimensional summary tables instead of whole summary tables from a fact table simultaneously. This algorithm improves coverage of MOLAP with high dimensionality by reusing memory efficiently and analyzes high dimensional data quickly. We show the efficiency of the new algorithm through theoretical analysis and implement an application for estimating the cost of cube generation.