DSpace at EWHA: 스트림 데이터의 효율적인 집계 연산

Browse

My Repository

DSpace at EWHA일반대학원 컴퓨터공학과 Theses_Master

View : 649 Download: 0

스트림 데이터의 효율적인 집계 연산

Title: 스트림 데이터의 효율적인 집계 연산

Other Titles: An Efficient Multiple Aggregation of Stream Data

Authors: 김지현

Issue Date: 2007

Department/Major: 대학원 컴퓨터학과

Publisher: 이화여자대학교 대학원

Degree: Master

Abstract: Recently researchers show their tremendous interests in producing value added information from stream data which is a sequence of data produced and collected continuously. Examples of such data include web click stream data, data from sensors, financial data, and so on. Most stream data are generated and collected rapidly and their size is tremendously large, thus it is difficult to process and store them in a timely manner. In this thesis, we propose an efficient multiple aggregating algorithm for summing up the stream data by various dimensions. Aggregation is one of the important each analysis operators for stream data. It is difficult to use existing multiple aggregation algorithm for stream data since they assume that the input data are already sorted and there is no real time requirement for aggregating business data. Our assumptions are two fold. First, we assume that data stream is divided into windows by the time dimension, and aggregation is performed by window basis. Second, we assume that aggregation tables to be generated are determined prior to the table generation as other researchers assume. In summary, we process unsorted stream data efficiently. We use arrays as well as AVL trees for rapid summation, and set no restrictions on selection of aggregation tables. Our algorithm executes the aggregation operation successfully even when the entire aggregation tables cannot be stored in memory. We showed the proposed algorithm is efficient by analysis and experiments.;스트림 데이터는 끊임없이 생성되어 수집되는 데이터로써 최근 이러한 데이터를 분석하여 부가가치를 얻고자 하는 노력이 활발히 진행 중이다. 스트림 데이터의 예로는 네트워크 트래픽 모니터링 데이터, 웹 클릭 스트림 데이터, 건강관리를 위한 생체 신호 데이터 등을 들 수 있으며, 이러한 데이터는 빠른 속도로 생성되고 용량이 방대하여 저장하기 힘들며 데이터가 흘러가는 가운데 분석 결과를 내야 하는 특징을 갖는 경우가 많아서 기존의 비즈니스 데이터 분석 방식을 그대로 사용하는 데 어려움이 많다. 본 연구에서는 스트림 데이터 분석 연산 중의 하나인 집계 연산을 효율적으로 처리하는 방법을 제안한다. 집계 연산은 집계할 데이터 전체를 본 후에야 연산 결과를 낼 수 있는 블로킹 연산 중의 하나로써, 고속 처리가 힘든 연산 중의 하나이다. 본 연구에서는 기존 연구들과 마찬가지로 스트림 데이터를 시간 차원을 기준으로 하여 윈도우 단위로 나누고, 각 윈도우마다 독립적인 집계 연산 결과를 생성하도록 하였다. 또한 생성하고자 하는 집계 테이블(또는 질의)들은 데이터가 입력되기 전에 미리 정해진다고 가정하였다. 스트림 데이터를 온라인으로 집계하려 할 때 어려운 점은 데이터가 정렬되어있지 않다는 것이다. 정렬되어 있지 않은 데이터를 고속으로 집계하기 위해 간단한 방법은 다차원 배열을 집계 테이블 저장 구조로 사용하는 것이다. 그러나 이러한 저장 구조는 큐브를 이루는 집계 테이블들이 고차원이고, 차원 멤버의 수가 크며 희박할 때 메모리에 로드되기 어렵다는 문제를 갖는다. 본 연구에서는 이러한 문제점을 해결하기 위해 원본 데이터로부터 직접 생성되는 집계테이블 구조로 배열과 AVL 트리를 혼합하여 사용하는 방법을 제안한다. 본 연구에서는 실험을 통해 배열과 AVL 트리를 사용하여 고속 집계 연산이 가능하다는 것을 보였으며, 생성하고자 하는 집계 테이블 선택에 제약을 두지 않는다. 제안한 방법은 생성하려는 집계 테이블들 전체가 메모리에 상주할 수 없을 정도로 크다고 해도 집계 연산을 수행할 수 있다는 확장성을 갖는다.