DSpace at EWHA: XML 데이터베이스에서 문서 구조 독립적인 질의 처리 기법

Browse

My Repository

DSpace at EWHA과학기술대학원 컴퓨터학과 Theses_Ph.D

View : 1281 Download: 0

XML 데이터베이스에서 문서 구조 독립적인 질의 처리 기법

Title: XML 데이터베이스에서 문서 구조 독립적인 질의 처리 기법

Other Titles: Structure Independent Query Processing Technique on XML Databases

Authors: 이월영

Issue Date: 2004

Department/Major: 과학기술대학원 컴퓨터학과

Publisher: 梨花女子大學校科學技術大學院

Degree: Doctor

Advisors: 용환승

Abstract: XML provides simple yet flexible ways to represent the structure and contents of arbitrary documents. The simplicity and flexibility of XML have made it possible for XML to be adopted as the basis of data interchange standards in a wide variety of application areas, including electronic business, financial services, chemistry, multimedia, scientific research, metadata management, web services, data mining, etc. Therfore, there have been many proposals for a query language for XML. Users have to know the structure of docuemnts for querying because these query languages require the users to know the structure of XML documents, including all the element and attribute names, data types of the data values, and the hierarchical structure of the elements. Users may feel inconvenience during querying because query expressions are too much dependent upon document structures. Therefore, users of XML databases will benefit significantly from non-navigational content-based queries if users specify in their queries only the search conditions (and output elements), without having to know and specify the detailed hierarchical structure of the XML documents or have to give hints for efficiently processing the queries. The fact that users are freed from having to know and specify the structures of the XML documents in query expressions means, however, that the burden of automatically navigating the hierarchical structures of the XML documents and matching the search conditions (element or attribute names and their values) against the elements or attributes and their values in the stored XML documents falls entirely on the query processor. Also, the fact that users can query by specifying only the names of data and their values against many documents of various structures is likely to impose a heavy burden on the query processor in a query processing time aspect. Therefore, this paper describes some solutions for complications or the “semantic uncertainty” problem to automatic query processing. Also, it develops a technique to be able to search all the fragments of documents that satisfies search conditions against many documents of various structures in a single query expression at a time. Further, it evaluates whether the structure-agnostic queries are reasonable not only in a query expressive efficiency aspect but also in a query speed aspect. As a result, this paper shows that the structure-agnostic queries can deliver the query results in the similar speed as best cases of path-based queries. The approach is largely divide into three steps, which are designing of a query expression to be able to query regardless of document structures, implementation of a query processor to process CXquery langauge, and a performance evaluation for the CXquery processor. The detailed procedures are following. First, this paper designs a query expression called CXquery (Chamois XML query language) to query without knowledge about document structures. The expression uses only data names known to users and their values without a path among them. On distributed environments or web data, it facilitates users to query without knowledge the exact structures of documents. Second, a query processor has to drive paths among the names because CXquery language expresses only data names and values. For this work, this paper classifies all possible paths among the names according to types and extracts factors the query processor has to resolve to process CXquery language. Also, in order to quickly search all possible paths, the query processor assigns unique identifiers to each node on XML documents and develops an algorithm to be able to process it. For the “semantic uncertainty” problem, thequery processor assigns the query results to the confidences. Third, this paper shows that the technique processing CXquery is a reasonable approach by comparing performance with other XML database called X-Hive. This paper shows that the CXquery processor is five times as fast as a processor for path-based queries. Also, the CXquery processor is faster seven times than X-Hive XML server.;XML은 다양한 데이터 종류를 임의 형태로 쉽게 표현할 수 있도록 하는 데이터 모델이다. XML의 이러한 단순하고 유연한 특성 때문에 여러 응용 분야 즉, 전자 상거래, 재정 서비스, 화학, 멀티미디어, 과학 연구, 메타데이터 관리, 웹 서비스, 데이터 마이닝 등과 같은 분야에서 데이터를 교환하기 위한 표준으로서 채택하고 있고 많은 데이터들이 XML로 표현되고 있다. XML 문서로부터 사용자들이 원하는 것을 검색할 수 있도록 하기 위해 많은 질의 언어들도 제안되었다. 이러한 XML 질의 언어들은 질의하고자 하는 엘리먼트나 애트리뷰트에 대해 이들의 이름과 값뿐 아니라 엘리먼트들 사이의 계층 구조를 항해하는 방식으로 경로 표현을 하도록 되어 있기 때문에 사용자는 문서 구조를 알아야만 질의할 수 있도록 되어 있다. 이러한 질의 표현 방식은 질의하는 동안 사용자에게 문서에 대해 너무 많은 지식을 요구하기 때문에 무척 불편함을 느끼게 하는 요인이 된다. 본 논문에서는 XML 사용자들이 검색 조건을 명시할 때 XML의 계층 구조를 모르고도 질의할 수 있도록 항해하는 방식의 경로 표현을 사용하지 않고 내용 기반 애드 혹 질의 기법을 개발함으로써 사용자에게 편리함을 제공하고자 한다. 이를 위해 사용자가 문서 구조를 고려하지 않고도 질의 할 수 있도록 질의 표현을 설계, 그 질의 표현을 처리할 수 있는 질의 처리기를 구현, 질의 처리기의 성능을 평가하는 과정을 거쳐, XML 데이터베이스에서 문서 구조에 독립적인 질의 처리를 지원할 수 있도록 한다. 각 부분의 자세한 수행 내용은 다음과 같다. 첫째, 문서 구조에 대한 지식 없이 질의할 수 있도록 하는 CXquery (Chamois XML query language)라는 질의 언어를 설계한다. 이 질의 표현은 단지 검색하고자 하는 데이터 이름과 그 값만을 명시하고 데이터들 사이의 경로는 명시하지 않는 내용 기반 애드 혹 질의를 지원하도록 한다. 이러한 질의 표현은 특히 문서 구조를 알기 어려운 분산 환경이나 웹 환경 같은 곳에서 문서 구조를 고려하지 않고 질의할 수 있도록 함으로써 사용자에게 편리함을 제공한다. 둘째, CXquery 는 질의 표현에 데이터 이름과 값만을 명시하고 경로는 명시하지 않기 때문에 질의 처리기는 그 경로를 알아내야만 질의 처리를 할 수 있다. 이를 위해 본 논문에서는 CXquery 에 주어지는 모든 경로를 분석하고 질의 처리기가 고려하여야 할 경로 유형을 분류한다. 또한 질의 처리기가 분석한 모든 경로를 빠르게 검색할 수 있는 기법을 개발하고, 질의 결과에 따라 신뢰도를 부여함으로써 문서 구조 독립적인 질의 처리 기법이 지니고 있는 “semantic uncertainty”의 단점을 보완할 수 있도록 질의 처리 알고리즘을 개발한다. 셋째, CXquery가 질의 표현 측면에서뿐 아니라 성능 면에서도 과연 합리적인가를 평가하기 위해 CXquery 질의 처리기와 경로 기반 질의 처리기를 구현하고 두 처리기 사이의 질의 처리 속도를 비교한다. 또한 본 논문에서 구현하는 질의 처리기의 성능 평가의 공정성을 입증하기 위하여 X-Hive라는 공개된 XML 서버의 질의 처리 속도와 비교한다. 본 논문에서 구현한 CXquery 질의 처리기는 경로 기반 질의 처리기에 비하여 최고 5배까지의 빠른 질의 처리 속도를 보였고 X-Hive와 CXquery 질의 처리기와는 평가 결과 대략 최고 7배까지 속도가 빠름을 보였다.