DSpace at EWHA: 學位論文의 全文索引시스템 設計

Browse

My Repository

DSpace at EWHA일반대학원 문헌정보학과 Theses_Master

View : 547 Download: 0

學位論文의 全文索引시스템 設計

Title: 學位論文의 全文索引시스템 設計

Other Titles: (A) Study on the Design of a Full-Text Indexing System for Thesis

Authors: 추윤미

Issue Date: 1996

Department/Major: 대학원 문헌정보학과

Keywords: 학위논문; 전문색인시스템; 설계

Publisher: 이화여자대학교 대학원

Degree: Master

Abstract: 全文데이터베이스는 문헌의 원문을 제공하고, 전문검색이 가능하다는 장점으로 인해 최근 급속하게 발전하여 왔으며, 도서관의 주요 정보원으로 자리잡고 있다. 그러나 문헌은 방대한 양의 텍스트로 구성된 비정형데이터이므로, 이를 효과적으로 조직하여 검색하는 데는 많은 어려움이 있다. 따라서 본 연구에서는 SGML(Standard Generalized Markup Language)을 이용하여 문헌을 구조화하고, 이를 이용한 전문색인시스템을 설계하였다. 이를 통해 문헌구조화의 장점을 제시하고, 효과적인 전문데이터베이스의 검색을 위한 기초를 마련하였다. 문헌연구를 통해서 전문데이터베이스의 특성 및 문제점을 분석하고, SGML을 이용한 문헌구조화의 유용성을 살펴보았다. 또한 이제까지 구축된 全文데이터베이스 시스템을 전문데이터의 형식을 기준으로 유형별로 분류하고 유형의 특징과 장·단점을 비교·분석하였다. 이러한 분석결과를 토대로 하여 전문검색과 함께 완벽한 전문을 제공할 수 있도록 통합모형을 선택하였다. 본 연구에서는 전문데이터베이스시스템의 구성요소를 전자본 생성과 색인, 검색 및 인터페이스로 나누고, 이 중에서 전자본 생성과 색인시스템의 설계를 다루었다. 전자본의 페이지이미지와 SGML, 텍스트형식의 두 가지 형식으로 생성하였다. 먼저 인쇄본은 스캐닝을 통해 페이지이미지를 생성하고, 저장된 페이지이미지를 OCR(Optical Character Recognition) 처리하거나 문서편집파일을 변환하여 ASCII 텍스트를 생성하였다. 다시 ASCII 텍스트형식의 전문데이터는 논문구조를 따라 작성된 DTD(Document Type Definition; 문헌유형정의부)에 의해 태깅(tagging)하여 SGML 문헌으로 생성하였다. 색인과정에서는 SGML로 표현된 문헌의 구조를 색인에 이용하기 위해 대상데이터인 학위논문의 문헌구조를 분석하고 각 문헌요소의 특성을 규명하여 색인시스템의 설계에 반영하였다. 색인과정은 일차색인과 이차색인으로 나누어지는데, 일차색인에서는 SGML 문헌을 분석하여 문헌요소와 텍스트를 분리하고, 문헌의 구조를 나타내는 문헌요소테이블과 본문텍스트를 소장하고 있는 내용데이터파일로 생성하였다. 이차색인에서는 이를 이용하여 논문의 주요 구성요소인 초록, 목차, 본문, 참고문헌을 문헌요소의 특징에 따라 다음과 같이 색인하였다. 1. 원문대표정보인 표제, 초록, 목차의 키워드를 추출하여 역파일로 조직한 키워드색인파일을 구축하였다. 2. 본문의 구조를 표현하는 목차는 이용자가 문헌의 내용을 파악하기 위한 유용한 수단이므로, 이를 별도의 목차테이블로 구성하고 다양한 본문의 검색과 브라우징에 이용할 수 있도록 하였다. 3. 본문에 포함된 표, 그림, 부록에 대한 검색과 브라우징을 위해 표목차테이블, 그림목차테이블, 부록목차테이블을 구성하였다. 4. 본문 뒷부분의 참고문헌은 그 서지사항과 본문에서 나타난 인용된 위치정보를 참고문헌테이블로 작성하였다. 설계된 전문색인시스템은 다음과 같이 문헌구조를 이용한 전문검색과 브라우징에 이용될 수 있다. 첫째, 문헌요소를 검색하거나, 검색범위를 제한하는 등 문헌요소를 검색에 선택적으로 이용할 수 있다. 따라서 전문검색의 정확률을 높일 수 있고 효율적인 본문탐색이 가능하다. 둘째, 키워드색인파일은 키워드탐색의 범위를 표제, 초록, 또는 목차로 선택적으로 제한할 수 있으며, 문헌요소의 중요도나 키워드의 발생빈도에 따라 순위매김이나 가중치검색 등 다양한 검색방법을 구현할 수 있다. 셋째, 목차테이블을 이용하여 목차의 검색과 검색된 문헌의 적합성 판정, 그리고 본문의 브라우징이 가능하다. 넷째, 표, 그림, 부록과 같은 특정한 속성을 가진 문헌요소를 검색할 수 있고, 표, 그림, 부록목차테이블을 통해 본문의 표, 그림, 부록을 브라우징할 수 있다. 다섯째, 참고문헌테이블은 본문의 인용표시와 링크되어 본문에 나타난 참고문헌의 발생위치를 추적하거나 참고문헌의 검색에 이용될 수 있다. 이와 같이 본 논문에서 설계한 전문색인시스템은 문헌의 구조와 특성을 이용한 색인을 통해 다양한 전문검색이 가능하다는 것을 보여줌으로써 문헌구조화를 적용한 전문데이터베이스시스템의 장점을 제시하였다. 또한 단행본의 일종인 학위논문을 대상으로 문헌요소를 분석하고, 문헌특성에 따른 색인방법을 고안하였으므로, 이후 단행본의 다양한 문헌유형에 적용될 수 있다.;A full-text database has been recently prevailed on account of advantages of the availability of complete text and full-text searching. A document, however, is a large text composed of atypical data, there are several problems for organizing and retrieving a document. Accordingly, the purpose of this study is to represent the usability of a full-text database system based on SGML document by analyzing a document with a logical structure and designing a full-text indexing system. To achieve this goal, through literature survey, features and problems of full-text database were reviewed and the usability of a structured document by SGML was investigated. Also, the integrated model was selected to provide both complete text and full-text searching by analyzing features, strengths, and weaknesses of existing full-text database systems. In this study, the production of an electronic text and the design of full-text indexing system were treated among components of full-text database system. The objective document of this system was a thesis. The electronic version of a thesis was produced in both image format and SGML text format. First, the printed documents were converted into page images by scanning. Second, ASCII texts were produced by OCR processing of page images, or converting of word-processing files. Third, SGML documents were produced by tagging in according to DTD that represented the structure of the thesis. The indexing process was divided into two levels. It produced the element table and content text file made from SGML documents in the primary indexing. In the secondary indexing, main elements of thesis based on their features were indexed as followings: 1. It constructed the keyword index file made up of keywords extracted from title, abstract, content of table as surrogates. 2. A content table was made from the table of contents to be used for variable full-text retrieval and browsing, 3. It constructed the table of table list, figure list, and appendices for retrieval and browsing on tables, figures, and appendices contained in contents. 4. A reference table was made from the reference of back matter and citations appeared in contents. The designed full-text indexing system can be used to full-text searching and browsing based on a logical structure of a document as followings: First, it is feasible that use elements of a document for retrieval, for example, that select a certain element to search. search elements. or constrain a search scope. Second, a scope of keyword search can be constrained to title, abstract, or table of contents. A variable searching method can also be implemented such as ranking or term weighting using frequencies of terms and importance of elements. Third, it is possible to use a table of contents to retrieve a relevant document or to browse in the document. It is also possible to use a table of contents reflecting hierarchical structure to querying and browsing. Fourth, it can retrieve the element having a certain attribute such as tables, figures, and appendices. Fifth, the reference table can be used to trace citations in contents and to retrieve bibliographies. Therefore, the designed full-text indexing system in this study shows that is able for full-text searching through an indexing process using a structure and to diversify ways feature of a document. Moreover, it can be expanded and adapted for other types of a book.