DSpace at EWHA: XML태그의 의미적 유사성 검사

Browse

My Repository

DSpace at EWHA과학기술대학원 컴퓨터학과 Theses_Master

View : 990 Download: 0

Full metadata record

DC Field	Value	Language
dc.contributor.author	李蕙受	-
dc.creator	李蕙受	-
dc.date.accessioned	2016-08-25T02:08:15Z	-
dc.date.available	2016-08-25T02:08:15Z	-
dc.date.issued	2000	-
dc.identifier.other	OAK-000000029254	-
dc.identifier.uri	https://dspace.ewha.ac.kr/handle/2015.oak/175209	-
dc.identifier.uri	http://dcollection.ewha.ac.kr/jsp/common/DcLoOrgPer.jsp?sItemId=000000029254	-
dc.description.abstract	The success of XML(eXtensib1e Markup Language) is primarily based on its flexibility : everybody can define the structure of XML documents that represent information in the form he or she desires. The biggest advantage of XML is at the same time its biggest handicap. XML is so flexible that XML documents cannot be automatically provided with an underlying semantics. Different tag sets, different names for elements or attributes, or different document structures in general aggravate the task of classifying and clustering XML documents precisely. In this thesis, we design and implement a system that allows checking the semantic-based similarity between XML tags. First, this system extracts the underlying semantics of tags and then expands the synonym set of tags using an WordNet thesaurus and user-defined word library which supports the abbreviation forms and compound words for XML tags. Seconds, considering the relative importance of XML tags in the XML documents, we extend a conventional vector space model which is the most generally used for document model in Information Retrieval field. Using this method, we have been able to check the similarity between XML documents which are represented different tags.;XML(extensible Markup Language)문서가 웹 문서의 표준으로 자리 매김 할 수 있는 가장 큰 성공요인은 사용자가 문서 타입을 기술할 수 있는 유연성(flexibility)이다. 그러나 XML의 가장 큰 장점인 유연성은 동시에 가장 큰 단점이다. XML의 유연성으로 야기되는 문제점은 동일한 의미를 표현하기 위해 XML문서 작성자마다 서로 다른 태그명과 구조를 사용한다는 점이다. 즉 서로 다른 태그 집합, 요소(element), 속성(attribute)에 대한 서로 다른 이름 또는 다른 문서 구조로 인해 다른 태그로 표현된 문서는 서로 다른 부류의 문서로 간주되기 쉽다. 따라서 본 논문은 XML태그에 내재된 의미 정보(semantic information)를 추출하여 시소러스인 WordNet과 사용자 정의 용어 사전을 기반으로 각 태그를 최대한 의미적으로 유사한 동의어로 확장하여 두 XML문서의 확장된 태그간의 의미적 유사도를 비교 분석하였다. 그리고 의미적 유사도를 가중치로 부여하여 기존의 비구조적인 문서를 표현하는 방법인 벡터 스페이스 모델(Vector Space Model)에 태그가 가지는 의미적 유사도를 고려하여 확장 적용하였다. 본 논문에서는 XML 태그의 의미적 유사도를 검사하는 시스템을 설계하고 구현하므로써 XML태그간의 의미적 유사도를 이용해서 두 XML태그간의 유사성을 검사한다. 그 결과 두 XML문서가 유사한지 아닌지를 휴리스틱하게 판단할 수 있듯이, 본 논문에서 제시한 방법을 통해 XML문서의 유사도를 파악할 수 있었다. 본 논문의 의의는 크게 두 가지로 다음과 같다. 첫째, 정보 검색분야에서 문서를 표현하는 방법으로 가장 널리 사용되는 벡터 스페이스 모델을 반구조적(semi-structured) 문서인 XML에 적용하여 태그가 지니는 의미정보를 충분히 반영하였다. 둘째, XML문서를 분류(classification)하거나 비슷한 부류의 문서로 군집(clustering)시키고자 하는 응용분야에서 본 논문에서 제시하는 의미 기반의 태그 비교 방법은 XML문서의 데이터 준비 단계(Data preprocessing)로 사용할 수 있을 것이다.	-
dc.description.tableofcontents	목차 = ⅰ 논문 개요 = ⅴ Ⅰ. 서론 = 1 1.1 연구배경 = 1 1.2 연구 목적 및 내용 = 2 Ⅱ. 관련 요소 기술 = 4 2.1 XML(eXtensible Markup Language) 특징 = 4 2.1.1 XML 설계 원리 = 4 2.1.2 XML문서에서 동일한 내용에 대한 다른 표현 = 5 2.2 WordNet = 7 2.3 정보 검색 모델 = 9 2.2.1 정보 검색 모델 (Information retrieval model) 분류 = 10 Ⅲ. XML태그의 의미기반 유사도 검사를 위한 시스템 설계 = 14 3.1 시스템 구성도 = 14 3.2 개념 지식 모듈 = 15 3.2.1 WordNet = 15 3.2.2 사용자 정의 용어 사전(User-defined Word Library) = 17 3.3 정보 추출기 = 20 3.4 동의어 벡터 생성기 = 23 3.5 유사도 측정기 = 26 3.5.1 태그의 의미 비교 = 26 3.5.2 벡터 스페이스 모델을 확장한 XML문서 모델 = 28 3.5.2.1 용어 가중치와 벡터 유사도 = 26 3.5.2.2 벡터 스페이스 모델을 확장한 XML태그의 유사도 계산 = 31 Ⅳ. 시스템의 구현 = 34 4.1 구현 환경 = 34 4.2 정보 추출기 모듈 = 35 4.2.1 알고리즘 = 35 4.2.2 불용어 목록 = 40 4.2.3 어간 추출(stemming) = 42 4.3 개념지식 및 동의어 벡터 생성 모듈 = 44 4.3.1 사용자 정의 용어 사전 (User-defined Word Library) = 44 4.3.2 동의어 벡터 생성 예제 = 46 4.4 유사도 측정 모듈 = 47 4.5 실험을 통한 시스템 검증 = 49 Ⅴ. 결론 및 향후 과제 = 58 참고문헌 = 60 ABSTRACT = 63	-
dc.format	application/pdf	-
dc.format.extent	1808200 bytes	-
dc.language	kor	-
dc.publisher	이화여자대학교 과학기술대학원	-
dc.subject	XML태그	-
dc.subject	컴퓨터	-
dc.subject	유사	-
dc.subject	검사	-
dc.title	XML태그의 의미적 유사성 검사	-
dc.type	Master's Thesis	-
dc.title.translated	Semantic-based Similarity Checking between XML Tags	-
dc.format.page	vi, 64 p.	-
dc.identifier.thesisdegree	Master	-
dc.identifier.major	과학기술대학원 컴퓨터학과	-
dc.date.awarded	2001. 2	-