DSpace at EWHA: 색인어의 특정성 측정에 관한 연구

Browse

My Repository

DSpace at EWHA일반대학원 문헌정보학과 Theses_Master

View : 985 Download: 0

색인어의 특정성 측정에 관한 연구

Title: 색인어의 특정성 측정에 관한 연구

Other Titles: (The) Measurement of the index term specificity

Authors: 이주은

Issue Date: 1993

Department/Major: 대학원 도서관학과

Keywords: 도서관; 색인어; index term; 문헌정보학

Publisher: 이화여자대학교 대학원

Degree: Master

Abstract: This study is to examine and compare term weighting functions that are proposed to measure the term specificity in the documents. The concern studied in this research is that of identifying the relationship between the term weighting functions on the assumption that (1) the assignment of term weight improve the performence of information retrieval systems, and (2) the optimal indexing strategy resulted from the inverse relationship between recall ratio and precision ratio is selected for the indexing policy of the system. The optimal indexing strategy is that accomplish maximum precision ratio on the fixed recall ratio in the system. The systems that follow the optimal indexing strategy could provide more relevant retrieved items to the information users. The strategy is proved theoretically in the view of cost analysis. In general, the precision function is served by selecting specific index terms. The research is promted by the fact the weights that are calculated from the weighting functions derived statistically are the quantitative representaion of term specificity. The research area is limited to the statistically derived weighting functions. This research studied five weighting approaches theoretically: Sparck Jones' inverse document frequency weight, Salton's term dicscrimination value, Robertson and Sparck Jones' relevance weigthing concepts, Harter's 2-Poisson distribution model and Wong's informarion-theoretic value. After the theoretical considerations, four weights that are inverse document frequency weight(IDF), term discrimnation value(TDV), Z in 2-Poisson distribution and information-theoretic value(ITV) are assigned to the experimental term set and correlation analysis is performed to identify the relationships between them. The experimental document set is 43 English abstract on the otolarlyngology average length 96 words. From total 4196 words, 398 distinct words(collection frequency > 3) were selected for the experiment. Experimental findings were as follows: 1. The four weights clearly distinguished the topic words and non-topic word in ranked order. 2. Correlation was significant(P<0.05). Three pattern of correlations appears. The result of correlation analysis are as follow: (1) IDF and ITV have strong positive correlation. This is interpreted that both functions reflect directly the document frequenct factor as a term specificity. (2) IDF/ITV,TDV,Z have weak positive correlation. This is interpreted that each unique theoretical factors in TDV and Z added to the traditional term specificity factor(the document frequency) are acted in the preocess of the calculating the three weights. (3) Z and TDV follow the pattern of negative weak correlation. This result could not be confirmed fully in the experiment because of too weak correlation coefficient. But this particular negative relationship between TDV and Z is once reported in the Srinivasan's research(l990). This is interpreted temporarily that the concept of ⅰ) the partition of the two document class and ⅱ) the dissimilarity between the classes in the 2-Poisson approach and the average dissimilarity of whole document set in the TDV approach act inversely especailly for the topic words that have clear two document clasess. Thus if The further research is perfomed, the experimental term set limited to the the term set that follow 2-Poisson distribution, stronger negative correlation is expected.;본 논문은 문헌정보를 구성하는 용어들의 특정성을 측정하여 색인어로서 잠재적인 가치를 수량적으로 표현하는 기존의 가중치 부여 기법에 관한 비교 연구이다. 가중치는 정보검색시스템에서 검색결과의 향상을 위하여 색인어 혹은 탐색어에 부여하는 수치로 문헌군내에서 용어의 특정성에 따라 값이 차별화된다. 특정성이 높은 용어는 색인어로서의 잠재적인 가치가 크기 때문에 검색의 정도율을 높힌다. 본문에서 다루는 가중치부여모델은 통계적인 기법에 의한 것으로 한정하여 역문헌빈도 가중치이론, 문헌분리가이론, 적합성 가중치이론, 2-프아송분포이론, 정보이론에 의한 가중치에 대하여 일차적으로 (1)이론적인 고찰을 하였다. 고찰 결과 파악된 가중치부여 알고리듬에 따라 (2)소규모의 실험문헌군을 대상으로 하여 직접 적용하여 보았다. 종합적으로 (3) 본 연구는 가중치 부여가 검색성능을 향상시킨다는 전제하에서 각 가중치기법간의 관계를 알아보기 위하여 각 가중치들을 대상으로 상관관계분석을 시도하였다. 이론적인 고찰을 통하여 역문헌빈도 가중치와 정보이론에 의한 가중치는 문헌빈도를 용어의 특정성 장치로 가중치부여 함수식에 직접 반영하고 있음을 알 수 있었다. 문헌분리가는 문헌군내에서 문헌 구별력이 높은 용어를 색인어로서의 가치가 큰 용어로 보고 특정용어가 색인어로 할당된 후와 할당되기 전으로 나누어 평균문헌유사도가 변화한 값에 따라 색인어의 가치를 해석한다. 문헌분리가는 색인어로 할당되어 문헌군의 평균유사도를 낮추는 용어를 좋은 색인어로 본다. 적합성 가중치는 탐색결과 형성되는 적합문헌과 부적합문헌에서의 용어 출현특성을 가중치요인으로 반영시킨다. 2-프아송분포는 문헌군의 주제어휘를 이루는 있는 전문어들의 출현분포로 주어진 용어가 표현하는 주제가 다루어지는 정도에 따라 문헌군을 두개의 문헌클래스로 나누고 두 문헌클래스의 구별력이 높고 문헌군내 클래스 I의 비율이 작아 특정성이 높은 용어를 좋은 색인어로 선정하는 것을 알 수 있었다. 적합성 가중치를 제외한 역문헌빈도 가중치, 문헌분리가, 2-프아송의 Z가중치, 정보이론에 의한 가중치를 계산하는 실험을 하였다. 실험문헌군은 이비인후과학의 소주제인 중이 및 내이, 인후두, 부비동의 병변에 관한 43건의 논문을 각각 유사한 비율로 선정하여 이들 논문의 영문초록에서 총 398개의 단어를 추출하였다. 본 실험용어군의 제한점은 용어가 추출된 문헌군이 소규모이기 때문에 이상적인 용어빈도분포가 형성되지 않은 점이다. 즉 문헌빈도의 분포를 볼때 중간 빈도어의 비율이 저빈도어에 비해 비교적 낮았다. 가중치부여 결과 각 기법별로 가중치에 따라 내림차순으로 정렬하여 순위를 부여해 보았을때 상위와 하위순위에서 주제어와 비주제어의 분리는 명확하게 나타났다. 주제어에 대한 판단 기준은 그 해당 단어가 의학용어사전에 실려있는지의 여부와 전문가의 도움을 받았으며 398개의 단어중 총 202개가 키워드 후보로 선정되었다. 상관관계 분석 결과 상관관계가 모든 가중치가 유의한것으로 나타났다 (α≤0,05). 상관관계는 세 가지 유형으로 나타났다. (1) 강한 양의 상관관계를 형성한 역문헌빈도 가중치와 정보이론 가중치는 문헌빈도가 용어의 특정성 장치를 직접 반영된 결과로 해석된다. (2) 약한 상관관계를 형성한 경우로 역문헌빈도, 문헌분리가, 2-프아송의 Z가중치의 관계이다. 이는 해당 가중치기법마다 문헌빈도외에 다른 고유한 이론적 요인이 용어의 특정성을 측정하는데 작용한 것으로 해석된다. (3) 본 실험에서는 통계적으로 상관관계는 극히 미약하나 2-프아송의 Z와 문헌분리가가 음의 상관 패턴을 가지는 것으로 나타났다. 이는 특별한 경우로 2-프아송 분포의 문헌클래스 분할, 문헌클래스간의 비유사도 개념과 문헌분리가의 문헌간 평균유사도 개념이 가중치함수에 이론적으로 반영되는 과정에서 기인한 것으로 보인다. 이를 2-프아송 분포를 따른 용어를 대상으로 후속 실험을 하면 상관관계의 유형이 더욱 명확해 질 것으로 기대된다.