DSpace at EWHA: A Study on BERT based medical Open Information Extraction and Relation Prediction

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 525 Download: 0

A Study on BERT based medical Open Information Extraction and Relation Prediction

Title: A Study on BERT based medical Open Information Extraction and Relation Prediction

Authors: 우정연

Issue Date: 2023

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 강윤철

Abstract: 생의학 분야의 대표적인 학술 데이터베이스 서비스인 PubMed의 연 단위의 등록 논문을 2010년부터 2021년까지 정리한 결과, 생의학 분야의 학술 등록 논문은 연 평균 8~9% 증가하였고, 계속해서 그 증가폭이 높아지고 있다(Landhuis E.(2021)). 급격하게 발생량이 증가하고 있는 학술 논문은 정제되지 않은 비정형 텍스트이기 때문에, 생의학 분야 연구자들이 정보 내의 개념과 이론, 동향을 파악하는데 매우 많은 비용이 소요된다. 방대한 양의 문헌에서 연구자가 손수 필요한 정보를 추출하기 위해서는 상당한 노동력이 필요하다. 그렇게 때문에 정보의 검색 및 분석을 단순화 해주기 위한 정보 추출 시스템에 대한 연구가 필요한 상황이다. 또한, 효율적으로 medical knowledge base를 구축하기 위해서는 raw text에서 extractable information에 대해 relation을 파악할 수 있어야 한다. 특히 생의학 분야에서는 개체 간의 의미적 연관관계가 존재하는지 파악하는 것이 매우 중요하다. 본 연구는 의료/헬스케어 분야의 효율적인 지식 베이스 구축을 위한 Bert 기반의 Biomedical Open Information Extraction 및 Event type 추론 기법을 제안한다. 입력 문장 자체에서 S-V-O(Subject-Verb-Object) 형식의 Relational triple을 추출하고, BioBert로 Decoded Sequence를 Encoder의 Input으로 활용하여 겹치지 않는 유사한 tuple들을 생성해 보다 유연한 정보관계를 추출하였다. 또한 추출된 정보들의 유사도를 바탕으로 Clustering을 진행하여 입력 문장들의 Event type을 추론해 내용의 중심 의미를 파악하였다. 연구 결과 제안 모델은 전통적인 정보 추출 기법에서 벗어나 BioBERT를 활용하여 Open Information Extraction을 수행하였다. 해당 추출 기법은 전체 sequence 내에서 형식으로 tuple을 추출하여 훨씬 복잡한 관계를 반영할 수 있었다. 또한 추출한 tuple들을 토대로 별도로 pair을 추출하여 술어와 목적어의 중의성을 해소하여 추출한 tuple들의 의미를 훨씬 명확하게 표한 할 수 있었고 추출 quality를 높였다. 또한 raw text 속 정보들의 동향을 파악할 수 있었다. 뿐 아니라 Tuple pair clustering을 통해 유사한 의미를 가진 pair들로 다양한 relation type을 도출하고 F1 score를 통해 타 모델과 Topic Coherence를 비교해본 결과 상대적으로 높은 score 보이며 각 주제별로 관련성 있는 단어끼리 잘 묶여 있음을 확인하였다. 그러나 본 연구는 raw text 속에서 명확한 품사만을 기준으로 tuple을 추출하였기에 문장속에 명시되지 않은 관계에 대해서는 추출할 수 없는 한계가 있다. 또한 medical domain 만을 타겟으로 정보 추출을 진행하였으므로 다른 domain과 접목하여 cross-field information extraction을 진행할 수 있는 방법을 고안하여 더 풍부한 정보를 추출할 수 있도록 하여야 한다. 마지막으로 본 연구는 영어 학습데이터만을 기준으로 추출하였으므로 고려되지 않은 다른 언어들에 대해서도 추가적인 연구가 필요하다. 향후 연구에서는 본 연구 모델을 발전시켜 QA(Question Answering) 시스템이나 검색시스템은 물론 knowledge base 구축 등 다양한 downstream task을 수행할 예정이다. 따라서 추가적인 연구를 통해 biomedical knowledge base system을 구축할 수 있도록 해당 모델을 발전시키고자 한다 ;Since academic papers which are rapidly increasing in generation are unrefined unstructured texts, it takes a lot of time and money for biomedical researchers to grasp concepts and theories within the text. Research on information extraction systems to simplify the search and analysis of information has begun to be needed because it requires considerable labor for researchers to extract the information they need from vast amounts of literature. This study presents a BERT based biomedical open information extraction system and event type prediction methods. Our model can extract information automatically from unstructured large corpus without human supervision. This study aims to extract all possible relational tuples from the corpus, which is do not need a pre-specified relationship type. In this research, with the help of the BioBERT, added generated decoded sequence as the input of the next encoding step. With this approach managed to extract variable number of diverse S-P-O(Subject-Predicate-Object) relational tuple from unstructured sentence. Additionally, obtained P-O (predicate sense, object head) pairs from extracted tuple and clustering generated extractions for identify the key information of the sentence more effectively. Experimental results showed that the Copy Mechanism with fine-tuned BioBERT improved by 1.4% in the best F1 score. Also, investigated the clustering results of pair improved in some clusters and the best F1 score reaches 84.4%