DSpace at EWHA: 콘텐츠 기반 변수 추출 방법에 의거한 가짜 뉴스 분류

Browse

My Repository

DSpace at EWHA일반대학원 빅데이터분석학협동과정 Theses_Master

View : 1169 Download: 0

콘텐츠 기반 변수 추출 방법에 의거한 가짜 뉴스 분류

Title: 콘텐츠 기반 변수 추출 방법에 의거한 가짜 뉴스 분류

Other Titles: Fake News Detection Using Content-based Feature Extraction Method

Authors: 정호선

Issue Date: 2019

Department/Major: 대학원 빅데이터분석학협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 신경식

Abstract: 최근 TV나 신문과 같은 전통적인 매체보다 페이스북, 트위터 등의 소셜 네크워크 서비스(Social Network Service, SNS)를 통해 뉴스를 소비하는 인구가 증가하고 있다. SNS를 이용하면 언제 어디서나 편리하게 뉴스에 접근할 수 있으며, 비용을 절감할 수 있기 때문이다. 하지만 SNS에는 가짜 뉴스가 쉽게 생성될 수 있으며, 한번 생성된 가짜 뉴스는 빠르게 확산되어 SNS 이용자들에게 노출된다는 문제를 야기할 수 있다. 이와 같은 가짜 뉴스의 확산은 뉴스 소비자 개인과 전체 사회에 정치적, 경제적으로 악영향을 끼칠 수 있기 때문에 ‘SNS 상의 가짜 뉴스’는 최근 심각한 문제로 평가되고 있다. 구체적으로, 정치적 목적으로 SNS에 유포된 가짜 뉴스는 잘못된 여론 형성을 유발하여 SNS를 사용하는 수 많은 유권자들의 의사를 왜곡하는 문제가 발생할 수 있다. 또한, 개인이나 기업에 대한 가짜뉴스가 확산될 경우 브랜드의 가치가 하락하여 막대한 경제적 손실이 발생할 위험이 있다. 한편, 가짜 뉴스는 전세계적인 차원의 문제로 인지되고 있다. 2016년 45대 미국 대통령 선거에서 소셜 미디어를 통해 유출된 가짜 뉴스가 트럼프의 승리에 영향을 주었다는 연구 결과가 발표되었고, 이를 계기로 국외에서는 가짜 뉴스를 탐지하는 연구의 중요성이 대두되었다. 그 결과, 페이스북과 구글이 ‘가짜 뉴스’와의 전쟁을 선포하는 등 가짜 뉴스 문제를 해결하기 위한 연구들이 국외에서 다각도로 활발히 진행되고 있다. 국내에서도 가짜 뉴스 문제가 심각한 사회적 이슈로 자리잡고 있다. 하지만, 국외에 비해 가짜 뉴스 탐지에 대한 연구가 미흡한 실정이다. 따라서, 본 연구에서는 SNS 상의 국내 가짜 뉴스를 대상으로 가짜 뉴스 탐지 연구를 진행하여 국내 가짜 뉴스 문제 해결에 기여해보고자 한다. 구체적으로, 본 연구는 한국어 트위터의 진위 여부를 파악해주는 분류 모형을 구축하는 것을 목표로 한다. 이를 위해, 텍스트의 문체적 특징을 대변해 줄 수 있는 언어학적 변수들을 추출하는 ‘콘텐츠 기반 변수 추출 방법(Content-based Feature Extraction Method)’에 의거하여 분류 모형을 구축하고자 한다. 텍스트 분류 연구에서는 비정형 텍스트로부터 정형화된 자질 벡터를 추출하기 위해 주로 단어 빈도(Term Frequency, TF) 혹은 단어 빈도-역문서 빈도(Term Frequency-Inverse Document Frequency, TF-IDF) 기법을 사용하는데, 이들은 텍스트의 어휘 특성만을 기반으로 하여 입력 값을 산출한다는 한계점을 지니고 있다. 따라서, 본 연구에서는 콘텐츠 기반 변수 추출 방법을 통해 어휘, 문장 구조, 문법 등 보다 다양한 범주의 언어학적 특징들에 주목하여 텍스트의 진위 여부를 파악해주는 우수한 성능의 분류기를 구축하는 것을 목표로 하였다. 연구 결과, 제안 모형은 정확도 70.53%, 재현율 85.76%, 정밀도 71.18%, F값 77.24%의 성능을 보여 전통적인 텍스트 자질 추출 방법인 TF나 TF-IDF를 적용한 모형보다 전반적으로 우수한 성과를 도출하였다. 이를 통해, 자연어처리 기반의 가짜 뉴스 분류 문제에서는 다양한 범주의 언어학적 변수들을 사용하는 것이 효과적임을 확인할 수 있다. 본 연구는 텍스트의 언어학적 특징에 주목하는 ‘콘텐츠 기반 변수 추출 방법’을 사용하여 우수한 성능의 가짜 뉴스 탐지 모형을 구축하였다. 가짜 뉴스가 사회적으로 심각한 문제로 자리잡고 있는 현 시점에서, 본 연구는 SNS 상의 가짜 뉴스의 문체 특징을 분석하여 이를 기반으로 가짜 뉴스 여부를 알 수 있는 우수한 성능의 분류 모형을 구축하였다는 점에서 의미가 있다. 또한, 국내 가짜 뉴스 연구가 거의 전무한 상황에서 한국어 대상의 가짜 뉴스 분류 모형을 구축하였기에 학술적으로 의미가 있다고 사료된다. 나아가, 영어 대상의 가짜 뉴스 뿐만 아니라 한국어 대상의 가짜 뉴스에서도 ‘콘텐츠 기반의 변수 추출 방법’을 적용한 모형이 우수한 분류 성과를 보였으므로, 가짜 뉴스 도메인에서 다양한 범주의 언어학적 특징들을 변수로 추출하는 것이 중요함을 확인할 수 있다. 한편, 본 연구의 한계점은 다음과 같다. 우선, 데이터 확보의 어려움으로 인해 데이터의 양이 크지 않았다는 아쉬움이 따른다. 따라서 향후 연구에서는 데이터를 추가적으로 수집하여 연구의 타당성을 보다 높일 수 있는 방안을 모색할 필요가 있다. 나아가 레이블링된 데이터가 부족한 상황에서도 우수한 성능의 자동화된 분류기를 구축할 수 있는 준지도학습(Semi-supervised Learning) 기반의 연구를 진행하여 가짜 뉴스 도메인에서 발생할 수 있는 Cold-start 문제를 해결하는 것이 의미 있는 후속 연구가 될 것으로 예상된다. 또한, 콘텐츠 기반의 언어학적 변수들을 세분화하여 더욱 다양한 범주의 변수들을 고려한다면 더욱 우수한 성능의 분류 모형을 구축할 수 있을 것으로 보인다.;Since the advent of social media, more people have been consuming news through Social Network Services(SNS) such as Facebook and Twitter rather than through traditional media such as television and newspapers. News Consumption using SNS have the advantage of being accessible and less expensive. However, it also has some serious disadvantages. First, because anyone can generate information on social media, false information can be easily generated. Secondly, because any information can be spread quickly on social media, there is always the risk of exposure to fake news for SNS users. Fake news is problematic because it can cause political and economic problems to news consumers and the entire community. For example, if SNS is misused for political purposes, voters using SNS can make wrong decision by being affected by fake news. In addition, if fakes news about certain individuals or companies spread, there is a possibility that their economic value will decrease. Concerns over the seriousness of the fake news problem have spread around the world since the 45th U.S. presidential election in 2016. Therefore, the importance of detecting fake news has emerged and studies to prevent fake news on social media have been actively conducted overseas. On the other hand, domestic research on fake news detection is not enough even though fake news problems are rampant in Korean society as well. Therefore, in this study, we aim to detect Korean fake news on SNS. Specifically, the goal of this study is to build a classification model that identifies the authenticity of the Korean Twitter. To achieve the goal, this research extracts a variety of linguistic input variables that can represent the textual characteristics of news on Twitter rather than extracting feature vectors using “Term Frequency(TF)” or “Term Frequency-Inversed Document Frequency(TF-IDF)” techniques that are commonly used in feature extraction. In addition, a meaningful group of variables was selected by conducting feature selection using Stepwise Regression. According to the study, the proposed model showed 70.53% accuracy, 85.76% recall, 71.18% precision, and 77.24% F-measure by outperforming the performance of TF-based model and TF-IDF based model in general. The main contribution of this research is to build a Korean fake news classifier since there has not been enough research about Korean fake news detection before. Also, the proposed model using content-based features outperformed the models using TF and TF-IDF, which identifies the importance of various types of linguistic features in fake news classification. On the other hand, our research has a limitation of not having enough data. Therefore, it is necessary to collect more data in future studies to explore ways to make the research more reasonable. Furthermore, it is expected to be a meaningful follow-up study to conduct semi-supervised learning-based research because it can build a high-performance automated classifier even when data size is not large.