DSpace at EWHA: 토픽 모델링을 활용한 인공지능 윤리 관련 언론보도 분석

Browse

My Repository

DSpace at EWHAETC ETC

View : 328 Download: 0

토픽 모델링을 활용한 인공지능 윤리 관련 언론보도 분석

Title: 토픽 모델링을 활용한 인공지능 윤리 관련 언론보도 분석

Other Titles: An analysis of online news on ethics of artificial intelligence using topic modeling

Authors: 이해린

Issue Date: 2023

Department/Major: 교육대학원 AI융합교육전공

Publisher: 이화여자대학교 교육대학원

Degree: Master

Advisors: 이선복

Abstract: The purpose of the study is to discusses the direction of artificial intelligence ethics education based on analysis of recent media reports related to artificial intelligence ethics. The research procedure was carried out in three steps and the R program was used. First, the collection period of press reports on AI ethics, which is the subject of analysis of the study, was set from January 1, 2019 to December 31, 2021. Naver news service was used as a search tool, and article headlines, which summarized and compressed the content of articles, were collected by web crawling. Second, since the collected data is unstructured text data, pre-processing such as word-based tokenization and processing of special characters, stopwords, and redundant words was performed to refine the data prior to analysis. Third, keyword frequency analysis and topic modeling analysis were applied to 11,113 data prepared through data collection and refinement. Keyword frequency analysis was performed based on TF, which is the sum of simple word occurrence frequency, and TF-IDF, which indicates word importance, considering whether the frequency is high in overall documents or only in some specific documents. Prior to topic modeling, the optimal number of topics was selected using these four metrics, Griffiths2004, Deveaud2014, CaoJuan2009, Arun2010. Next, key words for each topic were calculated using Latent Dirichlet Assignment (LDA), a probabilistic generative model that assumes that all documents consist of a distribution of topics and that all topics consist of a distribution of words. Finally, by analyzing and interpreting key words by year, meaningful results for the purpose of this study were explored and topic names were presented. The number of topics for all 3 years is 5. The topic names for each year assigned as a result of the research are as follows. In 2019, 'AI introduction 1, AI introduction 2, AI introduction 3, AI utilization strategy 1, AI utilization strategy 2'. In 2020, 'AI utilization promotion, AI utilization expansion 1' , Expansion of Artificial Intelligence Use 2, AI Education, AI Regulation Discussion'. In 2021, 'AI Ethics 1, AI Ethics 2, AI Ethics 3, AI Utilization Service, AI Utilization Industriy'. Through the changes in major keywords and topics by year, the following implications for AI ethics education were derived. The social importance of artificial intelligence ethics education will increase further in the future. Therefore it is necessary to strengthen the content of artificial intelligence ethics education. As a solution, it is necessary to minimize ambiguity and increase the possibility of practice by subdividing AI ethics education contents by main subject. In addition, the effectiveness of ethics education should be maximized by having flexibility in the content suitability check cycle. This study is meaningful in that it presents the direction for AI ethics education based on the text analysis of press reports. Finally, this study is differentiated in that it solves two issues that are often pointed out as limitations in topic modeling studies as follows. First, reliability was secured by collecting data from raw data through web crawling without going through a secondary searching platform. Second, prior to topic modeling, the researcher's subjectivity was minimized by using four metrics related to the selection of the number of topics.;빅데이터와 컴퓨터 성능이 뒷받침되면서 급속히 발달한 인공지능은 사회 패러다임을 바꾸는 주요 동력으로 성장하였다. 빠른 속도로 발전한 인공지능의 혜택에는 부작용이 수반되었고 이에 대한 해결책인 인공지능 윤리 및 윤리교육의 중요성이 높아졌다. 그러나 사회적 관심과 중요성을 반영한 인공지능 윤리교육 연구는 아직 미흡한 실정이다. 따라서 본 연구는 인공지능 윤리 관련 최근 언론보도 기사를 분석하여 주제를 추출하고 이를 토대로 인공지능 윤리교육 방향에 대해 논의하고자 한다. 기사 수집 및 분석은 모두 R 프로그램을 이용했으며 연구 절차는 다음과 같은 순서로 진행되었다. 첫째, 연구의 분석 대상인 인공지능 윤리 관련 언론보도 기사의 수집 기간을 2019년 1월 1일부터 2021년 12월 31일까지로 설정하여 기사 내용을 요약 및 압축해주는 헤드라인을 웹 크롤링하여 수집하였다. 둘째, 수집한 비정형 텍스트 데이터를 분석에 앞서 정제하기 위해 단어 기준 토큰화 및 특수문자·불용어·유의어 처리 등의 전처리 과정을 거쳤다. 셋째, 데이터 수집 및 정제를 통해 마련된 11,113개의 데이터에 키워드 빈도 분석과 토픽 모델링 분석을 적용하였다. 키워드 빈도 분석은 단어 출현 빈도의 단순 합인 TF와 특정 문서 내에서 높은 빈도를 보이는 단어를 중요도 높게 표현하는 TF-IDF를 기준으로 실행하였다. 넷째, 토픽 모델링 분석에 앞서 최적의 토픽 개수를 선정하기 위해 Griffiths2004, Deveaud2014, CaoJuan2009, Arun2010 지표와 Elbow Method를 종합적으로 활용하였다. 다음으로 모든 문서는 토픽의 분포로 이루어지고 모든 토픽은 단어의 분포로 이루어진다고 가정하는 확률적 생성모델인 잠재적 디리클레 할당(LDA)으로 토픽별 주요 단어를 산출하였다. 마지막으로 토픽별 주요 단어와 관련 기사를 종합하여 연도별 토픽 명칭을 부여하였다. 연구 결과로 부여한 연도별 토픽명은 다음과 같다. 2019년은 ‘인공지능 도입 1, 인공지능 활용 전략 1, 인공지능 활용 전략 2, 인공지능 도입 2, 인공지능 도입 3’ 5개 토픽으로, 2020년은 ‘인공지능 활용 추진, 인공지능 활용 확산 1, 인공지능 활용 확산 2, 인공지능 교육, 인공지능 규제 논의’ 5개 토픽으로, 2021년은 ‘인공지능 윤리 1, 인공지능 활용 서비스, 인공지능 활용 산업, 인공지능 윤리 2, 인공지능 윤리 3’ 5개 토픽으로 군집을 이루었다. 연도별 주요 키워드 및 토픽 변화 양상을 통해 다음과 같이 인공지능 윤리교육 관련 시사점을 도출하였다. 첫째, 인공지능 윤리교육의 사회적 중요성은 앞으로 더욱 높아질 것이며 이에 따른 대비가 필요하다. 둘째, 인공지능 윤리 교육내용의 관련 주체별 세분화로 모호성을 최소화하고 실천 가능성을 높여야 한다. 단, 인공지능 윤리에 관한 종합적인 이해를 바탕으로 합리적 판단력을 기르기 위해 주체별로 단절적인 교육이 아닌 통합적인 교육이 이루어져야 한다. 셋째, 인공지능 발전 및 확산 속도를 고려해 인공지능 윤리 교육내용 적합성 점검 주기의 유연성을 갖춰 교육 실효성을 최대화해야 한다. 본 연구는 토픽 모델링 연구에서 빈번하게 한계로 지적되는 두 가지 사항을 다음과 같이 해결하였다는 점에서 차별성을 가진다. 첫째, 2차 검색 플랫폼을 거치지 않고 웹 크롤링으로 원자료(raw data)에서 데이터를 수집하여 잠재적 데이터 편향성을 최소화하였다. 둘째, 토픽 모델링에 앞서 토픽 개수 선정과 관련된 네 가지 지표를 활용하여 연구자의 주관을 최소화했다. 그러나 본 연구는 불용어나 유의어 처리 등 데이터 정제 과정에서의 연구자 판단에 따라 연구 결과가 달라질 수 있다는 한계가 있다.