DSpace at EWHA: LDA 토픽모델링의 적정 표본크기 분석 연구

Browse

My Repository

DSpace at EWHA일반대학원 교육학과 Theses_Ph.D

View : 358 Download: 0

LDA 토픽모델링의 적정 표본크기 분석 연구

Title: LDA 토픽모델링의 적정 표본크기 분석 연구

Other Titles: A Study on Appropriate Sample Size Analysis of LDA Topic Modeling: Focusing on High School Credit System News Articles

Authors: 전은정

Issue Date: 2023

Department/Major: 대학원 교육학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 최윤정

Abstract: 최근 빅데이터 분석을 통한 정보생성 및 미래예측의 중요성과 유용성이 대두됨에 따라 빅데이터를 활용한 연구가 증가하고 있다. 그 중에서도 빅데이터의 상당 부분을 차지하는 비정형 데이터를 대상으로 한 연구가 활발히 이루어지고 있다. 비정형 데이터의 대부분은 텍스트로 구성되어 있으며, 이러한 비정형 데이터에 대한 마이닝 과정을 텍스트 마이닝이라 한다. 텍스트 마이닝에는 여러 가지 분석 방법이 있으나 LDA를 기반으로 한 토픽모델링 분석이 많이 활용되고 있다(최성철, 박한우, 2020; Blei, 2012). LDA 토픽모델링 분석의 경우 분석에 활용되는 문서가 충분하지 않으면 토픽 분석에 불안정성이 발생한다. 문서 수가 적음으로 인해 샘플 단어의 작은 변화에도 LDA 모델이 민감하게 반응하게 되면 샘플링 절차에 따라 토픽 분석 결과가 크게 달라질 수 있으며 이는 적절하지 못한 결과 및 해석으로 이어질 수 있다(Hecking & Leydesdorff, 2018). 즉, 연구대상이 되는 데이터는 LDA 모델의 추론 성능에 영향을 미치므로 이에 대한 지침이 필요하지만 관련 연구는 많지 않다(Tang et al, 2014). Blei et al(2003)는 LDA 토픽모델링 분석 시 문서 수가 적으면 문서를 다루기 위한 토픽 학습이 충분히 이루어지기 어려움을 언급하였다. Hecking & Leydesdorff(2018)는 문서 수가 상대적으로 적을 경우 부적절한 연구 결과를 가져올 수 있음을 지적하였다. Naushan(2020)는 뉴스 기사로 작업할 때 모델링에 필요한 최소 기사 수는 600건이나 결과 향상을 위해서는 1,000건 이상이 필요하며, Twitter의 경우 5,000건에서 10,000건이 필요하다고 하였다. 아마존의 AWS솔루션은 LDA 토픽모델링 작업에서 최상의 결과를 얻기 위한 문서는 1,000건 이상임을 언급하고 있다(AWS, 2023). 앞서 살펴본 바와 같이 LDA 토픽모델링 분석 시 연구 대상이 되는 문서 수의 중요성에도 불구하고 한국학술지인용색인(KCI)을 통해 검색한 LDA 토픽모델링 연구 중 피인용지수 기준 상위 45개 연구를 살펴본 결과 500건 미만의 문서를 분석한 연구는 31.8%였으며 특히 교육학 분야의 경우 52.2%에 달했다. 이에 본 연구의 첫 번째 목적은 최근 활발히 활용되고 있는 텍스트 마이닝 분석 방법 중 하나인 LDA 알고리즘을 기반으로 한 토픽모델링 분석에 적합한 문서 수를 알아보는 것으로 하였다. 연구대상은 교육부의 고교학점제 도입 발표(2017.11.27.)부터 2022년까지의 고교학점제 관련 뉴스기사 7,115건으로 하였다. 고교학점제는 2025년까지 전국 고등학교에 전면 도입될 예정이다. 고교학점제하에서는 학생이 본인의 적성 및 진로에 따라 학업을 설계하고 수강할 과목을 선택하여 이수하게 되며, 이수를 통해 얻은 누적 학점 등이 졸업 기준을 충족하면 졸업하게 된다. 학교 공간, 교사 수급, 교육과정 운영, 대입제도에까지 큰 변화를 가져오기에 모두가 관심을 가지고 있는 상황이다. 한편 뉴스기사는 특정 주제의 목적 및 방향성에 대한 해석을 담고있는 경우가 많고 독자가 이를 인지하게 하여 특정 목표로 이끄는 측면이 있다. 따라서 뉴스기사 분석 작업은 사회적으로 쟁점이 되는 사안과 해당 사안에 대한 여러 이해관계자의 시각을 폭넓게 파악할 수 있게 한다. 이에 본 연구의 두 번째 목적은 고교학점제 전면 도입을 앞둔 시점에서 뉴스기사가 다루는 고교학점제 관련 토픽은 무엇인지 분석하여 해당 정책에 대한 시사점을 제공하는 것으로 하였다. 위와 같은 연구 목적 달성을 위한 본 연구의 연구 문제는 다음과 같다. 연구문제 1. 뉴스기사에서 다루는 고교학점제의 토픽은 무엇인가? 1-1. 뉴스기사를 통해 분석된 고교학점제 토픽은 무엇인가? 1-2. 분석된 토픽 중 많은 뉴스기사가 다룬 토픽은 무엇인가? 1-3. 뉴스기사 외 자료가 다룬 고교학점제 토픽과 차이가 있는가? 1-4. 연도의 흐름에 따라 토픽은 어떤 변화를 보이는가? 연구문제 2. LDA 토픽모델링 분석에 적합한 문서 수는 어떠한가? 2-1. 문서 수에 따라 전체 문서와 샘플데이터 간 토픽 일치도에차이가 있는가? 2-2. 문서 수에 따라 ROC 곡선의 AUC는 차이가 있는가? 2-3. LDA 토픽모델링 분석에 적합한 문서 수는 몇 건인가? 위 연구 문제를 해결하기 위한 연구 절차는 다음과 같다. 첫 번째, 고교학점제 발표 이후부터 2022년까지 고교학점제를 주제로 다룬 뉴스기사 7,115건을 대상으로 R(버전 4.2.2) 통계 프로그램을 활용하여 토픽을 분석하고, 토픽의 연도별 변화를 살펴보았다. 두 번째, 전체 문서에서 무선추출을 통하여 문서 수가 다른 120개의 샘플데이터를 생성하고 모든 샘플데이터에 대해 토픽분석을 실시하였다. 무작위 문서 추출을 위한 난수발생에는 SPSS 통계 프로그램을 활용하였다. 문서 수는 전체 문서(7,115건)의 1%(71건), 5%(356건), 10%(712건), 15%(1,067건), 30%(2,135건), 60%(4,269건)의 비율로 하였다. 이 때 샘플데이터는 각 비율별로 20개씩 생성하였다. 예를 들면 전체 문서 7,115건의 1%에 해당하는 71건의 문서로 구성된 샘플데이터를 20개 생성한 것으로, 비율별로 각각 20개씩 총 120개 데이터를 생성하여 토픽 분석을 실시하였다. 이후 Excel 프로그램의 매크로 vba를 활용하여 전체 문서와 샘플데이터 간 일치하는 토픽을 분석하였다. 세 번째, 전체 문서 토픽과 샘플데이터들의 토픽 간 일치도, ROC 곡선의 AUC 분석을 통해 LDA 토픽모델링 분석에 필요한 문서 수를 알아보았다. 연구 결과는 다음과 같다. 첫째, 고교학점제를 주제로 한 뉴스기사의 토픽 분석 결과 총 16개의 토픽이 추출되었다: 1) 고교학점제 운영을 위한 지역별 역량 강화 노력, 2) 과목선택, 이수, 학점취득, 졸업기준 등 고교학점제 안내, 3) 정시 확대 정책과 고교학점제, 4) 과목 확대 개설 및 진로에 따른 선택 수업 운영, 5) 혁신적 미래형 학교공간 조성, 6) 교육부, 고교학점제 도입에 따른 대입제도 개편, 7) 지역 대학 및 인근 고교 연계 온라인 공동교육과정, 8) 교육감 후보 고교학점제 관련 공약, 9) 교육부 외고, 국제고, 자사고 일반고 전환 추진, 10) 진로 프로그램 특화 운영(직업계고), 11) 고교학점제에 적합한 대입 전형 모색, 12) 학령 인구 감소 및 중등교원 양성, 13) 교육청, 고교학점제 선도학교 지정 및 지원, 14) 고교학점제 도입 취지 및 학교 적응 현황, 15) 과목 선택권 보장을 위한 교육과정 개정, 16) 서울시교육청 자사고 지정 취소 위법 판결. 이와 같은 분석 결과는 빅데이터 분석 방법을 활용한 선행연구와 일치하는 공통 주제 외에 ΄교육감 후보의 고교학점제 관련 공약΄, ΄외고, 국제고, 자사고 등 일반고 전환 추진΄과 ΄지정 취소 위법 판결΄, ΄학령인구 감소 및 중등교원 양성΄ 등의 토픽이 추가로 분석되었다는 점에서 의의가 있다. 고교학점제 도입 및 정착을 위해 이를 둘러싼 여러 이해관계자들의 다양한 시각을 종합적으로 고려하며 접근해야 함을 알 수 있다. 더불어 연도별 토픽 비중 변화를 통해 고교학점제의 실제 운영 방식 및 현황에 대한 토픽이 증가 추세 있음과 대입제도, 고교유형 등의 토픽이 감소 추세에 있음이 드러나 고교학점제의 도입 및 정착이 적극적으로 진행되고 있으며 하나의 독립된 제도로 자리잡고 있음을 확인할 수 있었다. 둘째, LDA 토픽모델링 분석에 필요한 문서 수를 알아보기 위해 전체 문서와 샘플데이터 간 토픽 일치도를 알아보았다. 그 결과 문서 수가 전체 문서(7,115건)의 1%에 해당하는 71건의 문서로 구성된 데이터의 경우 전체 문서를 대상으로 분석한 토픽과의 일치도는 40%에 미치지 못하였다. 5%에 해당하는 356건의 문서 수를 가진 데이터들은 71건의 데이터 분석 결과보다는 다소 높은 평균 61.51%의 일치도를 보였으나 여전히 문서 수가 적음에 의한 영향이 있는 것을 확인할 수 있다. 10%에 해당하는 712건의 문서로 구성된 데이터 분석 결과 토픽 일치율은 급격히 상승하여 71.97%의 일치도를 보였고, 15%에 해당하는 1,067건의 문서로 구성된 데이터 분석 결과 72.8%의 일치율을 보였다. 30%에 해당하는 2,135건의 문서 수로 구성된 데이터 분석 결과 토픽 일치도는 83.26%으로 나타났으며 60%에 해당하는 4,269건의 문서로 구성된 데이터 분석 결과 토픽 일치도는 82.85%으로 나타났다. 문서 수별 토픽 분석 결과가 전체 문서 토픽을 얼마나 정확하게 분석해낼 수 있는지 변별도를 알아보기 위해 ROC 곡선의 AUC를 분석한 결과 712건 이상의 문서로 구성된 데이터의 경우 AUC 영역이 .8 이상으로 나타나 변별력이 우수한 것으로 나타났고, 특히 2,135건으로 구성된 데이터의 경우 AUC 영역이 .939로 나타나 변별력이 매우 뛰어난 것으로 나타났다. 이러한 연구 결과를 토대로 LDA 토픽모델링 분석을 위해서는 연구 대상인 문서 수가 최소한 약 700건은 확보되어야 하고, 약 2,000건 이상의 문서가 확보될 경우 LDA 토픽모델링 연구 수행에 충분하다고 판단하였다. 본 연구의 의의는 다음과 같다. 최근 고등학교 교육과 대입제도에 큰 변화를 일으키고 있는 고교학점제와 관련하여 언론이 주요하게 다루고 있는 부분을 토픽 분석을 통해 알아보았다. 이를 통해 선행연구에서는 부각되지 않았던 교육감 후보 고교학점제 관련 공약, 자사고 등 일반고로 전환, 학령인구 감소와 증등교원 증원 필요 등의 사회적 이슈를 발견할 수 있었다. 또한 연도별 토픽 비율 변화를 통해 고교학점제 운영 방식 및 현황과 관련된 토픽이 증가 추세에 있음과 고교학점제와 함께 묶여 논의되던 대입제도, 고교유형 등의 토픽이 감소 추세에 있음이 드러나 고교학점제에 대한 사회적 관심의 이동 방향과 제도의 정착 정도 등을 확인할 수 있었다. 더불어 고교학점제를 주제로 다룬 뉴스기사로 생성한 샘플데이터들을 통해 LDA 토픽모델링 분석에 필요한 문서 수를 분석하고, 최소한으로 보장되어야 하는 문서 수는 700건이며 분석에 충분한 문서 수는 2,000건임을 밝혔다. 이를 통해 향후 LDA 토픽모델링 분석을 실시하고자 하는 연구자들에게 가이드를 제공할 수 있게 되었다. 본 연구는 LDA 토픽모델링 분석 시 문서 확보에 대한 고민과, 확보한 문서 수에 대한 고려를 독려하는 역할을 한다. 또한 연구자들이 적절한 수의 문서를 대상으로 LDA 토픽모델링 분석을 실시하도록 도와 관련 연구 결과의 정확도 및 질적 수준을 높이는데 기여할 것이다. ;Recently, as the importance and usefulness of information generation and future prediction through big data analysis have emerged, research using big data is increasing. Among them, research on unstructured data, which accounts for a large part of big data, is being actively conducted. Most of unstructured data is composed of text, and the mining process for such unstructured data is called text mining. There are various analysis methods for text mining, but topic modeling analysis based on LDA is widely used( Choi & Park, 2020; Blei, 2012). In the case of LDA topic modeling analysis, instability occurs in topic analysis if the documents used for analysis are not sufficient. If the LDA model reacts sensitively to small changes in sample words due to the small number of documents, topic analysis results may vary greatly depending on the sampling procedure, which may lead to inappropriate results and interpretation (Hecking & Leydesdorff, 2018). In other words, since the data to be studied affects the inference performance of the LDA model, guidelines on this are needed, but there are not many related studies (Tang et al, 2014). Blei et al (2003) mentioned that it is difficult to learn enough topics to handle documents when the number of documents is small when analyzing LDA topic modeling. Hecking & Leydesdorff (2018) pointed out that a relatively small number of documents can lead to inappropriate research results. Naushan (2020) said that when working with news articles, the minimum number of articles required for modeling is 600, but more than 1,000 are needed to improve results, and in the case of Twitter, 5,000 to 10,000 articles are needed. Amazon's AWS solution states that it has more than 1,000 documents to achieve the best results in LDA topic modeling tasks (AWS, 2023). As discussed above, despite the importance of the number of documents to be studied when analyzing LDA topic modeling, the results of examining the top 45 studies based on the citation index among LDA topic modeling studies searched through the Korean Journal Citation Index (KCI) showed less than 500 cases. 31.8% of the studies analyzed documents, especially in the field of education, reaching 52.2%. Therefore, the first purpose of this study was to find out the number of documents suitable for topic modeling analysis based on the LDA algorithm, one of the text mining analysis methods that have been actively used recently. The subjects of the study were 7,115 news articles related to the high school credit system from the announcement of the introduction of the high school credit system by the Ministry of Education (November 27, 2017) to 2022. The high school credit system will be fully introduced in all high schools nationwide by 2025. Students design their studies according to their aptitude and career path, select and complete the courses to be taken, and graduate when the cumulative credits obtained through completion meet the graduation standards. Everyone is interested in the situation as it will bring about big changes in school space, supply and demand of teachers, curriculum management, and college admissions system. On the other hand, news articles often contain an interpretation of the purpose and direction of a specific topic, and have an aspect that leads the reader to a specific goal by making them aware of it. Therefore, it is possible to grasp a wide range of social issues and the views of various stakeholders on the issue. Therefore, the second purpose of this study was to provide implications for the policy by analyzing the topics related to the high school credit system covered by news articles at the time ahead of the full introduction of the high school credit system. The research questions of this study to achieve the above research objectives are as follows. 1. What is the topic of the high school credit system covered in news articles? 1-1. What are the topics of the high school credit system analyzed through news articles? 1-2. Which of the analyzed topics were covered by many news articles? 1-3. Are there any differences from the topic of the high school credit system covered by materials other than news articles? 1-4. How does the topic change with the flow of the year? 2. How many documents are suitable for LDA topic modeling analysis? 2-1. Is there a difference in topic matching between all documents and sample data according to the number of documents? 2-2. Does the AUC of the ROC curve differ according to the number of documents? 2-3. How many documents are suitable for LDA topic modeling analysis? The research procedure to solve the above research problem is as follows. First, from the announcement of the high school credit system to 2022, 7,115 news articles dealing with the subject of the high school credit system were analyzed using the R (version 4.2.2) statistical program, and the yearly change of the topic was examined. Second, 120 sample data with different numbers of documents were generated through random extraction from all documents, and topic analysis was performed on all sample data. SPSS statistical program was used to generate random numbers for random document extraction. The number of documents is 1% (71), 5% (356), 10% (712), 15% (1,067), 30% (2,135), 60% (4,269) of the total documents (7,115). ) as a ratio of At this time, 20 sample data were created for each ratio. For example, 20 sample data consisting of 71 documents corresponding to 1% of 7,115 total documents were created, and a total of 120 data were created with 20 each for each ratio to conduct topic analysis. Afterwards, the matching topics between the entire document and the sample data were analyzed using the macro vba of the Excel program. Third, the number of documents required for LDA topic modeling analysis was investigated through the AUC analysis of the ROC curve, the agreement between the topics of the entire document and the sample data. The results of the study are as follows. First, as a result of topic analysis of news articles on the high school credit system, a total of 16 topics were extracted: 1) Efforts to strengthen regional capabilities for the operation of the high school credit system, 2) Information on the high school credit system, such as subject selection, completion, credit acquisition, and graduation standards , 3) regular-time expansion policy and high school credit system, 4) expansion of courses and operation of elective classes according to career paths, 5) creation of innovative future-oriented school spaces, 6) Ministry of Education, reorganization of college entrance system following the introduction of high school credit system, 7) local universities and neighborhoods High school-linked online joint curriculum, 8) Superintendent candidate’s high school credit system pledge, 9) Ministry of Education’s promotion of foreign language high school, international high school, private high school conversion to general high school, 10) Specialized career program operation (vocational high school), 11) Seeking college admissions suitable for high school credit system, 12 ) Declining school-age population and training of secondary school teachers, 13) Designation and support of schools leading the high school credit system by the Office of Education, 14) Purpose of introducing the high school credit system and current status of school adaptation, 15) Curriculum revision to ensure the right to choose subjects, 16) Cancellation of the designation of private high schools by the Seoul Metropolitan Office of Education misdemeanor judgment. In addition to common themes consistent with preceding studies using big data analysis methods, the results of this analysis include pledges related to the high school credit system by superintendent candidates, promotion of conversion to general high schools such as foreign language high schools, international high schools, and private high schools and illegal judgment of cancellation of designation, It is meaningful in that topics such as the decrease in the school age population and the training of secondary school teachers were additionally analyzed. It can be seen that in order to introduce and settle the high school credit system, it is necessary to comprehensively consider and approach the various perspectives of various stakeholders surrounding it. In addition, through the change in the proportion of topics by year, it was revealed that topics related to the actual operation method and current status of the high school credit system are on the rise, and topics such as the college entrance system and high school types are on the decline. It was confirmed that it was established as an independent system. Second, in order to find out the number of documents required for LDA topic modeling analysis, the topic concordance between all documents and sample data was investigated. As a result, in the case of data composed of 71 documents, which accounted for 1% of the total number of documents (7,115 documents), the degree of agreement with the topic analyzed for all documents did not reach 40%. The data with the number of 356 documents corresponding to 5% showed an average concordance of 61.51%, which was slightly higher than the result of analyzing the data of 71 cases, but it can be confirmed that there is still an effect due to the small number of documents. As a result of data analysis consisting of 712 documents corresponding to 10%, the topic concordance rate rose rapidly to 71.97% concordance, and as a result of data analysis consisting of 1,067 documents corresponding to 15%, the concordance rate was 72.8%. As a result of data analysis consisting of 2,135 documents corresponding to 30%, the topic concordance was 83.26%, and as a result of data analysis consisting of 4,269 documents, corresponding to 60%, the topic concordance was 82.85%. As a result of analyzing the AUC of the ROC curve to find out how accurately the topic analysis result by number of documents can analyze the entire document topic, the AUC area of 712 or more documents showed that the AUC area was .8 or higher, indicating that the discriminatory power was excellent. In particular, in the case of data consisting of 2,135 cases, the AUC area was .939, indicating that the discrimination power was very excellent. Based on the results of this study, it was determined that at least about 700 documents to be studied should be secured for LDA topic modeling analysis, and if more than about 2,000 documents were secured, it would be sufficient for LDA topic modeling research. The significance of this study is as follows. In relation to the high school credit system, which is causing a major change in high school education and college entrance system, the main topics covered by the media were analyzed through topic analysis. Through this, it was possible to discover social issues that were not highlighted in previous studies, such as the pledges related to the high school credit system by the superintendent of education, conversion to general high schools such as private high schools, and the decrease in the school-age population and the need to increase the number of secondary school teachers. In addition, through the change in the topic ratio by year, it was revealed that topics related to the high school credit system operation method and current status are on the rise, and topics such as the college entrance system and high school types, which were discussed together with the high school credit system, are on the decline. It was possible to confirm the direction of movement of interest and the degree of settlement of the system. In addition, the number of documents required for LDA topic modeling analysis was analyzed through sample data generated by news articles dealing with the high school credit system, and it was revealed that the minimum number of documents to be guaranteed was 700 and the number of documents sufficient for analysis was 2,000. Through this, it was possible to provide a guide to researchers who want to conduct LDA topic modeling analysis in the future. This study encourages consideration of securing documents and consideration of the number of documents secured when analyzing LDA topic modeling. In addition, it will contribute to improving the accuracy and quality of related research results by helping researchers conduct LDA topic modeling analysis for an appropriate number of documents.