DSpace at EWHA: 인지진단모형에 의한 차별기능문항 추출 방법 비교와 TIMSS 2015 수학 문항의 차별기능 원인 탐색

Browse

My Repository

DSpace at EWHA일반대학원 교육학과 Theses_Ph.D

View : 1453 Download: 0

인지진단모형에 의한 차별기능문항 추출 방법 비교와 TIMSS 2015 수학 문항의 차별기능 원인 탐색

Title: 인지진단모형에 의한 차별기능문항 추출 방법 비교와 TIMSS 2015 수학 문항의 차별기능 원인 탐색

Other Titles: Comparing the Detection Method of Differential Item Functioning Based on the Cognitive Diagnosis Model and Exploring the Cause of Different Functioning in TIMSS 2015 Mathematics Items

Authors: 권승아

Issue Date: 2017

Department/Major: 대학원 교육학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 성태제

Abstract: 검사는 개인의 특성이나 능력을 측정하기 위한 도구로서 피험자가 속한 집단에 따라 검사 문항이 편파적으로 작용한다면 검사의 공정성과 타당도는 훼손된다. 문항이 특정 집단에 유리하거나 불리하게 기능하는 문항의 편파성을 통계적으로 검증하고, 검사의 공정성과 타당도를 확보하기 위한 노력의 일환으로 차별기능문항(differential item functioning; DIF) 추출을 시행할 수 있다. 차별기능문항이란 동등한 능력수준의 피험자들이 그들이 속한 집단에 따라 문항에 정답으로 응답할 확률이 다른 문항을 의미한다. 피험자의 능력은 고전검사이론(classical test theory; CTT)에서는 검사총점, 문항반응이론(item response theory; IRT)에서는 문항특성곡선(item characteristic curve; ICC)에 의한 피험자의 잠재적 특성()으로 정의된다. 최근에 주목하는 인지진단모형(cognitive diagnosis models; CDMs)에 근거한 피험자 능력은 검사의 문항을 해결하는 데 필요한 몇 가지의 지식이나 기술인 인지요소(attribute)의 숙달 여부라 할 수 있으며, 이를 인지요소 프로파일(attribute profile)로 지칭한다. 따라서 인지진단모형에서는 동일한 인지요소 프로파일의 잠재 집단에 속한 피험자라도 성별, 인종, 국가 등과 같은 피험자 집단의 특성에 따라서 문항의 정답 확률에 차이가 날 경우 차별기능이 존재한다고 본다(Li, 2008; Li & Wang, 2015). 인지진단모형은 다차원적이고 이분적인 피험자 특성을 제공하기 때문에 다차원적인 준거참조적 해석이 가능하여 교육적 상황에 유용하다는 측면에서 검사 자료의 분석뿐만 아니라 검사 개발 단계부터 적용이 활발히 진행되고 있다. 검사 자료의 분석 및 검사 개발에 있어 인지진단모형 적용의 유용성을 높이기 위해서 검사의 공정성과 타당도를 확보하기 위한 노력이 필요하다. 이러한 노력의 하나로 차별기능문항 추출을 통한 검증 절차가 있으며, 인지진단모형 기반의 검사에서는 인지요소 수준에서 문항의 차별기능을 분석할 수 있는 방법이 필요하다. 이렇게 인지요소 수준에서 문항의 차별기능을 분석하기 위해 최근 인지진단모형 기반의 차별기능문항 추출 방법으로 CDM-Wald 검정(Hou, de la Torre & Nandakumar, 2014)과 LCDM-DIF 방법(Li & Wang, 2015)이 제안되었다. 두 가지 방법 모두 인지진단모형의 DINA 모형을 예시로 차별기능문항을 추출하였으므로 그밖에 구체적 모형에 대해서도 차별기능문항을 적절하게 추출하는지 검증하는 것에 본 연구의 목적을 둔다. 또한 C-RUM 모형 기반의 문항응답 자료 하에서 CDM-Wald 검정과 LCDM-DIF 방법, Raju 방법, MH 방법으로 차별기능문항을 추출 및 비교하여 더욱 유용한 차별기능문항 추출 방법을 선택할 수 있도록 정보를 제공하는데 목적이 있다. 마지막으로 본 연구는 실제 자료를 적용하여 문항과 인지요소 수준에서의 차별기능의 원인을 탐색하여 인지진단모형 검사 개발 단계에서 문항의 공정성 확보를 위한 시사점을 제시하는 것에 그 목적이 있다. 본 연구에서는 차별기능문항 추출 방법이 차별기능문항을 적절하게 추출하는지에 대한 비교를 위해 모의실험을 실시하고, 실제 자료를 적용하여 차별기능의 원인을 탐색하였다. 모의실험을 위해 문항 수, 표본 수, 차별기능문항 비율, 문항의 인지요소 적합성 및 차별기능문항 효과크기를 다르게 하여 총 40가지 조건에서 인지진단모형 기반의 모의자료를 생성하였다. 조건 별 30회 반복 측정하여 모의자료 별로 CDM-Wald 검정과 LCDM-DIF 방법, Raju 방법, MH 방법으로 차별기능문항을 추출하고, 이를 토대로 경험적 제1종 오류 및 제2종 오류, 검정력, 일치도 통계값, Kappa 계수를 비교하였다. 또한 중다회귀분석을 실시하여 모의자료의 조건이 경험적 제1종 오류 및 제2종 오류, 검정력에 미치는 영향을 분석하였다. 또한 실제 자료의 적용은 TIMSS(Trends in International Mathematics and Science Study) 2015 8학년 수학 18개 문항, 760명의 한국 학생들의 문항응답 자료를 사용하고, 모의실험에 활용한 차별기능문항 추출 방법으로 분석하였다. 이를 위해 RSS 방법 및 내용 전문가의 의견을 수렴하여 Q행렬 작성과 타당화 작업을 실시하였다. 실제 자료의 차별기능 원인을 탐색하기 위해 인지요소, 내용 영역, 문항 유형군에 따라 차별기능이 발생했는지 파악하고, 문항내용, 문항 난이도 및 문항 변별도, 답지 반응분포 등의 정보를 토대로 문항별로 내용전문가의 의견을 수렴하였다. 모의실험을 통한 결론은 다음과 같다. 인지진단모형에 의한 CDM-Wald 검정은 전반적으로 적절하게 차별기능문항을 추출하였으나 LCDM-DIF 방법은 상대적으로 차별기능문항 추출이 양호하지 못했다. 고전검사이론에 의한 MH 방법은 적절하게 차별기능문항을 추출하였으나 문항반응이론에 의한 Raju 방법은 차별기능문항 추출이 적절하지 못했다. 방법 간 분류 일치도의 경우 일치도 통계값을 기준으로 모든 방법 간 일치도는 높은 편이었다. 그러나 Kappa 계수를 기준으로 LCDM-DIF 방법과 MH 방법, LCDM-DIF 방법과 Raju 방법 간 Kappa 계수가 낮은 편이었고, CDM-Wald 검정은 나머지 세 방법과의 분류 일치도가 전체적으로 양호하였다. CDM-Wald 검정은 문항의 인지요소 적합성 조건에 따라 제1종 오류 및 제2종 오류와 검정력이 민감하게 작동했는데 인지구조 적합성이 높을수록 동일 조건에서 제2종 오류는 40~100% 가량 높아지고, 검정력은 40~100% 가량 낮아지며 차별기능문항 효과크기가 작을수록 이러한 현상은 심화되는 특징을 발견하였다. Li & Wang(2015)의 모의실험 조건 중 표본 수만 다르고 나머지 조건은 동일할 때와 비교하면 CDM-Wald 검정이 차별기능문항 비율에 영향을 거의 받지 않는다는 선행연구와 달리 C-RUM 모형을 적용한 CDM-Wald 검정은 차별기능문항 비율에 경험적 제1종 오류와 검정력 모두 영향을 받는 것으로 나타났다. LCDM-DIF 방법은 경험적 제1종 오류가 다른 방법들에 비해 약 0.07 높아 비차별기능문항을 차별기능문항으로 추출할 가능성이 상대적으로 높다. 그러나 CDM-Wald 검정에 비해 인지요소 적합성과 차별기능문항 효과크기, 차별기능문항 비율에 상대적으로 영향을 덜 받았다. MH 방법은 측정학적인 가정과 이론이 다름에도 불구하고 전반적으로 안정적인 차별기능문항 분류를 하였다. 그러나 인지구조 적합성이 높고 차별기능문항 효과크기가 작을 경우 CDM-Wald 검정과 LCDM-DIF 방법이 MH 방법 보다 제2종 오류는 낮고 검정력은 높게 나타났다. TIMSS 2015 8학년 수학 18개 문항의 응답 자료에 모의실험의 인지진단모형과 차별기능문항 추출 방법을 적용하여 성차에 분석한 결과 CDM-Wald 검정, LCDM-DIF 방법, MH 방법에서 각각 5개(27.8%) 문항, Raju 방법에서 4개(22.2%) 문항이 성차에 의한 차별기능문항으로 추출되어, 총 8개(44.4%) 문항이 추출되었다. 이 가운데 두 가지 방법 이상에서 차별기능문항으로 추출된 5개 문항은 3번(M042019), 7번(M042066), 10번(M042229A), 14번(M042120), 18번(M042224) 문항이다. 인지요소 수준에서는 ‘계산’이 남학생에게 유리하고, ‘추론’과 ‘자료해석’은 여학생에게 유리하였으며, 내용 영역 수준에서 남학생에게는 ‘수’, 여학생에게는 ‘기하’와 ‘자료와 가능성’이 유리한 것으로 나타났다. 문항유형은 성별에 따른 차별기능의 원인으로 발견되지 않았다. 본 연구의 의의와 시사점으로 첫째, 최근까지 CDM-Wald 검정과 LCDM-DIF 방법은 DINA 모형을 적용한 차별기능문항 추출 방법에 대한 연구만 진행되었으나 검사 상황을 유연하게 반영할 수 있는 C-RUM 모형을 활용함으로써 CDM-Wald 검정과 LCDM-DIF 방법이 DINA 모형 이외의 인지진단모형에서도 차별기능문항을 적절하게 추출하는지에 대해 검증하였다. CDM-Wald 검정은 선행연구와 비교하여 제1종 오류와 검정력에 특이한 차이를 발견하지 못했으나 LCDM-DIF 방법의 경우 선행연구(Li & Wang, 2015)와 비교하여 경험적 제1종 오류가 C-RUM 모형에서 86% 높은 것으로 밝혀졌다. 둘째, 모의실험을 통한 방법 간 비교 결과 인지진단모형 기반 차별기능문항 추출 방법들이 문항의 인지요소 적합성 조건에 따라 경험적 제1종 오류 및 검정력 등에 민감하게 작동하였는데, 인지요소 적합성은 인지진단모형의 문항 변별도와 관련이 깊다. 따라서 인지진단모형의 문항 변별도가 전반적으로 양호한 경우는 CDM-Wald 검정과 LCDM-DIF 방법을 사용하는 것이 유용하고, 전반적으로 낮은 경우에는 MH 방법을 보조적으로 사용하는 것이 바람직하다. Raju 방법은 인지진단모형 기반의 문항응답 자료에서의 차별기능문항 추출이 제1종 오류 및 제2종 오류와 검정력에서 큰 장점은 발견되지 않았으므로 독단적인 사용은 피해야 한다. 셋째, 인지진단모형을 바탕으로 둔 차별기능문항 추출 방법은 인지요소 수준에서 차별기능의 발생을 확인할 수 있다는 것이 기존의 차별기능문항 추출 방법과의 차이라 할 수 있다. 특히, LCDM-DIF 방법은 각 인지요소의 숙달여부가 문항을 해결하는 데 있어서 집단 간 차이가 있는지를 통계적으로 검증할 수 있다는 장점이 있다. 실례로 실제 자료에서 차별기능문항으로 추출된 10번(M042229A) 문항이 인지요소 가운데 ‘계산’에서 차별기능이 발생한 것을 확인할 수 있고, ‘계산’ 인지요소가 문항을 해결하는데 어떻게 작용했는지를 성별에 따라 분석할 수 있다. 따라서 LCDM-DIF를 활용하여 향후 다양한 집단요인이나 성취도, 사회·경제적 수준 등에 따른 다양한 공변인을 투입한다면 보다 다양한 문항 분석이 가능하다. 넷째, 내용 전문가의 의견 수렴 결과 성별에 따라 특별히 선호하는 수학 영역이나 인지요소가 존재하기 보다는 차별기능문항으로 추출된 문항을 해결하는 절차나 방식에서 남녀 간의 차이가 발생한다는 의견이 다수이다. 문항을 해결하는 절차나 방식이 다르다는 것은 문항 해결에 필요한 인지요소의 활용이나 경로가 성별의 특성에 따라 동일하게 작동하지 않을 수 있음을 의미하고, 이는 집단에 따라 문항을 해결하는 데 있어 Q행렬이 동일하게 작용하지 않을 수 있음을 시사한다. 인지진단모형 기반의 차별기능문항 분석은 집단에 따른 각 문항의 인지요소에서 발생하는 차별기능을 파악할 수 있으므로 집단 간 발생하는 인지요소의 발현에 대한 차이를 추론할 수 있는 정보의 일부를 제공받을 수 있다는 시사점이 있다. 만약 문항의 인지요소 수준에서 차별기능이 발생하였다면 향후 Q행렬 수정 및 문항 개선을 위한 정보로 활용이 가능하다. 또한 집단별로 특정 인지요소가 취약하거나 검사 개발자의 의도와 상이한 인지요소 발현에 대해 추론할 수 있으므로 교육과정과 교수·학습을 교정하는 데 필요한 정보로 활용할 수 있다.;Inspection test is a mean to assess the individual characteristics or abilities, which of item’s biased influence followed by the subject’s group characteristic may undermine the validity and equitability of the test. Extracting the differential item functioning(DIF) is an effort to statistically examine a partiality of the item that functions advantageously or disadvantageously to a particular group, and to secure the validity and equitability of the test. The DIF refers to the item responded by the subjects of equal abilities and has possibilities of being differently answered depending on the subject’s group. Abilities of the subjects are defined by the total score from the Classical Test Theory(CTT) and the latent trait() of the subject identified by the Item Characteristic Curve(ICC). The recent ability of the subject based on the Cognitive Diagnosis Models (CDMs) can be said as the attribute’s proficiency, called the attribute profile, which is the knowledge or skill necessary to address an item of the test. Thus, in terms of the CDMs, even if the subject belongs to the identical attribute profile’s latent group, it can be said that the DIF exists in the test provided that the possibility of getting correct answer differs according to particular characteristics of the subject’s group including sex, race, country, etc(Li, 2008; Li & Wang, 2015). The CDMs provides the multi-dimensional and dichotomous characteristics of the subject which can be interpreted multi-dimensionally referring to the criteria; its educational value contributes to active test development and item analysis. Therefore, the DIF analysis is essential in the CDM-based test for verifying the validity and equitability not only after the test but also during the developing stage, and a method that can analyze the DIF at the attribute level should be utilized. Recently, the CDM-Wald test and the LCDM-DIF method were introduced as a way to extract the CDM-based DIFs. Because both methods were suggested limited to the DINA model of CDMs, the purpose of this study is to verify the propriety of DIF categorization vis-a-vis other models. Also, the study purposes to provide the information for selecting the useful DIF extraction method by extracting and comparing the DIF using the CDM-Wald test and the methods of LCDM-DIF, Raju, and MH based on the item responses of C-RUM model, selected from specific CDMs. Lastly, this study will utilize the actual materials to explore the cause of items and attributes’ different functioning and suggest the implication to secure the equitability of items in the CDM-test’s developing stage. The study explored the cause of different functioning by applying the simulation and actual materials to compare the DIF extraction methods. The simulation was composed of different item numbers, sample sizes, DIF percentages, attribute conformity of the item, and the effect size of DIF; thus, the CDM-based simulation data was created out of 40 different conditions in total. Each condition was repeated 30 times and DIFs of each simulation data were extracted by the CDM-Wald test and the methods of LCDM-DIF, Raju, and MH. The comparison was conducted among the empirical type errors of 1 and 2, the power, the agreement statistics, and the Kappa index based on the extraction. Multiple regression analysis was also demonstrated to address the effect of actual materials’ condition on the empirical type 1 and the power. The actual materials include responses of 760 Korean subjects from 18 items of 8th Year mathematics according to the 2015 Trends in International Mathematics and Science Study(TIMSS), and the materials were analyzed using the DIF extraction method that was utilized in the simulation. The Q matrix draw-up and the validation were conducted referring to the opinion of a specialist in RSS method and contents. In order to explore the cause of different functioning in actual materials, the author apprehended the different functioning according to the attribute, the content domain and the item type, and gathered the content specialist’s opinion on each item based on the information including the item’s content, level of difficulty, discrimination index, response distribution, and etc. The overall conclusion of the simulation is as follows. The CDM-Wald test of CDMs, by and large, condignly categorized the DIF which showed the stable power and empirical type 1 error in the MH method followed by the classical test theory, though the response materials were based on the CDMs. On contrary, the LCDM-DIF method and the Raju method have shown relatively unsatisfactory DIF-extraction. The conformity degree of categorization, based on the statistics, was likely to be high among the methods. In terms of the Kappa index, the index between the LCDM-DIF and the MH methods, and the index between the LCDM-DIF and the Raju methods were generally low. The conformity degree of the CDM-Wald test’s categorization with three other methods has shown satisfactory performance in whole. In terms of the CDM-Wald test, the empirical type errors of 1 and 2 and the power militated sensitively in accordance with the condition of the item’s attribute conformity. Put in the same condition, when the attribute conformity is higher, the type 2 error increased approximately 40-60% and the power decreased approximately 40-100%; this phenomenon was intensified the more the effect size of DIF was smaller. As the DIF size got smaller, the phenomenon seemed to be intensified. Unlike the result of Li & Wang (2015) that the DIF percentage has mere influence on the CDM-Wald test, the CDM-Wald test with the C-RUM model turned up with the result that both the empirical type 1 error and the power are influenced by the DIF percentage. The LCDM-DIF method has about 0.07 higher value of empirical type 1 error compared with other methods which holds higher possibility of extracting non-different functioning item as DIF. It was relatively less affected by the item parameter, the DIF size, and the DIF percentage compared with the CDM-Wald method. The MH method, despite of the difference between the metrological assumption and the theory, has shown stable DIF categorization in whole. However, both MH and Raju methods had a tendency to have lower power than the CDM-DIF method when the DIF size and the item parameter are both small. The conclusion on the DIF extraction result of the response materials from 18 items of the 2015 TIMSS 8th Year Mathematics and the cause of different functioning is as follows: 5 items(27.8%) were extracted from each of the CDM-Wald test, the LCDM-DIF method, and the MH method and 4 items(22.2%) were extracted from the Raju method, leading to 8 DIFs(44.4%) in total. Out of the result, five items extracted as DIF from more than two methods were numbers 3(M042019), 7(M042066), 10(M042229A), 14(M042120), and 18(M042224). In terms of the attribute level, male students had an advantageous position in “calculation” while female students were advantageous in “reasoning” and “data interpretation”; in the content area, male students were better at “numbers” while female students were better at “geometry” and “data and probability.” The item type was not found to be the cause of different functioning followed by the sex. The significance and implication of this test are as follows: first, until recently the CDM-Wald test and the LCDM-DIF method were only used to extract DIFs limited to condition when the DINA model was applied, but this study verified the propriety of the DIF categorization of the CDM-Wald test and the LCDM-DIF method matched with other CDMs by utilizing the C-RUM model that can flexibly reflect the test condition. There was no particular difference found from the type 1 error and the power in the CDM-Wald test compared with the previous study; however, in case of the LCDM-DIF method, the empirical type 1 error was found to be 86% higher than the previous study(Li & Wang, 2015). Second, the comparison among the methods conducted through a simulation had shown that the DIF functions sensitively according to the condition of the item’s attribute conformity, which has a deep relationship with the CDMs’ item discrimination index. Thus, the CDM-Wald and LCDM-DIF methods are more useful in the DIF extraction when the CDM item’s discrimination index is generally good, while the MH method is recommended for adjunctive use when the index is low. In case of the Raju method, there was no significant advantage found from the type 1 error and the power in the item extraction of CDM-based materials, thus arbitrary use of this method should be avoided. Third, the difference of the CDM-based DIF extraction method from the original DIF-extraction method is that the occurrence of different functioning can be identified from the attribute level. Particularly, the LCDM-DIF method has the advantage of statistically verifying the difference of each group’s attribute proficiency in addressing items. Taking the example of the actual material, the item no.10 (M042229A) that was extracted as DIF was identified to be differently functioning in “calculation”, out of the attributes, which can be analyzed by gender on how the “calculation” functioned to solve the item. Therefore, various item analyses will be possible if more covariates are inserted according to diverse group factors, socioeconomic status, and etc. Fourth, many of the opinions of the contents experts insisted that there is a difference in the procedure and method of solving the items identified as DIF followed by the gender, rather than the subject’s particular preference in mathematics or attribute. Having a difference in the procedure and method of solving the items means that the utilization or channel of the attribute necessary for item solving may not function identically according to the gender characteristic, which implies that the Q-matrix may not militate identically for item solving according to the group. The analysis of DIF based on the CDMs can capture different functioning of each item’s attribute depending on the group; thus, receive some information which can infer the difference in the expression of the attribute that occurs among groups. If there is a defect found in the item itself that is irrelevant to the attribute, it can provide a basis for modifying or eliminating the item. If the different functioning occurred at the attribute level, it can be used as information for correcting the Q-matrix and improving the item. In addition, the information regarding the attribute that is particularly vulnerable according to group, or the attribute expression that is different from the test developer’s intention can be obtained, which can be used for correcting the curriculum as well as teaching and learning.