DSpace at EWHA: 4모수 문항반응모형을 적용한 MMLE-EM방법의 증명과 TIMSS 2015 수학의 문항실수제외도 분석

Browse

My Repository

DSpace at EWHA일반대학원 교육학과 Theses_Ph.D

View : 963 Download: 0

4모수 문항반응모형을 적용한 MMLE-EM방법의 증명과 TIMSS 2015 수학의 문항실수제외도 분석

Title: 4모수 문항반응모형을 적용한 MMLE-EM방법의 증명과 TIMSS 2015 수학의 문항실수제외도 분석

Other Titles: Proof of MMLE-EM solution with the four-parameter logistic IRT Model and Analysis of Item non-slip in TIMSS 2015 Mathematics.

Authors: 안선영

Issue Date: 2021

Department/Major: 대학원 교육학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 성태제

Abstract: 문항반응이론(Item Response Theory)은 검사의 공정성과 신뢰성 확보를 위한 검사이론으로 많은 분야에서 각광받아 왔다. 현재 문항반응모형 중 가장 많이 사용되는 3모수 모형은 문항변별도, 난이도, 추측도를 추정하는 모형이며, 4모수 모형(four-parameter logistic Model)은 문항실수제외도를 더 추정한다. 문항실수제외도(item non-slip)는 4모수 모형에서의 정답확률로 이를 통해 문제해결능력이 있는 피험자가 실수로 오답하는 문항실수도를 확인할 수 있다. 문항실수는 피험자 개인차에 의해 발생되기도 하지만, 문항의 구조적 결함으로도 발생된다. 문항의 결함 때문에 발생되는 피험자의 오답은 피험자의 실제 능력 수준을 과소추정하는 오류와 직결되므로 문항실수도는 검사개발과정에서 검토되어야 한다. 처음 4모수 모형을 연구한 Barton과 Lord(1981)는 문항실수도를 고려하는 유용성을 밝히려 하였으나, 추정방법 등의 제한점 때문에 연구 목적을 달성하지 못하여 이론적 의의만을 남기고, 후속 연구가 부진하였다. 그러다 2000년대 초반부터 컴퓨터화 적응검사(computerized adaptive test: CAT)와 관련한 4모수 모형 연구가 재조명되면서 4모수 모형의 문항실수제외도 분석 가치를 인식하고 관련 연구가 지속적으로 증가되는 추세이다. 본 연구는 4모수 로지스틱 문항반응모형에서 MMLE(Marginalized Maximum Likelihood Estimation: MMLE)의 EM 알고리즘방법과 이 때 사용되는 뉴튼-랩슨(Newton-Raphson) 절차의 주요 과정을 수리적으로 전개하였다. 이는 이론의 발전과정에서 이분 로지스틱 문항반응모형의 MMLE-EM 방법에 대한 설명이 3모수 모형에 머물러 있고 4모수 모형의 설명은 간과되었기 때문이다. 또한 4모수 모형을 적용한 실제 자료 분석을 통해 활용성을 탐색하였다. 연구 자료는 문항반응이론의 일차원성 기본 가정을 충족하는 국제 수학·과학 성취도 추이 비교연구(the Trends in International Mathematics and Science Study: TIMSS)에서 8학년 수학 문항에 대한 우리나라 포함 상위 5개국 학생 1,876명의 응답이다. 4모수 모형을 구성하는 4개 문항모수인 난이도, 변별도, 추측도, 실수제외도의 모형별 비교 중에서 문항추측도와 실수제외도의 비교는 간명한 비보상 인지진단모형인 DINA(Deterministic Inputs, noisy“and”gate)모형과 함께 비교하였다. DINA모형 분석에 필수 요소인 Q행렬은 자격을 갖춘 내용전문가 3인에 의해 개발하였고, RSS(Residual Sum of Squares: RSS)방법 및 전문가 회의를 거쳐 내용 및 통계적 측면에서 타당화 하였다. 연구 결과는 다음과 같다. 먼저, 본 연구는 이분 로지스틱 4모수 모형의 MMLE, MMLE-EM 방법에서 모수를 추정하는 과정을 전개하였다. 또한 최종 전개 결과는 문항실수제외도인 변수 를 포함하고 있으므로 3모수 모형의 조건인 을 대입·재정리하여 위계적으로 하위 모형인 3모수 모형형태와 일치함을 확인하였다. 이 결과는 Baker와 Kim(2004)과 같다. 4모수 모형을 적용한 실제 자료 분석 결과는 다음과 같다. 먼저, 모형 적합도 확인을 위해 각 모형의 적합도를 상대적합지수인 AIC, AICc, BIC, SABIC, LL, 차이검정을 적용해 확인하였다. 이 중 AIC, AICc, LL와 문항반응모형에서만 확인한 차이검정을 종합하면 4모수, 2모수, 3모수, DINA 모형 순서로 적합도가 양호하였다. 한편 각 모형별로 추정된 문항 모수치들의 서술통계 확인 결과, 전체 평균 문항모수치는 모형이 복잡해질수록 그 값이 증가하는 경향을 보였다. 이 때, 모형에서 추정하는 모수만을 비교하기 위해 문항난이도와 문항변별도의 비교에서는 2모수, 3모수, 4모수 모형을 비교하고, 문항추측도와 문항실수제외도에서는 3모수, 4모수 모형, DINA모형을 비교하였다. 첫째, 문항난이도의 경우, 2모수, 3모수, 4모수 문항반응모형 간 추정치 비교 결과, 3모수와 4모수 모형의 추정이 가장 유사하였다. 평균 표준오차 추정치 역시 같은 양상이었다. 피어슨 상관계수, 추정치 간 차이 절대값, RMSE 확인결과도 역시 3모수와 4모수 모형이 가장 유사하였다. 그러나 세 문항반응모형 간 난이도 추정치는 비슷하고, 가장 낮은 상관을 보인 2모수와 4모수 모형의 상관도 .944의 매우 높은 상관이었다. 따라서 문항난이도의 추정은 3모수와 4모수 모형이 가장 유사했지만, 전체 2모수, 3모수, 4모수 모형의 추정은 서로 유사하였다. 둘째, 문항변별도의 경우, 4모수 모형의 모수와 표준편차 추정치가 다른 두 2, 3모수 하위 모형에 비해 크게 나타났다. 2모수와 3모수 모형 간 평균 차이가 3모수와 4모수 모형 간 평균 차이보다 약 .5 더 컸다. 평균 표준오차 추정치 역시 평균 모수 추정치와 유사한 결과였다. 이러한 추정결과는 상관분석에서도 확인할 수 있었다. 2모수와 3모수 모형의 피어슨 상관계수가 .865로 가장 높고, 3모수와 4모수 모형이 .713, 2모수와 4모수 모형이 .733으로 2모수와 3모수 모형이 가장 유사하였다. 마찬가지로 차이 절대값, RMSE를 통한 비교에서도 2모수와 3모수 모형의 추정이 가장 비슷하였다. 셋째, 문항추측도의 경우, 3모수 모형, 4모수 모형, DINA모형의 순서로 추측도의 평균 추정치가 커졌고, 3모수와 4모수 모형의 추정이 가장 유사하였다. 3모수와 4모수 모형 중에서는 4모수 모형이 더 DINA모형과 차이가 작았다. 상관분석에서도 피어슨 상관계수가 3모수와 4모수 모형 간 상관이 .834로 매우 높았다. 반면, 두 문항반응모형과 DINA모형 간의 상관은 3모수 모형에서 -.376, 4모수 모형에서 -.362로 모두 부적상관을 보였다. 따라서 문항추측도의 추정은 3모수, 4모수 문항반응모형과 DINA모형이 서로 다름을 확인하였다. 추정치 간 차이 절대값의 평균은 문항반응모형 간 비교에서는 .068인데, 3모수와 4모수 모형과 DINA모형의 비교에서는 각각 .345, .278로 크게 나타났다. RMSE도 마찬가지의 결과를 보였으며, 3모수 모형과 4모수 모형 중에서는 4모수 모형이 DINA모형과 더 차이가 작았다. 넷째, 문항실수제외도의 경우, 4모수 모형과 DINA모형의 추정치 평균은 4모수 모형이 DINA모형보다 약 .1 더 크게 추정되었다. 즉, 4모수 모형이 더 문항 실수도를 더 작게 추정하였다. 또 문항실수제외도의 절대적 판단 준거는 없지만, 다른 문항들의 값과 상대적으로 비교했을 때, 4모수 모형의 결과에서는 가장 큰 값이 1이고 가장 작은 값이 .885여서 두드러지게 낮은 문항은 발견되지 않았다. 반면, DINA 모형에서는 가장 높은 값은 .994이고 가장 낮은 값은 .743으로 4모수 모형의 분석과 차이를 보였다. 4모수 모형과 DINA모형 간 피어슨 상관계수는 .101로 두 모형의 추정은 거의 상관이 없었다. 두 모형의 추정치 간 차이의 절대값은 .091, RMSE는 .118로 앞서 추정한 문항 추측도의 추정 결과와 비교하면, 3모수, 4모수 모형과 DINA모형 간의 비교 결과보다는 작았다. 본 연구의 문항반응모형과 인지진단모형 간 문항 모수치 추정 결과에 기반할 때, 두 모형에서 문항실수제외도는 동일한 명칭의 문항모수이지만 서로 다른 경향성을 보였다. 한편, 각 문항반응모형별 문항과 4모수 모형의 자료에 대한 정보 제공량은 문항 12번을 제외한 나머지 11개 문항에서 가장 많았다. 그 다음으로 2모수, 3모수 모형의 순서로 문항 내 정보 제공량이 많았다. 문항정보의 경향과 같이 검사정보함수에서도 4모수, 2모수 3모수 모형의 순서로 정보 제공량이 많았다. 본 연구의 제한점은 다음과 같다. 먼저, 실제 자료 분석의 경험적 연구 결과는 다양한 자료에 의한 것이 아니므로 추후 이에 대한 보완이 필요하다. 또한 사용된 연구 자료인 인지적 영역의 수학 검사가 인지진단이론에 기반하여 개발된 검사가 아니었으므로 임의적 인지요소 지정의 일반적 제한점이 있다. 본 연구는 다음의 의의와 시사점을 갖는다. 첫째, 초기 연구의 제한점을 개선한 4모수 이분 로지스틱 모형의 MMLE-EM 방법을 설명하는 수리적 전개와 그 증명을 제시하였다. 이는 해외에 비해 상대적으로 4모수 모형의 연구 가치가 등한시 되고 있는 국내 문항반응이론 연구 환경에서 문항실수제외도의 활용성에 대한 연구가치 제고의 의의를 갖는다. 둘째, 모형에서 추정하는 4번째 문항모수인 상위점근값 를 문항실수제외도로 명칭하여 4모수 모형의 정답확률인 의 해석과 이해에 대한 편의를 제공하였다. 셋째, 4모수 모형의 문항실수제외도를 실제 자료 분석으로 경험적 확인을 함으로써 검사의 정확성과 공정성을 높이기 위한 수단으로의 가능성을 탐색하였다. 넷째, 3모수, 4모수 모형과 문항추측도와 문항실수제외도를 추정하는 DINA 모형과의 비교를 통해 하나의 차원을 가정하는 문항반응모형과 여러 개 인지요소로 다차원을 가정하는 인지진단모형의 추정 차이를 실제 자료 분석에 바탕해 경험적으로 확인하였다. 마지막으로 4모수 모형의 문항실수제외도 분석은 고부담 검사나 컴퓨터화 검사, 그리고 정의적 영역에서 부정적인 심리요인을 측정하는 심리검사 등 다양한 검사 개발과 교육현장에서 문항선별작업과 검사 체계 운용, 검사 결과 분석에 적용될 수 있다. 이를 통해 문항과 피험자 능력의 측정에 정확하고 신뢰성을 더할 수 있다.;Proof of MMLE-EM solution with the four-parameter logistic IRT Model and Analysis of Item non-slip in TIMSS 2015 Mathematics. The Item response theory(IRT) has been employed in many field as a test theory to ensure the fairness and reliability of the tests. The mostly used IRT is 3-parameter logistic model(3PLM) which deals with estimation of discrimination, difficulty and pseudo-guessing of the items. The expansion of the 3PLM known as 4-parameter logistic model(4PLM) includes non-slip estimation additionally. The 4PLM was introduced to analyze the mistakes by ably subjects through estimation of the item non-slip. Therefore the summation probability of the item slip and item non-slip should be 1 in 4PLM. This is the distinct difference between 3PLM and 4PLM. Conventionally in 3PLM, the probability of correct response is unconditionally 1, but in 4PLM is not. Because the probability of the correct response cannot exceed item non-slip which is less than 1 in 4PLM. The item slip can be caused by not only personal ability but also structural flaws of the items. The ignoring of the item slip can underestimate subjects’ ability due to the structural flaws of the items. So the consideration of the item non-slip is required in the development of the good items. The early report of the 4PLM by Barton and Lord(1981) tried to investigate the usefulness of the item non-slip on speed test but did not effectively demonstrate the usefulness due to the instability of the estimation methods. So the 4PLM was remained as a theoretical proposal for decades without subsequent researches. In the early of 21th century, the rise of the computerized adaptive test(CAT) revived the 4PLM. Especially the computer-based test(CBT) and the CAT were hugely affected by subjects’ experiences and adaption of computer system. Recently in foreign countries, the value of the analysis of the 4PLM’s item non-slip has been recognized and related research has increased continuously. The main purpose of this study is aimed to expand the expectation and maximization(EM) algorithm of the marginalized maximum likelihood estimation in 3PLM (Bock and Aitkin, 1981) to 4PLM. The description of mathematical calculation and the modality deployment by a dichotomous IRT model still remains as 3PLM only. Thus this study suggests the expansion of the MMLE-EM method by using 4PL-IRT model which is a monotonic increment. Additionally the detailed mathematical description of Newton-Raphson procedure for the 4PLM will be given in detail. The utilization and usefulness of 4PLM will be demonstrated by analysis of actual data from 2015 TIMSS(the Trends in International Mathematics and Science Study). The study applied 1,876 responses from the top five countries' student groups of 14 items in the 2015 TIMSS eighth-grade Mathematics Achievement Test Booklet 6 satisfying the unidimensional test assumption, which is the basic assumption of the IRT. The 4PLM was defined as difficulty, discrimination, item guessing and non-slip. The item guessing and non-slip parameter of the 4PLM was compared to the item guessing and slip parameter of the cognitive diagnostic DINA(Deterministic Inputs, noisy“AND”gate) model which is know as a typical non-compensated and the most concise model. The critical requirement of the DINA analysis is the Q-matrix. The-Q matrix was developed by reflecting the opinions of 3 qualified content experts with 2nd grade of middle school math curriculum which is the same curriculum of the TIMSS data. The contents and the statistical verification of the Q-matrix was confirmed by these experts. The result of the study are as follows : First of all, the detailed mathematical description of MMLE and MMLE-EM method of dichotomous form of 4PLM was drawn and proved by comparing 3PLM. The Condition of d=1 in the 4PLM shows the exact same results in 3PLM model of Baker and Kim(2004), the recognized publication. To confirm the goodness of the model fit, each model was analyzed with AIC, AICc, BIC, SABIC, LL and test. The results shows the best model fit was 4PLM, 2PLM, 3PLM and DINA model respectively. Thus the choice of the 4PLM was valid and resonable. On the other hand, the results of the descriptive statistics of parameter analysis shows that the mean difficulty was increased as the model became the more complex. The item difficulty and discrimination parameter analysis was compared with 2PLM, 3PLM and 4PLM. The item guessing and non-slip parameter was analyzed in 3PLM, 4PLM and DINA. 1. In the view point of the item difficulty, Pearson correlation, standard errors, absolute value of the difference and RMSE, the result of 3PLM and 4PLM analysis was the most similar to each other. Even though the 3PLM and 4PLM estimation was more similar but the estimator difference between 3 models are negligible. Thus, those 3models have no significant difference in the item difficulty estimation. 2. For the item discrimination, the 4PLM results was exaggerated than other 2 model. The difference between 2PLM and 3PLM was .5 greater than those of 3PLM and 4PLM. It can be read in the correlation analysis. The Pearson correlation parameter of (2PLM and 3PLM), (3PLM and 4PLM) and (2PLM and 4PLM) was .865, .713 and .733 respectively. So the 2PLM and 3PLM was the most similar estimation in item discrimination and also same as in absolute value and RMSE comparison. 3. For the item guessing parameter, the average estimator was greater in terns of 3PLM, 4PLM and DINA respectively. The 3PLM and 4PLM was similar to each other. In the correlation analysis, Pearson correlation coefficient was .834 between 3PLM and 4PLM. However the 3PLM and 4PLM shows negative correlation with DINA. So the results shows there is huge difference between (3PLM and 4PLM) IRT model and DINA . 4. For the Item non-slip parameter, the average estimator of 4PLM estimated .1 greater than DINA. Though there is no absolute criterion on the item non-slip, the estimation parameter of 4PLM was ranged from .885 to 1 but in DINA it shows .743 to .994. The Pearson correlation coefficient of 4PLM and DINA was .101 which means there is no correlation. Consequently, there is distinct difference between unidimensional IRT estimation and multidimensional CDM-DINA estimation even though they use similar parameter definitions. 5. For the item information, 4PLM provided the largest amount of information than 2PLM and 3PLM. The tendency of the test information function was similar. This study suggests that the 4PLM method is useful in the estimation of exact subject abilities in TIMSS data. If there is chance to apply another data set, the suggestion can be reinforced by practical comparison between models. in the study, the item non-slip was not significantly dominant because the TIMSS data was cognitive and massive mathematics test. So further research is needed in affective test comparison such as psychological test. As conclusion of the study was as follow. 1. The study demonstrate detailed mathematical explanation in 4PLM MMLE-EM method. This will be a pavement of the utilizing item non-slip parameter in domestic IRT study. 2. The upper asymptote was defined as 4th item parameter of item non-slip which provide convenience in study of 4PLM with item correct response probability. Because slip and non-slip are mutually exclusive. 3. The application of 4PLM into actual TIMSS data analysis suggests the validity and the usefulness of the model for the accurate and fair test. 4. The comparison between 3PLM, 4PLM and DINA in actual data shows the different estimation results between IRT and CDM-DINA, which have diffrent dimensionality. Finally, the study suggests the consideration of the item non-slip paramater in 4PLM provide usefulness in high-stake test, CBT, CAT and affective domain psychological test, etc. The analysis of the good quality of the item using 4PLM is expected to have a valued utility in the test development process. In addition, accuracy and reliability can be added in the measurement of test items and subjects' ability through the interpretation of the item non-slip parameter.