DSpace at EWHA: 하부검사(Testlet)로 구성된 언어영역 검사의 측정학적 특성 탐색

Browse

My Repository

DSpace at EWHA일반대학원 교육학과 Theses_Master

View : 719 Download: 0

하부검사(Testlet)로 구성된 언어영역 검사의 측정학적 특성 탐색

Title: 하부검사(Testlet)로 구성된 언어영역 검사의 측정학적 특성 탐색

Other Titles: An Investigation of the Psychometric Properties in the Testlets of the Language Test

Authors: 이서영

Issue Date: 2005

Department/Major: 대학원 교육학과

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 성태제

Abstract: 교육 현장의 변화에 따라 학생에 대한 평가는 단순 지식이 아닌 수행 중심으로 이루어지고 있다. 이에 따라 객관적으로 점수를 부여할 수 있으면서 학생의 수행을 평가할 수 있는 하부검사의 활용이 증가하고 있다. 그리고 하부검사의 적절한 활용과 더불어 하부검사를 활용한 평가 결과가 정확하게 측정된 것인지에 대한 문제가 대두하게 된다. 일반적으로 학생에 대한 평가는 개별 문항에 대한 응답결과를 기초로 하여 이루어지고 있다. 그러나 하부검사의 경우, 하부검사를 통해 학생의 수행을 보고자 하기 때문에 하부검사에 속하는 어떤 문항에 대한 응답 결과는 같은 하부검사 내에 있는 다른 문항의 응답에 영향을 끼치게 된다. 따라서 한 문항에 대한 응답결과는 그 문항에 국한된 것이라고 보기 어렵고, 평가를 할 때 이러한 측면에 대한 고려가 필요하게 된다. 이는 하부검사를 통해 학생의 수행을 평가하고자 하는 평가의 목적에 부합하는 것이며 교육현장에서 학생의 능력을 정확하게 측정하는 것이 중요하다는 점에서 의미있는 일이라 할 수 있다. 본 연구에서는 우리나라 교육상황에서 하부검사가 가장 활발하게 사용되는 언어영역에서 하부검사에 속하는 하위 문항들의 응답이 어느 정도 관련이 있는지를 알아봄으로써 하부검사의 지역의존성을 확인하고, 이것의 원인과 영향을 탐색해보았다. 그리고 언어영역 하부검사에 적합한 모수추정방법을 살펴보았다. 이 때 하부검사 측정모형은 측정의 기본가정인 지역독립성을 유지할 수 있도록 하기 위하여 하위 문항 점수를 합산하여 하나의 다분문항으로 간주하는 일반적인 방법을 사용하였다. 이에 대한 결과는 다음과 같다. 첫째, 언어영역 하부검사의 지역의존성을 살펴보기 위하여 문항짝별 Q₃통계치를 산출하고 각 하부검사별 평균값을 구한 결과, 언어영역 하부검사에는 지역의존성이 있지만 그 정도가 크지 않은 것으로 나타났다. 또한 하부검사에 포함되는 하위 문항짝별 Q₃통계치의 분포를 알아본 결과, 어느 정도 큰 의존성을 보이는 문항짝이 약간 있었으나 Chen & Thissen(1997)에 의하여 제안된 의존성을 확인하는 기준에 미치지는 못하였다. 둘째, 하부검사의 지역의존성에 영향을 미치는 요인을 알아보기 위하여 하위 문항유형과 행동 영역 목표를 기준으로 지역의존성에 차이를 보이는 하부검사를 비교하였다. 하위 문항 유형에서는 복합적인 문항 유형을 사용하는 것이 더 큰 의존성을 만들어내지는 않았으나, 단순한 문항 유형을 사용한 하부검사에서는 독립성이 확보되는 것으로 나타났다. 행동 영역의 목표에 따라서는 지문에 대한 이해를 전제로 종합적인 사고를 요하는 하위 문항으로 구성된 하부검사에서 하위 문항의 지역의존성이 있는 것으로 나타났다. 셋째, 언어영역 하부검사에 지역의존성이 있는 것으로 나타났기 때문에 분석 단위의 지역독립성을 가정하는 측정이론에 부합하도록 하부검사를 기본 분석단위로 재구성하였고, 이에 적합한 분석모형을 탐색하기 위하여 명명반응모형과 일반화부분점수모형을 적용하여 분석하였다. 우선 반응범주가 서열적으로 구성되었는지를 알아보기 위하여 명명반응모형으로 분석하였고, 이 때 범주간 변화를 보기 위하여 문항범주 변별도와 절편모수를 중심화된 다항식으로 재모수화하였다. 그 결과, 문항범주 변별도의 증가와 단계 난이도 증가를 근거로 하여 반응범주에 서열성이 있음을 확인하였다. 그러나 각 모수에서 범주간 변화가 균일하지 않은 것으로 나타나 문항범주 변별도나 절편모수의 다항식 차수를 제한할 수 없었고 이를 통해 모수에 제한을 가하지 않은 명명반응모형이 언어영역 하부검사에 적합한 것을 알 수 있었다. 넷째, 반응범주의 서열성을 확인한 후 반응범주의 서열성을 가정하는 분석모형인 일반화부분점수모형을 적용하여 분석하였고 두 분석모형에 의한 결과를 토대로 하부검사의 지역의존성이 모수추정치에 미치는 영향에 대하여 경험적으로 탐색해보았다. 그 결과 문항범주 변별도에는 거의 영향을 미치지 않으나 절편 모수와 단계 난이도에는 영향을 미치는 것을 알 수 있었다. 다섯째, 명명반응모형과 일반화부분점수모형에 의해 추정된 피험자 능력 모수 추정치에 차이가 있는지를 알아보기 위하여 상관분석을 실시하였고 그 결과로 나온 상관계수는 .918이었다. 동일 자료에 대해 다른 모형을 적용하여 분석한 결과이므로 높은 상관계수라고 단정지을 수 없고 이에 따라 피험자 능력 모수 추정치 결과에 대한 산포도를 살펴보았다. 그래프를 통하여 일반화부분점수모형에서 추정한 결과가 명명반응모형에서의 결과보다 더 넓은 범위로 흩어져 있는 것을 알 수 있었고 능력이 낮은 수준에서 두 모형간 추정치에 차이가 큰 것으로 나타났다. 여섯째, 명명반응모형과 일반화부분점수모형의 적합성을 알아보기 위하여 두 모형의 검사정보함수와 모형 적합도를 비교하였다. 검사정보함수에서 최대 정보를 보이는 능력치는 두 모형에서 비슷한 것으로 나왔고 정보의 양은 능력수준에 따라 다르게 나타났다. 그러나 정보량의 차이가 크지 않았기 때문에 두 분석모형에 의한 검사정보함수에는 차이가 거의 없다고 할 수 있다. 모형 적합도 비교 결과, 주변신뢰도 추정치에 거의 차이가 없었고 통계치에서는 유의미한 차이가 있는 것으로 나왔지만 표본크기가 크기 때문에 통계치의 차이가 모형 적합도를 결정지을 수 있는 값은 아닌 것으로 판단된다. 따라서 모형 적합도 추정결과 역시 두 분석 모형의 결과가 비슷하다고 할 수 있다. 이상의 결과에서 언어영역 하부검사에는 어느 정도의 지역의존성이 있고, 이것은 검사 목적과 문항 제작에 의해 영향을 받음을 알 수 있다. 또한 하부검사의 지역의존성이 모수 추정치에 영향을 끼치는 것을 발견할 수 있다. 이는 하부검사를 활용하여 학생들을 평가할 때에는 하위 문항에 대한 반응이 서로 연관되어 있음을 고려하여 측정하여야 함을 말해준다. 이 때 하부검사의 하위 문항을 합산하여 다분문항반응모형을 적용하여 측정하는 방법을 택할 경우, 언어영역 하부검사가 하위 문항을 단계적으로 구성하지 않았다고 하더라도 하위 문항 점수를 합산한 범주는 서열적이 되기 때문에 일반화부분점수모형을 적용하는 것이 더 타당하다고 볼 수 있다. 본 연구를 통하여 교육현장에서 학생들의 수행을 보기 위하여 하부검사를 활용한 평가를 할 때 그 결과는 검사가 가지는 특성을 고려하여 측정해야 함을 보였다. 교육 현장에서 하부검사의 사용이 점차 증가하는 상황에서 하부검사를 활용하여 학생의 전체적인 수행을 평가하고자 한다면, 하부검사는 그 형태만 갖추는 것이 아니라 하위 문항에서 전반적인 수행을 볼 수 있도록 문항간 관계를 고려하여 치밀하게 구성되어야 할 것이고, 이에 대한 평가방법도 하부검사의 제작 목적에 부합되게 마련되어야 할 것이다. 이를 위하여 하부검사의 지역의존성의 정도를 고려하여 검사를 재구성하는 연구, 학생들이 응답한 정보를 모두 활용하기 위하여 하부검사반응모형을 활용한 연구, 하부검사의 지역의존성에 의한 영향을 정확히 알아보기 위한 모의자료 연구 등을 제안한다.;As the change of the educational situation, a current trend in educational evaluation is to move away from the evaluation for just memory and to rely on students' performance. It has lead to use testlets which can measure students' performance objectively different from performance assessment. Generally the evaluation of students is based on the performance of individual items. But because testlet consists of a group of items related to a single content area, students' performance on one item influences their performance on other items in the same testlet. So we can't say that students' performance on one item confines just that item. To measure students' performance exactly, it needs to consider testlet effect like this. Moreover it is better the evaluation based on testlet than on individual item from the purpose of the evaluation. The purposes of this study are to measure the degree of local item dependence(LID) within the testlets of language test, to investigate the reasons and effects of LID empirically and to investigate proper measurement model. To measure testlet, it is used testlet scoring which is scoring each testlet as if it were a single polytomous item. Followings are the results from those research. First, to measure LID, Q₃statistics was computed for all pairs of items and averaged within testlets. The results were meant that LID within testlet was present but the degree of LID was not large. Second, to investigate the reasons of LID empirically, items within testlets were classified into item-type and object of performance and compared low LID with high LID. Simple item-type was led item local independence(LII). And general object of performance was led LID. Third, testlets were treated as single polytomous item because of LID. At first nominal response model(NRM) was used to estimate item parameters because it couldn't be assumed the categories must be monotonically ordered. In result, ordered categories was verified by ordered slope and step difficulty parameters. And also, unconstrained NRM is better than constrained NRM as the degree of the parameter change is not equal. Fourth, as ordered categories were verified, generalized partial credit model(GPCM) was used to estimate item parameters for ordered response. Based on the results, it was investigated the effects of LID. Intercept parameters and step difficulty were affected by LID, but discrimination parameters were not. Fifth, the correlation of ability parameter between NRM and GPCM was .918. This value was not high because it was a result using same data. The plot of ability parameter between NRM and GPCM showed that the difference occurred at the low ability level. Sixth, test information functions and model fits were compared NRM with GPCM to investigate which was better to estimate item parameters in the testlet of language test. The ability values maximizing test information were similar. And the amount of test information was different across ability level, but was little different between models. Model fits were examined by marginal reliability and statistics change. Marginal reliability was little different. statistics change was 157.4 with 38 degrees and was significant. Although this value was significant, it was not highly significant given the sample size(5,000). So there was not much different between models. Through those results of this study, it is clarified that LID exists and is not large at the testlets of language test. And LID affects item parameter estimations and is affected by object of performance and item type. It says to us that when students' performance measure, LID must be considered for exact measurement. If testlet scoring is used to manage LID, GPCM is more proper than NRM to estimate item parameter because testlets have ordered categories and GPCM is for ordered response. In conclusion, when we measure students' performance with test, the properties of the test must be considered for reasonable measurement in the aspect of measurement method and the objective of the test. Especially if tests consisting of testlets are used to measure students' performance, items within testlets must construct to measure it and harmonize for evaluating the whole aspect of the performance. Moreover, measurement methods corresponding the objective of the test are formulated. For following study, it can be suggested study applying testlet scoring with consideration about the degree of LID, study to apply testlet response model to estimate item parameters for fully using response information from testlets, and simulation study for positive evidence about the effect of LID.