DSpace at EWHA: 검사매체와 검사시행 모형에 따른 피험자 능력모수와 피험자 응답 적합도 비교

Browse

My Repository

DSpace at EWHA일반대학원 교육학과 Theses_Ph.D

View : 826 Download: 0

검사매체와 검사시행 모형에 따른 피험자 능력모수와 피험자 응답 적합도 비교

Title: 검사매체와 검사시행 모형에 따른 피험자 능력모수와 피험자 응답 적합도 비교

Authors: 시기자

Issue Date: 2003

Department/Major: 대학원 교육학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Abstract: 최근 컴퓨터 공학의 발달과 컴퓨터의 확산 및 인터넷의 보급으로 컴퓨터를 이용한 새로운 형태의 검사들이 교육·심리측정 분야에서 빠른 속도로 변화·발전되고 있다. 컴퓨터화 검사(Computer-Based Test: CBT)는 검사시행 모형에 따라 모든 피험자에게 동일한 문항을 동일한 순서로 구성하여 전달하는 컴퓨터화 고정검사(Computerized Fixed Test: CFT)와 개인의 능력수준에 따라 서로 다른 검사를 구성하여 전달하는 컴퓨터화 적응검사(Computerized Adaptive Test: CAT)로 분류할 수 있다. 현재 컴퓨터를 이용한 검사는 컴퓨터화 고정검사(이후 CFT라 칭함)의 단계를 거쳐 컴퓨터화 적응검사(이후 CAT라 칭함)로 나아가고 있는 추세이다. 그러나 컴퓨터화 검사가 아직은 많은 학생들에게 익숙하지 않은 검사방식이므로 컴퓨터화 검사의 도입을 위해서는 피험자들이 지필검사와 컴퓨터화 검사에서 동일한 검사수행을 나타내는지에 대한 검증이 필수적으로 요구된다. 이 연구의 목적은 지필검사와 컴퓨터화 검사에 의해 추정된 피험자 능력모수와 피험자 응답 적합도에 대한 비교를 기초로 검사매체와 검사시행 모형이 피험자들의 검사수행에 어떠한 영향을 미치는지 알아봄으로써 지필검사와 컴퓨터화 검사의 상호교환 가능성 및 컴퓨터화 검사의 타당성을 탐색하는 것이다. 이를 위해 서울시에 소재하고 있는 중학교 2학년 학생 325명을 대상으로 수학 교과에 대한 검사를 시행하였으며 최종 302명의 자료를 분석하였다. 미리보기, 무응답, 재검토 및 응답수정을 허용하지 않는 CAT의 제한 사항들이 피험자들의 검사수행에 어떠한 영향을 미치는지 알아보기 위하여 CFT는 이를 허용하는 방식(CFT_1)과 허용하지 않는 방식(CFT_2)으로 구분하고, 검사매체와 검사시행 모형에 의한 영향을 구체적으로 밝히기 위해 동일 피험자에게 세 가지 검사 방식(지필검사, CFT, CAT)을 모두 시행하였다. 순서효과가 미치는 영향을 배제하기 위하여 균형화 설계를 적용하였으며 반복시행에 따른 기억효과의 영향을 최소화하기 위해 지필검사와 CFT는 동형으로 제작(16개의 문항으로 구성)하였다. CAT의 시행을 위해 총 184개의 문항으로 구성된 문제은행을 개발하였다. CAT의 검사 알고리즘은 중간 난이도에 해당하는 문항을 첫 문항으로 제공해주고 최대우도추정에 의한 능력추정방법과 추정된 능력에서 최대 정보를 줄 수 있는 문항을 다음 문항으로 제공해주는 최대정보 문항선택 방법을 적용하였으며 제한시간인 25분이 지나거나 최대 문항 수인 16개의 문항에 모두 응답하면 검사가 종료되도록 설정하였다. CAT의 검사길이에 따른 변화를 비교하기 위해 CAT의 최대 문항 수를 지필검사 및 CFT와 동일한 16개의 문항으로 규정하고, 지필검사나 CFT 길이의 1/2에 해당하는 여덟 번째 문항부터 분석하였다. 피험자 능력모수는 Rasch모형에 의해 추정하였으며 피험자 응답의 적절성을 측정하기 위하여 잔차(residual)에 근거한 적합도 지수인 INFIT과 OUTFIT을 적용하였고, CAT에서는 누적적 합산절차(Cumulative sum procedure)에 근거한 적합도 지수를 함께 적용하였다. 주요 연구결과는 다음과 같다. 첫째, 문항내적 일관성 신뢰도에 대응되는 주변신뢰도를 비교한 결과, 지필검사와 제한을 두지 않은 CFT_1의 신뢰도는 비슷하게 추정된 반면, 제한을 둔 CFT_2의 신뢰도는 낮게 추정되었다. CAT는 지필검사나 CFT 길이의 1/2에 해당하는 길이에서도 더 높은 신뢰도를 나타냈다. 능력추정의 정확성을 나타내는 검사정보에 대한 분석결과, 지필검사와 CFT_1은 대부분의 능력수준에서 비슷한 양의 검사정보를 나타냈다. 반면, CFT_2의 경우 중간 능력수준에서는 지필검사나 CFT_1보다 많은 양의 검사정보를 제공하였으나 낮은 능력수준과 높은 능력수준에서는 지필검사나 CFT_1보다 적은 양의 검사정보를 제공하는 것으로 나타났다. CAT는 지필검사나 CFT의 1/2에 해당하는 검사길이에서도 지필검사나 CFT보다 더 많은 양의 정보를 제공함으로써 CAT에서 피험자들의 능력이 보다 정확히 추정되고 있음을 보여주었다. 둘째, 지필검사와 CFT에서 측정되는 특성이 동일한지 알아보기 위하여 확인적 요인분석에 의해 구인타당도를 검증한 결과, 제한을 두지 않은 CFT_1에서는 그림이나 그래프가 포함된 영역이 지필검사보다 CFT에서 더 신뢰롭게 측정되고 있었으며 수와 연산, 문자와 식과 같이 계산을 필요로 하는 영역은 CFT보다 지필검사에서 더 신뢰롭게 측정되고 있었다. 한편, 제한을 둔 CFT_2의 경우 검사의 첫 부분과 끝 부분에서 지필검사보다 CFT에서 더 낮은 신뢰도를 나타냈다. 셋째, 검사매체와 검사시행 모형에 따른 능력모수를 비교한 결과, 지필검사와 CFT의 비교에서는 CFT_1을 치른 집단과 CFT_2를 치른 집단 모두에서 지필검사보다 CFT에서 추정된 능력모수가 더 낮았으며 제한을 둔 CFT_2에서 더 낮은 능력모수를 나타냈다. CFT와 CAT에서 추정된 능력모수를 비교한 결과, CFT_1 집단에서는 CFT보다 CAT에서 추정된 능력모수가 더 낮았고 CFT_2 집단에서는 CFT보다 CAT에서 추정된 능력모수가 더 높았으며 CFT_1 집단과 CFT_2 집단의 CAT에서 추정된 능력모수는 유사한 것으로 나타났다. 한편, 지필검사와 CAT에서 추정된 능력모수를 비교한 결과, 열 번째 문항 이후부터 CAT에서 추정된 능력모수가 유의하게 낮은 것으로 분석되었다. 넷째, 검사매체와 검사시행 모형에 따른 피험자 응답 적합도를 비교한 결과, 지필검사와 CFT에서 Rasch모형에 근거한 적합도 지수인 INFIT과 OUTFIT의 분포는 유의한 차이를 나타내지 않았으나 두 검사방식에 의해 추정된 적합도 지수들간의 상관계수나 부적합하게 분류된 피험자들의 일치도는 매우 낮은 것으로 분석되었다. 이를 통해 지필검사와 CFT에서의 부적합의 원인이 다름을 확인할 수 있었다. 한편, CFT와 CAT, 지필검사와 CAT에서 INFIT과 OUTFIT의 분포는 매우 유의한 차이를 나타냈으며 CAT에서 훨씬 낮은 적합도 지수를 나타냈다. 또한 CFT와 CAT, 지필검사와 CAT에서 적합도 지수들간의 상관계수나 부적합하게 분류된 피험자들에 대한 일치도는 0에 가까운 값을 나타냈다. 따라서 다른 검사방식에 비해 CAT에서 부적합하게 응답하는 피험자들의 비율이 더 적으며 INFIT과 OUTFIT은 CAT에서 부적합하게 응답한 피험자들을 발견하는 데 효율적이지 않다는 것을 확인할 수 있었다. 누적적 합산절차에 근거한 적합도 지수를 적용하여 CAT에서 부적합하게 응답한 피험자들을 분석한 결과, INFIT과 OUTFIT을 적용했을 때보다 많은 수의 피험자들이 부적합하게 응답한 것으로 분류되었지만 지필검사나 CFT에서 부적합하게 분류된 피험자들의 비율보다는 작은 것으로 나타났다. 이와 같은 결과는 CAT에서 피험자들의 능력이 더 정확히 추정될 수 있음을 보여주는 것이다. 다섯째, 각 검사방식에서 부적합하게 분류된 피험자들의 문항응답유형과 그 원인에 대한 분석결과, 지필검사에서는 검사불안, 부주의, 부정행위, 추측 등 다양한 원인에 의한 유형을 발견할 수 있었으나 CFT에서는 검사 초반에서 기대와 다른 오답을 나타내는 피험자들의 유형을 많이 발견할 수 있었다. 특히, 제한을 둔 CFT_2를 치른 집단에서 그러한 경향이 두드러지게 나타났다. 이는 검사매체와 검사시행상의 제한이 검사초반에서의 부적절한 응답에 영향을 미치고 있음을 보여주는 결과라고 할 수 있다. CAT에서는 첫 번째 문항에서 능력과 부합하지 않는 응답을 나타내는 피험자들의 비율이 많았으며 이들은 검사 알고리즘의 영향과도 관련이 있는 것으로 분석되었다. 여섯째, 중다회귀분석에 의해 각 검사방식에서 피험자의 검사수행에 영향을 미치는 요인을 탐색한 결과, 모든 검사에서 학습동기와 응답시간이 피험자들의 검사수행에 영향을 미치고 있었으며 CFT와 CAT에서는 컴퓨터와 관련된 특성이, CAT에서는 이 외에 검사 알고리즘과 관련된 특성들이 영향을 미치는 것으로 분석되었다. 이와 같이 각 검사방식에서 추정된 능력모수, 피험자 응답 적합도, 부적합한 문항응답유형과 원인 및 검사수행에 영향을 미치는 요인 등 다양한 측면에서의 분석결과들을 근거로 할 때 피험자들이 지필검사와 컴퓨터화 검사에서 동일한 검사수행을 나타낸다고 보기는 어렵다. 검사매체에 의한 요인과 검사시행 모형에 따른 요인들이 피험자들의 검사수행에 영향을 미치고 있었으며 특히, 미리보기, 무응답, 재검토 및 응답수정이 허용되지 않는 CAT의 제한사항이 더 많은 영향을 미치고 있는 것으로 나타났다. 이와 같은 결과는 컴퓨터화 검사가 아직은 많은 학생들에게 익숙하지 않은 검사방식임을 보여주는 것이다. 그러므로 두 검사의 동등성에 대한 충분한 근거가 확보되기 전까지 지필검사와 컴퓨터화 검사를 동시에 시행하여 두 검사에서 얻어진 점수를 상호교환적으로 사용하는 것은 검사의 공정성 측면에서 바람직하지 않다고 할 수 있다. 그러나 주변신뢰도, 검사정보, 피험자 응답 적합도 등에 대한 분석에 기초할 때 CAT가 다른 검사방식에 비해 적은 문항 수로 피험자들의 능력을 더 정확히 추정해 주고 있음을 확인할 수 있었으며 검사초반에서의 추정오차를 줄일 수 있는 알고리즘을 적용한다면 더 정확하고 효율적인 측정이 가능함을 시사받을 수 있었다. 아울러 CAT에서 피험자 응답 적합도에 대한 분석을 접목시키면 피험자 개인의 특성 뿐 아니라 문제은행이나 검사 알고리즘에 대한 평가도 가능하며 이를 통해 더 나은 CAT를 위한 기초 정보를 제공해 줄 수 있음을 확인할 수 있었다. CAT는 피험자의 능력을 측정하기 위한 가장 정확하고 효율적인 검사방법이다. 학교 현장에서 CAT를 활용하게 되면 모든 학생들에게 일정 수준 이상의 정확도를 가진 공정한 검사를 제공할 수 있을 뿐 아니라 지속적인 평가를 통해 제공되는 정보들은 교수-학습을 위한 유용한 자료로 활용할 수 있을 것이다.;Recent improvements in computer technology and psychometrics have encouraged the delivery of educational and psychological tests through computers. Computer-based tests(CBTs) are classified with computerized fixed test(CFT) and computerized adaptive test(CAT) according to the testing model. CFT is a testing model that provides the most direct analogue to PPT. This method administers a fixed length, fixed-form computerized test. CAT selects items individually for each examinee based on the examinee's responses to previous items to obtain a precise and accurate estimate of that examinee's latent ability. The advantages of CBT include immediate feedback of results, increase in examine interest, and reduce in costs of test construction, administration, and scoring. Moreover, CAT can produce an equally reliable score with about half items of a fixed-form, non adaptive test. It is, however, important to examine whether students perform as well on CBT as they would on a traditional PPT, and whether the shift from PPT to CBT produces consistent results in patterns of performance because CBT is not a familiar format to many students. Two possible sources for comparability of PPT and CBT are the mode of delivery(computer or paper) and the testing model(adaptive or traditional/linear). Prior experience of computerized test, computer familiarity, computer confidence, and computer interface might influence examinee's test performance in CBT. Furthermore, unique aspects of CAT based on adaptive model might influence examinee's test performance. Because most adaptive tests do not allow items to be previewed, skipped, reviewed, and changed, examinees may be overly anxious about giving a response. Also, factors related to unbalancing of content, context effect, and pre-knowledge of the test may be the cause of aberrant test performance in CAT. Research with respect to methods that provide information about the fit of an individual item score pattern to a test model is usually referred to person-fit measurement. It is very important to investigate the fit of an item score pattern and the mechanisms underlying aberrant response behavior because the cause of misfit may be different according to the testing formats(i.e., PPT, CFT, CAT). Purpose of this study is to examine comparability between PPT and CBT by investigating the effect of test delivery mode and testing model on a test performance. This study compared the ability estimates and person fit statistics among PPT, CFT, and CAT. This study especially compared them under two conditions of CFT to investigate how examinees response on test items when we allow items to be previewed, skipped, reviewed, and changed(CFT_1) and when we don't allow them(CFT_2). This study also investigated an aberrant response pattern and the cause of misfit in each testing form, and factors, such as test anxiety, test-taking skills, computer confidence, study motivation, and response time, which are associated with test performance in each testing format. This study targeted 2nd grade students of middle school in Seoul. The test subject is mathematics. PPT and CFT were constructed as a parallel form to avoid carrier of effect, and a single group design with counterbalancing was used to avoid the ordering effect. In CAT, the first item is randomly selected among items with mean difficulty level, maximum information item selection, maximum likelihood estimation were used. CAT is terminated if a examinee respond all 16 items or if 25 minutes(the limited time) goes on after CAT stars. The results of this study are summarized as follows: First, in analysis of marginal reliability, marginal reliability of PPT and CFT_1 was estimated similarly while that of CFT_2 was estimated low. The reliability of CAT was higher than that of CFT at the half length of CFT. In comparison of test information, PPT and CFT_1 showed a similar test information at most of the ability level. In CFT_2, however, the test information of students at middle ability level was higher than that of PPT or CFT_1, while the test information of students at low ability and high ability level were lower. At the half length of PPT or CFT, CAT indicated higher test information. Second, according to the construct validity verification by exploratory factor analysis, in CFT_1, CFT was estimated more reliable than PPT in the domain that involved figures and graphs, and PPT was estimated more reliable than CFT in the domain that needed the calculation like "number and operations " or "letters and equations". On the other hand, in CFT_2, CFT showed lower reliability than PPT at the beginning and end of the test. Third, the results of comparing the ability estimates according to the test delivery mode and testing model were as follows. In comparing the PPT and CFT, the ability estimates of CFT were lower than that of PPT in both groups, and the difference was wider in CFT_2. But, the ability estimates of CAT indicated the similar level in both groups. According to the examining results of that there was a difference in the ability estimates of PPT and CAT. The difference of ability estimates between PPT and CAT was significant after the tenth item. Fourth, in comparing person-fit statistics according to the test delivery model and testing model, the distributions of INFIT and OUTFIT in PPT and CFT were not showing any significant difference. However, the correlation between fit indices and agreement statistics of examinees, classified inadequately by each tesing formats were estimated low. This means that the cause of misfit in PPT and CFT is different. On the other hand, the distributions of INFIT and OUTFIT in CFT versus CAT and PPT versus CAT showed significant differences, and fit indicies in CAT appeared much lower. Moreover, both correlation between fit indices and agreement statistics of examinees classified inadequately in CFT vs CAT and PPT vs CAT showed a value near zero. Therefore, it could be verified that INFIT and OUTFIT were not efficient in CAT, and ratio of examinees responding properly at their ability level was much higher. Fifth, analysis of aberrant response pattern detected in each testing format showed that there were a variety of problems that may contribute to an inaccurate estimate of an individual's standing on the construct assessed. Misfitting response patterns by various factors including carelessness, test anxiety, guessing and cheating were found in PPT and the ratio of incorrect answer at the beginning of test was high in CFT. In CAT, the ratio of examinees answerd away from true ability at first item was high, and the test algorithm influenced their misfitting responses. Sixth, as a result of exploring the factor influenced on test performance in each testing format by multiple regression analysis, it was found that the learning motivation and response time influenced test performance in all testing format. Also, the characteristics concerned to computer influenced test performance in CFT and CAT. Especially, the test algorithm influenced the test performance in CAT. This study suggests a lack of comparability between CAT and PPT. The lack of comparability was reflected in differences in test delivery mode, limited administration condition of CAT, and test algorithm. Among them, the administration condition like forbidding to preview, omit, review and change was the most important factor. Therefore, disadvantages may arise for some examinees when the CAT and PPT administrated at the same time. When based on the analysis of marginal reliability, test information, and person-fit, CAT is the most proper and efficient testing format. Therefore, if CAT is applied in the areas such as high stakes, national examination, and other assessment aimed at school-age children, it could provide every student with a fair test with the same precision. CAT could provide a useful information for instruction, because it enables the collecting and managing of various information about individual students by continual testing. Moreover, if person-fit measurement could be applied in CAT, individuals with a poor cognitive structure and problem solving process could be modified.