DSpace at EWHA: 동등화를 이용한 2000~2002학년도 대학수학능력시험 외국어영역의 검사특성과 피험자능력 비교분석

Browse

My Repository

DSpace at EWHA일반대학원 교육학과 Theses_Master

View : 604 Download: 0

동등화를 이용한 2000~2002학년도 대학수학능력시험 외국어영역의 검사특성과 피험자능력 비교분석

Title: 동등화를 이용한 2000~2002학년도 대학수학능력시험 외국어영역의 검사특성과 피험자능력 비교분석

Authors: 전경희

Issue Date: 2003

Department/Major: 대학원 교육학과

Publisher: 이화여자대학교 대학원

Degree: Master

Abstract: 동일한 기능을 측정하는 검사가 여러 가지 유형으로 제작되고 실시되었을 때, 특정 유형의 검사를 치른 집단이 다른 유형의 검사에 응시한 학생들에 비해 더 유리하거나 불리한 상황이 발생하지 않도록 하는 것은 교육측정 분야의 주요 관심이다. 이와 같이 검사의 상대 비교가 필요한 경우, 다양한 검사 유형의 점수를 공통 척도로 전환하는 검사동등화를 통해 검사 점수간 비교 가능성을 확보할 수 있다. 이 연구는 검사동등화를 이용하여 2000학년도에서부터 2002학년도까지의 대학수학능력시험 외국어영역의 검사특성과 피험자능력을 비교하고 분석하는데 목적이 있다. 연구 목적에 기초하여 첫째, 2000학년도에서부터 2002학년도 대학수학능력시험 외국어영역의 난이도와 변별도를 포함한 검사특성을 분석하여 어느 검사가 더 어렵고, 변별력이 있는가를 비교하고 둘째, 동등화된 피험자능력의 비교를 통해 학생들의 영어능력의 성장 또는 저하 정도를 알아 보았다. 마지막으로 3년에 걸쳐 외국어영역의 검사점수는 어느 정도 변화하는지를 분석하였다. 검사의 특성을 비교하고 피험자능력의 변화를 알아보기 위해 2000학년도에서부터 2002학년도까지 지난 3년간 실시된 대학수학능력시험 외국어영역에 응시한 전체 피험자 중 홀수형에 응답한 학생들의 자료를 사용하였다. 분석에 사용된 2000학년도 피험자 수는 433,905명이고, 2001학년도와 2002학년도 외국어영역 홀수형 응시자는 각각 425,257명, 358,410명이다. 이 연구에서는 검사특성곡선에 의한 동등화 방법을 사용하였으며, 가교문항의 설정에서부터 동등화된 문항 및 능력모수의 비교에 이르기까지 이 연구는 크게 네 단계의 분석과정을 통해 순차적으로 진행되었다. 첫째, 검사동등화 실행을 위한 선행작업으로서 가교문항을 설정하였고 둘째, BILOG 3 프로그램을 통해 세 검사로부터 각각 문항모수와 능력모수를 추정한 후, EQUATE 2.1 프로그램을 이용하여 2002학년도 외국어영역의 척도를 기준으로 2000학년도와 2001학년도 검사의 척도를 동등화하였다. 셋째, 동등화된 문항 및 능력모수로부터 세 검사간 검사의 특성과 능력을 비교하기 위해 분산분석을 실시하였다. 마지막으로 PIE 프로그램을 이용하여 능력수준에 따른 진점수를 계산하고 세 개 학년도 검사점수를 비교하였다. 연구 결과, 세 검사 중 2002학년도 검사의 변별력이 가장 높게 추정되었다. 또한 2002학년도 검사가 가장 어렵고 이어 2000학년도가 어려우며 2001학년도 외국어영역이 가장 쉽게 출제된 것으로 분석되었다. 추측도의 평균 비교에서 2000학년도 검사의 추측도가 가장 높으며 2002학년도 검사에서의 추측도가 가장 낮은 것으로 나타났다. 그러나 차이검증을 위해 분산분석을 실시한 결과 세 검사의 변별도, 난이도, 추측도 추정치에서 모두 통계적으로 유의한 차이가 나타나지 않았다. 동등화된 능력모수 추정치를 통해 세 개 학년도 외국어영역에 응시한 학생들의 능력을 비교한 결과, 2001학년도 외국어영역에 응시한 학생들의 능력이 가장 높게 조사되었고, 이들 다음으로는 2000학년도 학생들의 능력 평균이 높으며 2002학년도 검사에 응시한 학생들의 영어 능력이 가장 낮은 것으로 분석되었다. 분산분석에 의한 집단간 차이검증으로부터 유의수준 0.01에서 세 집단의 능력간 통계적으로 유의한 차이가 입증되었다. 검사의 난이도에 의한 점수의 차이가 어느 정도인지를 확인하기 위해 능력수준에 따라 세 검사에서의 동등진점수를 비교한 결과, 모든 능력수준에 걸쳐 2002학년도 검사로부터 예상되는 점수가 가장 낮은 반면, 2001학년도 검사점수가 가장 높게 예상되었다. 그러나 능력의 양단에 위치한 학생들과 능력의 중앙에 위치한 학생들의 세 검사에 의한 점수의 변동폭은 각기 다르게 나타났으며, 능력의 양극에 위치한 학생들의 경우 검사의 쉽고 어려운 정도에 그다지 민감하게 반응하지 않으며, 세 검사에 걸쳐 비교적 일관된 점수를 받고 있는 것으로 나타났다. 대학수학능력시험이 검사개발의 단계에서부터 점수의 상호교환 용도를 고려하여 제작된 검사는 아니지만, 검사동등화에 의해 서로 다른 해에 실시된 검사간 공통척도를 확보할 수 있을 때, 이로부터 얻는 정보는 여러 가지 교육적 상황에서 유용하게 기여할 수 있을 것으로 기대된다. 이 연구는 2000학년도에서부터 2002학년도까지의 대학수학능력시험 외국어영역에 대한 정보만을 제공하므로 보다 실질적인 활용을 위해 외국어영역뿐만 아니라 다른 교과 영역으로 확장된 연구가 이루어져야 하고, 방법간 비교를 통해 연구결과의 타당성을 확인하는 작업들도 지속적으로 뒷받침되어야 할 것이다. ; When large-scale testing programs are developed more than one form of a test, the tests may be administered at different times and at different locations. In such situations, it is important that all of the forms measure the same skill, trait, or ability. All the tests should be also constructed according to the same content and statistical specifications. However, even when multiple forms are constructed carefully, differences among the tests might exist to such a degree that the scores from the forms are not interchangeable without some type of equating. Equating procedures are intended to produce score comparability among alternate forms of a test. The primary purpose of this study is to compare the item characteristics and examinee ability in the 2000 to 2002 College Scholastic Ability Test(CSAT) Foreign Language(English) Section using equating. CSAT English Section assesses four language skills: listening, speaking, reading and writing, and is composed of 55 items in the 2000 test, and 50 items in both 2001 and 2002 tests. This study is based on the analysis of the response data of 433,905 examinees in the 2000 test, 425,257 examinees in the 2001 test, and 358,410 examinees in the 2002 test. The steps for this study are as followings. First, to equate three tests, sets of internal anchor items were chosen so that the content and the statistical characteristics of the anchor items are representative of the total test; consequently, five items were taken from each test as anchor items. Second, BILOG 3(Mislevy & Bock, 1990) was run separately for the data to estimate item parameters and theta distributions. Third, EQUATE 2.1(Baker, 1995) was used to estimate the Stocking and Lord(1983) scale transformation coefficients by using anchor item parameter estimates and the estimated theta distributions that were obtained from the second step. Next, PIE program (Hanson & Zeng, 1995) was used to obtain the 2000 test and the 2001 test s true-score equivalents for the 2002 test using estimated item parameter estimates. According to the result of comparing equated item parameters across three different tests, the 2002 test had higher discrimination than the other tests. The most difficult test was the 2002 test while the 2001 test was the easiest of three tests. Also, guessing parameters, the asymptotic probability in the 2002 test was the lowest. However, the result of testing mean differences of three parameters by ANOVA verified no statistical significance between three tests. A comparison of equated ability parameters clarified that the examinees taking the 2001 test were superior to the other groups, and the ability of the 2000 test s examinees was higher than that of the 2002 test s takers. This ability differences between groups had statistical significance as well. In addition to the comparison of equated parameters, the result of comparing true score equivalent of the three tests showed that the 2002 test was expected to get the lowest score over all ability levels while the relatively high score was expected to get in the 2001 test. However, the score variation ranges of each ability level showed to be different over whole ability distributions. Students positioned in a high or low level were not subject to respond sensitively to the difficulty of the tests, and showed to get the relatively constant scores in the tests while middle-level students responded differently across three tests. In conclusion, although CSAT was not developed for the use of interchangeable scores, equating made it possible to compare test characteristics and examinee ability between the tests that were taken at different times. This study focused on providing as accurate information as possible regarding the recently administrated tests. Further investigations that compare the results of other IRT models and other equating designs would be useful. Lastly, additional studies should be expanded to the other sections of CSAT such as verbal, mathematics, natural science inquiry.