DSpace at EWHA: 중학교 영어 수행평가 과제의 채점방법과 채점자간 신뢰도 추정방법 비교

Browse

My Repository

DSpace at EWHA일반대학원 교육학과 Theses_Master

View : 1394 Download: 0

중학교 영어 수행평가 과제의 채점방법과 채점자간 신뢰도 추정방법 비교

Title: 중학교 영어 수행평가 과제의 채점방법과 채점자간 신뢰도 추정방법 비교

Authors: 전현정

Issue Date: 2000

Department/Major: 대학원 교육학과

Publisher: 이화여자대학교 대학원

Degree: Master

Abstract: 1980년대 후반 이후로 표준화된 선다형검사에 대한 비판과 함께, 학습자가 중심이 되고 실제 생활과 관련된 지식을 강조하는 교육의 움직임이 일어났다. 이러한 교육관의 변화에 따라 학생들의 수행을 직접적으로 평가하고 그 결과가 학생들 개개인과 실제 교수-학습 활동을 개선하는 데 활용될 수 있는 수행평가가 중요하게 인식 되었으며, 실제로 학교현장에서의 적용과 정착에 많은 노력을 기울이게 되었다. 하지만 이러한 수행평가를 학교현장에 적용하는 데는 많은 문제점들이 지적되고 있다. 수행평가의 현장 적용에 있어 제기되는 주된 문제는 수행평가의 신뢰도와 관련된 것이다. 수행평가는 채점자의 주관적 판단이 개입되기 때문에 채점에 대한 공정성에 따라 평가의 결과가 달라질 수 있기 때문이다. 따라서 수행평가의 실제 현장 적용을 위해서는 가장 큰 문제로 지적되고 있는 채점자내 신뢰도와 채점자간 신뢰도 에 영향을 미치는 요인에 대한 논의가 필요하다. 본 연구의 목적은 채점방법과 채점 자간 신뢰도 추정방법별로 산출된 채점자간 신뢰도를 비교·분석하여, 중학교 영어과에 적합한 채점방법을 알아보고, 수행평가의 시행과 결과 활용의 특성에 따라 채점자간 신뢰도 추정방법이 활용될 수 있는 방법을 제안하는 데 있다. 본 연구에서는 채점방법과 채점자간 신뢰도 추정방법에 따른 채점자간 신뢰도의 비교와 분석을 위해, 먼저 2가지의 채점방법(총체적 방법과 분석적 방법)과 5가지의 채점자간 신뢰도 추정방법(상관계수, 일치도 통계, Kappa 계수, 일반화가능도 이론, 다국면 Rasch 모형)의 특성과 장단점, 채점자간 신뢰도와 관련된 선행연구를 조사 하였다. 또한 중학교 2학년 영어 수행평가 과제의 채점결과 자료를 각 채점방법과 채점자간 신뢰도 추정방법별로 분석하여 산출한 채점자간 계수를 비교하였다. 본 연구를 통하여 얻은 결과와 연구에 대한 시사점은 다음과 같다. 첫째, 중학교 영어 수행평가의 과제에 있어서, 총체적 방법에 의한 채점자간 신뢰도가 분석적 방법에 의한 것보다 높게 추정되었다. 11과의 일반화가능도 계수를 제외하고는 각 채점자간 신뢰도 추정방법별로도 모두 같은 결과가 나타났다. 이러한 결과는 분석적 방법의 점수 세분화로 인해 채점자간 일치도가 떨어졌기 때문이라고 볼 수 있고, 반면에 총체적 방법으로 채점하기 위해 작성된 채점기준표가 채점기준 을 명료화하여 구체적인 채점지침을 채점자들에게 제공하였음을 보여준다. 분석적 채점방법의 경우, 평가영역에 따라 채점자간 신뢰도에 차이가 있게 나타났는데 수행 과제의 완성도 영역이 가장 높게 추정되었고, 언어사용의 적절성 영역이 가장 낮게 추정되었다. 이는 언어사용의 적절성에 대한 채점기준이 모호하게 표현되어 있어서 채점자간의 일치가 어려웠을 것이라고 생각된다. 또한 본 연구에서는 분석적 채점 후에 총체적 채점이 이루어졌기 때문에 채점순서가 채점자간 신뢰도에 영향을 미쳤을 것이라고 생각된다. 둘째, 채점자간 신뢰도 추정방법은 채점자간의 일치정도를 확인하여 채점자간 의견을 조정하기 위한 방법에서 채점자의 주관성 개입으로 인한 오류를 최소화하고 피험자의 점수를 보다 정확히 추정하기 위한 방향으로 개선되고 있다. 수행평가의 시행과 결과 활용의 특성에 따라 채점자간 신뢰도 추정방법은 각 방법의 장점을 활용할 수 있는 방향으로 선택적으로 사용될 수 있다. 즉, 상관계수와 일치도 통계, Kappa 계수에 의한 채점자간 신뢰도 추정방법은 피험자에게 있어서보다 채점자들에게 유용한 정보를 줄 수 있으므로, 채점자 훈련이나 실제 평가가 이루어지고 있는 상황에서 채점기준을 명확하게 하고 채점자간의 의견조정을 위해 사용되는 것이 바람직하다. 일반화가능도 이론에 의한 신뢰도 추정방법은 수행평가가 실시되는 복합적인 상황에서 채점자를 포함한 오차국면의 영향력을 분석하고, 신뢰도를 높이기 위한 의사결정을 위해 사용될 때 유용한 정보를 제공해 줄 수 있다. 다국면 Rasch 모 형에 의한 채점자간 신뢰도 추정방법은 채점자의 효과를 배제한 조정된 점수로 피험자 능력을 보다 공정하게 추정해주므로 고부담 평가에서 사용되는 것이 유용하다. 결론적으로, 채점자간 신뢰도를 높이기 위해서는 채점방법과 채점자간 신뢰도 추정방법을 수행평가의 운영과 과목에 따라 적절히 사용하는 것이 필요하다. 또한 채점자 훈련이 강화되고 채점 기준표가 구체적일 때 채점자간 신뢰도를 높일 수 있을 것이다. 본 연구의 후속 연구로 채점자간 신뢰도와 관련된 연구는 학교급, 학년, 과목, 과제의 특성에 따른 차이와 일반화가능성을 모색해보는 방향에서 계속해서 이루어져야 할 것이다.;Since the late 1980s, education moves toward learner-centered, process-oriented, and real-life related tendency along with criticisms of traditional standardized tests. There is explicit change that performance and development of students in teaching and learning process has become important and performance assessment has been implemented in classrooms. This effort has been done to assess students improvement which cannot be assessed by multiple-choice standardized tests. However, concerns with performance assessment seem to have raised. In contrast to traditional standardized tests, which are scored objectively, performance assessments require raters judgement on students performance to produce a score. This introduces the possibility of subjectivity and lack of consensus with other raters. Therefore inter-rater reliability is mainly important to ensure consistency and fairness in scoring. The purpose of this study is to identify the most appropriate method estimating inter-reliability among a variety of methods for obtaining high-level inter-rater reliability when both holistic and analytic scoring methods are used. For this study, it is compared inter-rater reliability estimates resulting from the holistic scoring method with that of the analytic scoring method for developed performance assessment and then compared each result of various methods used for estimating inter-rater reliability. Methods used for estimating inter-rater reliability here are correlation coefficient, agreement statistics, Cohen's kappa from the application of classical test theory, generalizability theory, and many-facet Rasch measurement. For these all treatments, the.data obtained from performance assessment for second-year middle school students in English subject are analyzed. The main results and indicators of this study are as follows: first, the inter-rater reliability estimated is higher with the holistic scoring method than with the analytic scoring method. Moreover inter-rater reliabilities in the holistic scoring method for all methods used for estimating inter-rater reliability show consistently higher than those in the holistic scoring method except generalizability coefficient of the writing task of performance assessment. Second, the methods used for estimating inter-rater reliability have been improved for obtaining the high-level inter-rater agreement by way of mediating raters ability to be more fair and controlling the effects of rater facet on raw scores. But raters themselves should select one of a variety of methods used for estimating inter-rater reliability as more appropriate method while considering the goals and features of performance assessment. Correlation coefficient, agreement statistics, Cohen's kappa in application of classical test theory can mediate between raters and identify scoring criteria among raters. So it is more desirable to use in both rater training and real-life assessment situation. The generalization theory is useful in detecting various, error components including rater. This method estimates variance components and evaluates factors to compare the relative influences of each facet. Therefore this theory can make us clear in complex assessment situation. The many-facet Rasch measurement can enhance the objectivity of performance assessment by providing examinee logit scores after adjusting the effects of rater facet on raw scores. So it is also desirable to use in high-stakes assessment like large-scale one as well. In conclusion, in order to maintain high inter-rater reliability in scoring, two kinds of scoring methods and a variety of methods for estimating inter-rater reliability should be applied appropriately depending on purposes of performance assessment and subject areas. Also, if specified scoring criteria, a guide describing the application of those criteria, and systematic rater training which follows are provided, its scoring results can be reliable and consistent, regardless of raters and scoring occasions. Further empirical research on inter-rater reliability is needed to investigate how different effects of scoring methods which are combined with inter-rater reliability estimation methods are according to the levels of school, age, subject area and task of performance assessment.