DSpace at EWHA: 한국어 말하기 수행 평가의 발음 범주 채점에 대한 타당성 검증

Browse

My Repository

DSpace at EWHA일반대학원 국어국문학과 Theses_Ph.D

View : 1819 Download: 0

한국어 말하기 수행 평가의 발음 범주 채점에 대한 타당성 검증

Title: 한국어 말하기 수행 평가의 발음 범주 채점에 대한 타당성 검증

Other Titles: Study on Validity Testing of Pronunciation Rating for Speaking Performance Evaluation

Authors: 이향

Issue Date: 2013

Department/Major: 대학원 국어국문학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 박창원

Abstract: 본 연구는 한국어 말하기 수행 평가에 있어서 발음 범주 채점의 타당성을 검증하는데 그 목적이 있다. 이를 위하여 본 연구에서는 한국어 말하기 평가에서의 발음 범주 채점의 타당화 과정의 일부분으로서 이론 기반 타당도 검증과 채점 타당도 검증을 실시하였다. 먼저 본고에서는 지금까지의 말하기 수행 평가에 대한 이론과 연구들을 토대로 한국어 말하기 수행 평가란 '평가 도구의 과제를 통하여 유도된 수험자의 한국어 말하기 수행을 평가자가 평가 척도를 사용하여 채점하는 일련의 과정'이라고 정의하였다. 그리고 지금까지 사용되어 온 말하기 평가 수행 평가 도구를 개괄해 본 결과 최근 말하기 평가 도구로 가장 많이 사용되는 컴퓨터 기반의 말하기 수행 평가가 소규모 평가뿐만 아닌 대규모의 평가에서도 적절하다고 보았다. 다음으로 최근의 타당도 개념의 변화를 개괄하여 단일화된 (구인) 타당도의 개념을 살펴보았다. 그리고 이를 기반으로 말하기 수행 평가의 타당성 검증을 '타당화 과정(validation)' 즉, 평가 결과의 해석을 위한 타당성의 증거들을 수집하는 일련의 과정'을 의미하는 이라고 보고 이를 발음 범주 채점의 타당성 검증에 적용해 보았다. 이를 위하여 본고에서는 Weir (2004)가 제안한 '말하기 수행 평가의 타당화 과정 틀'을 바탕으로 하여 한국어 말하기 수행 평가에서의 발음 범주 채점의 타당성 검증을 위한 이론 기반 타당도 검증과 채점 타당도 검증 과정을 실시하였다. 먼저 이론 기반 타당도 검증 과정으로 지금까지의 말하기 평가 관련 이론과 연구들을 고찰하여 말하기 수행 평가에서의 발음 범주의 위치와 독립성을 확인한 후, 발음 범주에서 평가 해야 하는 구인들을 선정하였다. 그 결과 본고에서는 분절음(자음과 모음, 음운 변화), 초분절음(억양), 그리고 발화 속도와 휴지를 발음 능력 평가 구인으로 보았다. 다음으로 사전 채점 타당도 검증 과정을 통하여 이들 구인들에 대한 구체적인 채점 방안을 제안하고, 사후 채점 타당도 검증 과정을 통하여 앞서 제안한 채점 방안에 대한 통계 분석을 사용한 객관적인 타당도 검증을 실시하였다. 먼저 사전 채점 타당도 검증 과정으로 채점 기준, 채점자, 평가 과제, 채점 방식 및 채점 척도에 대한 이론적인 타당도 검증을 실시하였다. 그 결과를 바탕으로 연구자는 발음 평가를 위한 채점 기준으로 '정확성'과 이해 명료성' 기준이 평가 상황과 필요에 따라 선택적으로 사용될 수 있는 기준임을 주장하였으며, 발화 속도와 휴지의 채점 기준으로는 '유창성'을 제안하였다. 또한 채점자 선정에 있어서 한국어 교육 경험과 말하기 수행 평가 채점 경험이 있는 채점자를 선정하여야 한다고 보았다. 또한 평가 과제에 있어서는 일반 목적의 말하기 수행 평가에서의 발음 능력을 측정하기 위해서 '구인 기반 평가 과제'가 적절하며, 채점 가능한 양과 질을 갖춘 발화를 이끌어 낼 수 있는 수준의 과제를 사용할 것을 제안하였다. 또한 정확성을 기준으로 분절음과 초분절음을 채점하는 데는 분석적 채점 방식을, 이해 명료성을 기준으로 분절음과 초분절음을 채점하기 위해서는 총체적 채점 방식을, 발화 속도와 휴지를 채점하는 데 있어서는 유창성을 기준으로 총체적 채점 방식을 사용해야 한다고 보았다. 마지막으로 채점에 사용할 채점 척도로는 리컬트 6점 척도를 사용할 것을 제안하였다. 또한 본고에서는 앞서 제안한 채점 방안에 대한 사후 타당도 검증 과정을 병행하였다. 이를 위하여 먼저 44명의 수험자들에게 컴퓨터 기반의 말하기 수행 평가를 실시하여 7명의 채점자들에게 제안한 채점 방안이 반영된 채점 기준표를 사용하여 실제 채점을 하게 한 후 이들 채점자들의 채점 결과로 다국면 라쉬 모형과 일반화가능도 이론을 사용한 사후 타당도 검증을 실시하였다. 1차 다국면 라쉬 모형 분석 결과 두 명의 채점자가 부적합한 채점자로 나타나 이들을 제외한 채점 결과로 2차 다국면 라쉬 모형 분석을 실시하였다. 그 결과 앞서 제안한 채점 기준이 각각의 구인들을 채점하는 데 있어서 독립적으로 변별력 있게 사용되고 있음을 확인하였으며, 채점자들의 채점 또한 채점자 내 일관성과 채점자 간 일관성을 갖추고 있음을 확인하였다. 또한 '정확성 기준의 초분절음 채점 → 정확성 기준의 분절음을 채점 → 음운 유창성 기준의 발화 속도와 휴지 채점 → 이해 명료성 기준의 분절음과 초분절음 채점'의 순으로 곤란도가 높아지는 것을 확인하였다. 평가 과제 국면에서는 평가에 사용된 낭독하기, 그림보고 이야기하기, 서술하기 과제가 난이도의 차이 없이 통계적으로 유의미하게 독립적으로 변별력 있게 사용되었음을 볼 수 있었다. 본 채점에 사용된 6점 리컬트 채점 척도 또한 채점자들이 비교적 동간으로, 채점자들 간에 통일된 해석으로 사용되고 있음을 확인하였다. 마지막으로 다섯 명의 채점자들의 채점 결과를 사용하여 일반화가능도 이론을 기반으로 한 타당도 검증을 실시하였다. 먼저 일반화 연구를 통하여 본 평가의 주요 오차 요인은 수험자 요인인 것으로 나타났다. 이는 본 평가 점수가 수험자들의 발음 능력에서의 차이를 가장 많이 반영하고 있는 것을 의미하는 것으로 긍정적으로 해석할 수 있다. 다음으로 결정 연구를 실시하여 채점 점수의 일반화가능도 계수를 탐색한 결과 채점자 두 명에 세 개 이상의 과제로 채점을 할 경우 .9 이상의 일반화가능도 계수를 확보할 수 있음을 알 수 있었다. 이는 이들 조건이 만족될 경우 앞서 제안한 채점 방법으로 다른 채점자들이 채점을 하더라도 일관성 있는 채점 결과를 얻을 수 있음을 의미하는 것이다. 또한 과제의 수를 증가시키는 것보다 채점자를 증가시키는 것이 보다 효율적인 일반화가능도 계수의 증가를 보이는 것을 확인할 수 있었다. 본고는 말하기 수행 평가의 발음 범주에 한정하여 그 채점 방안을 제안하고 이에 대한 타당도 검증을 실시하였다. 비록 본 연구가 발음 범주에 한정한 연구이나 본고에서 사용한 타당화 과정은 다른 범주에도 적용하여 활용할 수 있을 것이라고 본다. 또한 향후 질적인 분석도 병행되어 본고에서 사용한 이론 기반의 검증 방법과 객관적인 분석 방법을 상호 보완적으로 사용하여 말하기 수행 평가 개발과 전문적인 말하기 채점자 훈련 과정 설계에 참고 자료로 사용될 수 있을 것이라고 기대한다. ;The primary goal of speaking performance evaluation is to predict an examinee's speaking ability in terms of real-life communication, based on scores measured on speaking ability. To that end, all of the elements such as the construct, measuring process, and measuring results (or scores) should be valid. With that being said, what does it really mean by "valid evaluation"? With the concepts of 'validity' being unified into 'construct validity', the 'validity' of educational evaluation has been defined as the extent to which an evaluation can accurately indicate the level of an examinee's linguistic knowledge or ability (construct) through scores. Such a change in concepts of validity has brought people's attentions to the issue of how to verify the validity of an evaluation as well as how to validate the evaluation per se. Moreover, as types of evidences for proving validity of evaluation methods have become diverse ever since the Messick's model was applied to language evaluation in 1990's, verification of validity (will be referred to as 'validation' hereinafter) has been recognized as a due course for reaching a final conclusion based on all of the evidences that are collected. According to Weir (2004), such a validation for speaking performance evaluation is categorized into five specific steps such as 'theory-based validity', 'context validity', 'scoring validity', 'context validity', and 'criterion-related validity'. Based on such processes suggested by Weir, both theoretical and empirical validations were conducted in this study by running 'theory-base validity' and 'scoring validity' programs focusing solely on the 'pronunciation' category of speaking performance evaluation. All of such processes were aimed at recommending valid rating methods for measuring examinees' pronunciation abilities. First off, as a theory-based validation process, renowned communication models in the field of foreign language education were researched, while the status of 'pronunciation' category in studies related to speaking evaluation in the field of Korean speaking education, was examined. As a result, it was reassured that pronunciation ability is a critical category that must be included as an independent part of the speaking performance evaluation for assessing examinees' speaking ability. Also, it was discovered that the concepts of 'pronunciation' category that have been suggested up to present in studies pertaining to Korean language education are vague. Moreover, there were inconsistencies among the concepts or evaluation methods for constructs that are supposed to be graded from the pronunciation-specific point of view. Against the backdrop, after scrutinizing pronunciation-related constructs that have been used for pronunciation education and evaluation in foreign language field, this study has selected 'segments', 'suprasegment', 'speech speed', and 'pause' as the constructs for pronunciation evaluation in Korean speaking evaluation. In addition, 'phoneme', 'syllable', and 'phonological change' as well as 'intonation' were included in 'segment' and 'suprasegment' constructs respectively. As a next step, this study suggests specific rating methods for such constructs through pre-scoring validation process. In the mean time, the suggested rating methods were objectively validated through post-scoring validation process by performing quantitative analysis. First, as a pre-scoring validation process, criterion, rater, task, and rating method as well as rating scale, all went through theory-based validation. Based on the results, it is asserted in this paper that 'accuracy' and 'fluency' can be selectively used as rating criterion for evaluating pronunciation, in accordance with the given circumstances and as necessary. It is also reaffirmed that since examiner is one of the factors that can have the greatest impact on evaluation results, he or she will have to be experienced in teaching Korean language and rating speaking performances. Given that this evaluation is focused on evaluating general speaking performance rather than goal-specific speaking performance, 'construct based task' is suggested to be adequate enough for rating pronunciation ability. Additionally, it is proposed that tasks will need to be able to facilitate quality speech that is lengthy enough for reasonable evaluation. As for rating 'segment' and 'suprasegment' based on 'accuracy', 'analystic rating method' is suggested, while 'holistic rating method' is suggested for rating the same two constructs based on 'intelligibility'. Also, with respect to rating 'speed' and 'pause' of speech based on 'phonological fluency', 'holistic rating method' is recommended. Finally, as a rating scale, '6-point Likert scale' is proposed. In the following step, computer-based speaking performance evaluations were conducted, and seven raters were assigned to grade them according to the rating methods suggested previously. Based on the rating result data, post-rating validation using Multi-Facet Rasch Analysis and Generalizability Theory was performed. As a result of running the Multi-Facet Rasch Analysis, a couple of raters were identified to be disqualified, therefore, the analysis was re-run without them. Subsequently, it was confirmed not only that the suggested rating criterion were used independently and effectively for rating each of the constructs, but also that scores of the raters were showing 'inter-rater consistency' as well as 'intra-rater consistency'. In the mean time, 'severity' seemed to vary from one rater from another. Additionally, the level of 'difficulty' seemed to be rising in the following order: 'suprasegment rating based on accuracy' → 'segment rating based on accuracy' → 'speech speed and pause rating based on phonological fluency' → 'segment and suprasegment rating based on intelligibility'. With regard to evaluation task, tasks such as 'read aloud', 'describing picture', and 'narration' all seemed to be working independently and effectively in statistically significant ways regardless of the level of difficulty. As for the 6-point Likert scale, which was used for the rating, relatively the same intervals were exhibited amongst the raters indicating that they had the same understating as to how the scale works. In the next step, rating results of the five raters were taken to undergo validation based on Generalizability Theory. Based upon the Generalization Study, the key error source of this evaluation was unveiled to be the 'examinees' factor. However, considering what the evaluation scores are reflecting the most, which is the different levels of pronunciation abilities of the examinees, the results still can be viewed positively. In the final stage, a Decision Study was conducted to examine generalizability coefficients of the rated scores. As a result, it was discovered that in case two raters scored three or more than three tasks, even .9 or higher generalizability coefficients could be obtained. This means that once the conditions were met, consistent rating results could be achieved even when evaluations were graded by different raters using the suggested rating methods. Moreover, it was assured that increasing the number of raters rather than tasks worked better to acquire higher generalizability coefficients. In conclusion, this study narrows down its scope of speaking performance evaluation to pronunciation as a way to suggest valid rating methods. In order to verify the validity of the suggested rating methods, validation processes were undertaken. Acknowledging the limited scope of the study, the validation processes used in the study are still believed to be applicable to other evaluation categories. Moreover, once the qualitative analysis is also run in parallel with the theory-based validation and quantitative analysis used in this study, they are expected to play complementary roles for one another when used as reference data for developing speaking performance evaluations and designing rater training courses.