DSpace at EWHA: 한국어 화행 능력 평가 연구

Browse

My Repository

DSpace at EWHA국제대학원 한국학과 Theses_Ph.D

View : 315 Download: 0

한국어 화행 능력 평가 연구

Title: 한국어 화행 능력 평가 연구

Authors: 조성해

Issue Date: 2023

Department/Major: 국제대학원 한국학과

Publisher: 이화여자대학교 국제대학원

Degree: Doctor

Advisors: 박선희

Abstract: 본 연구는 시험의 질이 담보된 한국어 화행 수행 능력 평가를 수행하고 분석하는 데에 목적이 있었다. 그간 화용 능력이 명시적으로 교수되고 평가되어야 함이 지적되어 왔으나 기초적인 논의조차 풍부하지 못하였으며 실용적 적용은 더욱 만족스럽지 못하였다. 적게나마 이루어진 한국어 화용 능력 평가 연구의 경우 수행보다는 지식 측정에 집적하였고 실용도는 논의되지 않았다. 의식 수준에서 화용 능력의 신뢰도와 구인 타당도만이 검증된 것인데 그마저도 시험의 질이 체계적으로 논증되지 않았다. 이에 연구 필요성을 제기하며 수정된 Chapelle(2008)의 타당화 모형에 따라 한국어 화행 수행 능력 평가를 실시하고 그것의 신뢰도, 타당도, 실용도를 검토하여 보고자 하였다. 본 연구의 구성은 다음과 같다. Ⅰ장에서는 연구 목적과 필요성을 밝히고 화용 능력 평가 구인 연구와 화용 능력 평가 도구 연구로 나누어 선행 연구를 고찰하였다. 선행 연구를 검토함으로써 구인의 범위를 결정하고 구인 측정에 적합한 평가 도구를 선택하였다. 시험의 질 확보와 검증이라는 연구 목적에 따라 신뢰도, 타당도, 실용도 측면에서 연구 문제를 세우고 가설을 설정하였다. Ⅱ장에서는 이론적 배경을 기술하였다. 구인으로 설정된 화행의 개념과 특징에 대해 설명하였다. 연구 문제와 가설과 관련하여 신뢰도, 타당도, 실용도를 중심으로 평가 유용성 개념과 특징을 알아보았다. 연구 방법과 관련하여 논거 기반 접근법과 그 대표적인 모형으로 Chapelle(2008)의 타당화 모형에 대해 살펴보았다. 그리고 본 연구의 목적에 맞게 논증 과정을 간소화하고 Bachman과 Palmer(1996)의 실용도 개념을 삽입해 수정된 Chapelle(2008) 모형을 제시하였다. Ⅲ장에서는 연구 방법을 기술하였다. 연구 과정을 조망하고 일련의 타당화 절차를 거친 평가표 틀 개발 과정을 설명하였다. 사전 모의시험의 절차와 참여자를 밝히고 화행 능력 측정 도구의 설계 과정을 통해 영역 기술 추론 증거를 제시하였다. 채점자 훈련, 분석 방법, 분석 결과를 통해 평가 추론과 실용도 추론 증거를 마련하였다. 그리고 모의시험의 절차와 참여자를 알아보고 수정, 보완된 화행 능력 평가 도구와 분석 방법에 대해 살펴보았다. Ⅳ장에서는 연구 결과를 기술하였다. 일반화 추론 증거로써 문항 측면에서 문항 내적 신뢰도가 분석되었고 평가자 측면에서 평가자 간 신뢰도와 평가자의 엄격도가 분석되었다. 단일 오차원을 가정하는 고전 검사 이론의 관점에서도, 여러 오차 요인을 고려한 문항 반응 이론의 관점에서도 문항 내적 신뢰도와 평가자 간 신뢰도가 높았고 평가자의 엄격도가 적합하였다. 통계적으로도 유의하였다. 문항 내적 신뢰도가 높은 이유는 평가 도구 특성과 체계적인 타당화 절차로, 평가자 간 신뢰도가 높고 엄격도가 적합한 이유는 평가 도구 특성, 체계적인 채점자 훈련과 평가 도구 개발로 해석되었다. 설명 추론 증거로써 내용 타당도와 구인 타당도가 분석되었다. 전자의 경우 문헌 고찰과 전문가 분석에서도, 간이 요구 분석에서도 문항이 적절하고 수험자의 목표 언어 사용 영역을 대표함, 즉 내용이 타당함을 보여 주었다. 후자의 경우 집단 차이 분석에서는 시험 전반 층위와 개별 문항 층위에서 구인과 유관한 숙달도별, 거주 기간별 집단 차이가 분석되었고 통계적으로도 유의하였다. 반면에 구인 외적 변인인 문화권은 화행 능력 수준을 반영하지 않았다. 시험이 측정하고자 하는 것을 측정하고 구인과 무관한 분산에 의해 영향을 받지 않음으로써 구인의 타당성을 나타내었다. 상관 계수 분석에서는 개별 문항이 거시적으로 전부 화행이라는 점에서 상관이 있으면서 기능 또는 등급 측면에서 고유하게 측정하는 능력이 달랐고 통계적으로, 논리적으로 유의미하였다. 그러나 요인 분석 결과는 이를 지지하지 않았다. 화행이라는 공통성에 더 초점을 두고 있었다. 비록 요인 분석에서 개별 화행이 독립된 영역으로 드러나지 않았지만 상관 계수 분석 결과의 설득력이 충분하였고 요인 분석 결과도 정당화의 여지가 있었다. 내용 타당도가 높은 이유는 체계적인 타당화 절차로 해석되었다. 집단 차이 분석 결과 구인 타당도가 높은 이유는 평가 도구 특성과 체계적이고 다면적인 타당화 절차로 해석되었다. 시험 내적 구조 조사 결과 구인 타당도에 대한 설명이 다른 이유는 화용 본성, 체계적인 타당화 절차, 동일한 측정 도구, 작은 문항 표본과 수험자 표본으로 해석되었다. 실용도 추론 증거로써 인적 자원, 물적 자원, 시간이 분석되었다. 인적 자원은 수험자와 시험관, 수험자와 평가자의 비율은 균형적이었으나 시험관의 감독 분량, 평가자의 채점 분량은 합리적이지 않았다. 수치만 보면 실용성이 떨어졌지만 연구적 조치나 한계를 차치하고 적용을 생각할 때 필요한 인적 자원이 이용 가능한 자원을 초과하지 않았다. 물적 자원은 화상 회의실 시험이 교실 시험보다 상대적으로 간소하였지만 절대적인 기준에서 두 경우 모두 필요한 물적 자원이 이용 가능한 자원을 초과하지 않아 실용적이었다. 시간은 응답 시간과 채점 시간을 합친 평균 시간의 경우에도, 최장 채점 시간의 경우에도 규정 시간을 준수하였다. 필요한 시간이 이용 가능한 시간을 초과하지 않아 실용적이었다. 인적 자원 실용도에 대한 설명이 다른 이유는 연구와 현장의 맥락 차이로 해석되었다. 물적 자원의 실용도가 높은 이유는 평가 도구 특성으로, 시간의 실용도가 높은 이유는 평가 도구 특성과 체계적인 타당화 절차로 해석되었다. Ⅴ장에서는 연구 결과를 요약하며 연구 의의와 한계를 밝히고 후속 연구에 대한 제언을 마련하였다. 연구 의의는 다음과 같다. 첫째, 수행 능력으로서 화행 평가를 하였고 평가표를 채점 도구로 하였다. 선례가 많지 않은 가운데 한국어 화용 수행 평가를 시도하였으며 그를 위해 이전에 전무하였던 평가표를 마련하였다. 둘째, 시험의 질을 결정하는 조건으로 실용도를 강조하였다. 기존 화용 평가 연구에서는 실용도에 대한 논의가 많지 않았다. 신뢰도와 타당도뿐만 아니라 실용도까지 강조하면서 평가 유용성이 보다 조화를 이루도록 하였다. 셋째, 구인 타당화의 강도를 높였다. 선행 연구와 달리 시험 전반 층위와 개별 문항 층위에서 집단 차이를 분석하였다. 이때 집단 설정도 보통의 숙달도, 거주 기간 이외에 구인 외적 변인으로 문화권까지 포함하여 구인을 정당화하는 증거를 풍부하게 수집하였다. 넷째, 논거 기반 타당화 모형을 도입하였다. 구인 추출과 선별부터 평가 수행과 검증까지 논거를 체계적으로 확립함으로써 설명력을 높였다. 연구 한계와 제언은 다음과 같다. 첫째, 구인이 화행에 한정되었다. 다양한 화용적 내용을 아우르는 평가 연구, 그보다 선행되어야 할 개별 자질들의 습득 연구가 제안되었다. 둘째, 초급까지 수험자 집단에 포함되면서 초급의 목표 문항이 중·고급 평가용으로는 부적절한 경우가 보고되었다. 다른 집단과 초급의 분리, 담화적 화용론과 격식적 사용역으로 구인 확장, 문항당 시간 규정 설정이 제안되었다. 셋째, 채점 편이를 위해 선택된 4점 척도는 집단 규모에 비해 간격이 세밀하지 못한 편이었다. 6점 척도로 채점 후 결과 비교, 결정 연구를 통한 이상적인 척도 탐색이 제안되었다.;The purpose of this study is to conduct and analyze speech act competence evaluation which guarantees test qualities. It has been indicated that pragmatic competence should be explicitly taught and evaluated, but basic research was not abundant, and practical application was even more unsatisfactory. Korean pragmatic competence evaluation studies focused on knowledge measurement instead of performance, and the practicality was not discussed. Only the reliability and construct validity of pragmatic knowledge tests were analyzed, and the test qualities were not systematically demonstrated. Therefore, this study is to conduct Korean speech act performance assessment according to the modified Chapelle (2008) validation model, and to investigate its reliability, validity, and practicality. This study consists as follows. In Chapter I, the aim of research was introduced, and previous literature were reviewed. Precedent studies were divided into pragmatic construct and pragmatic evaluation tool research. By reviewing prior studies, the construct and evaluation tool of the current study were determined. In accordance with the research purpose of securing and verifying the test qualities, research questions and hypotheses were identified in terms of reliability, validity, and practicality. In Chapter II, the theoretical background of speech act was described. The concept and characteristics of speech act set as construct were explained. In relation to research questions and hypotheses, the concept and characteristics of test usefulness were examined, focusing on reliability, validity, and practicality. With regard to the research method, the argument-based approach and the validation model of Chapelle (2008) as its representative model were introduced. For the purpose of this study, a modified Chapelle (2008) model was presented by simplifying the argumentation process and inserting the practicality of Bachman and Palmer (1996). In Chapter Ⅲ, the research method was described. The research process was viewed, and the process of developing a rubric framework through a series of the validation was explained. The pre-pilot test procedure and participants were introduced. In the process of designing a speech act competence measurement tool, backings for domain description inference were presented. Evaluation inference and practicality inference were supported by rater training, analysis method, and analysis result. The pilot test procedure and participants were introduced, and the modified speech act competence evaluation tool and analysis methods were explained. In Chapter IV, the results were described. Generalization inference was backed by inter-item consistency reliability, inter-rater reliability, and rater severity. Both from the perspective of classical test theory assuming a single source of error and from the view of item response theory considering multiple sources of error, the inter-item consistency reliability and inter-rater reliability were high, and the rater severity was fit. It was also statistically significant. The reason for high inter-item consistency reliability was comprehended as the characteristics of the evaluation tool and the systematic validation. The reason for high inter-rater reliability and appropriate rater severity was interpreted as characteristics of the evaluation tool, and the systematic rater training and evaluation tool development. Explanation inference was defended by content validity and construct validity. For the former, as a result of literature review, expert analysis, and simplified needs analysis, it indicates that the items were appropriate and represented the target language use domain of the examinee. The content validity was satisfied. For the latter, in the group difference analysis, the group difference by the proficiency or residence related to the construct was analyzed at the overall test level and individual item level. It was also statistically significant. On the other hand, cultural area, an external variable, did not reflect speech act competence level. The construct validity was demonstrated by measuring what the test was intended to measure and not being affected by variances unrelated to the construct. In the correlation coefficient analysis, individual items were correlated in that they were all speech acts, but were different in terms of function or level. It was statistically and logically significant. The factor analysis result was not consistent with this. The focus was more on the commonality of speech act. Although individual speech acts were not extracted as an independent factor in the factor analysis, the result of the correlation coefficient analysis was sufficient, and that of the factor analysis also had a possibility of justification. The reason for the high content validity was comprehended as a systematic validation. According to the group difference analysis, the reason for the high construct validity was explained as the characteristics of the evaluation tool and the systematic and multifaceted validation. As a result of the internal structure analysis, the reason for different explanations of the construct validity was interpreted as the nature of the speech act, the systematic validation, the same measurement tool, and the small sample size. Practicality inference was demonstrated by human resources, material resources, and time analysis. As for human resources, the ratio of examinee to proctor and examinee to rater were balanced, but the amount of supervision and scoring was not reasonable. This showed that the test was not practical. However, apart from the research measures and limitations, considering the field application, the required human resources did not exceed the available resources. As for material resources, the web based test was relatively simpler than the classroom test, but in an absolute standard, the required material resources of both cases did not exceed the available resources, making it practical. As for time, the average response and scoring time, and the longest scoring time complied with the prescribed time. It was practical as the required time did not exceed the available time. The reason for the different explanations for the practicality of human resources was comprehended as the discrepancy between the research and the field. The reason for the high practicality of material resources was explained as the characteristics of the evaluation tool, and the reason for the high practicality of time was interpreted as the characteristics of the evaluation tool and the systematic validation. In Chapter V, the findings were summarized, the significance and limitations of the study were clarified, and suggestions for follow-up studies was prepared. The research significance is as follows. First, speech act evaluation was conducted as performance assessment, and the rubric was used as a scoring tool. While there were few precedents, Korean speech act performance assessment was attempted, and the pragmatic rating rubric was firstly developed. Second, practicality was also emphasized among the test qualities. There was little discussion of practicality in the previous studies. Not only reliability and validity, but also practicality were highlighted, making test usefulness more balanced. Third, the construct validation was intensified. Unlike prior studies, the group difference at the overall test level and individual items level was analyzed. Abundant backings justifying the construct validity were collected by setting up the groups by proficiency, residence, and cultural area. Fourth, an argument-based validation model was introduced. The power of explication was enhanced in that the rationale from the construct extract and selection to evaluation and analysis was systematically established. The research limitations and suggestions are as follows. First, the construct was limited to speech acts. Follow-up researches measuring various pragmatic features as well as ones investigating those features acquisition were proposed. Second, the beginner group was included. It was reported that the target items for beginning level were inappropriate for evaluation of the intermediate and advanced levels. Separation of the beginner group from the others, the extension of the construct to discursive pragmatics and the formal register, and the establishment of time regulation were suggested. Third, the 4-point scale selected for less demanding scoring was wide compared to the group size. In order to explore ideal scale, a suggestion of decision study or comparison to 6-point scale was made.