DSpace at EWHA: 불성실 응답 탐지에 대한 역채점 문항의 효과 및 탐지 방법 비교

Browse

My Repository

DSpace at EWHA일반대학원 교육학과 Theses_Ph.D

View : 131 Download: 0

불성실 응답 탐지에 대한 역채점 문항의 효과 및 탐지 방법 비교

Title: 불성실 응답 탐지에 대한 역채점 문항의 효과 및 탐지 방법 비교

Other Titles: The Effects of Reverse-Scored Items on Detecting Insufficient Effort Responses and a Comparison between Detection Methods

Authors: 곽예린

Issue Date: 2024

Department/Major: 대학원 교육학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 최윤정

Abstract: 본 연구의 목적은 검사에서 발생하는 불성실 응답이 검사 타당도 및 신뢰도에 미치는 영향, 불성실 응답 탐지 방법별 효과, 탐지 방법 간 관련성, 불성실 응답 처리에 있어서 역채점 문항의 기능을 확인하는 것이다. 또한 교육적 맥락에서 인지적 영역(학업 성취도)과 정의적 영역(학교에 대한 태도 등), 그 외 요인들(응답시간 등)에 대해 불성실 응답 집단과 그렇지 않은 집단 간에 유의한 차이가 나타나는지를 살펴보고자 하였다. 연구 자료의 선정을 위해 불성실 응답이 발생하기 쉬운 저부담 검사 환경인지, 불성실 응답에 의해 자료의 질이 크게 저하될 수 있는 자기보고식 설문 방식인지, 측정 단위가 일관되어 통계 분석이 용이한 척도로 구성되어 있는지를 고려하였다. 이에 따라 본 연구에서는 TIMSS 2019의 배경변인 설문 중 ‘과학에 대한 태도(흥미, 자신감, 가치 인식)’를 묻는 26개 문항에 대한 8학년 한국 학생들의 응답을 사용하였다. TIMSS는 수학·과학 성취도 추이 변화에 대한 국제 비교 연구로서 참여국을 대상으로 4년마다 실시된다. 교육의 질 개선 및 정책 수립에 정보를 제공하기 위해 학생들의 수학·과학 성취도뿐만 아니라 학생 및 학교 수준의 교육맥락변인을 함께 조사하고 있다(박상욱 외, 2019). 즉, 교육 분야에서 널리 사용되고 있는 검사 환경 및 방식에 대한 불성실 응답의 영향을 살펴보고, 편향되지 않은 정확한 분석 결과를 얻기 위한 절차로서 불성실 응답 탐지 및 자료 전처리 방안을 제시하고자 한다. 선행연구를 통해 불성실 응답 탐지를 위한 다양한 방법들이 제안되었으며, 본 연구에서 사용한 검사 자료의 특성을 고려하여 한 줄 응답, 개인 내 응답 변량, 마할라노비스 거리, 심리측정학적 반의어, 피험자 적합도 방법을 선정하였다. 각 방법에서 산출된 지수 간의 상관이 얼마나 높은지, 불성실 응답 집단과 그렇지 않은 집단으로 분류하는 데 있어서 방법 간 일치도가 어떠한지를 살펴보았다. 이를 통해 탐지 방법에 따라 더욱 효과적으로 탐지할 수 있는 불성실 응답 유형이 존재함을 발견하였다. 한 줄 응답 방법과 개인 내 응답 변량 방법이 비무작위 연속 응답 패턴을 걸러내는 데 효과적이라면, 마할라노비스 거리 방법과 피험자 적합도 방법은 정규성을 벗어나는 무작위 패턴을 걸러내는 데 효과적인 것으로 나타났다. 다음으로 불성실 응답의 정확한 탐지에 있어 역채점 문항의 기능을 살펴보고자 하였으며, 불성실 응답 제거 기준 설정에 역채점 문항의 개수와 위치를 고려한 조건을 적용하였다. 역채점 문항은 부주의하거나 불성실한 응답을 걸러내는 데 효과적인 도구가 될 수 있다. TIMSS 2019 ‘과학에 대한 태도’ 설문에서는 흥미 요인 중 2개 문항, 자신감 요인 중 4개 문항이 역채점 문항이었다. 본 연구에서는 불성실 응답 탐지 방법 내에서도 불성실 응답 집단과 유효표본 집단(성실 응답 집단)의 분할 기준(cut-off)을 설정하는 방식에 따라 효과가 다르게 나타날 것이라고 가정하였다. 따라서 흥미 요인의 역채점 문항을 기준으로 불성실 응답 방법별 지수를 계산하고 제거 기준을 설정한 조건 1, 자신감 요인의 역채점 문항을 기준으로 한 조건 2, 앞서 언급한 두 조건의 교집합을 제거하는 조건 3과 합집합을 제거하는 조건 4, 검사 전체를 기준으로 한 조건 5로 총 다섯 가지 조건을 비교하였다. 그러나 문항반응이론에 기반한 피험자 적합도 방법의 경우, 정확한 모수 추정을 위해 요구되는 표본 및 문항의 수를 고려하여 검사 전체(26개 문항)를 기준으로 한 조건 5만을 적용하였다. 불성실 응답이 검사 타당도에 미치는 영향을 보기 위해 확인적 요인분석(confirmatory factor analysis, CFA)을 실시하였으며, 검사 신뢰도를 비교하고자 각 요인과 검사 전체에 대한 Cronbach's 를 산출하였다. 불성실 응답을 제거한 경우, 모형 적합도가 개선되었으며, 요인 부하량 역시 소폭 상승되는 것으로 나타났다. 그러나 검사 신뢰도와 관련해서는 기존 자료가 충분히 신뢰로웠기 때문에 유의미한 변화를 발견할 수 없었다. 또한 불성실 응답 집단과 유효표본 집단의 특성을 비교하기 위해 과학에 대한 태도, 피험자 특성 변수, 과학 성취도 검사 점수에 대한 집단 간 차이 검정을 실시하였다. 과학에 대한 태도에서는 대체로 불성실 응답 집단보다 유효표본 집단이 높은 점수를 나타내는 편이었으나, 역채점 문항의 비율이 50%로 비교적 높았던 자신감 요인의 점수는 불성실 응답 집단이 높게 나타나는 경우도 있었다. 이는 성취도 검사와 같이 정해진 정답이 있는 상황에서는 불성실 응답이 낮은 성취도 점수와 연관될 수 있으나, 정답이 없는 정의적 영역에 대한 설문에서는 반드시 낮은 점수와 연결되지는 않는다는 사실을 보여준다. 다음으로, 불성실 응답 탐지를 위해 사용하지 않은 변수 중 불성실 응답과 관련이 있을 것으로 판단되는 학교에 대한 태도, 수업에 대한 태도, 도달하지 못한 과학 성취도 문항의 비율 및 응답시간을 피험자 특성 변수로 선정하였다. 피험자 특성에서 전반적으로 유의한 집단 간 차이가 나타났으며, 유효표본 집단이 불성실 응답 집단보다 학교와 수업에 대해 긍정적이며, 과학 성취도 검사에 대한 도달률이 높았으며 응답시간은 유의하게 길었다. 또한 과학 성취도 검사 점수에 대해서도 유효표본 집단이 불성실 응답 집단보다 유의하게 높은 값을 나타냈다. 연구 결과를 바탕으로 도출한 시사점은 다음과 같다. 첫째, 불성실 응답 탐지 방법에 따라 더욱 효과적으로 찾아낼 수 있는 불성실 응답 유형이 상이한 것으로 나타났으며, 탐지 방법 및 제거 기준에 따라 불성실 응답으로 분류되는 피험자의 비율이 3.1~22.5%로 차이가 큰 편이었다. 따라서 한 줄 응답 방법과 마할라노비스 거리 방법을 조합하거나 개인 내 응답 변량 방법과 피험자 적합도 방법을 조합하는 등 서로 다른 유형의 불성실 응답을 찾아내는 방법을 적절히 조합하거나 둘 이상의 방법을 적용한 결과를 비교하여 더 효과적인 방안을 선택하는 것이 바람직하다. 둘째, 검사 타당도와 관련해서는 모형 적합도가 유의하게 개선되어 불성실 응답 제거의 효과가 나타났으나, 검사 신뢰도에 대해서는 효과가 미미하였다. 그러나 이는 불성실 응답 탐지 방법 자체의 문제가 아니라, 기존 자료의 신뢰도가 이미 높아서 비교 자체가 어려웠기 때문으로 판단된다. 따라서 기존 자료의 신뢰도가 충분히 낮은 상황에서는 어떠한 변화가 나타나는지를 살펴볼 필요가 있다. 셋째, 불성실 응답 집단은 유효표본 집단보다 학교나 수업에 대한 인식도 부정적이었으며, 성취도 검사에 대한 참여도나 점수 자체도 낮은 것으로 나타났다. 즉, 학교 교육과 관련된 연구에서 질 높은 자료를 수집하고 검사 결과를 정확하게 분석하기 위해서는 학습과 평가에 대한 학생들의 인식과 동기를 관리해야 한다는 것을 의미한다. 또한 불성실한 참여가 낮은 성취도와 연관성이 있다는 결과를 통해 알 수 있는 사실은 낮은 성취도 점수가 반드시 낮은 능력을 반영하지는 않으며, 불성실하거나 부주의한 응답에 영향을 받는다는 점이다. 이러한 상황은 특히 저부담 검사 환경에서 자주 발생하기 때문에 저부담 검사의 결과를 분석함에 있어서 주의를 기울여야 한다. TIMSS 2019와 같이 피험자 입장에서는 저부담 검사이지만 그 결과가 다방면으로 활용되는 자료의 경우, 더욱 정확한 분석 결과를 얻기 위해 불성실 응답 탐지 및 제거로 자료의 품질을 관리할 필요가 있다.;The purpose of this study is to examine the effects of insufficient effort responses(IER) on test validity and reliability, the effects of different IER detection methods, the correlation between detection methods, and the function of reverse-scored items in dealing with IER. Additionally, the study aimed to investigate whether there were significant differences between IER group and non-IER group on the cognitive domain(academic achievement), the affective domain(attitudes towards school etc.), and other factors(response time etc.) in an educational context. The data used in this study were the responses of South Korean eighth graders to 26 items asking about their attitudes towards science(interest, confidence, and value recognition) from TIMSS 2019 background survey. To select the research data, we considered whether it was a low-stakes testing environment prone to IER due to lack of motivation, a self-report survey where data quality could be significantly degraded by IER. In other words, The goal was to examine the impact of IER on testing environments and methods that are widely used in the field of education, and to propose IER detection and data cleaning procedures to obtain unbiased and accurate analysis results. Various methods for detecting IER have been proposed in previous studies. Considering the characteristics of the test data used in this study, we selected five methods: long-string, intra-individual response variability, Mahalanobis distance, psychometric antonym, and person-fit method. We examined the correlation between the indices calculated by each method and the agreement between the methods in classifying IER group and non-IER group. Through this, I discovered that there were certain types of IER that can be detected more effectively depending on the detection methods. While long-string and the intra-individual responses variability methods were effective in filtering out non-random continuous response patterns, Mahalanobis distance and person-fit methods were effective in filtering out random patterns that deviate from normality. Next, I examined the ability of reverse-scored items to accurately detect IER and applied conditions that considered the number and location of reverse-scored items to set the criteria for removing IER. Reverse-scored items can be an effective tool for filtering out IER. In TIMSS 2019 ‘Attitudes Toward Science’ survey, two items in the interest factor and four items in the confidence factor were reverse-scored. In this study, it was assumed that the effect of detecting IER would vary depending on the method of setting the cut-off between IER group and non-IER group. Therefore, I compared five conditions: condition 1, which set the cut-off based on the reverse-scored items of the interest factor; condition 2, based on the confidence factor; condition 3, which removed the intersection of the two aforementioned conditions; condition 4, which removed the union of condition 1&2; and condition 5, which used the whole test(26 items). However, for person-fit method based on item response theory, only condition 5, based on the whole test(26 items), was applied due to the number of samples and items for accurate parameter estimation. Confirmatory factor analysis (CFA) was conducted to examine the impact of IER on test validity, and Cronbach's alpha was calculated for each factor and the whole test to compare test reliability. When IER were removed, the model fit improved, and factor loadings also showed a slight increase. However, I did not find a significant change in test reliability, likely because the reliability of the raw data was already high. Furthermore, to compare the characteristics of IER group and non-IER group, I also conducted between-group differences tests on attitudes toward science, subject characteristics variables, and science achievement test scores. In general, non-IER group showed higher scores in attitudes towards science than IER group. However, in the confidence factor, where the proportion of reverse-scored items was relatively high at 50%, there were cases where IER group scored higher. This suggests that while in situations with a fixed correct answer, such as achievement tests, IER may be associated with lower achievement scores, in self-report surveys in the affective domain without a fixed correct answer, they are not necessarily linked to lower scores. Next, I selected as subject characteristics variables attitudes toward school, attitudes toward class, proportion of science achievement items not reached, and response time that I thought might be related to IER among the variables I did not use to detect IER. Overall, there were significant between-group differences in subject characteristics, with non-IER group having more positive attitudes toward school and class, higher attainment rates on the science achievement test, and significantly longer response times than the IER group. non-IER group also had significantly higher science achievement test scores than the IER group. There are several implications based on our findings. First, different methods of detecting IER were found to be more effective at detecting different types of IER, and the percentage of subjects classified as IER ranged from 3.1 to 22.5% depending on the detection method and removal criteria. Therefore, it is advisable to appropriately combine methods of detecting different types of IER, such as combining long-string and Mahalanobis distance method, or combining intra-individual response variability and person-fit method, or to compare the results of applying two or more methods to select a more effective approach. Second, in terms of test validity, the model fit was significantly improved, showing the effect of eliminating IER, but the effect on test reliability was insignificant. However, this is not a problem with IER detection method itself, but rather with the fact that the reliability of the raw data was already high, making comparisons difficult. Therefore, it is necessary to see what changes appear in situations where the reliability of the raw data is sufficiently low. Third, the IER group had more negative perceptions of school and classes than non-IER group, as well as lower levels of participation and scores on achievement tests. This suggests that to collect high-quality data and accurately analyze test results related to school education, it is necessary to manage students' perceptions and motivations towards learning and assessment. Moreover, the fact that low achievement scores are not necessarily a reflection of low ability but can be influenced by IER is evident. This situation is especially common in low-stakes testing environments, so caution is needed when analyzing the results of low-stakes tests. In cases like TIMSS 2019, where the test is low-stakes for participants but the results are used in various ways, it is necessary to manage the quality of the data by detecting and removing IER to obtain more accurate analysis results.