DSpace at EWHA: 문항과 검사 수준에서의 차별기능 분석 및 원인 탐색을 위한 통합적 접근

Browse

My Repository

DSpace at EWHA일반대학원 교육학과 Theses_Ph.D

View : 1017 Download: 0

문항과 검사 수준에서의 차별기능 분석 및 원인 탐색을 위한 통합적 접근

Title: 문항과 검사 수준에서의 차별기능 분석 및 원인 탐색을 위한 통합적 접근

Other Titles: An Integrated Approach to Detecting and Assessing Potential Causes of Differential Functioning at the Item and Test Level

Authors: 전현정

Issue Date: 2010

Department/Major: 대학원 교육학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 성태제

Abstract: 학업성취도 검사는 교육목표에 비추어 학생들이 학습한 내용을 얼마만큼 알고 있는지 평가하기 위하여 제작되며, 학생들의 성취수준을 파악하고 학교 교육의 질을 점검하는데 활용된다. 학업성취도 검사 결과를 신뢰하기 위해서는 학생들의 학업성취도가 정확하고 공정하게 측정되는 것이 선행되어야 하며, 검사가 측정하고자 하는 특성이나 능력이 아닌 피험자 집단의 특성에 따라 차별적으로 기능하지 않도록 검사도구를 타당하고 공정하게 제작하는 것이 필수적으로 요구된다. 특히 국가수준 학업성취도 평가와 같이 영향력이 큰 대규모 학업성취도 검사에서는 검사가 모든 학생들에게 공정하게 제공될 수 있도록 공정성에 대한 검증이 검사 개발 단계에서 이루어져야 한다. 본 연구는 대규모 학업성취도 검사에서 공정성을 진단하기 위한 방법으로 검사를 구성하고 있는 개별 문항, 문항군, 검사 수준에서의 차별기능을 분석하고 차별기능의 잠재적인 원인을 탐색함으로써, 검사도구의 공정성을 확보하고 검사의 질을 개선하기 위한 측정학적 정보를 제공하는데 목적이 있다. 이를 위하여 차별기능문항 뿐만 아니라 차별기능문항군, 차별기능검사에 대한 분석이 가능한 DFIT 기법, LOR 접근, SIBTEST 방법을 적용하여 국가수준 학업성취도 평가 고등학교 1학년 수학 검사에서 성별에 따른 차별기능이 존재하는지 살펴보았다. 또한 검사를 구성하고 있는 문항에서 어떠한 요소가 차별기능을 야기하는지 차별기능문항으로 추출된 문항의 내용 영역, 행동 영역, 문항 유형에 따른 특성을 분석하고, 선다형 문항에 대해서는 차별기능오답지 분석을, 서답형 문항에 대해서는 차별기능단계 분석을 수행하였다. 이러한 문항 특성 분석 내용에 기초하여 내용 전문가가 개별 문항에 대하여 질적 분석을 실시하였으며, 연구 결과에 기초하여 차별기능 분석과 원인 탐색을 위한 통합적 접근 모형을 제안하였다. 국가수준 학업성취도 평가 고등학교 1학년 수학 검사에서 성별에 따른 차별기능이 존재하는지 문항, 문항군, 검사 수준에서 분석하고, 차별기능의 원인을 탐색한 결과는 다음과 같다. 첫째, 차별기능문항 추출 결과, DFIT 기법에 의해서는 NCDIF 지수를 기준으로 하였을 때, 전체 36문항의 11.1%인 4문항이 차별기능을 하는 문항으로 추출되었고, CDIF 지수를 기준으로 하였을 때, 전체 문항의 25.0%인 9문항이 차별기능문항으로 추출되었다. DFIT 기법에 의해 차별기능문항으로 추출된 문항은 모두 선다형 문항으로 남학생에게 유리하게 기능하는 문항으로 나타났다. LOR 접근에 의해서는 전체 문항의 27.8%인 10문항이 차별기능문항으로 추출되었으며, 남학생에게 유리하게 기능하는 문항은 6문항, 여학생에게 유리하게 기능하는 문항은 4문항으로 나타났다. 문항 유형별로 살펴보면, 남학생에게 유리하게 기능하는 문항은 모두 선다형 문항인 반면, 여학생에게 유리하게 기능하는 문항은 선다형 2문항, 서답형 2문항으로 서답형 문항에서 차별기능문항으로 추출된 문항은 모두 여학생에게 유리하게 기능하는 것으로 나타났다. SIBTEST 방법에 의해서는 전체 문항의 27.8%인 10문항이 차별기능문항으로 추출되었으며, 이 중 남학생에게 유리하게 기능하는 문항은 5문항, 여학생에게 유리하게 기능하는 문항은 5문항으로 나타났다. 문항 유형별로 살펴보면, 남학생에게 유리하게 기능하는 문항은 모두 선다형 문항인 반면, 여학생에게 유리하게 기능하는 문항은 선다형 3문항, 서답형 2문항으로 서답형 문항에서 차별기능문항으로 추출된 문항은 모두 여학생에게 유리하게 기능하는 것으로 나타났다. DFIT 기법, LOR 접근, SIBTEST 방법을 적용하여 차별기능문항을 추출한 결과를 비교해 보면, LOR 접근과 SIBTEST 방법은 DFIT 기법에 비해 차별기능문항의 추출 비율이 높은 것으로 나타났고, 세 가지 방법에서 모두 차별기능문항으로 추출된 문항은 5문항으로 전체 검사 문항의 13.9%, 한 가지 방법에서라도 차별기능문항으로 추출된 문항은 14문항으로 전체 검사 문항의 38.9%였다. 차별기능문항이 어떤 집단에게 유리하게 기능하는지 차별기능의 방향을 검증한 결과는 모든 방법에서 동일하게 나타났으며, 남학생에게 유리하게 기능하는 문항이 9문항, 여학생에게 유리하게 기능하는 문항이 5문항으로 남학생에게 유리하게 기능하는 문항이 더 많았다. DFIT 기법, LOR 접근, SIBTEST 방법 간 차별기능문항 추출의 일치 정도를 확인하기 위하여 산출한 일치도 통계는 .75~.94로 모두 높은 편이었으나 Kappa 계수는 .36~.85로 우연에 의한 일치 확률을 제거하였을 때 두 방법 간 일치 정도가 낮게 나타난 경우도 있었다. Kappa 계수를 기준으로 보았을 때, 두 방법 간 일치도가 가장 높은 것은 LOR 접근과 SIBTEST 방법이었고, 일치도가 가장 낮은 것은 DFIT 기법과 SIBTEST 방법이었다. 또한 세 방법에 의해 산출된 차별기능지수 간 상관 분석을 실시한 결과, Pearson 상관계수는 .846~.947, Spearman 상관계수는 .935~.975로 높게 나타났다. Pearson 상관계수가 가장 높게 나타난 방법은 DFIT 기법과 LOR 접근이었고, 상대적으로 DFIT 기법과 SIBTEST 방법 간 상관이 가장 낮았다. 차별기능지수들 간 순위에 따른 Spearman 상관계수는 세 방법들 간에 모두 .9 이상으로 매우 높았으며, 가장 높게 나타난 방법은 LOR 접근과 SIBTEST 방법으로 .975였다. 이러한 차별기능문항 추출 결과를 종합해보면, 분석방법별로 차별기능지수의 크기는 유사하게 분석되었지만, 차별기능문항을 판정하는 기준이 달라서 차별기능문항 여부를 결정한 결과는 다르게 나타났음을 알 수 있다. 따라서 차별기능문항을 분석할 때는 어떤 한 방법에 의해 추출된 결과를 선택하기보다는 분석방법별로 추출된 결과를 상호보완적으로 활용할 필요가 있다. 둘째, 차별기능문항군 분석 결과, DFIT 기법에 의해서는 내용 영역에서 ‘문자와 식’과 ‘도형’ 영역이 남학생에게 유리하게 기능하는 것으로 나타났고, 행동 영역에서는 ‘이해’와 ‘추론’ 영역이 남학생에게 유리하게 기능하는 영역으로 추출되었다. 그러나 LOR 접근에 의해서는 내용 영역과 행동 영역에서 차별기능문항군으로 추출된 하위 영역이 없었으며, SIBTEST 방법에서도 ‘추론’ 영역만 남학생에게 유리하게 기능하는 영역으로 나타났다. 차별기능문항 추출 결과와 비교해 보았을 때, LOR 접근과 SIBTEST 방법에서는 DFIT 기법에 비하여 여학생에게 유리하게 기능하는 차별기능문항이 더 많이 추출되었고, 하위 영역 내에서 남학생에게 유리하게 기능하는 문항과 여학생에게 유리하게 기능하는 문항의 영향력이 상쇄되어 문항군 수준에서는 차별기능이 나타나지 않았다. 즉, 개별 문항 수준에서 차별기능이 존재하지만 문항군 수준에서는 그 영향력이 축소되어 차별기능을 나타내지 않는 ‘차별기능문항의 축소(DIF cancellation)’ 현상을 분석 결과를 통하여 확인할 수 있었다. 따라서 문항군에서의 차별기능을 분석할 때는 문항 수준에서의 분석 결과를 확인하여 차별기능문항의 영향력이 문항군 수준에서 어떻게 반영되고 있는지 반드시 점검해 보아야 한다. 셋째, 차별기능검사 분석 결과, DFIT 기법과 LOR 접근 두 방법 모두 검사 수준에서 차별기능이 존재하는 것으로 나타났다. 즉, 국가수준 학업성취도 평가의 고등학교 1학년 수학 검사는 문항뿐만 아니라 검사 수준에서도 성별에 따른 차별기능이 존재하며, 남학생에게 유리하게 기능하는 검사인 것으로 나타났다. 검사 수준에서의 차별기능이 존재한다는 것은 검사가 측정하고자 하는 능력 이외에 다른 요인에 의하여 검사 점수가 영향을 받는다는 것이고, 검사 점수에 기초하여 이루어지는 의사결정에 영향을 미칠 수 있으므로 검사 개발 과정에서 문항 및 검사의 차별기능을 반드시 확인할 필요가 있다. 마지막으로 차별기능을 나타내는 원인이 무엇인지 탐색하기 위하여 DFIT 기법, LOR 접근, SIBTEST 방법 중 2가지 이상의 방법에서 차별기능문항으로 추출된 10문항을 대상으로 측정하고 있는 내용 영역, 행동 영역, 문항 유형에서의 특성을 분석한 결과는 다음과 같다. 먼저, 내용 영역에서는 ‘도형’ 영역에서 4문항, ‘문자와 식’ 영역에서 3문항이 추출되었다. 특히 ‘도형’ 영역은 전체 9문항 중 44.4%인 4문항이 추출되었고, 추출된 모든 문항은 남학생에게 유리하게 기능하는 것으로 나타났다. 이러한 결과는 도형, 기하 영역이 남학생에게 유리하게 기능한다는 선행연구의 결과와 일치하였다. 행동 영역에서는 ‘문제해결’과 ‘이해’ 영역에서 각각 4문항씩 추출되었고 ‘추론’ 영역에서 2문항이 추출되었다. 특히 ‘문제해결’ 영역은 전체 8문항 중 50.0%인 4문항이 추출되었고 이 중 3문항이 남학생에게 유리하게 기능하는 것으로 나타났다. 본 연구에서는 ‘문제해결’ 영역이 차별기능문항군으로 추출되지는 않았지만 모든 분석방법에서 다른 영역에 비해 남학생에게 유리하게 기능하는 정도가 크게 나타났다. 문항 유형에 따라서는 차별기능문항으로 추출된 10문항 중 선다형 문항이 전체 30문항의 26.7%인 8문항, 서답형 문항이 전체 6문항의 33.3%인 2문항이 추출되었다. 선다형 문항에서는 6문항이 남학생에게, 2문항이 여학생에게 유리하게 기능하는 것으로 나타났고, 서답형 문항에서는 2문항 모두 여학생에게 유리하게 기능하는 것으로 나타났다. 이러한 결과는 문항 유형에 따라 고등학교 남학생이 여학생보다 선다형 문항에서 더 우수한 것으로 나타났다는 선행연구의 결과와 유사하다. 이러한 결과를 종합해 볼 때, 수학 성취도 검사에서 차별기능문항으로 추출된 문항들의 특성은 내용 영역, 행동 영역, 문항 유형에 대해 분석한 선행연구 결과와 유사하게 나타났으며, 공통적으로 추출된 특성들이 성별에 따른 차별기능의 원인이 될 수 있음을 시사한다. 하지만 어떤 문항에서의 차별기능은 다양하고 복잡한 문항 특성과 검사 상황에 따라 영향을 받기 때문에 본 연구에서는 차별기능문항에 대해서 추가적으로 문항난이도, 문항변별도, 문항추측도의 문항모수 특성을 분석하였고, 선다형 문항에 대해서는 답지 반응 분포를 기초로 차별기능오답지 분석을, 서답형 문항에 대해서는 부분점수 분포를 토대로 차별기능단계 분석을 수행하였다. 또한 이러한 특성을 확인하고 실제 학생들이 문제를 해결하는 과정에서 보일 수 있는 반응에 대해 내용 전문가에 의한 질적 분석을 실시하여 차별기능의 원인을 탐색하였다. 따라서 차별기능문항을 추출하고 추출된 결과를 검사의 질을 개선하는데 활용하기 위해서는 측정학적 분석과 함께 내용 전문가에 의한 질적 분석을 병행하여 수행할 필요가 있다. 이와 같이 학업성취도 평가에서 문항, 문항군, 검사 수준에서 차별기능이 존재하는지, 또한 차별기능의 원인이 무엇인지를 분석하는 것은 타당하고 공정한 검사를 개발하기 위해서 반드시 수행되어야 할 과정이다. 특히 차별기능문항으로 추출된 문항들이 보다 중립적으로 기능할 수 있도록 문항 수준에서의 검토와 수정이 필요하고, 차별기능의 원인으로 분석된 내용 영역과 행동 영역의 영향을 최소화할 수 있도록 평가 내용의 선정과 전체 검사 구성에 주의를 기울여야 한다. 또한 검사의 활용 목적에 따라 문항 유형, 문항 제시 형태, 검사에서 문항의 위치, 문항 내용의 실생활과의 관련성 등 문항의 특성과 남녀 학생의 학습 양식, 학습 성향, 교수․학습 상황 등 차별기능에 영향을 미칠 수 있는 잠재적인 요인들도 충분히 고려되어야 한다. 아울러 차별기능의 원인에 대한 탐색 결과는 교육과정과 교수․학습 과정을 개선하는데 도움이 되는 정보를 제공할 수 있으므로, 실제 검사 자료에서 다양한 조건과 방법을 적용하여 차별기능문항을 추출하고 그 원인에 대한 분석을 수행하는 경험적인 연구가 지속적으로 이루어져야 할 것이다.;The purpose of achievement tests is to measure students’ knowledge learned in class based on educational objective. This means that the test should have content validity to precisely measure the level of their knowledge. Thus, it is required that the test should be tested whether or not it has test fairness and validity. With this process, the test can be controlled not to measure other abilities which are not intended to measure in achievement tests. Furthermore, it is important to detect factors which can affect a certain group positively or negatively. Recently, the National Assessment of Educational Achievement (NAEA), which is a large-scale test, was used to diagnose and improve educational situations and to provide a basis for establishing educational policy. Particularly, from 2008, the NAEA was expanded to all students in 6th, 9th and 10th grades to evaluate individual achievement, especially whether students achieve the basic level of the test. In addition, various studies using the test data have been performed in both academic and policy contexts as the result was open to public. Recognizing of the importance of the NAEA, the effect of the test on individual student as well as educational policy maker has been increased. Under this circumstance, it is important that the test and its result must be reliable. The reliability is obtained by the precise and fair measurement of student achievement. Developing tests which are as valid and fair as possible is required to get precise and fair measurements. Therefore, studies of test validity, especially test fairness should be conducted in high-stakes test such as the NAEA. In the psychometric field, the study of differential item functioning (DIF) is one of methods used to assess test fairness and validity, and it examines whether items function differently across groups. This means that items measure differently for one group compared to another group because of such characteristics as gender, race, and ethnicity, not because of ability. The purpose of this study is to provide psychometric information for the improvement of test quality and validity through examining DIF, DBF and DTF by group characteristics in the NAEA. It is also used to investigate the sources of DIF, DBF and DTF. The differential functioning of items and test framework (DFIT), log-odds ratio approach (LOR) and simultaneous item bias test (SIBTEST) were employed to examine DIF, DBF, and DTF by gender in the 10th grade students’ math achievement test of the NAEA. In addition, analyses were followed by content areas, behavior areas and item formats for exploring the causes of DIF. The differential distractor functioning (DDF) was applied to assess multiple-choice items, and the differential step functioning (DSF) was applied for constructed-response items. Therefore, the focuses of this study were to investigate DIF, DBF, and DTF by gender in the 10th grade students’ math achievement test and to explore the causes of them. The results of this study were as follows. First, in the result of DIF analysis by the DFIT, four items out of thirty-six items (11.1%) indicated DIF based on NCDIF index when nine items (25.0%) displayed DIF based on CDIF index. All of them were multiple-choice items and favored male students. When LOR was applied, ten items (27.8%) out of thirty six items showed DIF. Six items favored male students, whereas four items favored female students. With respect to item format, items identified as DIF in favor of males were multiple-choice items. But in the items identified as DIF in favor of females, two of the items were multiple-choice items whereas the other two items were constructed-response items. Thus, constructed-response items as identified as DIF favored females. The results of SIBTEST showed that ten items (27.8%) displayed DIF where five of them favored females but other five items favored males. With respect to item format, items identified as DIF favoring males were multiple-choice items. But in the items identified as DIF favoring females, three items were multiple-choice items whereas other two items were constructed-response items. Thus, constructed-response items as identified as DIF favored females. The comparison of DFIT, LOR and SIBTEST to detect DIF showed that the percentage of items flagged as DIF by LOR and SIBTEST was greater than by DFIT. Five items (13.9%) showed DIF in all of three methods, and fourteen items (38.9%) displayed DIF in one of three methods. Items favored males were more than items favored females since nine items favored females and five items favored males. The direction of DIF was consistent across methods. Nine items favored males while five items favored females. Thus, items identified as DIF in favor of males were more than items favored females. The agreement statistics among DFIT, LOR, and SIBTEST were .75~.94, while the Kappa values among them were .36~.85. It indicated that the agreement could be low when the probability by chance was eliminated. Based on Kappa value, the agreement between LOR and SIBTEST was the highest whereas the agreement between DFIT and SIBTEST was the lowest. The result of correlation analysis indicated that three methods were highly correlated since the Pearson’s correlation coefficient was .846~.947, and Spearman’s rank correlation coefficient was .935~.975. The Pearson’s correlation coefficient between DFIT and LOR was the highest, but the coefficient between DFIT and SIBTEST was the lowest. The Spearman’s rank correlation coefficients showed that the three methods highly correlated one another. Among them, the correlation coefficient between LOR and SIBTEST was ranked the highest at .975. Second, the result of DBF analysis by DFIT indicated that bundles in content areas of letters and formulas and figures favored males. Also, bundles in behavior areas of understanding and reasoning favored males. However, there was no DBF under LOR in both content and behavior areas. In the analysis of SIBTEST, only the area of reasoning functioned differently in favor of males. In comparing with LOR and SIBTEST, SIBTEST detected more DIF in favor of females than DFIT. Moreover, no DBF was detected by SIBTEST because the effect of DIF was cancelled out. Therefore, DIF cancellation which means that significant DIF is identified at the item level but could be cancelled at the bundle level is verified. Third, the result of DTF analysis by DFIT and LOR indicated that the test functioned differently. It means the NAEA for the 10th grade math achievement functioned differently between males and females at the test level, as well as at the item level. According to the result, the test favored males. The presence of DTF means that test score can be affected by other factors besides the examinees’ ability. It also indicates that the other factors can affect the decision based on the test score. Finally, the analysis of item characteristics was performed to explore the causes of DIF. Ten items were identified as DIF by at least two methods among DFIT, LOR or SIBTEST were selected to analyze. These items analyzed by content areas, behavior areas, and item formats. In the analysis of content areas, four items in the area of figures and three items in the area of letters and formulas displayed DIF. Four items out of nine items (44.4%) showed DIF in the content area of figures. Also, all of them functioned differently in favor of males. These findings were consistent with previous research. With respect to behavior areas, four items were in the area of problem-solving, four items were in the area of understanding, and other two items were in the area of reasoning. The interesting thing is that half of the items in the area of problem-solving were identified as DIF. Also, three out of four items functioned in favor of males. Even though problem-solving area was not statistically significant to display DBF, items in this area functioned more differently in favor of males than other behavior areas. In the item format analysis, eight out of ten items were multiple-choice items, while two items were constructed-response items. Of the eight multiple-choice items identified as DIF, six items favored males but two items favored females. In the case of constructed-response items, all items favored females. This was consistent with the result of previous research which indicated male high school students performed better than females in multiple-choice items. Therefore, the characteristics of the items which were identified as DIF in the math achievement test were consistent with the result of previous research. This finding supports that item characteristics such as content area, behavior area, and item format can be the causes of gender DIF. However, the causes of DIF are complex. DIF is affected by various item characteristics and test circumstances. Thus, it’s hard to make a conclusion that DIF is caused by a specific reason. The effects of content areas, behavior areas and item formats as indicated in this study are mixed up. In addition, item characteristics and examinee characteristics can be the causes of DIF. It indicates that analyses at the item level are required to explore the causes of DIF. Through this study, it is clear that the studies of DIF, DBF and DTF are required to establish test fairness and validity. These studies should be performed to see if DIF is present and what are the causes of DIF. Items identified as DIF should be reviewed and revised to make them fair. Also, test contents and structures should be decided carefully because content areas, behavior areas, and item formats can be the causes of DIF according to the result of this study. To explore the causes of DIF can help improve curriculum, teaching and learning. Therefore, empirical analyses about DIF and the causes of DIF considering various conditions should be performed persistently.