華語文閱讀測驗信度效度分析與垂直等化研究

本文旨在探討華語文閱讀測驗四個測驗等級：基礎級、進階級、高階級與流利級的信度與效度表現，並將四個等級試題難度連結至同一量尺上。樣本來自2011年5月與11月正式考試，及2012年預試之考生作答反應資料，以古典測驗理論與試題反應理論進行分析。研究結果顯示：1.閱讀測驗信度良好，各等測驗KR20信度係數接近或達到0.90以上，IRT估計標準誤換算後的信度數值皆達到0.90以上，且各測驗通過門檻的考生能力值亦有較高的測驗訊息量與較低的估計標準誤；2.閱讀測驗具有建構效度，各等級因素分析結果抽出閱讀理解單一因素，解釋變異量在66.91%以上，且各等級試題與模式適配比例達87.5%以上；3.四等測驗試題難度分佈良好；4.進階與高階級測驗折半合併為一等測驗，通過門檻之測驗訊息量及估計標準誤，與原進階級測驗相當，略差於原高階級測驗，將此兩等級測驗合併為一等測驗在實務上應為可行，惟組卷時試題難度比例需再做調整。

關鍵字

華語文能力測驗；信度；效度；試題反應理論；垂直等化

並列摘要

The purpose of this study is to investigate the reliability, validity and vertical equating of the Reading subtest of the Test of Chinese as a Foreign Language. Four levels are included in the reading section, they are Level 2, 3, 4, and 5, respectively. The analysis data was sampled from the formal version of the test administered in 2011 and pretest version in 2012. The results showed that, first, the coefficients of the Kuder-Richardson 20 were closed to or higher than .90. Moreover, large test information is provided to the value of cutoff which is determined an examinee is passed or failed. In other words, low standard error of estimation was obtained for the examinees. Second, the results of factor analysis showed that only one factor was extracted, which could account for above 66% of the variance. In addition, the results of Rasch analysis revealed that more than 87.5% of the items fit the model well. Third, there is a suitable range of difficulties for each level of test. Finally, standard error of estimation about the cutoff values were similar to Level 3 but lower than Level 4 when the items in Level 3 and 4 were split to assemble two tests (i.e., test information on the cutoff values for the even items included in Level 3 and 4, the odd items included in Level 3 and 4, and items in Level 3 and 4). That is these two adjacent levels can be combined to form a composite level of test in the future to reduce the burden for examinees and developers of the test. However, the item difficulty distribution of the composite test should be adjusted.

並列關鍵字

mandarin test ； reliability ； validity ； item response theory ； vertical equating

參考文獻

張鈺卿(2007)。BIB 與 NEAT 設計在不同年度測驗連結效果之比較。國立臺中教育大學教育測驗統計研究所=Graduate Institute of Education Measurment and Statistics, National Taichung University of Education。

Educational Testing Service. 2007. TOEFL® iBT Score Reliability and Generalizability. Retrieved Sep 26, 2013 , from http://www.ets.org/Media/Tests/TOEFL /pdf/TOEFL_iBT_Score_Reliability_Generalizability.pdf

Google Scholar

Educational Testing Service. 2012. TOEIC Examinee handbook listening & reading. Retrieved Sep 26, 2013 , from http://www.ets.org/Media/Tests/TOEIC/pdf/TOEIC_LR_examinee_ handbook.pdf

Google Scholar

Winsteps and Rasch measurement Software. 2013. Misfit diagnosis: Infit outfit mean-square standardized. Retrieved from http://www.winsteps.com/win-man/index.htm?diagnosingmisfit.htm.

Google Scholar

張晉軍。 2011。〈新漢語水準考試（HSK）品質報告〉。2013年9月26日，取自: http://blog.sina.com.cn/s/blog_53e7c11d0100v71z.html [Zhang, Jin- Jun. 2011. The report of the new Hanyu Shuiping Kaoshi (HSK). Retrieved Sep 26, 2013 , from http://blog.sina.com.cn/s/blog_53e7c11d0100v71z.html]

Google Scholar

國際替代計量

華語文閱讀測驗信度效度分析與垂直等化研究

全文下載

主題瀏覽