在實務應用上,試題反應理論 (item response theory; IRT) 通常將試題的難度參數當作固定效果、人的能力參數當作隨機效果。理論上,試題的難度參數亦可視為隨機效果。過去有關差異試題功能 (differential item functioning; DIF) 檢驗的研究中,多將試題的難度參數視為固定效果;雖然有少數研究將試題參數視為隨機效果進行DIF檢驗,但這些研究並未充分考量現實情境且難以在實務推廣,如僅操弄全部有利於焦點團體的DIF型態、操弄DIF試題含量低 (約25%)、使用的DIF檢驗方法是透過計算複雜的參數估計程序等。因此本研究目的為在隨機試題效果下,瞭解實務常用的DIF檢驗方法之效果,研究採用Mantel-Haenszel法 (MH; Holland & Thayer, 1988; Mantel & Haenszel, 1959) 和Logistic Regression法 (LR; Swaminathan & Rogers, 1990),同時也與固定試題效果情境比較。研究結果發現在大多數的情境下,兩種試題效果的統計檢定力表現差不多;但當試題的難度參數被視為隨機效果時,DIF檢驗效果的型一錯誤率偏離0.05的情形比固定試題效果情境多。當兩團體的平均試題難度差異 (mean item difficulty difference; MIDD) 小於0.04時,兩種試題效果在傳統DIF檢驗之型一錯誤率與統計檢定力的表現差不多;當MIDD超過0.06時,兩種試題效果下都會有傳統DIF檢驗之型一錯誤率明顯失控的情形,但透過量尺淨化 (scale purification) 的程序,兩種試題效果下皆可將型一錯誤率完全控制在0.05附近。本研究同時比較隨機和固定試題效果的DIF檢驗,且實驗設計也較過去研究更考量真實情境,期許此研究結果將能有助於瞭解與解釋隨機試題效果之DIF檢驗的結果。
It is common practice in item response theory (IRT) to consider items as fixed effects and persons as random effects. Theoretically, items can be treated as random effects as well. Many studies on the assessment of differential item functioning (DIF) treat items as fixed effects. Few studies treat items as random effects; however, they didn’t manipulate all possible DIF type and high DIF percentage, and the DIF detection method they used was not easy to be implemented for practical users. Therefore, the aim of this study was to investigate the efficiency of DIF assessment for dichotomous items with random item effects via Mantel-Haenszel (Holland & Thayer, 1988; Mantel & Haenszel, 1959) and the Logistic Regression (Swaminathan & Rogers, 1990) methods. The results of random item effects were compared with those of fixed item effects in this study. The results showed the powers of DIF detection for both item effects were similar. However, the type I error of random item effects deviated 0.05 much more than that of fixed item effects. When the mean item difficulty difference (MIDD) smaller than 0.04, both item effects showed similar results in type I error and power of one-stage DIF detection methods. When MIDD was greater than 0.06, the type I error of one-stage DIF detection methods inflated significantly for both item effects; however, scale purification could reduce the inflated type I error. This study investigated the DIF detection under the random item effects and fixed item effects simultaneously, and the simulations were manipulated to real situation. Hence, the results from the study are expected to facilitate the understanding and the explanation of the DIF assessment when the items come from a distribution in practice.