Differential item functioning (DIF) assessment is critical for ensuring test validity and fairness. Many DIF assessment methods have been proposed in the past several decades, including the Mantel-Haenszel (MH) method. Such methods are applied to DIF assessment for discrete response items; however, no research has been done for continuous response items. Considering the popularity of the MH method in practical applications, its continuous counterpart (called the MHC method) proposed by Rayner and Best (2012) is applied to assess DIF for continuous response items in this study. The scale purification (SP) is further incorporated to improve the performance of the MHC method in DIF assessment. According to the simulation results, the MHC method with the SP procedure can yield high power rates while controlling type I error rates well in DIF assessment. Since the MHC method can be easily implemented with the SP procedure, it is recommended for test practitioners to conduct DIF assessments to improve the test quality for continuous response items.
Differential item functioning (DIF) assessment has been widely applied for decades to ensure test fairness in routine item analysis. However, few studies have investigated, or even noticed, omitted variable bias (OVB) while assessing DIF. As a result, the estimation of DIF effects may not be unbiased, resulting in inflated type I error rates and/or deflated power rates of DIF assessment. In testing practices, test practitioners may, therefore, wrongly identify inequality among grouping variables and revise the flagged DIF items based on misleading information. To overcome these problems, two issues were addressed in detail in this study. The first issue is the robustness of the original method (i.e., assessing DIF without considering confounding variables) to OVB, which was examined by evaluating the impact of ignoring OVB in DIF assessment. The second issue occurs when the controlled method (i.e., including all grouping variables) encounters the so-called trade-off between bias and inefficiency while assessing DIF. To address this issue, the backward scale purification (BSP) procedure was applied to the controlled method to improve the performance of DIF assessment. Accordingly, three interrelated studies were conducted. In Study 1, type I error rates for the original and controlled methods in DIF assessment were investigated. The results indicated that the controlled method can well control type I error rates under all conditions. In contrast, the original method lost control of type I error rates when confounding variables exhibited DIF and the correlation among grouping variables was high (i.e., greater than or equal to .2). In Study 2, type II error rates of the controlled method were investigated. In comparison to the true model, the results indicated that the type II error rates of the controlled method increased as the number of confounding variables decreased and the correlation among grouping variables increased. This result manifests the trade-off between bias and inefficiency when adding additional variables to the model. In Study 3, the BSP was applied to the controlled method to reduce the type II error rates. The results indicated that BSP can effectively control type I error rates while maintaining acceptable power rates. In summary, the controlled method with BSP appears promising for helping test practitioners deal with OVB in DIF assessment, thereby ensuring fairness and validity in testing practices.
Conventional differential item functioning (DIF) assessment methods tend to yield an inflated type I error rate and a deflated power rate when the tests contain many DIF items that favor the same group. To control type I error rates in DIF assessments under similar conditions, the DIF-free-then-DIF (DFTD) strategy is proposed. The DFTD strategy consists of two steps: (1) selecting a set of items that is most likely to be DIF-free, and (2) assessing DIF for other items using the designated items as anchors. To explore the variables that influence the performance of the DFTD strategy in assessing DIF, a series of simulation studies was implemented in this study. Three multiple indicators, multiple causes (MIMIC) methods, namely the standard MIMIC method (M-ST), the M IMIC method with scale purification (M-SP), and the iterative MIMIC method (M-IT), were used to select four items as an anchor set before implementing the DFTD strategy. The results of the analysis of variance showed significant differences among M-IT, M-SP, and M-ST in identifying DIFfree items, with M-IT performing better than M-SP, and M-SP performing better than M-ST. The analysis also found that the main effects of DIF patterns, DIF percentages, sample sizes, and item response theory (IRT) models, as well as their interactions, were significant in terms of their accuracy in identifying the DIF-free items. Based on the results, the M-SP and M-IT methods are recommended for use in identifying DIF-free items, especially when there are many DIF items in a test. The same set of variables significantly determined the power rates of these methods in assessing DIF. However, the type I error rates in the DIF assessments were significantly influenced by the DIF patterns, DIF percentages, and sample sizes. Based on the results of this study, it is recommended that R500/F500, as well as data fits two-parameter logistic model (2PLM), be adopted when applying the DFTD strategy with t he MIMIC method in assessing DIF.