階層結構試題反應模式及其在電腦適性測驗之應用

本研究旨在發展具有階層結構潛在變項的試題反應模式，稱之為「階層結構試題反應模式」，且將其應用在電腦適性測驗中，並檢驗其有效性。本論文共有三個模擬研究，第一個研究是透過貝氏統計中的馬可夫鍊蒙地卡羅估計法，來進行模式參數的估計與模式適配度的檢驗，結果發現本研究發展的模式適配度指標與貝氏DIC指標適合用來診斷模式與資料的適配程度，且貝氏估計法能提供良好的模式參數回復性。第二個研究則是發展階層結構試題反應模式在電腦適性測驗上的算則，結果發現透過修正題組模式的電腦適性測驗算則而發展出的選題與能力估計程序，具有最佳的能力估計效能。第三個研究則是修正傳統的最大訊息量選題法，在測驗初期加上隨機成分來控制測驗初期能力估計的誤差，結果發現新近的選題方法能提高題庫使用率，降低試題的曝光率與測驗平均重疊率，支持新的選題法可以兼顧題庫安全與測量精確度。最後，作者則針對未來研究與實務應用提供若干建議。

關鍵字

階層結構試題反應模式；貝氏估計法；電腦適性測驗；題庫安全；新近選題法；試題反應理論

並列摘要

This study is aimed at constructing IRT models with a higher order latent trait structure within a multidimensional IRT framework and implementing these models in the context of a CAT with varying modern item selection rules, in order to assess the effectiveness of the model through simulation studies. Sheng and Wikle (2008) proposed a Bayesian multidimensional IRT models with a hierarchical structure (also referred to as a hierarchical structure item response model (HSIRM) in this study), and conducted several simulation studies to support their assumptions. However, certain questionable features of their simulation design made their findings unclear and left many questions in need of answering. Unlike the original models proposed by Sheng and Wikle (2008), the HSIRM constructed a latent trait structure based on factor analysis instead of principle components analysis. Because the original study is questionable, and because it is necessary to guarantee that novel IRT models are stable and reliable before implementing them in a CAT environment, it is important to revise the proposed models and assess their estimation efficiency. Consequently three separate studies on the HSIRM were conducted. The first study focused on the Bayesian estimation method and the Bayesian model checking techniques at first and then checked the model parameter recovery. The second study attempted to derive the CAT algorithm of the HSIRM and evaluated the accuracy of overall and domain ability estimations under a variety of conditions. Finally, modern item selection methods were incorporated into the HSIRM-based CAT to better control item exposure and overlap rates. In the first study, simulations were conducted to assess the effect of Bayesian model checking techniques, including the posterior predictive model checking (PPMC) method, the pseudo-Bayes factor (PsBF) approach, and the Bayesian DIC, and then to evaluate the model parameter recovery through comparison with the true model. The data sets were generated with a UIRT model, a MIRT model with identical latent trait (MIRT-I), and a HSIRM with both a high ability correlation (HSIRM-H) and a low ability correlation (HSIRM-L) in terms of 1PL- and 2PL-IRT models. The analytic models were 1PL- and 2PL-HSIRMs. Five indicators were incorporated into the PPMC procedure, including the SD of the biserial correlations (Bis), the Bayesian chi-square test (BChi), the reproduced correlation matrix test (Rcor), the observed score covariance between the subtests test (Cov), and the identical latent trait correlation test (Id). The results suggest that, when implementing PPMC in the HSIRM, it is advisable to fit data to the 1PL- HSIRM at first and move away the inappropriate data sets which were generated from the UIRT, MIRT-I, and MIRT-S models according to the well-working criteria, because the effect of the PPMC method is improved in the fit of 1PL- HSIRM. As for the relative model fit criteria, only the DIC was able to consistently select the correct model to fit the data. The PsBF always preferred the simplest model to fit the data regardless of which was the true model. With respect to model parameter recovery, most estimators were unbiased, suggesting that the Bayesian statistics such as the MCMC procedure can provide precise measurement for model parameters in the HSIRM. Finally, as a one-stage approach, in comparison with two-stage approaches such as the two-stage CFA and averaging procedures, the HSIRM was the most efficient in measurement accuracy for overall ability estimates. In addition, the major advantage of the one-stage method over the two-stage methods is that the HSIRM can provide standard errors of measurement obtained immediately by the standard deviation of the posterior distribution for each examinee in estimating overall ability, whereas only the two-stage CFA approach had an approximate estimate for the standard error of measurement obtained through an indirect formula transition. The most important thing, however, was that both the two-stage CFA and the averaging approaches did not meet the structure of test design used as a standard in the study, namely, the way to design a test was the only appropriate way to analyze the corresponding data set. In the second study, three HSIRM-based CAT algorithms were proposed by the author. These included a multidimensional CAT, a unidimensional CAT, and a HSIRM-CAT approach. The two-stage methods, UCAT with CFA and UCAT with average approaches, served as baseline methods for comparison with the one-stage methods. The results showed that except for the unidimensional CAT approach, the one-stage methods always generated more accurate estimates both for overall and domain abilities than the two-stage methods, suggesting that the multidimensional CAT and the HSIRM-CAT approaches are reliable enough for administration in a CAT context. Neglecting the random effects of subtests made it difficult for the unidimensional CAT approach to estimate overall and domain abilities precisely, especially in the diverse factor loading setting and the 2PL-HSIRM condition. Of the two methods, the HSIRM-CAT approach was recommended because the CAT was based on the HSIRM that was used to generate the item responses. In addition, a significant advantage of the HSIRM-CAT approach was that it yielded standard errors of measurement for overall and domain ability estimates simultaneously after administering an adaptive item such that the fixed-precision stopping rule can be implemented if necessary, whereas the multidimensional CAT approach did not. In the last study, the progressive (PG; Barrada, Olea, Ponsoda, & Abad, 2008; Revuelta & Ponsoda, 1998) and the proportional (PP; Barrada et al., 2008; Segall, 2004a) methods were incorporated into the CAT procedures based on the HSIRM to improve item pool security and measurement precision simultaneously, as compared to the point Fisher information (PFI) method. In addition, the Sympson and Hetter online freeze (SHOF; Chen, 2004, 2005) procedure and content balancing controls were implemented in the process. The result showed that the PG and the PP methods can reduce the item exposure rate as well as improve item pool usage. Further, the effect becomes larger as the acceleration parameter increases. However, it was not possible to guarantee that the item exposure rate for each item would be below a pre-specified level unless the SHOF was implemented. As the acceleration parameter increased, item overlap rate decreased both for the PG and the PP methods but the overall RMSE did not always increase. When the PG method improved measurement precision by reducing the acceleration parameter, the difference in overall RMSEs between the PFI and the PG method was much smaller. In sum, the HSIRM-CAT approach, with both the PG and SHOF procedures can improve item bank security with little or no loss in measurement precision and provide test information for the duration of the CAT, as evidenced by the equivalent overall RMSEs with lower test mean overlap rate of this approach as compared to the PFI method. Finally, study limitations are noted and suggestions for future investigations are proposed.

並列關鍵字

hierarchical structure item response model ； Bayesian estimation ； computerized adaptive testing ； test security ； modern item selection rules ； item response theory

參考文獻

Wang, W.-C. (2004). Effects of anchor item methods on the detection of differential

Wu, M.-L., & Chen, S.-Y. (2008). Investigating item exposure control on the FLY in Computerized Adaptive Testing. Psychological testing, 55(1), 1-32.

Hsu, C.-L. & Chen, S.-Y. (2007). Controlling item exposure and test overlap in variable length computerized adaptive testing. Psychological testing, 54(2), 403-428.

Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum.

Adams, R. J., Wilson, M., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-23.

被引用紀錄

尤淑菁（2011）。資訊融入教學與電腦化適性診斷測驗之學習成效－以「一元一次不等式」為例〔碩士論文，亞洲大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0118-1511201215465416

國際替代計量

階層結構試題反應模式及其在電腦適性測驗之應用

主題瀏覽