以試題反應理論為基礎的開展模式之延伸及其在電腦化適性測驗上之應用

目前在社會科學研究中，以等級反應，同意¬-不同意型式的態度測驗為工具的研究，仍有不少採用因素分析、試題-總分相關、內部一致性信度等方法篩選試題，以及使用測驗總分或傳統累積IRT模式進行分析，這會造成中立的試題消失，在潛在特質連續量尺上屬於中立立場的人無法準確估計，極端正向的人被低估，極端負向的人被高估的現象，解決之道應以開展模式進行分析和估算。本研究的目的是將試題反應為基礎的開展模式加以延伸和應用，共包括三項子研究：延伸至多向度開展模式、延伸至隨機閾值模式以及應用在多向度電腦化適性測驗上。研究一建立了驗證性多向度一般化等級開展模式(CMGGUM)，此新模式除了能同時表徵題間和題內多向度情境外，還具有驗證性以及能分析多分題試題等特性。經由模擬研究結果得知CMGGUM參數回復性佳，尤其在大樣本時更佳。同時以刺青態度量表為例，進行實徵研究分析，結果顯示CMGGUM相較於傳統GGUM，其log PsBF值為27.2，貝氏卡方PPP value為.52，代表CMGGUM模式與資料適配良好；另從信度來看，如同預期，由於多向度取向將潛在特質間的相關納入模式中一併考量，因此能提升測驗的信度，根據斯布公式(Spearman–Brown prophecy formula)，若經由單向度取向要達到和多向度一樣的信度，必須將三分測驗的題長分別增長68%、79%、3%，而這增長量也說明了多向度取向相較於單向度取向的相對測驗效能。當測驗的情境為人與試題選項具有交互作用時，由於受試者的主觀判斷的影響，每個人對選項的解讀並不一致，為了表徵這不一致性的變異量，研究二建立隨機閾值一般化等級開展模式(RTGGUM)，透過閾值變異數參數表徵其隨機變異量的大小。若此變異數為0，RTGGUM就成為GGUM模式。經由模擬研究結果顯示，當產生資料的模式和分析的模式均為相同的模式時，具有不錯的參數回復性；當以GGUM產生資料，卻以RTGGUM分析資料，其參數回復性也不錯，此時閾值變異數估計值接近0，顯示即使資料性質是不受受試者主觀判斷影響的情況，採用RTGGUM分析亦無妨；反之，當資料來自RTGGUM，卻以GGUM分析，會產生較大的RMSE和絕對值Bias，代表當資料具有人-題交互作用時，卻使用不當的模式來分析，會造成估計上的偏誤。同時另以一項檢查制度態度問卷所獲得的資料進行實徵研究，結果顯示兩種RTGGUM模式的DIC分別為12179和12399，均小於兩種GGUM模式的DIC (13340和13571)，從PsBF指標來看也有相同結論，可知RTGGUM均比GGUM更適配資料；從信度來看，GGUM(.906、.886)高於RTGGUM(.866、.850)，這是因為誤將隨機閾值的變異當成真正潛在特質變異的緣故。另外從各參數的估計值和標準誤來看，以RTGGUM分析得到五個隨機閾值變異數差異很大，越往兩端其值越大，說明了人與題的交互作用是存在的。研究三推導出CMGGUM的測驗訊息函數、試題訊息函數、第一階和第二階導函數以及事後分配函數相對應的訊息函數和導函數，並結合MPI選題法、固定精準度的終止條件、MAP估算法進行MCAT。在此研究中主要操弄選題策略、終止條件和事前分配三變項。結果顯示，在具部份訊息量事前分配的情境中，E-優選的選題策略會造成測驗題長較長，測驗時間耗費較多的現象，因此較不建議採用，另外三種選題策略-D-優選、A-優選、T-優選則較佳；PSER法終止條件比SE法能有效縮短測驗題長和減少施測時間，避免施測過多不必要的試題，雖然會降低些微精準度，但幅度不大，若測驗的性質和目的可以接受此幅度，兩者相權，建議可採行PSER法；當提供正確的事前分配時，會比具部份訊息量的事前分配的情境更有效地減少施測題數和節省施測時間，同時也能提升測量精準度和達到預設的測量標準誤的目標，因此在CAT程序中善於利用事前分配的訊息有助於提升施測的品質。此外，由於各向度的結果相近，也說明了MPI選題法具有內容平衡的功能。當精準度要求至.25水準時，幾乎各種情境均能達到或接近此一預設的測量標準誤，這也說明了此模擬情境採用的方法是有效的，利用這些操弄的方法和設定，可將CMGGUM成功地應用在MCAT的情境中。關鍵字詞：試題反應理論、態度測驗、開展模式、隨機效果、電腦化適性測驗

關鍵字

電腦適性化測驗；隨機效果；開展模式；態度測驗；試題反應理論

並列摘要

To date, many researches use graded response, agree-disagree type attitude tests as a tool in the fields of social science, and adopt traditional item analysis methods such as factor analysis, item–total correlations, and internal consistency reliability to select items, or use total scores or cumulative IRT models to perform data analysis. These procedures may result in neutral items being deleted, underestimating extremely positive persons’ attitudes in the latent trait continuum, and overestimating those of extremely negative persons. Better alternatives are performing data analysis based on unfolding IRT models. However, unfolding IRT models in the literature are restricted to unidimensional or exploratory models. The purposes of this dissertation, including three studies, are to inquire extensions and applications of unfolding IRT models. Study 1 extends the Generalized Graded Unfolding Model (GGUM) to the confirmatory and multidimensional version , known as CMGGUM, which can be fit to within-item and between-item multidimensional tests of polytomous items. Simulations showed the parameters of the new model can be recovered fairly well with the R package R2WinBUGS and the WinBUGS computer program. The Tattoo Attitude Questionnaire with three subscales was analyzed to demonstrate the advantages of the new model over the unidimensional model in obtaining a better model-data fit, a higher test reliability, and a stronger correlation between latent traits. Study 2 extends the GGUM to Random Threshold Genralized Graded Unfolding Model (RTGGUM) by treating the threshold parameters as random-effects rather than fixed-effects. The RTGGUM takes into account the randomness in subjective judgment in responding graded response items while subjects perform an attitude test. Simulations were conducted to evaluate the parameter recovery of the new model and the consequences of ignoring the randomness in thresholds. The results showed that the parameters of RTGGUM were recovered fairly well and that ignoring the randomness in thresholds led to biased estimates. The Censorship Data was analyzed to demonstrate that RTGGUM had better model-data fit than the GGUM according to DIC and PsBF values. The GGUM overestimated the test reliability because the individual difference in personal subjective judgment was ignored. The variances for the five random thresholds under the RTGGUM were 5.14, 1.82, 0.32, 1.80, and 3.87, respectively, suggesting the individual difference in personal preference of the rating labels was quite substantial and should not be ignored. Study 3 implemented computerized adaptive testing algorithms based on the CMGGUM. The Fisher information and corresponding posterior distribution were derived. Simulations were conducted to evaluate the performance of the algorithms, including the maximum a posteriori estimation for ability estimation, the maximum priority index (MPI) and D-, A-, T-, E-optimality criteria for item selection, and a fixed-precision stopping rule. Three kinds of independent variables, including four selection strategies, two stopping rules, and two prior distributions were manipulated. Results showed that all the four criteria achieved the predetermined precision level of .25. Under the semi-informative prior and SE stopping rule condition, the E-optimality criterion averagely needed additional 12 items and 14 seconds to achieve the same precision with other criteria, therefore it was not recommended due to its longer test length and consuming time. The other three optimality criteria performed similarly. The mean test length for each dimension was similar, indicating that the MPI would facilitate content balance. When adopting the informative prior, all four selection criteria had similar bias and MSE. In conclusion, the CMGGUM can be successfully applied to the MCAT situations. Keywords: item response theory, attitude test, unfolding model, random effect, multidimensional computerized adaptive testing

並列關鍵字

multidimensional computerized adaptive testing ； random effect ； unfolding model ； attitude test ； item response theory

參考文獻

Adams, R. J., Wilson, M., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-23.

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716-723.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.

Andrich, D. (1988). The application of an unfolding model of the PIRT type to the measurement of attitude. Applied Psychological Measurement, 12, 33-51.

Andrich, D. (1996). A hyperbolic cosine latent trait model for unfolding polytomous responses: Reconciling Thurstone and Likert methodologies. British Journal of Mathematical and Statistical Psychology, 49, 347-365.

國際替代計量

以試題反應理論為基礎的開展模式之延伸及其在電腦化適性測驗上之應用

未授權

主題瀏覽