分類目標與選題限制對於高階試題反應理論之電腦化分類測驗效能的影響

本論文主要是應用高階試題反應理論（high-order item response theory, HIRT）於電腦化分類測驗（computerized classification test, CCT）情境中，探討分類目標、Fisher Information選題方法、切截點數以及最大測驗長度，對於高階試題反應理論之電腦化分類測驗（簡稱HIRT-CCT）效能的影響，以便對未來實行相關測驗方式上提供建議。研究採用三參數HIRT作為測驗模式，並以能力信賴區間（ability confidence interval, ACI）搭配暫時能力估計值為基礎的選題策略（estimated-based, EB）為分類方式下，比較三種分類目標（包括：以能區分二階潛在能力為分類目標、以能區分一階潛在能力為分類目標以及以能同時區分一二階潛在能力為分類目標）、三種Fisher Information（簡稱FI）選題方法（包括：使二階潛在能力訊息量最大法（FI2）、使一階潛在能力訊息量最大法（FI1）以及同時使一二階潛在能力訊息量最大法（FI1+2））、兩種切截點數（包括：一個切截點以及兩個切截點）以及四種最大測驗長度（一個切截點包括：15、30、60以及90題；兩個切截點包括：30、60、90以及120題）在HIRT-CCT的表現，並進一步探討當HIRT-CCT加入選題限制時（包括：試題曝光率控制以及內容平衡限制），對於分類測驗結果的影響。研究的依變項包括：分類正確性、平均測驗長度、最大試題曝光率、題庫使用率以及內容平衡（各內容所選題數百分比）。研究結果顯示，在三種分類目標中，以能區分一階潛在能力為分類目標所得到的結果與以能同時區分一二階潛在能力為分類目標所得到的結果相似，但若是以能區分二階潛在能力為分類目標時，則結果會與前兩者不同。此外，隨著最大測驗長度增加，一二階能力分類正確性均能有所提升。針對三種FI選題方法，在分類正確性方面，當使用FI2時，並不能有效提高二階能力的分類正確性；當使用FI1時，擁有最低因素負荷量的一階能力分類正確性能獲得提升（次高因素負荷量則保持不變或上升），但最高因素負荷量的一階能力分類正確性則會下降，不過隨著最大測驗長度增加，三種FI選題方法的分類正確性結果會趨於相同。在強迫分類百分比與及平均測驗長度方面，當以能區分一階潛在能力為分類目標或是以能同時區分一二階潛在能力為分類目標時，使用FI1的強迫分類百分比以及平均測驗長度最小；當以能區分二階潛在能力為分類目標時，則是使用FI2的強迫分類百分比以及平均測驗長度最小。在內容平衡方面，當使用FI1+2時，所選題目平均分布在各內容上；當使用FI2時，會傾向選出較多最高因素負荷量題庫的試題；當使用FI1時，則會傾向選出較多最低因素負荷量題庫的試題，不過隨著最大測驗長度增加，各內容所選題數百分比的差異會減少。此外，當最大測驗長度增加時，強迫分類百分比會減少，分類正確性、平均測驗長度以及題庫使用率則會提高。當切截點數增加時，分類正確性會降低，強迫分類百分比、平均測驗長度以及題庫使用率則會提高。在選題限制方面，當加入試題曝光率控制時，雖然可以有效控制最大試題曝光率，但會使分類正確性稍微下降，強迫分類百分比以及平均測驗長度稍微上升，題庫使用率則會大幅提升。當加入內容平衡限制時，雖然可以得到均勻的內容平衡，但會導致三種FI選題方法的效果變得沒有差異。整體而言，三種分類目標在一個切截點時，將最大測驗長度設為30且使用FI1+2選題方法；在兩個切截點時，將最大測驗長度設為60且同樣使用FI1+2選題方法，如此將有最佳HIRT-CCT表現效能。此外，當加入試題曝光率控制以及內容平衡限制時，除了能將試題曝光率控在設定的範圍內以及使所選題目平均分布在各內容上，並能有效提升題庫使用率，且對於HIRT-CCT效能的影響不大。

關鍵字

高階試題反應理論；電腦化分類測驗；分類目標；選題限制

並列摘要

This study aims to implement high-order item response theory (HIRT) in computerized classification test (CCT), and to investigate the influences of target classification traits, Fisher Information (FI) item selection methods, cutting points, maximum test lengths and item selection constrains on the efficiency of HIRT-CCT. In this study, 3PLM-HIRT was employed as the test model and the ability confidence interval with estimated-based item selection method was used as a classification method. Five independent variables were manipulated: (a) target classification traits target at second-order latent trait, target at first-order latent trait, and target at both second-order and first-order latent traits; (b) FI item selection methods maximum of FI at second-order latent trait (FI2) , maximum of FI at first-order latent trait (FI1) , and maximum of FI at both second-order and first-order latent traits (FI1+2); (c) number of cutting points 1 and 2; (d) maximum test lengths 15, 30, 60, 90 for 1-cutting point and 30, 60, 90, 120 for 2-cutting point (e) item selection constrains no item exposure and content balancing controls, only item exposure control, only content balancing control, and item exposure plus content balancing controls. Five major dependent variables were included: (a) classification accuracy, (b) average test length, (c) maximum item exposure rate, (d) pool usage rate, and (e) content balancing (the percentage of selected items for each content). The main results are summarized as follows: 1. For three types of target classification traits, the results indicated that there was a little difference between target at first-order latent trait and target at both second-order and first-order latent traits. Besides, classification accuracy would increase while maximum test length increases. 2. For three types of FI item selection methods, in term of classification accuracy, FI2 had little effect on increasing classification accuracy of second-order latent trait. FI1 could increase classification accuracy of the first-order latent trait with the lowest factor loading (the second highest factor loading one would keep unchanged or increasing), but the one with the highest factor loading would decrease. However, three methods tended to be similar while maximum test length increases. In term of the percentage of forced classification, and average test length, using FI1 would yield the lowest percentage of forced classification, and the lowest average test length for target at first-order latent trait and target at both second-order and first-order latent traits; using FI2 would yield the lowest percentage of forced classification, and the lowest average test length for target at second-order latent trait. In term of content balancing, the results were close to being even while using FI1+2. Besides, there are more items selected from the highest factor loading item pool while using FI2, and more from the lowest factor loading while using FI1. However, the differences among three content balancings would decrease while maximum test length increases. 3. For four types of maximum test lengths, the percentage of forced classification would decrease but classification accuracy, average test length, and pool usage rate would increase while maximum test length increases. 4. For two types of cutting points, classification accuracy would decrease but the percentage of forced classification, average test length, and pool usage rate would increase while cutting point increases. 5. For item selection constrains, although item exposure control could control item exposure rate, it would result in a slight decreasing on classification accuracy, a slight increasing on the percentage of forced classification, and average test length, but a substantial increasing on pool usage rate. As for content balancing control, although it could maintain an even content balancing, it would lead to no differences occurred to the results among the three methods. In sum, three types of target classification traits of HIRT-CCT would have the best performances in the context of 1-cutting point while setting maximum test length to 30 and using FI1+2; as for the context of 2-cutting point, HIRT-CCT would yield the best performances while setting maximum test length to 60 and using FI1+2. Besides, imposing item exposure and content balancing controls on HIRT-CCT not only could control item exposure rate and maintain an even content balancing, but also improve pool usage rate; moreover, it brought little effect on the efficiency of HIRT-CCT.