以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題

近年來，標準設定方法在教育實務情境中蓬勃發展，其中尤以修正版Angoff標準設定法的使用最為廣泛。Angoff法假定，經過訓練後的評分者能依據試題難度正確地估計出通過預設標準的最低能力受試者，其答對每一道試題的成功機率。由於標準設定方法的主觀評分特性，因此，尋求適切的工具以確保評分者評分品質甚為重要。多面向Rasch模式(MFRM)已被廣泛使用於主觀評分情境，特別是在標準設定程序中，用以考驗評分過程中是否出現負向的評分者效果而影響評分品質。然而，多面向Rasch模式的基本假設為，評分者間的影響是不存在的。然而由於多數的研究除了評分資料外並未能取得相對客觀的試題難度資料加以比對以考驗此假設，因此極少有研究檢驗該假設。由於使用Angoff法時，除了評分者對於試題難度的評估以及受試者是否有能力能夠達到預先設定的標準，同時還可以取得外部試題反應資料。基於此，本研究利用Angoff法所取得的外部試題反應資料以及評分者資料，來交叉驗證多面向Rasch模式的基本假設。其次，利用多面向Rasch模式來檢驗Angoff法的三個假設，以及評分資料與模式的適切程度。在執行Angoff法時，研究者請18位外語教學(EFL)專家擔任評分者，並將英文閱讀以及聽力試題各40題對照到歐洲語言共同架構中的B1等級(Common European Framework of Reference)。在負向評分者效果的偵測方面，本研究依據MFRM所提供的各項指標，偵測三種在評分過程常出現的評分者效果：嚴苛度 (leniency/severity)、準確度(inaccuracy)以及趨中與極端評分 (centrality/extremism)。接著，將Angoff設定法所估計的概率作為內在參照架構，並將施測所得的試題難度估計作為外在參照架構。首先，將MFRM指標用來偵測在兩個參照架構下的評分者效果，並比較兩個架構下標準設定的結果。其次，利用原始分數以及MFRM指標來考驗Angoff標準設定法的基本假定。本研究主要的發現如下： 1.對照兩個架構下的標準設定，評分者在嚴苛度、準確度以及評分趨中與極端程度的結果不一致。如此的差異使研究者對於單獨使用Angoff設定法，作為設定標準分數的方式，產生疑慮。有關群體效果假設的考驗也確實發現，在使用內部的參造架構下，確實出現群體趨中評分效果。這也顯示出在使用多面向Rasch模式前必須先考驗評分者間的群體效果是否存在。 2.關於Angoff法的假設檢定，BPS以及試題功能方面違反基本假設。其中較嚴重的缺失為，幾乎所有的評分者皆無法利用概率來評估最低受試者能力。

關鍵字

標準設定； Angoff法；多面向Rasch模式；評分者效果；歐洲語言共同架構；評分品質

並列摘要

Introduction: The use of standards-based scores in education has grown in recent years and the modified Angoff standard setting method is perhaps the most widely used procedure for establishing these standards. In this method, trained judges imagine students who just meet the standard in question and estimate the likelihood of their responding correctly to each item on the test being aligned to the standard. The method assumes that trained judges can accurately represent students who just meet the standard, represent how test items function and quantify their estimation of the likelihood of student success for each item. All three assumptions have been called into question. More generally, the subjective nature of all standard setting methods has resulted in a focused search for tools to evaluate the quality of judges’ decisions. The many-facet Rasch model (MFRM) has been proposed for use in detecting rater effects generally and for evaluating standard setting results in particular. Use of the MFRM, however, relies on the further assumption that no group-level rater effects exist. Because only internal, judge-generated data is available in most cases, this assumption is usually not evaluated and little research exists on how plausible the assumption is in real settings or on how robust results are to violations of the assumption. As external item response information often is available when the Angoff method is used, an Angoff setting provides a rare opportunity to test this assumption of the MFRM. Thus, the two-fold purpose of this study is to first evaluate the suitability of the many-facet Rasch model using data from an Angoff standard setting, and then to evaluate the assumptions of the Angoff method using the MFRM. Method: The data consisted of the first round estimates of a panel of 18 trained EFL professionals serving as judges in an operational Angoff standard setting linking two 40-item English exams (one reading, one listening) to the Common European Framework of Reference B1 proficiency level, and of the item response data from the original administration of the exams. MFRM indices were identified for the detection of three broad types of rater effects: leniency/severity, inaccuracy and centrality/extremism. These indices include estimated parameters and standard errors, residuals and residual-based indices, separation statistics and correlations between ratings and model indices. The probability estimates made by the Angoff judges were used to construct an ‘internal’ frame of reference, and the item difficulty estimates from the test administration were used to construct an ‘external’ frame of reference. Indices from the many-facet Rasch model were used to examine the subjective ratings of the Angoff judges for the presence of rater effects in both frames and the results were compared. In the second stage of the study, the assumptions of the modified Angoff method were assessed, using raw score and MFRM indices. Results: In the first phase, results differed across frames for all three rater effects. The leniency/severity indicators suggested greater agreement between judges in the internal frame than in the external frame, although a similar number of judges were flagged (four in both the internal and external frames for reading; two in the internal and three in the external frame for listening). Inaccuracy effects were sharply underestimated within the internal frame of reference: six judges were flagged in the internal frame and nine in the external frame for reading; for the listening test, two and four judges were flagged in the internal and extermal frames respectively. Results for centrality/extremity differed even more markedly: for the reading test, four judges were flagged for centrality and five for extremism in the internal frame while 17 judges were flagged for centrality in the external frame; for the listening test, 10 judges were flagged for centrality and one judge for extremity in the internal frame while all 18 judges were flagged for centrality in the external frame. Group-level indicators did indicate the presence of group-level centrality and inaccuracy effects within the internal frame of reference, suggesting their possible use in evaluating the assumption of the model prior to use. In terms of the assumptions of the Angoff method, the BPS and item functioning assumptions appear to have been violated to some extent but the most striking failure was the inability of nearly all judges to accurately quantify their assessments using the probability scale. The ‘centrality’ or ‘central tendency’ bias, in particular, was displayed by nearly all judges, compressing the Angoff metric. This compression of the scale appears to have been largely responsible for the distorted results for the MFRM leniency/severity and centrality/extremity indices in the internal frame noted above. Further, this scale compression appears to have distorted the cut scores, leading to differences in pass/fail rates: for the reading test, the pass rates within the internal frame across the three rounds of the standard setting were 46.4%, 37.8% and 37.7%, while the corresponding pass rates in the external frame were 38.1%, 29.0% and 27.2%; for the listening test, the pass rates in the internal frame were 35.4%, 35.4% and 31.5%, compared to 31.0%, 31.0% and 27.1% in the external frame. Discussion: The critical assumption underlying use of the MFRM for detecting rater effects was found not to hold in the present case, casting doubt on the use of the model in standard setting situations for which only internal data (from the judges’ estimates) is available. More positively, the group-level indicators within the internal frame were found to be sensitive to inaccuracy and centrality effects and thus may serve to help check the suitability of the model for use where no external data is available. The assumptions of the Angoff method were also found to be violated. In particular, a centrality or central tendency bias was shown to persist across all three rounds and to distort results. In view of previous research into central tendency, the present findings are consistent with the possibility that the Angoff method is inherently highly susceptible to the distorting effects of this bias. More generally, the centrality bias seems likely to pose a serious threat in many rating situations, both to the validity of ratings and to the accuracy of indicators used to evaluate these ratings. Future research should focus on refining our understanding of when the MFRM is likely to be appropriate for use; on solutions to problems with the Angoff method (perhaps in the form of procedural modifications or score adjustments); and on what rating situations are likely to be susceptible to the centrality bias and how it might be reduced or eliminated.

並列關鍵字

standard setting ； Angoff method ； many-facet Rasch model ； rater effects ； Common European Framework of Reference ； rating quality

參考文獻

American Psychological Association, American Educational Research Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.

Angoff, W.H. (1971). Scales, norms, and equivalent scores. In R.L. Thorndike (Ed.), Educational Measurement (2nd ed.). Washington, DC: American Council on Education.

Brandon, P.R. (2004). Conclusions about frequently studied modified Angoff standard-setting topics. Applied Measurement in Education, 17, 59-88.

Brennan, R.L. & Lockwood, R.E. (1980). A comparison of the Nedelsky and Angoff cutting score procedures using generalizability theory. Applied Psychological Measurement, 4, 219-240.

國際替代計量

以多面向Rasch模式為基礎檢驗Angoff標準設定法的效度議題

主題瀏覽