簡易檢索 / 詳目顯示

研究生: 吳俞瑩
WU, YU-YING
論文名稱: 應用潛在特質模型以驗證試題競試之英語成就測驗
An Application of the Latent Trait Model to Validating the English Achievement Tests of a Campaign
指導教授: 曾文鐽
Tseng, Wen-Ta
學位類別: 碩士
Master
系所名稱: 英語學系
Department of English
論文出版年: 2010
畢業學年度: 98
語文別: 英文
論文頁數: 148
中文關鍵詞: 語言測驗潛在特質模型成就測驗
英文關鍵詞: language testing, the Latent Trait Model, achievement test
論文種類: 學術論文
相關次數: 點閱:132下載:8
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 台灣的研究文獻中,與學校測驗相關的研究並不多見,雖然台灣教育部一直相當重視語言測驗並補助各地方政府舉辦國中教師英語試題競試。在英語試題競賽中,鑑定這些成就測驗品質的依據經常就是專家的判斷,然而文獻指出專家內容效度的可靠性仍未被完整地認知。因此,本研究應用潛在特質模型(the Latent Trait Model)來分析兩份皆來自金門英文科試題競試中的測驗(優等測驗與未得獎的對照測驗);藉由分析並比較考生對兩份試卷的作答反應來驗證試題品質。劣質題目被偵測出來後則以內容分析法檢視,以找出可能導致試題品質下降的原因。此研究共有兩百四十一名就讀於一所桃園地區的國中九年級學生參與,男女比例約一比一。結果指出此兩份試卷中的多點記分主觀測驗題型皆顯現出良好的模型適配度;至於二元記分客觀題型,優等測驗卻沒有較低比率的差適配度題目,並且顯現比對照測驗更多的其他顯著因素;此外,性別歧視的項目功能差異分析(Differential Functional Analysis) 以及局部依賴分析(Local Dependence Analysis)結果顯示,從比例上探討,優等測驗並沒有比對照測驗有更少劣質題目。整體來說,令人訝異地,優等試卷並沒有優於對照試卷。因此,此研究支持更多類型的效度證據須加以蒐集才能更全面性的評估試卷。然而,試題內容分析顯示對照測驗含有數個明顯語言錯誤 (linguistic errors),且參照測驗題型較有吸引力且創新。因此,此研究認為若要全面性檢驗一份試卷的好壞,專家判斷與統計分析缺一不可。最後,依據內容分析,造成品質不佳的可能來源包含測驗設計者未能認清各種題性的特質,善用試題雙向細目表,謹慎選擇內容主題,和有意識地使用題組題型。根據本研究發現,相關的建議亦提供給英語科試題競試舉辦單位及國中英語教師。

    In Taiwan, little research has been done on school-based testing though the Minister of Education values the importance of language testing and has subsidized county campaigns at junior high school level. In the contests, the evidence of the quality of these achievement tests is often based on experts’ judgment. However, the robustness of such content validity is not well-known yet. This study, then, aims at validating the tests of the campaign by analyzing and comparing one winning test and one competing test in Kinmen Contest in terms of aspects regarding the empirical evidence collected with the Latent Trait Model. Then a qualitative content analysis was conducted to locate possible sources accounting for the poor items detected. Two hundred forty-one ninth graders, nearly half males and half females, at one junior high school in Taoyuan participated in the study. The results revealed all the subjectively-scored polytomous items in the tests fit the model well. As for the objectively-scored dichotomous items, nevertheless, the winning test did not have a lower percentage of misfitting items but manifested more other significant dimensions than the compared test; both differential item functioning (DIF) analysis of gender bias and local dependence analysis showed that the winning test did not have a lower percentage of poor items than the compared one. Overall, surprisingly, the winning test did not outperform the compared test. Thus, this study supports more types of validity evidence are needed to evaluate tests comprehensively. However, the results of content analysis indicated that the compared test contained several significant linguistic errors, and that the items in the winning test were more intriguing and innovative. Consequently, the study contends that to better evaluate test quality, it takes both expert knowledge and statistical analysis. At last, the possible sources of the poor items included test designers’ failures to recognize the characteristics of item types, to make use of test specifications, to prudently select topics, and to have an awareness of the use of testlets. Suggestions based on the findings are provided for contest holders and junior high school English teachers.

    TABLE OF CONTENTS ABSTRACT (CHINESE)....................................II ABSTRACT (ENGLISH)....................................IV ACKNOWLEDGEMENTS......................................VI TABLE OF CONTENTS....................................VII LIST OF TABLES........................................IX CHAPTER ONE︰INTRODUCTION...............................1 Background and MotivatioN...............................1 Research Questions......................................6 Significance of the Study...............................6 Organization of the Thesis..............................7 CHAPTER TWO︰LITERATURE REVIEW..........................9 General Issues About Language Testing...................9 Teaching and Testing....................................9 Important Criteria for a Good Language Test............12 Validity...............................................12 Reliability............................................16 Test bias..............................................21 A socio-cognitive framework for validating tests.......24 Issues About Language Test Development.................28 Language Ability.......................................28 Test Types.............................................34 Common Test Item Types.................................36 Binary Item............................................37 Multiple-choice Question...............................38 Matching...............................................40 Editing................................................41 Gap-filling............................................43 Cued Vocabulary Spelling...............................44 Sentence Transformation................................45 Sentence Translation...................................46 Measurement Theories in Language Assessment............51 Item Analysis..........................................51 Classical Test Theory (CTT)............................51 Latent Trait Theory (LTT)..............................56 Multi-facet Rasch Measurement (MFRM)...................65 CHAPTER THREE︰METHOD..................................68 Materials..............................................68 The Pilot Test.........................................71 Trial Results..........................................72 Pilot Test Two.........................................74 Participants and Procedures............................74 Trial Results..........................................74 Formal Test Administration.............................75 Participants and Data Collection Procedures............75 Scoring and Coding.....................................76 Subjectively-scored Polytomous Items...................76 Objectively-scored Dichotomous Items...................77 Data Analyses..........................................77 CHAPTER FOUR: RESULTS..................................79 Objectively-scored Dichotomous Items...................79 The Revision of the Mark Schemes.......................79 Person Reliability and Item Reliability................80 Rasch Construct Validity...............................81 Fit Analysis...........................................82 Local Dependence Analysis..............................90 DIF Analysis...........................................92 Subjectively-scored Polytomous Items...................94 All Facet Vertical Ruler...............................95 Person Reliability.....................................98 Rater Reliability......................................99 Task Reliability......................................101 CHAPTER FIVE: DISCUSSION..............................103 Comparison of the Two Tests...........................103 Sources of the Poor Items.............................108 Items with Unacceptable Fit Statistics................108 Local Dependent Items.................................110 DIF Items.............................................113 CHAPTER SIX: CONCLUSION...............................118 Summary of Major Findings.............................118 Implications..........................................120 Limitations of the Study..............................122 Directions for Future Research........................123 REFERENCES............................................125 Appendix A: The Target Test (TT)......................131 Appendix B: Answers of the Target Test (TT)...........136 Appendix C: The Compared Test (CT)....................137 Appendix D: Answers of the Compared Test (CC).........141 Appendix E: Table of TT Fit Statistics................142 Appendix F: Table of CT Fit Statistics................144 Appendix G: DIF Table of the Target Test (TT).........146 Appendix H: DIF Table of the Compared Test (CT).......147

    Alderson, C., & Wall, D. (1993). Does washback exist? Applied Linguistics, 14(2), 115-129.
    Alderson, J. C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge [England]; New York, NY, USA: Cambridge University Press.
    Bachman, L.F., & Palmer, A.S. (1996). Language testing in practice: designing and developing useful language tests. Oxford; New York: Oxford University Press.
    Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge: New York: Cambridge University Press
    Bachman, L.F. (1990). Fundamental Considerations in language testing. Hong Kong: Oxford University Press.
    Baghaei, P. (2008).The Rasch Model as a Construct Validation Tool [Electronic version]. Rasch Measurement Transactions, 22:1 p. 1145-6.
    Baker, D. (1989). Language testing: a critical survey and practical guide. London: Edward Arnold.
    Brown, J. D. & Hudson, T. (2002).Criterion-referenced language testing. UK: Cambridge University Press.
    Brown, J. D. (1988). Understanding research in second language learning: a teacher’s guide to statistics and

    下載圖示
    QR CODE