簡易檢索 / 詳目顯示

研究生: 吳佳儒
論文名稱: 電腦化適性預試對試題難度估計精準度之影響
Influences of Computerized Adaptive Pretest on Estimation Precision of Difficulty Parameters in Small Sample Size Pretest
指導教授: 陳柏熹
學位類別: 碩士
Master
系所名稱: 教育心理與輔導學系
Department of Educational Psychology and Counseling
論文出版年: 2010
畢業學年度: 98
語文別: 中文
論文頁數: 113
中文關鍵詞: 小樣本電腦化適性測驗電腦化適性預試試題參數估計
英文關鍵詞: small sample size, computerized adaprive test, computerized adaptive pretest, item parameters estimation
論文種類: 學術論文
相關次數: 點閱:105下載:15
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本研究目的在於探討適性選題策略在小樣本情境下對參數估計效果之影響。試題參數的精準度是很重要的議題,因為許多測驗應用都是建立在精準的試題參數上進行,例如電腦化適性測驗(CAT)。一般來說,試題參數需經過大樣本的預試才能得到精準的估計,然而這個目標在現實中往往難以達成。本研究希望藉由提出「電腦化適性預試」(CAPT)設計,為每位受試者提供難度與能力相符的試題,進而提升試題參數估計的精準度。
    本研究共分為兩個子研究,研究一的目的是了解不同能力分佈形態受試者對試題參數估計的影響,希望提供CAPT設計下有關受試者選擇方面的建議。研究二可分為兩部分,第一部份的目的是提出CAPT設計,並探討不同預試設計下的參數估計效果為何;第二部分的目的是探討主觀難度與真實難度相關對參數估計的影響,並操弄測驗長度、共同題題數、共同題難度分佈等情境,以提供不同設定下的CAPT參數估計效果做為參考。
    研究一結果顯示,常態分佈、均等分佈與多群體分佈受試者的整體試題參數估計效果相近,但常態分佈受試者對於中等難度試題估計得較精準、對簡單與困難試題估計得較不精準,均等與多群體分佈受試者則是對不同難度試題的估計效果較一致。研究二結果顯示,使用CAPT設計所得到的參數比使用NEAT設計所得到的參數更精準。而當主觀難度與真實難度的相關愈低時,參數估計的效果愈差;當測驗長度較長、或共同題題數較少時,試題參數的估計效果較好;共同題難度分佈的不同則對參數估計的影響不大。
    整體而言,CAPT在小樣本情境下能夠提升試題參數估計的精準度,只要控制主觀難度與真實難度的相關在中等以上時,本研究可以提供無法使用大樣本進行預試者一個相當有用的資訊。

    The goal of the research is to investigate the influences of adaptive item selection method on the accuracy of pretest items calibration. The success of applications in computerized adaptive test (CAT) depends on the accuracy of each individual item parameters estimated. Typically, pretest calibration of item parameters is suggested to acquire large calibration sample size to reduce the estimation error. However, it may be difficult to reach such standard in reality. This paper proposes a Computerized Adaptive Pretest method (CAPT) for determining optimum items for each examinee to take in the pretest, thus improve the accuracy of item calibration.
    The research is composed of two studys. In Study 1, three kinds of ability distribution, normal, uniform and multi-group, were formed and to examine the influence on pretest item calibration. Study 2 is composed of two parts. Part one is to examine the difference of item estimation precision between CAPT and NEAT design. Part two is mainly to examine the difference of item estimation precision under three kinds of correlation between subjective difficulty and true difficluty. In addition, part two examines some variables that might be the factor which influences the item estimation precision under the CAPT design, such as test length, anchor items number, and the difficulty distribution of anchor itms.
    The result of Study 1 suggests there is no difference on the precision of item estimation between normal, uniform and multi-group distribution examinees. The difference between them is that the estimation is more precise for normal distribution at the average difficulty items, while it is not that accurate at the easy and hard items. Besides, uniform or multi-group examinees perform similarly accurate in all the items. The result of Study 2 suggests that the CAPT design performs better than NEAT design in the small sample size situation. With respect to the correlation between subjective difficulty and true difficluty, the higher the correlation, the more precise the item estimation. Furthermore, item estimation is more precise as the length of test is longer and the anchor items are fewer. The difficulty distribution of anchor items has little to do with the precision of item estimation.
    Generally speaking, this study sheds some light on future applications of pretest design for test users who can not acquire large sample size to estimate item parameters, as long as the correlation between subjective difficulty and true difficulty is equal to or higher than moderate level.

    摘要 i Abstract ii 目錄 iv 表目錄 v 圖目錄 vi 第一章 緒論 1 第一節 研究動機 1 第二節 研究目的 4 第三節 研究問題 4 第四節 名詞釋義 5 第二章 文獻探討 6 第一節 試題反應理論 6 第二節 IRT的參數估計方法 10 第三節 IRT參數估計的精準度 19 第三章 研究方法 29 第一節 研究架構 29 第二節 研究設計 30 第三節 研究程序 37 第四節 研究工具 44 第四章 研究結果和結論 45 第一節 研究一結果 45 第二節 研究二結果 49 第五章 結論與建議 69 第一節 研究結論與建議 69 第二節 研究限制與未來研究方向 72 參考文獻 75 附錄A 研究一模擬資料 82 附錄B 研究二模擬資料 86 附錄C CAPT作答反應產生程式 97 附錄D NEAT作答反應產生程式 110

    陳柏熹(2006)。能力估計方法對多向度電腦化適性測驗測量精準度的影響。教育心理學報,38(2),93-210。
    謝清麟(2001)。台灣中風病人工具性日常生活活動量表之建構(2/3)。行政院國家科學委員會專題研究成果報告(報告編號:NSC 89-2314-B-002-534),未出版。
    Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory and noncompensatory multidimensional items. Applied Psychological Measurement, 13, 113-127.
    Adams, R. J., Wilson, M., & Wang, W. C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-23.
    Anderson, L. W., Krathwohl, D. R., Airasian, P. W., Cruikshank, K. A., Mayer, R. E., Pintrich, P. R., Raths, J., & Wittrock, M. C. (2001). A taxonomy for learning, teaching, and assessing: a revision of Bloom’s taxonomy of educational objectives. New York: Longman.
    Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.
    Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York: Marcel Dekker.
    Ban, J. C., Hanson, B. H., Wang, T., Yi Q., & Harris, D. J. (2001). A comparative study of on-line pretest item calibration-scaling methods in computerized adaptive test. Journal of Educational Measurement, 38(3), 191-212.
    Ban, J. C., Hanson, B. H., Wang, T., Yi Q., & Harris, D. J. (2002). Data sparseness and on-line pretest item calibration-scaling methods in CAT. Journal of Educational Measurement, 39(3), 207-218.
    Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
    Bloom, B. S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R. (Eds). (1964). Taxonomy of educational objectives: The classification of educational goals. (Handbook Ⅱ: affective domain). New York: David McKay.
    Bock, R. D. (1972). Estimating item parameters and latent ability when response is scored in two or more nominal category. Psychometrika, 37, 29-51.
    Bock, R. D., & Aitken, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm, Psychometrika, 46, 443-459.
    Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431-444.
    Embretson, S. E. & Reise, S. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum Publishers.
    Guion, R. M., & Ironson, G. H.(1983). Latent trait theory for organizational research. Organizational behavior and human performance, 31, 54-87.
    Hambleton, R. K. (1989). Principles and selected applications of item response theory. In R. L. Linn (Ed.), Educational measurement(3rd ed., pp. 147-200). New York:Macmillan.
    Hambleton, R. K., & Jones, R. W. (1994). Comparison of empirical and judgmental procedures for detecting differential item functioning. Educational Research Quarterly, 18, 21-36.
    Hambleton, R. K., & Swaminathan, H.(1985). Item response theory:Principles and applications. Boston, MA:Kluwer-Nijhoff.
    Hambleton, R. K., Jones, R. W., & Rogers, H. J. (1993). Influence of item parameter estimation errors in test development. Journal of Educational Measurement, 30, 143-155.
    Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
    Holland, P. W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55, 577–601.
    Hulin, C.L., Lissak, R.I., & Drasgow, F. (1982). Recovery of two- and three-parameter logistic item characteristic curves: A Monte Carlo study. Applied Psychological Measurement, 6, 249-260.
    Kendall, M. and Stuart, A. (1979). The advanced theory of statistics, volume 2. Griffin, London, 4th edition. Macmillan, New York.
    Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for common-item equating with non-random groups. Journal of Educational Measurement, 22, 197-206.
    Klein, L.W. & Kolen, M.J. (1985). Effect of number of common items in common-item equating with non-random groups. Paper presented in the annual meeting of American Educational Research Association, Chicago.
    Kolen, M.J. & Brennan, R.J. (1995). Test Equating: Methods and Practices. New York: Springer-Verlag.
    Kramlinger, K., & Mayo Clinic. (2001). Mayo clinic on depression : Answers to help you understand, recognize and manage depression. Mayo Clinic, Press.
    Linacre, J. M. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7:4, 328.
    Livingston, S. A. (2004). Equating Test Scores. Princeton, NJ: ETS.
    Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ:Lawrence Erlbaum Associates.
    Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.
    McKinley, R. L., & Reckase, M. D. (1982). The use of the general Rasch model with multidimensional item response data (Research Report ONR 82-1). Iowa City IA: American College Testing.
    Mislevy, R. J. & Bock, R. D. (1989). PC-BILOG 3: Item analysis and test scoring with binary logistic models. Mooresville, IN: Scientific Software.
    Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176.
    Orlando M. & Marshall, G.N. (2002). Differential item functioning in a Spanish translation of the PTSD checklist: detection and evaluation of impact. Psychological Assessment, 14,1, 50-9.
    Parshall, C. G., Spray, J.A., Kalohn, J. J., & Davey, T. (2002). Practical considerations in computer-based testing. New York: Springer-Verlag.
    Petersen, N. S., Kolen, M. J., & Hoover, H.D. (1993). Scaling, Norming, and Equating. In R.L. Linn (Ed.), Educational Measurement (3rd ed., pp. 221-262). New York: Macmillan.
    Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: The Danish Institute for Educational Research.
    Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Monograph, No. 17.
    Seong, T. J. (1990). Sensitivity of marginal maximum likelihood estimation of item and ability parameters to the characteristics of the prior ability distributions. Applied Psychological Measurement, 14, 299-311.
    Smith, L. L., & Reise, S. P. (1998). Gender differences on negative affectivity: An IRT study of differential item functioning on the multidimensional personality questionnaire stress reaction scale. Journal of Personality and Social Psychology, 75(5), 1350–1362.
    Stocking, M. L. (1988). Some considerations in maintaining adaptive test item pools. Research Report RR-88-33. Princetion, NJ:Educational Testing Service.
    Stocking, M. L. (1990). Specifying optimum examinees for item parameter estimation in item response theory. Psychometrika, 55, 461-475.
    Swaminathan, H. & Gifford, J. A. (1983). Estimating of parameters in the three- parameter latent trait model. In D. J. Weiss (Ed.), New horizons in testing (pp. 9-30). NY: Academic Press.
    Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika, 47, 397-412.
    Urry, V. W. (1977). Tailor testing: A successful application of item response theory. Journal of Educational Measurement, 14, 181-196.
    Vale, C. D. (1986). Linking item parameters onto a common scale. Applied Psychological Measurement, 10(4), 333-344.
    van der Linden, W. J. & Glas, C. (Eds.). (2000). Computer adaptive testing: Theory and practice. Boston, MA: Kluwer Academic Publishers.
    von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. New York: Springer.
    Wainer, H., & Thissen, D. (1987). Estimating ability with the wrong model. Journal of Educational Statistics, 12, 339-368.
    Wainer, H., Dorans, N. J., Flaughter, R., Green, B. F., Mislevy, R. J., Steinberg, L., & Thissen, D. (1990). Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum Associates.
    Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21, 361-375.
    Weiss, D. J., & Mcbride, J. R. (1984). Bias and information of Bayesian adaptive testing. Applied Psychological Measurement, 8(3) , 273-285.
    Weiss, D.J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6, 473-492.
    Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8, 347-364.
    Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of educational measurement, 14, 97-116.
    Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: MESA Press.
    Wright B. D., & Tennant A. (1996). Sample Size Again. Rasch Measurement Transactions, 9:4, 468.
    Wu, M. L., Adams, R. J., & Wilson, M. (1998). ACER ConQuest user guide. Hawthorn, Australia: ACER Press.
    Yen, W. M., & Fitzpatrick, A. R. (2006). Item Response Theory. In R. L. Brennan (Ed.), Educational Measurement (4th Ed.). Westport, CT: American Council on Education and Praeger Publishers.

    下載圖示
    QR CODE