The development and validation of a rating scale for definition essays: A data-based approach

To date, no rating scale has been created for definition writing. The holistic rating scales used in large-scale standardized tests, such as those used in the TOEFL iBT, are either for general use like in the case of the independent writing scale, or task-specific like in the case of the integrated writing scale. Brindley (1994) criticized that such scales are too general to be applied to a specific task and context. This project thus aims to use a data-based approach to create a scale for definition writing. The development of the scale was largely based on the procedures taken by Knoch (2007) because her study described the details of how a data-based scale can be constructed and the rating scale she created was proved valid and reliable. In this study, the rating criteria were first chosen for the scale for definition essays based on a number of models relevant to writing performance. From these models, six traits were selected: accuracy, fluency, complexity (syntactic and lexical), coherence, cohesion, and content. Then, 268 samples were selected from a pool of 1,365 short definition essays and analyzed using discourse measures covering these six traits. The analysis results of the discourse measures were subject to statistical analyses. The statistical information helped indicated measures that could effectively distinguish the essays at different performance levels. The scale was finally written based on these discriminating measures and featured the following characteristics. First, the scale descriptions were made specific to the definition writing task. Second, the scale criteria were prioritized to reflect their importance in the writing task and divided into three steps so that the criteria can be judged separately. Once the scale was created, it was tested in a validation study. The purpose is to collect evidence to support the validity of this scale by comparing it with the TOEFL scale, a scale often considered generalizable to many writing tasks. The validation aims to find out which of the scales was more suitable for definition writing like the one investigated in this study. Four raters were invited to use both scales on the same batch of 65 essays, and filled out a questionnaire and participated semi-structured interviews based on their experience of using the scales. The ratings from the scales were statistically analyzed to estimate the inter-rater reliability, and besides the raters’ responses to the questionnaire and interviews were used to investigate the validity of the scales. The analysis of rating consistency indicated that both scales led to similar inter-rater reliability estimates that were not desirably high. Further statistical analyses proved that the raters actually operated the rating scales with different levels of scoring severity. A follow-up interview with the raters indeed revealed several factors that explained the inconsistency of the ratings, and these factors could be largely attributed to insufficient rater training and inappropriate design of the scoring procedures. Even though the new scale failed to reach high rating consistency, the questionnaire and interview results showed that it was positively perceived by the raters because it could (1) bring benefits to the test users and the test takers, (2) generate ratings fair to the test takers, (3) adequately represent the writing ability involved in definition writing, (4) have strong connection to the definition writing task, and (5) provide enough information for the raters to discriminate the test takers at different levels. Therefore, the raters considered that the definition-writing scale was more suitable for definition writing. Even though the reliability of the scale created in this study was not satisfactory, the raters still perceived it quite positively. It can still be concluded that this study, following Knoch (2007), confirmed that a scale developed based on empirical analysis shows more evidence to scale validity.

關鍵字

寫作測驗；評分量表

並列摘要

目前，尚未有單一個評分量表是專為批改定義型文章而設計，許多大型標準化測驗所使用的量表（如托福iBT寫作測驗）不是過於通用，就是過於特定於該測驗所使用的寫作題型。Brindley (1994) 指出這些大型測驗的量表過於籠統，並不適合特定的寫作題型和測驗環境，因此本研究旨在使用真實語料來創建一個定義型文章的評鑑量表。　　Knoch (2007) 詳細記載了發展一個基於語料的量表所需的步驟和細節，其所建立的量表信度（reliability）和效度（validity）的結果皆相當不錯，因此本研究參照Knoch的研究設計來制定量表。本研究所使用的評分準則是基於數個與寫作表現相關的理論模型，篩選出六個寫作指標，包含流暢度、準確度、句型與字彙複雜度、句子連接性、段落連貫性以及內容。首先，從1,365篇定義型文章中隨機選出268篇，使用量測上述的六大指標的語言分析計量單位（discourse analysis measures）加以分析；接著使用統計分析來研究語言分析後的量化數據，經由統計分析後的結果能夠確切地指出那些語言分析的計量單位能夠有效地鑑別這些文章的等級；最後，根據這些結果來建立新的量表以及評分敘述。此量表有兩個特點，第一，其評分敘述是針對定義型文章的特色而撰寫，第二，量表所使用的準則有重要性之分，且各自獨立評分。　　當此量表建立後，本研究接著進行效度考驗（validation），分析其與托福獨立寫作量表的效度表現，並探討哪一種量表較適合用來批改定義型文章。四位評分員使用這兩種量表來批改 65篇文章，之後根據其批改的過程和經驗來填寫問卷並參加訪談。批閱後的分數以統計分析來衡量評分員者間信度（inter-rater reliability），此外，評分員的問卷回答和訪談內容也加以分析，探討此兩種量表的效度表現。　　分數一致性的分析結果指出，新的量表與托福量表都達到相似的評分員者間信度，但並不是很高。進一步的統計分析發現，四位評分員在使用兩量表時，評量的嚴格程度不同。後續的訪談中確實可發現出許多導致信度不足的因素，而這些因素大多可歸咎於閱卷訓練（rater training）的不足以及閱卷程序的設計不佳。雖然新的量表與並未達到較高的信度，但評分員們在問卷和訪談中皆給予此量表正面回饋。他們認為，新量表能：（一）造福於施測者和受試者，（二）讓評分者公正地給分，（三）適切地代表定義型寫作的能力，（四）與定義型寫作題型較為相關，（五）提供足夠的資訊來幫助評分員區分文章的等級。因此，評分員們認為新量表較適合用於批改定義型文章。　　雖然此定義寫作評分量表的信度表現並不完美，評分員們仍給予正面肯定。本研究繼Knoch (2007) 之後再度證明根據語料所制定的評分量表對於其特定的寫作題型能有較佳的效度表現。

並列關鍵字

Writing Assessment ； Rating scale

參考文獻

Bachman, L., & Palmer, A. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford: Oxford University Press.

Bardovi-Harlig, K., & Bofman, T. (1989). Attainment of syntactic and morphological accuracy by advanced language learners. Studies in Second Language Acquisition, 11, 17-34.

Beers, S. F., & Nagy, W. F. (2009). Syntactic conplexity as a predictor of adolescent writing quality: Which measure? Which genre? Reading and Writing, 22, 185-200.

Brown, J. D. (1991). Do English and ESL faculties rate writing samples differently? TESOL Quarterly, 25(4), 587-603.

Center for Advanced Research on Language Acquisition (CARLA). (n.d.). Types of rubrics: Primary trait and multiple trait. Retrieved August 28, 2010, from http://www.carla.umn.edu/assessment/VAC/Evaluation/rubrics/types/traitRubrics.html

國際替代計量

The development and validation of a rating scale for definition essays: A data-based approach

全文下載

主題瀏覽