應用可讀性預測於中小學國語文教科書及優良課外讀物分類之研究

可讀性（Readability）是指閱讀材料能夠被讀者理解的程度。可讀性高的文章較容易被讀者理解。文章的可讀性與很多因素有關，如：文長、字詞難度、句法結構、內容是否符合讀者的先備知識等，然而表淺的語言特徵無法反映這些複雜的成分。本論文以先前的研究為基礎，更深入的探討不同種類的特徵，包括句法分析（Syntactic Analysis）、詞性標記（Part-of-Speech, POS）、詞表示法（Word Embedding）、語意資訊（Semantic Information）與寫作程度（Well-written）等特徵，分析比對不同類型的特徵與可讀性高低的關聯性。實驗資料分為二部分：其一為中小學國語文教科書，選自98年度台灣三大出版社所出版的1~9年級（共18冊）審定版國中小國語文教科書；其二為優良課外讀物，選自文化部歷屆「中小學生優良課外讀物」獲選書籍。本論文嘗試透過逐步迴歸與支持向量機等兩種方式建立可讀性模型，比較兩者之效能優劣；最後，再將兩者加以結合，以提升預測之正確率。實驗結果顯示，本論文所提出的可讀性特徵相較於傳統所使用的表淺特徵，在文本難易度評估的任務中，能有顯著的效能提升。

關鍵字

可讀性；文本特徵；逐步迴歸；支持向量機

並列摘要

Readability is basically concerned with readers’ comprehension of given textual materials: the higher the readability of a document, the easier the document can be understood. It may be affected by various factors, such as document length, word difficulty, sentence structure and whether the content of a document meets the prior knowledge of a reader or not. However, simple surface linguistic features cannot always account for these factors in an appropriate manner. To cater for this, we explore in this study a variety of extra features, including syntactic analysis, parts of speech, word embedding, semantic role features and well-written features. The experimental datasets are composed of two parts: one is textbooks of the Chinese language for elementary and junior high schools (K1 to K9) in Taiwan, compiled from three publishers in the academic year of 2009; the other is excellent extracurricular reading materials for students of elementary and junior high schools, collected by the Ministry of Culture in Taiwan. Two readability prediction models, viz. stepwise regression and support vector machine, are evaluated and compared, while the combination of these two models is also investigated so as to further enhance the accuracy of readability prediction. Experimental results reveal that our proposed approach can yield consistently better performance than traditional ones merely with simple surface linguistic features in evaluating text difficulty.

並列關鍵字

Readability ； Textual Features ； Stepwise Regression ； Support Vector Machine

參考文獻

[1] 宋曜廷、陳茹玲、李宜憲、查日龢、曾厚強、林維駿、張道行、張國恩, “中文文本可讀性探討：指標選取、模型建立與效度驗證”, 中華心理學刊, 55卷, 1期, 75–106, 2013.

Google Scholar

[2] A. C. Graesser, D. S. McNamara, M. M. Louwerse, and Z. Cai, “Coh-Metrix: Analysis of Text on Cohesion and Language,” Behavior Research Methods, Instruments, & Computers, vol. 36, no. 2, pp. 193–202, 2004.

Google Scholar

[3] 陳世敏, “中文可讀性公式試擬”, 新聞學研究, 8卷, 181–226, 1971.

Google Scholar

[4] 楊孝濚, “中文可讀性公式”, 新聞學研究, 8卷, 77–101, 1971.

Google Scholar

[5] K. Collins-Thompson, “Computational Assessment of Text Readability: A Survey of Current and Future Research,” Recent Advances in Automatic Readability Assessment and Text Simplification. Special issue of International Journal of Applied Linguistics, vol. 165, no. 2, 97–135, 2014.

Google Scholar

主題瀏覽