Title

探究使用基於類神經網路之特徵於文本可讀性分類

Translated Titles

Exploring the Use of Neural Network based Features for Text Readability Classification

Authors

曾厚強(Hou-Chiang Tseng);陳柏琳(Berlin Chen);宋曜廷(Yao-Ting Sung)

Key Words

可讀性 ; 詞向量 ; 卷積神經網路 ; 表示學習法 ; 快速文本 ; Readability ; Word Vector ; Convolutional Neural Network ; Representation Learning ; fastText.

PublicationName

中文計算語言學期刊

Volume or Term/Year and Month of Publication

22卷2期(2017 / 12 / 01)

Page #

31 - 45

Content Language

繁體中文

Chinese Abstract

可讀性通常指的是閱讀題材可以被讀者理解的程度:當閱讀材料愈能夠被讀者所理解時,就愈能夠產生好的學習效果。為了能夠幫助讀者去適配符合自己閱讀能力的文件,研究人員長久以來持續發展各種能夠自動且精準地估測文本可讀性的模型來達到此目標。可讀性分類通常是透過分析文件上的資訊來轉化成一組可讀性特徵,再利用這些可讀性特徵來訓練出可讀性模型,以便能預測未知文件的可讀性。然而,傳統的可讀性模型所使用的特徵都需要根據專家的經驗來進行選取,這卻也限制其實用性。近年來隨著表示學習法技術的蓬勃發展,訓練可讀性模型所需要的特徵可以不再需要仰賴專家,這也使得可讀性模型的發展有了一個嶄新的研究方向。因此,本論文嘗試以卷積神經網路以及快速文本兩種技術分別來自動地擷取文本特徵,以訓練出一個能夠分析跨領域文件的可讀性模型,並可以因應文件內容多元主題的特性。經與現有方法的一系列實驗比較後,其結果確認了本論文所提可讀性模型的效能優勢。

English Abstract

Text readability refers to the degree to which a text can be understood by its readers: the higher the readability of a text for readers, the better the the comprehension and learning retention can be achieved. In order to facilitate readers to digest and comprehend documents, researchers have long been developing readability models that can automatically and accurately estimate text readability. Conventional approaches to readability classification is to infer a readability model using a set of handcrafted features defined a priori and computed from the training documents, along with the readability levels of these documents. However, the use of handcrafted features requires special expertise and its applicability also is limited. With the recent advance of representation learning techniques, we can efficiently extract salient features from dcouments without recourse to specialized expertise, which offers a promising avenue of research on readability classification. In view of this, we in this paper propose two novel readability models built on top of a convolutional neural network based representation and the so-called fastText representation, respectively, which have the capability of effectively analyzing documents belonging to different domains and covering a wide variety of topics. A series of emperical experiments seem to demonstrate the utility of the proposed models in relation to several existing methods.

Topic Category 人文學 > 圖書資訊學
基礎與應用科學 > 資訊科學
工程學 > 電機工程
Reference
  1. Bertha, A. L. & Pressey, S. L. (1923). A method for measuring the" vocabulary burden" of textbooks. Educational Administration and Supervision, 9, 389-398
  2. Dale, E. & Chall, J. S. (1949). The concept of readability. Elementary English, 26(1), 19-26
  3. Flesch, R. (1948). A new readability yardstick. Journal of applied psychology, 32(3), 221-233. doi: 10.1037/h0057532
  4. Vogel, M. & Washburne, C. (1928). An objective method of determining grade placement of children's reading material. The Elementary School Journal, 28(5), 373-381
  5. Chollet, F. (2015). Keras: Deep learning library for theano and tensorflow. URL: https://keras.io.
  6. Joulin, A., Grave, E., Bojanowski, P. & Mikolov, T. (2016). Bag of tricks for efficient text classification. Retrived from arXiv preprint arXiv:1607.01759
  7. Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). Efficient estimation of word representations in vector space. Retrived from arXiv preprint arXiv:1301.3781
  8. Zhang, Y. & Wallace, B. (2015). A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. Retrieved from arXiv preprint arXiv:1510.03820
  9. Abdel-Hamid, O.,Deng, L.,Yu, D.(2013).Exploring convolutional neural network structures and optimization techniques for speech recognition.Interspeech 2013
  10. Bengio, Y.,Ducharme, R.,Vincent, P.,Jauvin, C.(2003).A neural probabilistic language model.Journal of machine learning research,3,1137-1155.
  11. Borst, A.,Gaudinat, A.,Grabar, N.,Boyer, C.(2008).Lexically-based distinction of readability levels of health documents.Acta Informatica Medica,16(2),72-75.
  12. Chall, J. S.,Dale, E.(1995).Readability revisited: The new Dale-Chall readability formula.Cambridge, Mass:Brookline Books.
  13. Chang, T. H.,Sung, Y. T.,Lee, Y. T.(2012).A Chinese word segmentation and POS tagging system for readability research.Proceedings of the 42nd Annual Meeting of the Society for Computers in Psychology
  14. Chang, T. H.,Sung, Y. T.,Lee, Y. T.(2013).Evaluating the difficulty of concepts on domain knowledge using latent semantic analysis.Proceedings of 2013 International Conference on Asian Language Processing (IALP)
  15. Ciresan, D. C.,Giusti, A.,Gambardella, L. M.,Schmidhuber, J.(2012).Deep neural networks segment neuronal membranes in electron microscopy images.Proceedings of the 25th International Conference on Advances in neural information processing systems(NIPS'12)
  16. Cireşan, D. C.,Meier, U.,Gambardella, L. M.,Schmidhuber, J.(2010).Deep, big, simple neural nets for handwritten digit recognition.Neural computation,22(12),3207-3220.
  17. Cireşan, D. C.,Meier, U.,Masci, J.,Schmidhuber, J.(2011).A committee of neural networks for traffic sign classification.Proceedings of The 2011 International Joint Conference on Neural Networks (IJCNN)
  18. Collins-Thompson, K.(2014).Computational assessment of text readability: A survey of current and future research.ITL-International Journal of Applied Linguistics,165(2),97-135.
  19. Deng, L.,Abdel-Hamid, O.,Yu, D.(2013).A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion.Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  20. Deng, L.,Li, J.,Huang, J. T.,Yao, K.,Yu, D.,Seide, F.,Acero, A.(2013).Recent advances in deep learning for speech research at Microsoft.Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  21. Feng, L.,Jansche, M.,Huenerfauth, M.,Elhadad, N.(2010).A comparison of features for automatic readability assessment.Proceedings of the 23rd International Conference on Computational Linguistics: Posters (COLING '10)
  22. François, T.,Miltsakaki, E.(2012).Do NLP and machine learning improve traditional readability formulas?.Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations (PITR '12)
  23. Goodfellow, I.,Bengio, Y.,Courville, A.(2016).Deep learning (adaptive computation and machine learning series).Cambridge, MA:The MIT Press.
  24. Graesser, A. C.,McNamara, D. S.,Louwerse, M. M.,Cai, Z.(2004).Coh-Metrix: Analysis of text on cohesion and language.Behavior Research Methods, Instruments, & Computers,36(2),193-202.
  25. Graesser, A. C.,Singer, M.,Trabasso, T.(1994).Constructing inferences during narrative text comprehension.Psychological review,101(3),371-395.
  26. Hinton, G. E.(1986).Learning distributed representations of concepts.Proceedings of the eighth annual conference of the cognitive science society
  27. Johnson, R.,Zhang, T.(2014).Effective use of word order for text categorization with convolutional neural networks.NAACL HLT 2015
  28. Kim, Y.(2014).Convolutional neural networks for sentence classification.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
  29. Kireyev, K.,Landauer, T. K.(2011).Word maturity: Computational modeling of word knowledge.Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
  30. Klare, G. R.(1963).Measurement of readability.Ames, IA:Iowa State University Press.
  31. Klare, G. R.(2000).The measurement of readability: useful information for communicators.ACM Journal of Computer Documentation,24(3),107-121.
  32. Landauer, T. K.,Dumais, S. T.(1997).A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.Psychological review,104(2),211-240.
  33. Landauer, T. K.,Foltz, P. W.,Laham, D.(1998).An introduction to latent semantic analysis.Discourse processes,25(2-3),259-284.
  34. Liu, Y. N.、Chen, K. Y.、Tseng, H. C.、Chen, B.(2015)。A Study of Readability Prediction on Elementary and Secondary Chinese Textbooks and Excellent Extracurricular Reading Materials。Proceedings of the 27th Conference on Computational Linguistics and Speech Processing (ROCLING 2015)
  35. Mc Laughlin, G. H.(1969).SMOG grading-a new readability formula.Journal of reading,12(8),639-646.
  36. Nair, V.,Hinton, G. E.(2010).Rectified linear units improve restricted boltzmann machines.Proceedings of the 27th international conference on machine learning (ICML-10)
  37. Petersen, S. E.,Ostendorf, M.(2009).A machine learning approach to reading level assessment.Computer speech & language,23(1),89-106.
  38. Pfeifer, R.(Ed.),Schreter, Z.(Ed.),Fogelman, F.(Ed.),Steels, L.(Ed.)(1989).Connectionism in perspective.Zurich, Switzerland:Elsevier.
  39. Srivastava, N.,Hinton, G. E.,Krizhevsky, A.,Sutskever, I.,Salakhutdinov, R.(2014).Dropout: a simple way to prevent neural networks from overfitting.Journal of machine learning research,15(1),1929-1958.
  40. Sung, Y. T.,Chen, J. L.,Cha, J. H.,Tseng, H. C.,Chang, T. H.,Chang, K. E.(2015).Constructing and validating readability models: the method of integrating multilevel linguistic features with machine learning.Behavior research methods,47(2),340-354.
  41. Truran, M.,Georg, G.,Cavazza, M.,Zhou, D.(2010).Assessing the readability of clinical documents in a document engineering environment.Proceedings of the 10th ACM symposium on Document engineering (DocEng '10 )
  42. Tseng, H. C.、Hung, H. T.、Sung, Y. T.、Chen, B.(2016)。Classification of Text Readability Based on Deep Neural Network and Representation Learning Techniques。Proceedings of 28th Conference on Computational Linguistics and Speech Processing (ROCLING 2016)
  43. Tseng, H. C.,Sung, Y. T.,Chen, B.,Lee, W. E.(2016).Classification of text readability based on representation learning techniques.Proceedings of the 26th Annual Meeting of the Society for Text & Discourse
  44. Vapnik, V. N.,Chervonenkis, A. Y.(1974).Teoriya raspoznavaniya obrazov. Statisticheskie problemy obucheniya.Moscow, Russia:Nauka.
  45. Yan, X.,Song, D.,Li, X.(2006).Concept-based document readability in domain specific information retrieval.Proceedings of the 15th ACM international conference on Information and knowledge management (CIKM '06)