  • 期刊
  • OpenAccess

Design and Development of a Bilingual Reading Comprehension Corpus


This paper describes our initial attempt to design and develop a bilingual reading comprehension corpus (BRCC). RC is a task that conventionally evaluates the reading ability of an individual. An RC system can automatically analyze a passage of natural language text and generate an answer for each question based on information in the passage. The RC task can be used to drive advancements of natural language processing (NLP) technologies imparted in automatic RC systems. Furthermore, an RC system presents a novel paradigm of information search, when compared to the predominant paradigm of text retrieval in search engines on the Web. Previous works on automatic RC typically involved English-only language learning materials (Remedia and CBC4Kids) designed for children/students, which included stories, human-authored questions, and answer keys. These corpora are important for supporting empirical evaluation of RC performance. In the present work, we attempted to utilize RC as a driver for NLP techniques in both English and Chinese. We sought parallel English, and Chinese learning materials and incorporated annotations deemed relevant to the RC task. We measured the comparative levels of difficulty among the three corpora by means of the baseline bag-of-words (BOW) approach. Our results show that the BOW approach achieves better RC performance in BRCC (67%) when compared to Remedia (29%) and CBC4Kids (63%). This reveals that BRCC has the highest degree of word overlap between questions and passages among the three corpora, which artificially simplifies the RC task. This result suggests that additional effort should be devoted to authoring questions with a various grades of difficulty in order for BRCC to better support RC research across the English and Chinese languages.


bilingual reading comprehension corpus


Allen,J.(1995).Natural Language Understanding.
Anand,P.,E. Breck,B. Brown,M. Light,G. Mann,E. Riloff,M. Rooth,M. Thelen(2000).Fun with Reading Comprehension.(Final Report of the Workshop 2000 of Language Engneering for Students and Professonals Integrating Research and Education).
Brill,E.(1994).Some advances in rule-based part of speech tagging.(In Proceedings of the Twelfth National Conference on Ar4/Icial Intelligence).
Buchholz,S.(2001).Using Grammatical Relations, Answer Frequencies and the World Wide Web for TREC Question Answering.(In Proceedings of the tenth Tat Retrieval Conference).
Chall,J. S.,B. Dale(1995).Readability revisited: The new Dale-Chall readability formula.
