  • 期刊
  • OpenAccess


Historical Corpora for Synchronic and Diachronic Linguistics Studies


中央研究院古漢語語料庫是為古漢語語言研究而構建的。這個語料庫不但具有大量的可作為古漢語語法及詞彙研究的電子文獻,而且擁有可以對文獻的語詞進行檢索、統計、搭配的多功能程式。以語法的發展為準,這個語料庫又分作上古漢語、中古漢語、近代漢語箏三個次語料庫,相信這樣的劃分對古漢語的共時或歷時的研究都是頗為便益的。 現在上古漢語語料庫中有相當數量的文獻已經依據其原典、作者、文體等等完成了分類及標注的工作,其中又有不少文獻已經做了斷詞,在己斷詞的文獻中又有幾部古籍已完成詞類的標記。這些斷詞以及詞類標記的成果現已構成我們上古漢語詞彙庫的基礎。


The Academia Sinica Ancient Chinese Corpus is designed for linguistic research. The corpus contains ancient texts that are selected because of their usefulness in grammatical and lexical studies, as well as an inspection program with keyword searching, statistics, and collocation functions. The corpus is divided into three subcorpora according to stages of grammatical developments, thus both synchronic and diachronic studies can be performed on them. Their current sizes are as follows: a. Old Chinese subcorpus (from pre-Qin to Pre-Han): 5,128,068 characters. b. Middle Chinese subcorpus (from Late Han to the Six Dynasties): 8,101,662 characters. c. Early Mandarin Chinese subcorpus (from Tang to Ching): 4,406,381 characters A great portion of the texts from the Old Chinese subcorpus (4,497,051 characters) has been textually classified and marked-up according to their source books, author, text genre etc. A substantive part (520,794 characters) of the same subcorpus has also been segmented into words, which are in turn given part-of-speech tagging. results of the above two tasks form the basis of our Old Chinese Lexical Database.


Chen, M. H. (2013). 漢語位移構式的歷史演變 [doctoral dissertation, National Chung Cheng University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0033-2110201613554093
