透過您的圖書館登入
IP:3.145.42.94

摘要


中文古籍通常沒有標點符號,給現代人閱讀和理解帶來極大困難。為古籍添加現代標點是古籍整理和研究的基礎,也是相當繁重的工作。借助人工智能(artificial intelligence, AI)實現古籍的自動標點具有現實意義。我們應用深度學習(deep learning, DL)在自然語言處理(natural language processing, NLP)領域的最新工具,在超過5千萬個漢字和約1千萬個標點組成的訓練集上,使用長短時記憶(long short-term memory, LSTM)和卷積神經網路(convolutional neural network, CNN)兩種模型進行訓練。然後在六種不同朝代佛教古籍文本的測試集上,實現了最高94.3%的標點正確率,可以為古文標注七種現代標點(逗號、句號、問號、嘆號、頓號、分號、冒號)。

並列摘要


Ancient Chinese scriptures usually have no punctuation marks, which makes it difficult for modern people to read and understand. Adding modern punctuation to ancient scriptures is the basis for the collation and research of ancient scriptures, however, it is a very tedious process. Therefore, it is of practical significance to realize automatic punctuation of ancient scriptures by means of artificial intelligence (AI). We apply the latest tool of deep learning (DL) in the field of natural language processing (NLP) to train the two models of long short-term memory (LSTM) and convolution neural network (CNN) on a training set of more than 50 million Chinese characters and approximately 10 million punctuations. Then, on the test set of Buddhist texts from six different dynasties, the highest punctuation accuracy of 94.3% was achieved. At present, the system can mark seven kinds of modern punctuations (comma, period, question mark, exclamation mark, dunhao, semicolon, colon) for ancient texts.

參考文獻


王博立、史曉東、蘇勁松(2017)。一種基於循環神經網絡的古文斷句方法。北京大學學報(自然科學版),53,255-261。doi:10.13209/j.0479-8023.2017.032
張開旭、夏雲慶、宇航(2009)。基於條件隨機場的古漢語自動斷句與標點方法。清華大學學報(自然科學版),49,1733-1736。doi:10.16511/j.cnki.qhdxxb.2009.10.027
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (pp. 770-778). doi:10.1109/CVPR.2016.90
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735-1780. doi:10.1162/neco.1997.9.8.1735
黃建年(2009)。農業古籍的電腦斷句標點與分詞標引研究(未出版之博士論文)。南京農業大學科學技術史系,南京,中國。

延伸閱讀