基於聚類校對和輕造字的古籍數字化方法與實踐

本研究首先回顧了古籍數字化中的保真原則和整理原則，指出並比較了文字校對的三種方式。其次，介紹了古籍數字化需要解決的缺字問題、認同問題和異體字問題等三個核心問題，指出缺字問題屬於技術問題，根本矛盾在於技術水平；而認同問題和異體字問題屬於體例問題，根本矛盾在於文字專家人才資源。基於對上述問題的梳理以及古籍智能光學字元辨識（optical character recognition, OCR）技術的進展，提出了一種基於聚類校對和輕造字的文字校對方法。該方法可以輕巧地解決缺字問題，也可以將認同問題和異體字問題分解為不同的環節，以便專業化分工，從而緩解文字專家人才資源的矛盾。最後，介紹了《徑山藏》數字化項目的實踐工作，初步驗證了基於聚類校對和輕造字方法的合理性和有效性。

關鍵字

古籍數字化；文字校對；聚類校對；缺字；異體字

並列摘要

This study explores the principles of fidelity and collation in the digitization of ancient books, proposing and comparing three rules of text proofreading. It identifies three core problems in the process: missing characters (a technical issue), character identification, and variant characters (both stylistic issues reliant on the expertise of Chinese character specialists). The article analyzes these problems and their relationships, highlighting the underlying contradictions in technical levels and talent resources. Leveraging advances in intelligent optical character recognition (OCR) technology, the study introduces a text proofreading method that utilizes clustering proofreading and light word-forming. This method not only addresses the problem of missing characters but also breaks down character identification and variant characters into specialized tasks, easing the reliance on expert resources. The practical application of this method is demonstrated in the "Jingshan Tripitaka" digitization project by the Beijing Rushi Institute of Artificial Intelligence Technology, preliminarily validating the effectiveness of this novel approach.

並列關鍵字

digitization of ancient books ； text proofreading ； clustering proofreading ； missing characters ； variant characters

參考文獻

Wu, S., Wang, J., Ma, W., & Jin, L. (2020). Precise detection of Chinese characters in historical documents with deep reinforcement learning. Pattern Recognition, 107, 107503. doi:10.1016/j.patcog.2020.107503。

Xie, Z., Huang, Y., Jin, L., Liu, Y., Zhu, Y., Gao, L., & Zhang, X. (2019). Weakly supervised precise segmentation for historical document images. Neurocomputing, 350, 271-281. doi:10.1016/j.neucom.2019.04.001。

Yang, H., Jin, L., & Sun, J. (2018). Recognition of Chinese text in historical documents with page-level annotations. In 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR) (pp. 199-204). Niagara Falls, NY: IEEE.. doi:10.1109/ICFHR-2018.2018.00043。

中華電子佛典協會（n.d.-a）。CBETA 電子佛典缺字處理—以大正藏為例。取自 http://www.cbeta.org/data/cbeta/rare.htm。

Google Scholar

中華電子佛典協會（n.d.-b）。中華電子佛典協會（CBETA）簡史。取自http://cbeta.org/node/4942。

Google Scholar

國際替代計量

基於聚類校對和輕造字的古籍數字化方法與實踐

全文下載

主題瀏覽