基於 Transformer 的滿文辨識系統與大語言模型零樣本上下文學習修正

滿文檔案是研究中國清代歷史的重要來源，但許多文本尚未全面數位化，因此需要一種能有效識別滿文檔案文字的自動化滿文辨識技術。近期的研究採用了基於滿文字母標記的序列對序列模型，使模型能自主學習影像與字母之間的對應關係。然而，由於缺乏大量真實滿文標註資料，模型難以應對滿文字母因上下文變化而出現的書寫形式差異。為了解決這一問題，本研究提出了一種基於 Transformer 架構的光學字元辨識模型 TrOCR 和大型語言模型 ChatGPT o1 的滿文辨識系統。該系統利用大量低成本的合成滿文影像訓練滿文辨識模型，並引入新穎的滿文音節標記，提升模型對字母書寫形式變化的適應能力。此外，研究還結合滿文字典與滿文單語語料，引導不具備滿文理解能力的大型語言模型進行基於零樣本上下文學習的辨識修正。評估結果顯示，僅使用合成資料訓練的滿文辨識模型在《金剛經》和《親征平定朔漠方略》中的字元錯誤率（CER）分別為 7.33%和 3.84%，準確率分別達到 79.88% 和 87.28%。進一步通過 ChatGPT o1 進行錯誤修正後，《親征平定朔漠方略》的 CER 再降低 1.2%，準確率提高 5.2%。這項研究為低資源語言的文字辨識提供新的低成本方法，促進歷史文獻數位化與保存，為相關領域的研究與應用開闢新方向。

關鍵字

滿文辨識；深度學習；低資源語言；大型語言模型；歷史文獻數位化

並列摘要

Manchu archival documents serve as essential resources for studying the history of China＇s Qing dynasty. However, many of these documents remain largely undigitized, underscoring the need for effective Manchu word recognition systems. Recent studies have employed sequence-to-sequence models that utilize Manchu character tokens, enabling the autonomous learning of mappings between visual features and their corresponding characters. Nonetheless, the limited availability of annotated Manchu datasets poses a significant challenge, particularly in capturing contextual variations in Manchu character forms. To address this issue, this study proposes a Manchu recognition system based on the Transformer architecture optical character recognition (OCR) model TrOCR, combined with the large language model ChatGPT o1. The system trains the recognition model on a substantial volume of low-cost synthetic Manchu images and introduces innovative Manchu syllable tokens to improve the model’s adaptability to variations in character forms. Furthermore, the study integrates a Manchu dictionary and a monolingual corpus to guide zero-shot, context-based recognition refinement using large language models that lack inherent understanding of Manchu. Evaluation results show that the Manchu word recognition model achieves character error rates (CER) of 7.33% and 3.84% for the Diamond Sutra and Qinzheng Pingding Shuomo Fanglüe datasets, respectively, with corresponding accuracies of 79.88% and 87.28%. Subsequent error correction using ChatGPT o1 reduces the CER for the Qinzheng Pingding Shuomo Fanglüe dataset by 1.2% and enhances accuracy by 5.2%. This study presents an innovative and high-performance Manchu word recognition system that eliminates the need for annotated real-world data during training. It offers a new low-cost method for word recognition in low-resource languages, facilitating the digitization and preservation of historical documents and opening new directions for research and applications in related fields.

並列關鍵字

Manchu word recognition ； Deep learning ； Low-resource language ； Large language model ； Historical document digitization

參考文獻

[1] Mark C Elliott. The Manchu Way: The Eight Banners and Ethnic Identity in Late Imperial China. Stanford university press, 2001.

Google Scholar

[2] Pamela Kyle Crossley and Evelyn S. Rawski. A profile of the manchu language in ch’ing history. Harvard Journal of Asiatic Studies, 53(1):63–102, 1993.

Google Scholar

[3] Shuang Xu, Min Li, Rui-Rui Zheng, and Shulmam Michael. Manchu character segmentation and recognition method. Journal of Discrete Mathematical Sciences and Cryptography, 20(1):43–53, 2017.

Google Scholar

[4] Di Huang, Min Li, Ruirui Zheng, Shuang Xu, and Jiajing Bi. Synthetic data and dag-svm classifier for segmentation-free manchu word recognition. In 2017 International Conference on Computing Intelligence and Information System (CIIS), pages 46–50. IEEE, 2017.

Google Scholar

[5] Ruirui Zheng, Min Li, Jianjun He, Jiajing Bi, and Baochun Wu. Segmentation-free multi-font printed manchu word recognition using deep convolutional features and data augmentation. In 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), pages 1–6. IEEE, 2018.

Google Scholar

延伸閱讀

查找全文

主題瀏覽