Language Model Assisted OCR Classification for Republican Chinese Newspaper Text

In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese newspaper. Our basis is a small fraction of the image corpus for which text ground truth exists. We introduce a character segmentation method which produces over 90,000 labeled images of single characters and train a GoogLeNet classifier as an OCR model. In addition, we create synthetic training data from character images extracted from Song-Ti fonts. Randomly augmented on the fly and used for pre-training, they increase OCR accuracy from 95.49% to 96.95% on our test set. Finally, we employ post-OCR correction based on a pre-trained masked language model and present heuristics to select the required hyperparameters, by which we are able to correct 16% of remaining classification errors, increasing accuracy on the test set to 97.44%.

關鍵字

optical character recognition ； language model ； ground truth ； image augmentation ； Republican Chinese newspapers

並列摘要

本文為研發使用神經網絡的光學字元辨識（optical character recognition, OCR）工具提出了一些方法，以辨識民國時期中文報紙中的文章部分。這項工作的基礎為一小部分已存在基準真相（ground truth）的圖像語料。我們引入了一種字符分割方法，從而生成了超過90,000個有標籤的單一字符圖像，並且訓練了一個GoogLeNet分類器作為OCR模型。此外，我們從宋體字體中提取字符圖像，以此製作了訓練數據。這些圖像被隨機增強並被用於預訓練，測試集的OCR準確率由95.49%提高到96.95%。最後，我們採用了基於預訓練遮罩語言模型（Masked LM）的OCR後校正，並提出啟發式方法來選擇所需的超參數。通過這些方法，我們能夠校正16%的剩餘分類錯誤，將測試集的準確率提高到97.44%。

並列關鍵字

光學字元辨識；語言模型；基準真相；圖像增強；民國時期報紙

參考文獻

Arnold, M., & Hessel, L. (2020). Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO). In V. Heuveline, F. Gebhart, & N. Mohammadianbisheh (Eds.), E-Science-Tage 2019: Data to knowledge (pp. 95-109). Heidelberg, Germany: heiBOOKS. doi:10.11588/heibooks.598.c8420

Arnold, M. (2022). Multilingual research projects: Non-Latin script challenges for making use of standards, authority files, and character recognition. Digital Studies/Le champ numérique, 12(1), 1-36. doi:10.16995/dscn.8110

Eskenazi, S., Gomez-Krämer, P., & Ogier, J.-M. (2017). A comprehensive survey of mostly textual document segmentation algorithms since 2008. Pattern Recognition, 64, 1-14. doi:10.1016/j.patcog.2016.10.023

Fan, K.-C., Wang, L.-S., & Tu, Y.-T. (1998). Classification of machine-printed and handwritten texts using character block layout variance. Pattern Recognition, 31(9), 1275-1284. doi:10.1016/S0031-3203(97)00143-X

Henke, K. (2021). Building and improving an OCR classifier for Republican Chinese newspaper text (Unpublished Bachelor’s thesis). Heidelberg University, Heidelberg, Germany. doi:10.11588/heidok.00030845

國際替代計量

Language Model Assisted OCR Classification for Republican Chinese Newspaper Text

全文下載

主題瀏覽