民初漢字出版品數位化技術之研究

隨著電腦時代來臨，紙本古籍往往被掃瞄成數位影像檔以方便保存。然而影像形式的古籍不易為電腦作關鍵字搜尋，因此文件影像須經過數位化以將影像文字辨識為數位文字。其中又以光學字元識別（Optical Character Recognition，OCR）最為關鍵，可是文件影像需經妥善前處理才能使OCR順利進行。本研究即以民初印刷出版品《晶報》為研究的對象，嘗試諸如頁面分割及走文式欄位串接等前處理作業的自動化可行性。頁面分割作業已有學者們研究多年；然而所提出的方法皆需倚靠各文章欄位之間充足的留白，才能將各文章欄位分離，不適用於文章緊湊排列的《晶報》上。本研究提出一種使用卷積神經網路（Convolutional neural network，CNN）偵測《晶報》分界線的方法，藉由所偵測到的線段位置區分各文章欄位，以達到頁面分割的效果。走文式欄位串接作業的目的在於串連屬於同一篇文章的所有欄位：本研究以在對報紙版面進行走文時，遇到標題則視為另一篇文章，達到欄位串接的目的。最後，本研究將上述兩作業與其他作業合併，提出一套含有五個步驟的《晶報》數位化流程：頁面分割、欄位分類、去除標點、文字辨識、以及走文式欄位串接。本研究所提出的頁面分割方法，在單頁《晶報》上各欄位的平均IoU（Intersection over Union）可達83.98%。而走文式欄位串接方法在單頁《晶報》上，雖然全部13篇文章中只有9篇串接成功，然而其誤差面積皆相當小。證實本研究所提出的方法可針對緊密排列的民國初年報紙進行有效的頁面分割，亦說明了走文式欄位串接方法的有效性。

關鍵字

文件影像處理；文件影像辨識； ECPO ；頁面分割；光學字元識別；卷積神經網路

並列摘要

Ancient documents tend to be scanned into image files since the invention of the computer. However, these image files are not easy for searching by keywords, so it is necessary to transform them into digitized words. Optical character recognition (OCR) is the key of this process, but document images need to be pre-processed for OCR performing smoothly. This study attempts to automate the pre-processes, such as page segmentation and component connection, on The Crystal, or “Jing Bao,” published in the early 20th century, when the young Republic was just born in China. Page segmentation have been studied for many years, but the existing methods rely on sufficient blank space between the components to separate them, which is not applicable to The Crystal for its compact arrangement. This study proposes a method for detecting boundaries in The Crystal using the convolutional neural network (CNN). The position of the boundaries can distinguish the components apart and achieve page segmentation. Writing direction based component connection is to connect all the components belong to the same article. This study connects components by visiting each component along the direction of writing, and determines a new article when encountering a title. Finally, this study combines five methods, including the above methods, and proposes a set of digitization process for The Crystal: page segmentation, component classification, punctuation removal, text recognition, and component connection. The proposed page segmentation method has an mean IoU (intersection over union) of 83.98% on the components in single page of The Crystal. In the component connection method, while only 9 out of 13 articles are connected successfully, the error area is small. It is confirmed that the proposed method can effectively segment the pages of the closely arranged publications, and also demonstrates the effectiveness of the component connection method.

並列關鍵字

Document image processing ； Document image recognition ； ECPO ； Page segmentation ； Optical character recognition ； Convolutional neural network

參考文獻

1 Early Chinese Periodicals Online (ECPO)

Google Scholar

http://tasaom.iwr.uni-heidelberg.de:8080/de/projects/early-chinese-periodicals-online-ecpo

Google Scholar

2 王詩涵. “基於文本的漢字影像辨識研究.” 臺灣大學工程科學及海洋工程學研究所學位論文 (2015): 1-46.

Google Scholar

3 Doermann, David. Handbook of Document Image Processing and Recognition. Ed. Karl Tombre. London: Springer, 2014.

Google Scholar

4 Nagy, George, and Sharad Seth. “Hierarchical representation of optically scanned documents.” (1984).

Google Scholar

國際替代計量

民初漢字出版品數位化技術之研究

全文下載

主題瀏覽