  • 期刊


Special Typeface Identification in Chinese Document Images


光學文字辨識是近二十年來被廣泛研究的題目,利用文字辨識技術將文件數位化,既可減少紙張的存放空間,又可以自動將文件分類以方便日後檢索。目前市面上文字辨識相關產品,都聲稱文字辨識率可高達90%以上。然而,這些數據大都是根據正常字體之印刷文字影像的辨識結果統計。對於印刷文件中常見的特殊字體(如粗體字、中空字、底線字與斜體字),辨識效果會與正常字體有明顯的差異。若同時使用多字體的辨識核心來做辨識,對於數量龐大的中文字集,會導致辨識速度下降。本研究提出方法來自動偵測出印刷文字區塊影像中,各內容文字所屬的特殊字體。 首先,利用文字區塊影像的水平與投影方向投影輪廓分析,可先將各文字行與可能的文字元件擷取出,再統計各元件大小、元件間距離、元件內筆劃寬度與元件黑點群長度等特性,以判斷各字元所屬的字體。後續進行文字辨識時,可使用該特殊字體字元所訓練出的辨識核心去比對,便可在盡量維持辨識速度下,提升含特殊字體文件的整體辨識效果。


Optical character recognition (OCR) has been a common research topic during the past twenty years. Digitizing paper documents by applying OCR techniques can decrease their storage space. These digitized images can also be classified and retrieved conveniently. Commercial OCR products purport to provide a satisfactory character recognition engine with accuracy above 90%. This accuracy is generally measured by recognizing printed characters having normal typefaces. However, for several special typefaces such as italic, underline, hollow, and boldface, poor recognition accuracy is obtained by commercial systems. Since the number of Chinese characters is large, the recognition speed is slow when a multi-engine OCR system is used. This study proposes an approach for identifying the special typeface of each character in a text-block image. In the proposed approach, text lines and character components are extracted by analyzing the projection profiles of the images. Then, several characteristics such as component size, gaps between pairs of components, stroke width, and black run length, are computed and analyzed to identify the special typeface of each character. Finally, a specific recognition engine is applied to recognize an unknown character image according to the corresponding identified typeface.


