表單文件手寫資料欄位擷取之研究

本研究旨在針對表單文件自動化處?進?研究，針對表單處?中之手寫?位分?、擷取與手寫資?擷取等問題提出解決的方法。在表單手寫?位擷取的階段，分別?用表單中物件的尺寸大小、比?、物件整體性結構特性與物件方向性結構特徵，作為物件之分?特徵。為?於取得物件之結構特徵，本研究?用影像編碼的方式，將空白表單影像轉換成簡化的結構圖。同時為區辨?明?位與包含?明文字之填寫?位，分別?用?位區域水平及垂直方向之像素投影，配合?明文字之分佈、大小與文字間距等特徵，進?分析辨?。在手寫資?擷取的階段中，將已填寫之表單影像與已知空白表單樣本進?比對後，根據相同?別的空白表單之手寫?位資訊，擷取出已填寫表單中之手寫?位資?。對於所擷取出之手寫資?中，因框線去除後，造成與框線相交之手寫筆畫斷?的問題，提出判斷筆畫相交區段，並重建相交區段之手寫筆畫的方法，修補破碎手寫筆畫。本研究之測試影像，共分為一般單純格式之表單影像與格式複雜之複合式表單影像等??。由實驗結果可證明本研究所提出之方法，針對?同?型之表單影像，皆可得到?錯的效果。

關鍵字

表單文件辨識；表單手寫欄位擷取；手寫資料萃取；破碎字修補； Run-Based 演算法

並列摘要

Form document analysis is one of the most essential tasks in document analysis and recognition. The problems of form fields and filled-in data extraction are two important parts of form document analysis. For form field extraction, the first major task was to classify the preprinted text, lines, check boxes, text boxes and the tables of a form. This thesis proposes a method which based on direction-invariant global structural features and directional dependant structural features to classify the form fields, and then extract the filled-in spaces in a form document. Since tables can contain both name fields and data fields, for the second task, we used a method based on horizontal and vertical color histogram distribution features to segment the fields and extract the data fields. For filled-in data extraction, we propose a method which based on Run-based algorithm and the idea of interpolation to detect the character strokes overlapped by printed form frame and reconstruct the broken strokes after removing the frame line. The experimental results on different types of form documents showed a 99% recognition rate on form fields extraction, and a 91% successful filled-in data extraction rate was achieved.

並列關鍵字

Form document analysis and recognition ； Form field extraction ； Filled-in data extraction ； Broken stroke reconstruction ； Run-based Algorithm

參考文獻

[1] S. Di Zenzo, L. Cinque, and S. Levialdi, “Run-based Algorithm for Binary Image Analysis and Processing,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 1, pp. 83-89, 1996.

[3] Y. F. Zheng, C. S. Liu, X. Q. Ding and S. Y. Pan, “Form Frame Line Detection with Directional Single-Connected Chain,” Proc. Int. Conf. Document Analysis and Recognition, pp. 699-703, 2001.

[4] Y. F. Zheng, H. P. Li and D. Doermann, “A Parallel-Line Detection Algorithm Based on HMM Decoding,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 777-790, 2005.

[5] H. E. Nielson and W. A. Barrett, “Consensus-Based Table Form Recognition,” Proc. Int. Conf. Document Analysis and Recognition, pp. 906-910, 2003.

[6] Y. Y. Tang, H. Ma, J. M. Liu, B. F. Li and D. H. Xi, “Multiresolution Analysis in Extraction of Reference Lines from Documents with Gray Level Background,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 8, pp. 921-926, 1997.

國際替代計量

表單文件手寫資料欄位擷取之研究

主題瀏覽