用Tesseract 結合LSTM 模型實作手填表格辨識

在日常生活中，我們常遇到手填表格的情況，而將手填表格轉換成電子檔大多須由人工輸入至電腦，而在此篇論文為了減去人工輸入的時間，利用OpenCV 對掃描的表格進行影像處理，抓取欄位框線並將欄位內原有的資料去除，再交由光學字元識別軟體Tesseract 進行手寫文字的辨識。我們使用AI.FREE 的繁體手寫文字集，從手寫文字集中挑選100 個文字，並將三分之二圖檔進行LSTM 訓練，加強Tesseract 對手寫文字的辨識準確度，剩餘的三分之一來驗證訓練的成果。本次實驗希望藉由LSTM 訓練的資料集，可以增進對繁體中文辨識的準確度，以利手填表格可以容易的轉為電子檔。

關鍵字

光學字元識別、表格處理、長短期記憶模型

並列摘要

In daily life, we often encounter the situation of hand-filled forms, and the hand-filled forms are converted into electronic document.Most of them must be manually input to the computer, in order to subtract the time of manual input in this paper,We use OpenCV on the scanned form, catch the table border and remove the original data in the table, and then hand it over Optical character recognition software Tesseract recognizes handwritten text. We use AI.FREE’s traditional handwritten character set, select 100 characters from the handwritten character set, and use LSTM training on two-thirds of the image files to strengthen Tesseract’s recognition accuracy of handwritten text.The remaining one-third to verify the training results. This paper hopes that the data set trained by LSTM can improve the accuracy of traditional Chinese recognition.It can be easily converted to electronic document by hand-filled form.