任意方向橫書及直書之場景文字辨識

過去關於場景文字辨識的文獻主要致力於單一方向之橫書文字辨識，然而在現實環境中，橫書及直書的文字同時出現在一個場景的情形並非不會發生。尤其在部分的亞洲國家，例如中國，街景中可見的直書文字幾乎與橫書文字一樣多。在這樣的情況下，若要正確地識別場景中所有的文字，必須使文字辨識系統可以同時處理橫書及直書的文字。一般而言，一個完整的文字辨識系統會包含一個偵測器及一個辨識器，其中偵測器輸出的文字圖片會是辨識器的輸入。在現存文獻中，大多會要求辨識器的每張輸入圖片具有相同的文字排列方向(例如，由左至右)。然而，一旦文字辨識系統的輸入圖片可以同時包含任意角度的橫書及直書文字，我們很難確保偵測器輸出的文字圖片都具有相同的文字方向，而這將會造成辨識器預測錯誤。在這篇論文裡，我們針對任意方向橫書及直書之文字設計了一個新穎的場景文字辨識系統。其中，基於類神經網路的辨識器可以端對端的方法進行訓練並且只需要單詞級別的標註資料。除此之外，我們更設計了一個文字角度預測器，用以擷取圖片中文字的旋轉角度資訊並進一步確保輸入辨識器的文字圖片都具有符合要求的文字方向。由於目前並沒有公開的直書場景文字資料集，我們實作出一個直書文字圖片產生器並生成了一份直書英文資料集供訓練用。我們另外蒐集並標註了一個真實場景直書英文資料集供測試用。我們的方法在公開的橫書英文資料集(SVT、 IIIT-5k 跟 ICDAR)上與目前領先的方法有相當的成績，但同時又較其他方法多了可以同時處理任意方向橫書及直書文字的能力。

關鍵字

電腦視覺；圖型識別；場景文字辨識；光學文字辨識；深度學習；類神經網路；卷積神經網路；循環神經網絡

並列摘要

Research of scene text recognition done to date has focused on sideways text recognition. However, it is common that both sideways and upright text appear in one scene. In some Asian countries like China, you may see as much upright text as sideways text in street views. Under such circumstance, it is necessary for a scene text recognition system to recognize both types of text simultaneously. Generally, a scene text recognition system is compose of a detector and a recognizer and the input of the recognizer is usually the output of the detector. Most existing scene text recognizers expect the text in all input image to be arranged in the same direction (e.g., from left to right). However, once the text lines in a image can be arbitrarily sideways and upright with random orientation angle, it is hard to make sure all detector output images have the same character direction which would cause false recognition. In this paper, we develop a system for scene text recognition of both sideways and upright text in arbitrary orientation. A text orientation estimation module is further proposed to capture the orientation angle information and make sure the character direction is correct for the recognizer. Since there is no public upright text dataset, We implemented an upright synthetic data engine to generate a synthetic upright English text dataset (Synth-ENGV) for training and collected a real-world upright English dataset (ENG) for testing. Experimenting on benchmark sideways datasets, including the street view text (SVT), IIIT-5k and ICDAR, our model demonstrates competitive performance compared to state-of-the-arts, with the additional functionality of handling text in different direction and automatically recognizing both sideways and upright text in the same time.

並列關鍵字

Computer Vision ； Pattern Recognition ； Scene Text Recognition ； Optical Character Recognition ； Deep Learning, Neural Network ； Convolutional Neural Network ； Recurrent Neural Network

參考文獻

H. J. Chen, “text-orientation: Codrops css reference.”

Google Scholar

C. Choi, Y. Yoon, J. Lee, and J. Kim, “Simultaneous recognition of horizontal and vertical text in natural images,” in Asian Conference on Computer Vision, pp. 202–212, Springer, 2018.

Google Scholar

J. J. Weinman, Z. Butler, D. Knoll, and J. Feild, “Toward integrated scene text reading,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36,pp. 375–387, Feb 2014.

Google Scholar

A. Bissacco, M. J. Cummins, Y. Netzer, and H. Neven, “Photoocr: Reading text in uncontrolled conditions,” 2013 IEEE International Conference on Computer Vision,pp. 785–792, 2013.

Google Scholar

T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, “End-to-end text recognition with convolutional neural networks,” in Proceedings of the 21st International Conferenceon Pattern Recognition (ICPR2012), pp. 3304–3308, Nov 2012.

Google Scholar

國際替代計量

任意方向橫書及直書之場景文字辨識

全文下載

主題瀏覽