基於生成對抗網路的雙語文字與圖像轉譯系統

隨著科技的蓬勃發展，我們生活中已有許多人工智慧的產品，像是Google Home、智慧掃地機器人、自駕車…等產品，在這之中，圖片生成文字或文字生成圖片的應用也非常多，但是文字生成圖片再生成文字的相關研究卻非常的稀少。另外，在全球化的情況下，愈來愈多人學習外文，如果在學習外文時，能夠有一張圖片以及翻譯完的句子能夠更有效率的學習。本論文提出的ConvertGAN採用新的Text-to-Image-to-Text架構，先透過對中文數據集的前處理，讓中文的語言結構能與英文相似，來達成中文生成圖片再生成英文或英文生成圖片再生成中文，本論文在ConvertGAN的Text-to-Image部分，使用GAN的架構，共使用三組生成器和三組鑑別器，三組生成器分別生成64x64、128x128、256x256的大小，三組鑑別器負責與三組生成器對抗，Image encoder使用InceptionResNetV2、Text encoder使用GRU並結合注意力機制，讓圖片在生成時能夠注意比較重要的詞。在ConvertGAN的Image-to-Text部分，使用了seq2seq的架構，以Text-to-Image所生成的256x256的圖片作為輸入，使用Encoder將圖片提取特徵，再透過與注意力機制的Decoder將注意力放在權重較高的詞的身上，最後使用cycle consistency來確保語意前後的一致性。本論文使用Inception scores評分文字生成圖片的效果，實驗結果顯示，在英文生成圖片再生成文字(E2I2E)上，先前的相關研究，如: TITGAN的評分為1.8分，MirrorGAN為4.11分，而本論文的ConvertGAN則為4.20分。此外，本論文的ConvertGAN經過中文數據集前處理的中文生成圖片再生成英文(C2I2E) 和英文生成圖片再生成中文(E2I2C) 的評分分數則分別為4.29以及4.21分，證明中文數據集在生成圖片上仍可表現良好。在圖片生成文字方面，我們比較生成文字上的演算法，以TITGAN的Greedy Search與本論文的Beam Search進行比較，實驗結果證明本論文使用的演算法能生成的文字更加貼近圖義。最後我們以BLEU評分各Encoder網路架構上，原輸入句子與生成句子是否相近，實驗結果顯示在BLEU-4上的E2I2E，AlexNet為0.11390分，VGGnet上為0.20117分，而本論文使用的ResNet-152為0.22993分。而經過中文數據集前處理的C2I2E和E2I2C分別為0.22518以及0.43926分，證明中文數據集在句子上翻譯上仍可表現良好。綜合上述，本論文的貢獻在於(1) 提出新的Text-to-Image-to-Text架構、(2) 透過中文資料及前處理，讓中文數據集在本論文的架構上能良好的表現、(3) 增加圖片生成的多樣性及真實性、(4) 改變文字生成方式，讓文字的生成更符合圖意、(5) 文字語言的轉換更加貼近原句子敘述、(6)減少訓練時間。

關鍵字

深度學習； Text-to-Image ；生成對抗網路； Image-to-Text ；自動編碼器；注意力機制

並列摘要

With the vigorous development of science and technology, there are many artificial intelligence products in our lives, such as Google Home, smart sweeping robots, self-driving cars, etc. Among them, there are also many researches or applications of Image-to-Text or Text-to-Image, but the researches of Text-to-Image-to-Text are very rare. In addition, in the trend of globalization, more and more people are studying foreign languages, if people can have an image that corresponds to a foreign sentence when studying a foreign language, people can study more efficiently. Therefore, this thesis proposed ConvertGAN to address Native-Text-to-Image-to-Foreign Text generation problem. The proposed ConvertGAN adopts the new Text-to-Image-to-Text architecture and preprocesses the Chinese dataset that makes the structure of Chinese dataset can be similar to that of English to achieve Chinese to Image to English (C2I2E) generation and English to Image to Chinese (E2I2C) generation. This thesis uses the GAN architecture in the Text-to-Image part of ConvertGAN, which consists of three generators and three discriminators. The three generators generate images with sizes of 64x64 pixels, 128x128 pixels, 256x256 pixels, respectively. The discriminators are responsible for distinguish real data and data generated by generators. In addition, InceptionResNetV2 is used as image encoder and GRU that combines with the attention mechanism to allow the image to be regenerated to pay attention to more important words is used as text encoder. In the Image-to-Text part of ConvertGAN, the seq2seq architecture is used, the 256x256 image generated by Text-to-Image is used as input, the Encoder is used to extract features from the image, and then the Decoder with the attention mechanism is used to focus on words with higher weights. Finally, cycle consistency is used to ensure semantic consistency. This thesis uses Inception scores to evaluate the results of Text-to-Image generation. The experimental results show that for English-to-Image-to English (E2I2E) generation, the Inception Score of TITGAN is 1.8 and MirrorGAN is 4.11, the proposed ConvertGAN is 4.20. Furthermore, the Inception score of ConvertGAN of C2I2E and E2I2C are 4.29 and 4.21, respectively. As results show that the Chinese dataset can still perform well in generating images in the proposed model. In terms of Image-to-Text we compare the algorithms Greedy Search, which is used in TITGAN and Beam Search which is used in this thesis for generating text. The experimental results show that the text generated by the algorithm used in this thesis is closer to the meaning of the image. Finally, we use BLEU to evaluate some Encoder network architectures to determine whether the original input sentence is similar to the generated sentence. The experimental results show that the BLEU-4 score for E2I2E of AlexNet is 0.11390, VGGnet is 0.20117, ResNet-152 (used in this thesis) is 0.22993. The BLEU-4 score for C2I2E and E2I2C of ConvertGAN are 0.22518 and 0.43926, respectively, which proves that the Chinese dataset can still perform well in sentence translation. The contributions of this thesis summarize as follows: (1) propose a new Text-to-Image-to-Text architecture, (2) preprocesses Chinese dataset and makes them perform well in the architecture of this thesis, (3) increase the diversity and authenticity of image generation, (4) change the way of text generation to make the text generation more in line with the intention, (5) the conversion of text language is closer to the original sentence description, (6) reduce training time.

並列關鍵字

Deep Learning ； Text-to-Image ； Generative Adversarial Networks ； Image-to-Text ； AutoEncoder ； Attention Mechanism

參考文獻

[1] Tingting Qiao, Jing Zhang, Duanqing Xu, Dacheng Tao “MirrorGAN: Learning Text-to-image Generation by Redescription”, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

Google Scholar

[2] Satya Krishna Gorti, Jeremy Ma, “Text-to-Image-to-Text Translation using Cycle Consistent Adversarial Networks”, arXiv:1808.04538. [Online]. Available: http://arxiv.org/abs/1808.04538

Google Scholar

[3] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio “Generative Adversarial Networks”, Neural Information Processing Systems,2014

Google Scholar

[4] Hinton, G. E., Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy.

Google Scholar

[5] Diederik P Kingma, Max Welling, “Auto-Encoding Variational Bayes”, In The 2nd International Conference on Learning Representations (ICLR), 2013.

Google Scholar

國際替代計量

基於生成對抗網路的雙語文字與圖像轉譯系統

全文下載

主題瀏覽