透過您的圖書館登入
IP:3.128.153.31
  • 學位論文

基於雙鑑別器生成對抗網路之影像敘事

Dual Discriminator GAN-based Visual Storytelling

指導教授 : 林嘉文

摘要


隨著深度神經網路的發展,在影像描述(Image Captioning)上的成果已能對圖片的內容產生良好的描述。不同於影像描述對單一圖片的描述,影像敘事(Visual Storytelling)不僅是針對多張圖片進行描述,還要找尋圖片與圖片之間的關聯,以形成多段相聯的描述構築成完整的故事。 在影像敘事的資料集中,絕大部分的描述含有不同的風格以及想像的概念,此種特性相較於影像描述針對圖片主體的正確描述,使得影像敘事的任務更為困難以及複雜。此外,過去的方法在使用最大概似估計或者利用強化學習最優化人工定義分數時,無法有效地生成好的句子。 生產對抗式網路(GANs)擅長生成符合常理但不存在的資料,在近年的發展下,除了圖片以外也能用來生成文字。對抗式訓練在影像描述上已被證實能有效地提升生成的句子。然而,在現今方法中,以往鑑別器的架構在面臨影像敘事多變化性的句子時,無法有效地提升結果。 在這篇論文中,我們提出了基於雙鑑別器生成對抗網路的方法。首先,為了增加故事的關聯性,我們調整了生成器的架構;再者,為了評量生成句子的兩個觀點: 像人寫的句子以及與影像符合的,我們使用了兩種不同架構的鑑別器。實驗結果顯示我們相較於過去方法的優勢。

並列摘要


With the development of deep neural networks, great performance on image captioning has been achieved. Different from image captioning, a new task visual storytelling has been introduced, this task gives descriptions of image streams instead of a single description of a single image. The descriptions from visual storytelling required not only relevant but also related to the image streams to complete a full story. In visual storytelling dataset (VIST), most descriptions have unique styles and imaginary concepts. This kind of property makes visual storytelling more complex compared to image captioning. Furthermore, past methods face the limitation of maximum likelihood estimation and strong bias problems caused by hand-crafted rewards (e.g., BLUE, METEOR, CIDEr, etc.) through reinforcement learning, which makes it hard to improve the performance of the results. Generative adversarial networks (GANs) are good at generating reasonable but non-exist data not only images but also captions. Reinforcement learning method with adversarial-training-structure has been proven to improve the performance of the image captioning results. However, the existing discriminator structure could not take advantage of other techniques due to small and complex visual storytelling dataset. In our method, we propose Dual-RL, a dual discriminator GAN-based algorithm for visual storytelling. First, to make the story more relevant, we adjust the generator architecture. Second, we take two aspects to assess the story: human-like and story-related. It is shown in the experiments that the advantage of our proposed method compared to previous works.

參考文獻


[20] S. Yan, F. Wu, J. S. Smith, W. Lu and B. Zhang, "Image Captioning using Adversarial Networks and Reinforcement Learning," 2018 24th International Conference on Pattern Recognition (ICPR), 2018, pp. 248-253, doi: 10.1109/ICPR.2018.8545049.
[1] Kim, Y. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746– 1751.
[2] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
[3] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693– 1701.
[4] Ting-Hao K. Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Aishwarya Agrawal, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual storytelling. In the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2016).

延伸閱讀