透過您的圖書館登入
IP:18.117.74.44
  • 學位論文

使用場景圖生成圖像之段落描述

SG2P: Image Paragraphing with Scene Graph

指導教授 : 許永真

摘要


在近幾年的電腦視覺領域,愈來愈多人研究圖片段落生成(image paragraphing)然而,因為圖片與文字有著根本結構上的不同,很難找到適合的方式將圖片資訊對應成文字,所以由現有方法產生的圖片段落仍充斥著許多語意上的錯誤。在這篇論文,我們提出了一個兩階段生成圖片段落的方法SG2P,來解決這個問題。相較於以往直接從圖片轉換成文字,我們先將圖片轉變成另一種語意結構的表示方法──場景圖(scene graph),期望透過場景圖可以生成更加語意正確的段落。除此之外,我們還使用了分級的循環語言模型,搭配跳躍連結以減輕在長句文字產生時的梯度消失問題。 為了評估結果,我們提出了一個新的衡量標準cSPICE,是一個基於圖比較的一種衡量標準,可以用來計算段落的語意正確性。實驗結果顯示:相較於直接將圖片轉換成段落,如果先將原始圖片轉換成場景圖,再利用其來產生對應的段落,分數會有顯著的進步。

並列摘要


Automatically describing an image with a paragraph has gained popularity recently in the field of computer vision. However, the results of existing methods are full of semantic errors as the features extracted directly from raw image they use have difficulty bridging the visual semantic information to language. In this thesis, we propose SG2P which is a two-staged network to address this issue. Instead of from raw image, the proposed method leverages features encoded from the scene graph, an intermediate semantic structure of an image, aiming to generate stronger semantically correct paragraphs. With the explicit semantic representation, we hypothesize that features from scene graph retains more semantic information than directly from raw image. In addition, we use hierarchical recurrent language model with skip connection in SG2P to reduce the effect of gradient vanishment during long generation process. To evaluate the results, we propose a new evaluation metric called c-SPICE, which can automatically compute the semantic correctness of generated paragraphs by a graph-based comparison. Experiment shows that methods utilizing features from scene graph outperform those directly from raw image in c-SPICE.

參考文獻


[1] P. Anderson, B. Fernando, M. Johnson, and S. Gould. SPICE: semantic propositionalimage caption evaluation. InECCV, 2016.
[2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang.Bottom-up and top-down attention for image captioning and visual question answer-ing. InCVPR, 2018.
[3] Y. Bengio and Y. LeCun, editors.3rd International Conference on Learning Rep-resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference TrackProceedings, 2015.
[4] M. Chatterjee and A. G. Schwing. Diverse and coherent paragraph generation fromimages. InECCV, 2018.
[5] X. Chen and C. L. Zitnick. Mind’s eye: A recurrent visual representation for imagecaption generation. InCVPR, 2015.

延伸閱讀