從多標籤圖像學習之深層視覺語意轉換模型

在機器學習與電腦視覺領域中，如何學習圖像與文字語意之間的關係一直都是重要的議題。本論文探討圖像與文字關連性的問題，首先，每個文字之間是具有語意關係的，例如：天空跟雲這兩個字語意上靠近的，或是天空與汽車在語意上是幾乎不相關的。但是使用者對每個文字之間的語意關係是否會根據圖像會有所不同？例如：一張有天空與汽車的圖像，「天空」與「汽車」這兩個字原本就語意上可以說是幾乎不相關的，但因為此圖而產生了關連性。因此，我們認為文字間的語意關係會因為不同的圖像而改變其關聯程度。我們提出了一個卷積類神經網路(Convolutional Neural Network)的模型來連結圖像與該圖像多個的文字標籤的語意關係，其輸入為圖像，和現有的視覺語意嵌入模型最大的不同在於該模型的輸出為一個線性轉換函數，將輸入圖像對應到一個函數，用以判斷文字對該圖像的相關性，進而為圖像預測可能的標籤。

關鍵字

卷積類神經網路；視覺語意嵌入模型；多標籤圖像；多標籤分類問題

並列摘要

Learning the relation between images and text semantics has been an important problem in the field of machine learning and computer vision. This paper addresses this problem. We observe that there is a semantic relation between texts, for example, “sky” and “cloud” have a close semantic relation, and “sky” and “car” have a weak semantic relation. We suppose the semantic relation between texts can be different depending on images. For example, an image contains both sky and car. The word “sky” and “car” are initially semantically irrelevant, but may have a connection because of the image containing these concepts. Therefore, we propose a Convolutional Neural Network based model to link the semantic relation between an image and its text labels. The main difference between our work and existing visual semantic embedding models is that the output of our model is a linear transformation function. In other words, each input image is treated as a function to determine the relation between each word and the image, and to predict the possible labels for the image. Finally, this model is validated on the NUS-WIDE dataset and the experimental results show that the model has a great performance on predicting labels for images.

並列關鍵字

Convolutional neural network ； visual semantic embedding model ； multi-label image ； multi-label classification problem

參考文獻

[1] A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov. DeViSE: A deep visual-semantic embedding model. In NIPS, 2013.

Google Scholar

[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.

Google Scholar

[3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

Google Scholar

[4] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR, 2013.

Google Scholar

[5] R. JeffreyPennington and C. Manning. Glove: Global vectors for word representation. 2014.

Google Scholar

主題瀏覽