多模態知識圖像描述系統於服務型機器人之應用

Service robot是未來機器人以及人工智慧的趨勢。過去的工業型機器人搭配人工智慧，能幫助工廠進行多項自動化工作。而在機器人真正進入人們的生活之前，服務型機器人所擁有的智慧程度，便是一大考驗。要使機器人的智慧到達一定水準，deep learning便是一項不可或缺的技術。deep learning技術近幾年開始逐漸興盛，包含影像的部分所使用的CNN (convolutional neural network)以及語言處理所使用的RNN (recurrent neural network)。而當今將此二者進行整合的image caption技術則是更貼近人工智慧的代表作品。Image caption即輸入一張影像後，根據影像的內容，輸出一句描述影像的文字，就像一個人一樣，不斷描述他所見到的場景。這樣的功能雖然在其他領域有其應用實例，例如image retrieval，image indexing，但在服務型機器人上仍然無法真正落實。其原因有二。一、目前的image caption所學習的範圍太過廣泛。著名的image caption dataset如MSCOCO以及flickr8k, flickr30k都包含著許多大自然的風景、手繪圖案、抽象畫作品等等實際生活中較少見的畫面。而服務型機器人則可能因為之前所訓練的dataset的關係而說出錯誤的話。二、由於dataset中所包含的內容廣泛，因此其label只能是非常通俗且一般的用語。而服務型機器人在其服務場域中必須具備該場域的特定知識如場域中的人、事、物等等。因此使用一般的image caption model無法包含特定場域的知識。本論文的目的在於將image caption真正落實於服務型機器人。本論文所著重的目標應用在於巡邏用途的服務型機器人。也就是說，此機器人能夠將他所看到的影像進行描述後告知管理者，使管理者更便於進行管理。為了達到此一目標，我們試圖了解管理者需要知道的資訊例如影像中所包含的物體。而影像中若包含人的話，則須知道該人的身分以及該人的狀態。此狀態有可分為情緒以及行為。因此本論文使用三種方法將image caption model與特定環境中物體辨識進行整合。再將此image caption model與人臉辨識、情緒辨識進行整合。使巡邏機器人能夠回報該人的身分以及情緒狀態。機器人還能具備有語意式定位的功能，使管理者能夠獲得更全面的資訊。透過實驗比較，我們發現使用我們的informative image caption system能夠比MSCOCO pre-trained image caption model具有更高的物體辨識正確率。且使用我們的informative image caption system能夠比單純進行fine-tune的image caption model具有更高的人臉辨識與情緒辨識的正確率。

關鍵字

圖像描述；身分辨識；情緒辨識；物體偵測；深度學習

並列摘要

Service robot is a trend for the field of robotics and artificial intelligence. In the past, industrial robots equipped with artificial intelligence help the factory automation. When the robots gradually come into daily lives, it's a big challenge to equip them with sufficient intelligence. To make this possible, deep learning is an important technique. Deep learning gets popular in recent years including CNN (Convolutional Neural Network) for image processing and RNN (Recurrent Neural Network) for natural language processing. The more intelligent function, image caption, combines the techniques of CNN and RNN. Image caption is a function that given an image, it will generate a sentence to describe the image as a person does. Although this can be used in image retrieval, image indexing, etc., it cannot be applied directly on a service robot due to two main reasons. First, the image caption models proposed in recent works are trained with the famous common dataset, i.e. MSCOCO or flickr. These datasets gather images from a broad variety of field such as hand-drawn pictures, natural scene and paintings that are not usually seen in daily lives. Therefore, a robot equipped with such models may generate these special sentences sometimes even when it does not see the related images. Second, a service robot usually serves in a specific environment, so that it should be equipped with some specific knowledge corresponding to the objects and human in the environment. Unfortunately, the public and general dataset will not have that knowledge. The purpose of this work is to ground the image caption to a real service robot. This work will focus on the service robot for patrol. In other words, the robot should make a caption about what it sees and read that sentence to the guard in the remote control room. For this kind of purpose, we need to know what information the guard wants to know. For example, if there is a person in an image, the guard may want to know the person's identity and state. The state includes the emotion and the behavior. In this work, the author proposes three methodologies to combine an image caption model with specific object recognition, so that the output sentence will contain the knowledge about the objects. Then, this image caption model is combined with a face recognition model and an emotion classification model so that the robot can also report the person's identity and the emotion. Furthermore, the robot is also equipped with semantic localization to give the guard more comprehensive information about the scene the robot sees. From the experiment, we conclude that our informative image caption system outperform the MSCOCO-pretrained image caption model with higher object recognition accuracy. Our model also has a higher facial recognition rate and emotion recognition rate compared to the fine-tuned model.

並列關鍵字

Image Caption ； Face Recognition ； Emotion Recognition ； Object Detection ； Deep Learning

參考文獻

[1] Y. Ushiku, T. Harada, and Y. Kuniyoshi, “Efficient image annotation for au-tomatic sentence generation,” in Proc. 20th ACM international conference on Multimedia, 2012.

Google Scholar

[2] M. Mitchell et al., “Generating image descriptions from computer vision detections,” in Proc. 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012.

Google Scholar

[3] R. C. Luo, C. C. Chang, and C. C. Lai, “Multisensor Fusion and Integration: Theories, Applications, and its Perspectives,” IEEE Sensors Journal, Vol.11, No.12 , pp. 3122-3138, 2011.

Google Scholar

[4] A. Farhadi et al., “Every picture tells a story: Generating sentences from images,” in European conference on computer vision, pp. 15-29, 2010.

Google Scholar

[5] A. Gupta, Y. Verma, and C. V. Jawahar, “Choosing Linguistics over Vision to Describe Images,” in Twenty-Sixth AAAI Conference on Artificial Intelligence, pp. 1, Jul., 2012.

Google Scholar

國際替代計量

多模態知識圖像描述系統於服務型機器人之應用

查找全文

主題瀏覽