多模型生成對抗網路人機互動圖像描述系統於智慧服務機器人之應用

服務型機器人是未來市場中的一大趨勢，在人力資源昂貴且匱乏的情況下，將機器人引進日常生活中便是能有效提升生活便利的一個方式，世界上存在著許多需要幫助的弱勢族群，隨著科技的進步，傳統的導盲設備已經無法滿足瞬息萬變的環境，導盲機器人應需而生且逐漸受到企業的高度關注。這篇論文的目的在於幫助那些視覺退化、視覺受損或是看不懂文字的人們，我們提出了一個完善的視覺服務系統，讓弱勢族群能夠在這科技發達的社會中生活便利，因此，為了讓服務型機器人更加的靈活且多功能，結合人工智慧是不可避免的趨勢。為了能夠成為視覺障礙者的眼睛，圖像描述(Image caption) 是整篇論文最重要的核心概念。圖像描述(Image caption)即輸入一張影像後，根據影像的內容，輸出一句描述影像的文字，就像人一樣，不斷描述他所見到的場景。這樣的技術也能應用在不同的領域上，例如圖像搜尋(image retrieval)，圖像指引(image indexing) ，但應用在服務型機器人上，仍有許多地方要改進，下面列出兩點傳統圖像描述(image caption) 應用在服務型機器人上的缺點。一、經過傳統交叉荻(cross entropy) 的方法訓練，模型會頃向於回答MSCOCO 訓練集上的模板句子，句子回答起來死板又固定，並沒有像人一樣回答得那樣多樣且自然。二、由於訓練集中所包含的內容廣泛，訓練集裡面的句子大都是概略且籠統的回答，但在實際的應用層面上，人們真正想知道的是有意義且資訊豐富的句子。在學術上，有別於傳統的訓練方法，本論文採用GAN以及強化學習的技術，提升句子的變化性以及自然性，並在各個的評分指標上取得相當好的分數。在應用上，本論文的目的在於將圖像描述(image caption)真正落實於服務型機器人，並且幫助視覺上有障礙的族群。本論文所著重的目標應用在於幫助盲人的服務型機器人。也就是說，此機器人能夠將他所看到的影像進行描述後告知使用者，搭配上靈活的語音系統，可以詢問任何想知道的資訊，如果遇到資料庫以外的資訊，可以使用網路爬蟲來上網抓取資料，在日常生活中，使用者想知道的事情不外乎是附近人們的身分、動作、性別以及髮型等細微的資訊。我們利用物體偵測的技術，找出我們有興趣的區域，再透過人臉辨識、文字辨識、年齡辨識、物體辨識等模型蒐得到我們想要的資訊，我們在多模型資訊描述系統(Multi-modal informative Caption)裡，整合高達六個模型的資訊，分別是身份辨識、表情辨識、年齡辨識、圖像描述、密集圖像描述以及圖像分割。透過實驗比較，我們發現使用我們強化學習(Reinforcement Learning)以及 GAN的方法能夠比MSCOCO 預訓練的圖像模型具有更高的分數。且使用我們的GAN為基底圖像描述模型能夠比單純進行微調的圖像描述模型具有更高的變化性以及準確度。

關鍵字

圖像描述；人臉辨識；文字辨識；物體偵測；深度學習

並列摘要

Service robots are a major trend in the future market. To reduce the cost of hiring many workers providing services, introducing robots into daily life is a way to effectively improve the quality of life. The purpose of this thesis is to develop necessary technologies in helping people who are visually degraded, visually impaired, or unable to understand the text. We propose an assistive visual service system to enable vulnerable groups to live in this technologically advanced society. In order to be the eyes of the visually impaired, Image caption is one of the most important core assistive technologies. Image caption is to input an image, and according to the content of the image, output a sentence describing the image, just like a person, constantly describing the scene he observes. This technology can also be applied in different fields, such as image retrieval and image indexing, but there are still many places to be improved in the application of service robots. There are two shortcomings of the traditional image caption application in service robots. Firstly, after training by the traditional cross-entropy method, the model will tend to answer the template sentences on the MSCOCO training data. The sentences are rigid and fixed in response and are not as diverse, vivid, and nature as the human answers. Secondly, due to the extensive content contained in the dataset, the sentences in training data are mostly rough and general answers, however, in reality on the practical application level, what people really want to know are meaningful and informative sentences. In the Multi-modal informative caption system, we integrate six models of information, including identity recognition, facial expression recognition, age recognition, image caption, dense caption and image segmentation. Academically, unlike traditional training methods, this thesis uses Generative Adversarial Network (GAN) and reinforcement learning techniques to improve the variability and naturalness of the sentence and to obtain quite good scores on various evaluation metrics. In terms of application, the purpose of this thesis is to truly implement image caption in service robots and to help visually impaired ethnic groups. The target application emphasized in this thesis is the service robot that helps the visually impaired person. In other words, the service robot can describe the images he sees and informs the user. With a flexible virtual assistant, users can obtain critical information such as the identity, movement, gender, and hairstyle of a certain person. We use object detection technology to find areas of interest, and then collect useful information through models such as face recognition, text recognition, age recognition, and object recognition.

並列關鍵字

image caption ； face recognition ； emotion recognition ； object detection ； deep learning

參考文獻

[1] R. C. Luo, Y. Hsu, Y. Wen and H. Ye, "Visual Image Caption Generation for Service Robotics and Industrial Applications," 2019 IEEE International Conference on Industrial Cyber Physical Systems (ICPS), Taipei, Taiwan, 2019, pp. 827-832.

Google Scholar

[2] R. C. Luo, Y. Hsu and H. Ye, "Multi-Modal Human-Aware Image Caption System for Intelligent Service Robotics Applications," 2019 IEEE 28th International Symposium on Industrial Electronics (ISIE), Vancouver, BC, Canada, 2019, pp.1180 1185.

Google Scholar

[3] X. Xu, Z. Wang, Z. Tu, D. Chu and Y. Ye, "E-SBOT: A Soft Service Robot for UserCentric Smart Service Delivery," 2019 IEEE World Congress on Services (SERVICES), Milan, Italy, 2019, pp. 354-355.

Google Scholar

[4] F. Yan and K. Mikolajczyk, "Deep correlation for matching images and text", Proc.IEEE Conference on Computer Vision and Pattern Recognition, pp. 3441-3450, Jun.2015.

Google Scholar

[5] M. Hodosh, P. Young and J. Hockenmaier, "Framing image description as a ranking task: data models and evaluation metrics", Journal of Artificial Intelligence Research, vol. 47, pp. 853-899, 2013.

Google Scholar

國際替代計量

多模型生成對抗網路人機互動圖像描述系統於智慧服務機器人之應用

不提供下載

主題瀏覽