透過您的圖書館登入
IP:18.190.154.24
  • 學位論文

基於模擬至現實之遷移學習解決視覺定位中透過語言引導的領域適應問題

Sim2real Transfer Visual Grounding Knowledge Through Language-Guided Patch-wise Domain Adaptation

指導教授 : 徐宏民
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


在人機互動的領域中,我們時常期望聰明的機器人能夠快速地適應環境的變化,並在視覺定位的任務上有好的表現,然而現今的解決方法都是利用收集新環境的資料來重新訓練我們的機器人,然而這樣的方法既沒效率又很花費人力與金錢。因此為了解決此問題,我們提出了一種基於模擬至現實之遷移學習的領域適應方法來幫助我們的機器人在零成本的模擬資料中學習。而為了要生成出訓練時所需的資料,我們透過強大的圖形渲染引擎來製作出逼近於現實模樣的虛擬照片,將這些虛擬照片和不需成本即可獲得的標住資訊組合成一個新的視覺定位資料集 YCB-Ref 來讓我們可以在機器人的視覺定位之任務訓練中使用。不過,如果直接使用這些生成的資料,會在過程中遇到一個虛實差異的困境,在這問題上,我們的方法也提出了兩個解決方案,第一個方案是混合式領域隨機方法,我們將現實世界中的背景套用到一個空背景的虛擬照片上,來增強機器人對於背景雜訊的區分。第二個方案是語言領導之小區塊領域適應方法,在這方法上,我們將虛擬圖片和現實照片中較重要的小區塊去強化他們的關聯性,以幫助機器人對於需要關注的小區塊能夠更敏感且更了解。最後,在實驗結果上,皆表示出我們的方法能夠很好的幫助機器人在模擬資料中去學習視覺定位的任務。

並列摘要


The capability of rapidly adapting to dynamic environments is vital to visual grounding on several downstream tasks such as human-robot interaction. However, once generalizing to the new environment, we usually need to collect additional labeled data to fine-tune that takes redundant effort and expensive annotation cost. Therefore, we proposed our new sim2real transfer method in visual grounding, SimVG, by utilizing a rendering engine to generate infinite synthetic data from the simulator and explore the potential of leveraging zero-cost and sufficient reasoning knowledge in our generated dataset, YCB-Ref. Moreover, to bridge the typical reality gap between synthetic data and real data, we adopt Mixup Domain Randomization (MixDR) to diminish the influence of background noise by intuitively pasting the real-world background at the blank space of the synthetic images and another novel Language-Guided Patch-wise Domain Adaptation (LaPaDA) to mitigate the visual domain differences. Our experiment testing on a real-world dataset, OCID-Ref, reveals that our method outperforms the previous methods with conventional domain classifier manners in an unsupervised setting, and even fine-tuning with few labeled data in different ratios.

參考文獻


Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T. L. (2014). ReferIt Game: Referring to Objects in Photographs of Natural Scenes. EMNLP.
Tziafas, G., Kasaei, S. H. (2021). Few-Shot Visual Grounding for Natural Human-Robot Interaction. CoRR, abs/2103.09720. Opgehaal van https://arxiv.org/abs/2103.09720
Shridhar, M., Mittal, D., Hsu, D. (01 2020). INGRESS: Interactive visual grounding of referring expressions. The International Journal of Robotics Research, 39, 027836491989713. doi:10.1177/0278364919897133
Zhang, H., Lu, Y., Yu, C., Hsu, D., Lan, X., Zheng, N. (2021, Julie). INVIGORATE: Interactive Visual Grounding and Grasping in Clutter. Proceedings of Robotics: Science and Systems. doi:10.15607/RSS.2021.XVII.020
Wang, K.-J., Liu, Y.-H., Su, H.-T., Wang, J.-W., Wang, Y.-S., Hsu, W., Chen, W.-C. (2021, Junie). OCID-Ref: A 3D Robotic Dataset With Embodied Language For Clutter Scene Grounding. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 5333–5338. doi:10.18653/v1/2021.naacl-main.419

延伸閱讀