在人機互動的領域中,我們時常期望聰明的機器人能夠快速地適應環境的變化,並在視覺定位的任務上有好的表現,然而現今的解決方法都是利用收集新環境的資料來重新訓練我們的機器人,然而這樣的方法既沒效率又很花費人力與金錢。因此為了解決此問題,我們提出了一種基於模擬至現實之遷移學習的領域適應方法來幫助我們的機器人在零成本的模擬資料中學習。而為了要生成出訓練時所需的資料,我們透過強大的圖形渲染引擎來製作出逼近於現實模樣的虛擬照片,將這些虛擬照片和不需成本即可獲得的標住資訊組合成一個新的視覺定位資料集 YCB-Ref 來讓我們可以在機器人的視覺定位之任務訓練中使用。不過,如果直接使用這些生成的資料,會在過程中遇到一個虛實差異的困境,在這問題上,我們的方法也提出了兩個解決方案,第一個方案是混合式領域隨機方法,我們將現實世界中的背景套用到一個空背景的虛擬照片上,來增強機器人對於背景雜訊的區分。第二個方案是語言領導之小區塊領域適應方法,在這方法上,我們將虛擬圖片和現實照片中較重要的小區塊去強化他們的關聯性,以幫助機器人對於需要關注的小區塊能夠更敏感且更了解。最後,在實驗結果上,皆表示出我們的方法能夠很好的幫助機器人在模擬資料中去學習視覺定位的任務。
The capability of rapidly adapting to dynamic environments is vital to visual grounding on several downstream tasks such as human-robot interaction. However, once generalizing to the new environment, we usually need to collect additional labeled data to fine-tune that takes redundant effort and expensive annotation cost. Therefore, we proposed our new sim2real transfer method in visual grounding, SimVG, by utilizing a rendering engine to generate infinite synthetic data from the simulator and explore the potential of leveraging zero-cost and sufficient reasoning knowledge in our generated dataset, YCB-Ref. Moreover, to bridge the typical reality gap between synthetic data and real data, we adopt Mixup Domain Randomization (MixDR) to diminish the influence of background noise by intuitively pasting the real-world background at the blank space of the synthetic images and another novel Language-Guided Patch-wise Domain Adaptation (LaPaDA) to mitigate the visual domain differences. Our experiment testing on a real-world dataset, OCID-Ref, reveals that our method outperforms the previous methods with conventional domain classifier manners in an unsupervised setting, and even fine-tuning with few labeled data in different ratios.