透過您的圖書館登入
IP:216.73.216.60
  • 學位論文

透過預訓練視覺-語言模型之文本知識增強即時語義分割:一種輕量級方法

Enhancing Real-Time Semantic Segmentation with Textual Knowledge of Pre-Trained Vision-Language Model: A Lightweight Approach

指導教授 : 吳家麟
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


在本文中,我們提出一種輕量級方法,透過預訓練視覺語言模型(pre-trained vision-language model)來增強即時語義分割(real-time semantic segmentation)。我們的方法將CLIP文本編碼器(text encoder)與語義分割模型相結合,有效地將文本知識傳遞給分割模型。我們的框架整合了圖像和文本嵌入(text and image embeddings),使視覺和文本資訊可以相互整合。同時,我們引入了可學習的提示嵌入(learnable prompt embedding),以捕捉特定類別(class-specific)的資訊並提升模型對語義的理解能力。為了確保訓練效果,我們設計了一種兩階段的訓練流程,允許語義分割模型在第一階段從固定的文本嵌入中學習,並在第二階段優化提示嵌入。通過實驗和消融研究,我們驗證了這種方法能夠顯著提升即時語義分割模型的性能。

並列摘要


In this paper, we present a lightweight method to enhance real-time semantic segmentation models by leveraging the power of pre-trained vision-language models. Our approach incorporates the CLIP text encoder, which provides rich semantic embeddings for text labels, and effectively transmits this textual knowledge to the segmentation model. The proposed framework integrates image and text embeddings, enabling visual and textual information alignment. Besides, we introduce learnable prompt embeddings to capture class-specific information and enhance the semantic understanding of the model. To ensure effective learning, we devise a two-stage training procedure that allows the segmentation backbone to learn from fixed text embeddings in the first stage and optimize the prompt embeddings in the second stage. Extensive experiments and ablation studies demonstrate the effectiveness of our method in significantly improving the performance of the real-time semantic segmentation model.

參考文獻


[1] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European conference on computer vision (ECCV), pages 405–420, 2018.
[2] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325– 341, 2018.
[3] Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9522–9531, 2019.
[4] Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh Hajishirzi. Es- pnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9190–9200, 2019.
[5] Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, and Nong Sang. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129:3051–3068, 2021.

延伸閱讀