透過預訓練視覺-語言模型之文本知識增強即時語義分割：一種輕量級方法

在本文中，我們提出一種輕量級方法，透過預訓練視覺語言模型（pre-trained vision-language model）來增強即時語義分割（real-time semantic segmentation）。我們的方法將CLIP文本編碼器（text encoder）與語義分割模型相結合，有效地將文本知識傳遞給分割模型。我們的框架整合了圖像和文本嵌入（text and image embeddings），使視覺和文本資訊可以相互整合。同時，我們引入了可學習的提示嵌入（learnable prompt embedding），以捕捉特定類別（class-specific）的資訊並提升模型對語義的理解能力。為了確保訓練效果，我們設計了一種兩階段的訓練流程，允許語義分割模型在第一階段從固定的文本嵌入中學習，並在第二階段優化提示嵌入。通過實驗和消融研究，我們驗證了這種方法能夠顯著提升即時語義分割模型的性能。

關鍵字

語義分割；即時；預訓練視覺-語言模型；文本知識；提示調整；兩階段訓練

並列摘要

In this paper, we present a lightweight method to enhance real-time semantic segmentation models by leveraging the power of pre-trained vision-language models. Our approach incorporates the CLIP text encoder, which provides rich semantic embeddings for text labels, and effectively transmits this textual knowledge to the segmentation model. The proposed framework integrates image and text embeddings, enabling visual and textual information alignment. Besides, we introduce learnable prompt embeddings to capture class-specific information and enhance the semantic understanding of the model. To ensure effective learning, we devise a two-stage training procedure that allows the segmentation backbone to learn from fixed text embeddings in the first stage and optimize the prompt embeddings in the second stage. Extensive experiments and ablation studies demonstrate the effectiveness of our method in significantly improving the performance of the real-time semantic segmentation model.

並列關鍵字

Semantic Segmentation ； Real-Time ； Pre-Trained Vision-Language Model ； Textual Knowledge ； Prompt Tuning ； Two-Stage Training

參考文獻

[1] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European conference on computer vision (ECCV), pages 405–420, 2018.

Google Scholar

[2] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325– 341, 2018.

Google Scholar

[3] Hanchao Li, Pengfei Xiong, Haoqiang Fan, and Jian Sun. Dfanet: Deep feature aggregation for real-time semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9522–9531, 2019.

Google Scholar

[4] Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh Hajishirzi. Es- pnetv2: A light-weight, power efficient, and general purpose convolutional neural network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9190–9200, 2019.

Google Scholar

[5] Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, and Nong Sang. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, 129:3051–3068, 2021.

Google Scholar

國際替代計量

透過預訓練視覺-語言模型之文本知識增強即時語義分割：一種輕量級方法

主題瀏覽