在本文中,我們提出一種輕量級方法,透過預訓練視覺語言模型(pre-trained vision-language model)來增強即時語義分割(real-time semantic segmentation)。我們的方法將CLIP文本編碼器(text encoder)與語義分割模型相結合,有效地將文本知識傳遞給分割模型。我們的框架整合了圖像和文本嵌入(text and image embeddings),使視覺和文本資訊可以相互整合。同時,我們引入了可學習的提示嵌入(learnable prompt embedding),以捕捉特定類別(class-specific)的資訊並提升模型對語義的理解能力。為了確保訓練效果,我們設計了一種兩階段的訓練流程,允許語義分割模型在第一階段從固定的文本嵌入中學習,並在第二階段優化提示嵌入。通過實驗和消融研究,我們驗證了這種方法能夠顯著提升即時語義分割模型的性能。
In this paper, we present a lightweight method to enhance real-time semantic segmentation models by leveraging the power of pre-trained vision-language models. Our approach incorporates the CLIP text encoder, which provides rich semantic embeddings for text labels, and effectively transmits this textual knowledge to the segmentation model. The proposed framework integrates image and text embeddings, enabling visual and textual information alignment. Besides, we introduce learnable prompt embeddings to capture class-specific information and enhance the semantic understanding of the model. To ensure effective learning, we devise a two-stage training procedure that allows the segmentation backbone to learn from fixed text embeddings in the first stage and optimize the prompt embeddings in the second stage. Extensive experiments and ablation studies demonstrate the effectiveness of our method in significantly improving the performance of the real-time semantic segmentation model.