依據區域重要性的動態雙階層式稀疏化 Vision Transformer 方法

目前的自動駕駛系統中使用深度學習技術來感知環境的方法越來越多，然而大多數的方法不會去考慮車輛駕駛時感知周圍環境有不同程度重要性，讓模型在每個區域上消耗相同的運算成本，浪費不重要區域的運算成本，因此本研究提出一套依據重要性資訊，動態調整模型針對不同影像區域的稀疏化程度，藉此控制模型不同區域的辨識性能。本研究專注於語意分割任務，提出IDBS-ViT (Importance-based Dynamic Bi-Level Sparse Vision Transformer) 方法針對Pyramid Vision Transformer 編碼器架構中的多頭自注意力機制進行稀疏化，採用Token 剪枝(Token Pruning) 稀疏化架構，同時提出雙層式的剪枝決策架構，由上層的策略網路決定各區域的剪枝率，以及下層的剪枝模組決定實際每個Token的去留，藉由雙層的架構使上下層模組分別控制大範圍的區域與處理對Token層級細微的決策，分層管理稀疏化的範圍和細節。最後IDBS-ViT能夠依據重要性不同動態調整運算量，在提供模型最高重要度的情況下，提升SegFormer-B0模型的FPS (Frames per second) 從原本的 9.05 提升到 13.62，提升約 50%，在 Cityscapes驗證集中模型的mIoU為74.05相比原來的SegFormer-B0mIoU只下降0.8，且相比於傳統的靜態剪枝率方法有更好的性能運算與速度權衡，此外也在CARLA模擬器中的連續幀場景中進行推論，展示IDBS-ViT在不同重要性設計下能夠改變關注區域，以及依據影像狀態減少運算量。

關鍵字

Vision Transformer ；模型稀疏化；動態神經網路； Token pruning

並列摘要

In autonomous driving systems, deep learning techniques are widely used for environmental perception. However, most approaches overlook the varying importance of different regions, leading to uniform computational cost across all areas and wasting resources on less critical ones. This study introduces IDBS-ViT (Importance-based Dynamic Bi-Level Sparse Vision Transformer), which dynamically adjusts sparsity levels in different image regions based on importance, optimizing recognition performance and efficiency in semantic segmentation tasks. Focusing on the Pyramid Vision Transformer encoder, IDBS-ViT applies token pruning for multi-head self-attention. A bi-level pruning framework enables coarse and fine control over sparsity, adjusting computation per region. By providing the model with the highest importance level, IDBS-ViT improves the FPS (Frames per Second) of the SegFormer-B0 model from 9.05 to 13.62, an approximately 50% increase. On the Cityscapes validation set, the model achieves an mIoU of 73.85, with only a 0.8-point drop compared to the original SegFormer-B0. Furthermore, it demonstrates a better trade-off between performance and speed compared to traditional static pruning rate methods. In continuous frame inference within the CARLA simulator, showcasing that IDBS-ViT can adjust focus regions based on importance designs and reduce computational load according to image conditions.

並列關鍵字

Vision Transformer ； Model sparsification ； Dynamic neural network ； Token pruning

參考文獻

Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig. Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4340–4349, 2016.

Google Scholar

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.

Google Scholar

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.

Google Scholar

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34:12077–12090, 2021.

Google Scholar

Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

Google Scholar

主題瀏覽