使用Swin Transformer 進行針對遮擋行人的行人偵測任務

本研究主要是提出一種高精確度的行人辨識模型。在自駕車所必須的道路物件辨識功能中，以辨識行人最為重要。因為與行人的碰撞事故一定傷亡最為嚴重，所以行人是車輛在道路上最不該與其發生碰撞的物件。本文預期後續能應用在自駕車的車內系統中，發揮即時偵測行人效果。因為於這幾年Transformer架構的模型在自然語言處理上的巨大成功，許多研究便將Transformer架構的模型試著應用在電腦視覺相關的任務上面。在其中的Vision Transformer中雖獲得了可以與CNN並駕齊驅的結果，但在訓練過程中其所需高額的運算量與龐大的模型參數量，對於要應用在終端的設備中，這是非常需要克服的難點。 2021年微軟所提出的Swin Transformer，其強大的性能、相比於Vision Transformer之下更為精簡的模型架構以及可在各種下游任務中廣泛且自由的應用，非常適合拿來當作一個物件偵測模型的特徵抽取器。本研究便利用其的強大功能來捕捉圖像中的多尺度特徵和空間關係，使其非常適合處理行人檢測這一具有挑戰性的任務。再搭配基於Faster R-CNN架構的兩階段的檢測器，結合階層式的RPN與ROI Head，並且在訓練RPN的過程中使用所有Anchors與Focal Loss。本研究在Euro City Persons和CityPersons數據集上的實驗展示了令人鼓舞的結果。特別是在檢測高度遮擋的行人方面，本研究的模型表現出色，展示了它應對傳統方法可能難以應對的挑戰性場景的能力。

關鍵字

深度學習；電腦視覺；物件偵測； Transformer

並列摘要

This study primarily proposes a high-precision pedestrian recognition model. In the context of road object recognition, which is crucial for autonomous vehicles, pedestrian recognition holds paramount importance. Pedestrian collisions result in the most severe casualties, making pedestrians the objects that vehicles should avoid colliding with on the road. This paper is expected to be used in self-driving in-car systems to achieve real-time detection of pedestrians. Given the significant success of Transformer architecture models in natural language processing in recent years, researchers have explored the application of Transformer-based models in computer vision-related tasks. While the Vision Transformer (ViT) have shown promise in achieving results on par with Convolutional Neural Networks (CNNs), overcoming the high computational requirements and extensive model parameters during training poses a significant challenge, especially when targeting deployment on edge devices. In 2021, Microsoft introduced the Swin Transformer, known for its powerful performance, more streamlined model architecture compared to the Vision Transformers, and its versatility in various downstream tasks. Swin Transformer is particularly suitable as a feature extractor for object detection models. This research harnesses its robust capabilities to capture multi-scale features and spatial relationships in images, making it well-suited for the challenging task of pedestrian detection. Combined with a two-stage detector based on the Faster R-CNN framework, which includes a cascade Region Proposal Network (RPN) and Region of Interest (ROI) Head, and the use of all anchors and Focal Loss during RPN training, this study showcases promising results in experiments conducted on the Euro City Persons and CityPersons datasets. Particularly, the model in this study demonstrates outstanding performance in detecting heavily occluded pedestrians, highlighting its ability to handle challenging scenarios that traditional methods may struggle with.

並列關鍵字

Deep learning ； Computer vision ； Object detection ； Transformer

參考文獻

W. Liu et al., "Ssd: Single shot multibox detector," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, 2016: Springer, pp. 21-37.

Google Scholar

S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, 2015.

Google Scholar

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117-2125.

Google Scholar

K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask r-cnn," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969.

Google Scholar

K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.

Google Scholar

國際替代計量

使用Swin Transformer 進行針對遮擋行人的行人偵測任務

全文下載

主題瀏覽