透過您的圖書館登入
IP:216.73.216.52
  • 學位論文

基於無錨框共同偵測與嵌入網路之多物件追蹤系統設計與實現

Design and Implementation of Multi-Object Tracking System Based on Anchor-free Joint Detection and Embedded Network

指導教授 : 蔡奇謚

摘要


多物件追蹤一直以來都是電腦視覺領域中的一個備受重視且挑戰性高的研究議題。為了達到更強健的追蹤性能,近期所發表的多物件追蹤方法已傾向於使用無錨框物件偵測器,其優點為可改善基於錨框的方法在學習外觀特徵時所遭遇的身分混淆問題。然而,實務中發現,基於卷積神經網路(CNN)之無錨框物件偵測器在人群密集的場景中的偵測準確度會有明顯的下降。為了在人潮擁擠的情境中能夠有更好的檢測效果以及追蹤性能,本論文提出了一基於Transformer架構之無錨框共同物件偵測器與追蹤系統,稱為Swin-JDE,其中包含一個新穎的上採樣模塊PatchExpand,透過類神經網路的學習與基於愛因斯坦求和約定(Einstein Notation)的排列方式,提升特徵圖的空間資訊,以強化模型的偵測與追蹤性能。在訓練方法上,我們提出一個二階段訓練法,其將偵測分支與外觀分支分開訓練,以強化無錨框預測器的偵測強健性。此外,在訓練過程中,我們也對訓練數據集進行遮擋目標的移除,以提升外觀嵌入層(Appearance Embedding Layer)的識別準確性。在數據關聯方法上,我們亦提出一個新的後處理方法,其同時考量偵測信心度、外觀嵌入距離、聯合交集距離等三項資訊進行追蹤器與偵測資訊的匹配,以提升多目標追蹤的追蹤強健性。實驗結果顯示,所提的方法在MOT20中得到70.38%MOTA與69.53%IDF1的結果,ID Switch更是降低到2026。與FairMOT相比,所提出的方法在MOTA 及IDF1兩項指標上分別提升了8.58%及2.23%。

並列摘要


Multi-object tracking (MOT) is a highly valued and challenging research topic in computer vision. To achieve more robust tracking performance, recently published MOT methods tend to use anchor-free object detectors, which have the advantage of dealing with the identity ambiguity problem encountered by anchor-based methods in learning appearance features. However, in practical applications, it is found that the detection accuracy of the anchor-free object detector based on classical convolutional neural networks in crowded scenes will be significantly reduced. In order to have better detection and tracking performance in crowded scenes, this paper proposes an anchor-free joint detection and embedding (JDE) MOT method based on Transformer architecture, called Swin-JDE. The proposed method includes a novel PatchExpand module, which can improve the spatial information of feature maps by up-sampling processing through neural network learning and Einstein Notation-based rearrangement to enhance the detection and tracking performance of the MOT model. In terms of training method, we propose a two-stage training method that trains the detection branch separately from the appearance branch to enhance the detection robustness of anchor-free predictors. Furthermore, during the training process, we also propose an examination method to remove occluded targets from the training dataset to improve the accuracy of the appearance embedding layer. In terms of data association, we propose a new post-processing method, which simultaneously considers the three factors of detection confidence, appearance embedding distance, intersection over union (IoU) distance to match each tracklet and the detection information to improve the tracking robustness of the MOT model. Experimental results show that the proposed method achieves 70.38% MOTA and 69.53% IDF1 results in MOT20, and the ID Switch is reduced to 2026. Compared with FairMOT, the proposed method improves MOTA and IDF1 by 8.58% and 2.23%, respectively.

參考文獻


[1] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran, “Detect-and-Track: Efficient Pose Estimation in Videos,” IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, December, 2018, pp. 350-359.
[2] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun and M. Paluri, “A Closer Look at Spatiotemporal Convolutions for Action Recognition,” IEEE/CVF International Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, December, 2018, pp. 6450-6459.
[3] S. Ahmed, M.N. Huda, S. Rajbhandari, C. Saha, M. Elshaw, and S. Kanarachos, “Pedestrian and Cyclist Detection and Intent Estimation for Autonomous Vehicles: A Survey,” Applied Sciences. June, 2019.
[4] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, "Simple Online and Realtime Tracking." IEEE International Conference on Image Processing, Phoenix, AZ, USA, September, 2016, pp.3464-3468.
[5] N. Wojke, A. Bewley, and D. Paulus, “Simple Online and Realtime Tracking with A Deep Association Metric,” IEEE International Conference on Image Processing, Beijing, China, September, 2017, pp. 3645-3649.

延伸閱讀