用於多相機追蹤系統之高效與精確的行人重識別技術研究

不論是如智慧家庭般的小系統，到如校園的中規模系統，或是到一個城市這樣的大規模系統下，攝影機無所不在。有了這些相機，我們可以建構一個智慧的環境。「多目標多相機追蹤」就是其中最關鍵的技術，它可以追蹤行人經過不同相機之後的軌跡，有了這些軌跡，我們就可以進而分析在這個環境下每個人的行走模式。因為「多目標多相機追蹤」是一個複雜的問題，在我們的博士論文中，較著重在其中的子研究領域，稱作「行人重識別」。行人重識別目的在給定兩個已經產生好的跨相機行人定界框後，根據圖片外觀來判斷是否為同一個人。一個好的行人重識別技術可以直接的影響到整體「多目標多相機追蹤」的表現。在我們的博士論文中，我們專注於實際世界會有的情況，像是運算量跟表現上的取捨，或是在沒有人工標籤的情況下學習行人重識別模型。首先第一部分，因為在實際場域下較常處理視訊序列而非單一張影像，因此我們著重於視訊行人重識別。我們設計了一個創新的自注意力機制的架構，它可以學習空間與時間中該專注的部分。接著我們提出了一個基於空間與時間上優化的輕量架構版本，使其可以在相似的表現下降低了硬體的耗能與運算量。另外，我們探討了目前現有資料集的問題。我們提出了一個簡單卻有效的前處理方式來減少在資料集中的雜訊與錯誤，可以幫助正在做此方向研究的研究員不再因為資料集的錯誤而無法提出有效的解決方法。第二部分，我們專注於處理半監督式學習的行人重識別，也就是資料集中只有少部分的資料有標籤。我們提出了一個創新的分群機制，它可以根據有標籤的資料分布來幫助在無標籤的資料上正確地分群，進而利用分群後偽標籤來學習模型。第三部分，我們希望學習非監督式的行人重識別，也就是在目標環境並且所有資料都沒有標籤的情況下來學習模型。我們依舊是採用分群方式來給定資料偽標籤，但是提出了兩個創新的修正機制來修正本來因為分群錯誤而產生的錯誤偽標籤。另一方面，在實際情況下常常因為硬體限制而無法順利的運行複雜的神經網路模型，因此，「濾波器剪枝」就是一個可以移除不重要的濾波器的解決方式。在我們的博士論文中，我們提出兩個剪枝的方式，第一種是層向剪枝。我們會根據每一層對損失函數的影響來定義每一層的敏感度，接著會從最不敏感的層來做剪枝。第二種是全局剪枝，也就是全局地估測每個濾波器的重要性。特別的是，我們提出要把重要性估測結合每個濾波器對目標硬體資源的影響，這樣在最終目標資源下，我們可以更準確的估測每個濾波器的重要性。最後，結合我們所提出的剪枝與行人重識別技術，我們建構了一個及時多目標多相機追蹤系統，這個系統利用一台電腦模擬真實環境下分散式運算的狀況，來執行行人偵測、行人追蹤與行人重識別。在我們提出的運算優化方法下，此系統可達到即時的運行，也就是每秒可以處理超過三十個幀。通過大量的實驗，我們所有提出的方法同時具有準確性與計算效率，可以很有效的部屬到真實場域中。

關鍵字

多目標多相機追蹤系統；行人重識別；自注意力機制；濾波器剪枝；聯邦學習

並列摘要

Surveillance cameras are seen everywhere in the world, which can be embedded into a small system, such as a smart home, a smart campus, or to a large system like a smart city. With the cameras, we can enable the intelligence. Multi-Target Multi-Camera Tracking (MTMCT) plays a critical role in the core techniques. It aims at tracking multiple people captured under different camera views. With MTMCT, we can extract the walking trajectories of some specific people and further analyze the patterns of them. Since MTMCT is a complicate problems, we specifically focus on the sub-problem that is suitable for research, which called Person Re-identification (re-ID). Re-ID aims at matching two cropped pedestrians under different cameras with only appearance cues. The performance of the re-ID will explicitly influence that of the MTMCT system. In this dissertation, we address multiple aspects of re-ID, and especially focus more on the real-world scenarios, such as the trade-off between computation and performance, or how to learn under data without labels. The first part is for video-based re-ID. In the system, it is more common to match two pedestrians with their image sequence along time. We demonstrate a novel model architecture for learning self-attention across space and time and propose a spatially and temporally efficient version that can maintain the performance but with a more light-weight structure. Then, we also explore the problems in the existing benchmark for data and evaluation metrics. We further propose an easy pre-processing technique to reduce the noise in the dataset and help the community focus on extracting invariant visual appearance. The second part is for learning re-ID with only few labeled data, which called semi-supervised re-ID. We adopt novel clustering methods on the unlabeled data with the guidance of the labeled ones to progressively learn pseudo-labels for training re-ID models. The third part is for learning the re-ID model even without any annotated labels. This work simplifies the problem into cross-domain re-ID that we have data with labels in the source domain and aim to learn the model on data totally unlabeled in target domain. We propose two rectification mechanisms that can help clean the original noise generated from the pseudo-labels of typical clustering algorithm. On the other hand, for a practical system in our life, if we cannot perform a model in real-time owing to the hardware constraints, ``Network Filter Pruning'' is a solution to remove unimportant filters in a complicated neural network. In this dissertation, we propose two kinds of pruning techniques. The first one is called layer-wise pruning. We measure the sensitivity of each layer, which means the impact on loss of a unit weight in that layer, and start pruning on the less sensitive layer. The other technique focuses on global pruning, which measures the importance of each filter at the same time and remove the less important ones. Specifically, in this work, we combine the importance estimation with the hardware constraints, which makes it more accurate based on hardware impact of each weight. With the pruning technique, we combine them with the proposed re-ID algorithms. We build a real-time MTMCT system on one machine to simulate distributed multiple cameras in an environment that perform pedestrian detection, tracking and re-identification at the same time. With all the proposed techniques, we can largely reduce all the complicated computation in neural network and make the whole system operate in real-time, which achieves larger than 30 FPS. The proposed algorithms are all quantitatively and qualitatively evaluated in various benchmarks on re-ID and image classification. Experimental results all show that our techniques are efficient and effective in these applications.

並列關鍵字

Multi-Target Multi-Camera Tracking ； Person Re-identification ； Self-attention Mechanism ； Filter Pruning ； Federated Learning

參考文獻

Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by gan improve the person re-identification baseline in vitro,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017.

Google Scholar

L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars: A video benchmark for large-scale person re-identification,” in Proceedings of European Conference on Computer Vision (ECCV). Springer, 2016, pp.868–884.

Google Scholar

Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, Y. Li, X. Liu, and B. He, “A survey on federated learning systems: vision, hype and reality for data privacy and protection,” arXiv preprint arXiv:1907.09693, 2019.

Google Scholar

Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series in 2021,” arXiv preprint arXiv:2107.08430, 2021.

Google Scholar

Y. Zhang, P. Sun, Y. Jiang, D. Yu, Z. Yuan, P. Luo, W. Liu, and X. Wang,”Bytetrack: Multi-object tracking by associating every detection box,” arXiv preprint arXiv:2110.06864, 2021.

Google Scholar

主題瀏覽