在近期的行為辨識研究裡,用了多種創新方法來提取及處理還有分析影片的特徵,這與傳統的方法有很大的不同,而本實驗用的方法,是以卷積網路為影片特徵提取器,處理輸入的影片,從影片裡提取特徵,再以單一的Transformer編碼器做為特徵處理器,負責處理和分析特徵以及最後的分類,並且利用知識蒸餾技術,將Transformer編碼器的結構簡化,顯著地降低參數量,降低模型對計算資源的要求,同時提高了泛化性與表現,本實驗測試了兩種卷積模型在特徵提取的效果,第一種是DenseNet,另一種是Slowfast Network,而Slowfast Networks作為特徵提取器,可以從影片裡提取出更豐富且更重要的特徵,使Transformer編碼器能夠接收及處理到與行為密切相關的資訊,這給行為辨識帶來了更好的表現。
In recent behavioral recognition research, many innovative methods are used to extract, process and analyze video features, which are very different from the traditional methods. In this experiment, we use a convolutional network as the video feature extractor to process the input video and extract features from the video, and then we use a single Transformer encoder as the feature processor, which is responsible for the processing and analyzing of features as well as the final classification. The final classification, and the use of knowledge distillation technology, the structure of the Transformer encoder is simplified, significantly reducing the number of parameters, reducing the model of computing resources requirements, while improving the generalization and performance, this experiment tested the effect of two convolutional models in feature extraction, the first one is DenseNet, and the other is Slowfast Networks, and the Slowfast Networks as feature extractor. Slowfast Networks, as a feature extractor, can extract richer and more important features from the video, so that the Transformer encoder can receive and process the information that is closely related to the behavior, which brings better performance for behavior recognition.