透過您的圖書館登入
IP:3.145.15.205
  • 學位論文

具有增強關聯機制用於線上動作檢測之時序金字塔網路

Temporal Pyramid Networks with Enhanced Relation Mechanism for Online Action Detection

指導教授 : 傅立成

摘要


近年來,影片內容分析獲得學界和業界廣泛地關注,而其中一個主要分支為動作檢測。大多數的研究將動作檢測視為離線問題。但是在諸如自動駕駛, 輔助機器人和監視系統之類的實際應用中,需要在線且即時地檢測出動作。因此在線的設定將更加實用。在線動作檢測旨在串流的影片當中,每一刻即時的偵測出發生的動作。 輸入的影片序列不僅包含了感興趣動作的幀,還包含背景(無動作發生)和其他不相關的幀。這些幀會導致網路學習出較不具判別性的特徵。在本論文中,我們提出了一個增強性時序關係模型並將其嵌入在時序卷積網路中。該模型根據輸入序列中每個時間點與感興趣動作的關聯性以及每個時間點的動作分數來更新特徵向量。與感興趣動作比較相關的特徵應被視為重要的資訊,而忽略那些較不相關的特徵。增強性時序關係模型對每個時間點的特徵給予一個代表與感興趣動作之關聯性的相關分數,以及一個表示發生動作可能性的動作分數。這兩個分數能引導網路提取出更具有辨別性的特徵,用來表示在當前禎所發生的動作。 本研究透過時序卷積網路來學習輸入序列的時序模型。在時序卷積網路中不同層級所輸出的特徵向量具有不同的感受視野,覆蓋在不同的時間長度上。然而較低層級的特徵具有較弱的語義表達能力,因此本研究設計了時間金字塔網絡,該網路藉由一個由上而下的結構,將高層級特徵較強的語意表達能力傳遞至低層級,從而在各個層級上都建構出語義表達能力強的特徵。如此可以通過具有多個時間尺度的特徵來辨識不同時間長度的動作。在實驗部分,本研究將所提出的的方法實作在於兩個公開的數據集THUMOS及TVSeries上,其表現勝過幾個基本比較模型,並且同時超越了當前最好的方法。

並列摘要


Nowadays, video content analysis attracts wide attention from the industry and academic fields. One major branch of video content analysis is action detection. Most of the works view action detection as an off-line problem. However, in real-world applications such as autonomous driving, assistance robots, and surveillance systems, the actions need to be detected every moment in time. Therefore, an online setting will be more practical. Online action detection aims to identify actions as soon as each video frame arrives from a streaming video. An input video sequence contains not only the action of interest frames but also background (non-action) and other irrelevant frames. Those frames will cause the network to learn less discriminative features. This thesis explores an Enhanced Relation Layer embedded in a Temporal Convolution Network (TCN), which updates the features according to their relevance to the action of interest and the actioness score. Relevant features should be considered essential and irrelevant features unessential. Enhanced Relation Layer gives each timestep a relevance score implying the relevance to the action of interest, and an actioness score indicates the probability of action occurrence. The scores guide the network to focus on those more essential features and learn a more discriminative representation for identifying the action that happens in the current timestep. The temporal information of an input sequence is learned from TCN. The output feature of each layer in TCN has different receptive fields, focusing on different temporal scales. However, lower-level features are semantically weak. Therefore, we design a Temporal Pyramid Network with a top-down architecture to transform the strong semantic ability from higher-levels to lower-levels, building semantically strong feature sequences at all levels. In this way, we can identify actions with different temporal lengths with multi-temporal-scale features. In the experiment part, we apply our method to two benchmark datasets, THUMOS-14, and TVSerires. Our method achieves superior performance as compared with baseline networks and promising results as compared with the state-of-the-art works.

參考文獻


[1] R. D. Geest, E. Gavves, A. Ghodrati, Z. Li, C. G. M. Snoek, and T. Tuytelaars, "Online Action Detection," ArXiv, vol. abs/1604.06506, 2016.
[2] H. Eun, J. Moon, J. Park, C. Jung, and C. Kim, "Learning to Discriminate Information for Online Action Detection," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[3] K. Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation," ArXiv, vol. abs/1406.1078, 2014.
[4] S. Bai, J. Z. Kolter, and V. Koltun, "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling," ArXiv, vol. abs/1803.01271, 2018.
[5] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," Computing Research Repository (CoRR), vol. abs/1409.1556, 2015.

延伸閱讀