透過您的圖書館登入
IP:3.145.17.46
  • 學位論文

將注意力機制擴展到弱監督式動作偵測

Extend Attention Mechanism to Weakly-Supervised Action Detection

指導教授 : 陳銘憲

摘要


動作辨識(Action detection)的目的在於,從影片中偵測出動作發生的位置和時間。監督式學習的方法(supervised-learning approaches)在這個任務中雖然有很好的表現,但也增加了資料標注(data annotation)的要求和成本。相反的,弱監督式的方法(weakly-supervised learning methods)只需要影片等級的標注即可,大量的減少人力成本與消耗。然而,只從影片級(video-level)的標注來預測幀級(frame-level)的位置是相當具有挑戰性的任務,也少有相關研究在這個領域中。 在本論文中,我們將注意力機制(attention mechanism)的概念延伸至只使用影片級標注的弱監督式動作偵測上,提出兩個架構分別可以處理多類別(multi-class)與多標注(multi-label)的影片。針對多類別資料,我們結合3D卷積模型與注意力機制,提出 Inception-Attention 3D Convolutional Network (IA-C3D)。經過一些實驗以後,我們針對IA-C3D的弱點修改架構,並提出稀疏信心損失函數(Sparse confidence loss),形成 Sparse-Confidence 3D Convolutional Network (SC-C3D) 來處理多標注資料。和之前基於注意力機制的模型不同的是,我們將產生的注意力地圖(attention maps)視為動作的位置圖,而不是遮罩(mask)。我們提出的架構可以自動從影片中學習動作辨識,實驗結果顯示我們的方法在弱監督式動作偵測的能力。

並列摘要


Action detection aims to localize actions in spatial and temporal dimensions in video clips. Supervised-learning approaches require numerous high-quality bounding-box annotations. In contrast, weakly-supervised methods need only video-level labels for training, which can save tremendous amount of labor. However, learning frame-level localization with only video-level labels is very challenging, and little research has been conducted in this field. In this thesis, based on attention mechanism and 3D convolutional kernels, we introduce two frameworks for the weakly-supervised learning action detection task. The two frameworks are proposed to deal with multi-class and multi-label videos respectively. The first one is the Inception-Attention 3D Convolutional Network (IA-C3D). An attention subnet is introduced to predict the scores of each regions containing actions. Unlike the previous attention-based models, we consider the attention maps as action location maps rather than taking them as masks and learn the attention subnet directly. And second, the Sparse-Confidence 3D Convolutional Network (SC-C3D) is proposed to handle videos which multiple actions occur in. In addition, we introduce a sparse confidence loss to enable SC-C3D to learn more precisely. Both our frameworks are able to learn action localization from videos automatically. The experimental results show the ability and the strength of our approaches.

參考文獻


[1] F. Becattini, T. Uricchio, L. Seidenari, A. Del Bimbo, and L. Ballan. Am i done? predicting action progress in videos. arXiv preprint arXiv:1705.01781, 2017.
[2] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In The IEEE conference on Computer Vision and Pattern Recognition, pages 4724–4733, 2017.
[3] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
[4] V. Escorcia, C. D. Dao, M. Jain, B. Ghanem, and C. Snoek. Guess where? actor-supervision for spatiotemporal action localization. arXiv:1804.01824, 2018.
[5] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.

延伸閱讀