時序上的動作檢測指的是在一段包含多個動作的視頻中,除了要偵測出當中包含哪些動作類別外,還要精確地定位出每個動作發生的時間,包括起始和結束的時間。隨著深度學習技術的發展,很多研究從使用傳統電腦視覺的方法,改成利用深度學習的方式,這使得時序上的動作檢測這個研究領域也有了很大的進展。時序上的動作檢測有許多應用,像是視頻監控和視頻檢索等。 在本論文中,我們認為圖片中出現的物體資訊對於動作的檢測有很大幫助。因此,我們不使用三維的卷積網絡來生成影片的特徵,而是提出了一種使用兩層物體偵測網絡的架構:第一層網絡用於偵測每個幀中出現的物體,第二層網路則是用於動作的檢測。其中,我們提出了一種資料轉換的方法,將第一層的偵測結果沿著時序堆疊起來,形成一種具六通道的新型態資料,兼具空間和時間的資訊,作為第二層網絡的輸入資料。透過實驗證實了我們的方法能得到不錯的結果。
As the development of deep learning, there is a great progress in temporal action detection. Instead of using the ways of conventional computer vision, many approaches use the ways of deep learning to do temporal action detection. There are many applications of temporal action detection such as video surveillance and video retrieval. Considering that some actions can be recognized by the information of objects appearing and moving in the video, in this thesis, a hierarchical model is proposed which consists of two object detection networks to do temporal action detection. The first network is used to detect objects in each frame, and the second one is for temporal action detection. We also proposed a method which converts the object detection results of the first network into a new type of data so that it can be fed to the second network. The new type of data is an image of six channels with spatiotemporal information and is beneficial to temporal action detection. We conduct experiments on the dataset THUMOS14 which is used for temporal action detection and our approach achieves a satisfactory performance.