透過您的圖書館登入
IP:3.144.103.10
  • 學位論文

機器學習偵測複雜視頻事件的關鍵證據

Learning Key Evidence for Detecting Complex Events in Videos

指導教授 : 陳銘憲

摘要


視頻事件偵測是電腦科學中最重要的研究課題之一,非常有挑戰性•其中偵測複雜事件,例如“生日派對”、“結婚典禮”、“腳踏車特技”等又更加地困難•複雜事件是由許多人對各種物品,在不同的環境下進行時間長短不一的動作所組成的•目前最常用的方法是從一個視頻中的每幀影像的或影段中抽取特徵值,量化和區域取樣後,由一 個向量代表一整個視頻•此方式雖然簡單有效,但是最後區域取樣的步驟常會導致重要的局部訊息喪失和引入不相干的背景特徵•相較於之前的方法,我們注意到了實際上人類只需要幾個重要的證據就可以辨識視頻中的事件•例如,只要看到了“生日蛋糕”或“吹蠟燭”的事件,就可以辨識出“生日派對”的事件•受到這個啟發,我們提出了 新的演算法,可以自動從影片範例中學習關鍵證據和利用關鍵證據來辨識視頻事件•在我們的架構中,每個視頻都被分割成許多的短視頻,稱為“事例”, 然用自動學習哪些事例可以用來作為辨識事件的證據• 在此篇論文中,我們提出了兩種機器學習的方法•第一種方法稱為最大證據學習法(Maximal Evidence Learning, MEL),是基於大邊量架構,將所有的事例視為未知,並同時學習所有事例和視頻的分類標籤•我們的架構可以在正樣本中學習出最多的正事例,和在負樣本中學習出最多的負事例•第二種方法叫做證據選擇排序法(Evidence Selective Ranking, ESR),是基於靜態動態事例嵌入法和無限推進排序法來選出重要的證據•實驗證實在大量的視頻資料中兩種方法皆可顯著的提升辨識的正確率•除此之外,實驗結果顯示使用我們的方法學習出來的證據對人是有意義的,可以用來找出事件視頻中的重要的證據影段•

並列摘要


Video event detection is one of the most important, yet very challenging, research topics in computer science. The recognition of complex events, e.g. “birthday party”, “wedding ceremony” or “attempting a bike trick”, is even more difficult since complex events consist of various human interactions with different objects in diverse environments with variable time intervals. Currently the most common approach is to extract features from frames or video clips, and then to quantize and pool these features to form a single vector representation for the entire video. While this method is simple and efficient, the final pooling step may lead to the loss of temporally local information, and include many irrelevant features from noisy background. To approach this problem in a different way than in previous methods, we noticed that humans require only a small amount of evidence to recognize an event in a video. For example, a “birthday party” event can be identified by discovering “birthday cake” and “blowing candles”. Inspired by this idea, we propose a novel way to detect complex events, whereby one first identifies the key evidence that can prove the existence of an event, and then utilizes the evidence to recognize videos. Under our framework, each video is represented as multiple “instances”, which are defined as video segments of different temporal intervals. Then we apply learning methods to identify evidence (positive instances) first and utilize the evidence to recognize complex video events. In this thesis, we propose two learning methods. The first proposed method, called maximal evidence learning (MEL), is based on a large-margin formulation that treats instance labels as hidden latent variables, and infers the instance labels and the instance-level classification model simultaneously. MEL can infer optimal solutions by learning as many positive instances as possible from positive videos, and negative instances from negative videos. The second proposed method is called evidence selective ranking (ESR). ESR is based on static-dynamic instance embedding, and employs infinite push ranking to select the most distinctive evidence. Extensive analysis on large-scale video event datasets shows significant performance gains by both methods. In this study, we also demonstrate key selected evidence is meaningful to humans and can be used to locate video segments that signify an event.

參考文獻


[2] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
[3] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010.
large-scale hierarchical image database. In CVPR. IEEE, 2009.
[5] Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong. Locality-constrained linear coding for image classification. In CVPR. IEEE, 2010.
[6] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

延伸閱讀