利用多重中階表徵進行第一人稱視角影片動作辨識

現有的第一人稱視角影像動作辨識主要著重在單一模型（例如：偵測互動的物件）來推斷活動類型。然而，因為攝影機的與實驗者的視角不一致的關係，導致在影像裡重要的物件可能會被部分遮蔽或者沒有顯示。這些因素將導致偵測互動的物件模型準確度大幅降低。再者，我們發現實驗者在何處（where）與如何（how）與物件互動的資訊在先前的第一人稱視角影像動作辨識研究裡幾乎被忽略。因此為了解決上述的困難點，我們使用多重中階表徵來提高第一人稱視角影像動作辨識的準確度。具體地來說，我們利用多重模型（例如：背景資訊、物件的使用與手部的動作模式）來補足單一模型的不足，並且共同地考慮使用者在與什麼（what）、何處（where）與如何（how）互動的資訊來建立起多重模型來進行第一人稱視角影像動作辨識。為了測試我們所提出的多重中階表徵模型，我們收集了新的第一人稱視角影像動作辨識資料庫，其中包含了第一人稱視角影像與手部三軸加速度。在公開的資料集（ADL）中我們的多重中階表徵模型勝過目前最新穎的方法從36.8%到46.7%，在我們自己收集的資料集裡，我們的方法勝過目前最新穎的方法從32.5%到60.0%。除此之外，我們也做了一系列的實驗來發掘各個模型的相對價值。

關鍵字

動作辨識；第一人稱視角辨識；特徵融合

並列摘要

Existing approaches for egocentric activity recognition mainly rely on a single modality (e.g., detecting interacting objects) to infer the activity category. However, due to the inconsistency between camera angle and subject's visual field, important objects may be partially occluded or missing in the video frames. Moreover, where the objects are and how we interact with the objects are usually ignored in prior works. To resolve these difficulties, we propose to leverage multiple mid-level representations to improve egocentric activity classification accuracy. Specifically, we aim at utilizing multimodal representations (e.g., background context, objects manipulated by a user, and motion patterns of hands) to compensate the insufficiency of a single modality, and jointly consider what, where, and how a subject is interacting with. To evaluate the method, we introduce a new and challenging egocentric activity dataset (ADL+) that contains video and wrist-worn accelerometer data of people performing daily-life activities. Our approach significantly outperforms the state-of-the-art method on the ADL dataset (i.e., 36.8% to 46.7%) and our ADL+ dataset (i.e., 32.5% to 60.0%) in terms of classification accuracy. In addition, we also conduct a series of analyses to explore relative merits of each modality to egocentric activity recognition.

並列關鍵字

Activity Recognition ； Egocentric Video ； Feature Fusion

參考文獻

[12] G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007.

[1] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM Computing Surveys (CSUR), 43(3):16, 2011.

[2] P. K. Atrey, M. A. Hossain, A. El Saddik, and M. S. Kankanhalli. Multimodal fusion for multimedia analysis: a survey. Multimedia systems, 16(6):345–379, 2010.

[4] A. Bulling, U. Blanke, and B. Schiele. A tutorial on human activity recognition using body-worn inertial sensors. ACM Computing Surveys (CSUR), 46(3):33, 2014.

[5] A. Catz, A. Tamir, and M. Itzkovich. Scim–spinal cord independence measure: a new disability scale for patients with spinal cord lesions. Spinal cord, 36(734):735, 1998.

國際替代計量

利用多重中階表徵進行第一人稱視角影片動作辨識

全文下載

主題瀏覽