於監視系統使用色彩及深度資訊進行特徵與事件之偵測

本論文前半段提出了一套於室內監視系統環境下偵測人物特徵以及互動行為的架構；特徵辨識在近幾年已逐漸成為時下電腦視覺以及多媒體領域中相當熱門的研究題目，但傳統在物體偵測以及辨識上的解法仍然有相當多的難題需要克服，幾乎無可避免的會受到如光影變化、物體或人的視角、姿勢等等的因素之影響。近年來，由於包含了紅外線等設備而發展出來的深度儀不斷進步，有別於傳統純視覺領域的作法，我們提出了一套使用色彩與深度資訊於多視角下，基於人體的各個部位為基準來做人的特徵辨識之系統。在本研究中我們針對了數個對於監視環境之下相當重要的特徵，透過由3D資訊以及由3D資訊輔助下取得更精準的特徵值，學習出我們的辨識模型。而為了驗證我們提出之方法的成效以及評估系統的表現，我們選擇了數個當前最佳的辨識方法來作為我們實驗比較的對象，而實驗結果也彰顯出在如此低解析度，並且人的視角、姿勢等多變化的因素下，我們所提出的方法能有比這些作法更好的表現。而透過我們系統的概念，許多應用如篩選監視錄影畫面，找尋嫌疑犯等等都可以建立在此之上。而在後半段的部分，有別於傳統自影像中抽取的低階層特徵值，中階層的特徵表示方法，或者可以說是更具有語意上的特徵表示方法已經在很多前人的研究中，取得了突破性的表現；在語意空間下的特徵值不只能提供辨識目標精簡的特徵表示，更能避免掉在低階層的特徵空間下一些雜質的影響。在本研究中，我們提出了將人的行為或動作，拆成由多個身體部位已經姿態的結合來表現其特徵的表示方法；先前的對於人的動作之相關研究主要都集中在單人的動作辨識上，而在這份研究中我們想研究的是人與人的互動行為，在這樣的前提下許多傳統於動作辨識會碰到的問題如遮蔽所造成的影響，將會讓問題變得更為困難。在我們這部份的實驗結果中，數據也呈現了我們提出的方法可以表現得比我們所比較的對象來得好，尤其是幾個對於傳統作法較為有困難度的動作類別。

關鍵字

特徵；深度；監視；動作；語意；姿態；部位

並列摘要

Attributes have gained much attention in computer vision and multimedia research area in the recent years. With the advent of depth enabled sensors and increasing needs in surveillance systems, this thesis propose a novel framework to detect fine-grained human attributes (e.g., having backpack, talking on cell phone, wearing glasses) in the surveillance environments. Traditional detection and recognition methods always suffer from the problems such as variations in lighting conditions, poses, and viewpoints of object instances. To tackle these problems, we propose a multi-view part-based attribute detecting system based on color-depth inputs instead of utilizing color images. We address several important attributes in the surveillance environments and train multiple attribute classifiers based on features inferred from 3D information to construct our discriminative model. To justify the idea of our approach and evaluate the performance of our system, several state-of-the-art methods are compared and the experimental results show that our method is more robust under large variations in surveillance conditions, and human related issues such as pose, orientations and deformation of body parts. With the capability of our system, many applications can be built such as pre-filtering for browsing specific attributes related surveillance video frames, finding suspects or missing people. Mid-level feature representation, or semantic features have showed the discriminative power beyond low-level features in many recent works. The semantic feature space can not only give compact representation but also invariant to certain low-level feature noise. In this paper, we propose to represent actions from video clips by sets of combination of parts and human poses, which is quite different from traditional image-based feature representation. While previous works mainly focus on studying the problems of single person actions, we investigate the problem of human interaction events. In comparison with single person actions, interaction events are more complex since interaction events are performed by more than one person and traditional problems such as occlusion will become much more challenging. Our experiments show that representing actions by parts and poses can outperform our baseline methods, especially in some cases that are more difficult to traditional methods.

並列關鍵字

attributes ； depth ； surveillance ； action ； semantic ； pose ； part

參考文獻

shapes. In The Tenth IEEE International Conference on Computer Vision (ICCV’05),

[3] L. Bourdev, S. Maji, and J. Malik. Describing people: Poselet-based attribute classification.

In International Conference on Computer Vision (ICCV), 2011.

[4] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose

annotations. In International Conference on Computer Vision (ICCV), 2009.

國際替代計量

於監視系統使用色彩及深度資訊進行特徵與事件之偵測

全文下載

主題瀏覽