透過您的圖書館登入
IP:3.144.161.116
  • 學位論文

基於視覺化訊息之人體動作及手勢辨識之研究

A Study on Vision-based Human Action and Hand Gesture Recognition

指導教授 : 林維暘
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


本論文開發了三個用於手勢及動作辨識的系統。第一個系統透過手部的移動軌跡來辨識及檢索手語的涵意。此系統首先將軌跡以Kernel Principal Component Analysis (KPCA)投影至高維空間,在高維空間中將更有利於軌跡的分類。接著使用Nonparametric Discriminant Analysis (NDA)擷取出KPCA空間中最具鑑別度的特徵。KPCA與NDA的協同作業使得軌跡類別間能有更好的分離性。實驗時我們使用Australian Sign Language (ASL)資料庫來驗證本系統。而實驗結果亦顯示本系統在軌跡的分類及檢索上比起近年的文獻擁有更好的效果。 第二個系統則是使用影片做為資料來源,來辨識影片中的動作類別。該系統是基於local learning boosting algorithm。主要的概念是使用local learner來制定出一個高精度的分類規則。該系統首先擷取出interest points所形成的點群,並利用這些點群建立出一系列具鑑別力的特徵。這些特徵包含了靜態特性(例如人的長寬比)及動態特性。接著我們對這些特徵以local learning建立locally adaptive classifiers。每一個local classifier均是由一個訓練樣本所建立。由於每個local classifier均能很好的描述資料的局部分佈特性,因此結合這些local classifiers將可預期能得到更好的分類效果。我們在KTH資料庫上做了不同的實驗。實驗結果顯示我們所提出之系統能與近期文獻中的系統有相近的效能。此外,比起使用global learning的AdaBoost,local learning在訓練迭代時擁有更佳的效率。 第三個系統用來辨識手勢及動作。在手勢部份,我們直接以影片作為資料來源,而非軌跡。在動作辨識部份,我們也使用較符合實際的動作影片來測試系統。該系統使用雙互補張量來辨識手勢及動作。此方法將影片正規化成二個簡單且具鑑別度的張量。第一個張量為原生影片所產生。另一個張量則是由影片中的Histogram of Oriented Gradients (HOG)特徵所產生。每個張量被分解成矩陣後由典型相關分析(Canonical Correlation Analysis, CCA)評估其相似度。此外,我們提出一個資訊融合的方法來合併張量間的矩陣相似度。該融合方式能有效強化不同動作類別間的鑑別度以得到更佳的辨識率。我們在兩個公開資料庫(UCF sports和Cambridge-Gesture)上測試我們的系統,而結果亦顯示我們的系統與近代文獻能有相近的辨識率。

並列摘要


In this dissertation, we develop three systems for gesture and action recognition. First system uses the trajectory of hand motion as the datum source for sign language classification and retrieval. In this system, a trajectory is firstly projected by the Kernel Principal Component Analysis (KPCA) which can be considered as an implicit mapping to a much higher-dimensional feature space. The high dimensionality can effectively improve the accuracy in recognizing motion trajectories. Then, Nonparametric Discriminant Analysis (NDA) is used to extract the most discriminative features from the KPCA feature space. The synergistic effect of KPCA and NDA leads to better class separability and makes the proposed trajectory representation a more powerful discriminator. The experimental validation of the proposed method is conducted on the Australian Sign Language (ASL) data set. The results show that our method performs significantly better, in both trajectory classification and retrieval, than the state-of-the-art techniques. Second system uses video clip as the input data to classify action category. This system is based on the local learning boosting algorithm. The idea of local learning is to use the local learner to form a highly accurate classification rule. In this system, we first extract the cloud of interest points from video and use them to construct more discriminative features. These features encode not only the static characteristics such as the aspect ratio of the human, but also body dynamics per action class. Then, we perform an efficient local learning on the extracted features to learn locally adaptive classifiers. In particular, a local classifier is specifically trained for each training sample. A local classifier could better describe the local data distribution and thus combining multiple local classifiers would lead to better classification accuracy. We conduct several experiments on the KTH dataset and obtain very inspiring results. Our approach achieves comparable performance to that of the state-of-the-art methods. Compared with a popular method for global learning, i.e., the AdaBoost, the local learning provides significantly better accuracy with little additional cost in training time. Third system can be used to recognize gesture or action. In the gesture part, we directly use the video clip as the datum source. In the action part, we use a more realistic dataset to evaluate the performance of our proposed system. This system uses dual-complementary tensors to recognize gesture and human action. In particular, the proposed method constructs a compact and yet discriminative representation by normalizing the input video volume into dual tensors. One tensor is obtained from the raw video volume data and the other one is obtained from the histogram of oriented gradients (HOG) features. Each tensor is converted to factored matrices and the similarity between factored matrices is evaluated using canonical correlation analysis (CCA). We, furthermore, propose an information fusion method to combine the resulting similarity scores. The proposed fusion strategy can effectively enhance discriminability between different action categories and lead to better recognition accuracy. We have conducted several experiments on two publicly available databases (UCF sports and Cambridge-Gesture). The results show that our proposed method achieves comparable recognition accuracy as the state-of-the-art methods.

參考文獻


[3] N. Anjum and A. Cavallaro. Multifeature object trajectory clustering for video analysis. IEEE Transactions on Circuits and Systems for Video Technology, 18(11):1555–1564, 2008.
[4] G. Antonini and J. P. Thiran. Counting pedestrians in video sequences using trajectory clustering. IEEE Transactions on Circuits and Systems for Video Technology, 16(8):1008–1020, 2006.
[5] I. Atmosukarto, N. Ahuja, and B. Ghanem. Action recognition using discriminative structured trajectory groups. In IEEE Winter Conference on Applications of Computer Vision, pages 899–906. IEEE, 2015.
[6] L. Baraldi, F. Paci, G. Serra, L. Benini, and R. Cucchiara. Gesture recognition in ego-centric videos using dense trajectories and hand segmentation. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 702–707. IEEE, 2014.
[7] F. Bashir, A. Khokhar, and D. Schonfeld. Automatic object trajectory-based motion recognition using gaussian mixture models. In IEEE International Conference on Multimedia and Expo, pages 1532 –1535, 2005.

延伸閱讀