透過您的圖書館登入
IP:18.119.157.39
  • 學位論文

即時人體動作辨識系統之特徵點萃取架構設計

Architecture Design of Feature Extraction for Real-time Action Recognition

指導教授 : 陳良基

摘要


電腦視覺的相關研究已經進行多年,結合了機器學習演算法的幫助,電子產品能夠自動從網路等大量資料庫學習有用的知識,進行自我校正與進步。電腦視覺與機器學習的結合帶來許多不同的應用,使我們的生活更加迅捷與方便。電腦視覺的終極目標是發明一個智慧型機器人,使得此機器人能夠和具有與一般人無異的感知與互動。我們認為要達到此目標的第一步則是:使得機器能夠解讀動態影片背後所代表的實質意義。 與靜態影像相比,擁有時空資訊的動態影片往往蘊含更多的知識。因此,人體動作辨識的應用則成為機器人視覺最重要的基礎之一。然而動態影片所包含的各種變化卻也大幅增加了分析的難度,許多研究學者專注於提高動作辨識的準確度。在過去的研究中,從動態影片中取出特徵值的演算法依然太過複雜以致於難以達到即時。在此論文中,我們首先介紹一些電腦視覺的基礎應用以及不同特徵點萃取的方法。比較這些演算法的優缺點後,我們選擇使用區域時空特徵法來萃取特徵點以用於動作辨識。 考量到系統的效率以及準確度,我們使用MoFREAK特徵點萃取演算法來描述含有動作的影片。MoFREAK特徵點萃取法分別利用FREAK特徵來描述動作的靜態資訊、MIP特徵來描述動作的動態資訊。接著我們基於此演算法設計出一個硬體架構,並利用區塊性特徵點的技巧來節省架構頻寬、提升硬體效能。經過硬體架構之優化,我們所提出之硬體架構於TSMC 40 nm製成之電路合成結果達到即時運算的規格,為工作頻率200 MHz影片解析度為full HD (1920╳1080),且只需要約1100 K 邏輯閘數目及7.9 Kbytes記憶體。此外,藉由我們所提出的區塊性特徵點技巧將鄰近的特徵點包裝起來再進行運算,可以提供1.2 K個區塊性特徵點於120幀率且頻寬為417.6 Mbytes/sec、0.5 K個區塊性特徵點於240幀率且頻寬為835.2 Mbytes/sec。上述的可提供的至多區塊性特徵點及幀率皆假設為最差的情況,也就是區塊性特徵點中的10個特徵點都符合特徵點皆需要進行描述。因此若是在一般情形,於相同的幀率我們的硬體架構可提供的至多區塊性特徵點數量可以再往上提升。

並列摘要


Computer vision has been developed for decades, and the help of machine learning algorithms, electronic devices are able to learn knowledge from big data such as the Internet. The combination of computer vision and machine learning has also brought a large amount of applications, making our lives more convenient. The ultimate goal of computer vision is to invent a brilliant robot. We think the first step is to understand the semantic meaning behind videos. Videos content spatial-temporal information that implies richer knowledge. Therefore, action recognition becomes a basic application that can be implemented in the vision of robots. The variations of videos increase the difficulty of analysis, leading many researchers to develop better algorithms aiming at raising the recognition accuracy on datasets. However, the computation complexity of feature extraction in videos is still too complicated to be real-time in past researches. In the thesis, we first introduce some computer vision applications and different approaches of feature extraction. Comparing several related algorithms and examining the pros and cons of each method, we choose to use space-time local features in our approach. Considering both the efficiency and accuracy, we adopt the MoFREAK feature extraction algorithm to generate robust descriptors of action videos. MoFREAK is a feature combines the appearance model and motion model independently. We capture static information by FREAK and dynamic information by MIP, and show good performance through datasets. Then, we implement the MoFREAK feature extraction into hardware architecture by introducing the block-based features technique to improve the hardware performance and reduction the bandwidth and solve the problem of irregular feature points. After the optimization, the synthesis results of our proposed design achieve the real-time specification with about 1100 K gate counts and 7.9 Kbytes memory usage, operate at 200 MHz with full HD (1920_1080) video resolution. Furthermore, because of the block-based keypoint technique, we can extract features from full HD resolution video sequence and offer 1.2 K block-based feature points at 120 fps with bandwidth 417.6 Mbytes/sec and 0.5 K block-based feature points at 240 fps with 835.2 Mbytes/sec bandwidth, which assume in the worst case that all 10 points in block-based features are detected as keypoints. If it is not the worst case, we can offer more feature points at the same frame rate.

參考文獻


[2] W. Choi, Y.-W. Chao, C. Pantofaru, and S. Savarese, “Understanding indoor scenes using 3d geometric phrases,” pp. 33–40, 2013.
[8] J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: A review,” ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 16, 2011.
[10] I. Laptev, “On space-time interest points,” International Journal of Computer Vision, vol. 64, no. 2-3, pp. 107–123, 2005.
[12] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
[13] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.

延伸閱讀