簡易檢索 / 詳目顯示

研究生: 蔡仁凱
Tsai, Jen-Kai
論文名稱: 以深度學習為基礎之多人即時動作辨識系統
Deep Learning Based Real-Time Multiple-Person Action Recognition System
指導教授: 許陳鑑
Hsu, Chen-Chien
王偉彥
Wang, Wei-Yen
學位類別: 碩士
Master
系所名稱: 電機工程學系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 82
中文關鍵詞: 動作辨識深度學習人物追蹤智慧型監控三維卷積人臉辨識
英文關鍵詞: action recognition, deep learning, face recognition, human tracking, smart surveillance, 3D convolution
DOI URL: http://doi.org/10.6345/NTNU202001187
論文種類: 學術論文
相關次數: 點閱:123下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 誌謝 i 中文摘要 ii 英文摘要 iii 圖目錄 viii 表目錄 xi 第一章 緒論 1 1.1 研究背景與動機 1 1.2 論文架構 3 第二章 文獻探討 4 2.1 卷積 4 2.1.1 二維卷積 4 2.1.2 三維卷積 5 2.2 動作辨識 6 2.2.1 動作辨識的類別 7 2.2.2 以骨架為基礎之動作辨識 7 2.2.3 以影像為基礎之動作辨識 14 2.3 物件偵測 19 2.4 物件追蹤 20 2.5 人臉偵測與識別 22 2.6 動作資料庫 25 第三章 實驗平台及軟硬體介紹 29 3.1 實驗平台 29 3.2 硬體設備 30 3.3 執行環境與軟體 33 第四章 基於骨架資料之即時動作辨識 36 4.1 系統流程 36 4.2 滑動視窗 37 4.3 辨識架構 37 4.4 骨架資料預處理 38 4.5 建立訓練資料 40 4.6 訓練與執行 42 4.7 實驗結果 44 第五章 以深度學習為基礎之多人即時動作辨識系統 47 5.1 系統架構 47 5.2 YOLOv3 49 5.3 Deep SORT 49 5.4 FaceNet 53 5.5 Zoom In 56 5.6 背景模糊 (Blurring Background) 57 5.7 滑動視窗 (Sliding Windows) 59 5.8 Inflated 3D ConvNet (I3D) 61 5.9 非最大值抑制 (Non-Maximum Suppression) 62 第六章 多人即時動作辨識實驗結果 64 6.1 訓練資料 64 6.2 訓練與執行流程 66 6.3 Zoom In 效果 68 6.4 背景模糊效果 70 6.5 NMS效果 72 6.6 系統準確率 72 6.7 真實環境實驗結果 73 第七章 結論 75 7.1 結論 75 7.2 未來展望 75 參考文獻 77 自傳 81 學術成就 82

    [1] L. Xia, C. Chen, and J. Aggarwal, “View invariant human action recognition using histograms of 3D joints,” IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, Jun. 2012, pp. 20-27.
    [2] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in Proc. European Conference on Computer Vision, Springer, Cham, Sep. 2016, pp. 816-833.
    [3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, Jul. 2017, pp. 7291-7299.
    [4] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi-person pose estimation,” in Proc. IEEE International Conference on Computer Vision (ICCV), Venice, Italy, Oct. 2017, pp. 2334-2343.
    [5] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, Jun. 2015, pp. 2625-2634.
    [6] S. Ji, W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221-231, Jan. 2013.
    [7] P.-J. Hwang, W.-Y. Wang, and C.-C. Hsu, “Development of a mimic robot-learning from demonstration incorporating object detection and multiaction recognition,” IEEE Consumer Electronics Magazine, vol. 9, no. 3, pp. 79-87, May 2020.
    [8] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proc. Conference and Workshop on Neural Information Processing Systems, 2014, pp. 568-576.
    [9] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3d Convolutional Networks,” in Proc. of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, Dec. 2015, pp. 4489-4497.
    [10] J. Carreira and A. Zisserman, “Quo vadis, action recognition? new models and the kinetics dataset,” in Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, Jul. 2017, pp. 6299-6308.
    [11] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, Jun. 2016, pp. 1933-1941.
    [12] T. Rose, J. Fiscus, P. Over, J. Garofolo, and M. Michel, “The trecvid 2008 event detection evaluation,” Workshop on Application of Computer Vision (WACV), Snowbird, Utah, Dec. 2009, pp. 1-8.
    [13] C. Schuldt, I. Laptev, and B. Caputo. “Recognizing human actions: a local svm approach,” in Proc. of the International Conference on Pattern Recognition (ICPR), Cambridge, UK, Aug. 2004, pp. 32-36.
    [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, Jun. 2015, pp. 1-9.
    [15] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
    [16] J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note on the kinetics-700 human action dataset,” arXiv preprint arXiv:1907.06987, 2019.
    [17] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, Honolulu, HI, pp. 7291-7299.
    [18] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields,” arXiv preprint arXiv:1812.08008, 2018.
    [19] H. Joo, T. Simon, X. Li, H. Liu, L. Tan, L. Gui, S. Banerjee, T. S. Godisart, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. A Sheikh, “Panoptic studio: A massively multiview system for social interaction capture,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 1, pp. 190-204, Jan. 2019.
    [20] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional Multi-Person Pose Estimation,” in Proc. of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, Oct. 2017, pp. 2334-2343.
    [21] J. Li, C. Wang, H. Zhu, Y. Mao, H.-S. Fang, and C. Lu, “CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, Jun. 2019, pp. 10863-10872.
    [22] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu, “Pose Flow: Efficient Online Pose Tracking,” British Machine Vision Conference (BMVC), Newcastle, UK, Sep. 2018.
    [23] S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735-1780, Nov. 1997.
    [24] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, Ohio, Jun. 2014, pp. 580-587.
    [25] R. Girshick, “Fast R-CNN,” in Proc. of the IEEE international conference on computer vision (ICCV), Santiago, Chile, Dec. 2015, pp. 1440-1448.
    [26] S. Ren, K. He, R. Girshick, and J. Sum, “Faster R-CNN: towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems (NIPS), Montreal, Quebec, Dec. 2015, pp. 91-99.
    [27] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
    [28] N. Wojke, A. Bewley, and D. Paulus. “Simple online and realtime tracking with a deep association metric,” in IEEE International Conference on Image Processing (ICIP), Beijing, China, Sep. 2017, pp. 3645-3649.
    [29] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, Jun. 2015, pp. 815-823.
    [30] Y.-T. Wu, Y.-H. Chien, W.-Y. Wang, and C.-C. Hsu, “A YOLO-based method on the segmentation and recognition of Chinese words,” in Proc. of the International Conference on System Science and Engineering (ICSSE), New Taipei City, Taiwan, Jun. 2018.
    [31] A. Bewley, G. Zongyuan, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in IEEE International Conference on Image Processing (ICIP), Phoenix, Arizina, Sep. 2016, pp. 3464-3468.
    [32] Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization in untrimmed videos via multi-stage CNNs,” in Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, Jun. 2016, pp. 1049-1058.
    [33] P.-J. Hwang, C.-C. Hsu, W.-Y. Wang, and H.-H. Chiang, “Robot learning from demonstration based on action and object recognition,” in IEEE International Conference on Consumer Electronics (ICCE), Taoyuan, Taiwan, Jan. 2020.
    [34] Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization in untrimmed videos via multi-stage cnns,” in Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, Jun. 2016, pp. 1049-1058.
    [35] A. Shahroudy, J. Liu, T.-T. Ng and G. Wang, “NTU RGB+D: A large scale dataset for 3d human activity analysis,” in Proc. of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, Jun. 2016, pp. 1010-1019.
    [36] A. Graves, A. Mohamed and G. Hinton, “Speech recognition with deep recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, May 2013, pp. 6645-6649.
    [37] Z. Wang, L. Zheng, Y. Liu, and S. Wang, “Towards real-time multi-object tracking,” arXiv preprint arXiv:1909.12605, Sep. 2019.

    無法下載圖示 電子全文延後公開
    2025/08/01
    QR CODE