人體動作識別,在深度學習領域中是一項熱門且具有挑戰性的目標,在現如今硬體設備愈發卓越的現在,針對動作識別的應用也如雨後春筍般出現,如醫護領域使用影像識別高齡長輩是否跌倒,健身的動作使否引發受傷;體育上有識別人體與擊球點來追尋球技的精進;賣場裡則有識別顧客對於商品進行的動作,以此可做為購買意願的參考,也有識別手扶梯上是否有不良的動作,以避免將發生的危險;載具上也有識別卡車司機是否有使用手機的不良動作,以提升用路人的安全。 由上面就可以看到各種不同的應用,而目前,骨架提取在動作辨識上對於精確度的提升有目共睹,但大多侷限於圖型,或是輔以提取出的關鍵點座標資訊來描繪圖型,鮮少直接使用骨架座標點資訊做為訓練標的。 本研究把座標資訊從二維,結合時間維度包裝成三維的「圖片」型式,並使用在圖片分類中頗有成效的VGG架構進行訓練,再與使用3+1D圖片訓練的ResNet、使用3+1D骨架圖像訓練的ResNet在所需消耗的時間與準確度上做比較。 以結果來看,骨架提取的圖像在KTH資料集可以將準確度從96.3%提升至98.71%,而使用骨架座標相對於目前以3+1D為主的模型在準確度上略輸三到五個百分比(93.75%),但在訓練時間上可以節省60%的時間消耗,辨識上則加快了約45%。
Recognizing human behavior is a popular and challenging goal in the field of deep learning. With the rapid development of hardware, applications for human behavior recognition have been sprung up, such as the use of images in the medical field to detect the falls of elders and the injuries which the wrong weight-training way may cause. At present, skeleton extraction has improved the accuracy in human behavior recognition, but most of them are confined to the field of graphics, or supplemented by the extracted joint coordinate to enhance the graphics. However, people seldom use the skeleton joint coordinate as the training target directly. In this study, the joint coordinate is packaged from two dimensions and time dimensions into a three-dimensional "picture" format, and applied in the VGG framework to train, which is a very effective way in image classification. Then we use three different methods to learn about the comparison of costing time and accuracy; one is the 3+1D ResNet trained with images, another is 3+1D ResNet trained with skeleton images, and the other is VGG-10 trained only with joint coordinate. From the results, the accuracy of the images extracted by the skeleton in the KTH dataset can be increased from 96.3% to 98.71%, and the use of skeleton joint coordinate is slightly less accurate than the current 3+1D-based model in terms of accuracy (93.75%), Nevertheless, it can save 60% of the time consumption in training, the recognition speed has increased 45%.