深層特徵與關聯結合長短期記憶網路於視訊影片之犬隻的追蹤與情緒識別

本研究主要針對影片中的犬隻進行情緒識別，識別出可能具有攻擊性的犬隻，本研究係運用多元卷積類神經網路(convolutional neural network, CNN)架構進行犬隻的偵測(detection)、追蹤(tracking)以及情緒的識別(emotion recognition)。首先從影片中每個視訊頁(frame)的犬隻做偵測，然後再對影片中的犬隻進行追蹤，最後再將對影片中犬隻進行情緒的識別。每個視訊頁中的犬隻偵測，本研究採用YOLO(you only look once)第三版的卷積類神經網路架構；影片中犬隻的追蹤，則採用深層關聯且即時的犬隻追蹤(realtime dogs tracking with a deep association metric, DeepDogTrack)，其中也使用卡爾曼濾波器(kalman filter)來追蹤偵測後犬隻產生的框，然後再使用此框中物件的外觀與位置，以便進行犬隻的追蹤與識別。最後，本研究將犬隻情緒的定義係透過獸醫系專家及海關犬飼養員進行人工的判斷，分成三種的情緒類別：生氣(或攻擊)、開心(或興奮)及普通(或一般情緒)行為。本研究的犬隻情緒識別係抽取犬隻追蹤之子影片，然後判斷此子影片是否足夠識別犬隻的情緒，再採用長短期犬隻深度特徵之記憶網路(long short-term deeping features of dog memory networks, LDFMN)架構進行情緒的識別。本研究的犬隻偵測實驗使用兩個資料集進行測試，偵測準確率分別為97.59%及94.62%，其偵測錯的原因有五官被遮蔽、特殊品種犬、身體被遮蔽或裁切以及實際區域不完整。本研究的犬隻追蹤實驗使用三部影片進行測試，影片中犬隻的數量包含單隻以及多隻，影片中犬隻有出現不同的角度有正面、背面以及側面，其單隻犬隻追蹤準確率最高93.02%，多隻犬隻追蹤率最高為86.45%，其最主要追蹤失敗原因為犬隻在進入以及離開畫面時，身軀遮蔽區域增加，導致無法追蹤。未來改善方式可以降低匹配門檻，以便提昇追蹤結果。本研究的犬隻情緒識實驗使用兩個資料集，情緒識別準確率為81.73%與76.02%。情緒辨識率較差的原因主要是進行去除背景時造成影像錯誤，使得後續進行情緒識別時產生錯誤。至於各情緒辨識率其中以生氣(或攻擊)的情緒比較明顯所以辨識率比其他兩種情緒明顯高很多，而各種情緒識別錯誤的原因有：犬隻的情緒動作不明顯、犬隻動作太快造成影像模糊、拍攝角度不佳、影片解析度過低以及多犬隻不易辨識、嘴巴僅稍微打開，而且動作太小導致辨識錯誤。本研究在整體流程的測試實驗中總共對兩個測試影片進行測試，其實驗結果顯示如果影像中犬隻移動範圍較廣，則追蹤的準確性對於情緒識別結果有很大的影響性。由於去除背景的動作容易造成影像錯誤，因此使用原始的追蹤子影像進行情緒識別效果較佳。經過實驗證明，本研究提出了一個對影片中犬隻進行情緒識別的完整流程，由於目前關於犬隻追蹤以及犬隻情緒並沒有一套完整的資料集，因此在追蹤及情緒識別上的準確率仍有進步空間。期望本文方法將來能夠應用於街道監控系統，以提早發現可能有攻擊行為的犬隻。

關鍵字

卷積類神經網路；偵測；追蹤；情緒識別；長短期記憶

並列摘要

This study focuses on the recognition of dogs'emotions in the film, and identifies the potentially aggressive dogs. The main purpose of this study was to use the multi-convolutional neural network (CNN) architecture for dog detection, tracking, and emotion recognition. First, the dogs in each frame of the video are detected, then they are tracked in the video, and finally the dogs'emotions are identified in the video. This study used the YOLO (you only look once) third edition of the convolutional neural network architecture for dog detection. The dog's tracking uses real time tracking of dogs with a deep association metric (DeepDogTrack), which also employs the Kalman filter to predict their next position; this location is then used to track the dog. Finally, this study used artificial judgments made by veterinary specialists and custom dog breeders to classify dogs into three emotional categories: angry (or aggressive), happy (or excited), and normal (or general emotional) behavior. In the dog's emotional recognition process of this study, the sub-video for the dog tracked is first extracted, and then whether the sub videos are sufficient for 16 frames is determined. After processing by long short-term deep features of dog memory networks (LDFMN) architecture, the recognition of each dog's emotions can be obtained. The dog detection experiment in this study uses two data sets for testing, and the detection accuracy rates are 97.59% and 94.62% respectively. The reason for the detection error include: the five senses are obscured, the special breed dog's body is obscured or cut, and the actual area is incomplete; future improvements can lower the matching threshold to improve the tracking results. The dog emotion recognition experiment in this study used two data sets, and the accuracy rates of emotion recognition were 81.73% and 76.02%, respectively. The reason for the poor emotion recognition rate was mainly due to the image error caused by the background removal, which caused the error in the subsequent emotion recognition. As for each emotion recognition rate, the emotion of being angry (or attacking) is more obvious, so the recognition rate is significantly higher than the other two emotions The reasons for various emotion recognition errors include the dog's emotional action was not obvious, or the dog moved too fast. The results were blurry images resulting from poor shooting angles, and low video resolution. Recognition errors result from multiple dogs being difficult to recognize, mouths only slightly open, and movement being too small. In this study, a total of two test videos were used in the overall process of the test experiment. The experimental results show that if the dog moves in a wide range in the image, the accuracy of tracking has a great influence on the emotion recognition results. Since the action of removing the background is likely to cause image errors, it is better to use the original tracking sub-images for emotion recognition. Experiments have proven what this study proposes, a complete process of emotion recognition for dogs in the film. Since there is currently no complete data set on dog tracking and dog emotions, there is still room for improvement in the accuracy rate regarding tracking and emotion recognition. It is expected that the method in this paper can be applied to street monitoring systems in the future to detect dogs that may have aggressive behaviors in advance.

並列關鍵字

convolutional neural network ； detection ； tracking ； emotion recognition ； long short-term memory

參考文獻

[1] 農委會(畜牧處)(2016)。10年內國人飼養毛小孩激增近百萬飼主責任重要。民國105年8月3日，取自：https://www.coa.gov.tw/theme_data.php?theme=news&sub_theme=agri&id=5503

Google Scholar

[2] 邱文鈴(2008)。淺談台灣流浪犬帶來的社會問題。取自：https://www.shs.edu.tw/works/essay/2008/10/2008102811302242.pdf

Google Scholar

[3] 端聞(2017)。台灣流浪動物零安樂死政策上路，但社會準備好了嗎?取自：https://theinitium.com/article/20170206-dailynews-taiwan-ban-Euthanasia/

Google Scholar

[4] 黃慶榮(2015)。從犬類生態行為看「精準捕捉」。取自：http://www.tanews.org.tw/info/8715.

Google Scholar

[5] Poria, S., Cambria, E., Bajpai, R., & Hussain, A. (2017). A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37, 98-125.

Google Scholar

國際替代計量

深層特徵與關聯結合長短期記憶網路於視訊影片之犬隻的追蹤與情緒識別

未授權

主題瀏覽