透過您的圖書館登入
IP:3.140.186.201
  • 學位論文

基於音訊方式之人類活動辨識

A Study on Audio-Based Human Activity Recognition

指導教授 : 丁英智
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


本論文提出以音訊方式來實現人類活動辨識,其主要的研究方向是結合高斯混合模型(Gaussian Mixture Model, GMM)與狀態機(State Machine)建構出為一個分層式的行為辨識機制,該辨識機制共分為Feature layer、Acoustic event layer、Behavior layer三層,Feature layer使用最為基礎的聲音特徵擷取,來找出聲音資料的特徵向量,Acoustic event layer則利用GMM建立聲音事件資料庫模型與Feature layer輸出的聲音特徵向量進行比對,而在Acoustic event layer中又有GMM未分群與GMM分群兩種辨識架構,前者為原始的GMM辨識方法,直接將測試聲音資料與所有聲音事件模型進行比對,而後者辨識方法需要對測試語料比對兩次,人聲語種與非人聲語種分類,及非人聲模型或人聲模型的比對來找出最終辨識結果,最後,將其辨識結果做為Behavior layer狀態機中狀態間的觸發條件來判斷是否要進入下一狀態,當狀態機之行為狀態流程結束時,則為此語料行為辨識的結果。在前述提及的三層式行為辨識機制中的Feature layer,聲音接收感測器的佈建除了使用傳統電容式麥克風,也特別再運用時下流行的Kinect感測器的麥克風陣列,此麥克風陣列可方便作為本研究之聲音資料融合(Data Fusion)辨識的進行。 由於聲音事件的辨識直接影響到行為辨識的辨識,為強化前述的GMM未分群與GMM分群,本論文再提出一種投票式(Voting)方式的GMM,其為Voting-GMM,Voting-GMM具備優異的辨識性能。為了讓Voting-GMM更具備強建性,本論文再發展一種整合Type-2模糊控制器(Type-2 Fuzzy Controller)之Voting-GMM,此方法辨識性更高於Voting-GMM。 本論文所發展的研究方法運用在三種實驗室環境中常見的「老師會議討論」、「同學閒話家談」、「內部研究交談」的人類行為辨識實驗,在Acoustic event layer中,共有九種Vocal聲音事件及八種Non-Vocal聲音事件進行分類,共計17種聲音事件進行分類。聲音事件辨識率的高低會影響到最後一層Behavior layer的聲音行為辨識的結果,因此,本論文在麥克風佈建上多加考量,來改善聲音事件辨識的結果。本論文所使用的資料融合方法為一種Model-based方法,其方法為融合每一麥克風所接收的聲音訊號之所計算求得的GMM模型的相似度分數,並將此Model-based融合方法應用於GMM分群聲音行為辨識架構中,其行為辨識率從原先之未經融合的單一麥克風資料擷取的33.33%提昇到經資料融合的多麥克風的56.86%。經資料融合的Voting-GMM的行為辨識率由56.86%上升至58.82%。然而,此辨識率仍不盡理想,因此,本論文在Voting-GMM辨識方法中再加入Type-2模糊控制器,進而改善GMM辨識的辨識決策,具備較佳的聲音事件辨識決策可以讓行為辨識率能夠更進一步的提昇,由實驗結果證實整合Type-2模糊控制器之Voting-GMM具備最佳的辨識效能,其行為辨識率成長至68.63%。

並列摘要


In recent years, human activity recognition is used in the smart environment and provides all the necessary services to the people, so human activity recognition slowly becomes a very important research topic. This thesis proposes the audio way to implement the human activity recognition system. The main work is a combination of Gaussian mixture model (GMM) and the state machine to construct a hierarchical behavior identification mechanism. The recognition mechanism is divided into feature layer, acoustic event layer, and behavior layer. Feature layer uses the most basic feature extraction to find the feature vector. Acoustic event layer uses GMM to establish acoustic event model and performs classification. In acoustic event layer, there are non-hierarchical GMM and hierarchical GMM identification frameworks. The former is the original GMM identification method without event classifications, and the latter needs to identify non-vocal or human vocal firstly, and then second-layered event recognition is done to find the final recognition result. Finally, the recognition result is to be as a trigger condition for state transition in the behavior layer. When the process of state machine is completed, a behavior event is to be identified. In the feature layer of three hierarchical behavioral mechanisms, traditional condenser microphones and microphone arrays of Windows Kinect are arranged. Data fusion is used in the study. The recognition rate of the acoustic event layer influences that of the behavior recognition layer. Therefore, the Kinect device with the microphone array of multiple microphones increases the voice receiving range in the environment. Since each microphone voice data is independent, so data fusion for combining all received voices is required. The thesis uses a model-based method where all likelihood scores derived from GMM calculations are combined and then applied to the hierarchical GMM. The behavior recognition rate is increased from the previous 33.33% to 56.86% by the utilization of data fusion. For fatherly enhancing GMM, the work uses a voting scheme for GMM decisions, called Voting-GMM, which is better than conventional GMM on recognition performances. Behavior recognition by Voting-GMM with a hierarchical classification is 58.82%, which demonstrates the effectiveness of the method. Furthermore, this study uses type-2 fuzzy controller for adjusting the voting behavior of Voting-GMM to further improving behavior recognition rate. The experimental results show that human behavior recognition by the proposed method can achieve a competitive recognition rate, which is 68.63%.

參考文獻


[30] 林子正,2012,基於多模型架構之語者辨認系統,國立虎尾科技大學電機工程系碩士班論文。
[1] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice Hall, New Jersey, 1993.
[4] X. B. Wang, G. Y. Yang, Y. C. Li, D. Liu, “Review on the application of artificial intelligence in antivirus detection system,” Proc. IEEE Conference on Cybernetics and Intelligent Systems, pp. 506-509, 2008.
[6] M. S. Hawley, S. P. Cunningham, P. D. Green, P. Enderby, R. Palmer, S. Sehgal, and P. O’Neill, ”A voice-input voice-output communication aid for people with severe speech impairment,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 21, no. 1, pp.23-31, 2013.
[7] J. K. Aggarwal and M. S. Ryoo, “Human activity analysis: a review,” ACM Computing Surveys, vol. 43, no. 3, 2011.

延伸閱讀