語音辨識是許多人機介面所需的技術,也是智慧型電視不可或缺的重要功能。但考慮使用者下達聲控指令時大都正在觀看電視,因此麥克風錄下語者說話聲的同時,也錄下了電視的播放聲音,有時電視播放聲甚至比使用者的說話聲更大,造成語音辨識不易。慶幸地,我們可從電視的線輸出取得節目播放聲,據此作為指令語音之背景聲音的消除。然而,由於麥克風所錄下的電視播放聲實際上並不等於透過線輸出所取得的聲音,因此直接進行訊號相減處理並無法獲得純粹的使用者說話聲。為了解決此問題,本論文發展多項背景音消除法。包括適應性頻譜相減法、基於最小平方誤差之頻譜相減法、以及基於遞迴式類神經網路學習之背景音消除法。經實驗證實,在電視播放聲音大於使用者說話聲時,本論文所提出之三種方法皆可有效地幫助改善語音辨識,而基於遞迴式類神經網路學習的背景音消除法則優於調適性頻譜相減法以及基於最小平方誤差之頻譜相減法。
Speech recognition is a necessity for a number of human-machine interfaces, especially for smart TV. Recognizing the factor that when a user issues a voice command to a smart TV, the signal received by the smart TV would not only the user's speech but also the background sound mainly from the TV. Sometimes the background sound can be louder than the user's speech, and hence it is detrimental for speech recognition. Fortunately, the background sound from TV can be acquired and handled by recording the signal from "Line out". However, as the background sound, coming from TV's speaker(s), is not the same as the one from "Line out", it is infeasible to cquire the user's voice by performing direct subtraction in the time domain. To deal with this problem, we propose three approaches, including adaptive spectrum subtraction, least square-based spectrum subtraction, and recurrent neuron network-based approach. Our experiments show that when the background sound is much louder than the user's speech, the proposed three approaches can heA improve the accuracy of speech recognition significantly. In particular, the recurrent neuron network-based approach is superior to the other approaches.