語音增強與噪音感知聲學模型於強健性語音辨識

語音辨識系統在人機互動中扮演了一個舉足輕重的角色。然而，疊加噪音與語音迴響嚴重地影響系統的辨識效能，為實際環境的應用帶來諸多的障礙。為了提高系統對於噪音的強健性，降噪自編碼器（denoising autoencoder，DAE）作為前端訊號處理模型被前人廣大地採用，但是此方法可能存在語音增強模型的輸出與聲學模型所預期的輸入不一致，進而影響語音辨識任務的效能表現。本篇論文提出基於無網格最大互信息（lattice-free maximum mutual information，LF-MMI）的聯合訓練（joint training）框架，合併訓練語音增強模型與聲學模型，以加強兩者模型輸出入之間的一致性。同時，本框架實作噪音感知訓練（noise-aware training，NAT），其可將噪音特徵顯性地告知後端模型，以使系統對於噪音更具有強健性。透過在Aurora-4上進行的實驗，本論文所提之最佳模型詞錯誤率相對進步幅度可達38.6%。本論文所提出的方法也於真實環境所錄製語料AMI進行效能評估。然而，由於AMI為自發性語音且錄製於極具挑戰的環境，因此性能並沒有十分顯著的進步。

關鍵字

強健性語音辨識；語音增強；噪音感知訓練；聯合訓練； Aurora-4

並列摘要

Automatic speech recognition (ASR) is a key component in human-computer interaction. In real-life applications, the performance of ASR systems usually degrades in the presence of environmental noises and reverberation. To boost the robustness to the noise, front-end approaches based on denoising autoencoder (DAE) for ASR have been studied in the literature. However, there may be a mismatch between the DAE output and the expected input of the acoustic model in such systems, and then it could degrade recognition performance. In this thesis, we propose a joint training framework based on lattice-free maximum mutual information (LF-MMI) that leverages two components of DAE and a phoneme-aware classification scheme to ensure consistency between these two components. The proposed framework implements noise-aware training (NAT) which can explicitly inform the acoustic model of the noise features to make the system more robust to environmental noises. The experiments performed on Aurora-4 database show that the relative reduction in word error rate (WER) of the best model can reach 38.6%. The proposed method has also been evaluated on real-world data, AMI. However, the performance is not obvious since AMI consists of spontaneous dialogues recorded in challenging environments.

並列關鍵字

robust speech recognition ； speech enhancement ； noise-aware training ； joint training ； Aurora-4

參考文獻

[1] C. Fan, J. Yi, J. Tao, Z. Tian, B. Liu, and Z. Wen, “Gated recurrent fusion with joint training framework for robust end-to-end speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 198–209, 2020.

Google Scholar

[2] F. Li, P. S. Nidadavolu, and H. Hermansky, “A long, deep and wide artificial neural net for robust speech recognition in unknown noise,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.

Google Scholar

[3] D. Povey, G. Boulianne, L. Burget, P. Motlicek, and P. Schwarz, “The kaldi speech recognition toolkit,” in Proc. ASRU, 2011.

Google Scholar

[4] Y. Zhao, Z.-Q. Wang, and D. Wang, “Two-stage deep learning for noisy-reverberant speech enhancement,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 1, pp. 53–62, 2018.

Google Scholar

[5] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113–120, 1979.

Google Scholar

國際替代計量

語音增強與噪音感知聲學模型於強健性語音辨識

全文下載

主題瀏覽