基於深度聲學模型其狀態精確度最大化之強健語音特徵擷取的初步研究

在本研究中，我們提出一種新穎的強健性語音特徵擷取技術，以增進雜訊干擾環境下的語音辨識效能。此新技術利用語音辨識系統中後端的原聲學模型所提供的資訊，在不重新訓練聲學模型的前提下，藉由深度類神經網路架構，學習得到最大化聲學模型狀態之精確度對應的語音特徵，進而使此語音特徵擁有對雜訊的強健性，相較於其他改善聲學模型以達到雜訊強健性的技術，本研究所提出的新技術具有計算量小且訓練快的優點。在初步實驗中，我們使用了TIMIT此中型語料庫來評估，實驗結果顯示所提之新語音特徵擷取法，相對於基礎實驗，能有效地降低各種雜訊種類與雜訊程度之環境下語音的音素錯誤率，凸顯此方法的效能及發展價值。

關鍵字

雜訊強健性之語音特徵；語音辨識；深度學習

並列摘要

In this study, we focus on developing a novel speech feature extraction technique to achieve noise-robust speech recognition, which employs the information from the backend acoustic models. Without further retraining and adapting the backend acoustic models, we use deep neural networks to learn the front-end acoustic speech feature representation that can achieve the maximum state accuracy obtained from the original acoustic models. Compared with the robustness methods that retrain or adapt acoustic models, the presented method exhibits the advantages of lower computational complexity and faster training. In the preliminary evaluation experiments conducted with the median-vocabulary TIMIT database and task, we show that the newly presented method achieves lower word error rates in recognition under various noise types and levels compared with the baseline results. Therefore, this method is quite promising and worth developing further.

並列關鍵字

Noise-robust Speech Feature ； Speech Recognition ； Deep Learning

參考文獻

Anastasakos, T., McDonough, J., Schwartz, R., & Makhoul, J. (1996). A compact model for speaker-adaptive training. In Proceedings of Fourth International Conference on Spoken Language Processing (ICSLP) 1996. doi : 10.1109/ICSLP.1996.607807

Ephraim, Y. & Malah, D. (1984). Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans on Acoustics, Speech, and Signal Processing, 32(6), 1109-1121. doi: 10.1109/TASSP.1984.1164453

Gales, M. (1998). Maximum likelihood linear transformations for hmm-based speech recognition. Computer Speech and Language, 12(2), 75-98. doi: 10.1006/csla.1998.0043

Grezl, F., Karafiat, M., Kontar, S., & Cernocky, J. (2007). Probabilistic and bottleneck features for lvcsr of meetings. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 2007. doi: 10.1109/ICASSP.2007.367023

Haeb-Umbach, R. & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing 1992. doi : 10.1109/ICASSP.1992.225984

國際替代計量

基於深度聲學模型其狀態精確度最大化之強健語音特徵擷取的初步研究

全文下載

主題瀏覽