透過您的圖書館登入
IP:3.139.72.78
  • 期刊
  • OpenAccess

使用低通時序列語音特徵訓練理想比率遮罩法之語音強化

Employing Low-Pass Filtered Temporal Speech Features for the Training of Ideal Ratio Mask in Speech Enhancement

摘要


在諸多基於深度學習之語音強化法中,遮罩式(masking-based)強化法求取一個遮罩與雜訊語音之時頻圖相乘、藉此使所得乘積之新時頻圖所含雜訊成分降低、以重建相對乾淨的語音訊號。在用以訓練遮罩之深度模型其輸入特徵的選取上,許多長期以來用以語音辨識的特徵、如梅爾倒倒頻譜、振幅調變時頻圖、感知線性估測係數等都是適合的選擇、可使訓練所得的遮罩達到有效的語音強化效果。另外,傳統上若將語音特徵之時序列作低通濾波處理,可以抑制雜訊所帶來的失真,因此,在本研究中,我們嘗試將各種語音特徵時序列,藉由離散小波轉換的方式加以低通濾波,再用它們來訓練語音遮罩的深度模型,探究其是否能使所學習之遮罩能對於原始雜訊語音之時頻圖有更佳的語音強化效果。在我們的初步實驗裡,在人聲雜訊環境中,我們發現上述之低通濾波所得之特徵序列、相較於原始特徵序列而言所學習而得的深度模型,能更有效地提升測試語音之品質與可讀性。

並列摘要


The masking-based speech enhancement method pursues a multiplicative mask that applies to the spectrogram of input noise-corrupted utterance, and a deep neural network (DNN) is often used to learn the mask. In particular, the features commonly used for automatic speech recognition can serve as the input of the DNN to learn the well-behaved mask that significantly reduce the noise distortion of processed utterances. This study proposes to preprocess the input speech features for the ideal ratio mask (IRM)-based DNN by lowpass filtering in order to alleviate the noise components. In particular, we employ the discrete wavelet transform (DWT) to decompose the temporal speech feature sequence and scale down the detail coefficients, which correspond to the high-pass portion of the sequence. Preliminary experiments conducted on a subset of TIMIT corpus reveal that the proposed method can make the resulting IRM achieve higher speech quality and intelligibility for the babble noise-corrupted signals compared with the original IRM, indicating that the lowpass filtered temporal feature sequence can learn a superior IRM network for speech enhancement.

參考文獻


Chen, C., & Bilmes, J. (2007). MVA processing of speech features. IEEE Trans. on Audio, Speech, and Language Processing, 15(1), 257-270. https://doi.org/10.1109/TASL.2006.876717
Erdogan, H., Hershey, J. R., Watanabe, S., & Le Roux, J. (2015). Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of 40th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015), 780-712. https://doi.org/10.1109/ICASSP.2015.7178061
Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs. In Proceedings of of 26th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2001), 749-752. https://doi.org/10.1109/ICASSP.2001.941023
Srinivasan, S., Roman, N., & Wang, D. (2006). Binary and ratio time-frequency masks for robust speech recognition. Speech Communications, 48(11), 1486-1501. https://doi.org/10.1016/j.specom.2006.09.003
Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2125-2136. https://doi.org/10.1109/TASL.2011.2114881

延伸閱讀