使用離散小波轉換特徵融合之跨域全摺積時域音訊分離網路語音強化模型的初步研究

在現今，時域特徵已廣泛用於語音強化網路、如同經典的頻域特徵一般，用以有效消除輸入語音中的雜訊。本論文主要研究如何從時域語音中提取資訊以在語音強化中創建更有效的特徵，我們提出了在時域中使用多個聲學頻帶中的子信號，並將它們融合成一個統一的時域特徵中。具體方法是應用離散小波變換對每個輸入的音框信號進行分解、以獲得子帶信號，並對這些信號進行投影融合處理以創建最終時間信號。對應的融合處理法稱為雙投影融合(bi-projection fusion, BPF)以及其延伸的多投影融合(multiple projection fusion, MPF)，後者主要藉由softmax函數、取代前者使用的sigmoid函數，使多於二種的特徵得以有效的整合。我們將藉由離散小波轉換之特徵的融合與原始時域特徵加以整合、用於著名的語音強化網路：全卷積時域音頻分離網路 (Conv-TasNet) 其編碼器輸出，以估計其中的遮罩，然後生成強化低雜訊的時域語句。我們在 VoiceBank-DEMAND 與 VoiceBank-QUT 兩個語音強化實驗任務上進行了評估實驗，結果表明，所提出的方法比原始單純使用時域特徵的 Conv-TasNet 實現了更高的客觀語音品質和可讀性指標，表明融合小波域特徵可以輔助原時域特徵、從輸入的雜訊語音中可學習一個優越的 Conv-TasNet 網路、達到更佳的語音強化效果。

關鍵字

語音強化；離散小波轉換；跨域；雙投影融合；多投影融合；全卷積時頻分離網路

並列摘要

Nowadays, time-domain features have been widely used in speech enhancement (SE) networks like frequency-domain features to achieve excellent performance in eliminating noise from input utterances. This thesis primarily investigates how to extract information from time-domain utterances to create more effective features in speech enhancement. We present employing sub-signals dwelled in multiple acoustic frequency bands in time domain and integrating them into a unified time-domain feature set. The discrete wavelet transform (DWT) is applied to decompose each input frame signal to obtain sub-band signals, and a projection fusion process is performed on these signals to create the ultimate features. The corresponding fusion strategy is either bi-projection fusion (BPF) or multiple projection fusion (MPF). In short, MPF exploits the softmax function to replace the sigmoid function in order to create ratio masks for multiple feature sources. The concatenation of fused DWT features and time features serves as the encoder output of two celebrated SE frameworks, fully-convolutional time-domain audio separation network (Conv-TasNet) and the dual-path Transformer network (DPTNet), to estimate the mask and then produce the enhanced time-domain utterances. The evaluation experiments are conducted on the VoiceBank-DEMAND and VoiceBank-QUT tasks, and the results reveal that the proposed method achieves higher speech quality and intelligibility than the original Conv-TasNet that uses time features only, indicating that the fusion of DWT features created from the input utterances can benefit time features to learn a superior Conv-TasNet/DPTNet network in speech enhancement.

並列關鍵字

speech enhancement ； discrete wavelet transform ； cross-band ； Conv-TasNet ； bi-projection fusion ； multiple projection fusion

參考文獻

References

Google Scholar

[1] D. Kahneman, O. Sibony, C. R. Sunstein, ”Noise: a flaw in human judgment,” New York: Little, Brown Spark, 2021.

Google Scholar

[2] Y. Luo and N. Mesgarani, ”Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256-1266, Aug. 2019.

Google Scholar

[3] F. -A. Chao, J. -w. Hung and B. Chen, ”Cross-Domain Single-Channel Speech Enhancement Model with BI-Projection Fusion Module for Noise-Robust ASR,” 2021 IEEE International Conference on Multimedia and Expo (ICME), 2021.

Google Scholar

[4] Jingjing Chen, Qirong Mao, Dong Liu, ”Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation,” arXiv preprint arXiv:2007.13975, 2020.

Google Scholar

主題瀏覽