在現今,時域特徵已廣泛用於語音強化網路、如同經典的頻域特徵一般,用以有效消除輸入語音中的雜訊。本論文主要研究如何從時域語音中提取資訊以在語音強化中創建更有效的特徵,我們提出了在時域中使用多個聲學頻帶中的子信號,並將它們融合成一個統一的時域特徵中。具體方法是應用離散小波變換對每個輸入的音框信號進行分解、以獲得子帶信號,並對這些信號進行投影融合處理以創建最終時間信號。對應的融合處理法稱為雙投影融合(bi-projection fusion, BPF)以及其延伸的多投影融合(multiple projection fusion, MPF),後者主要藉由softmax函數、取代前者使用的sigmoid函數,使多於二種的特徵得以有效的整合。我們將藉由離散小波轉換之特徵的融合與原始時域特徵加以整合、用於著名的語音強化網路:全卷積時域音頻分離網路 (Conv-TasNet) 其編碼器輸出,以估計其中的遮罩,然後生成強化低雜訊的時域語句。 我們在 VoiceBank-DEMAND 與 VoiceBank-QUT 兩個語音強化實驗任務上進行了評估實驗,結果表明,所提出的方法比原始單純使用時域特徵的 Conv-TasNet 實現了更高的客觀語音品質和可讀性指標,表明融合小波域特徵可以輔助原時域特徵、從輸入的雜訊語音中可學習一個優越的 Conv-TasNet 網路、達到更佳的語音強化效果。
Nowadays, time-domain features have been widely used in speech enhancement (SE) networks like frequency-domain features to achieve excellent performance in eliminating noise from input utterances. This thesis primarily investigates how to extract information from time-domain utterances to create more effective features in speech enhancement. We present employing sub-signals dwelled in multiple acoustic frequency bands in time domain and integrating them into a unified time-domain feature set. The discrete wavelet transform (DWT) is applied to decompose each input frame signal to obtain sub-band signals, and a projection fusion process is performed on these signals to create the ultimate features. The corresponding fusion strategy is either bi-projection fusion (BPF) or multiple projection fusion (MPF). In short, MPF exploits the softmax function to replace the sigmoid function in order to create ratio masks for multiple feature sources. The concatenation of fused DWT features and time features serves as the encoder output of two celebrated SE frameworks, fully-convolutional time-domain audio separation network (Conv-TasNet) and the dual-path Transformer network (DPTNet), to estimate the mask and then produce the enhanced time-domain utterances. The evaluation experiments are conducted on the VoiceBank-DEMAND and VoiceBank-QUT tasks, and the results reveal that the proposed method achieves higher speech quality and intelligibility than the original Conv-TasNet that uses time features only, indicating that the fusion of DWT features created from the input utterances can benefit time features to learn a superior Conv-TasNet/DPTNet network in speech enhancement.