語音分離技術研究:模型壓縮與多工學習

本論文中，我們提出了兩種新穎的語音分離模型架構，分別以模型壓縮和噪聲環境下的語音分離任務為目標，我們期望透過改進現有語音分離模型以達到更通用化、更貼近真實應用場景的語音分離系統(Universal Separation)。針對模型壓縮，參照參數共享方法在自然語言處理模型壓縮上帶來的成功。我們探討參數共享方法，在時域語音分離模型上的影響，並針對時域模型設計對應的參數共享策略。模型穩定性評估對於壓縮後模型非常重要。實驗證明，我們所提出的MiTAS在保有相同的語音分離表現之外，能壓縮近50%參數量，並通過多重穩定性評估實驗。模型壓縮使得語音分離能朝向終端使用者並更接近應用的普及化。本論文第二個研究方向為改善噪聲環境下的語音分離任務的表現，由於語音去噪與語音分離任務在本質上相近，我們提出統一的模型架構SADDEL將兩任務透過多工學習框架合併在一個框架下，因此模型本身能執行語音分離以及語音去噪任務。實驗證明SADDEL較單一任務模型表現更好並較其他比較模型更貼近真實環境中的場景。其在語音分離及語音去噪表現和在未知噪聲及噪聲程度下的模型穩定性也都獲致成功。語音分離的應用包括，現實生活中語音分離數據的採集標記以及在嘈雜環境中進行自動語音辨識(Automatic Speech Recognition, ASR)、語者辨識(Speaker Recognition)等應用。將語音訊息從人聲混雜以及背景噪聲中提取出來，對於下游各種語音訊號處理系統皆相當重要。

關鍵字

語音分離；膜型壓縮；多工學習；終端應用；語音去噪

並列摘要

In this paper, we propose two novel model architectures in speech separation to boost applications in real world scenarios through two aspects. Our goal targets model compression and speech separation in noisy environments respectively. We hope to improve the existing speech separation models to achieve wilder generalizability and step closer toward an universal separation system. Our first research interest is model compression, inspired by the success of parameter sharing in compression of natural language processing models. We investigated the effectiveness of such methods on time domain speech separation and proposed several parameter sharing strategies. We also looked into some important design aspects leading to a parameter efficient model. Model stability evaluation is very important for the compressed model. Experimental results have proved that our proposed MiTAS can compress nearly 75% of the model parameters while maintaining the same speech separation performance. Besides, MiTAS has passed multiple stability evaluation experiments indicating its robustness. In summary, MiTAS represents a significant step toward the realization of separation on edge devices and enables a wider range of downstream applications. Our second research interest is to improve the speech separation performance under noisy environments. Since speech separation and denoising tasks have similar nature. In this study, we propose a joint speech separation and denoising framework based on the multitask learning criterion to tackle the two issues simultaneously. Under the framework, the model itself can perform speech separation and speech denoising tasks. Experimental results demonstrate that SADDEL outperforms comparative speech denosing and speech separation models, and exhibits promising results on various noisy separation tasks. Moreover, SADDEL can provide high performance robustness across different datasets, noise types, and SNR levels. Common application of speech separation include labeling of collected real world separation data, automatic speech recognition (ASR) and speaker recognition in noisy environments. Extracting speech from a mixture of human voice and background noise is very important for various downstream speech processing systems.

並列關鍵字

speech separation ； speech denoising ； model compression ； multitask learning ； endpoint applications

參考文獻

[1]C. Olah, “Understanding lstm networks.(online),” Accessed in Dec. 2020.

Google Scholar

[2]E. Bendersky, “Depthwise separable convolutions for machine learning.(online),”retrievedfromhttps://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning, Accessed in Dec. 2020.

Google Scholar

[3]F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” inInternational Conference on Learning Representations (ICLR), May 2016.

Google Scholar

[4]L. Hung Yi, “Speech separation.(online),” Accessed in Dec. 2020.

Google Scholar

[5]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. An-dreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mo-bile vision applications,”arXiv preprint arXiv:1704.04861, 2017.

Google Scholar

國際替代計量

語音分離技術研究:模型壓縮與多工學習

全文下載

主題瀏覽