朝向語音分離的領域普遍化

本論文的主軸在於改進現有的語音分離(Speech Separation)模型，以朝向更普遍化的語音分離模型。過去應用深層網路的語音分離模型，已可以在乾淨的語音資料集上得到非常好的結果。但若是以發展全面且可以商用的模型的角度來看，現有的語音分離模型勢必要能應付語音資料有不同背景雜訊或是不同錄音環境的情況，也就是在不同領域間的普遍化(Generalized)模型。本論文的目標便是探討語音分離模型在不同領域下的表現和可能的改進方向。論文的第一個方向為探討基礎的模型普遍化方法的效果，包括丟棄法(Dropout)、正則化(Regularization)以及數據擴增(Data Augmentation)，這些方法被廣泛應用在深層模型的訓練上，能在未使用其他資料集的情況下，減輕模型過度貼合(Overfitting)的情況。由於過往的語音分離研究並未使用這些技巧，因此嘗試這些方法是朝向普遍化模型的第一步。論文的第二個方向為嘗試監督式的領域調適(Domain Adaptation)效果，本論文會定義不同領域的資料集，包括來源領域以及目標領域，且假設目標領域的標註是可以取得的情況下，我們是否能透過監督式學習使模型有領域調適的能力。實驗結果證實了監督式的領域調適方法是穩定且有效的。論文的第三個方向為非監督式的領域調適，亦即假設我們無法取得目標領域的標註。在這個設定下，本論文會嘗試兩種方法，第一個是以對抗式學習的框架，讓模型可以學會消除不同領域之間的分佈差異，達成領域調適的效果；第二個則是以半監督式學習為主，透過半監督式學習的精神，去引導模型本身的決策邊界，進而得到一個更普遍化的模型。實驗結果證實這兩種方法可以在部分的資料設定下獲致成功，本論文也會探討不成功的資料設定下，可能的失敗原因。

關鍵字

語音分離；轉移學習；領域調適

並列摘要

The main idea of this thesis is to improve the existing speech separation model and to achieve a more generalized speech separation model. Speech separation model which applies deep neural network could perform well on a clear speech corpus. However, from the perspective of a fully developed and commercially available model, the existing speech separation model must be able to cope with different background noise or different recording environments in the speech data. That is to say, a generalized model should be robust between different domains. The goal of this paper is to explore the performance and possible improvement directions of the speech separation model in different domains. The first direction of the thesis is to explore the effects of basic generalization methods including dropout, regularization, and data augmentation. These methods are widely used in the training of deep neural network, and can reduce overfitting of the model without using extra data. Since previous research on speech separation has not used these techniques, trying these methods is the first step towards a generalized model. The second direction of this thesis is to try the effect of supervised domain adaptation. In the thesis, we will define datasets which belong to different domains, including source domain and target domain. Assuming that the ground truth of the dataset in the target domain is available, can we achieve domain adaptation through supervised learning? The experimental results prove that the supervised domain adaptation method is stable and effective. The third direction of the thesis is unsupervised domain adaptation, that is, it is assumed that we can't obtain the ground truth of the target domain. Under this setting, this thesis will endeavour two methods. The first one is an adversarial learning framework, which helps the model learning to eliminate the distribution differences between different domains and achieve the effect of domain adaptation. The second one is mainly based on semi-supervised learning. Through the spirit of semi-supervised learning, we wish to guide the decision boundary of the model, and then get a more generalized model. Results confirm that these two methods can be successful under some experimental settings. Besides, the thesis will also explore possible reasons for failure under unsuccessful experimental settings.

並列關鍵字

Speech Separation ； Transfer Learning ； Domain Adaptation

參考文獻

[1] John R Hershey, Zhuo Chen, Jonathan Le Roux, and Shinji Watanabe, “Deep clus-

Google Scholar

tering: Discriminative embeddings for segmentation and separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE, 2016, pp. 31–35.