透過您的圖書館登入
IP:18.218.55.14
  • 學位論文

基於時頻跨域共同嵌入及聚類之語音分離

Speech Separation with Time-and-Frequency Cross-Domain Joint Embedding and Clustering

指導教授 : 李琳山

摘要


本論文之主軸在探討語者無關的語音分離(Speaker Independent Speech Separation)技術,亦即在沒有語者資訊的情況下要把兩個以上的語者混雜語音分離出來。這在許多語音處理系統中都很有用,包含語音辨識、語者識別等等。當音訊中出現兩個語者以上的語音時,我們的目標就是將這些具備相近特性的語音分離出來。目前以深層學習方法處理這個問題主要分為兩大主流:頻域方法以及時域方法。兩者最大的不同在於模型的輸入,一個輸入的是原始時域訊號,另一個的輸入為經短時傅立葉轉換後所得的時頻譜。這兩種方法也分別使用了不同的模型架構,以處理這兩種不同的輸入,然而這些方法都各有缺點。 本論文提出基於時頻跨域共同嵌入及聚類之分離技術,可以讓兩種不同領域的輸入訊號(時域和頻域)能夠互相參考。我們主要是基於類神經網路中的卷積式類神經網路建模,而本輪文所提出的方法是截至目前為止語者無關的語音分離技術中表現最好的演算法之一。我們將在本文主要分析不同類神經模組對於此問題的影響,並透過實驗數據分析不同模組在解決語者無關語音分離問題時的優缺點。

並列摘要


The main topic of this thesis is to explore Speaker Independent Speech Separation technique, that is, to separate two or more speaker in a mixed speech without the speaker information. This is useful in many speech processing systems, including speech recognition, speaker recognition, etc. When there are two or more speakers in the audio, our goal is to separate these voices with similar characteristics. At present, deep learning method is mainly divided into two major mainstreams: the frequency domain method and the time domain method. The biggest difference between the two is the input of the model, one input is the original time domain waveform, and the other input is the frequency domain spectrum obtained by short-time Fourier transform. These two methods also use different model architectures to handle these two different inputs, but each has its own drawbacks. This paper proposes a separation technique based on time-and-frequency cross-domain joint embedding and clustering, which allows two different fields of input signals (time domain and frequency domain) to be referenced to each other. We are mainly based on convolution-like neural network modeling, and the method proposed in this round is one of the best performing algorithms in speech-independent speech separation technology. In this paper, we will mainly analyze the influence of different types of neural modules on this problem, and analyze the advantages and disadvantages of different modules in solving the speaker-independent speech separation problem through experimental data.

參考文獻


[1] Vinod Nair and Geoffrey E Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
[2] Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep learning, 2016.
[3] Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, and Lin-shan Lee, “Improved speech separation with time-and-frequency cross-domain joint embedding and clustering,” arXiv preprint arXiv:1904.07845, 2019.
[4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[5] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

延伸閱讀