類神經網路聲碼器在語音波形生成上的強健性分析

聲碼器為一種可以將聲學特徵值轉換至音訊波形的架構,目前以深層學習為基礎的聲碼器已被廣泛用在語音生成應用中,其中包含文句翻語音系統和語者轉換系統。不過當遇到訓練和測試的資料分佈不一致時,以深層學習為基礎的聲碼器的表現會大幅下降。這篇碩論主要在探討不同以深層學習基礎的聲碼器,對於訓練和測試的資料分佈不一致時的生成語音品質。本碩論所探討的聲碼器包含WaveNet, WaveRNN, FFTNet, Parallel WaveGAN。我們首先測試聲碼器的訓練及測試是在不同語者和不同語言時的影響。當聲碼器分別訓練在單語者單語言、多語者單語言、多語者多語言的訓練集,並分別測試在相同語者相同語言、不同語者相同語言、不同語者不同語言時,發現影響聲碼器輸出品質最大的因素是語者的多樣性,不同語言則不會影響生成結果。此外,我們也分析聲碼器在訓練在單語者語言的情況下,發現不同性別也會大幅影響聲碼器的輸出。這篇碩論也將聲碼器應用在其他語音生成運用上,發現WaveNet, WaveRNN最適合使用在文句翻語音系統上;Parallel WaveGAN最適合使用在語者轉換系統上。

關鍵字

類神經網路聲碼器；強健性；語音生成

並列摘要

Vocoder is an architecture to convert waveform from acoustic feature. Recently, neural based vocoders have been widely used in speech generation applications, including text-to-speech and voice conversion. However, However, when encountering data distribution mismatch between training and inference, the performance of neural based vocoders will degrade significantly. In this thesis, we discuss the performance different deep learning based vocoders when The vocoders discussed in this article include WaveNet, WaveRNN, FFTNet, Parallel WaveGAN. We evaluate the models using acoustic features from seen/unseen speakers, seen/unseen languages, a text-to-speech model, and a voice conversion model. We found out that the speaker variety is much more important for achieving a universal vocoder than the language. When vocoders trained in a single speaker dataset, different gender will influence the quality of vocoders. Through our experiments, we show that WaveNet and WaveRNN are more suitable for text-to-speech models, while Parallel WaveGAN is more suitable for voice conversion applications.

並列關鍵字

Neural Vocoder ； Robustness ； Speech Synthesis

參考文獻

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information pro- cessing systems, 2012, pp. 1097–1105.

Google Scholar

Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Cernocký, and Sanjeev Khudanpur, “Recurrent neural network based language model,” in Eleventh annual conference of the international speech communication association, 2010.

Google Scholar

Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

Google Scholar

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Em- pirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.

Google Scholar

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.

Google Scholar

國際替代計量

類神經網路聲碼器在語音波形生成上的強健性分析

未授權

主題瀏覽