透過您的圖書館登入
IP:18.225.98.111
  • 學位論文

類神經網路語音轉換模型之強健性分析與任意對任意非平行語料序列到序列語音轉換模型

Robustness Analysis for Neural Voice Conversion Models and An Any-to-Any Non-Parallel Sequence-to-Sequence Voice Conversion Model

指導教授 : 李琳山

摘要


語音轉換(Voice conversion)任務的目標為不改變輸入語音中的語言內容,而將語音轉換成另一段語音,所轉換的可以是口音、韻律、情緒或是語者特徵。目前許多研究都已使用深層學習來達成語音轉換任務,甚至有公開展示的語音成果供人聆聽,且乍聽之下都表現很好;然而因為語音轉換屬於訊號生成任務,通常只能主觀衡量,很容易會因為樣本數不夠多,不易真實反應模型的表現;另一方面,到目前為止,幾乎所有的論文都只測試在訓練資料集上,很少呈現在真實環境下的表現,而當資料的統計分佈不同時,深層學習難免會有嚴重的表現衰退的問題。 本論文第一部分主要探討各種不同的基於深層學習的語音轉換模型,對於訓練和測試資料分布不一致時生成的語音品質。受測的模型包含FragmentVC、AutoVC、AdaIN-VC、VQVC+、BLOW、DGAN-VC 和WAStarGAN-VC,並盡可能對模型做公平的比較,測試的情境包含不同的錄音環境、不同的語言、不同的性別間的轉換和增加雜訊的語音,並用實驗證明語音轉換模型能夠訓練在單一語言的資料上,而能很好在不同語言間做轉換。 本論文第二部分主要嘗試在沒有平行的訓練資料且沒有文字資訊的情境下,達成序列到序列的語者轉換模型;根據FragmentVC 的架構,額外增添韻律模組來學習建立目標韻律,並實驗各種資料增強的方法。雖然因為時間因素實驗尚未完成,但仍有一些發現可供未來的研究者參考,故忠實紀錄實驗的結果與推想。

關鍵字

語音轉換

並列摘要


The goal of voice conversion task is to convert some property (e.g. accent, rhythm, emotion, or speaker characteristics) of the source audio without changing the linguistic content. At present, many studies used deep learning to achieve the voice conversion task, with publicly available audio results sounding very well. However, the voice conversion task is a signal generation task, actual performance of which requirs human evaluation based on enough number of audio samples. In particular, very often only results on the training dataset rather than those for the real-world data were provided, but it is well known that the performance of deep learning models may drop seriously for the out-of-distribution data. In the first part of this thesis, we analyzed the performances of a few most updated and popular voice conversion models based on deep learning when they were tested with very different datasets. The voice conversion models analyzed include FragmentVC, AutoVC, AdaIN-VC, VQVC+, BLOW, DGANVC, and WAStarGAN-VC, in the scenarios of different recording environments, different languages, and different genders. One example result was we showed the voice conversion models trained in a single language can generalize well to different languages. In the second part of this thesis, we made an effort to try to achieve sequence-to-sequence and any-to-any voice conversion with non-parallel training data. We modified the framework of FragmentVC by adding an additional prosody module, and tried to use several data augmentation methods. Although the experiments were not completed due to time limitation, some interesting findings were reported here for future researchers to refer to.

並列關鍵字

Voice conversion

參考文獻


[1] K. Fukushima and S. Miyake, “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition,” in Competition and cooperation in neural nets. Springer, 1982, pp. 267–285.
[2] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[3] T. Homma, L. E. Atlas, and R. J. Marks, “An artificial neural network for spatiotemporal: application to phoneme classification,” in Proceedings of the 1987 International Conference on Neural Information Processing Systems, 1987, pp. 31–40.
[4] J. L. Elman, “Finding structure in time.” Cognitive Science, 1990.
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

延伸閱讀