以分別嵌入語者及語言內容資訊之深層生成模型達成無監督式語音轉換

語音轉換(Voice Conversion)的目標是將語言內容(Content)保留，但將語者(Speaker)轉換至另一目標語者(Target Speaker)。因為這類任務常遭遇平行語料(Parallel Corpus)難以蒐集的問題，因此許多人投入使用非平行語料(Non-parallel Corpus)來進行語音轉換的研究。本論文首先提出了一個將單一目標(Single-target)語者的非平行語料的語音轉換擴展至多目標(Multi-target)語者的情境，也就是單一模型能夠轉換到多名目標語者。我們使用”分別嵌入(Separately Embedding)語言內容及語者資訊”的概念來設計模型。透過對抗式訓練(Adversarial Training)的架構，模型能夠學習如何生成語者不相關(Speaker Invariant)，亦即僅含語言內容，的表徵(Representations)，透過改變語者向量，模型便能夠將語音轉換到目標語者。同時，我們也提出了一個使用生成對抗網路(Generative Adversarial Networks, GANs)的方法來解決生成的語音過度平滑(Over-smoothed)的問題。在主觀測試上，此模型在自然程度(Naturalness)以及與目標語者相似度(Similarity)上，都取得與使用循環一致性對抗生成網路(Cycle Consistent Generative Adversarial Networks, CycleGANs)的基準模型(Baseline Model)相似的成效，然而該模型只適用於單一目標語者的情境下。除此之外，我們也提出了一個一次性樣本(One-shot)語音轉換的模型。在推論階段(Inference Stage)，只需要使用一句來源語者(Source Speaker)的語句(Utterance)，以及一句目標語者的語句，模型即可做到使用目標語者的聲音”說”出來源語音的內容。透過在變分自編碼器(Variational Autoencoders, VAEs)的內容編碼器(Content Encoder)引入實例正規化(Instance Normalization)，並且在解碼器(Decoder)使用自適應實例正規化(Adaptive Instance Normalization)的方式，模型能夠學習到將語者資訊以及內容資訊分別嵌入(Embedding)，進而能夠做到一次性樣本語音轉換。在相似度上，模型能夠生成與目標語者相像的語音。

關鍵字

語音；語音轉換；深層生成模型

並列摘要

The goal of voice conversion is to keep the content related to language in the speech signal while converting the speaker to another target speaker. This kind of task usually suffers from the problem that a parallel corpus is difficult to collect. Thus, many investigate how to use a non-parallel corpus to do voice conversion. This thesis proposed to extend the scenario from a single target speaker to multi-target for non-parallel voice conversion. In other words, to use one model to convert to multiple target speakers. We based on the concept of "separately embedding language content and speaker information" to design the model. By adversarial training, the model can learn to generate speaker-invariant representation, thus content-only. By changing speaker latent representation, the model can convert voice conversion to the target speaker. Also, we proposed an approach to use generative adversarial networks (GANs) to address the problem of over-smoothed speech. In subjective evaluation, this model achieves comparable results with the CycleGAN-based baseline model in terms of similarity to target speaker and naturalness of speech while this model can perform multi-target speaker conversion. Besides, we proposed a one-shot voice conversion model. At the inference stage, we only need one utterance from the source and target speaker respectively. The model can use the target speaker's voice to "speak" the source speech content. By introducing instance normalization in the content encoder of variational autoencoders (VAEs), and putting adaptive instance normalization in the decoder, the model can learn to separately embed speaker information and content information. So that one-shot voice conversion is achieved. In the similarity evaluation, the model can learn to generate speech similar to the target speaker.

並列關鍵字

speech ； voice conversion ； deep generative model

參考文獻

[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436, 2015.

Google Scholar

[2] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al., “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal processing magazine, vol. 29, 2012.

Google Scholar

[3] Heiga Ze, Andrew Senior, and Mike Schuster, “Statistical parametric speech synthesis using deep neural networks,” in 2013 ieee international conference on acoustics, speech and signal processing. IEEE, 2013, pp. 7962–7966.

Google Scholar

[4] Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.

Google Scholar

[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,”in Advances in neural information processing systems, 2014, pp. 2672–2680.

Google Scholar

國際替代計量

以分別嵌入語者及語言內容資訊之深層生成模型達成無監督式語音轉換

全文下載

主題瀏覽