透過您的圖書館登入
IP:3.147.45.90
  • 學位論文

以激活函數引導與自適應實例正規化達成無監督式語音轉換

Unsupervised Voice Conversion using Activation Guidance and Adaptive Instance Normalization

指導教授 : 李宏毅

摘要


近年來,深度學習在語音轉換(Voice Conversion, VC)的應用與研究發展越來越多。從一對一語者的語音轉換(One-to-one)、多對多(Many-to-many)、任意對任意(Any-to-any),以及一次性樣本(One-shot)語音轉換的研究逐漸成熟。許多語音轉換模型使用了表徵解纏的技術來分解一句語音中的語者特性以及文字內容,接著他們將文字內容,結合目標語者的語者特性來合成出轉換後的語音,達成語音轉換任務。在語音解纏的過程,我們會得到帶有語者特色的語者表徵(Speaker Embedding)及帶有文字內容特色的內容表徵 (Content Embedding)。一個常見的作法是,在內容表徵的抽取過程,加上資訊瓶頸讓語者資訊被過濾掉,但如果瓶頸加得太強,可能導致內容資訊的遺失,造成轉換出的語音品質不佳;如果瓶頸不夠強,又可能會讓語者資訊被過濾的不完全,導致轉換出的語音仍然帶有來源語者的特色,造成轉換失敗;這個現象即是語音解纏能力(Disentangling Ability)和語音重構能力(Reconstruction Ability)的取捨(Trade-off)。本論文第一個部份提出了使用單一編碼器與自適應實例正規化(Adaptive Instance Normalization, AdaIN)來達成語音轉換,有效改善了前作在語音轉換的模型記憶體應用,不但大幅減少了前作模型的記憶體使用率以及運算速度,同時改善模型的輸出品質、語者相似度。在本論文的第二部分,我們嘗試探討不同的激活函數(Activation Function)對於語音表徵的解纏效果。我們使用前面提到的單一編碼器的架構,在其內容表徵上加入不同的激活函數,觀察不同激活函數在語音解纏能力和語音重構能力的取捨中,會帶來什麼不同的影響。實驗結果展示,與基礎模型(Baseline)相比,使用單一編碼器,搭配特定的S型函數(Sigmiode Function),能同時改善讓語音解纏能力和語音重構能力;在使用者主觀測試中,我們提出的方法也在語音品質的平均意見分數(Mean Opinion Score, MOS)和語者相似度分數取得最好成績。

並列摘要


Recently, the application and research development of Voice Conversion (VC) has increased. From one-to-one, many-to-many, any-to-any, all the way to one-shot VC, the research has gradually matured. Many deep-learning-based VC systems use the feature disentangling technique to separate the speaker information and the linguistic content information from a speech signal. These models convert the voice by changing the speaker information while maintaining the content information. In the process of feature disentangling, speaker embeddings and content embeddings are extracted from an audio. Applying information bottleneck on content embeddings is a general way to disentangle speaker information from content embeddings. However, the content information might be lossy if the bottleneck is too strong, which results in low-quality conversion; otherwise, the speaker information may leak into content embeddings due to a weak bottleneck. In short, there is a trade-off between disentangling ability and reconstruction ability. In this thesis, we firstly propose to use a single encoder with Adaptive Instance Normalization (AdaIN) to achieve VC, which reduces the memory usage, meanwhile improves the voice quality and the speaker similarity of generated speech. In the second part of the thesis, we explore the effects of different activation functions on speech representation. We use the single-encoder model mentioned above as the baseline model and add different activation functions to the content embedding to see how the activation functions affect the results. The experiment results show that using a single encoder with a proper sigmoid function applied on the speech representation improves the disentangling ability and reconstruction ability at the same time. The proposed method also obtains the best performance of the subjective evaluation, including the naturalness test and the speaker similarity test.

參考文獻


[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
[2] A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal processing magazine, 2012.
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large- scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[4] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, 2016.
[5] J.chiehChouandH.-Y.Lee,“One-ShotVoiceConversionbySeparatingSpeakerand Content Representations with Instance Normalization,” in Proc. Interspeech 2019, 2019, pp. 664–668. [Online]. Available: http://dx.doi.org/10.21437/Interspeech. 2019-2663

延伸閱讀