透過您的圖書館登入
IP:3.133.133.110
  • 學位論文

零樣本歌聲轉換與合成的統一模型

A Unified Model for Zero-Shot Singing Voice Conversion and Synthesis

指導教授 : 蘇黎
共同指導教授 : 張智星(Jyh-Shing Jang)

摘要


深度學習的最新進展不僅促進了零樣本歌聲合成和歌聲轉換任務的 實現,同時也提供了將這兩個任務統一為一個通用模型的機會。在本 文中我們提出了一個統一兩項任務的模型,可以從文本或音頻格式的 任意源歌唱內容生成任意目標歌手的歌聲。該模型結合了處理文本輸 入的詞源編碼器以及處理音頻輸入的聲源編碼器進行訓練,並透過以 動態規劃為基礎的自督導式學習,編碼器將會在訓練過程中學習如何 將音頻與音素進行最佳的對齊。這些編碼器也將音頻和文本數據分別 映射到一個相似的潛在空間中,使得歌聲轉換與合成兩項任務可以透 過同一個解碼器來完成。目標歌手的參考音檔被轉換成以幀為單位的 碎片化資訊,並透過注意機制來根據源內容進行提取與重構,這使模 型能夠在測試階段從文本或音頻源生成沒學習過的目標歌手的聲音。 客觀和主觀實驗都證實,所提出的模型表現超越過去最佳的任意歌聲 轉換與任意歌聲合成模型。

並列摘要


Recent advances in deep learning not only facilitate the implementation of zero-shot singing voice synthesis (SVS) and singing voice conversion (SVC) tasks, but also provide the opportunity to unify these two tasks into one gen- eralized model. In this paper, we propose such a model that can generate singing voice of any target singer from any source singing content in either text or audio format. The model incorporates self-supervised joint training of the phonetic source encoder and the acoustic source encoder, with an audio- to-phoneme alignment process in each training step, such that these encoders map the audio and text data respectively into a shared, temporally aligned, and singer-agnostic latent space. The target singer’s latent representations en- coded at different granularity levels are all trained to match the source latent representations sequentially with the attention mechanisms in the decoding stage. This enables the model to generate unseen target singer’s voice with fine-grained resolution from either text or audio sources during the inference stage. Both objective and subjective experiments confirmed that the proposed model is competitive with the state-of-the-art SVC and SVS methods.

參考文獻


[1] J. Bonada and X. Serra, “Synthesis of the singing voice by performance sampling and spectral models,” IEEE Signal Process. Mag., vol. 24, no. 2, pp. 67–79, 2007.
[2] T. Nakano and M. Goto, “Vocalistener2: A singing synthesis system able to mimic a user’s singing in terms of voice timbre changes as well as pitch and dynamics,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011, pp. 453–456.
[3] P. Chandna, M. Blaauw, J. Bonada, and E. Gómez, “WGANsing: A multi-voice singing voice synthesizer based on the wasserstein-gan,” in 27th European Signal Processing Conference (EUSIPCO), 2019, pp. 1–5.
[4] M. Blaauw, J. Bonada, and R. Daido, “Data efficient voice cloning for neural singing synthesis,” in Proceedings of the IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2019, pp. 6840–6844.
[5] S. Choi, W. Kim, S. Park, S. Yong, and J. Nam, “Korean singing voice synthesis based on auto-regressive boundary equilibrium gan,” in Proceedings of the IEEE In- ternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7234–7238.

延伸閱讀