透過您的圖書館登入
IP:18.222.177.138
  • 學位論文

任意對多歌唱風格轉換

Any-to-many Singing Style Conversion

指導教授 : 張智星

摘要


近年來,將歌聲中的歌手身份轉換成另一位歌手的任務,或稱為歌聲轉換,已經取得了巨大的成功。大多數現有的歌聲轉換系統僅考慮了歌聲的音色轉換,其他資訊則保持不變。然而,這未充分考慮歌手身份的其他方面,特別是體現在歌聲的音高曲線和能量曲線中的歌唱風格。為了解決這個問題,本論文提出了一個任意對多的歌唱風格轉換系統,將一位歌手的音高曲線和能量曲線轉換為另一位歌手的風格。為了實現這個目標,我們利用了兩個類似 AutoVC 具有信息瓶頸的自編碼器,以將歌唱風格與音樂內容區分開來。第一個自編碼器執行音高轉換,而第二個自編碼器則以音高曲線為條件執行能量轉換,以確保兩個曲線之間的一致性。考慮到顫音在歌聲表達中的重要性,我們進一步加入了強調顫音特徵的損失函數,以突顯其作用。實驗結果顯示,我們提出的模型能夠有效地在任意對多的情境下將音高和能量特徵的風格轉換為目標歌手的歌唱風格。

並列摘要


The task of converting singer identity of a singing voice to that of another singer, or singing voice conversion (SVC), has achieved a huge success in recent years. Most existing SVC systems consider the conversion of a singing voice's timbre while leaving all other information unchanged. This, however, does not take other aspects of singer identity into consideration, particularly a singer's singing style, which is reflected in the pitch and the energy contours of a singing voice. To address this issue, this paper proposes an any-to-many singing style conversion system that converts the pitch and energy contours of one singer's style to that of another singer's style. To achieve this target, we utilize two AutoVC-like autoencoders with information bottleneck to disentangle singing style from musical contents. The first one performs pitch conversion, while the second one performs energy conversion with the condition of pitch contour to ensure a consistency between the two contours. Recognizing the crucial role of vibratos in vocal expression, we further incorporate loss functions that emphasize vibrato features to highlight their importance. Experimental results suggested that the proposed model can effectively convert the style of pitch and energy features to that of target singer in an any-to-many conversion scenario.

參考文獻


[1] W. Cai, J. Chen, and M. Li. Exploring the encoding layer and loss function in endto-end speaker and language recognition system. arXiv preprint arXiv:1804.05160, 2018.
[2] H.-Y. Choi, S.-H. Lee, and S.-W. Lee. Diff-hiervc: Diffusion-based hierarchical voice conversion with robust pitch generation and masked prior for zero-shot speaker adaptation. International Speech Communication Association, pages 2283–2287, 2023.
[3] J. S. Chung, J. Huh, and S. Mun. Delving into voxceleb: environment invariant speaker recognition. arXiv preprint arXiv:1910.11238, 2019.
[4] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han. In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982, 2020.
[5] C. Deng, C. Yu, H. Lu, C. Weng, and D. Yu. Pitchnet: Unsupervised singing voice conversion with pitch adversarial network. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7749–7753. IEEE, 2020.

延伸閱讀