使用音高資料擴增以及二階段方法改進基於WaveNet之非監督式歌聲轉換

歌聲轉換是一個知名的音訊處理任務，目標是將一位歌手的音色轉換成另一位歌手的音色。近年由於類神經網路技術發展蓬勃，許多研究提升了歌聲轉換的品質以及應用價值。此篇研究主要聚焦在由Chengqi Deng提出的基於WaveNet之非監督式歌聲轉換模型「PitchNet」。我們首先考察了此模型的各種性質及訓練方法，顯示其在男性、女性歌手之間，以及外部來源資料的歌聲轉換品質較不理想。因此我們提出了三種方法來改進此模型：超參數調校、音高資料擴增以及二階段轉換。我們的實驗使用了NUS-48E資料集，並且使用此資料集以及數首來自「The 'Mixing Secrets' Free Multitrack Download Library」的歌曲執行結果評估。評估方式包含兩種客觀指標：Pitch Zero-Normalized Cross-Correlation (PZNCC)、Singer Classification Accuracy (SCA)以及一種主觀指標：Mean Opinion Score (MOS)。實驗結果顯示我們提出的改良方法能使轉換後的音高準確度提升（PZNCC由0.74 提高至0.81），主觀聆聽的自然程度也得到提升（MOS由2.11提高至3.33）。此外，我們所提出的方法較原方法對於外部資料之歌聲轉換更為穩健（MOS由1.48提高至3.15）。

關鍵字

歌聲轉換； WaveNet ； PitchNet ；音高資料擴增；二階段轉換

並列摘要

Singing voice conversion is a well-known task that aims at converting a singer's voice to another one's voice, and recent neural network approaches have greatly improved the quality and application possibility of such technology. This work focuses on an unsupervised, WaveNet-based model, "PitchNet", proposed by Chengqi Deng. We first investigate the properties of the original PitchNet, showing that it has lower conversion quality when performing conversion between females and males as well as unseen source data. Secondly, we proposed three approaches to improve such a model: hyperparameter tuning, pitch augmentation, and two-phase conversion. We conduct our experiments on the NUS-48E dataset. The evaluation is done on the same dataset and a few songs from "The 'Mixing Secrets' Free Multitrack Download Library", using two objective metrics: Pitch Zero-Normalized Cross-Correlation (PZNCC) and Singer Classification Accuracy (SCA), along with a subjective metric: Mean Opinion Score (MOS). The results show that the proposed method can improve pitch accuracy (0.74 vs. 0.81 in PZNCC) and listening quality (2.11 vs. 3.33 in MOS). In addition, the proposed method is more robust against unseen source data (1.48 vs. 3.15 in MOS).

並列關鍵字

singing voice conversion ； WaveNet ； PitchNet ； pitch augmentation ； two-phase conversion

參考文獻

A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.

Google Scholar

C. Deng, C. Yu, H. Lu, C. Weng, and D. Yu, “Pitchnet: Unsupervised singing voice conversion with pitch adversarial network,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7749–7753, IEEE, 2020.

Google Scholar

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” arXiv preprint arXiv:1912.01703, 2019.

Google Scholar

Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang, “The nus sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” in 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp. 1–9, IEEE, 2013.

Google Scholar

M. Senior, “The 'mixing secrets' free multitrack download library,” 2012.

Google Scholar

國際替代計量

使用音高資料擴增以及二階段方法改進基於WaveNet之非監督式歌聲轉換

全文下載

主題瀏覽