This thesis proposes a vector quantization-based voice conversion approach. The objective and the subjective evaluations show that the proposed method performs better than other existing approaches in both audio naturalness and speaker similarity
[1] J. Song, P. Kalluri, A. Grover, S. Zhao, and S. Ermon, “Learning controllable fair representations,” in Proceedings of 22nd International Conference on Artificial Intelligence and Statistics, April 16–18, 2019, Naha, Japan, pp. 2164–2173.
[2] F. Villavicencio and J. Bonada, “Applying voice conversion to concatenative singing-voice synthesis,” in 11th Annual Conference of the International Speech Communication Association, September 26–30, 2010, Makuhari, Japan, pp. 2162– 2165.
[3] E. Nachmani and L. Wolf, “Unsupervised singing voice conversion,” in 20th Annual Conference of the International Speech Communication Association, September 15– 19, 2019, Graz, Austria, pp. 2583–2587.
[4] S. H. Mohammadi and A. Kain, “Voice conversion using deep neural networks with speaker-independent pre-training,” in Spoken Language Technology Workshop, December 7–10, 2014, Lake Tahoe, pp. 19–23.
[5] M. Sahidullah and G. Saha, “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition,” Speech Commun., vol. 54, no. 4, pp. 543–565, 2012.