整合語者嵌入向量與後置濾波器於提升個人化合成語音之語者相似度

近年來在語音合成的研究之中，單一語者的合成系統已經有著高品質的表現，但對於多語者系統來說，合成語音的品質與語者相似度仍是一大挑戰，本研究針對合成語音的品質與語者相似度兩個議題來建立出一套可合成多語者之文字轉語音系統，首先針對多語者的議題中，目標為透過少量樣本（Zero-Shot）來達成語者轉換，我們透過語者嵌入向量（Speaker Embedding）的引入來實作多語者語音合成系統，並比較針對不同任務所建立的語者嵌入向量的效果差異。在此我們比較了用於語者辨識（Speaker Verification）以及單純用於語音轉換（Voice Conversion）的語者嵌入向量。接著，為了提升合成的語者相似度以及語音品質，我們嘗試置換類神經網路架構中，作為提升頻譜的Post-Net的部分，在此處我們使用了一個後置濾波器（Post-Filter）的網路來取代，且比較和Post-Net所產生的頻譜差異以及探討其模型參數量之差異性。實驗結果表明，透過疊加性注意力機制來整合語者嵌入向量進入到類神經網路架構的語音合成系統的確能夠有效地產生具有目標語者的合成語音，並且在加入後置濾波器網路後能夠比傳統透過Post-Net的方式來強化合成語音的語者特性以及語音品質，且合成一般長度語音句的時間約為2秒鐘，已接近即時合成個人化語音之成果。未來的研究方向會加入更多資訊來幫助語者嵌入向量在TTS的效能上改進。

關鍵字

多語者語音合成；語音轉換；語者識別；少量樣本；後置濾波器

並列摘要

In recent years, speech synthesis system can generate speech with high speech quality. However, multi-speaker text-to-speech (TTS) system still require large amount of speech data for each target speaker. In this study, we would like to construct a multi-speaker TTS system by incorporating two sub modules into artificial neural network-based speech synthesis system to alleviate this problem. First module is to add the speaker embedding into encoding module of the end-to-end TTS framework while using small amount of the speech data of the training speakers. For speaker embedding method, in our study, two speaker embedding methods, namely speaker verification embedding and voice conversion embedding, are compared for deciding which one is suitable for the personalized TTS system. Besides, we substituted the conventional post-net module, which is conventionally adopted to enhance the output spectrum sequence, to a post-filter network, which is further improving the speech quality of the generated speech utterance. Finally, experiment results showed that the speaker embedding is useful by adding it into encoding module and the resultant speech utterance indeed perceived as the target speaker. Also, the post-filter network not only improving the speech quality and also enhancing the speaker similarity of the generated speech utterances. The constructed TTS system can generate a speech utterance of the target speaker in fewer than 2 seconds. In the future, other feature such as prosody information will be incorporated to help the TTS framework to improve the performance.

並列關鍵字

Multi-speaker Text-to-Speech ； Voice Conversion ； Speaker Verification ； Zero-Shot ； Post-Filter

參考文獻

Battenberg, E., Skerry-Ryan, R. J., Mariooryad, S., Stanton, D., Kao, D., Shannon, M., & Bagby, T. (2020). Location-relative attention mechanisms for robust long-form speech synthesis. In Proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6194-6198. https://doi.org/10.1109/ICASSP40776.2020.9054106

Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020). Zeroshot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In Proceedings of ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6184-6188. https://doi.org/10.1109/icassp40776.2020.9054535

Kameoka, H., Kaneko, T., Tanaka, K., & Hojo, N. (2018). Stargan-vc: Non-parallel many-tomany voice conversion using star generative adversarial networks. In Proceedings of 2018 IEEE Spoken Language Technology Workshop (SLT), 266-273. https://doi.org/10.1109/SLT.2018.8639535

Kaneko, T., & Kameoka, H. (2018). Cyclegan-vc: Non-parallel voice conversion using cycleconsistent adversarial networks. In Proceedings of 2018 26th European Signal Processing Conference (EUSIPCO), 2100-2104. https://doi.org/10.23919/EUSIPCO.2018.8553236

Li, N., Liu, S., Liu, Y., Zhao, S., & Liu, M. (2019, July). Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 6706-6713. https://doi.org/10.1609/aaai.v33i01.33016706

國際替代計量

整合語者嵌入向量與後置濾波器於提升個人化合成語音之語者相似度

全文下載

主題瀏覽