透過您的圖書館登入
IP:3.142.198.129
  • 期刊
  • OpenAccess

結合語音辨認及合成模組之台語語音轉換系統

Taiwanese Voice Conversion based on Cascade ASR and TTS Framework

摘要


台語已被聯合國列為瀕危消失語言,急需傳承。因此,本論文研究如何做出一個可以用任何人的聲音,合成出任何台語語句的台語語音合成系統。為達到此目的,我們首先(1)建置一Taiwanese Across Taiwan(TAT)大規模台文語音語料庫,其共有204位語者,約140小時的語料,其中有兩男兩女,每人約10小時的台語語音合成專用語料。然後(2)基於Tacotron2之語音合成架構,並加上前端中文字轉台羅拼音模組與後端Wave Glow即時語音生成器,建立中文文字轉台語語音合成系統。最後(3)基於串接台語語音辨認與語音合成架構,建置一台語語音轉換系統,並完成同語言:台語對台語語音轉換;以及跨語言:華語對台語語音轉換,兩種台語語音轉換功能。為評估此台語語音轉換系統的成效,我們透過網路公開招募到29位實驗者,進行同語言及跨語言轉換台語語音兩項評分任務,並分別進行針對「自然度」與「相似度」的MOS分數之主觀評測。實驗結果顯示,在同語言部分,若使用目標語者10分鐘,3分鐘與30秒語料進行測試,自然度平均MOS分數依序為3.45分,3.02分與2.23分,相似度平均MOS分數依序為3.38分,2.99分與2.10分;而在跨語言部分,若使用目標語者6分鐘與3分鐘語料進行測試,自然度平均MOS分數依序為2.90分與2.70分,相似度平均MOS分數依序為2.84分與2.54分。由實驗結果,可以顯示我們確實初步達成一個可以用任何人的聲音,合成出任何台語語句的台語語音合成系統。

並列摘要


Taiwanese has been listed as an endangered language by the United Nations and is urgent for passing on. Therefore, this study wants to find out how to make a Taiwanese speech synthesis system that can synthesize any Taiwanese sentences via anyone's voice. To achieve this goal, we first (1) built a large-scale Taiwanese Across Taiwan (TAT) corpus, with in total of 204 speakers and about 140 hours of speech. Among those speakers, two men and women, each one has especially about 10 hours of speech recorded for the purpose of speech synthesis, then (2) establish a Chinese Text-to-Taiwanese speech synthesis system based on the Tacotron2 speech synthesis architecture, plus with a frontend sequence-to-sequence-based Chinese characters to Taiwan Minnanyu Luomazi Pinyin (shortened as Tâi-lô) machine translation module and the backend WaveGlow real-time speech generator, and finally, (3) constructed a Taiwanese voice conversion system based on the concatenated speech recognition and speech synthesis framework where two voice conversion functions had been implemented including (1) same-language: Taiwanese to Taiwanese voice conversion, and (2) multi-language: Chinese to Taiwanese voice conversion. In order to evaluate the Taiwanese voice conversion system, we publically recruited 29 subjects from the Internet to conduct two kinds of scoring task: same-language and cross-language voice conversion and carried out the subjective "naturalness" and "similarity" mean opinion score (MOS) evaluations respectively. The test result shows that in the Intra-lingual session, the average naturalness MOS is 3.45, 3.02 and 2.23 points, and average similarity MOS score's 3.38, 2.99 and 2.10 points while using 10 minutes, 3 minutes, and 30 seconds target speech, respectively; in cross-lingual part, the average naturalness MOS score is 2.90 and 2.70 points; average similarity MOS score is 2.84 and 2.54 points while using 6 minutes and 3 minutes target speech, respectively. From those results, it shows that our proposed system indeed could synthesize any Taiwanese sentences via anyone's voice.

參考文獻


Dong, L., Xu, S., & Xu, B. (2018). Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018). https://doi.org/10.1109/ICASSP.2018.8462506
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-Vectors: Robust DNN Embeddings for Speaker Recognition. In proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018). https://doi.org/10.1109/ICASSP.2018.8461375
Sun, L., Li, K., Wang, H., Kang, S., & Meng, M. (2016). Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In proceedings of 2016 IEEE International Conference on Multimedia and Expo (ICME). https://doi.org/10.1109/ICME.2016.7552917
Chen, M., Tan, X., Ren, Y., Xu, J., Sun, H., Zhao, S., Qin, T., & Liu, T.-Y. (2020). MultiSpeech: Multi-Speaker Text to Speech with Transformer. arXiv preprint arXiv: 2006.04664v2
esdeboer. (2020). GitHub-X-Vector,. Available: https://github.com/kaldi-asr/kaldi/tree/master/egs/voxceleb/v2

延伸閱讀