即時中文語音合成系統

本論文研究與實作即時中文語音合成系統。此一系統採用文字序列到梅爾頻譜序列的轉換模型，再串接一個從梅爾頻譜到合成語音的聲碼器。我們使用Tacotron2實作序列到序列轉換模型，配合數種不同的聲碼器，包括Griffin-Lim，World-Vocoder，與WaveGlow。其中以實作可逆編碼解碼函數的WaveGlow神經網路聲碼器最為突出，無論在合成速度或語音品質方面，皆令人印象深刻。我們使用單人12小時的標貝語料實作系統。在語音品質方面，使用WaveGlow聲碼器的合成系統語音的MOS為4.08，略低於真實語音的4.41，而遠勝另兩種聲碼器（平均2.93）。在處理速度方面，若使用GeForce RTX 2080 TI GPU，使用WaveGlow聲碼器的合成系統產生10秒48 kHz的語音僅需1.4秒，故為即時系統。

關鍵字

文字轉語音； Tacotron2 ； WaveGlow

並列摘要

This thesis studies and implements the real time Chinese speech synthesis system. This system uses a conversion model of the text sequence to the Mel spectrum sequence, and then concatenates a vocoder from the Mel spectrum to the synthesized speech. We use Tacotron2 to implement a sequence-to-sequence conversion model with several different vocoders, including Griffin-Lim, World-Vocoder, and WaveGlow. The WaveGlow neural network vocoder, which implements the reversible codec function, is the most prominent, and is impressive in terms of synthesis speed or speech quality. We use a single speaker with 12-hour corpus implementation system. In terms of voice quality, the MOS of the synthesized system voice using the WaveGlow vocoder is 4.08, which is slightly lower than the 4.41 of the real voice, and far better than the other two vocoders (average 2.93). In terms of processing speed, if the GeForce RTX 2080 TI GPU is used, the synthesis system using the WaveGlow vocoder produces a voice of 10 seconds and 48 kHz in 1.4 seconds, so it is a real time system.

並列關鍵字

TTS ； Tacotron2 ； WaveGlow

參考文獻

Griffin, D. & Lim, J. (1984). Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236-243. doi: 10.1109/TASSP.1984.1164317

MORISE, M., YOKOMORI, F., & OZAWA, K. (2016).WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Trans. on Information and Systems, E99.D(7), 1877-1884. doi: 10.1587/transinf.2015EDP7457

Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-Based Models for Speech Recognition. In arXiv preprint arXiv:1506.07503v1.

Google Scholar

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., …Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. In arXiv preprint arXiv:1609.03499v2

Google Scholar

Prenger, R., Valle, R., & Catanzaro, B. (2018). WaveGlow: A Flow-based Generative Network for Speech Synthesis. In arXiv preprint arXiv:1811.00002v1

Google Scholar

國際替代計量

即時中文語音合成系統

全文下載

主題瀏覽