聲學模組於中文文句翻語音系統之研究與實作

本論文針對中文文句翻語音系統之聲學模組作一研究與實作。研究主軸朝聲學模組設計與一致性分析進行，並根據分析結果將聲學模組實作於中文文句翻語音系統。此聲學模組包含韻律產生器、合成單元產生器以及語音合成器等。文中先闡述聲學模組在中文文句翻語音系統之物理意義以及其對系統之重要性，並採用隱藏式馬可夫模型(HMM)、遞迴式類神經網路(RNN)以及基週疊加方法(PSOLA)建構聲學模組。再者，聲學模組之一致性分析是本論文強調重點，對於一致性的解釋是：在相同的詞中，頻譜與韻律在音節內的扭曲(warping)關係非常相似。實驗證明頻譜與韻律在音節內的扭曲關係非常一致，亦證明聲學模組之合理性與重要性。此一結果說明合成過程中必須考慮音節內頻譜與韻律的扭曲關係，進而改善合成聲音之自然度。根據分析結果，本論文提出五大策略將聲學模組實作於中文文句翻語音系統。第一、使用隱藏式馬可夫模型(HMM)執行語音資料庫之自動切音處理。第二、使用語料庫對韻律產生模組與合成單元產生模組作訓練，使得韻律資訊與合成單元具有一致性效果。第三、使用統計樣本與音節波形二合一方法產生合成單元，使得音節波形兼俱統計特性。第四、合成單元隱含韻律資訊，可使合成單元做時序(Timing)、能量(Energy)、以及頻譜(Spectrum)之扭曲，提升合成語音品質。第五、採用連音單元模組減低連音效應。最後，在系統效能驗證上，系統只佔2.4 MB、合成每一音節僅需0.08 MIPS、同時可合成出清晰、自然與流暢之聲音；足以證明本論文完成之系統具備聲音品質佳、系統小、效能高等優點，市場競爭力極佳。

關鍵字

文句翻語音系統；聲學模組；一致性；隱藏式馬可夫模型

並列摘要

The primary study of this dissertation is focused on the acoustic module (AM) design and its property of consistence in order to improve the performance of Mandarin text-to-speech (TTS) system. The AM is composed of the prosody generator, the spectrum generator, and the speech synthesizer. The physical insight of AM is established in the learning behavior of human being for speaking and the pronunciation rule in the running speech. Then the hidden Markov model (HMM), recurrent neural network (RNN), and PSOLA algorithms are employed to build the AM. The consistency is especially emphasized and analyzed in this dissertation. The consistency means that the high-correlated warping curve between spectrum and prosody is definitely existed for the designate pronunciation in the same word. The experimental results confirm this phenomenon. This conclusion gives a direction about that the warping process of spectrum and prosody intra the syllable must be considered in detail for TTS system. Based on the analytic results, five strategies are presented to implement the AM in Mandarin TTS. Firstly, the HMM algorithm is employed to automatically perform the phonetic segmentation on speech corpus as well as organize the representative prosody and spectrum modules. Secondly, the prosody and spectrum modules are trained together to keep the consistence within AM. Thirdly, the generation algorithm of synthesis unit is used to produce the prosody unit embedded and make the warping process in synthesis unit available. Fourth, the waveform-based synthesis unit and warping process make the synthesized speech clear, natural, and fluent. Fifth, the coarticulation module is studied to alleviate the coarticulation effect. Finally, the performance analyses including the speech quality, memory requirement, and computational complexity are examined in our system. Smaller than 2.4 MB memory space and average 0.08 MIPS for each syllable can be achieved on the fixed-point DSP chip. Also, the synthesized speech sounds very good. These results confirm the high performance of our system. Many speech applications with high quality and low cost can be easily achieved using our TTS solution.

並列關鍵字

Text-to-Speech ； Acoustic Module ； Consistency ； HMM

參考文獻

[1] D. H. Klatt, “Review of text-to-speech conversion for English,” J. Acoust. Sco. Amer., Vol. 82, pp. 137-181, 1987.

[2] C. Hamon, E. Mouline, and F. Charpentier, “A diphone synthesis system based on time-domain prosodic modifications of speech,” in Proc. ICASSP, Vol. 1, pp. 238-241, 1989.

[3] D. Bigorgne, et al., “Multilingual PSOLA text-to-speech system,” in Proc. ICASSP, Vol. 2, pp. 187-190, 1993.

[4] N. Iwahashi, and Y. Sagisaka, “Speech segment network approach for optimization of synthesis unit set,” Computer Speech and Language, Vol. 9, No. 4, pp. 335-352, 1995.

[5] G. G. Lai, H. Min, and Z. S. Qin, “The research and implementation of Mongolian text-to-speech system,” The 6th International Conference on Signal Processing, Vol.1, pp.472-475, 2002.

國際替代計量

聲學模組於中文文句翻語音系統之研究與實作

全文下載

主題瀏覽