透過您的圖書館登入
IP:3.141.30.162
  • 學位論文

基於支援語者聲學模型之中文語音合成系統

HMM-Based Chinese Text-To-Speech System with Support Speakers

指導教授 : 李琳山

摘要


在資訊科技發達的現代,人們可以享受語音科技帶來的成果。語音科技中,語音合成是近年來被廣為重視的一個領域。語音合成大致上可以分為兩類,分別為單元選取式語音合成和統計模型參數語音合成。前者在語料庫中尋找與目標合成語句相似的語音片段,將這些語音片段串接起來作語音合成。後者為從語料庫中統計出各聲音的特徵做為一個統計模型。合成時根據統計模型算出聲音的參數值作語音合成。 本論文提出的語音合成系統是屬於統計模型參數語音合成。聲音模型採用隱藏式馬可夫模型作為基礎。本論文抽取頻譜特徵、頻率特徵以及文脈相關資訊等語料特徵來訓練聲學模型。聲學模型訓練後,分析所要合成的語句,使用對應的模型產生出語音參數合成語句。 在一般的聲學模型訓練過程需要較大量的訓練語料,以訓練出高品質模型。由於大量訓練語料的錄製與取得不容易,為了讓使用者不需要大量語料也能夠訓練出聲學模型,通常使用平均聲學模型與語者調適技術。然而平均聲學模型無法建立出與目標語者相似的聲學模型,因此語者調適得出的目標語者聲學模型的效果也不佳。本論文提出的方法為尋找聲學特徵上相近的語者,作為目標語者的支援語者,並用支援語者的語料建立支援語者聲學模型,經過語者調適得到目標語者的聲學模型。 本論文實施了以人的感覺為依據的主觀實驗與以訊號誤差為依據的客觀實驗。根據這兩種實驗顯示,支援語者聲學模型比平均聲學模型有更好的效果,可以合成出更好的聲音品質。

並列摘要


Nowadays people can use the speech technology to make their life better. Among the speech technology, speech synthesis is regarded as an important part recently. There are two speech synthesis techniques commonly used. One is the unit selection technique and the other is the HMM-based technique. In the unit selection technique, voice in the corpus is divided into small pieces, and they will be concatenated to generate the synthesized voice. With the HMM-based technique, the acoustic model will be calculated using the acoustic features, and synthesized voice will be generated based on acoustic models. In this thesis, I used the HMM-based technique to implement the Chinese Text-to-Speech (TTS) system. In this system, it extracts the spectral feature and the frequency feature and context-dependent labels to train the models. After the training stage, it analyzes the text and uses the corresponding models to generate the voice. In the acoustic model training it needs a large amount of training data to train a high quality model. It is difficult to obtain enough training data, so conventionally we exploit the average acoustic model and speaker adaptation to make training with less data possible. However training models close to the one of the target speaker is difficult for average acoustic models, so the performance of the speaker adaptation is not good. In this thesis, I proposed several methods to find out acoustically similar speakers as the support speakers of the target speaker and use their training data to train support speaker models. I conducted objective experiments and subjective experiments. The experiments showed support speaker model technique is better than average acoustic model technique, and support speaker model technique can result in better synthesis quality.

參考文獻


Sagisaka, Y., et al., ATR v-TALK speech synthesis system. Proc. ICSLP, 1992.
K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis, Proc. of ICASSP, 2000.
Tseng, C.-Y. and F.-C. Chou, Machine Readable Phonetic Transcription System for Chinese Dialects Spoken in Taiwan. The Journal of the Acoustical Society of Japan (E), Vol.20, No.3, 1999.
Bradbury, J., Linear Predictive Coding. 2000.
F., I., Line Spectrum Representation of Linear Predictive Coefficients of Speech Signals, in J. Acoust. Soc. Am. 1975

延伸閱讀