透過您的圖書館登入
IP:3.15.31.22
  • 學位論文

結合情緒合成之語音轉換系統

Voice Conversion System Integrated with Emotional Speech Synthesis

指導教授 : 黃志方 成維華

摘要


語音轉換是將一個特定語者的聲音經過處理後,使它聽起來好像另一個目標與者講話的聲音。在本論文中,語音轉換的過程涉及訓練和轉換兩個階段,在訓練階段,一些用來描述語者身分的特徵參數先分別的從特定語者和目標語者萃取出來,再利用高斯混合模型(GMM)和人工類神經網絡(ANN)分別地建立兩個語者的特徵參數之間的關係,在轉換階段,先從輸入語音中提取特徵參數,再透過基於人工類神經網絡和基於高斯混合模型的變換函數分別地進行變換。在本篇論文中使用了頻譜、激發和韻律特徵參數來表示語者的身份。為了增強語音轉換系統的情緒表現,情緒語音合成模組也被整合到語音轉換系統的轉換階段中。在情緒語音合成模組中,線性調整模型(LMM)被用來修改韻律參數而調整過後的韻律參數則被用來合成具有情緒的語音。測驗的結果表示,以人工類神經網絡為基礎的語音轉換系統較基於高斯混和模型的語音轉換系統表現更佳,然而相較於真實的語音,合成的情緒語音自然度仍有改進空間,將於未來繼續精進。

並列摘要


Voice conversion (VC) is the process that the utterance of a speaker is transformed so that it is sounded as if spoken by a specified target speaker. In this thesis, the process of voice conversion involves training and transformation phases. In the training phase, some features illustrating the identity of a speaker are extracted from the source and target speaker individually and their relationships are captured by Gaussian mixture model (GMM) and artificial neural network (ANN) simultaneously. In the transformation phase, the features are extracted from the input speech and transformed by ANN and GMM-based conversion function respectively. The spectral, excitation, and prosodic features are used in this thesis to represent the speaker’s identity. In order to enhance the expressiveness of the VC system, the module of emotional speech synthesis is also integrated into the transformation phase of the VC system. The linear modification model (LMM) is adopted to modify the prosodic parameters and the modified prosodic parameters are used to synthesize the emotional utterance. The results of evaluation show that the ANN-based VC system is better than GMM-based VC system, and the most synthesized emotional utterance can be identified. However, the synthesized utterance is not natural enough in comparison with the real utterance. In the future, the naturalness of the synthesized utterances will be improved.

參考文獻


[3] D. H. Klatt and L. C. Klatt, “Analysis, synthesis, and perception of voice quality variations among female and male talkers,” J. Acoust. Soc. Am., vol. 87, no. 2, pp. 820–57, Feb. 1990.
[4] H. Kuwabara and Y. Sagisak, “Acoustic characteristics of speaker individuality: Control and conversion,” Speech Commun., vol. 16, pp. 165–173, 1995.
[6] D. Erro, A. Moreno, and A. Bonafonte, “Voice conversion based on weighted frequency warping,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 5, pp. 922–931, 2010.
[7] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” J. Acoust. Soc. Jpn. (E), vol. 11, no. 2, pp. 71–76, 1990.
[8] L. Arslan, “Speaker transformation algorithm using segmental codebooks (STASC),” Speech Commun., vol. 28, no. 3, pp. 211–226, 1999.

被引用紀錄


詹証皓(2016)。語氣辨識於社交機器人之應用〔碩士論文,國立虎尾科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0028-1608201617133000

延伸閱讀