透過您的圖書館登入
IP:3.21.106.69
  • 學位論文

多語者漢語韻律模型之建立與其在語者韻律轉換之應用

Multi-Speaker Mandarin Speech Prosody Modeling and its Application to Speaker Prosody Conversion

指導教授 : 陳信宏

摘要


本研究提出以語者韻律模式為基礎的語者韻律轉換方法,其系統架構包含語者韻律模式訓練及聲音韻律轉換兩階段。語者韻律模式訓練階段可再分成語者獨立韻律模型訓練及語者相依韻律模型調適兩個部分,它首先以PLM演算法訓練一個語者獨立韻律模型並對訓練語料產生韻律及停頓標記;接著以最大事後機率調適法則將語者獨立韻律模型調適成語者相關韻律模型,並以遞迴方式反覆疊代更新兩類模型直到收斂;聲音韻律轉換階段則包含來源語者韻律分析及目標語者韻律合成,它使用來源語者之語者相關韻律模型來分析輸入語音之韻律信息,以產生韻律標記,然後以目標語者之語者相關韻律模型來合成輸出語音之韻律參數,包括音節基頻軌跡、音節長度、音節能量、及音節間停頓長度。本研究實驗使用自行錄製的部分平行語料庫,包含9男6女的朗讀語音,實驗結果顯示我們所提出的方法轉換效果略優於傳統的高斯正規化法,並且在部分Source語者及Target語者韻律狀態的影響數值相差劇烈之處,可以產生補償的效果。

並列摘要


In this thesis, a speech prosody conversion method based on speaker’s prosody modeling is proposed. The method comprises a prosody modeling phase and a prosody conversion phase. In the prosody modeling phase, the PLM algorithm proposed previously is firstly employed to train an SI prosodic model from a multi-speaker training dataset and label all training utterances with prosodic states for all syllables as well as break types for all syllable junctures. Then, the maximum a posterior probability (MAP) method is applied to adapt the SI prosodic model to generate a speaker dependent (SD) prosodic model for each speaker. In the prosody conversion phase, the SD prosodic model of the source speaker is firstly used to analyze the input speech to generate prosodic tags. Then, the prosody of the output speech is generated using these prosodic tags by the SD prosodic model of the target speaker. The prosodic information generated includes syllable pitch contour, syllable duration, syllable energy level, and syllable-juncture pause duration. A corpus containing read speeches of six female and nine male speakers was used to examine the validity of the proposed method. Experimental results confirmed that the proposed method performed slightly better than the conventional Z-score normalization method.

參考文獻


[2] C. C. Hsia, C. H. Wu and J. Q. Wu, “Conversion Function Clustering and Selection Using Linguistic and Spectral Information for Emotional Voice Conversion,” IEEE Trans. Computers, 56(9):1225-1254, 2007.
[4] O. Türk, O.Büyük, A. Haznedaroglu and L. M. Arslan, “Application of Voice Conversion for Cross-Language Rap Singing Transformation,” in Proc. Of ICASSP, pp.3597-3600, Taipei, Taiwan, April 2009.
[5] K. Y. Park and H. S. Kim, “Narrowband to wideband conversion of speech using GMM based transformation,” in Proc. ICASSP, Istanbul, Turkey, Jun. 2000, pp.1847-1850.
[7] J. Tao, Y. Kang and A. Li., “Prosody Conversion from Neutral Speech to Emotional Speech,” IEEE Trans. Audio, Speech and Language Processing, Vol. 14, No.4, pp.1145-1154, July 2006.
[9] A. Kain and M. W. Macon, “Spectral Voice Conversion for Text-to-Speech Synthesis,” in Proc. Of ICASSP, vol.1, pp.285-288, Seattle, Washington, USA, May 1998.

延伸閱讀