透過您的圖書館登入
IP:3.139.83.7
  • 學位論文

華語歌聲合成之聲韻母音長預測與音量模型研究

A Study on Initial/Final Duration Prediction and Energy Modeling for Corpus-based Mandarin Singing Voice Synthesis

指導教授 : 張智星

摘要


在本論文中,針對以語料庫為主的中文歌聲合成,提出一個有效預測聲韻母音長及音量的方法,主要目標為改善合成歌聲的清晰度與自然度。 首先,我們介紹聲韻母音長預測模組的架構概念。在此,我們依據中文子音類別分別建造聲母/韻母音長預測模組,使用語言學與聲韻學特徵以及樂譜資訊當作輸入參數,並採用支撐向量機 (SVM, Support Vector Machine) 迴歸方法來預測聲母長度。 接著,在音量模型設計方面,我們提出三種調整方法。方法一、給定所有音節相同音量強度。方法二、使用語言學與聲韻學特徵以及樂譜資訊藉由分類與迴歸樹狀圖分析 (CART, Classification and Regression Trees) 來預測音量。方法三、根據不同音高與音長分佈組合,定義規則對音量做調整。最後,我們進行相關實驗與聽測並歸納結論。由實驗結果證實本論文所提出的聲韻母音長預測與音量模型確實能提升合成歌聲的清晰度與自然度。

並列摘要


In this research, we propose several effective methods for initial/final (I/F) duration prediction and energy modeling for corpus-based Mandarin singing voice synthesis (SVS). Our goal is to improve the clarity and naturalness of the synthesized singing voices. Firstly, the framework of the I/F duration prediction model is presented. We construct an individual I/F duration prediction model for each category of consonants. Both linguistic/phonetic attributes and music-score information are used as the input features. The support vector machine (SVM) is employed to train each I/F duration prediction model. Secondly, three methods for energy modeling are proposed. In the first method, we use an identical volume to specify the energy of each syllable. In the second method, we adopt the same features used in the I/F duration prediction to predict energy. In the third method, a rule-based approach is designed to modify the energy according to different combinations of pitch and duration. Finally, several experiments and listening tests are conducted to demonstrate the feasibility of the proposed methods. The experimental results indicate that our methods are able to improve both the clarity and naturalness of the synthesized singing voices.

參考文獻


P. R. Cook, “SPASM, a real-time vocal tract physical model controller and singer, the companion software synthesis system”, Computer Music Journal., Vol. 17, pp. 30-43, 1993.
E. B. George and M. J. T. Smith, “Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model”, IEEE Trans. Speech and Audio Proc., Vol. 5, pp. 389-406, 1997.
H. Kawahara, I. Masuda-Katsuse and A. de Cheveigne, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction”, Speech Commun., 27, pp.187-207 (1999).
H. Dudley, “Remaking speech”, J. Acoust. Soc. Am., 11, pp. 169-177, 1939.
D. Klatt, “Software for a cascade/parallel formant synthesizer”, Journal of the

被引用紀錄


馬嘉鍾(2010)。RFID資訊系統應用於賽會管理〔碩士論文,國立臺北科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0006-1708201017411900

延伸閱讀