非監督式中文語音韻律標記及韻律模式

韻律模式可使用在許多語音處理應用上，如語音合成及語音辨認。一般傳統建構韻律模式的方法，是先對語音信號標示出韻律標記以表示重要的韻律訊息，進而建構韻律模式。傳統韻律標記的方法是以人工觀察並聆聽語音信號進行標記，此方法之缺點為：（1）因為不同標記人的主觀認定不同造成標記結果不一致，（2）即使是同一個標記人進行標記，長時間進行下來，亦難以保持一致性，（3）耗時。上述所論及的不一致性，進而可能使得韻律模式在語音處理應用上的表現不佳。為了改善以上缺點，在本研究中，我們設計出一個包含四個子模型的「非監督式中文韻律標記及韻律模式」(Unsupervised joint prosody labeling and modeling, UJPLM)演算法，自動化地對語料同時進行韻律模式以及韻律標記，試圖更客觀且一致地標記出韻律標記。本研究標記的韻律標記為停頓標記及韻律狀態，其中停頓標記表示韻律單位的邊界，而韻律狀態的序列代表上層韻律單位(韻律詞、韻律短語以及呼吸組/韻律句組)的音高變化。實驗語料由一位專業女播音員朗讀中文文稿，文稿內容則從「中央研究院詞庫小組－中文句結構樹資料庫」中選出的短篇文章。透過分析訓練出的模型參數，我們探討此語者之：（1）音節的音高輪廓變化、韻律標記及語言參數的關係，（2）停頓標記、韻律參數及語言參數的關係，（3）由韻律狀態所表示的上層韻律單位之音高變化。藉由停頓標記和其對應詞關係之深入分析，除了探討韻律參數與語言參數的連結，同時也驗證本研究所提出方法之標記能力。另外，經由和人工停頓標記之比較，發現以本研究方法標記出來的停頓標記，其對應的韻律參數擁有較一致的統計特性，相較傳統以人工標記所造成的不一致統計特性，本研究的方法更能真實地（或客觀地）描述語者之韻律特性。基於UJPLM演算法，本研究接著提出「進階非監督式中文韻律標記及韻律模式」(Advanced-UJPLM, A-UJPLM)演算法，增加一個次要停頓韻律標記及同時對於音高、音長和音強進行模式建立。實驗結果顯示此方法可以更豐富地描述語者之韻律特性，停頓標記的結果顯示在主要停頓及無停頓的標示上，與UJPLM標示的結果相當一致，而A-UJPLM能夠標記出較多的次要停頓，使得次要停頓標記結果與人工標記結果更一致。最後本研究提出一個以A-UJPLM演算法為基礎之語音合成韻律產生法，實驗結果顯示此方法產生之韻律參數大致符合實際語音的韻律參數，驗證A-UJPLM演算法在韻律標記及韻律模式上擁有不錯的表現。

關鍵字

韻律模式；韻律標記；韻律產生；語音合成

並列摘要

An unsupervised joint prosody labeling and modeling method (UJPLM) for Mandarin speech is proposed, a new scheme intended to construct statistical prosodic models and to label prosodic tags consistently for Mandarin speech. Two types of prosodic tags are determined by four prosodic models designed to illustrate the hierarchy of Mandarin prosody: the break of a syllable juncture to demarcate prosodic constituents and the prosodic state of a syllable to represent any prosodic domain’s pitch level variation resulting from its upper-layered prosodic constituents’ influences. The performance of the proposed method was evaluated using an unlabeled read-speech corpus articulated by an experienced female announcer. Texts of the corpus were selected from The Sinica Treebank Corpus. Experimental results showed that the estimated parameters of the four prosodic models were able to explore and describe the structures and patterns of Mandarin prosody. Besides, certain corresponding relationships between the break indices labeled and the associated words were found, and manifested the connections between prosodic and linguistic parameters, a finding further verifying the capability of the method presented. A quantitative comparison in labeling results between the proposed method and human labelers indicated that the former was more consistent and discriminative than the latter in prosodic feature distributions, a merit of the method developed here on the applications of prosody modeling. In virtue of the success of UJPLM, the advanced UJPLM (A-UJPLM) method was designed based on UJPLM to jointly label seven prosodic tags and model syllable pitch contour, duration and energy level. Experimental results showed that A-UJPLM performed quite well. The break labeling result showed that A-UJPLM inserted more minor breaks than UJPLM to result in a more consistent labeling of minor breaks to the human labeling. Lastly, an application of A-UJPLM to the prosody generation for Mandarin TTS is proposed. Experimental results showed that the proposed method performed well. Most predicted values of syllable pitch mean, duration and energy level matched well to their original counterparts. This also reconfirmed the effectiveness of the A-UJPLM method.

並列關鍵字

prosody modeling ； prosody labeling ； prosody generation ； speech synthesis

參考文獻

[45] M. Chu, Y. Qian, “Locating Boundaries for Prosodic Constituents in Unrestricted Mandarin Texts,” Computat. Linguist. and Chinese Language Processing, 6, 61-82 (2001).

[2] E. Selkirk, Phonology and Syntax: The Relation Between Sound and Structure (MIT Press, Cambridge, MA, 1984).

[5] Q. Shi, X.-J. Ma, W.-B. Zhu, W. Zhang and L.-Q. Shen, “Statistic prosody structure prediction,” Proceedings of IEEE Workshop on Speech Synthesis 2002, pp. 155-158.

[7] G.-H. Fu and K.K Luke, “Integrated approaches to prosodic word prediction for Chinese TTS,” Proceeding of the IEEE NLP-KE 2003, pp. 413-418.

[9] C.-Y. Tseng, “Higher level organization and discourse prosody,” Proceedings of the TAL 2006, pp. 23–34.

被引用紀錄

唐若華（2010）。基於詞性之斷詞方法以改善華語語音合成系統〔碩士論文，國立清華大學〕。華藝線上圖書館。https://doi.org/10.6843/NTHU.2010.00487

游俊龍（2015）。中文自發性語音之聲學模式及韻律模式的改進〔碩士論文，國立交通大學〕。華藝線上圖書館。https://doi.org/10.6842/NCTU.2015.00714

蔡承燁（2010）。中英夾雜語音之階層式韻律架構建立與語音合成之應用〔碩士論文，國立交通大學〕。華藝線上圖書館。https://doi.org/10.6842/NCTU.2010.00941

許誌宏（2010）。中文自發性語音辨認系統〔碩士論文，國立交通大學〕。華藝線上圖書館。https://doi.org/10.6842/NCTU.2010.00890

周建邦（2009）。中文大詞彙語音辨認之語言模型改進〔碩士論文，國立交通大學〕。華藝線上圖書館。https://doi.org/10.6842/NCTU.2009.01146

國際替代計量

非監督式中文語音韻律標記及韻律模式

全文下載

主題瀏覽