透過您的圖書館登入
IP:18.227.0.192
  • 學位論文

中文自發性語音辨識中偵測修正性不流暢現象之新方法

New Approaches for Detecting Edit Disfluencies in Transcribing Spontaneous Mandarin Speech

指導教授 : 李琳山

摘要


理想的語音辨識系統(speech recognition system)必須能處理人類自然發生的口語語音或自發性語音(spontaneous speech)。相對於清晰朗讀或是有事先準備而產生的語音,這種自發性語音具有一些特質,會增加系統在處理上的難度。其中的一項重要特質就是隨處可見常常發生的修正性不流暢(edit disfluency)現象。要能正確而不失真地解讀說話者要傳達的意思,系統必須要能偵測這樣的修正性不流暢,並且妥善處理。 在本論文中,我們提出一套處理自發性語音中修正性不流暢的架構,透過找出語音中不流暢的中斷點(interruption points, IPs),並且比對前後所講的字詞之間的關係,來找出語句的結構,並刪去語句中多餘或說話者講錯想更正的應修正字詞(edit words, including reparandum and optional editing terms),以利於語意的理解。在這個架構中,我們提出一套有效的特徵參數(features)和模型來偵測語音修正性不流暢的中斷點,並且根據偵測的結果改進辨識結果的正確性和可理解性。這套特徵參數經過仔細設計,考慮了中文語音所特有的各種語言特性。而用來偵測不流暢中斷點的模型,則是改進自機器學習(machine learning)研究中相當著名的兩個方法:決策樹(decision trees, DTs)以及最大熵值模型(maximum entropy models, MEs)。透過結合兩者的優點,我們得到一個更加適合偵測不流暢中斷點的模型:以決策樹為基礎的最大熵值模型(DT-ME)。此外,我們又進一步提出一套分析語音的韻律或抑揚頓挫(prosody)結構的方法:統計式潛藏韻律模型(latent prosodic modeling, LPM)。透過分析說話者正常流利說話時的抑揚頓挫,並比較其說話中斷語流不順時的情形,我們於是可以將前述的DT-ME模型進一步改進,得到更精確的偵測模型。另一方面,透過使用條件隨機域模型(conditional random field,CRF),我們得以分析不流暢的中斷點前後的詞語間的關係,找出並刪去應修正字詞,以分析語句的結構,正確掌握語意。 在中文口語對話語音上的實驗結果顯示,我們提出的這套架構能有效偵測處理中文口語中的修正性不流暢現象,並且顯著降低偵測的錯誤率。對於語句結構的較佳掌握也帶來了較佳的辨識結果(辨識正確率的提升)。此外,我們更進一步觀察我們提出的潛藏韻律模型所分析出來的抑揚頓挫。我們也透過分析對偵測不同種類修正性不流暢現象有效果的特徵參數如何不同,來進一步了解這些不流暢在特性上的差別。

並列摘要


Detection of edit disfluencies is one of the keys to transcribing spontaneous utterances. In this dissertation, we present improved features and models to detect edit disfluencies and enhance transcription of spontaneous Mandarin speech using hypothesized disfluency interruption points (IPs) and edit word detection. A comprehensive set of prosodic features that takes into account the special characteristics of edit disfluencies in Mandarin is developed, and an improved model combining decision trees and maximum entropy is proposed to detect IPs. This model is further adapted to desired prosodic conditions by latent prosodic modeling, a probabilistic framework for analyzing speech prosody in terms of a set of latent prosodic states. These techniques contribute to higher recognition accuracy (by rescoring with the hypothesized IPs) and better edit word detection (using conditional random fields defined on Chinese characters) in the final transcription, as verified by experiments on a spontaneous Mandarin speech corpus. Detailed analysis on the output latent states of the proposed latent prosodic modeling is conducted. Further analysis on the relevance of the proposed prosodic features to each type of edit disfluency is also conducted for further insight into the characteristics of various disfluency categories.

參考文獻


[1] J. G. Kahn, M. Ostendorf, and C. Chelba, “Parsing conversational speech using enhanced segmentation,” in Proc. of HLT/NAACL, 2004.
[3] S.-C. Tseng and Y.-F. Liu, “Annotation of Mandarin Conversational Dialogue Corpus,” Academia Sinica, CKIP Tech. Rep.-01, 2002.
[4] C.-K. Lin and L.-S. Lee, “Improved features and models for detecting edit disfluencies in transcribing spontaneous Mandarin speech”, to appear in IEEE Transactions on Audio, Speech, and Language Processing in 2009.
[8] S. Furui, M. Nakamura, T. Ichiba, and K. Iwano, “Analysis and recognition of spontaneous speech using corpus of spontaneous japanese,” Speech Communication, vol. 47, pp. 208–219, 2005.
[9] H. Soltau, B. Kingsbury, L. Mangu, D. Povey, G. Saon, and G. Zweig, “The IBM 2004 conversational telephony system for rich transcription,” in Proc. IEEE ICASSP, 2005, pp. 205–208.

被引用紀錄


吳孟謙(2015)。以韻律訊息輔助中文自發性語音辨認之改進〔碩士論文,國立交通大學〕。華藝線上圖書館。https://doi.org/10.6842/NCTU.2015.00004
周建宇(2009)。基於機器學習之中文語句分段〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2009.00568

延伸閱讀