透過您的圖書館登入
IP:18.117.102.6
  • 學位論文

改善中文語音對話系統的若干相關技術

Relevant Technologies For Improved Chinese Spoken Dialog Systems

指導教授 : 李琳山

摘要


語音辨識技術在對話系統上的應用,是一門博大精深的研究。除了聲音訊號的處理,聲學模型的建立,強健語音特徵參數的擷取,語者調適,語言模型訓練,語言理解,對話管理,還有語言和語音合成等技術,都得一一處理好才能有好的品質,使人們願意接受和機器的對話。本論文包含了三個跟中文對話系統研發密切相關的主題,第一個主題是關於語者辨識和語者調適技術;第二個主題是關於語音理解技術;第三個主題是關於一個互動式的中文人名語音輸入系統。 第一個主題中的語者調適技術,是根據使用者的聲音特性,調整聲學模型的參數,以有效的提升語音辨識的正確率。作者提出的特徵最大相似度線性回歸(Eigen-MLLR)的技術,運用了主成分分析(PCA)技術,比最大相似度線性回歸(MLLR)技術有更強健的能力,與特徵聲音(Eigenvoice)技術比較,使用比較小的儲存量,更可以用於對話系統中。作者實驗比較了上述方法的差別,開發了快速的參數值演算法,並應用所提出的特徵最大相似度線性回歸技術的參數作為語者識別之用。 第二個主題是關於語音理解技術。語音理解常常必須借助二段式語音辨識和語言理解兩項技術的結合,來達成理解的功能。由於語音辨識的錯誤,或是由於語言理解的文法不足,常常造成語音理解的正確率不高。作者發明一個介於句子和詞單位之間的中介語言單位,稱作關鍵語意片段(Key Semantic Chunk),可以有效的提升二段式的語音理解的正確率。除了語音辨識的語言模型有較彈性的訓練方法,可以克服語料不足所造成的品質降低問題之外;語言理解的元件也可以彈性的處理語音辨識單元輸出的分段文字語句,以有效的避開辨識錯誤的部分或文法不足的困境,以達成強健的語音理解能力。此方法在實驗系統中可以降低約30%的語音理解的錯誤。此外,建構新系統所需要的工作,也透過關鍵語意片段的切割而增加了資源再利用的能力,降低建構的困難。 最後一個主題是藉由建立一個互動式中文開放人名語音輸入系統,研發實際好用的對話策略。開發的動機起源於中華電信104 查號台的客戶服務。這服務項目是台灣最大的電話服務,擁有最多的使用者,也被頻繁的使用著。不過,它的查詢的目標很單純,不是企業單位的電話,就是個人的電話。人名辨識的困難在於它有極大的辭彙量,卻只有短短不到兩秒鐘的語料以供辨識。單純透過語音辨識技術,不容易有足夠鑑別的能力,於是需要設計好用的對話策略讓使用者提供更多的資訊,以完成輸入的目的。參考了104服務的實際溝通方式之後,實驗系統的對話策略採用創新的字確認和字更正的方法,有效的達成86.7%的高成功率。 語音對話系統的技術博大精深,本論文嚐試在其中幾個部分著手改善,除了期望能夠改善整體的品質,也希望透過不同領域的研究而得到更廣泛的認識。本論文提出的這些改進方法,都是在中文語音辨識和對話系統上進行實驗的;然而,除了第三個中文人名辨識的主題跟中文有密切相關,其他兩個改進的技術並無使用語言的特定性質,應該可以應用在其他的語言上。最後,期待語音辨識和對話系統的技術,在不久的將來可以廣泛的應用在人們日常生活之中。

並列摘要


The application of automatic speech recognition technology in spoken dialogue systems comprises important technologies from several different aspects, including digital signal processing, robust speech feature extraction, acoustic-phonetic modeling, speaker adaptation, language modeling, and language understanding, dialogue management, as well as language generation and speech synthesis. All of these technologies contribute to the performance of the spoken dialogue system, which accomplishes the communication between men and machines. The dissertation includes three relevant technologies for improved Chinese spoken dialogue systems: the first topic is about speaker adaptation and speaker identification, the second one is about speech understanding, and the third one is an interactive open-vocabulary Chinese name input system. The speaker adaptation technology in the first topic is to adapt acoustic models to speaker voice characteristics to improve speech recognition accuracy. Eigen-MLLR approach was proposed to construct the subspace of MLLR parameters space by Principal Component Analysis (PCA) technique, hence it is more robust than MLLR approach with small amount of enrollment data. Compared with Eigenvoice approach, it requires less storage memory for model-adaptation estimates. Therefore, it could be more realistic for application in speaker-independent spoken dialogue systems. The author compared Eigen-MLLR with MLLR and Eigenvoice, developed a fast Eigen-MLLR coefficient estimation algorithm, and applied Eigen-MLLR coefficients for speaker identification. The second topic is about speech understanding. Most of speech understanding systems with middle to large vocabularies incorporate a two-stage approach: the speech recognition component as the first stage, followed by the second stage of natural language understanding component. The speech understanding performance is usually constrained by speech recognition errors and out-of-grammar problems. Therefore, it is necessary to have robust speech understanding ability. The proposed novel approach integrates a concept layer, Key Semantic Chunk, into the two-stage system. The Key Semantic Chunk is a language unit between sentence and word, is integrated into both speech recognition and language understanding components, and interfaces the communication between these two components. Not only the language model of speech recognition can be improved in its robustness to data-sparseness, but also the language understanding processing on the speech recognition output can work more robustly. Consequently, the improved system achieved about 30% reduction over understanding errors. Besides, the building and maintenance efforts for language understanding grammars and speech recognition n-gram models can be reduced. The last topic is to build an interactive open-vocabulary Chinese name input system and to establish an error correction mechanism. The motivation of building the system came from the experience of 104 directory-assistance services in Chunghwa Telecom. This service is the biggest commercial telephony service in Taiwan. It has the largest group of consumers and is frequently used by the telephone user. However, its service is clear and simple – the telephone number of a person, a company, or a branch of a company. The difficulty of an open-vocabulary Chinese name input task is its huge vocabulary size. For example, with very short periods, less than two seconds, of speech, the task requires a system to recognize the target name among billion names. It is incredible to have high recognition accuracy only by the speech recognition technique. The experimental system attempts to design an intelligent and friendly dialogue strategy by incorporate the error correction mechanism to achieve a reasonable high success rate. Referring to actual 104-service interactions, the human operator may attempt to ask the caller to describe again the ambiguous characters. Finally, both character confirmation and character input mechanisms were designed into the experimental system and achieved an 86.7% high success rate. The dissertation has included several relevant technologies for improved Chinese spoken dialogue systems, although the first two can also be applied in different languages. Via all different research topics, the author would like to understand more about the spoken dialogue system and to improve the whole system performance. There is a wish in the mind of the author: to see the speech recognition and dialogue system technologies being widely and successfully applied in many applications.

參考文獻


[1] L.R. Rabiner & R.W. Schafer Digital processing of speech signals,
[2] F. Jelinek, Statistical Methods for Speech Recognition, In by MIT press, 1997.
[4] K.F. Lee, H.W. Hon, and R. Reddy, The SPHINX speech recognition system, In ICASSP, 1989.
[5] Hsin-min Wang, Jia-lin Shen, Yen-Ju Yang, Chiu-Yu Tseng, and Lin-shan Lee, Complete Recognition of Continuous Mandarin Speech for Chinese Language with Very Large Vocabulary Using Very Limited Training Data, In IEEE Transactions on Speech and Audio Processing, Vol.5, No.2, March 1997, pp.195-200.
[6] L. Lee & R.C. Rose Speaker normalisation using efficient frequency warping procedures, In Proc. ICASSP, pp. 353–356, Atlanta, 1996.

被引用紀錄


李運寰(2008)。以樹狀資料結構為基礎之語音對話系統〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2008.01877

延伸閱讀