語言模型訓練與調適技術於中文大詞彙連續語音辨識之初步研究

在過去三十年間，統計式語言模型在各種與自然語言相關的應用上一直是一個重要的研究議題，它的功能是擷取自然語言中的各種資訊，諸如前後文資訊（contextual information）、語意資訊（semantic information）等，再利用這些資訊以機率量化來決定一個詞序列（word sequence）發生的可能性。例如，在語音辨識中，語言模型扮演的角色是要解決聲學混淆（acoustic confusion）的問題，將正確的辨識結果從有可能的候選詞序列中挑選出來。近年來，語音辨識在我們生活中已有越來越多的應用，例如語音聽寫（voice dictation）、電話轉接（call routing）系統等等。但是語音辨識效能的好壞，通常會隨著辨識任務的詞彙或語意的不同，而受到嚴重的影響，於是誕生了語言模型調適的研究。語言模型調適是要利用辨識任務中固有的詞彙和語意資訊來彌補訓練語料與測試語料間的不一致性（mismatch）。在本論文中，提出了原本應用在機率式資訊檢索上的主題混合模型法（topic mixture model, TMM）來動態的利用長距離的主題資訊，並且運用在語言模型調適上得到了不錯的效果。此外，本論文對最大熵值法（maximum entropy, ME）亦做了深入的研究，最大熵值法是一種將不同資訊來源（information sources）整合的方法，在此方法中，每一個資訊來源都會引發一群限制（constraints），限制合併後的語言模型要滿足所有的資訊。然而，這些限制的交集（intersection），是滿足所有資訊的機率分佈的集合，在這個集合中，擁有最大熵值（highest entropy）的機率分佈即為此方法的解。初步的實驗結果顯示以最大熵值法來合併一連詞、二連詞與三連詞所得到的語言模型，比用傳統最大相似度估測法（maximum likelihood）所訓練的語言模型，在中文廣播新聞轉寫上的字錯誤率（character error rate, CER）與語言模型複雜度（perplexity）都達到較好的效果。

關鍵字

語言模型；語言模型調適；主題混合模型；最大熵值法

並列摘要

Statistical language modeling, which aims to capture the regularities in human natural language and quantify the acceptance of a given word sequence, has continuously been an important research issue in a wide variety of applications of natural language processing (NLP) over the past three decades. For example, in speech recognition, the principal role of the language models is to help resolve the acoustic confusion and thus separate the correct hypothesis from the competing ones. In the recent past, there were quite many applications of speech recognition technology being developed, such as voice dictation and call routing systems, etc. However, speech recognition performance is often seriously affected by the varying lexical and semantic characteristics among different application tasks. Thus, there is always a need for language model adaptation, which has the goal to exploit the specific lexical and semantic information inherent in the recognition domain, so as to compensate the mismatch between training and testing conditions. In this thesis, a topical mixture model (TMM) previously proposed for probabilistic information retrieval was investigated to dynamically explore the long-span latent topical information for language model adaptation. Moreover, we also studied the use of the Maximum Entropy (ME) principle for language modeling. ME is a principle for efficient combination of a variety of information sources. Under the ME criterion, each information source gives rise to a set of constraints that can be futher imposed on the resultant language model. The intersection of these constraints is the set of language model probability distributions which can satisfy all of these constraints. The probability distribution which has highest entropy is thus the solution of the ME principle. The preliminary experimental results show that the ME-based language modeling approach can achieve superior performance over the conventional Maximum Likelihood (ML) based approach in both character error rate and perplexity reductions on the Mandarin broadcast news transcription task.

並列關鍵字

language model ； language model adaptation ； topic mixture model ； maximum entropy

參考文獻

[Chen et al. 2005] B. Chen, Jen-Wei Kuo, Wen-Huang Tsai. “Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 10, No. 1, pp.1-18, March 2005.

[Wang et al. 2005] Hsin-min Wang, Berlin Chen, Jen-Wei Kuo, and Shih-Sian Cheng. “MATBN: A Mandarin Chinese Broadcast News Corpus,” accepted to appear in International Journal of Computational Linguistics and Chinese Language Processing.

[Aubert 2002] X. L. Aubert. “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, January 2002.

[Ball et al. 1967] G. H. Ball, and D. J. Hall. A Clustering Technique for Summarizing Multivariate Data. Behavioral Science, Volume 12, pages 153-155, 1967.

[Bellegarda 2000] J. R. Bellegarda. Exploiting latent semantic information in statistical language modeling. Proceedings of the IEEE, Volume 88, pages 1279-1296, August 2000.

被引用紀錄

張文遠（2012）。自動體外除顫器之工程技術分析〔碩士論文，中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu201200605

邱炫盛（2006）。利用主題與位置相關語言模型於中文連續語音辨識〔碩士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-0712200716132659

陳冠宇（2010）。主題模型於語音辨識使用之改進〔碩士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-1610201315213186

國際替代計量

語言模型訓練與調適技術於中文大詞彙連續語音辨識之初步研究

主題瀏覽