強健性和鑑別力語音特徵擷取技術於大詞彙連續語音辨識之研究

語音是人類主要且最方便的溝通方式之一。現今由於小型電子產品的成功發展，如手機、個人數位代理(PDA)等，再加上無線通訊和無線網路的普及，一般都認為在不久的未來，語音將扮演舉足輕重的角色，且將擔任人類與各種不同智慧型產品溝通的主要人機介面。因此，自動語音辨識(Automatic Speech Recognition, ASR)的研究也變得日益受重視。其中，為了能讓自動語音辨識在真實且多變的環境下也可以適用，許多鑑別性(Discriminative)和強健性(Robust)的特徵擷取(Feature Extraction)技術在近二十年來也陸續被提出。根據上述的觀察，在本論文裡我們研究基於聽覺知覺特性(Auditory-perception-based)的特徵擷取技術和資料相關(Data-driven)的線性特徵轉換(Linear Feature Transformation)技術，以達到強健性語音辨識的目的。對於基於聽覺知覺特性的特徵擷取技術，我們廣泛地比較常見的梅爾倒頻譜係數(Mel-frequency Cepstral Coefficients, MFCC)與感知線性預測係數(Perceptual Linear Prediction Coefficients, PLPC)，並且比較用來取得與結合時域軌跡(Time Trajectory)資訊的各種方法。在資料相關線性特徵轉換這方面，首先我們嘗試驗證，線性鑑別分析(Linear Discriminant Analysis, LDA)在語音辨識的特徵空間轉換上的表現的確優於主成份分析(Principal Component Analysis, PCA)。然後我們研究幾種線性鑑別分析的改進方法，像是異質性線性鑑別分析(Heteroscedastic Linear Discriminant Analysis, HLDA)和異質性鑑別分析(Heteroscedastic Discriminant Analysis, HDA)等，這些方法在求取線性鑑別分析過程中，並未如傳統的線性鑑別分析般需假設每個類別分佈會有相同變異量(Variation)。此外，我們提出分別利用最小分類錯誤(Minimum Classification Error, MCE)和最大交互訊息(Maximum Mutual Information, MMI)等估測法來最佳化線性轉換矩陣，並與傳統最大相似度(Maximum Likelihood, ML)估測法作比較。最後，我們也進一步地結合最大相似度線性轉換(Maximum Likelihood Linear Transformation, MLLT)與其他強健性技術諸如特徵平均消去法(Feature Mean Subtraction)、特徵正規化法(Feature Normalization)等。本論文裡所有實驗皆使用中文廣播新聞為語料庫(Mandarin broadcast news corpus, MATBN)。實驗內容包括了中文自由音節辨識(Free Syllable Decoding)，與大詞彙連續語音辨識(Large Vocabulary Continuous Speech Recognition, LVCSR)上。初步的實驗結果顯示出本論文所提出的作法對於語音辨識率有相當顯著的提昇。

關鍵字

資料相關線性特徵轉換；主成份分析；線性鑑別分析；異質性線性鑑別分析；異質性鑑別分析；最大相似度線性轉換

並列摘要

Speech is the primary and the most convenient means of communication between people. Due to the successful development of much smaller electronic devices and the popularity of wireless communication and networking, it is widely believed that speech will play a more active role and will serve as the major human-machine interface for the interaction between people and different kinds of smart devices in the near future. Therefore, research on automatic speech recognition (ASR) is now becoming more and more emphasized, and in which the development of discriminative as well as robust feature extraction approaches for ASR to be deployed in real and diverse environments has continuously gained much attention over the past two decades. With the above observation in mind, in this thesis we studied the techniques of auditory-perception-based feature extraction and data-driven linear feature transformation for robust speech recognition. For auditory-perception-based feature extraction, we extensively compares the conventional Mel-frequency Cepstral Coefficients (MFCC) with the Perceptual Linear Prediction Coefficients (PLPC), as well as compared various ways to derive and combine their corresponding time trajectory information. For data-driven linear feature transformation, we started with the attempt to show the superior performance of the linear discriminant analysis (LDA) over that of the principal component analysis (PCA) in the feature transformation for speech recognition. We then investigated several improved approaches, such as the heteroscedastic linear discriminant analysis (HLDA) and heteroscedastic discriminant analysis (HDA) etc., for removing the inherent assumption of the same cluster variation in the derivation of LDA. Moreover, we proposed the use of the minimum classification error (MCE) and maximum mutual information (MMI) criteria, respectively, in the optimization of the transformation matrices, in comparison to the maximum likelihood (ML) criterion. Finally, the maximum likelihood linear transformation (MLLT) and other robust techniques, such as the feature mean subtraction or/and variance normalization were further applied. All experiments were carried out on the Mandarin broadcast news corpus (MATBN). Very promising experimental results were initially indicated.

並列關鍵字

Data-driven Linear Feature Transformation ； Principal Component Analysis ； Linear Discriminant Analysis ； Heteroscedastic Linear Discriminant Analysis ； Heteroscedastic Discriminant Analysis ； Maximum Likelihood Linear Transformation

被引用紀錄

朱晏嬅（2008）。應用最鄰近區域分類法於慢性病分類預警準則之研究〔碩士論文，元智大學〕。華藝線上圖書館。https://doi.org/10.6838/YZU.2008.00169

賴映仲（2014）。應用支持向量機於鯨豚哨音分類之研究〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2014.01630

陳佳妤（2006）。最小音素錯誤模型及特徵訓練法於中文大詞彙辨識上之初步研究〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2006.00844

許庭瑋（2006）。英文連續語音辨識之初步研究〔碩士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-0712200716132137

李鴻欣（2009）。基於分類錯誤之線性鑑別式特徵轉換應用於大詞彙連續語音辨識〔碩士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-1610201315172539

國際替代計量

強健性和鑑別力語音特徵擷取技術於大詞彙連續語音辨識之研究

主題瀏覽