數據擬合與分群方法於強健語音特徵擷取之研究

語音長久以來一直是人類最自然且最容易使用的溝通媒介。無庸至疑地，語音也勢必會扮演著未來人類與各種智慧型電子設備間最主要的人機互動媒介，因此自動語音辨識(Automatic Speech Recognition, ASR)技術將會是扮演其中最關鍵且重要的角色。目前大部份的自動語音辨識系統在語音訊號不受干擾的理想乾淨實驗室環境下，可獲得非常不錯的辨識效果；但若應用至現實環境中，語音辨識率卻往往會因為環境中複雜因素的影響，造成訓練環境與測試環境存在的不匹配(Mismatch)的問題存在，使得系統辨識效能大幅度地降低。因此，語音強健(Robustness)技術就顯得格外重要與受到重視。目前有關語音強健方法的研究若以其處理對象而言，大致上可從二種不同層面討論：從語音特徵值本身為出發，或是從統計分布出發，此二類研究各有其優缺點。本論文嘗試結合上述二種層面的優點，並且利用數據擬合(Data-fitting)技術來增進語音辨識系統的辨識效能。吾人首先提出了群集式為基礎之多項式擬合統計圖法(Cluster-based Polynomial-fit Histogram Equalization, CPHEQ)，利用統計圖等化法(Histogram Equalization)的概念與雙聲源訓練語料(Stereo Training Speech Data)的使用求得多項式轉換函數。再者，吾人將此方法做一些假設及延伸，進而衍生出二種不同方法，其一是以多項式擬合統計圖等化法(Polynomial-fit Histogram Equalization, PHEQ)來改良傳統統計圖等化法需要耗費較多記憶體空間與處理器運算時間的缺點；另一個則是配合遺失特徵理論(Missing Feature Theorem)的選擇性群集式為基礎之多項式擬合統計圖等化法(Selective Cluster-based Polynomial-fit Histogram Equalization, SCPHEQ)來進行語音特徵參數的重建。語音辨識實驗是以Aurora-2語料庫為研究題材；實驗結果顯示，在乾淨語料訓練模式下，吾人所提出的方法相較於基礎實驗結果能顯著地降低詞錯誤率，並且其成效也較其它傳統語音強健方法來的好。

關鍵字

語音辨識；語音強健技術；統計圖等化法；數據擬合；遺失特徵理論

並列摘要

Speech is the primary and the most convenient means of communication between individuals. It is also expected that automatic speech recognition (ASR) will play a more active role and will serve as the major human-machine interface for the interaction between people and different kinds of intelligent electronic devices in the near future. Most of the current state-of-the-art ASR systems can achieve quite high recognition performance levels in controlled laboratory environments. However, as the systems are moved out of the laboratory environments and deployed into real-world applications, the performance of the systems often degrade dramatically due to the reason that varying environmental effects will lead to a mismatch between the acoustic conditions of the training and test speech data. Therefore, robustness techniques have received great importance and attention in recent years. Robustness techniques in general fall into two aspects according to whether the methods’ orientation is either from feature domain or from their corresponding probability distributions. Methods of each have their own superiority and limitations. In this thesis, several attempts were made to integrate these two distinguishing information to improve the current speech robustness methods by using a novel data-fitting scheme. Firstly, cluster-based polynomial-fit histogram equalization (CPHEQ), based on histogram equalization and polynomial regression, was proposed to directly characterize the relationship between the speech feature vectors and their corresponding probability distributions by utilizing stereo speech training data. Moreover, we extended the idea of CPHEQ with some elaborate assumptions, and two different methods were derived as well, namely, polynomial-fit histogram equalization (PHEQ) and selective cluster-based polynomial-fit histogram equalization (SCPHEQ). PHEQ uses polynomial regression to efficiently approximate the inverse of the cumulative density functions of speech feature vectors for HEQ. It can avoid the need of high computation cost and large disk storage consumption caused by traditional HEQ methods. SCPHEQ is based on the missing feature theory and use polynomial regression to reconstruct unreliable feature components. All experiments were carried out on the Aurora-2 database and task. Experimental results shown that for clean-condition training, our method achieved a considerable word error rate reduction over the baseline system and also significantly outperformed the other robustness methods.

並列關鍵字

Speech Recognition ； Robustness ； Histogram Equalization ； Data-Fitting ； Missing Feature Theory

參考文獻

Abramowitz, M., and I. A. Stegun (1972), “Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables,” Dover.

Acero, A. (1990), “Acoustical and Environmental Robustness for Automatic Speech Recognition,” Ph. D. Dissertation, Carnegie Mellon University.

Alpaydin, E. (2004), “Introduction to Machine Learning,” The MIT Press.

Beyerlein, P., X. Aubert, et al. (2002), “Large Vocabulary Continuous Speech Recognition of Broadcast News - The Philips/RWTH Approach,” Speech Communication, vol. 37: pp. 109-131.

Boll, S. F. (1979), “Supperssion of Acoutstic Noise in Speech Using Spectral Subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27(2): pp. 113-120.

國際替代計量

數據擬合與分群方法於強健語音特徵擷取之研究

主題瀏覽