傳統語音辨識中,使用梅爾倒頻譜係數特徵參數來抽取聲音訊號中的語音資訊,並用這樣的特徵參數訓練統計模型,對聲音加以辨識;然而梅爾倒頻譜係數有一些無法克服的缺點,例如其所抽取的資訊僅限於短時間內等。近年來已有不少研究,藉由抽取聲音中更長時間的訊息,或是時域、頻域及時頻域上的變化,來獲取更豐富的特徵參數,進而提升辨識系統的效能。 本論文中,利用加伯濾波器抽取出富含時頻訊息的特徵參數,經多層感知器學習其在不同音素間的變化,得到音素事後機率向量,並藉由串接式系統將加伯事後機率和梅爾倒頻譜係數事後機率做整合,發現可以提升辨識系統的正確率。此外,我們進一步藉由群聚階層式多層感知器,針對易混淆的音素,估計更為精準的事後機率,改善了辨識系統的效能。最後,我們在特徵參數中加入了基頻特徵,並在聲學模型中考慮了聲調的變化,這樣的語音辨識系統在中文大字彙新聞辨識實驗中,辨識正確率有顯著的進步。
In conventional speech recognition, we use MFCC features to extract speech information in waveform. We further train statistic models with these features for decoding. However, MFCC features retain only the information within a short time span. Recently, many researches focus on extracting long-term information from speech signal or the variation in spectral, temporal or spectro-temporal modulation frequency, and these studies achieve significant performance improvement. Here, we utilize Gabor filters to extract Gabor features, which are abundant in spectro-temporal information. An MLP is trained for learning the variation of Gabor features among different phonemes. The outputs of MLP are Gabor posteriors. We use Tandem system to integrate Gabor and MFCC posteriors and achieve better performance in our speech recognition system. Furthermore, we estimate posteriors more accurately by clustered hierarchical MLP, which emphasize on the classification of error-prone phoneme pairs. Thus, we obtain even better recognition performance. Finally, we add pitch features while MLP training and adopt tonal acoustic units. With these modifications, we significantly improve the performance in Mandarin large vocabulary broadcast news recognition.