透過您的圖書館登入
IP:44.192.132.66
  • 學位論文

基於結構化學習之初步音素辨識

Towards Phoneme Recognition with Structured Learning

指導教授 : 李琳山

摘要


在現今的語音辨識中,用深層類神經網路(deep neural network, DNN)取代高斯混合模型(Gaussian mixture model, GMM)的混合式(hybrid)隱藏 式馬可夫模型(hidden Markov model, HMM)在辨識正確率上已經大幅超越傳統語音辨識系統,成為現在的主流。然而即使在主流架構中,仍然將聲音切成很小的音框分別辨識,並使用在不同層次分別優化的模型,而非一次考慮句子的整體結構。 另一方面,結構化學習有別於以往將一個個物件分別訓練辨識,而有能力考慮整體輸出輸入的結構。因此當我們將語音特徵向量序列作為結構化輸入,把音素序列當作結構化輸出,那麼結構化學習恰好可以利用語音整體結構上的資訊求出最佳的音素辨識結果。 在本論文中,除了實作使用結構化支撐向量機的音素辨識系統外,並提出兩種全新的融合結構化學習與深層學習的模型,分別是:結構化深層類神經網路與梯度結構化深層類神經網路,也分別實作了音素辨識系統。 在Timit語料庫上的實驗結果顯示,結構化支撐向量機雖然僅是線性模型,但搭配適當的輸入,可以達到音素錯誤率22.7%;結構化深層類神經網路突破了線性模型的限制,使用非線性深層類神經網路,成功擊敗了目前最好的主流模型,達到音素錯誤率17.8%; 而梯度結構化深層類神經網路雖然限於時間,現階段仍未能有很好的音素錯誤率表現,但提供了一個新方向,也可能是一種解決一般最大化問題的新方式。

並列摘要


Nowadays, using Deep Neural Network(DNN) and Gaussian Mixture Model(GMM) hybrid with Hidden Markov Model(HMM) shows great improvement over traditional Automatic Speech Recognition(ASR), and this becomes main stream in ASR.However, in this architecture, we still divide waveform into separated frames, and optimize each models individually without the whole utterance structure. In the other hand, structured learning is capable of taking whole structured input and produce structured output without separating each objects in training. Hence, we can take acoustic feature sequence as structured input, and phoneme sequence as structured output. In this way, ASR problem is transformed into a structured learning problem. In this thesis, we implemented structured Support Vector Machine(SVM) as baseline, and proposed two novel structured learning model: structured Deep Neural Network and gradient structured Deep Neural Network towards phoneme recognition system. In TIMIT corpus, although structured SVM is linear model, with proper input, it can achieve 22.7\% Phoneme Error Rate(PER). Structured DNN is a great non-linear model, and it shows 17.8\% PER which beats state-of-the-art results. And gradient structured deep neural network didn't give good results on PER, but it's a novel and interesting way to solve maximize problem.

參考文獻


[16]  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller, “Playing atari with deep re- inforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
[4] Rupesh Kumar Srivastava, Klaus Greff, and Ju ̈rgen Schmidhuber, “Highway networks,” arXiv preprint arXiv:1505.00387, 2015.
[1] Kurt Hornik, “Approximation capabilities of multilayer feedforward networks,”Neural networks, vol. 4, no. 2, pp. 251–257, 1991.
[2] George Cybenko, “Approximation by superpositions of a sigmoidal function,”Mathematics of control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989.
[8] RivarolVergin,DouglasO’shaughnessy,andAzarshidFarhat,“Generalizedmelfre- quency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition,” Speech and Audio Processing, IEEE Transactions on, vol. 7, no. 5, pp. 525–532, 1999.

延伸閱讀