透過您的圖書館登入
IP:18.118.137.243
  • 學位論文

深層非監督式學習以提升語音辨識之效能

Enhancing Speech Recognition by Deep Unsupervised Learning

指導教授 : 李琳山

摘要


近年來語音辨識已經發展得相當完善,但是仍然倚賴著大量人力標註的語料進行模型訓練。考慮到事實上有相當豐富的語料因為無人標註而無法使用,本論文乃嘗試用非監督式(unsupervised)的方法,使用九種不同的由機器自動習得的聲學組型 (Automatically Discovered Acoustic Patterns)來增進語音辨識的效能,並且以更多的未標註語料加入模型訓練以減少所需標註語料的總量。 我們先以兩階段式的深層類神經網路進行半監督式訓練。第一階段的深層類神經網路為非監督式學習,使用大量的未標註語料以其自動習得的聲學組型抽取瓶頸特徵。再將該瓶頸特徵與聲學特徵向量串接,以較少的標註語料作為第二階段的監督式深層類神經網路的輸入進行訓練,並以此達到了優化語音辨識的目標,且在減少標注語料時仍然擁有接近的辨識結果。 其次我們也透過多目標的深層類神經網路訓練將三連音素之標註作為主要訓練目標,另將九種不同的自動習得之聲學組型作為次要訓練目標,以幫助基於三連音素標註的監督式訓練。實驗結果顯示比起單純的只用標註語料的監督式訓練均有進步,證明了自動習得的聲學組型對已標註的語料一樣可以提供幫助。最後,本論文結合上述兩者,實現兼有瓶頸特徵及多目標訓練的深層類神經網路,發現兩者可以相輔相成,達到最佳的實驗結果。

並列摘要


In recent years, speech recognition has advanced considerably. However, it still depends on a huge amount of labeled speech corpus. In fact, there are lots of unlabeled corpus that can't be use. This thesis uses unsupervised learning to enhance speech recognition by using 9 different Automatically Discovered Acoustic Patterns, attempting to improve the result by bottleneck feature, semi-supervised learning, multi-target learning. First we use a two-phase deep neural network on semi-supervised learning. The first-phase DNN is unsupervised learning. Using a huge amount of unlabeled corpus and their Automatically Discovered Acoustic Patterns to extract bottleneck feature. These bottleneck features are then combined with acoustic feature vector. In phase two, less labeled corpus are used to train the supervised DNN, which achieves the goal of improving speech recognition and in the mean time sustains similar results when using less labeled corpus. In addition, we train the multi-target DNN for label and use those labels as the main target and the 9 different Automatically Discovered Acoustic Patterns as a secondary target to improve the supervised learning. The result of the experiment shows this method is useful compared to training which only uses labeled corpus, proving that Automatically Discovered Acoustic Patterns can help speech recognition when we have the labeled corpus. Finally, the thesis combined the two methods above, presenting DNN that contains bottleneck features and multi-target learning, we find that these two compliment each other, achieving the best result.

參考文獻


[1] Xuedong D Huang, Yasuo Ariki, and Mervyn A Jack, Hidden Markov models for speech recognition, vol. 2004, Edinburgh university press Edinburgh, 1990.
[3] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron, “Scalable parallel programming with cuda,” Queue, vol. 6, no. 2, pp. 40–53, 2008.
[4] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[6] Anthony J Robinson, “An application of recurrent nets to phone probability estimation,” IEEE transactions on Neural Networks, vol. 5, no. 2, pp. 298–305, 1994.
[7] Sepp Hochreiter and Ju ̈rgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

延伸閱讀