使用深層學習的語音辨識中的跨語言聲學模型

隨著巨量資料的發展，語音辨識相關的處理技術越來越成熟，人們渴望著聲音世代能帶來的魅力。此時，這些技術是可流動的，不再只是先進國家獨有的資源，而是世界上不同地區、各種語言使用者都可以享用的科技。這些不同語言的人類語音，雖各自成體系，卻都擁有一個共同點－－都是人類能夠藉以互相理解的訊號媒介，承載著感情、觀念、資訊以及聲音的意義。本論文探討的，是如何讓世界上不同語言的語料互相輔助學習，使得傳統的單語言語音辨識系統擴增成多語音辨識系統，找出其中潛藏的跨語言知識，希望藉以強化各個語言的語音辨識系統。本論文使用GlobalPhone全球音素語料庫，從純語言知識開始，加入資料導向的方法，最後合併了深層類神經網路中間層，由粗糙到細緻，一步一步探討如何可以將聲學模型中跨語言的共通知識合併起來。一旦有了多語言辨識系統，深層學習的模型將會變得更為龐大，訓練過程也會更為複雜。為了容納龐大資訊並方便即時使用，本論文亦探討了知識蒸餾的方法，將原本多語音辨識系統的龐大模型，濃縮在較小的模型裡，成功提煉出更豐富的跨語言概括化資訊，幫助多語言語音辨識系統變得更加準確。

關鍵字

多語言；語音辨識；跨語音資訊；深層學習；知識蒸餾

並列摘要

Speech Signal Processing technologies have gone mature as well as the Big Data Era. The beauty of sound draws high attention from the modern people. These resources are not occupied by only few strong companies, but shared by speakers in different regions, using different languages all over the world. The various types of human speech have their own unique properties, but they all share the same one: people rely on it to comprehend each others. This thesis focuses on the cooperation of speech data from different languages to help enhance the conventional monolingual speech recognition system. The latent crosslingual information could be found and utilized. We use GlobalPhone Corpus to discuss about linguistic knowledge, data-driven methods and model sharing techniques. The research procedure starts from coarse phonetic level mergig to delicate model level sharing in a step-by-step way, achieving better results using crosslingual information. Once multilingual speech recognition systems are built, the model should become deep and cumbersome. The training procedure should contain more complex and time-consuming techniques. To incorporate generalization ability lying inside the huge models with tiny, in-hand and real-time model size, one can use Knowledge Distillation to extract information, thus acheiving model compression.

並列關鍵字

Multilingual ； Speech Recognition ； Crosslingual Information ； Deep Learning ； Knowledge Distillation

參考文獻

[7] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell, “Caffe: Convolutional architecture for fast feature embedding,” 2014.

[8] Tomas Mikolov, Martin Karafi ́at, Lukas Burget, Jan Cernock`y, and Sanjeev Khudanpur, “Recurrent neural network based language model.,” in INTERSPEECH, 2010, vol. 2, p. 3.

[11] LIU Kat and Pascale Fung, “Fast accent identification and accented speech recognition,” in Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. IEEE, 1999, vol. 1, pp. 221–224.

[12] Ching-Feng Yeh, Aaron Heidel, Hong-Yi Lee, and Lin-Shan Lee, “Recognition of highly imbalanced code-mixed bilingual speech with frame-level language detection based on blurred posteriorgram,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4873–4876.

[13] Laurent Besacier, Etienne Barnard, Alexey Karpov, and Tanja Schultz, “Automatic speech recognition for under-resourced languages: A survey,” 2014, vol. 56, pp. 85–100, Elsevier.

國際替代計量

使用深層學習的語音辨識中的跨語言聲學模型

全文下載

主題瀏覽