隨著巨量資料的發展,語音辨識相關的處理技術越來越成熟,人們渴望著聲音世代能帶來的魅力。此時,這些技術是可流動的,不再只是先進國家獨有的資源,而是世界上不同地區、各種語言使用者都可以享用的科技。這些不同語言的人類語音,雖各自成體系,卻都擁有一個共同點--都是人類能夠藉以互相理解的訊號媒介,承載著感情、觀念、資訊以及聲音的意義。 本論文探討的,是如何讓世界上不同語言的語料互相輔助學習,使得傳統的單語言語音辨識系統擴增成多語音辨識系統,找出其中潛藏的跨語言知識,希望藉以強化各個語言的語音辨識系統。本論文使用GlobalPhone全球音素語料庫,從純語言知識開始,加入資料導向的方法,最後合併了深層類神經網路中間層,由粗糙到細緻,一步一步探討如何可以將聲學模型中跨語言的共通知識合併起來。 一旦有了多語言辨識系統,深層學習的模型將會變得更為龐大,訓練過程也會更為複雜。為了容納龐大資訊並方便即時使用,本論文亦探討了知識蒸餾的方法,將原本多語音辨識系統的龐大模型,濃縮在較小的模型裡,成功提煉出更豐富的跨語言概括化資訊,幫助多語言語音辨識系統變得更加準確。
Speech Signal Processing technologies have gone mature as well as the Big Data Era. The beauty of sound draws high attention from the modern people. These resources are not occupied by only few strong companies, but shared by speakers in different regions, using different languages all over the world. The various types of human speech have their own unique properties, but they all share the same one: people rely on it to comprehend each others. This thesis focuses on the cooperation of speech data from different languages to help enhance the conventional monolingual speech recognition system. The latent crosslingual information could be found and utilized. We use GlobalPhone Corpus to discuss about linguistic knowledge, data-driven methods and model sharing techniques. The research procedure starts from coarse phonetic level mergig to delicate model level sharing in a step-by-step way, achieving better results using crosslingual information. Once multilingual speech recognition systems are built, the model should become deep and cumbersome. The training procedure should contain more complex and time-consuming techniques. To incorporate generalization ability lying inside the huge models with tiny, in-hand and real-time model size, one can use Knowledge Distillation to extract information, thus acheiving model compression.