透過您的圖書館登入
IP:3.15.235.196
  • 學位論文

元學習於端對端語音辨識之探討

Meta Learning in End-to-End Speech Recognition

指導教授 : 李宏毅

摘要


本論文探討在標註資料有限的前提下,不同轉移學習方法在不同情境下的自動語音辨識之成效。實現上以 2017 年以降,在少樣本影像識別與強化學習中取得初步成功的元學習方法 - 模型無關元學習與在語音領域已行之有年的多工學習作為探討主軸。本論文所檢視的情境為跨語言音素辨識、跨腔調端對端語音辨識、跨語言端對端語音辨識三種,從聲學模型的任務出發到端對端語音辨識、從較為單純的深層類神經網路到複雜的轉換器模型、從資料相似程度較高的跨腔調情境到跨語言情境,一步一步地拓展元學習在語音相關任務上的應用界限。並以不同的預訓練資料集合、驗證資料集合、微調迭代數、微調資料量多寡及在預訓練時的採樣策略,嘗試找出在什麼樣的情境下,元學習能帶來更多表現的進步。實驗結果顯示,在極少資源的跨語言音素辨識、少資源但資料相似程度較高的跨腔調端對端語音辨識,元學習的方法都展露了較多工學習更為優秀的轉移成效;但在資料差異較大且不能以過少語料訓練的跨語言端對端語音辨識,其表現便與多工學習旗鼓相當。以此展現了適合應用元學習的語音任務情境需具備何種特性,作為學界後續研究的參考。

並列摘要


This thesis surveys various kinds of transfer learning methods under low-resource setting. In addition to the popular implementation of transfer learning, multitask learning methods, we firstly introduce meta learning methods into speech processing. This thesis uses cross-language phoneme recognition, cross-accent end-to-end speech recognition, cross-language end-to-end speech recognition as testing scenarios. To explore the limitation of applying meta learning in speech processing, we start from simple acoustic modeling to more complicated end-to-end speech recognition, from simple multi-layer neural network to more complicated transformer architecture, and from similar cross-accent settings to the more challenging and dissimilar cross-language setting. To find the suitable transfer learning methods under a specific scenario, we control the variables like pretraining datasets, validation sets, number of fine-tuning steps, number of data used in fine-tuning, and the sampling strategies during pretraining. The initial experiments show that under low-resource setting, in cross-language phoneme recognition and cross-accent end-to-end speech recognition tasks, meta learning methods outperform multitask learning methods. However, under more challenging tasks like cross-language end-to-end speech recognition, there is no performance gap between these two methods. We believe such findings can help the researchers explore more possibilities of applying meta learning methods in speech processing.

參考文獻


[1] F. Jelinek, “Continuous speech recognition by statistical methods,” Proceedings of the IEEE, vol. 64, no. 4, pp. 532–556, 1976.
[2] L. Bahl, P. Brown, P. de Souza, and R. Mercer, “Maximum mutual information estimation of hidden markov model parameters for speech recognition,” in ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, 1986, vol. 11, pp. 49–52.
[3] Lawrence R Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[4] G. D. Forney, “The viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp.268–278, 1973.
[5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.

延伸閱讀