自監督式學習模型在語音關鍵詞偵測上的應用及其轉移能力分析

深度學習中的轉移式學習泛指將訓練於一任務的深度學習模型運用於另一任務，期望能提升後者任務表現的技術。模型的不同將影響轉移式學習的效果，稱為該模型的轉移能力。自監督式學習是現今轉移式學習應用中重要的技術之一，透過自監督式學習任務將模型預先訓練於大量未標註資料集如從網路收集的文本或語音，再轉移至傳統監督式學習任務的方式，近年來已在許多人類語言處理任務上展現了優異的轉移能力，大幅提升了人類語言處理任務的表現。鑒於自監督式學習預訓練模型優良的轉移能力，本論文首先嘗試將其運用在使用者定義語音關鍵詞偵測任務上。語音關鍵詞偵測的目標為偵測含有特定詞彙的語音以讓裝置做出適當的回應，如語音助理中喚醒裝置的功能。而使用者定義語音關鍵詞偵測的目標在於客製化該特定詞彙，此時使用者僅會提供該詞彙極少量的標註資料，如何結合自監督式學習模型與既有的少量資料學習演算法以達到最佳的表現是尚待研究的問題，過去文獻亦尚未有完整的探討。本論文對此研究問題進行系統化的討論，並發現HuBERT模型結合匹配網路演算法在此應用上可獲得最佳的表現。接著本論文進一步嘗試將自監督式學習模型的轉移能力運用在與預訓練資料完全不同類型的資料上。不同於過去文獻中僅用於語言處理相關任務，本論文嘗試將語言自監督式學習模型轉移至蛋白質、去氧核醣核酸與音樂等非文字序列資料處理任務上，並發現語言自監督式學習模型比起隨機初始化模型不僅收斂得更快，也具備更優異的泛化能力。回顧過去文獻，本論文尚未找到理論或猜想能完全解釋此項發現。本論文的分析則發現語言自監督式學習模型與蛋白質自監督式學習模型間具有較高的表徵相似度。從應用層面而言，本論文提供了重複利用語言自監督式學習模型於非文字序列資料的可能性，減輕了需要在非文字資料上重新進行自監督式學習而產生的運算負擔。

關鍵字

自監督式學習；少量資料學習；關鍵詞偵測；轉移式學習

並列摘要

In deep learning, transfer learning means using a model trained on one task to learn another task. A model with better transferability can benefit more from transfer learning. Nowadays, self-supervised learning is an important technique for transfer learning. Models that are pre-trained on large-scale unlabeled corpora using self-supervised learning objectives have extraordinary transferability when transferred to traditional supervised learning tasks, which significantly improve the performance on a wide variety of human language processing tasks. Due to the remarkable transferability of self-supervised learning pre-trained models, this work first tries to utilize self-supervised learning pre-trained models for user-defined spoken keyword detection. Spoken keyword detection aims at detecting a specific word to evoke appropriate responses such as waking up a smart assistant. User-defined spoken keyword detection is to customize that specific word. In this situation, the sizes of labeled datasets provided by users are limited. How to combine self-supervised learning models and existing few-shot learning algorithms to achieve the best performance is an open question. Previous works did not conduct in-depth studies about this problem. In this work, I systematically study this problem and find that combining HuBERT and matching network obtains the best results. Next, this work tries to make use of the transferability of self-supervised learning models to data that is very different from pre-training corpora. Specifically, different from previous works, this work tries to adapt these models to non-text data processing tasks such as protein, DNA, and music data. The results show that self-supervised learning models converge faster and generalize better than their randomly initialized counterparts. Previous works still can not propose theories or hypotheses that fully explain these findings. The analysis in this work indicates that the representations from self-supervised learning models pre-trained on natural languages and the ones from self-supervised learning models pre-trained on proteins shares non-trivial similarities. For practitioners that need to process non-text data, this work provides a solution that reuses self-supervised learning models pre-trained on natural languages, and therefore, reduces computational costs.

並列關鍵字

self-supervised learning ； few-shot learning ； spoken keyword detection ； transfer learning

參考文獻

[1] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale deep unsupervised learning using graphics processors,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09. New York, NY, USA: Association for Computing Machinery, 2009, p. 873–880. [Online]. Available: https://doi.org/10.1145/1553374.1553486

Google Scholar

[2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

Google Scholar

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds., vol. 25. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: https://proceedings. neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

Google Scholar

[4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

Google Scholar

[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423

Google Scholar

國際替代計量

自監督式學習模型在語音關鍵詞偵測上的應用及其轉移能力分析

全文下載

主題瀏覽