以多模態資訊強化自督導式學習

類神經網路模型架構的演進，以及繪圖處理器的發展，促成電腦得以完成許多人類的認知任務，有些甚至能超越人類表現。這些任務當中，有和人類與世界互動基礎的任務，例如：語音辨識，影像文字配對等。有些任務需要理解能力更高的模型，例如自然語言理解，影像問答，或語音語意理解等。儘管任務的內容十分多元，對於使用類神經網路解這些任務的方法，其共同的特點在於最後的方法幾乎都是表徵學習。表徵的實作以向量居多，任何資料都可以被壓縮成一個足夠大維度的向量（例如：512維），用以涵蓋原始資料的重要資訊。這些向量在訓練完成之後，就可以透過不同的模型來實作最終的任務。向量可以經過分類器完成分類的任務，或是經過解碼器生成序列，達成翻譯，辨識等等任務。因此，表徵學習是非常熱門且實用的研究主題，如何有效率的壓縮資訊成為向量是研究的主要重點，這個方法在文字，影像，以及音訊上都有許多前人的研究，發展出不同的技巧，這些技巧大多都利用到自監督式的訓練，因而得以藉由未標註的資料學習。當資料含有多種模態時（文字，影像，聲音取二者以上），大多需要多模態學習從不同的模態中擷取資訊，就需要配對的資料共訓練。另一方面，基於自監督式模型的優異表現，架構上通常會利用現有的自監督式模型及其參數加以結合設計而成。如此一來，就可以完成需要理解不同模態的教互關係才能達成的任務，例如影像問答，或語音語意理解。由於不同的單模態系統並不會共用一個向量空間，因此從單模態表徵學習轉換到多模態表徵學習有一些挑戰。在我的碩論中，我主要研究三個以多模態資料強化自監督模型的方法。第一個面向是影像增強語意理解，利用影像的資訊來提升語意理解的成效。第二個面向是研究非督導式語音辨識中，不同領域的文字與語音任務，如何影響到結果的表現。第三個面向是語音語意理解，旨在研究文字自監督模型的輸入粒度如何影響到最終語音理解成果表現。

關鍵字

多模態學習；語音處理；深度學習

並列摘要

The advancement of neural networks and GPU has enabled machines to accomplish cognitive tasks, some of which outperforms human baselines. Some tasks focus on general understanding of the interaction environment of humans such as speech recognition, image text matching and optical character recognition. Advanced tasks dive into the semantics of these signals, such as natural language understanding, visual-text question answering, and spoken language question understanding. Despite the variety of tasks, the common approach for computers to solving these tasks is by representation learning. Each representation is typically in the form of an embedding, which is a vector of a sufficiently large dimension (for example 512) that will contain the essential information from the signal. Once trained, the vectorized output could be pipelined to a classifier to solve classification problems, or a sequential decoder for sequence generation. Therefore, representation learning is of high interests to researchers. Signals of all types of such as text, speech, and images all have different pretraining strategies being developed to efficiently extract information. Multimodal learning is a type as representation learning that requires knowledge from multiple modalities. Transferring from single modality represenatation learning to multimodality representation learning has challenges, because multiple single modality systems do not share the same embedding spac., Therefore, learning multimodal relations often involves specific cross-modal structures combined with multimodal pretraining. Applications of multimodal learning include speech language understanding, image question answering, etc. In my thesis, I research into three directions related to enhancing multi-modal learning. The first part is visually enhanced language learning, where visual information is used to enhance the natural language understanding. The second part is on the robustness of unsupervised ASR, in which I experimented with different domains of speech and text to determine how domain mismatch affects performance. The third direction covers solving the speech language understanding task, and I explored the effect of granularity of text model inputs on the final results on spoken language understanding.

並列關鍵字

Speech Processing, ； Mutilmodal Learning ； Deep Learning

參考文獻

[1] A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira,“Perceiver: General perception with iterative attention,” in Proceedings of the 38thInternational Conference on Machine Learning, ICML 2021, 18-24 July 2021,Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila andT. Zhang, Eds., vol. 139. PMLR, 2021, pp. 4651–4664. [Online]. Available:http://proceedings.mlr.press/v139/jaegle21a.html

Google Scholar

[2] A. Jaegle, S. Borgeaud, J. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula,D. Zoran, A. Brock, E. Shelhamer, O. J. Hénaff, M. M. Botvinick, A. Zisserman,O. Vinyals, and J. Carreira, “Perceiver IO: A general architecture for structuredinputs outputs,” CoRR, vol. abs/2107.14795, 2021. [Online]. Available:https://arxiv.org/abs/2107.14795

Google Scholar

[3] L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.

Google Scholar

[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A largescale hierarchical image database,” in 2009 IEEE conference on computer vision andpattern recognition. Ieee, 2009, pp. 248–255.

Google Scholar

[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deepconvolutional neural networks,” in Advances in Neural Information ProcessingSystems 25: 26th Annual Conference on Neural Information Processing Systems2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada,United States, P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger, Eds., 2012, pp. 1106–1114. [Online]. Available: https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

Google Scholar

國際替代計量

以多模態資訊強化自督導式學習

全文下載

主題瀏覽