透過您的圖書館登入
IP:18.221.198.233
  • 學位論文

自監督學習模型用於自動語音翻譯之分析研究

An Analysis on the Use of Self-Supervised Learning Models in Speech Translation

指導教授 : 李琳山

摘要


自動語音翻譯(Speech Translation)將來源語言的語音訊號翻譯成目標語言的文字輸出。此任務現已能透過類神經網路,以端到端的方式進行,只是其效能仍因有標註的訓練資料不足而有待提升。隨著自監督學習(Self-supervised Learning)技術的出現,我們得以讓機器自大量未經人工標註的語音及文字資料中,學習到語言的若干內涵結構資訊,因而可能可幫助機器執行此一任務;本論文嘗試使用諸多不同種類之自監督學習模型,分析其在不同面向幫助自動語音翻譯之效能。 由於在進行自動語音翻譯時,輸入為語音訊號,因此本論文嘗試使用以語音訊號訓練所得的自監督學習模型,作為音訊表徵抽取器,並分析比較在串接式與端到端自動語音翻譯之設定下,各模型所產生之語音表徵之效能。本論文也進一步將表徵進行群集(Clustering)並生成離散單元(Discrete Unit),將自動語音翻譯轉化為離散單元之機器翻譯,並證實其可行。 由於自動語音翻譯的任務中也包含語言轉換與文字生成,因此本論文也嘗試使用以文字語料訓練出來的自監督學習模型,並以微調方式將其逐步訓練成為自動語音翻譯模型。本論文以離散單元作為輸入,發現在資料量不足的情況下,自監督學習模型能帶來明顯的效能提升。 最後本論文探討生成離散單元時的各項設定與資料集、效能表現的關係,以供使用離散單元於不同語音任務時之參考。

並列摘要


Speech translation(ST) is the task of transforming speech signals in the source language into text in the target language. With artificial neural networks, today this task can be accomplished in an end-to-end manner; although with limited performance due to the lack of annotated training data. Recently, self-supervised learning(SSL) models have shown their powerfulness in extracting inherent structural information existing within raw linguistic data by learning from a large quantity of unlabeled speech or text. In this work, we try to analyze different ways to utilize such SSL models to support ST from different perspectives. Because the input of the ST task is a raw acoustic signal, we take the SSL models trained from speech data sets as feature extractors and analyze the performance of the ST task across different SSL models under both cascade and end-to-end scenarios. We also reduce the ST task into a machine translation(MT) task by transferring SSL features into discrete units and confirm its feasibility. Next, we take SSL models trained from text data sets and finetune them from language models into translation models, because the output of the ST task is text. We use discrete units as the input, and show the advantages under low-resource conditions. Finally, we analyze different settings in discrete unit generation based on the achievable ST performance. This may also provide a useful guideline when considering applications of discrete units technique in other speech tasks.

參考文獻


A.Anastasopoulos and D.Chiang. Tied multitask learning for neural speech translation, 2018. URL https://arxiv.org/abs/1802.06655.
J.Ao, R.Wang, L.Zhou, C.Wang, S.Ren, Y.Wu, S.Liu, T.Ko, Q.Li, Y.Zhang, Z.Wei, Y.Qian, J.Li, and F.Wei. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing, 2021. URL https://arxiv.org/abs/2110.07205.
A.Baevski, Y.Zhou, A.Mohamed, and M.Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33: 12449--12460, 2020.
A.Baevski, W.-N. Hsu, Q.Xu, A.Babu, J.Gu, and M.Auli. data2vec: A general framework for self-supervised learning in speech, vision and language, 2022. URL https://arxiv.org/abs/2202.03555.
P.Bahar, T.Bieschke, and H.Ney. A comparative study on end-to-end speech to text translation, 2019. URL https://arxiv.org/abs/1911.08870.

延伸閱讀