通過半監督學習改進端到端台語至中文語音翻譯

台語語音辨識主要面對問題分為: 1. 缺乏大量且公開的台語語料集，2. 台語文字書寫系統不統一，前者導致進行語音辨識的任務上面臨資料不足，後者造成輸出格式不統一且難以讀解。本研究以台語語音辨識結合中文翻譯為任務，透過預訓練語音模型結合端到端深度學習模型的架構，建立台語語音翻譯模型。以少量台語語音配對中文文本語料為基礎，透過大量蒐集網路台語語音資料進行半監督式學習，並設計資料清洗演算法，改善台語語音翻譯系統以及台語語料。研究探討主要分為端到端語音翻譯模型、預訓練語音模型特徵、疊代訓練方法以及語料清洗四種改進方向。根據實驗結果，驗證上述方法皆能有效改善台語語音翻譯中文的表現。

關鍵字

自動語音辨識；自監督式學習；端到端語音辨識；機器翻譯

並列摘要

The challenges in Taiwanese speech recognition can be primarily categorized into two aspects: 1) the lack of abundant and publicly available Taiwanese speech corpora, and 2) the inconsistency in the written system of Taiwanese. The former results in insufficient data for speech recognition tasks, while the latter leads to inconsistent output formats and difficulties in interpretation. In this study, we focus on the task of combining Taiwanese speech recognition with Chinese translation and propose a framework that integrates pretrained speech models with end-to-end deep learning models to build a Taiwanese speech translation system. Based on a limited amount of Taiwanese speech-Chinese text paired data, we utilize semi-supervised learning through a large collection of Taiwanese speech data gathered from the internet and design data cleaning algorithms to improve both the Taiwanese speech translation system and the Taiwanese speech corpora. The research explores four main improvement directions: end-to-end speech translation models, pretrained speech model features, iterative training methods, and data cleaning. Experimental results validate the effectiveness of the aforementioned approaches in improving the performance of Taiwanese speech translation to Chinese.

並列關鍵字

Automatic speech recognition ； Self-supervised learning ； End-to-end speech recognition ； Machine translation

參考文獻

R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber. Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, 2020.

Google Scholar

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.

Google Scholar

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

Google Scholar

L. Bentivogli, M. Cettolo, M. Gaido, A. Karakanta, A. Martinelli, M. Negri, and M. Turchi. Cascade versus direct speech translation: Do the differences still make a difference? arXiv preprint arXiv:2106.01045, 2021.

Google Scholar

R. N. Bracewell and R. N. Bracewell. The Fourier transform and its applications, volume 31999. McGraw-Hill New York, 1986.

Google Scholar

國際替代計量

通過半監督學習改進端到端台語至中文語音翻譯

主題瀏覽