基於Transformers深度學習模型建造之高效率漢英新聞雙語檢索系統

翻譯語料庫 (或平行語料庫) 為一種特殊類型的文本語料庫，在翻譯實務、翻譯研究和翻譯教育發揮了關鍵作用 (Bernardini, Stewart, Zanettin, 2003)。統計式機器翻譯 (statistical machine translation; SMT) 系統和近年開始普及的類神經網絡機器翻譯 (neural MT) 系統問世，使平行語料庫的重要性更為突出，原因是訓練主流機器翻譯系統時所需要的大量「標記」資料 (labeled data) 或「監督式學習」(supervised learning) 所需的資料正是平行語料。機器翻譯在過去幾年取得了長足的進步，然而許多譯者及翻譯教育工作者平時仍須仰賴雙語檢索系統以及其背後的平行語料庫。對於建立高效能漢英檢索系統時遇到的三大課題：(1) 提升平行句子中詞對齊 (word alignment) 的準確性， (2) 提升已對齊平行文檔 (document-aligned texts)中句對齊 (sentence alignment) 的準確性，以及 (3) 從可比語料 (comparable corpus)中找出隱藏的平行句子，本研究提供了目前最佳的處理方法。研究結果顯示，使用最新的類神經網絡 (artificial neural network) 自然語言處理 (natural language processing; NLP) 技術當中稱為 transformer 的架構所建立的語言模型 (language model)，可以精準對齊平行句子中的詞和片語 (也就是將對齊誤差減到最低)，有助譯者快速找到目的語中譯文的所在。此外，使用句子層次的 transformer，可以將平行文檔或段落對齊的平行語料升級為句對齊的語料庫，並大幅減少自動句對齊作業完成後的手動校正工作。最後，我們示範如何先在多語新聞網站挖掘出平行新聞文章，再從中獲得平行句子，而平行新聞文章之間如有明顯的鏈接或關聯則加以利用，若無本研究開發之演算法也可以根據文章語義加以判斷、比較。

關鍵字

平行語料庫；雙語檢索系統；雙語語料對齊；句子嵌入； Transformer ； BERT ； sentence transformer

並列摘要

Translation (or parallel) corpora, a special type of text corpora, have been instrumental in the field of translation studies and in translator education (Bernardini, Stewart, Zanettin, 2003). The advent of statistical and, more recently, neural network-based machine translation (MT) systems has made the importance of parallel corpora even more pronounced, as they serve as the “labeled”, or “supervised”, training data essential for the success of such systems. However, even with the vast improvements MT systems have achieved over the past several years, translators and translation educators still rely heavily on parallel corpora, in the form of bilingual concordancer, in their day-to-day work. In this study, we aim to address three areas concerning the creation of a highly productive Chinese-English concordancer: (a) accuracy in word alignment for parallel sentences; (b) accuracy in sentence alignment for document-aligned parallel corpora; and (c) mining parallel sentences from comparable sources. Our findings suggest that with language models created with the latest artificial neural network-based natural language processing (NLP) technology (specifically, “transformers”), words and phrases in parallel sentences can be accurately aligned, which facilitates the identification of target-language translation equivalents given a source language search word or phrase. Moreover, with sentence transformers, a document- or paragraph-aligned parallel corpus can be transformed into aligned sentences with much reduced effort in manual post-alignment corrections. Finally, we demonstrate that parallel sentences can be harvested profitably from news organizations that offer multilingual news articles, even if pairs of news articles and their translations are not explicitly linked or otherwise indicated on the website.

並列關鍵字

parallel corpus ； bilingual concordancer ； bitext alignment ； sentence embeddings ； transformer ； BERT ； sentence transformer

參考文獻

Artetxe, M., Schwenk, H. (2019a). Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings. ArXiv:1811.01136 [Cs]. http://arxiv.org/abs/1811.01136

Google Scholar

Artetxe, M., Schwenk, H. (2019b). Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the Association for Computational Linguistics, 7, 597–610. https://doi.org/10.1162/tacl_a_00288

Google Scholar

Baker, M. 1993. Corpus linguistics and translation studies: Implications and applications. In Text and technology: In honour of John Sinclair, ed. M. Baker, G. Francis, and E. Tognini-Bonelli, 233–250. Philadelphia: John Benjamins.

Google Scholar

Baker, M., Saldanha, G. (eds.) (2020) Routledge Encyclopedia of Translation Studies. 3rd ed. Routledge CRC Press.

Google Scholar

Baroni, M., Bernardini, S. (2004). BootCaT: Bootstrapping Corpora and Terms from the Web. In Proceedings of LREC (2004)

Google Scholar

國際替代計量

基於Transformers深度學習模型建造之高效率漢英新聞雙語檢索系統

全文下載

主題瀏覽