運用單語相關語料庫於跨領域機器翻譯調適問題之研究：一種混合式機器翻譯策略

本論文探討跨領域機器翻譯的問題，統計式翻譯近年來已逐漸成為機器翻譯的主流，然而以一般領域(general domain)統計式翻譯模型翻譯特殊領域(domain-specific)語句會遇到許多問題，例如歧異性、排序錯誤以及未知詞問題(out of vocabulary)。由於特殊領域雙語語料庫並不一定存在，在先前的實驗中，我們加入雙語字典及規則式翻譯來輔助統計式翻譯，並取得了不錯的效果。在此論文中，我們更進一步使用領域相關單語語料庫來改進統計式翻譯模型。我們使用多種方法利用單語語料庫，包括從譯後編輯(post-editing)取得新翻譯規則(pattern)、以非監督及半監督式學習訓練出領域相關統計式翻譯模型，並探討不同模型組合對翻譯效果的影響。實驗顯示從譯後編輯取出的規則確實能提升翻譯品質;從單語語料庫作非監督及半監督式學習訓練出的模型也皆有顯著進步;以譯後編輯搭配半監督式學所得到的模型則有最佳效果。

關鍵字

統計式機器翻譯；領域調適；單語語料庫；半監督式學習；譯後編輯

並列摘要

This thesis deals with the problems encountered in the cross-domain machine translation. Statistical machine translation has been the mainstream approach in the machine translation field, however, there are many problems when using general domain statistical machine translation model to translate the domain-specific sentences. For instance, word sense ambiguity errors, ordering errors, and out of vocabulary errors. Since the domain-specific parallel corpus alignment may not be available, there are some experiments which added the domain-specific bilingual dictionary and translations rule to support the statistical machine translation system, and they had some good results in those methods . In this thesis, we use those models and further extend their methods by using in-domain monolingual corpus to improve the statistical machine translation system. We make use of the monolingual corpus by multiple approaches, including mining new translation patterns from post-editing log, training the in-domain statistical machine translation models by unsupervised and semi-supervised learning, and studying the performance for different approaches combinations. The results of experiments show that the new translation rules mining from post-editing log can significantly improve the translation quality. The model trained from semi-supervised learning also has positive effect on performance, and the combination of new translation rules and semi-supervised method has the best result.

並列關鍵字

Statistical Machine Translation ； Domain Adaptation ； Monolingual Corpus ； Semi-Supervised Learning ； Post-editing

參考文獻

Bertoldi, N. and Federico, M. (2009). Domain adaptation for statistical machine translation with monolingual resources. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 182–189.

Civera, J. and Juan, A. (2007). Domain adaptation in statistical machine translation with mixture modelling. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 177–180.

Foster, G. and Kuhn, R. (2007). Mixture-model adaptation for SMT. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 128–135.

Marcu, D. and Wong, W. (2002). A phrase-based, joint probability model for statistical machine translation. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 133–139.

Och, F. J. (2003). Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of ACL 2003, pages 160–167.

國際替代計量

運用單語相關語料庫於跨領域機器翻譯調適問題之研究：一種混合式機器翻譯策略

主題瀏覽