透過您的圖書館登入
IP:3.139.104.214
  • 學位論文

改善翻譯流暢度之單語統計式機器翻譯模式

Improving Translation Fluency with a Monolingual Satistical Machine Translation Model

指導教授 : 張景新

摘要


目前有許多和MT相關的研究,效果一直不是很理想。以IBM所提的BLEU Score (Papineni, 2002) 來評估,在英文翻譯到中文的這個特定領域,分數僅在0.21~0.29之間。這樣的品質在讀者的感受上,是很不流暢的。本研究的主要目的即希望能改善翻譯後句子的流暢度,使句子的品質更可讓人們接受。 長久以來,SMT的改善方式皆是在Translation Model中做改善。然而,由於SMT模型的表達能力 (expressive power) 非常有限,透過word-to-word或phrase-to-phrase 轉換,再加上局部位置的re-ordering並不能有效產生流暢的目標語。尤其是許多目標語特有的詞彙或詞素,並無法有效地透過這樣的機制產生。因此,我們認為應該跳出SMT固有的限制,尋找其他增進流暢度的方法。尤其是Language Model及Searching (Decoding) 這兩個較少被人注意的部分,應該更受重視。 我們改善翻譯流暢度的方式是使用一個Statistical Post-Editing (SPE) 模式,對翻譯過的不流暢句子加以處理,使其變成一個流暢的版本。這樣的系統可視為一個 “非流暢句-到-流暢句” 的SMT。而這樣的系統可以使用一個較為特別的Monolingual SMT模式加以訓練。其特別的地方是訓練的語料來源為一個monolingual corpus。將monolingual corpus裡的句子透過自動生成的方式,產生一個不流暢句的corpus,將不流暢的中文句子及流暢的中文句子的配對,透過類似 SMT 的訓練程序,得到一組不流暢句子對應到流暢句子的機率參數。利用這組參數就可以由現有的流暢句子的例句庫 (example base) 或 web corpus 中,找出和不流暢句子對應的流暢句子。這樣的monolingual training 模式和常見的bilingual training模式比較之下,monolingual的corpus非常容易取得。 在 SPE的語言模式方面,我們以片語式的一元模式 (phrase-based unigram model) 取代以詞為基礎的三元模式 (word-based trigram model) ,以選擇流暢的片語取代三元詞串。由於一個片語可能是個很長的詞串,因此,有可能比隨機拼湊的三元詞串更為流暢。同時,我們放棄以雙語 word alignment 的結果來定義目標語的片語。而是由目標語直接訓練出最佳的片語。這樣的片語,在質量上完全符合目標語的語法。因此,可望具有較高的流暢度。而這樣的片語,在數量上會比由 words 任意拼湊而成的 phrases 來的少。因此,也可以相當程度降低模式的估計誤差 (estimation error), 減少所需的訓練語料,並且僅需用到目標語單語語料。這樣的單語語料數量極為龐大,不像難以取得的雙語語料,故可望訓練出流暢的片語。 在目標語的解碼或搜尋方面,我們的改善的方式是先對一個不流暢的句子,從現存的流暢句子的例句庫 (example base) 中尋找和其最相似甚至相同的句子,之後再對此最相似的句子做小部分的修改,以符合原翻譯句子的本意。從example base或 web corpus 尋找己存在的流暢句子,相較於由機器所組合出來的句子,直覺上已存在的句子流暢度才是較令人滿意的。即便需要做修飾,也常只是小幅度的局部修繕 (local editing),因為需要搜尋的修飾空間,已經被 example 大大地限制了。這樣的修飾環境,可視為一種限制性的解碼過程 (constrained decoding),可望相當程度減少解碼過程中的搜尋誤差 (searching error)。實驗結果證明,在一些模擬的error types之下,這樣的 statistical post-editing model有相當程度的改善能力。

並列摘要


While there are many SMT researches for the past tens of years, the performances are far from satisfactory. In translating English to Chinese, for instance, the BLEU Scores (Papineni, 2002) range only between 0.21 and 0.29. Such translation quality is very disfluent for human readers. The goal of the current work is to propose an statistical post-editing model for improving the fluency of translated sentences. The main approaches for improving classical SMT had put much energy on the Translation Model (TM) for a long time. Unfortunately, the classical SMT models have very low expressive power. The application of word-for-word or phrase-to-phrase translation and a little bit local re-ordering might not generate fluent target language sentences. In particular, many target specific lexical items and morphemes cannot be generated through this kind of models. The implication is that we might have to go beyond the limitations of the classical SMT models in order to improve the fluency of the translation. In particular, the Language Model (LM) and Searching (or Decoding) process, which have been ignored in past researches, should play more important roles. In the current work, we propose to adapt an Statistical Post-Editing (SPE) Model to translate disfluent sentences into fluent versions. Such a system can be regarded as an “disfluent-to-fluent” SMT, which can be trained with a Monolingual SMT Model. It is special in that the training corpus can be easily acquired from a large monolingual corpus with fluent target sentences. By generating an disfluent version of the fluent monolingual corpus automatically, one can easily acquire the model parameters for translating disfluent sentences into fluent ones through a similar training process for a standard SMT. With such a model, the most likely fluent sentence for a translated sentence can be searched from an example base. In comparison with standard SMT training, which requires a parallel bilingual corpus, the monolingual corpus is much easier to acquire. The proposed LM for the current SPE, which is responsible for selecting fluent target segments, will be a phrase-based unigram model, instead of a word-based trigram model, which was widely used in classical SMT. Since a phrase can cover more than 3 words, the selected phrases might be more fluent than word trigrams. Furthermore, we have decided not to define target phrases in terms of chunks of bilingually aligned words. Instead, the best target phrases are directly trained from the monolingual target corpus by optimizing the phrase-based unigram model. Such phrases will fit target grammar perfectly and therefore will generate more fluent sentences in general. Furthermore, the number of such phrases will be much smaller than those randomly combined phrases originated from word-aligned word chunks. As a result, the estimation error will be significantly reduced. The size of the required monolingual training corpus will be much smaller as well. Unlike the rare parallel bilingual training corpus, the amount of such target language corpora is extremely large. Therefore, fluent phrases can be well extracted. As far as the searching or decoding process is concerned, the proposed method is to search the most likely fluent sentence(s) from an example base or from the Web corpus. Local editing is then applied only to a local region of the example sentence based on the disfluent sentence. Intuitively, those sentences searched from an example base or from the Web corpus will be much more fluent than automaticly combined sentences from the SMT decoding module. Even if local editing is required, the repair will be quite local. The search space for repairing will be significantly constrained by words in the most likely example sentence. Such a post-editing context can thus be regarded as a constrained decoding. The searching error can thus be reduced significantly in comparison with the large search space of the decoding process of a typical SMT. The experiments for some error types of the translation process show that the proposed statistical post-editing model did improve fluency significantly.

參考文獻


CKIP 2001, Academia Sinica Word Segmentation Corpus, ASWSC-2001, (中研院中文分詞語料庫), Chinese Knowledge Information Processing Group, Acdemia Sinica, Tiapei, Taiwan, ROC.
[Brown 1990] Brown, Peter F., J. Cocke, Stephen A. Della Pietra, Vincent J. Della Pietra, Frederick Jelinek, John D. Lafferty, Robert L. Mercer, and Paul S. Roossin. 1990. “A statistical approach to machine translation.” Computational Linguistics, 16(2):79–85.
[Chiang 1992] Tung-Hui Chiang, Jing-Shin Chang, Ming-Yu Lin and Keh-Yih Su, "Statistical Models for Word Segmentation and Unknown Word Resolution," Proceedings of ROCLING-V, pp. 123-146, Taipei, Taiwan, R.O.C., 1992.
[Brown 1993] Brown, Peter F., Stephen A. Della Pietra, Vincent J. Della Pietra, and R. L. Mercer. 1993. “The mathematics of statistical machine translation: Parameter estimation.” Computational Linguistics, 19(2):263–311.
[Knight 1994] Kevin Knight, and Ishwar Chander, “Automated Post-Editing of Documents,” in Proceedings of the twelfth national conference on Artificial intelligence, pp. 779-784, CA, USA, 1994

延伸閱讀