在本論文中,我們提出一個從平行語料擷取名詞片語翻譯的新方法。我們的方法首先利用名詞片語辨識工具從原文句子擷取出所有可能的名詞片語。針對每一個名詞片語,我們利用現有的單字對應工具找到它在目標句的部分翻譯。接著,我們以部分翻譯為中心點,產生各種包含中心點的可能翻譯。最後,我們利用一個片語翻譯模型從中挑選出最有可能的翻譯。此片語翻譯模型包含兩個機率,分別是詞彙翻譯機率與孳生機率。詞彙翻譯機率用來計算單字間相關程度,而孳生機率則表示來源字翻譯後的字數長度機率。我們會在訓練階段分別利用EM演算法與一部機率辭典來訓練這兩組參數。我們實際撰寫了程式,以74萬句香港新聞為語料,與IBM Model4在名詞片語擷取的效能上進行比較。實驗的結果我們獲得了70%的準確率以及61%的召回率。實驗顯示我們的方法勝過IBM modle4,也說明了我們提出的新方法的確可以改善名詞片語翻譯擷取與機器翻譯中名詞片語的效率與品質。
We propose a new method for extracting noun phrase correspondence automatically from a sentence-aligned bilingual corpus. In our approach, noun phrases extracted from each source language sentence are aligned to phrases in each target language sentence based on a phrase translation model and maximum translation probability. The method involves generating word level alignment using existing word alignment technique as the basis of noun phrase alignment, and estimating Lexical Translation Probability (LTP) for noun phrases by using the EM algorithm and estimating Fertility Probability (FP) from a Most Frequency Translation Equivalent (MFTE). At runtime, for each noun phrase in the source sentence, partial translation in the target sentence is located. Then, each of the n-grams containing the partial translation is evaluated using phrase translation probability. The n-gram with maximum translation probability is chosen as the output. We describe the implementation of the method using bilingual Hong Kong news corpus. The experimental results show that our model outperforms IBM model4 in terms of precision rate of noun phrase extraction. The methodology cleanly improves the performance of noun phrase translation, which has been shown to be very crucial for statistical machine translation.