目前效能最佳的統計式機器翻譯系統 (statistical machine translation systems) 屬於詞組為本的系統 (phrase-based SMT). 而其核心成份, 則是一個詞組翻譯表 (phrase translation table). 多數所謂的詞組, 是由詞彙對應 (word alignment) 的結果衍生得來的; 衍生方式多數倚賴某些直覺, 來找出一些與詞彙對應結果“一致”的詞組對 (phrase pairs). 因此, 深受詞彙對應之正確率, 及用來判斷所謂 “一致性” 的直覺所影響. 由於缺乏客觀的詞組分斷 (phrase segmentation) 標準, 大量與詞彙對應 “一致” 但雜亂的詞組對就此產生. 同時, 這樣的詞組其實是由兩種語言共同決定的, 因此, 未必充分遵循個別語言的詞組結構. 有些奇特的詞組對及詞組, 就可能因而衍生. 如此龐大而雜亂的詞組翻譯表, 極可能在估測翻譯機率時, 引入估測誤差. 並在解碼的過程中, 引入搜尋 (解碼) 誤差. 龐大的搜尋空間也可能導致解碼的速度惡化. 若想進一步改善目前PBSMT 的效能, 就有必要對詞組分斷及詞組對應等模式進行最佳化處理, 一併考慮詞彙對應的結果, 及使用非直覺式的詞組分斷模式. 如此, 解碼的品質及速度都有可能顯著提升從而明顯改善翻譯的流暢度. 為此, 本文特別提出一個EM 演算法, 分別對來源語及目標語個別做最佳的詞組分斷處理, 而不受另一語言的干擾. 之後, 再針對這些斷好的詞組, 以一個詞組對應模型來找出最好的詞組配對. 如此一來, 不依靠直覺, 而是同時定量使用詞彙對應及詞組分斷處理的結果, 來產生高品質的詞組翻譯表及其機率, 就有可能實現。
The phrase translation table is the core model component of the state-of-the-art phrase-based statistical machine translation (SMT) systems. Most phrases are induced from word alignment results by using some heuristics to find phrase pairs that are “consistent” with the word alignment results. The phrase translation table is thus affected by the word alignment accuracy as well as the heuristics to find consistent phrase pairs. Without an objective optimization criterion for phrase segmentation, however, a large number of consistent yet noisy phrase pairs may be generated. Furthermore, the phrases are essentially defined in terms of two languages. Such phrases might not respect the individual languages very well. Some specific phrase pairs and phrases might then be induced. Such a huge and noisy phrase translation table is likely to introduce estimation errors when estimating the phrase translation probability as well as searching (decoding) errors during the training and decoding phases. The large search space might also degrade the speed of the decoding process. To improve the performance of the current phrase-based SMT, it is thus necessary to optimize the phrase segmentation as well as phrase alignment models by jointly considering the results of word alignment and a non-heuristic model for phrase segmentation. By doing this, it might significantly improve the quality and speed of the decoding process and thus the translation fluency. In particular, an EM algorithm is proposed to conduct phrase segmentation for the source and target language corpora, respectively, independent of each other. The phrase alignment algorithm is then applied to such well-segmented phrases, with good estimates for phrase translation probabilities, which are based on the word alignment statistics. Jointly using the word alignment and phrase segmentation results quantitatively, instead of heuristically, to produce a quality phrase translation table and their translation probability is thus possible.