透過您的圖書館登入
IP:3.20.238.187
  • 學位論文

利用平行語料庫與單語樹庫之雙語剖析研究

Learning Bilingual Parsing from Parallel Corpus and Monolingual Treebank

指導教授 : 張俊盛

摘要


在本論文中,我們提出新的演算法來學習Wu (1997) 所提出的倒置轉移文法(Inversion Transduction Grammar, ITG),並延伸應用在剖析(parsing)雙語句子中。我們的模型利用學習所得之雙語語法規則(bilingual grammar rule)為雙語句子產生帶有巢狀文法結構的剖析樹(nested syntactic structural parse tree),樹中顯現出一個文法結構(syntactic structure)及兩個語言在字序(word order)上的關係。在訓練階段,相對於Wu的ITG簡易版本的括弧轉移文法(Bracketing Tranduction Grammar)實驗中,沒有考慮元素組成類別(constituent category)對兩個語言排列對等結構(counterpart)的影響,我們利用大規模的平行語料庫與一個單語樹庫來找出語言的文法結構並數學化(model)了兩個語言在語法(syntax)的關係。基本上,我們的方法藉由平行語料字對應的結果(word alignment)將單語樹庫的文法規則投射到另一個語言。在投射的過程中,專注於發生順接(straight)與倒接(inverted)的次數,進而,為ITG規則推算出相關機率值;另一方面,在執行時期,我們則利用一個由底層開始的剖析器(bottom-up parser)為句組(sentence pair)建造出最有可能的雙語剖析樹。 我們實際製作了程式,以香港平行語料的新聞部份及Andrew B. Clegg所提供的生成規則(production rules)為語料,使用提出的演算法來訓練,並使用 Och 等人(2000)的評估方法來評估模型的效率,實驗結果顯示,我們方法產生的字對應在對應錯誤率(alignment error rate)上優於先進的Giza++系統。證明電腦學習到的雙語語法規則有效的幫助雙語剖析,並提供較合理的重組懲罰(reorder penalty)。我們為平行語料所產生的雙語剖析樹除了可以拿來改善ITG 規則,也可以拿來幫助訓練統計式機器翻譯之解碼器(decoder)。

並列摘要


We present a new method for learning to parse a bilingual sentence using Inversion Transduction Grammar trained on a parallel corpus and a monolingual treebank. The method produces a parse tree for a bilingual sentence, showing the shared syntactic structures of indivisual sentence and the differences of word order within a syntactic structure. The method involves estimating lexical translation probability based on an existing word alignment system, and inferring probability of ITG rules. At runtime, a CYK-styled bottom-up parser is employed to construct the most probable bilingual parse tree for any given sentene pair. We also describe an implementation of the proposed method. The experimental results indicate the proposed model produces word alignments better than those produced by Giza++, a state-of-the-art word alignment system, in terms of alignment error rate and F-measure. The bilingual parse trees produced for the parallel corpus can be exploited to refine the initial ITG rules and train a decoder for statistical machine translation.

參考文獻


Andrew B. Clegg and Adrian Shepherd. 2005. Evaluating and integrating Treebank parsers on a biomedical corpus. In Association for Computational Linguistics Workshop on software 2005.
David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the ACL, pages 263-270.
Yuan Ding and Martha Palmer. 2005. Machine translation using probabilistic synchronous dependency insertion grammars. In Proceedings of 43rd Annual Meetings of the ACL, pages 541-548.
WU Hua, WANG Haifeng, and LIU Zhanyi. 2005. Alignment model adaptation for domain-specific word alignment. In Proceedings of the 43rd Annual Meeting of the ACL, pages 467-474.
Kristina Toutanova, H. Tolga Ilhan and Christopher D. Manning. 2002. Extentions to HMM-based statistical word alignment models. In Proceedings of the Conference on Empirical Methods in Natural Processing Language.

被引用紀錄


許雪琴(2009)。田徑技術報告-以許雪琴選手參加2009年大專運動會女子撐竿跳高項目為例〔碩士論文,國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-1610201315161190

延伸閱讀