以結構機率重估改進中文句法分析

句法剖析(Syntactic parsing)是理解自然語言最重要的一步，在機器翻譯、問答系統、資訊檢索、語音辨識和其他自然語言處理的應用上都十分重要。當輸入一個句子並載入語法規則，句法剖析會辨識出詞彙的詞類及詞組的語法功能，並產生符合語法規則的數種歧義結構。然而要從眾多歧義結構中挑選出最好的句法結構並不容易，需仰賴一個強健的結構機率估算方法。本論文首先提出一個通用模型，與上下文相關的機率重估模型(context-dependent probability re-estimation model, CDM)，以改善機率式上下文無關語法規則(probabilistic context-free grammars, PCFG)在結構機率不夠精確的問題。我們所提出的模型可以有效率且彈性地使用上下文特徵，以獲得更為精確的結構機率，提升句法分析效能。接著，為彌補通用模型在特殊結構上(special structures)處理能力不足的問題，我們針對特殊結構提出特殊結構解歧模型，例如及物動詞後接名詞結構(Vt-N structures)的解歧及並列結構(conjunctive structures)的解歧。主要目的是將有利於特殊結構的特徵或方法加入結構解歧模型中，以重估出更為精確的結構機率，提升解歧的正確率，並有效地整合至現有的結構機率重估模型之中。從實驗評估結果來看，我們提出的結構機率重估方法比一般的PCFG剖析器及其它的統計式剖析器都有更好的剖析結果。

關鍵字

Syntactic Parsing ； PCFG ； Structural Disambiguation ； Grammar Representation

並列摘要

Syntactic parsing is the first major step of natural language understanding. It plays an important role in machine translation, question answering, information retrieval, speech recognition, and other natural language processing applications. Given a sentence and grammar rules, a syntactic parser may identify the part-of-speeches of words, then produce several ambiguous structures accepted by the grammar rules. However, to select the best structure from several ambiguous structures is a challenging task. Quality of the best structure selection usually depends on the precision of the structure probability estimation methods. In this thesis we first propose a general model, a context-dependent probability re-estimation model, to enhance the estimation of structure probabilities produced by probabilistic context-free grammars (PCFG). Compared with using rule probabilities only, the proposed model has the advantage of using effective, flexible, and broader range of contexture features to better estimate structure probabilities. Secondly we propose using specific models to resolve specific cases in parsing Chinese by pinpointing features specifically useful for such cases to enhance general models. The specific cases tested in this thesis are Vt-N structures and conjunctive structures. Evaluation on a set of experiments shows that the proposed models outperform the baseline parser and the existing state-of-the-art statistical parsers.

並列關鍵字

無資料

參考文獻

[25] Hsieh, Yu-Ming, Duen-Chi Yang and Keh-Jiann Chen. 2007. Improve Parsing Performance by Self-Learning. In Computational Linguistics and Chinese Language Processing, 12(2):195-216.

[56] Tsai, Yu-Fang and Keh-Jiann Chen. 2004. Reliable and Cost-Effective Pos-Tagging. International Journal of Computational Linguistics and Chinese Language Processing (IJCLCLP), 91:83-96.

[67] Zhao, Jun and Chang-ning Huang. 1999. The Complex-feature-based Model for Acquisition of VN-construction Structure Templates. Journal of Software, 10(1):92-99.

[41] Ma, Ji, Longfei Bai, Ao Zhang, Zhuo Liu, and Jingbo Zhu. 2012. NEU Systems in SIGHAN Bakeoff 2012. In Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 206-210.

[10] Chen, Keh-Jiann, Chu-Ren Huang, Chi-Ching Luo, Feng-Yi Chen, Ming-Chung Chang, Chao-Jan Chen, and Zhao-Ming Gao. 2003. Sinica Treebank: Design Criteria, Representational Issues and Implementation. In (Abeille 2003) Treebanks: Building and Using Parsed Corpora, pages 231-248. Dordrecht, the Netherlands: Kluwer.

國際替代計量

以結構機率重估改進中文句法分析

主題瀏覽