中文篇章連接詞偵測、消歧、及論元辨識

篇章關係指文字單位間如何有邏輯的彼此關聯。透過文章中的篇章結構分析，我們可以更了解文件的意義。因此，篇章結構分析被應用在很多領域，例如自然語言界面以及大規模的文件分析。相對於英文的篇章語料集早就提供研究者使用，中文的大規模篇章資料集一直到近年才終於被釋出。同時，中文的篇章結構分析有很多獨特的議題，例如中文的篇章連接詞的種類較多，且常有多個不連續詞語組成的多重連接詞，此外，中文的句子結構也更為複雜，使得正確辨識篇章結構更為困難。篇章連接詞是用來辨識中文文章中篇章關係的重要線索，但由於連接詞本身的歧義性讓辨識篇章連接詞本身成為一個挑戰議題。在本篇論文中，我們研究與篇章連接詞的顯性篇章關係有關的四個議題：第一，我們處理篇章連接詞的辨識，在文章中找出可能的篇章連接詞。第二，我們探討篇章連接詞的構成詞語間的多重連結關係。第三，我們研究每個篇章連接詞的篇章關係消歧。最後，我們辨識每個篇章連結詞的論元。我們提出不同的特徵來訓練基於羅吉斯迴歸 (Logistic Regression) 演算法的分類器來識別正確的篇章連接詞，以及辨識其篇章關係的種類。此外，我們也將每個可能的候選連接詞排序，並利用一個貪婪的演算法 (greedy algorithm) 來解決連結詞的連結關係歧義性。最後，我們將論元辨識視為一個序列標記問題 (sequence labeling problem)，並利用條件隨機域 (Conditional Random Fields) 來找出論元的邊界。除了顯性篇章關係外，未來隱性篇章關係也需要進一步的研究，在這些元件的基礎上，建立一個完整的中文篇章結構分析器。

關鍵字

自然語言處理；中文篇章結構分析；篇章連接詞辨識；篇章關係消歧；論元辨識

並列摘要

Discourse relations represent how textual units logically connect with each other. Analyzing the discourse structure for texts could aid the understanding of the meaning behind paragraphs. There are many potential applications such as natural language interface and large-scale content-analysis. Although there are popular English discourse corpora for researchers, large-scale Chinese discourse corpora have not been available until recently. In addition, Chinese discourse analysis has many unique issues including the variety of discourse connectives, the common occurrences of parallel connectives, and the complex sentence structures. Discourse connectives are important clues for identifying discourse relations in Chinese texts. However, the ambiguity involved makes it a challenge to extract true connectives. In this thesis, we investigate four tasks regarding explicit discourse relations that are signaled by discourse connectives. Firstly, we deal with the extraction of explicit discourse connectives. Secondly, we investigate resolving linking ambiguities among connective components. Thirdly, we disambiguate the discourse relation type for each connective. Finally, we extract the arguments for each discourse connective. Several features are proposed to train Logistic Regression classifiers to disambiguate between discourse and non-discourse usages and the relation types for connectives. Additionally, we rank each connective candidate and develop a greedy algorithm to resolve linking ambiguities. Finally, the argument identification is formulated as a sequence labeling problem, and Conditional Random Fields are utilized to determine the argument boundaries. Besides explicit discourse relations, further investigation must be done to recognize implicit relations. Built upon these components, an end-to-end discourse parser for Chinese may be constructed in future studies.

並列關鍵字

Natural Language Processing ； Chinese Discourse Analysis ； Discourse Connective Recognition ； Discourse Relation Disambiguation ； Discourse Connective Argument Identification

參考文獻

Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python. O'Reilly Media.

Chen, H.-H. (1994). The contextual analysis of Chinese sentences with punctuation marks. Literary and linguistic computing, 9(4):281--289.

Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1--30.

Elwell, R. and Baldridge, J. (2008). Discourse connective argument identification with connective specific rankers. In Semantic Computing, 2008 IEEE International Conference, pages 198--205. IEEE.

Hernault, H., Prendinger, H., duVerle, D. A., Ishizuka, M., et al. (2010). HILDA: a discourse parser using support vector machine classification. Dialogue and Discourse, 1(3):1--33.

國際替代計量

中文篇章連接詞偵測、消歧、及論元辨識

全文下載

主題瀏覽