中文語篇標記解釋與語篇關係辨識及其在意見極性分析之研究

語篇關係是語篇單元（如子句、句子、或句群）之間的修辭關係，常見的語篇關係有時序、因果、轉折、推展等。語篇關係呈現了文句承接的邏輯，影響文意的表達和解讀。利用電腦自動偵測語篇關係，是新興的研究領域。隨著Rhetoric Structure Theory Discourse Treebank (RST-DT) 與Penn Discourse Treebank (PDTB) 等語料資源釋出，英文的語篇關係分析已經有了一些成果，進而應用到自動摘要、意見分析、文本蘊涵、事件辨識等領域。反觀中文，由於語料資源的缺乏，以及中文本身的複雜性，使得中文語篇關係的研究更具挑戰性。本篇論文對於中文語篇關係識別、中文語篇標記、語篇關係與意見極性的關聯性，做了全面性的探討。我們發展了一套學習模型，可以識別句內及句間等兩種層次的語篇關係，同時也觸及語篇剖析的問題。語篇剖析可以將語篇單元之間的上下階層以及指涉範圍，解析成樹狀結構，從複雜的語句中挖掘出更多資訊。特別是中文的長句，超過三四個子句，沒有語篇結構的資訊，則不易解釋整個句子的意涵。對此，我們發展了初步的統計學習的模型，對中文句子進行句內的語篇剖析。在語篇關係識別與剖析的實驗過程中，我們發現語篇標記（一些具有語篇資訊的連接詞等詞彙，例如「因為」、「但是」）是語篇關係識別的重要線索。但在中文裡，語篇標記常有一字多義的歧義性，連帶干擾識別模型的效能。我們運用鉅量資料，配合半監督式機器學習法來探索歧義性的問題，評估每個語篇標記對於四大類語篇關係的分佈情況。從資料中習得的分佈資訊，作為語篇關係識別的特徵線索，效果比使用專家制定的詞典更好。我們也探討了語篇關係與意見極性之間的關聯。像「轉折」關係，它的兩個語篇單元常常形成對立的意見極性，較常用於呈現負面意見。相對的，「時序」和「推展」所陳述的內容，則較為中立，較少涉及情緒表態。由於語篇關係與意見極性此之間的密切關聯，語篇關係識別的結果可以作為線索，應用於意見分析。在本論文中，我們所處理的語篇關係是最基本的「時序」、「因果」、「轉折」、「推展」等四大類型。未來我們希望可以探討更細緻的語篇關係，並且進一步處理句內、句間、句群等不同層次的語篇剖析。

關鍵字

自然語言處理；中文語篇分析；語篇關係辨識；語篇標記；意見極性

並列摘要

Discourse relation is the rhetorical relation between two discourse units (i.e. clauses, sentences, or blocks of sentences). The famous discourse relations include Temporal, Contingency, Comparison, Expansion, and so on. A discourse relation indicates how its two discourse units cohere, and this information influences the meaning of text. Discourse relation is important clue to many applications such as summarization, opinion mining, textual entailment, and event recognition. Recently the research on automatically English discourse relation recognition is rapid growth due to the release of corpora like Rhetoric Structure Theory Discourse Treebank (RST-DT) and Penn Discourse Treebank (PDTB). Unlike English, Chinese discourse relation recognition is more challenging because of the lack of resources and the special issues in Chinese. In this dissertation, we give an in-depth study on Chinese discourse relation analysis. We propose a statistical algorithm to recognize the discourse relation in both levels of inter-sentential and intra-sentential. We also show our preliminary results on Chinese discourse parsing at sentence level. In Chinese, many long sentences contain more than two clauses and form complex discourse structures. Discourse parsing fetches the hierarchical structure and relation among the clauses in a given sentence. Discourse markers are key clue to discourse process, but the use of Chinese discourse marker is inherent ambiguity. To interpret the ambiguous Chinese discourse markers, we propose a semi-supervised framework to estimate the distribution of each Chinese discourse marker from a large-sized corpus, the ClueWeb09. This semi-supervised framework with the estimated distributions finally improve the performance of Chinese discourse relation recognition. Discourse relations and sentiment polarities are interactive in text. We investigate their correlation with ClueWeb09. A moderate-sized data annotated by human are analyzed and compared with the huge data heuristically labeled by machine. As a result, the association between sentiment and discourse is validated. In this dissertation, we focus on the four-way discourse relation classification. We will investigate the finer-grained classification on discourse relations in the future. In addition, we will further tackle the issue of Chinese discourse parsing at paragraph level and document level.

並列關鍵字

Natural Language Processing ； Chinese Discourse Analysis ； Discourse Relation Recognition ； Discourse Marker ； Sentiment Polarity

參考文獻

Nicholas Asher and Alex Lascarides. 1995. Lexical Disambiguation in a Discourse Context. Journal of Semantics, 12(1):69-108, Oxford University Press.

Adam A Augustine, Matthias R. Mehl and Randy J. Larsen. 2011. A Positivity Bias in Written and Spoken English and Its Moderation by Personality and Gender. Social Psychological and Personality Science, 2(5): 508-515.

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM : A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1--27:27.

Hsin-Hsi Chen. 1994. The Contextual Analysis of Chinese Sentences with Punctuation Marks. Literal and Linguistic Computing, 9(4):281-289.

Shou-Yi Cheng. 2006. Corpus-Based Coherence Relation Tagging in Chinese Discourse. Master Thesis, National Chiao Tung University, Hsinchu, Taiwan.

國際替代計量

中文語篇標記解釋與語篇關係辨識及其在意見極性分析之研究

主題瀏覽