中文話語標記解譯及句子話語關係辨識之研究

話語關係辨識的目標是在預測任兩個篇章單位之中，最適合的句子話語關係。這對於通篇文章在語義上的判斷有很大的影響，在自然語言處理研究中是個非常重要的議題。相較於英文，由於中文語言本身的特殊性，話語標記相對具有較大的歧義性，而導致在判斷話語關係時效能上的差異。為了有效提升篇章關係辨識的效能，適當的定義話語標記的意義非常重要。有鑑於中文尚未出現像英文PDTB、RST-DT等大規模經過完善標注的話語語料庫。本研究從ClueWeb09語料庫擷取出7,601組句子，請標注者標上最適宜的話語關係，接著利用此小規模資料集建立一半監督式學習模型。藉由參數評估的輔助，不但能有效提升話語關係辨識效能，同時更能統計出每個話語標記在PDTB所定義的四大話語關係中的機率分布資訊。實驗結果顯示表現最佳的一組實驗的平均F-分數可達到73.22%，相較於實驗中所採用的基礎模型的69.76%效能，達到顯著性差異的效能提升。接著將此半監督式分類器擴展到更大規模未經標注的資料集，共302,293組句子，目的是統計出覆蓋度更高的話語標記機率分布資訊。統計結果在經過兩種相似度計算方法驗證下，顯示不錯的表現。最後運用統計結果和一簡單的分類法，定義出話語標記的前/後結合性關係，以期能更有效降低歧義性問題。

關鍵字

話語標記；話語關係；標記歧義度；半監督式學習模型；標記結合性

並列摘要

Not all Chinese discourse makers have unique interpretation. That becomes a challenging issue when they are used for discourse relation recognition. In this thesis, we propose a semi-supervised method to learn the interpretations of Chinese discourse markers and apply the results to discourse relation labeling. Total 7,601 sentences composed of two clauses connected with single discourse markers are sampled from ClueWeb09 and annotated with discourse relations manually. We train an SVM discourse relation classifier with the dataset and boost the classifier with parameter estimation. Our experimental result shows that the proposed approach can achieve 73.22% of F-score. The discourse relation recognition system is employed to annotate 302,293 unlabeled sentences. The ambiguous degrees of discourse markers and backward/forward combination problems are analyzed.

並列關鍵字

discourse markers ； discourse relation labeling ； semi-supervised learning ； interpretation of ambiguous markers ； marker combination

參考文獻

[9] E. Pitler and A. Nenkova, “Using syntax to disambiguate explicit discourse connectives in text,” in Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, Stroudsburg, PA, USA, 2009, pp. 13–16.

[16] S.-Y. Cheng, “Corpus-Based Coherence Relation Tagging in Chinese Discourse,” 2006.

[20] L.-W. Ku and H.-H. Chen, “Mining opinions from the Web: Beyond relevance retrieval,” J. Am. Soc. Inf. Sci. Technol., vol. 58, no. 12, pp. 1838–1850, 2007.

[21] F. Wolf and E. Gibson, “Representing discourse coherence: a corpus-based analysis,” in Proceedings of the 20th international conference on Computational Linguistics, 2004, p. 134.

[22] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” Acm Trans Intell Syst Technol, vol. 2, no. 3, pp. 27:1–27:27, May 2011.

國際替代計量

中文話語標記解譯及句子話語關係辨識之研究

全文下載

主題瀏覽