中文文章修辭架構自動分類初步研究

本研究主旨之目標是利用篇章分析來建立自然的篇章修辭架構，讓機器能夠自動辨認出語句間的複句關係。為了達成此一目標，就必須考慮到整篇文章的修辭架構，即各句子間的語意變化，以一段篇章為處理單位，建構出修辭結構樹。首先我們依據Rhetorical Structure Theory (RST)理論，及參考現代漢語語法的句子結構分類，統整出在篇章及語意上較適合中文的複句結構關係，並採用臺北科技大學中文有聲書語料庫文本(NTUT-AB-01)進行修辭結構樹的標註。標註完成後，經由統計擷取出各個參數，包括句子詞數(Length)標註、標點符號(Punctuation)標註、關聯詞語(Connective)擷取及共享字詞(Shared Word)標註，再藉由Multi-Layer Perceptron(MLP)演算法學習中文RST分類，讓機器可以學習自動將複句關係分類開來，自動辨認出正確的複句關係，最後達到建立篇章修辭架構之偵測自動化。實驗結果顯示，結合所有特徵參數情況下，15分類之錯誤率為28.16%，4分類之錯誤率為24.08%，在標註統計方面，一致性為54.99%。整體來看，其錯誤率還有很大進步空間。

關鍵字

RST理論；現代漢語語法；修辭結構樹

並列摘要

The aim of this thesis is to build an automatic Chinese discourse structure labeling system. To this end, the text of NTUT’s audiobook corpus volume I (NTUT- AB-01) was first labelled by hand according to a modified Rhetorical Structure Theory (RST). Then, many linguistic features between two neighbor sentences were extracted to train a Multi-Layer Perceptron (MLP) classifier including (1) connective and (2) shared words, (3) word and (4) part-of-speech (POS) subspace projections, (5) sentence lengths and (6) punctuation marks. Experimental results show that the MLP classifier achieved 28.16% and 24.08% accuracies for 15 detail and 4 coarse classes, respectively.

並列關鍵字

Rhetorical Structure Theory ； syntax of modern Chinese ； Rhetorical Structure Tree

參考文獻

[1] William C. Mann and Sandra A. Thompson, “Rhetorical Structure Theory:A Theory of Text Organization,” Structure of Discourse, June 1987, pages 87-190.

[2] Webber, B., Stone, M., Joshi, A. and Knott, A., “Anaphora and discourse structure,” Computational Linguistics, 2003, pages 545–587.

[3] Yuping Zhou and Nianwen Xue, “PDTB-style Discourse Annotation of Chinese Text,” Proceefings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, 8-14 July 2012, pages 69–77.

[4] Hen-Hsen Huang and Hsin-Hsi Chen, “Chinese Discourse Relation Recognition,” Proceefings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, 8–11 November 2011, pages 1442–1446.

[5] The PDTB Research Group, “The Penn Discourse Treebank 2.0 Annotation Manual,” December 17, 2007, pages 26–37.

國際替代計量

中文文章修辭架構自動分類初步研究

全文下載

主題瀏覽