透過您的圖書館登入
IP:3.144.12.205
  • 學位論文

中文文章修辭架構自動分類初步研究

A Preliminary Research on Automatic Chinese Rhetoric Structure Identification

指導教授 : 廖元甫

摘要


本研究主旨之目標是利用篇章分析來建立自然的篇章修辭架構,讓機器能夠自動辨認出語句間的複句關係。為了達成此一目標,就必須考慮到整篇文章的修辭架構,即各句子間的語意變化,以一段篇章為處理單位,建構出修辭結構樹。 首先我們依據Rhetorical Structure Theory (RST)理論,及參考現代漢語語法的句子結構分類,統整出在篇章及語意上較適合中文的複句結構關係,並採用臺北科技大學中文有聲書語料庫文本(NTUT-AB-01)進行修辭結構樹的標註。 標註完成後,經由統計擷取出各個參數,包括句子詞數(Length)標註、標點符號(Punctuation)標註、關聯詞語(Connective)擷取及共享字詞(Shared Word)標註,再藉由Multi-Layer Perceptron(MLP)演算法學習中文RST分類,讓機器可以學習自動將複句關係分類開來,自動辨認出正確的複句關係,最後達到建立篇章修辭架構之偵測自動化。 實驗結果顯示,結合所有特徵參數情況下,15分類之錯誤率為28.16%,4分類之錯誤率為24.08%,在標註統計方面,一致性為54.99%。整體來看,其錯誤率還有很大進步空間。

並列摘要


The aim of this thesis is to build an automatic Chinese discourse structure labeling system. To this end, the text of NTUT’s audiobook corpus volume I (NTUT- AB-01) was first labelled by hand according to a modified Rhetorical Structure Theory (RST). Then, many linguistic features between two neighbor sentences were extracted to train a Multi-Layer Perceptron (MLP) classifier including (1) connective and (2) shared words, (3) word and (4) part-of-speech (POS) subspace projections, (5) sentence lengths and (6) punctuation marks. Experimental results show that the MLP classifier achieved 28.16% and 24.08% accuracies for 15 detail and 4 coarse classes, respectively.

參考文獻


[1] William C. Mann and Sandra A. Thompson, “Rhetorical Structure Theory:A Theory of Text Organization,” Structure of Discourse, June 1987, pages 87-190.
[2] Webber, B., Stone, M., Joshi, A. and Knott, A., “Anaphora and discourse structure,” Computational Linguistics, 2003, pages 545–587.
[3] Yuping Zhou and Nianwen Xue, “PDTB-style Discourse Annotation of Chinese Text,” Proceefings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju, Republic of Korea, 8-14 July 2012, pages 69–77.
[4] Hen-Hsen Huang and Hsin-Hsi Chen, “Chinese Discourse Relation Recognition,” Proceefings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand, 8–11 November 2011, pages 1442–1446.
[5] The PDTB Research Group, “The Penn Discourse Treebank 2.0 Annotation Manual,” December 17, 2007, pages 26–37.

延伸閱讀