本研究主旨之目標是利用篇章分析來建立自然的篇章修辭架構,讓機器能夠自動辨認出語句間的複句關係。為了達成此一目標,就必須考慮到整篇文章的修辭架構,即各句子間的語意變化,以一段篇章為處理單位,建構出修辭結構樹。 首先我們依據Rhetorical Structure Theory (RST)理論,及參考現代漢語語法的句子結構分類,統整出在篇章及語意上較適合中文的複句結構關係,並採用臺北科技大學中文有聲書語料庫文本(NTUT-AB-01)進行修辭結構樹的標註。 標註完成後,經由統計擷取出各個參數,包括句子詞數(Length)標註、標點符號(Punctuation)標註、關聯詞語(Connective)擷取及共享字詞(Shared Word)標註,再藉由Multi-Layer Perceptron(MLP)演算法學習中文RST分類,讓機器可以學習自動將複句關係分類開來,自動辨認出正確的複句關係,最後達到建立篇章修辭架構之偵測自動化。 實驗結果顯示,結合所有特徵參數情況下,15分類之錯誤率為28.16%,4分類之錯誤率為24.08%,在標註統計方面,一致性為54.99%。整體來看,其錯誤率還有很大進步空間。
The aim of this thesis is to build an automatic Chinese discourse structure labeling system. To this end, the text of NTUT’s audiobook corpus volume I (NTUT- AB-01) was first labelled by hand according to a modified Rhetorical Structure Theory (RST). Then, many linguistic features between two neighbor sentences were extracted to train a Multi-Layer Perceptron (MLP) classifier including (1) connective and (2) shared words, (3) word and (4) part-of-speech (POS) subspace projections, (5) sentence lengths and (6) punctuation marks. Experimental results show that the MLP classifier achieved 28.16% and 24.08% accuracies for 15 detail and 4 coarse classes, respectively.