透過您的圖書館登入
IP:18.218.127.141
  • 學位論文

T3台語剖析樹語料庫與Brill剖析器

T3 Taiwanese Treebank and Brill Parser

指導教授 : 江永進
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


T3語料庫是設計做台語、客語、華語三種台灣主要漢語的剖析樹語料庫,以「現代漢語八百詞」的例句及詞組為基礎,先翻譯成台語、客語的平行句或者詞組,然後每句斷詞、詞類標記、剖析,以及加上剖析後各詞組的結構、詞組性。現在大約完成台語的部份。該詞典是以華語虛詞為主,虛詞具備文法結構豐富的特性,本來就是語法分析主要的目標。雖然有“T3bracket” Windows程式的幫忙,所有的工作仍然需要多人以人工方式完成;儘管如此,語料庫仍然存在不一致的地方,翻譯、斷詞、剖析都有。台語部份現在大約完成。使用T3語料庫的部份資料,我們也報告使用Brill剖析器(Brill 1993)的效果。用部份T3語料庫來發展Brill剖析器,結果的剖析器的正確率可以達到組內測試87.8%,組外測試89%。

並列摘要


T3 corpus is a treebank corpus consists of parallel sentences in the three major languages in Taiwan: Taiwanese, Hakka, and Mandarin. Those sentences are originally example sentences or phrases from “現代漢語八百詞”, that are translated into Taiwanese and Hakka by native speakers. The translated sentences or phrases are then segmented, part-of-speech tagged, syntactically bracketed, and furtherly annotated with structure type for all the immediate constituents. All works are done manually with help by the “T3bracket”, a Windows program specifically designed for this task. Despite of studying various materials before we embark this task, we are still faced with many difficulties in all the phases of translation, segmentation, and bracketting. And thus, discussions was help regularyly for a period of two years. The Taiwanese part of the T3 treebank is almost finished; double check is still required. The transformation-based error-driven parsing of Brill(1993) is applied to part of the T3 treebank. The resulting parser, has non-crossing bracketting accuracy 87.8% for inside test and 89% for outside test.

並列關鍵字

無資料

參考文獻


周思源((2006) . “T3台語剖析樹語料庫與Brill詞類標記“, 清華大學碩士論文。
Defense Advanced Research Project Agency (DARPA). San Mateo, CA: Morgan
Kaufmann Publishers, Inc., 112-16.
Brill, Eric.(1993a). “Automatic grammar induction and parsing free text: a
ACL held in Columbus,

延伸閱讀