帳號:guest(          離開系統
字體大小: 字級放大   字級縮小   預設字形  


論文名稱(外文):T3 Taiwanese Treebank and Brill Parser
  • 推薦推薦:0
  • 點閱點閱:521
  • 評分評分:*****
  • 下載下載:28
  • 收藏收藏:0
T3語料庫是設計做台語、客語、華語三種台灣主要漢語的剖析樹語料庫,以「現代漢語八百詞」的例句及詞組為基礎,先翻譯成台語、客語的平行句或者詞組,然後每句斷詞、詞類標記、剖析,以及加上剖析後各詞組的結構、詞組性。現在大約完成台語的部份。該詞典是以華語虛詞為主,虛詞具備文法結構豐富的特性,本來就是語法分析主要的目標。雖然有“T3bracket” Windows程式的幫忙,所有的工作仍然需要多人以人工方式完成;儘管如此,語料庫仍然存在不一致的地方,翻譯、斷詞、剖析都有。台語部份現在大約完成。使用T3語料庫的部份資料,我們也報告使用Brill剖析器(Brill 1993)的效果。用部份T3語料庫來發展Brill剖析器,結果的剖析器的正確率可以達到組內測試87.8%,組外測試89%。
T3 corpus is a treebank corpus consists of parallel sentences in the three major languages in Taiwan: Taiwanese, Hakka, and Mandarin. Those sentences are originally example sentences or phrases from “現代漢語八百詞”, that are translated into Taiwanese and Hakka by native speakers. The translated sentences or phrases are then segmented, part-of-speech tagged, syntactically bracketed, and furtherly annotated with structure type for all the immediate constituents. All works are done manually with help by the “T3bracket”, a Windows program specifically designed for this task. Despite of studying various materials before we embark this task, we are still faced with many difficulties in all the phases of translation, segmentation, and bracketting. And thus, discussions was help regularyly for a period of two years. The Taiwanese part of the T3 treebank is almost finished; double check is still required. The transformation-based error-driven parsing of Brill(1993) is applied to part of the T3 treebank. The resulting parser, has non-crossing bracketting accuracy 87.8% for inside test and 89% for outside test.
第一章 概論 1
第二章 T3剖析樹語料庫 3
2•1 T3語料庫 3
2•2 T3語料庫部份語料的基礎統計量 4
第三章 台語詞組結構介紹 9
3•1 常見詞組結構 9
3•2 其他詞組結構 10
3•3 詞組結構分類舉例 10
第四章 基於轉換規則的錯誤驅動分析 16
4•1 初步剖析器 17
4•2 計算錯誤比率 18
4•3 Brill規則習得 19
4•4 結構轉換規則樣版的應用細節 20
第五章 Brill剖析實驗 33
5•1 詞內測試Brill剖析器 33
5•2 詞外測試Brill剖析器 37
5•3 小結論 38
第六章 結論 39
參考文獻 40
朱德熙 (1982) . 《語法講義》, 北京: 商務印書館。
朱德熙 (1984) . 《語法答問》, 北京: 商務印書館。
江永進(2005) . 《台語拼音課程》, 屏東:安可出版社。
呂叔湘. (1980). 《現代漢語八百詞》, 北京: 商務印書館。
周思源((2006) . “T3台語剖析樹語料庫與Brill詞類標記“, 清華大學碩士論文。
洪俊詠(2005).“馬可夫語言模型應用di台語變調gah 注音“ , 清華大學碩士論文。
陸儉明 (2003) . “對“NP+的+VP”結構的重新認識”, 北京大學。
劉亦真(2005) . “建立T3剖析樹語料庫:台語部分“, 清華大學碩士論文”。
Brill, Eric.(1992). “A simple rule-based part of speech tagger , In Proceedings of the speechand natural language workshop held in Harriman, N.Y., February 1992, by the
Defense Advanced Research Project Agency (DARPA). San Mateo, CA: Morgan
Kaufmann Publishers, Inc., 112-16.
Brill, Eric.(1993a). “Automatic grammar induction and parsing free text: a
transformation-based approach”, In Proceedings of the thirty-first annual meeting of the
ACL held in Columbus,
Brill, Eric. (1993b). “A corpus-based approach to language learning. ”,Ph.D. diss., Universityof Pennsylvania.
Daniel Jurafsky and James H. Martin (2000). “Speech and Language Processing: An Introduction to Natural Language Processing”,Computational Linguistics, and Speech Recognition”, Prentice Hall.
Jelinek, Frederick (1997), “Statistical methods for speech recognition”, Cambridge.
Mass. :MIT Press.
第一頁 上一頁 下一頁 最後一頁 top
* *