透過您的圖書館登入
IP:3.14.6.194
  • 學位論文

T3台語剖析樹語料庫與Brill詞類標記

T3 Taiwanese Treebank and Brill Part-of-Speech Tagger

指導教授 : 江永進

摘要


給定一串詞的序列,將各詞標上詞類,也就是配一個適當的詞類序列,這叫作詞類標記 (Part-of-Speech Tagging),這是自然語言處理的基礎問題。在本文內,我們以T3剖析樹語料庫台語部分的部份語料,實施Brill詞類標記法。Brill標記法需要二個階段,先訓練出轉換規則,然後應用得到詞類序列。Brill詞類標記是一種的錯誤驅動的學習程序 (Error-Driven Learning),學習的結果是一組詞類轉換規則 (Transformation Rules) 的集合。Brill標記法是根基於其他標記法,再做進一步的改善。這些其他標記法常用的是N-gram語言模型,其中我們分別使用Uni-gram、Bi-gram、Tri-gram的馬可夫及隱藏馬可夫模型來進行標記。本文除了報告T3語料庫的詞類標記的效果以外,我們也針對語料庫的不一致問題,使用混淆矩陣來發覺、檢視、修正。最後得到的較佳詞類標記正確率,其組內測試正確率為92.80%,組外測試的正確率為85.59%。

並列摘要


Part-of-Speech Tagging is a basic issue in the natural language processing. In this paper, we study the effect of Brill Tagger (1992) using part of the T3 Taiwanese treebank. Brill tagger is a transformation-based error-driven approach. Based on the results of other tagging method such as N-gram language model, Brill tagger learns a set of transformation rules from an annotated corpus. The learning process is error-driven in that its objective is to minimize the tagging errors computed from the comparison of the transformed results to the standard annotated corpus. Annotated corpus is often suffered from inconsistency problem, and we also study the problem using the confusing matrix. The best tagging result that we obtained is 92.80% and 85.59% for the inside test and the outside test respectively.

參考文獻


2. Brill Eric (1995), “Transformation-Based Error Driven Learning and Natural Language: A Case Study in Part of Speech Tagging”, Computational Linguistics, 21(4): 543-555.
7. Jelinek, Frederick (1997), “Statistical methods for speech recognition”, Cambridge, Mass.: MIT Press.
1. Brill Eric (1992), “A simple rule-based part of speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing”, ACL, Trento, Italy.
3. Daniel Jurafsky and James H. Martin (2000), “Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition”, Prentice Hall.
4. Fei Xia (2000), “The Bracketing Tagging Guidelines for the Penn Chinese Treebank (3.0)”, http://www.cis.upenn.edu/~chinese/parsequide.3rd.ch.pdf.

被引用紀錄


史滌明(2006)。T3台語剖析樹語料庫與Brill剖析器〔碩士論文,國立清華大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0016-1303200709262307

延伸閱讀