Reliable and Cost-Effective Pos-Tagging

In order to achieve fast, high quality Part-of-speech (pos) tagging, algorithms should achieve high accuracy and require less manually proofreading. This study aimed to achieve these goals by defining a new criterion of tagging reliability, the estimated final accuracy of the tagging under a fixed amount of proofreading, to be used to judge how cost-effective a tagging algorithm is. In this paper, we also propose a new tagging algorithm, called the context-rule model, to achieve cost-effective tagging. The context rule model utilizes broad context information to improve tagging accuracy In experiments, we compared the tagging accuracy and reliability of the context-rule model, Markov bi-gram model and word-dependent Markov bi-gram model. The result showed that the context-rule model outperformed both Markov models. Comparing the models based on tagging accuracy, the context-rule model reduced the number of errors 20% more than the other two Markov models did. For the best cost-effective tagging algorithm to achieve 99% tagging accuracy, it was estimated that, on average, 20% of the samples of ambiguous words needed to be rechecked. We also compared tradeoff between the amount of proofreading needed and final accuracy for the different algorithms. It turns out that an algorithm with the highest accuracy may not always be the most reliable algorithm.

並列關鍵字

part-of-speech tagging ； corpus ； reliability ； ambiguous resolution

參考文獻

陳克健 Keh-Jiann, Keh-Jiann(1997).Proceedings of the Natural Language Processing Pacific Rim Symposium.

Google Scholar

Brill, E.(1992).Proceedings of Applied Natural Language Processing.

Google Scholar

Chen, C. D.,Chang, C. H.(1993).Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives.

Google Scholar

Kveton, P.,Oliva, K.(2002).Proceedings of the 19th International Conference on Computational Linguistics.

Google Scholar

Liu, S. H.,Chen, K. J.,Chin, Y. H.,Chang, L. P.(1995).Automatic Part-of-Speech Tagging for Chinese Corpora.Computer Processing of Chinese and Oriental Languages: an international journal of the Chinese Language Computer Society.9(1),31-48.

Google Scholar

被引用紀錄

向思蓉（2014）。語音文件摘要與語音問答系統之新技術〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2014.03041

Chang, G. (2013). 以自然語言處理分析社群網路願望之研究 [master's thesis, National Taiwan University]. Airiti Library. https://doi.org/10.6342/NTU.2013.11038

陳冠宇（2010）。主題模型於語音辨識使用之改進〔碩士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-1610201315213186

呂國彥（2012）。利用專利文件主題辨識科技趨勢〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-1903201314444206

Hsieh, Y. M. (2015). 以結構機率重估改進中文句法分析 [doctoral dissertation, National Tsing Hua University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0016-0508201514084771

國際替代計量

Reliable and Cost-Effective Pos-Tagging

全文下載

主題瀏覽