用字典為基礎判別新聞事件類型:以體育新聞為例

資訊與網路科技的快速蓬勃發展，網際網路已成為目前最龐大的資料庫，使用者想要在這麼龐大的網頁當中蒐集相關資料，是相當地不容易。本篇論文希望可以在短時間內幫助使用者去閱讀想要的資訊，透過中文斷詞作為文章分類的區分，計算出每個單詞在文章中出現的頻率，如此一來我們可以得知出現頻率高的字詞在本文章中是關鍵詞，代表這篇文章的主題都是以這些關鍵詞環繞作報導，使用者可以透過關鍵詞來尋找他們想要的資訊，便可以大幅降低不必要的搜尋時間。實驗樣本取自東森新聞網站的 320 篇電子檔文章。並且將文章分為二類 : 訓練詞庫文章與測試文章。其中 285 篇從體育類別下載，作為訓練文章，35 篇為測試文章，前面 15 篇是從即時新聞下載，即時新聞裡面包括了各種新聞，所以這 15 篇都是綜合類別，另外後面 20 篇為評估效能。訓練文章的目的是製作詞庫，而測試文章主要則是用來比對斷詞結果的成效。

關鍵字

體育新聞；斷詞；訓練字典；隱馬可夫模型

並列摘要

Rapid and vigorous development of information network technology has resulted in the largest data repository. Collecting relevant information in such a large body of data is rather difficult for any user. This paper is aimed to help users to grasp key information in a short period of time. We observe that term frequency in a article can be used as keyword for that article. Article theme can be easily grasped based on these keywords. Therefore, users can find the information they want through keyword and significantly reduce unnecessary search time. Proper word segmentation enables article theme extraction. And article classification can be achieved by theme differentiation. We use 320 articles in the theme classification experiment. These articles are divided into two categories: training and testing. There are 285 training samples, all belonging to the sports news theme. There are 15 testing samples that are consists of themes picked at random. The result is able to pick out 6 articles which belonging to sport news theme among the 15 testing samples. Among the 20 negative samples, there are 4 false positives, all due to names related to sports events.

並列關鍵字

Sport News ； Segmentation ； Training Dictionary ； Hidden Markov Model

參考文獻

[3] 林千翔，張嘉惠，陳貞伶, “結合長詞優先與序列標記之中文斷詞研究”, 國家圖書館期刊文獻資訊網, Tech. Rep., 2010.

[6] W. Jiang, L. Huang, Q. Liu, and Y. Lü, “A cascaded linear model for joint chinese word segmentation and part-of-speech tagging”, in In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, 2008.

[7] M. Li, J. Gao, C. Huang, and J. Li, “Unsupervised training for overlapping ambiguity resolution in chinese word segmentation”, in Proceedings of the Second SIGHAN Workshop on Chinese Language Processing - Volume 17, 2003.

[8] X. Luo, M. Sun, and B. K. Tsou, “Covering ambiguity resolution in chinese word segmentation based on contextual information”, in Proceedings of the 19th International

Conference on Computational Linguistics - Volume 1, 2002.

國際替代計量

用字典為基礎判別新聞事件類型:以體育新聞為例

全文下載

主題瀏覽