基於中文斷詞技術之新聞網頁分類系統

近年來隨著網路的發展，網路已經是人們生活中不可缺少的一部份，利用網路的便利性與互動性，可以使網路使用者知道近期內所發生的事情，也因為網路擁有這些特性，使得新聞資訊成長非常的快速。然而這樣的狀況衍生了一個問題，如何讓網路使用者能夠得知正確或是相關的訊息則是當下不得不面對的重要問題。在本論文中建立了一個以基於中文斷詞技術的新聞網頁分類系統，把網路上所擷取的文章，利用統計式斷詞法來計算出各種詞在文章中出現的次數，然後設定一個門檻值，若是統計過次數的詞未超過系統所設定的門檻值，則將該詞從詞庫中刪除。接著把符合的詞配合單純貝氏分類與結合權重的貝氏分類兩種分類方法來比較哪一種分類方法較佳。實驗結果顯示，利用單純貝氏分類的分類結果比結合權重的貝氏分類的分類結果還要好，分類的查全率最高可達71%。從結果來看，利用門檻值的設定將不正確的詞刪除，配合單純貝氏分類法來做分類具有不錯的效果。

關鍵字

貝氏分類法；查全率

並列摘要

With the vigorous development of the Internet, network is becoming indispensable to many people’s everyday life. Due to the convenience of reading news from the network, the number of users learning recent events from the Internet is growing rapidly. This also caused a large number of news agencies made their news available on the network. Thus, how to enable users receive relevant or interested news is an important issue. One way is to build an automatic news classification system that allows users to read from different categories of their interests. In this paper, a news page classification system based on Chinese word segmentation is set up. It can automatically download news pages and use the n-gram algorithm for word segmentation. After word segmentation, we compare the performance of two classification schemes. Naïve Bayes classifier has higher recall rate, average recall rate is 71%. Experimental results show that Naïve Bayes classifier with n-gram for word segmentation has a better performance over.

並列關鍵字

Naive Bayes Classifier ； Recall Rate

參考文獻

[4] 陳稼興、謝佳倫、許芳誠，「以遺傳演算法為基礎的中文斷詞研究」，資訊管理研究第二卷第二期，pp. 27-44，2000。

[8] G. C. Li, K. Y. Liu, and Y. K. Zhang, “Identifying Chinese Word and Processing Different Meaning Structures,” Journal of Chinese Information Processing, Vol. 2, pp. 45-53, 1988.

[7] D. D. Lewis, “Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval,” Proceedings of the 10th European Conference on Machine Learning, Chemnitz, Germany, pp. 4-15, April 21-23, 1998.

[9] J. H. Holland, Adaptation in natural and artificial systems. The University of Michigan Press, Ann Arbor, 1975.

[10] J. R. Quinlan, “Induction of Decision Trees,” Machine Learning, Vol. 1, pp. 81-106, 1986.

被引用紀錄

許桓瑜（2012）。長句斷詞法和遺傳演算法對新聞分類的影響〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2012.00488

國際替代計量

基於中文斷詞技術之新聞網頁分類系統

全文下載

主題瀏覽