透過您的圖書館登入
IP:3.149.214.32
  • 學位論文

領域文章之自動化分類與關鍵字擷取:以新聞主題為研究案例

Automatic Classification and Keyword Extraction for Domain Articles: Case Studies on News Topics

指導教授 : 林宣華

摘要


在各種網路社群平台蓬勃發展的今日,如何分析這些難以計數的文字資料成為文字探勘的熱門研究主題。文章的分類及關鍵字擷取,對於往後加值服務與應用有直接的影響。雖然現今已有許多的分類及關鍵字擷取方法,但大多是基於人工標註來分類文章,建立機器學習的訓練資料。對於新聞每天產生大量未分類文章,傳統基於人工標註與機器學習的預測效果相對有限,而且人力成本也較高。新聞網站也僅分類數個到十幾個目錄,對於重要迫切議題,如食安議題,就不適合固定目錄的分類,須要動態快速產生分類資料。此外,文章的內容可能同時涵蓋多個分類面向,但新聞網站往往只分類新聞到單一目錄中,若使用者只從訂閱目錄觀看,可能遺漏許多想閱讀的新聞。例如,新聞經常同時包含社會及政治面向,而國際新聞也往往和政治相關。因此,本論文擬設計自動化系統,針對新聞文章分類與關鍵字擷取,以達成三項主要目標。(1) 透過設定幾個領域需求關鍵字,系統自動對大量文章進行多領域分類,同時分析文章中各個面向重要關鍵字所佔有的權重;(2) 透過文章分類之結果,找出該領域中具有代表性之領域關鍵字;(3) 根據前兩項的結果,對文章進行關鍵字的擷取,並提供輔助新聞閱讀的智慧標籤。

並列摘要


The explosive growth of web data on social networks drives Text Mining become a popular research topic for data scientists. Both Data Classification and Keyword Extraction are intrinsic methods to achieve better performance. Although many methods have proposed feasible solutions for Data Classification and Keyword Extraction, these studies are based on massive curated training data with class labels. Consequently, streaming data like daily news articles published by many new websites are hard to manually annotate reasonable class labels. Also, news categories are fixed and not applicable to dynamic and emergent topics for presenting social responses or trends. Moreover, news class labels curated from one or several reports are not objective enough since some news articles should belong to several classifications. If users merely read news from interesting categories, some desired news may never appear in users subscriptions. Therefore, we implement an automatic system for classifying news articles and extracting representative keywords so that 3 main goals are achieved in this system: (1) automatically classifying articles to multiple domains by only setting several domain keywords; (2) extracting significant domain keywords based on results of automatic classifiers; (3) extracting important keywords to represent tags of news articles.

參考文獻


[1] Entropy
https://en.wikipedia.org/wiki/Entropy
[2] Jieba
https://github.com/fxsjy/jieba
[3] Trie

延伸閱讀