透過您的圖書館登入
IP:3.133.87.156
  • 學位論文

自動標註新聞標籤以強化新聞內容擷取

Automatic News Tagging System for Improving Content Extraction

指導教授 : 林宣華

摘要


本論文整合文字分析與資料探勘技術[1],抓取各大新聞網站中的標題、摘要、內容與新聞標籤等等。透過字典比對方式擷取已知的關鍵字,並分析新聞標籤和文章中其餘辭彙的資料特性,透過幾種常見的中文辭彙分析方法,如:Significant Estimates[2]、TF-IDF[3]、Mining Associate[4]…等等,發掘出潛在的關鍵字並對關鍵字做分析與分類。本論文擬設計一套針對新聞文章關鍵字的自動化擷取系統,以達成幾項主要目標。(1) 從文章發掘出具有時效性從現有字典中的無法比對出的辭彙,例如:反送中、中美貿易戰、韓粉…等等;(2) 分類出新聞文章中的重要關鍵字(Keywords)、一般關鍵字(General Keywords)與領域關鍵字(Domain Keywords),並建立起一套系統,自動化生成新聞關鍵字字典;(3) 利用Associate Mining對前兩項的結果進關鍵字擷取、領域偵測和對類似文章推薦的準確度改善。

並列摘要


In this thesis, we propose an automatic news keywords tagging system that integrates text analysis and data mining [1] methods for improving content extraction. Using python crawler retrieve news website data, including news title, news summary, and news tags, etc. Detect potential keywords and classify keywords through several common methods for Chinese characters analysis, such as: Significant Estimates [2], TF-IDF [3], Mining Associate [4], etc. by analyze the data characteristics of news tags and phrase in news articles. In this thesis, we planned to design an automatic keywords extracting system for news articles to achieve several main goals: (1) Detect time-sensitive keywords from news articles that can't be found in currently using dictionary, such as Anti-Extradition Law Amendment Bill Movement (反送中), China-United State trade war (中美貿易戰) , candidate Guo-Yu Han’s supporter(韓粉); (2) Categorize the important keywords, general keywords, and domain keywords in news articles, and establish an automatically system to generate news keyword dictionary; (3) Use Associate Mining to improve keyword extraction accuracy, Topic detection, and recommendation system of suggesting similar articles.

參考文獻


[1] Data Mining
https://en.wikipedia.org/wiki/Data_mining
[2] Chien, L. F. (1997, July). PAT-tree-based keyword extraction for Chinese information retrieval. In ACM SIGIR Forum (Vol. 31, No. SI, pp. 50-58). ACM
[3] TF-IDF
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

延伸閱讀