現今網路上有大量的文字資料,例如:新聞網,PTT、facebook.. 等,由於這些資料繁多雜亂,可以透過文字探勘的方式淬取出有用的資訊,讓人們能有效率的掌握這些網路文字所提供的訊息。 本論文利用R 語言建立一個新聞事件追蹤系統,透過網路爬蟲爬取新聞文章,將爬取的文章做清理,利用jieba 斷詞後,依據各文章中斷詞的結果建立詞頻矩陣,透過TF-IDF 的計算找出關鍵字,最後將每篇文章中所切出來的關鍵字做文章相似度分析來實踐相似文章追蹤的系統。 本論文擷取了1500 篇新聞文章,透過上述文字探勘的步驟,將這1500 篇新聞透過計算文章間的餘弦距離來做文章相似度分析,加入沃德法(Ward‘s method)使群內的總變異變小,使群間的總變異變大,以判斷出最佳分群數目,實驗結果顯示爬取的1500 篇新聞經過此文字探勘步驟後,可以透過文章查詢函式來查詢相似的新聞,實踐新聞事件的追蹤。
Nowadays, there are massive text data on the internet. For example, news websites, PTT, facebook etc. Since these data are all disordered, it is important to apply text-mining in order to extract the useful information for people to efficiently grasp the main idea the text contains. This thesis utilizes R language to construct a news event tracking system. Using crawler to crawl and cleans news articles, segmenting Chinese words using jiebaR.Then, based on the segmentation result to build a frequency matrix and find key words through computing TF-IFD. Lastly, compares the similarities of each articles by their key words to carry out the similar article tracking system. Implementing these steps of text mining, this thesis retrieved 1500 news articles and calculates the cosine distance of every article to analyze their similarity. In addition, to find the best amount of groups, we made use of Ward’s method to minimize the total variation of each group and maximize the total variation between groups. The experiment result shows that after applying the proposed text-mining method on 1500 news articles, we can achieve news event tracking to find similar articles via news inquiry function.