本研究提出一個以詞彙的時間顯著性為基礎的方法來做網路新聞的分群,以協助讀者從大量的網路新聞中尋找與追蹤有興趣之新聞事件。這個方法依據詞彙在新聞文件發生的時間區間之卡方檢定值和辭彙在文件中出現的次數,來計算文件中特徵詞彙的權重,並以詞彙間共同出現之相關度來擴展文件的特徵詞彙數量及權重。文件分群則是依據新加入文件的特徵詞彙之權重以及其在新聞事件中心出現的順位計算出來的相似度,將文件分配至舊有事件群集,或者產生一個新的事件。同時,新聞事件中心亦會隨著新文件的加入而適度更新。 本研究對中英文新聞事件的實驗結果顯示,加上詞彙的時間顯著性與詞彙擴展之後的系統效能可明顯地提昇。只加上詞彙的時間顯著性之後的F1 Measure比起傳統TFIDF方法之數值平均高出約25.25%,再加上詞彙擴展之後,F1 Measure平均又可提昇約4.26%。
This study proposes a timeline significance based method for event detection and tracking of online news. This method calculates a term’s weight in a document by its number of occurrences and a χ2-statistic value dependent on the time interval in which the document occurs. Moreover, this method expands the feature terms of a document to include additional terms that frequently occur together with some of the original feature terms and raises the relevant terms’ weights by a timeline based term co-occurrence analysis. Experimental results on Chinese and English online news indicate that the proposed method significantly out-performed the traditional TFIDF method. The proposed method achieved an average improvement of 25.25% on the F1 measure with timeline significance only. With further term expansion, the proposed method achieved an additional average improvement of 4.26% on the F1 measure.