以詞彙的時間顯著性為基礎的新聞事件偵測與追蹤之研究

本研究提出一個以詞彙的時間顯著性為基礎的方法來做網路新聞的分群，以協助讀者從大量的網路新聞中尋找與追蹤有興趣之新聞事件。這個方法依據詞彙在新聞文件發生的時間區間之卡方檢定值和辭彙在文件中出現的次數，來計算文件中特徵詞彙的權重，並以詞彙間共同出現之相關度來擴展文件的特徵詞彙數量及權重。文件分群則是依據新加入文件的特徵詞彙之權重以及其在新聞事件中心出現的順位計算出來的相似度，將文件分配至舊有事件群集，或者產生一個新的事件。同時，新聞事件中心亦會隨著新文件的加入而適度更新。本研究對中英文新聞事件的實驗結果顯示，加上詞彙的時間顯著性與詞彙擴展之後的系統效能可明顯地提昇。只加上詞彙的時間顯著性之後的F1 Measure比起傳統TFIDF方法之數值平均高出約25.25%，再加上詞彙擴展之後，F1 Measure平均又可提昇約4.26%。

關鍵字

文件分群

並列摘要

This study proposes a timeline significance based method for event detection and tracking of online news. This method calculates a term’s weight in a document by its number of occurrences and a χ2-statistic value dependent on the time interval in which the document occurs. Moreover, this method expands the feature terms of a document to include additional terms that frequently occur together with some of the original feature terms and raises the relevant terms’ weights by a timeline based term co-occurrence analysis. Experimental results on Chinese and English online news indicate that the proposed method significantly out-performed the traditional TFIDF method. The proposed method achieved an average improvement of 25.25% on the F1 measure with timeline significance only. With further term expansion, the proposed method achieved an additional average improvement of 4.26% on the F1 measure.

並列關鍵字

Document Clustering ； Similarity Measure ； Timeline Significance ； Term Weight ； Relevance between Terms

參考文獻

4. Chen, H., & Lynch, K. J. ( 1992). “Automatic construction of networks of concepts characterizing document databases, “ IEEE Transactions on Systems, Man and Cybernetics, Vol.22, No.5, pp.885-902, 1992

5. C.J. van Rijsbergen (1999). INFORMATION RETRIEVAL, Second Edition, Scotland ,1999

7. G. Salton, editor (1971). The SMART Retrieval System: Experiments in Automatic Document Processing, Prentice-Hall, 1971

8. G. Salton and M. J. McGill (1983). Introduction to Modern Information Retrieval, McGraw-Hill, 1983

12. Kyung-Soon Lee and Kyo Kageura (2006). “Korean-Japanese story link detection based on distributional and contrastive properties of event terms,” Information Processing and Management. Vol. 42, No. 2, pp. 538-550, 2006

被引用紀錄

洪啟民（2006）。醫學中心醫師對使用新鮮冷凍血漿認知、態度之探討〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2006.10530

國際替代計量

以詞彙的時間顯著性為基礎的新聞事件偵測與追蹤之研究

全文下載

主題瀏覽