一個處理概念漂移的垃圾郵件分類演算法

垃圾郵件氾濫的問題一直沒有得到徹底的解決，各種垃圾郵件防治機制紛紛興起，其中以機器學習為主的垃圾郵件內容分類過濾最為盛行。而這些方法，主要都是基於所有的資料在固定不變的環境下之假設，但是在實際環境中，郵件內容會隨著概念的漂移而不斷變動，使得分類器在模型建立之初，都有不錯的分類效果，但隨著時間的演進與概念的漂移，郵件的分類正確率會逐漸下滑，因此必須有一個學習與調整的機制，針對資料集中新進與舊有郵件做相關的學習與調整。另一個郵件分類的問題是資料的偏斜，由於垃圾郵件的氾濫，垃圾郵件個數通常明顯的比正常郵件來的多，在分類的過程中，雖然垃圾郵件類別都有著較高的召回率，但是正常郵件類別的召回率卻相對不佳。因此本研究提出IFWB（Incremental Forgetting Weighted Bayesian，漸進遺忘權重貝氏）演算法，以貝氏分類為基礎，採用IGICF（Information Gain and Inverse Class Frequency，資訊增益與類別頻率倒數）擷取關鍵字，結合漸進遺忘機制與分類成本架構來解決郵件分類中概念漂移與資料偏斜的問題，最後透過實驗來驗證本研究所提出的郵件分類方法。

關鍵字

郵件分類；概念漂移；資料偏斜

並列摘要

The overflow problem of spam has not been solved completely. Many anti-spam techniques have been proposed. Among them, the machine learning techniques are the most popular, but these works are based on a static environment assumption. In the real world application, the email context may change with concept drift. The classification result is usually good at the beginning, but along with time evolution and concept drift, the classification accuracy dropped down gradually. So a mechanism is needed to adjust the classifier according to the new incoming emails and the old emails in the dataset. Another problem of email categorization is data skewedness. Because of the spam overflow, the number of spam emails is far more than that of legitimate ones. In the classification result, the majority class is with good recall rate, but the minority class with poor recall rate. For these reasons, we propose an algorithm, IFWB (Incremental Forgetting Weighted Bayesian), based on Naïve Bayesian and IGICF (Information Gain and Inverse Class Frequency) feature extraction, combined with the gradual forgetting mechanism and cost-sensitive model to tackle concept drift and data skewedness. Finally, we demonstrate the effectiveness of the IFWB algorithm through a series of experiments.

並列關鍵字

e-mail categorization ； concept drift ； data skewedness

參考文獻

[2]羅淑薰，2007，具部份漸進學習能力之類神經網路樹及其於垃圾郵件過濾之應用，國立中央大學資訊工程研究所碩士論文

[9]Delany, S. J., Cunningham, P., Tsymbal, A., & Coyle, L. (2005). A Case-Based Technique for Tracking Concept Drift in Spam Filtering. Knowledge-Based Systems, Vol. 18, No. 4-5, 187-195

[11]Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, Vol. 10, No. 5, 1048-1054

[12]Elkan, C. (2001). The Foundations of Cost-Sensitive Learning. Proceedings of the 17th International Joint Conference on Artificial Intelligence, 239-246

[14]Fdez-Riverola, F., Iglesias, E. L., Díaz, F., Méndez J. R., & Corchado, J. M. (2007). Applying Lazy Learning Algorithms to Tackle Concept Drift in Spam Filtering. Expert Systems with Applications, Vol. 33, No. 1, 36-48

被引用紀錄

鄭奕騰（2016）。垃圾郵件分類及特徵選擇組合之分析研究〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2016.00769

國際替代計量

一個處理概念漂移的垃圾郵件分類演算法

全文下載

主題瀏覽