Spam Filtering: Application of Data Mining and Chinese Word Segmentation Technique

指導教授 : 陳景祥


在尚未立法明確規範垃圾電子郵件的國家,運用科技來阻擋垃圾信為多數網路使用者自保的首要之道,阻擋垃圾信件的方法很多,近年來技術不斷翻新,但很少能達成百分之百的阻擋效果。本研究提出有效的過濾垃圾郵件方法,利用PHP網頁程式語言來擷取電子郵件特徵,再透過資料採礦技術工具中的C4.5決策樹及機率類神經網路法,經由中文斷詞系統辨析中文詞頻、詞序及詞性等因素,並加入「灰色區域」郵件分類作為新的輸出變數,輸入至本研究之郵件分類系統,比較中文電子郵件分類效果及總風險成本,結果在使用C4.5決策樹法,加入詞頻及詞序百分比為輸入變數,可提升垃圾郵件被辨識成功的分類正確率;而使用機率類神經網路法,加入詞性特徵為輸入變數後可提升正常郵件被辨識成功的分類正確率;加入「灰色區域」分類為輸出變數時,明顯提升了垃圾郵件的分類精確率及檢出率,而且多數高達98.5% 以上,及明顯降低總風險成本。


In countries without established laws with regards to spam-mail blocking, spam filtering technologies are adopted to filter mails. Spam filtering technologies come in many forms and have staged a steady stream of improvement. However, none of the technology can completely filter out spam mails. The study suggests an effective method of spam filtering. Using PHP program to pick out the characteristics of spam mails, we perform data mining techniques such as C4.5 method and probability neural network (PNN) classifier to the E-mail classification. We also apply Chinese word segmentation system to calculate the frequency, rank, and characteristics of Chinese words. A “gray region” is also considered as our new output category. Our result shows that the C4.5 method together with the frequency and rank percentage of Chinese words promotes the accuracy of spam-mail filtering. Meanwhile, the PNN method with the percentages of Chinese word characteristics increases the accuracy of legitimate mail classification. Also, with the addition of our new “gray region” output category, the spam precision and recall rate both increase significantly, most of the classification rates goes over 98.5%, and the misclassification cost is also reduced.


