透過您的圖書館登入
IP:13.59.231.155
  • 學位論文

A method of spam detection based on structural similarity

A method of spam detection based on structural similarity

指導教授 : 林柏青
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


無資料

關鍵字

垃圾郵件 分群 文件相似度

並列摘要


Spammers usually deliver a large number of spam instances generated from a set of templates. To identify spam messages in the same campaigns or to detect new spam instances that are likely to belong to known campaigns, we propose a method to group spam messages based on their HTML struc- tural features. We observe that spam mails tend to have similar structures of the mail bodies, even though the words in the bodies can be signicantly dif- ferent to evade spam detection. Rather than infer the templates and represent them in regular expressions, we extract the HTML tags from the mail bodies as the structural features, and build a ngerprint for each structure. With the ngerprints, we can eciently identify the clusters of similar structures using the simhash algorithm and the Jaccard similarity. The identication is useful to nd new spam instances belonging to known structures with a high recall up to around 95%, while the false-positive rates for normal mails can be less than 5%.

並列關鍵字

Spam Clustering Document similarity

參考文獻


and S. Savage, Spamalytics: an empirical analysis of spam marketing
worm-making-millions-day, Feb. 2008.
Apr. 2008.
[6] Andreas Pitsillidis, Kirill Levchenko, Christian Kreibich, Chris Kanich,
Georey M. Voelker, Vern Paxson, Nicholas Weaver, and Stefan Savage,