基於內容結構摘要與社群網路之漸進式電子垃圾郵件偵測技術

電子郵件是現今不可或缺的一種訊息傳遞方式，但垃圾郵件的問題卻始終困擾著使用者且持續惡化。垃圾郵件偵測最主要的挑戰在於因為有利益可圖或為了達到其他目的，垃圾郵件散播者總是會發展出新穎的技巧來攻擊過濾系統。這種對抗和進化的特性使得靜態的過濾系統很難長期維持不錯的偵測表現。在本論文中我們將探討如何利用關鍵且恆久的垃圾郵件散佈特徵來設計更健全的偵測系統。並且，我們在演算法設計上考慮了漸進式更新的能力，而這個議題對於垃圾郵件偵測問題極重要但在過去文獻中較少被討論。根據觀察，擁有相同或類似內容的垃圾郵件會在短時間內大量且連續地出現。基於這個觀察，近複製(near-duplicate)相似比對的合作式過濾機制在近幾年廣泛被討論。這個機制的主要概念是收集使用者回報的已知垃圾郵件來建立資料庫，再利用此資料庫來阻擋隨後出現的其他內容相似垃圾郵件。為了能更加抵抗垃圾郵件技巧進化的特性，我們首先提出了針對郵件內容結構來做摘要的方法。這個新穎的郵件摘要表示方式能夠更有效地掌握垃圾郵件近複製的特性。並且，我們設計了一套完整的垃圾郵件偵測系統：Cosdes。這個系統不但能有效率地執行近複製相似比對的程序，還具備漸進式更新的能力。漸進式更新的機制使此系統能擁有最新回報的垃圾郵件來提高偵測的效能。另一個本論文的主題是探討基於電子郵件社群網路來設計垃圾郵件偵測演算法。這一類方法的動機在於垃圾郵件散播具備特定的模式，且和一般使用者的行為有很大的差異。相關的過去文獻主要存在以下兩個缺點。(1)在不同的郵件社群網路環境中系統無法保持穩定的偵測效能。(2)系統並未考慮在發展中網路的更新議題。為了改善這些問題，我們設計了一套名為MailNET的系統。此系統藉由社群網路中每一使用者所取出的數個特徵值來訓練一個支持向量機分類模型(SVM Classification Model)，再利用此模型來偵測垃圾郵件。而我們更提出了增量更新的機制使得當有新的一批郵件被加入網路時，系統能夠有效率地重新訓練分類模型。另一方面，我們亦設計了一個冪疊代演算法(power iteration algorithm)來對郵件社群網路中的使用者評分，再依據這個聲譽分數區別出社群網路中的垃圾郵件散播者。由於我們的目的並不是要根據分數來對使用者排名，而是要利用一門檻界線分隔開垃圾郵件散播者與一般使用者。因此，並不需要太精準的評分機制就足夠達到此目的。我們提出了一套名為ProMail的系統，此系統的設計建構在一個能加速收斂速度的PageRank-like演算法。而這個演算法同時具備漸進式更新的特性，使得系統能夠在有新的郵件加入或有老舊郵件被刪除的情形下，有效率地更新部分有關使用者的聲譽分數。為了更確實驗證所提出方法之可行性，我們使用了大學規模的真實郵件串流資料來做實驗評估。實驗結果顯示所設計的Cosdes，MailNET，以及ProMail系統在實際環境中能達到不錯之偵測結果，而各系統之漸進式更新機制亦具備有效性。

關鍵字

電子垃圾郵件偵測；漸進式更新；近複製相似比對；社群網路監督式學習模型

並列摘要

Email communication is indispensable nowadays, but the email spam problem continues growing drastically. The major challenge of this problem is that spammers will always develop new sophisticated approaches to attack spam filters. This adversarial and evolving nature makes a static spam filter difficult to constantly retain high detection performance. In this dissertation, we study how to design more robust systems based on the essential and enduring spam-sending characteristics. Moreover, we consider the progressive update issue that is significant but less discussed in the literature. Based on the observation that spams with identical or similar content are usually sent in large quantities and successively, the notion of collaborative spam filtering with near-duplicate similarity matching scheme has been widely discussed. The primary idea of the similarity matching scheme for spam detection is to maintain a known spam database, formed by user feedback, to block subsequent near-duplicate spams. To better catch the evolving nature of spams, we propose a novel email abstraction scheme, which considers using email layout structure to represent emails. This newly-devised abstraction can more effectively capture the near-duplicate phenomenon of spams. Moreover, we design a complete spam detection system Cosdes (standing for COllaborative Spam DEtection System), which possesses an efficient near-duplicate matching scheme and a progressive update scheme. The progressive update scheme enables system Cosdes to keep the most up-to-date information for near-duplicate detection. On the other hand, motivated by the fact that spammers are prone to have unusual behavior and specific patterns of email communication, another central topic of this dissertation is to explore email social networks to detect spams. Previous works related to this topic generally suffer from two problems: (1) the system is not robust in diverse environments, and (2) no update scheme is provided to catch the feature changes of evolving networks. To remedy these problems, we propose an incremental support vector machine (SVM) model for spam detection on dynamic email social networks. A complete spam detection system MailNET is devised to better adjust to diverse networks. Several features of each user in the network are extracted to train an SVM model. Moreover, to catch the evolving nature of email communication, we employ an incremental update scheme that enables MailNET to efficiently re-train an approximate SVM model when a set of new emails added into the network. In addition, we also examine the feasibility of distinguishing spam nodes from normal users in email social networks by a power iteration algorithm. This algorithm generates a reputation score for each node to determine the possibility of being a spammer. Since we do not intend to produce a ranking list but to separate suspicious nodes from normal ones, relaxed constraints are introduced to expedite the convergence of the proposed PageRank-like algorithm. On the basis of this algorithm, we design a spam detection system ProMail that models email communication as a network and calculates a reputation score for each node. Furthermore, to capture the dynamic nature of email interactions, a progressive update scheme is proposed to not only include newly arrived emails but also delete obsolete ones. The designed power iteration algorithm has the progressive update capability, and thus can update the reputation scores of associated nodes. We conduct extensive experiments over our studies. To better simulate the real email environment, we use university-scale email streams as the evaluation datasets. The experimental results show that the designed systems Cosdes, MailNET, and ProMail are effective and can be applicable to the real-world environment.

並列關鍵字

Email Spam Detection ； Progressive Update ； Near-duplicate Matching ； Social Networks ； Supervised Learning Model

參考文獻

[4] P. O. Boykin and V. Roychowdhury. Leveraging Social Networks to Fight Spam. Proc. of the IEEE Computer, 2005.

[5] C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[7] S. Chhabra, W. S. Yerazunis, and C. Siefkes. Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas. Proc. of the 4th IEEE International Conference on Data Mining (ICDM), pages 347–350, 2004.

[12] E. Damiani, S. D. C. di Vimercati, S. Paraboschi, and P. Samarati. An Open Digest-based Technique for Spam Detection. Proc. of the 2004 International Workshop on Security in Parallel and Distributed Systems, pages 559–564, 2004.

[16] P. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Divide and Conquer Approach for Efficient PageRank Computation. Proc. of the 6th International Conference on Web Engineering, pages 233–240, 2006.

國際替代計量

基於內容結構摘要與社群網路之漸進式電子垃圾郵件偵測技術

全文下載

主題瀏覽