Using Characteristic Words Analysis and PSO Support Vector Machines for Spam Filter

指導教授 : 白炳豐


在資訊快速發展的知識經濟時代,現代人們常常為大量的資訊所苦;比較起以往的人們常常為資訊不足所苦,有著明顯的不同。隨著資訊的大量產生,人們也體認到一個能所能處理的資訊是有限的,過多的資訊將對人產生混亂或決策資訊的產生混淆。資料探勘的技術就是為了解決這樣的問題,例如垃圾郵件辨識、財務風險評估、醫療診斷及客戶關係管理等領域。而透過資料探勘技術的應用,能夠幫助各領域作出合適的分析。 近年來垃圾郵件已經成為一個重要的資訊處理問題,對一般使用者而言垃圾郵件造成使用者在使用電子郵件上的困擾,例如一般郵件與垃圾郵件的混雜易造成刪錯信件或重要訊息的遺漏等問題。在企業上造成公司的資訊處理成本大增,重要訊息的遺漏更會造成公司作業上的損失。因而各種反垃圾郵件的技術紛紛提出,如K-MEAN、倒傳遞網路(Back-Propagation Neural Network, BP)、決策樹(Decision Tree)、貝氏過濾法(Bayesian approach)、支援向量機(Support Vector Machine, SVM)等方法。而在垃圾郵件的分類上,目前大多數的研究著重於英文語系的垃圾郵件分析,較少著墨於中文垃圾郵件的分析。因此本研究以中文的垃圾郵件做為分析的目標,探討資料探勘技術在中文郵件上的分析與研究。 本研究將比較粗略集合理論、倒傳遞類神經網路、田口支援向量機、粒子族群最佳化支援向量機四種分類模式對中文垃圾郵件資料進行過濾,並且結合特徵辭擷取的技術互相比較。本研究的主要目的,首先比較特徵辭的抽取方式對於資料探勘準確性的影響。換言之,取出的特徵辭必須對於郵件的分類是具有代表性,則必然造成準確率的提升。因此第一步比較特徵詞取出方式不同所影響的結果。第二步,探討支援向量機的參數最佳化以及各方法加入ChiMerge方法對資料作離散化以及篩選屬性,比較加入離散化及後篩選屬性對準確率的影響。本研究將混合探討辭彙抽取技術對資料探勘技術的影響以及資料探勘技術的改進。


In recent years, people suffer the pain from having too much information. It is different from the past where people always worry about having less information. Nowadays, it is obvious that the amount of information that one man can handle is limited. When it exceeds the information rate that one can handle, they will make mistakes easily. Therefore, data mining techniques are needed. Spam filter is an application using data mining technique. There are some troubles for users when they use their e-mails. For instance, the mix of normal and junk mails let user lose important information or delete wrong mails easily. In the business environment, costs for information processing, especially in dealing with junk mails, will increase. Hence, many techniques are proposed for anti-spam. K-mean, Back-Propagation Network (BP), Decision Tree, Bayesian approach, and Support Vector Machine (SVM) are some of the many techniques used. In the past, there are few studies that concerns about Chinese characters compared with English in spam filter. This study will focus on the Chinese e-mails for spam filter. This study compares with the four data mining techniques, which are rough set theory (RST), back-propagation neural network, combing Taguchi with SVM, and combining particle swarm optimization (PSO) with SVM. We will discuss the capability of them in this study. Additionally, this study will combine it with the discretization and feature selection method. In short, there are two important key points in this paper. First, the selection of key words will influence the accuracy of the data mining method. On the other words, the key words must have distinguishing features that can stand for the original e-mail. Second, this study discusses the optimization of parameters and the influence of using the ChiMerge algorithm for the discretization method.


