透過您的圖書館登入
IP:3.142.250.114
  • 學位論文

垃圾郵件過濾:資料採礦與中文斷詞技術之應用

Spam Filtering: Application of Data Mining and Chinese Word Segmentation Technique

指導教授 : 陳景祥

摘要


在尚未立法明確規範垃圾電子郵件的國家,運用科技來阻擋垃圾信為多數網路使用者自保的首要之道,阻擋垃圾信件的方法很多,近年來技術不斷翻新,但很少能達成百分之百的阻擋效果。本研究提出有效的過濾垃圾郵件方法,利用PHP網頁程式語言來擷取電子郵件特徵,再透過資料採礦技術工具中的C4.5決策樹及機率類神經網路法,經由中文斷詞系統辨析中文詞頻、詞序及詞性等因素,並加入「灰色區域」郵件分類作為新的輸出變數,輸入至本研究之郵件分類系統,比較中文電子郵件分類效果及總風險成本,結果在使用C4.5決策樹法,加入詞頻及詞序百分比為輸入變數,可提升垃圾郵件被辨識成功的分類正確率;而使用機率類神經網路法,加入詞性特徵為輸入變數後可提升正常郵件被辨識成功的分類正確率;加入「灰色區域」分類為輸出變數時,明顯提升了垃圾郵件的分類精確率及檢出率,而且多數高達98.5% 以上,及明顯降低總風險成本。

並列摘要


In countries without established laws with regards to spam-mail blocking, spam filtering technologies are adopted to filter mails. Spam filtering technologies come in many forms and have staged a steady stream of improvement. However, none of the technology can completely filter out spam mails. The study suggests an effective method of spam filtering. Using PHP program to pick out the characteristics of spam mails, we perform data mining techniques such as C4.5 method and probability neural network (PNN) classifier to the E-mail classification. We also apply Chinese word segmentation system to calculate the frequency, rank, and characteristics of Chinese words. A “gray region” is also considered as our new output category. Our result shows that the C4.5 method together with the frequency and rank percentage of Chinese words promotes the accuracy of spam-mail filtering. Meanwhile, the PNN method with the percentages of Chinese word characteristics increases the accuracy of legitimate mail classification. Also, with the addition of our new “gray region” output category, the spam precision and recall rate both increase significantly, most of the classification rates goes over 98.5%, and the misclassification cost is also reduced.

參考文獻


[30]劉鼎康,使用類神經網路進行垃圾郵件過濾之研究,私立中原大學資訊管理學系碩士學位論文,2005。
[28]吳宗和,機率類神經網路在垃圾郵件過濾之應用,私立淡江大學統計學系應用統計學碩士班論文,2005。
[29]蔡孟娟,決策樹法在垃圾郵件過濾之應用,私立淡江大學統計學系應用統計學碩士班論文,2005。
[1]D. F. Specht, Probabilistic Neural Networks (original contribution), Neural Networks, vol.3,no.1 Jan 1990,pp.109-118.
[2]J. Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.

被引用紀錄


陳宇邦(2011)。順序型變數轉換在決策樹之應用〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2011.00383
吳泳慶(2007)。中文垃圾郵件客製化過濾系統之研究〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2007.00125
吳夢潔(2006)。垃圾郵件之傳播與使用行為調查研究〔碩士論文,國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-0712200716125623

延伸閱讀