透過您的圖書館登入
IP:18.216.186.164
  • 學位論文

中文垃圾郵件客製化過濾系統之研究

A Study of Customizable Chinese Spam E-mails Filtering System

指導教授 : 陳景祥

摘要


收發電子郵件已經是現代人主要的通訊工具之ㄧ,而廣告電子郵件的大幅增加,使的我們的電子信箱經常在不知不覺中就充斥著一堆信件。過去對於廣告電子郵件則都歸類於垃圾郵件,然而在台灣ALS於2006年6月28日至7月28日間所做的調查中確有27.4%的受訪者表示曾經因為收到廣告郵件而確實有完成交易,可見在這些廣告電子信件中,有些對使用者言的確提供了所需的資訊及幫助,但有些則對使用者造成困擾及時間的浪費。因此,客製化郵件的分類則為本研究的主要議題。 在本論文中使用機器學習法之C4.5決策樹法則及機率類神經法則為核心用以建制郵件分類系統,一般郵件分類所攫取的關鍵字通常都是以頻的高低做為選取條件,但有許多關鍵字的選取並不能真正代表該類別的郵件。所以本研究除了利用CKIP中文斷詞技術外,並計算TF-IDF的方法來攫取真正能表達每一種分類電子郵件的關鍵詞,再搭配14種發送特徵作為判斷郵件分類的準則。 本研究將廣告信件分為九大類客製化郵件,並綜合評比整體準確率、正常郵件精確率、正常郵件檢出率、客製化郵件精確率和客製化郵件檢出率五種指標,其結果顯示本研究在個人日常郵件的測試上亦有不錯的結果。

並列摘要


E-mail has become a very popular mode of communication in the modern world; however, along with the rapid growth of E-mail advertising, recipients often receive commercial E-mails that that are unsolicited and sent in bulk. In the past years all the Unsolicited Commercial E-mail were automatically categorized as spam. A survey done by Taiwan ALS from June 28th to July 28th in 2006 shows that 27.4% of interviewee had bought products through commercial E-mails. Accordingly, some of the commercial E-mails really provide recipients with information and assistance, but the others are often annoying and wasting time; therefore, Customizable e-mail Classification is the main theme in this research. In the research C4.5 decision tree and Probabilistic Neural Network (PNN) of machine learning method are used mainly to establish E-mail classification system. Usually the key words which are seized to categorize E-mails are chosen by their appearance rate, but many key words can not really represent the E-mails of their categories. In this research the CKIP and the method of calculating TF-IDF are used in order to seize the key words which can actually represent every categorized E-mail, accompanying 14 different sending characteristics as the rules to categorize E-mails. This research categorized commercial E-mails into nine major Customizable E-mails categories and comprehensively evaluates five indexes: overall precision rate, (normal) E-mail accuracy rate, (normal) E-mail detectable rate, Customizable E-mail precision rate, and Customizable E-mail detectable rate.

參考文獻


8.劉鼎康,「使用類神經網路進行垃圾郵件過濾之研究」,碩士論文,中原大學資訊管理學系(2005)
3.吳宗和,「機率類神經網路在垃圾郵件過濾之應用」,碩士論文,淡江大學應用統計學系(2005)
10.蔡孟娟,「決策樹法在垃圾郵件過濾之應用」,碩士論文,淡江大學應用統計學系(2005)
7.葉采羚,「垃圾郵件過濾:資料採礦與中文斷詞技術之應用」,碩士論文,淡江大學應用統計學系(2006)
4.陳稼興、謝佳倫、許芳誠,「以遺傳演算法為基礎的中文斷詞研究」,資訊管理研究,第二卷第二期,2000年07月,pp.27-44。

被引用紀錄


吳登揚(2017)。基於不同主題的中文情感分析比較〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2017.01083
劉炅函(2017)。中文情感分析應用於PTT之研究〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2017.00019
江奕(2013)。資料探勘技術應用於病患存活狀態之預測〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2013.00063
林敬凱(2012)。類神經網路於財務危機預測模式之應用:時間預測變數的比較〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2012.00438
沈彥廷(2012)。資料複雜度指標對資料探勘分類技術的影響〔碩士論文,淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2012.00231

延伸閱讀