使用類神經網路分類法進行文件分類已經知道有一定之成效,而且在國外的研究中也顯示,使用類神經網路作為垃圾郵件辨識之核心處理器也是可行的,但是由於目前並未有論文是使用類神經網路針對中文內容之郵件進行垃圾郵件之辨識,因此本研究希望可以透過實際實驗了解,使用類神經網路對於垃圾郵件之辨識是否可行。使用類神經網路作為中文文件分類時,必須先進行處理,而中文文件前處理時面對關鍵字(詞)之擷取,也由於中文的不結構化造成擷取上的困難,在本研究中由於將郵件視為文件的一種,因此,前處理中關鍵詞的擷取也是相當重要的一環,因此本研究希望透過擷取不同數量之關鍵詞,經由實地實驗了解,關鍵詞數目對於垃圾郵件的辨識是否會有影響。如果以文件分類的角度來觀察郵件,即使是垃圾郵件也有類別的不同,因此,在本研究中,希望透過實驗了解,將郵件之分類使用二分法以及使用實際使用測試資料中垃圾郵件之七個分類加上正常郵件共八類,此兩種分類結果之數量,對於使用類神經網路辨識垃圾郵件是否會有影響。 本研究中經由調整,關鍵詞數目、類神經網路結點數、輸出類別數此三個參數,以期獲得區域之最佳化解,經由實際實驗得知,擷取之關鍵詞數目若是能夠適當的代表垃圾郵件,不僅能夠提升垃圾郵件的辨識率,也能降低正常郵件的誤判率。而類神經網路的節點數,會隨著訓練資料的複雜度而改變,本研究的實驗在結點數為五的時候所得到的辨識率最高。究實驗後得知,類神經網路輸出類別必須與實際資料類別相符合,而本研究中將垃圾郵件分為七類,並且加上非垃圾郵件成為八類,因此類神經網路輸出類別設定為八類時,其垃圾郵件辨識率最高。經由實驗後得知使用類神經網路進行分類,垃圾郵件的SF1可達0.82,因此證明使用類神經網路於中文垃圾郵件的辨識的確是有效的。
It is well known that there is a specific effectiveness by using Neural Network Categorization (NNC) to proceed documents category; furthermore, foreign research reports are also obviously revealed that by using NNC to recognize core processor of spam is also workable. However, currently, due to there is no thesis by using NNC to proceed identification of spam in the light of Chinese mails’ contents, it is expected that through concrete experiment to understand does it workable by using NNC with regard to identification of spam. When Chinese documents are sorted out by using NNC, pre-handling is necessary. And, with regard to collection of key words (terms) when Chinese documents being pre-handled, difficulties of collection will be caused due to Chinese non-structured. In this study, as a result of mails are treated as one of documents, collection of key words (terms) in pre-handling is also quite important. Therefore, it is expected that being permeated collection of different numbers of key words (terms), and through live experiment to understand does numbers of key words (terms) influence to identification of spam. If mails are observed by an angle of documents category, even spam are also categorized differently. Hence, it is expected that through experiment to understand if mails are categorized by using dichotomy; as well as by actually using seven sorts of spam which are tested, and adding regular mails altogether, is there any influence between numbers of these two sorting results towards by using NNC to identify spam. It is expected to obtain regional optimum reconciliation through adjustment of three arguments of numbers of key words (terms), numbers of NNC node as well as numbers of NNC output categorization. Through actual experiment, it is aware of if numbers of key words (terms) collected are able to appropriately represent spam, ratio of spam identification not only can be heaved, but that of erroneous judgment can be reduced also. And, numbers of NNC node will be changed to comply with complication of training data. This study has pointed out that the highest ratio of identification will be obtained when numbers of NNC node is five. NNC output categories have to be conformed to categories of actual data after experiment. Spam has been divided to seven categories in this study, and non-spam category has been added up to eight. Therefore, spam identification is the highest when NNC numbers of output category is set at eight. Through experiment, it is informed of using NNC to proceed categorization, SF1 of spam can be reached to 0.82. Hence, it is proved that identification of Chinese spam by using NNC is effective indeed.