透過您的圖書館登入
IP:18.221.174.248
  • 期刊

An E-mail Classification Algorithm based on Stacking Integrated Learning

摘要


The text filtering of traditional anti spam system mainly focuses on keyword matching and text fingerprint analysis, which is difficult to accurately identify and classify spam. Therefore, an integrated learning algorithm based on stacking is proposed in this paper. Firstly, the algorithm takes the manually marked text data of various categories as samples, uses TF-IDF algorithm to train the word vector space model, then selects linear SVC, xgboost and logistic regression algorithm to structure the base classifier, uses random forest algorithm to structure the meta classifier, and combines the stacking ensemble learning algorithm to structure the classification model. It achieves the function of dividing e-mail into five categories: illegal, advertisement, news, bill and recruitment. From the simulation results, the AUC values of the stacking integrated learning classification algorithm for each category are 0.92, 0.95, 1.00, 0.93 and 0.97 respectively, and the AP values are 0.86, 0.88, 1.00, 0.88 and 0.94 respectively, which realizes the high performance and high precision of text classification.

延伸閱讀