個案公司E-mail文本分級探討

本實驗向量表示法採用Word2vec以及Bag-of-word（TF-IDF），並結合四種傳統機器學習分類演算法支援向量機（SVM，Support vector machine）、最近鄰居法（KNN、K-nearest neighbors algorithm）、梯度提升決策樹（GBDT，Gradient boosting Decision Tree）、隨機森林（Random Forest）和一種深度學習長短期記憶（Long Short-term Memory，LSTM）對個案公司所蒐集的電子郵件文本進行分類（需簽核、不需簽核），並透過各種向量表式法和分類器的組合分類出的結果進行探討與比較；根據實驗結果，將符合且適合個案公司現況的演算法組合推薦給個案公司，組合為Word2vec向量表示法搭配SVM演算法。

關鍵字

電子郵件分類；機器學習； tf-idf ； word2vec

並列摘要

We adopt vector representation of the text including Word2vec and Bag-of-word（TF-IDF）in this study, and combine four kinds of machine learning algorithms (SVM, KNN, GBDT and Random Forest), as well as a deep-learning tool, LSTM. We use the tools above to class the email text (security and normal), and then investigate and compare the result of each vector representation of the text and classifier. According to the results, we introduce the combination of Word2vec and SVM algorithm to the company.