隨著網路深入生活的各層面,我們亦面對日益頻繁的網路攻擊,而在攻擊的手法上,有很大部份都是借由包含惡意攻擊或內容之網址,借由如釣魚、垃圾郵件等,引導不知情的受害者執行。 判斷URL網址是否有害,傳統的黑名單已無法跟上一直在改變的攻擊者,本篇論文採用黑名單搭配機器學習的方法來判斷網址是否有害,將網址中的文字使用N元語法及特殊字元經由雜湊技巧轉換之特徵工程方法取得其特徵值,並訓練使用XGBoost(eXtreme Gradient Boosting)機器學習模型評估之,另選用隨機森林(Random Forests)、人工神經網路(Neural Network)及遞迴神經網路的長短期記憶(Long Short-Term Memory, LSTM)等機器學習方法訓練為模型,並分析比較,試圖找到最佳的預測模型並且合適的特徵組合。
As the Internet has penetrated into all our living, we are facing increasingly cyberattacks. Various attacking methods show up to lead the users to fall into the Internet trap. The commonest way is through the phishing or Spam-mail which the content including the malicious attacking URLs. The traditional way to check if the URLs is harmful or not is by the Blacklist mechanism. But it is far from working nowadays. This paper describes a more technical way to define the URLs by adding the blacklist mechanism with the machine learning model. The method is training the XGBoost (eXtreme Gradient Boosting) machine learning model to evaluate the effect which is extracted by hashing trick feature engineering method from the N-grams and the special character of the URL text. To compare with other machine learning model, like Random Forests(RF), Artificial Neural Network(ANN) and Long Short-Term Memory( LSTM), we do more analyzed to find out the best prediction model and suitable combination of features.