基於URL特徵的惡意網站識別

隨著網路的發展，使用者們上網時間愈來愈長，拜訪網站的數量及種類也變得愈來愈多，然而對於網路使用者而言惡意網站仍是個極為嚴重的威脅。惡意網站在使用者不知情的情況下利用路過式下載技術(Drive-by-download)注入間諜軟體、病毒或是惡意程式到使用者的電腦之中，從事惡意的活動。例如:身分盜用、勒索敲詐、病毒或蠕蟲的散播…等等，網路安全已經成為非常重要的議題，也是值得深入研究的領域。傳統的方法主要利用黑名單及機器學習來辨識惡意網站，然而這些方法經常使用了數百個甚至數千個特徵來進行計算，這些特徵中，有許多並非顯著特徵，這除了增加儲存資料及運算所消耗的資源，更有可能影響其評估的結果。本研究則利用特徵選取的方法挑選出較顯著的特徵，不僅降低特徵的數量來加速運算，更能提高惡意網站辨識的準確率。本研究將提出一個惡意網站辨識套件，讓使用者在拜訪網站時，先行判斷該網站是否為惡意網站。研究分為二個步驟:(1)透過網路爬蟲的方式蒐集惡意網站並建立惡意網站黑名單與良好網站白名單。(2)針對惡意網站的相關特徵，進行特徵的選取及分類，以找出顯著的特徵並且辨識出潛在的惡意網站。本研究所提出的方法將透過網站的特徵來預判未知或潛在的惡意網站，並且在使用者拜訪該網站前先行阻擋，讓使用者免於受到傷害。

關鍵字

網路爬蟲；特徵選取； LASSO ；網站安全

並列摘要

With the popularity of the networks, people spend more and more time surfing the Internet and visiting various websites. Although it is very convenient to obtain information from the websites, malicious websites are still a significant threat to the Internet users. Malicious websites implant malwares into users’ computers, without their knowledge, through the drive-by-downloads technology. And then the infected computers are ordered to do such illegal activities as identity theft, blackmail and extortion, virus or worm spreading, and so on. Hence, network security becomes an important issue, which is also an active research topic in the academia. In order to distinguish between normal and malicious websites, common methods are establishing a blacklist through machine learning. However, such methods usually use hundreds or even thousands of URL features as training data in the machine learning process, which results in a large amount of computing resources. Furthermore, some of the URL features (called noise features) tend to decrease the accuracy of malicious website identification. They should be identified and excluded from the training data. In this research, we propose methods to identify important features in order to improve the accuracy. First, we collect malicious and normal websites via web crawlers and then create a blacklist and whitelist, respectively. Second, the common URL features are processed one by one with the Lasso method to determine if they are significant. And finally, the significant features are used to evaluate if a website is normal or malicious through the SVM method. The techniques developed in this research can improve the quality of malicious website identification and therefore can protect users from be harmed by malicious websites.

並列關鍵字

Web crawler ； Feature selection ； LASSO ； Website security

參考文獻

[1] Aldwairi, M. & Alsalman, R., “MALURLS: A Lightweight Malicious Website Classification Based on URL Features.” Journal of Emerging Technologies in Web Intelligence (JETWI) Vol. 4, pp. 128-133 (2012)