隨著加密封包和新應用程式的出現,網際網路流量的分析變得愈來愈困難了。我們藉由提出三類機器學習演算法能使用的強大特徵,和一套建立分類器的標準作業程序來解決這個問題。 我們提出的三類特徵分別為方向改變、變換方向前的封包數和變換方向前的位元組數。方向改變和流量改變方向的頻率有關;變換方向前的封包數是每個流量改變其方向以前,總共累積了幾個封包;變換方向前的位元組數是每個網路改變其方向以前,總共累積了幾個位元組。比較用我們提出的特徵訓練出來的分類器和前人提出的特徵訓練出的分類器的平均召回率,神經網路從43.34%進步到58.29%,隨機森林從77.73%進步到82.97%,K-近鄰算法從55.17%進步到73.93%, XGB 從77.62%進步到81.91%,支持向量機從17.17%進步到41.94%, LGB 從80.92%進步到85.19%,決策樹從72.03%進步到82.53%。 提出的標準作業程序從洋蔥網路的 Pcap 檔開始。從它們抽取出流量之前,我們會先過濾雜訊和一些特定封包。抽取出的流量會被更進一步切割成較短的流量,我們再計算這些較短流量的特徵。在把特徵餵給機器學習演算法前,我們還會對特徵做一些處理。把處理後的特徵餵給機器是這個標準作業程序的最後一部。這個標準作業程序的特別之處在於其彈性。怎麼過濾封包、怎麼切割流量和怎麼處理特徵,都是能調整的。所以任何機器學習演算法,都能用這套標準作業程序訓練出一個令人滿意的分類器。以下是各演算法實際訓練的平均召回率:神經網路能達95.65%,隨機森林能達到92.72%,K-近鄰算法能達到84.03%,XGB 能達到93.18%,支持向量機能達到90.49%,LGB 能達到94.37%,決策樹能達到89.43%。 我們的貢獻在於(1) 我們提出了三類強大的特徵,幫助機器學習演算法訓練分類器。(2) 我們發展出了一套標準作業程序來訓練洋蔥網路流量的分類器。藉由我們提出的特徵和標準作業程序,網路服務提供者和洋蔥網路能在不傷害使用者隱私的情況下,大幅改善使用者體驗。我們希望這能讓洋蔥網路吸引更多使用者,進而使其變成一個更安全的覆蓋網路。洋蔥網路的使用者能對僅有網路一部份控制權/了解的壞蛋完全匿名,包括駭客、殘暴的政府等等。而整個網路也能在洋蔥網路有龐大流量時保持通順。
Traffic classification of the Internet has always been an important task due to its application in systems like Quality of Service (QoS) mechanism or Security Information and Event Management (SIEM) tool, etc. But since few decades ago, traffic classification has become more difficult, because there are more encrypted packets and packets of new applications flowing through the Internet. One of the reasons why there are more encrypted packets and packets of new applications flowing through the Internet is the increasing usage of Tor network. As people start to be aware of the potential danger of surfing the Internet, more people choose to use Tor browser instead. What makes Tor browser so different from current prevalent browsers (for example, Chrome, Firefox, etc.) is that Tor browser can provide anonymous service for its users. For example, we don't have to worry about the websites we browse would save cookies to track our activities when using Tor. Tor also resist to network surveillance. People living in oppressed regimes can use Tor to comment on sensitive topics without being blocked or tracked by their governments. This anonymity is appealing to its users but can make traffic classification much more difficult. So, if ISPs (Internet service providers) want to provide their customers with fast and safe services, Tor is an overlay network they must keep their eyes on. Besides ISPs, if Tor network itself wants to provide its users with a better environment, it had better be capable of classifying traffic flowing through it. With these in mind, we try to classify traffic of Tor network into eight categories: audio, browsing, chat, file-transfer, mail, P2P, video and VoIP, by using machine learning algorithms. In this thesis: (1) We propose three categories of powerful features for machine learning algorithms to train classifiers. (2) We develop a standard operating procedure (SOP) to build classifier for Tor traffic. By our efforts, we can make traffic classification of Tor network no longer a difficult problem, and further improve the performance of Tor and the whole Internet.