應用多任務學習分層長短期記憶網路於惡意程式偵測

隨著科技的進步，越來越多人得以接觸電子產品，當一般使用者缺乏隱私及資訊安全的觀念時，無形中已經將自己暴露於危險當中，並且讓具有惡意的使用者(或稱之為攻擊者)有機可趁(例如:竊取機敏資料、勒索金錢)。因此，有效偵測惡意程式成為當務之急，其目的在於將惡意樣本從可疑樣本中區隔出來。近年來，基於機器學習的偵測方法越來越受歡迎，其高度可調節性大幅降低傳統方法所需花費的時間與人力。然而，過去文獻採用的方法經常忽略字詞間的關係。因此，本篇論文採用深度學習技術，以更好地將字詞間的關係考慮進程式嵌入中。本篇研究是第一個基於靜態分析方法，將原始碼應用於 PE 惡意程式偵測的研究。基於深度學習技術，我們建構三層的分層長短期記憶網路架構，利用函式嵌入、函式段嵌入、程式嵌入學習能完整代表一個樣本的原始碼嵌入。此外，為了更好的訓練模型，我們採納並應用多任務學習，並提出一個輔助任務:二分類標準呼叫函式。根據實驗結果，我們驗證本研究提出的深度學習模型表現優於傳統的向量空間模型。加入輔助任務後，模型的 Macro F1 分數可以提升大約 3%。此外，我們也針對不同的輔助任務以及資料擴增策略，進行實驗並且評估其成效。

關鍵字

惡意程式偵測；原始碼；深度學習；分層長短期記憶網路；靜態分析；多任務學習

並列摘要

Expanded accessibility of technology appeals to users with malicious intentions. Some normal users expose to danger because of a lack of privacy awareness and cybersecurity mindset. A robust malware detection method turns out to be an urgent need, which aims to differentiate malware from suspicious samples. Recently, the machine learning-based detection approach has received great attention. The high adaptiveness makes it more effective and efficient than the traditional detection approaches. However, previous studies often overlooked the relationship between elements (words) in programs. Consequently, we will adopt the deep learning approach and develop a more effective malware detection method by learning more contextual information from sequences of codes. This research is the first study that uses source codes to detect PE malware based on a static analysis approach. We exploit a deep learning approach to construct code embedding for the entire program through a three-level hierarchical Long Short-Term Memory (LSTM) architecture. Expressly, we represent a program based on function embedding, segment embedding, and program embedding. Moreover, we propose an auxiliary task, i.e., the standard call binary classification, in a multi-task learning manner to help train the model. The experimental results indicate that our proposed deep learning-based model outperforms the traditional vector space models. Furthermore, the macro average of F1 scores increases by around 3% with the help of the auxiliary task. Additionally, we also evaluate the effectiveness of different auxiliary tasks and data augmentation strategies.

並列關鍵字

malware detection ； source code ； deep learning ； hierarchical LSTM ； static analysis ； multi-task learning

參考文獻

Agrawal, R., Stokes, J. W., Marinescu, M., Selvaraj, K. (2018). Neural Sequential Malware Detection with Parameters. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2656-2660.

Google Scholar

Allamanis, M. (2019). The Adverse Effects of Code Duplication in Machine Learning Models of Code. Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software.

Google Scholar

Alon, U., Brody, S., Levy, O., Yahav, E. (2019). Code2Seq: Generating Sequences from Structured Representations of Code. Proceedings of International Conference on Learning Representations.

Google Scholar

Alon, U., Zilberstein, M., Levy, O., Yahav, E. (2019). Code2Vec: Learning Distributed Representations of Code. Proceedings of the ACM on Programming Languages, 3(POPL), 1-29.

Google Scholar

Anderson, B., Quist, D., Neil, J., Storlie, C., Lane, T. (2011). Graph-Based Malware Detection Using Dynamic Analysis. Journal in Computer Virology, 7(4), 247- 258.

Google Scholar

國際替代計量

應用多任務學習分層長短期記憶網路於惡意程式偵測

全文下載

主題瀏覽