近年來深度學習在自然語言處理的問題上取得了卓越的成果,然而日常生活中常見的應用,例如垃圾訊息過濾以及情緒分析等等,都很容易受到對抗性攻擊,導致安全性上的疑慮。 本文提出兩個方法,干擾偵測可以判斷文字是否受到字元修改的攻擊,並接著基於上下文將受到修改的文字恢復成可能的替代字詞。在字詞替換攻擊上,藉由將重要的字詞替換成數個可能的替代文字以增加樣本數量,且預測結果為所有增加的樣本中最多數被分到的類別。 本文提出的方法可以在不需要知道模型參數以及調整模型架構的條件下抵禦對抗式攻擊。在IMDb資料集上所完成的實驗證明,本文的方法可以有效防禦在文字分類上的字元替換及字詞替換攻擊,並展現比比較基準更好的成果。
In recent years, deep learning models have achieved prominent success on NLP tasks. However, widely used real-world applications such as spam filter and sentiment analysis are vulnerable to adversarial attacks. This thesis proposes two methods to defend against adversarial attacks on the sentiment analysis task. Perturbation detector detects if a token in the sample is perturbed through character level attacks, and the recovery process recovers the words from the perturbed ones to possible substitutions based on the context. For word level attacks, augmenting inputs by replacing important words to their possible substitutions and the result of the original sample is the majority class among all the augmented samples. Our methods can block adversarial attacks without knowing the model parameters and modifying model structures. Experiments on IMDb dataset demonstrate that our methods can effectively block both character level and word level attacks and outperform baseline method on text classification task.