Sentiment classification has been broadly applied in real life, such as product recommendation and opinion-oriented analysis. Unfortunately, the widely employed sentiment classification systems based on deep neural networks (DNNs) are… Click to show full abstract
Sentiment classification has been broadly applied in real life, such as product recommendation and opinion-oriented analysis. Unfortunately, the widely employed sentiment classification systems based on deep neural networks (DNNs) are susceptible to adversarial attacks with imperceptible perturbations into the legitimate texts (also called adversarial texts). Adversarial texts could cause erroneous outputs even without access to the target model, bringing security concerns to systems deployed in safety-critical applications. However, studies on defending against adversarial texts are still in the early stage and not ready for tackling the emerging threats, especially in dealing with unknown attacks. Investigating the minor differences between adversarial texts and legitimate texts and enhancing the robustness of target models are two mainstream ideas for defending against adversarial texts. However, both of them suffer the generalization issue in dealing with unknown adversarial attacks. In this paper, we proposed a general method, called TextFirewall, for defending against adversarial texts crafted by various adversarial attacks, which shows the potential in identifying new developed adversarial attacks in the future. Given a piece of text, our TextFirewall identifies the adversarial text by investigating the inconsistency between the target model’s output and the impact value calculated by important words in the text. TextFirewall could be deployed as a third-party tool without modifying the target model and agnostic to the specific type of adversarial texts. Experimental results demonstrate that our proposed TextFirewall effectively identifies adversarial texts generated by the three state-of-the-art (SOTA) attacks and outperforms previous defense techniques. Specifically, TextFirewall achieves an average accuracy of 90.7% on IMDB and 96.9% on Yelp in defending the three SOTA attacks.
               
Click one of the above tabs to view related content.