"Adversarial attacks on text classification models using layer‐wise relevance propagation"

Due to the nested nonlinear structure inside neural networks, most existing deep learning models are treated as black boxes, and they are highly vulnerable to adversarial attacks. On the one hand, adversarial examples shed light on the decision‐making process of these opaque models to interrogate the interpretability. On the other hand, interpretability can be used as a powerful tool to assist in the generation of adversarial examples by affording transparency on the relative contribution of each input feature to the final prediction. Recently, a post‐hoc explanatory method, layer‐wise relevance propagation (LRP), shows significant value in instance‐wise explanations. In this paper, we attempt to optimize the recently proposed explanation‐based attack algorithms (EAAs) on text classification models with LRP. We empirically show that LRP provides good explanations and benefits existing EAAs notably. Apart from that, we propose a LRP‐based simple but effective EAA, LRPTricker. LRPTricker uses LRP to identify important words and subsequently performs typo‐based perturbations on these words to generate the adversarial texts. The extensive experiments show that LRPTricker is able to reduce the performance of text classification models significantly with infinitesimal perturbations as well as lead to high scalability.

Keywords: layer wise; text classification; adversarial attacks; classification models

Journal Title: International Journal of Intelligent Systems
Year Published: 2020

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
0

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended