Adversarial examples can evade the detection of text classification models based on Deep Neural Networks (DNNs), thus posing a potential security threat to the system. To address this problem, we… Click to show full abstract
Adversarial examples can evade the detection of text classification models based on Deep Neural Networks (DNNs), thus posing a potential security threat to the system. To address this problem, we propose an adversarial example defense method for Chinese text classification called WordRevert. The method first obtains the “positive text” containing the adversarial words by filtering the clauses that do not contribute to the current classification label. Then the detection network is combined with the position importance calculation function to achieve the detection of the adversarial words. Finally, the adversarial words are restored to the original words by calculating the candidate score and the detection score. The experiments show that the current popular Chinese text adversarial attack algorithms can be effectively defended by this method, and achieve a significant increase in the accuracy of the adversarial examples with a small reduction in the classification accuracy of clean samples while achieving better precision, recall, and F1 value of adversarial word detection and restoration.
               
Click one of the above tabs to view related content.