The Multifamily classification of Android malware aims to identify a malicious sample as one of the given malware families. This problem is believed to be much more significant than the… Click to show full abstract
The Multifamily classification of Android malware aims to identify a malicious sample as one of the given malware families. This problem is believed to be much more significant than the binary classification (simply identify a sample as malicious or benign) because it is able to reveal the behaviour patterns of multiple malware families and bring deep insights into the working mechanism of malicious payload. The main challenges of the multifamily classification involve two aspects: recognizing the behaviour patterns of malware families as well as addressing the issues of code obfuscation and polymorphic variants that are commonly used by adversaries to evade rigorous detections. To address these challenges, in this article, we utilize the regular expressions of callbacks to describe the behaviour patterns of malware families, and propose a two-step fuzzy processing strategy to resist potential polymorphic familial variants. The alphabet of such regular expressions only consists of security-sensitive API calls, this enables the regular expressions to resist various kinds of code obfuscation and metamorphism. The proposed fuzzy strategy, applied to the regular expressions, comprises two steps: the first step transforms an original regular expression to such a fuzzy regular expression that possesses a broader meaning than the original one; the second step further relaxes precise plaintext match between two regular expressions to a fuzzy match by introducing the notion of similarity of regular expressions. Applying this strategy promotes the abstract level of a regular expression and enables the behaviour pattern specified by the regular expression to be more resilient to code obfuscation and polymorphic variants. Furthermore, selecting the fuzzy regular expressions as features, we use text mining techniques to train a multifamily 1-NN classifier over 3270 samples of 65 families. The experimental results show that our approach outperforms most of the state-of-the-art approaches and tools, confirming the effectiveness of our approach.
               
Click one of the above tabs to view related content.