Abstract Machine learning can detect variant malware files that can evade signature-based detection. Feature hashing is used to convert features into a fixed-length vector. In this paper, we study the… Click to show full abstract
Abstract Machine learning can detect variant malware files that can evade signature-based detection. Feature hashing is used to convert features into a fixed-length vector. In this paper, we study the appropriate vector size for feature hashing for a large dataset of malware files. Through exhaustive experiments on more than 280,000 real malware and benign files, we find for the first time that the default vector size of current feature hashing practices is unnecessarily large. We experimentally explore the appropriate vector size, which not only reduces memory space by 70% but also increases the detection accuracy, compared with the state-of-the-art scheme.
               
Click one of the above tabs to view related content.