Towards effective malware clustering : reducing false negatives through feature weighting and the Lp metric
Cordeiro De Amorim, Renato
In this paper we present a novel method to reduce the incidence of false negatives in the clustering of malware detected during drive-by-download attacks. Our method comprises the use of a high-interaction client honey-pot called Capture-HPC to acquire behavioural system and network data, and the application of clustering analysis. Our method addresses various issues in clustering, including (i) finding the number of clusters in the dataset, (ii) finding good initial centroids, (iii) determining the relevance of each of the features at each cluster. Our method applies partitional clustering based on the Minkowski Weighted K-Means (Lp) and anomalous pattern initialization. We have performed various experiments on a dataset containing the behaviour of 17,000 possibly infected websites gathered from sources of malicious URLs. We find that our method produces a smaller within cluster variance and a lower quantity of false negatives than other popular clustering algorithms such as K-Means and the Ward's method.