Reflections on the NASA MDP data sets
Background: The NASA Metrics Data Program (MDP) data sets have been heavily used in software defect prediction research. Aim: To highlight the data quality issues present in these data sets, and the problems that can arise when they are used in a binary classification context. Method: A thorough exploration of all 13 original NASA data sets, followed by various experiments demonstrating the potential impact of duplicate data points when data mining. Conclusions: One: Researchers need to analyse the data that forms the basis of their findings in the context of how it will be used. Two: The bulk of defect prediction experiments based on the NASA MDP data sets may have led to erroneous findings. This is mainly due to repeated/duplicate data points potentially causing substantial amounts of training and testing data to be identical.