Software Defect Prediction Using Static Code Metrics: Formulating a Methodology
Abstract
Software defect prediction is motivated by the huge costs incurred as a result of
software failures. In an effort to reduce these costs, researchers have been utilising
software metrics to try and build predictive models capable of locating the most
defect-prone parts of a system. These areas can then be subject to some form of
further analysis, such as a manual code review. It is hoped that such defect predictors
will enable software to be produced more cost effectively, and/or be of higher
quality.
In this dissertation I identify many data quality and methodological issues in
previous defect prediction studies. The main data source is the NASA Metrics
Data Program Repository. The issues discovered with these well-utilised data sets
include many examples of seemingly impossible values, and much redundant data.
The redundant, or repeated data points are shown to be the cause of potentially
serious data mining problems. Other methodological issues discovered include the
violation of basic data mining principles, and the misleading reporting of classifier
predictive performance.
The issues discovered lead to a new proposed methodology for software defect
prediction. The methodology is focused around data analysis, as this appears to
have been overlooked in many prior studies. The aim of the methodology is to be
able to obtain a realistic estimate of potential real-world predictive performance, and
also to have simple performance baselines with which to compare against the actual
performance achieved. This is important as quantifying predictive performance
appropriately is a difficult task.
The findings of this dissertation raise questions about the current defect prediction
body of knowledge. So many data-related and/or methodological errors have
previously occurred that it may now be time to revisit the fundamental aspects of
this research area, to determine what we really know, and how we should proceed.
Publication date
2013-06-24Published version
https://doi.org/10.18745/th.11067https://doi.org/10.18745/th.11067