Building Classifiers to Identify Split Files.
Green, P. D.
We apply machine-learning techniques to help automate the process of mining the version history of software projects. Analysis of version histories is important in the study of software evolution. One of the associated problems is tracing program elements which have changed or moved as the result of file restructuring. As an initial application, we have developed classifiers to identify one such type of file change, `split files'. Our process involves extracting features through syntactic analysis of the original source code, and then training and evaluating classifiers against a set of data assessed by visual inspection. We analysed 266K files from 84 open-source projects, filtering out a set of candidate files for which our classifiers achieve either 89% overall accuracy, or a false positive rate of 5%.