Extracting Group Relationships Within Changing Software Using Text Analysis
Green, Pamela Dilys
This research looks at identifying and classifying changes in evolving software by making simple textual comparisons between groups of source code files. The two areas investigated are software origin analysis and collusion detection. Textual comparison is attractive because it can be used in the same way for many different programming languages. The research includes the first major study using machine learning techniques in the domain of software origin analysis, which looks at the movement of code in an evolving system. The training set for this study, which focuses on restructured files, is created by analysing 89 software systems. Novel features, which capture abstract patterns in the comparisons between source code files, are used to build models which classify restructured files fromunseen systems with a mean accuracy of over 90%. The unseen code is not only in C, the language of the training set, but also in Java and Python, which helps to demonstrate the language independence of the approach. As well as generating features for the machine learning system, textual comparisons between groups of files are used in other ways throughout the system: in filtering to find potentially restructured files, in ranking the possible destinations of the code moved from the restructured files, and as the basis for a new file comparison tool. This tool helps in the demanding task of manually labelling the training data, is valuable to the end user of the system, and is applicable to other file comparison tasks. These same techniques are used to create a new text-based visualisation for use in collusion detection, and to generate a measure which focuses on the unusual similarity between submissions. This measure helps to overcome problems in detecting collusion in data where files are of uneven size, where there is high incidental similarity or where more than one programming language is used. The visualisation highlights interesting similarities between files, making the task of inspecting the texts easier for the user.