Extracting Group Relationships Within Changing Software Using Text Analysis

Green, Pamela Dilys

dc.contributor.author	Green, Pamela Dilys
dc.date.accessioned	2013-10-29T13:00:32Z
dc.date.available	2013-10-29T13:00:32Z
dc.date.issued	2013-10-04
dc.identifier.uri	http://hdl.handle.net/2299/11896
dc.description.abstract	This research looks at identifying and classifying changes in evolving software by making simple textual comparisons between groups of source code files. The two areas investigated are software origin analysis and collusion detection. Textual comparison is attractive because it can be used in the same way for many different programming languages. The research includes the first major study using machine learning techniques in the domain of software origin analysis, which looks at the movement of code in an evolving system. The training set for this study, which focuses on restructured files, is created by analysing 89 software systems. Novel features, which capture abstract patterns in the comparisons between source code files, are used to build models which classify restructured files fromunseen systems with a mean accuracy of over 90%. The unseen code is not only in C, the language of the training set, but also in Java and Python, which helps to demonstrate the language independence of the approach. As well as generating features for the machine learning system, textual comparisons between groups of files are used in other ways throughout the system: in filtering to find potentially restructured files, in ranking the possible destinations of the code moved from the restructured files, and as the basis for a new file comparison tool. This tool helps in the demanding task of manually labelling the training data, is valuable to the end user of the system, and is applicable to other file comparison tasks. These same techniques are used to create a new text-based visualisation for use in collusion detection, and to generate a measure which focuses on the unusual similarity between submissions. This measure helps to overcome problems in detecting collusion in data where files are of uneven size, where there is high incidental similarity or where more than one programming language is used. The visualisation highlights interesting similarities between files, making the task of inspecting the texts easier for the user.	en_US
dc.language.iso	en	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	machine learning	en_US
dc.subject	feature generation	en_US
dc.subject	text analysis	en_US
dc.subject	Ferret	en_US
dc.subject	language independence	en_US
dc.subject	software origin analysis	en_US
dc.subject	collusion detection	en_US
dc.subject	source code similarity	en_US
dc.subject	trigram analysis	en_US
dc.subject	evolving software	en_US
dc.subject	3CO	en_US
dc.subject	file comparison tool	en_US
dc.subject	one-to-many comparison	en_US
dc.subject	visualisation	en_US
dc.title	Extracting Group Relationships Within Changing Software Using Text Analysis	en_US
dc.type	info:eu-repo/semantics/doctoralThesis	en_US
dc.identifier.doi	10.18745/th.11896
dc.identifier.doi	10.18745/th.11896
dc.type.qualificationlevel	Doctoral	en_US
dc.type.qualificationname	PhD	en_US
herts.preservation.rarelyaccessed	true

Files in this item

Name:: 02048994 Green Pamela final PhD ...
Size:: 27.50Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

PhD Theses Collection

Show simple item record