Show simple item record

dc.contributor.authorGreen, Pamela Dilys
dc.date.accessioned2013-10-29T13:00:32Z
dc.date.available2013-10-29T13:00:32Z
dc.date.issued2013-10-04
dc.identifier.urihttp://hdl.handle.net/2299/11896
dc.description.abstractThis research looks at identifying and classifying changes in evolving software by making simple textual comparisons between groups of source code files. The two areas investigated are software origin analysis and collusion detection. Textual comparison is attractive because it can be used in the same way for many different programming languages. The research includes the first major study using machine learning techniques in the domain of software origin analysis, which looks at the movement of code in an evolving system. The training set for this study, which focuses on restructured files, is created by analysing 89 software systems. Novel features, which capture abstract patterns in the comparisons between source code files, are used to build models which classify restructured files fromunseen systems with a mean accuracy of over 90%. The unseen code is not only in C, the language of the training set, but also in Java and Python, which helps to demonstrate the language independence of the approach. As well as generating features for the machine learning system, textual comparisons between groups of files are used in other ways throughout the system: in filtering to find potentially restructured files, in ranking the possible destinations of the code moved from the restructured files, and as the basis for a new file comparison tool. This tool helps in the demanding task of manually labelling the training data, is valuable to the end user of the system, and is applicable to other file comparison tasks. These same techniques are used to create a new text-based visualisation for use in collusion detection, and to generate a measure which focuses on the unusual similarity between submissions. This measure helps to overcome problems in detecting collusion in data where files are of uneven size, where there is high incidental similarity or where more than one programming language is used. The visualisation highlights interesting similarities between files, making the task of inspecting the texts easier for the user.en_US
dc.language.isoenen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectmachine learningen_US
dc.subjectfeature generationen_US
dc.subjecttext analysisen_US
dc.subjectFerreten_US
dc.subjectlanguage independenceen_US
dc.subjectsoftware origin analysisen_US
dc.subjectcollusion detectionen_US
dc.subjectsource code similarityen_US
dc.subjecttrigram analysisen_US
dc.subjectevolving softwareen_US
dc.subject3COen_US
dc.subjectfile comparison toolen_US
dc.subjectone-to-many comparisonen_US
dc.subjectvisualisationen_US
dc.titleExtracting Group Relationships Within Changing Software Using Text Analysisen_US
dc.typeinfo:eu-repo/semantics/doctoralThesisen_US
dc.identifier.doi10.18745/th.11896
dc.identifier.doi10.18745/th.11896
dc.type.qualificationlevelDoctoralen_US
dc.type.qualificationnamePhDen_US
herts.preservation.rarelyaccessedtrue


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record