Improving Computational Predictions of Cis-regulatory Binding Sites in Genomic Data
Abstract
Cis-regulatory elements are the short regions of DNA to which specific
regulatory proteins bind and these interactions subsequently influence
the level of transcription for associated genes, by inhibiting
or enhancing the transcription process. It is known that much of
the genetic change underlying morphological evolution takes place in
these regions, rather than in the coding regions of genes. Identifying
these sites in a genome is a non-trivial problem. Experimental
(wet-lab) methods for finding binding sites exist, but all have some
limitations regarding their applicability, accuracy, availability or cost.
On the other hand computational methods for predicting the position
of binding sites are less expensive and faster. Unfortunately, however,
these algorithms perform rather poorly, some missing most binding
sites and others over-predicting their presence. The aim of this thesis
is to develop and improve computational approaches for the prediction
of transcription factor binding sites (TFBSs) by integrating the
results of computational algorithms and other sources of complementary
biological evidence.
Previous related work involved the use of machine learning algorithms
for integrating predictions of TFBSs, with particular emphasis on the
use of the Support Vector Machine (SVM). This thesis has built upon,
extended and considerably improved this earlier work.
Data from two organisms was used here. Firstly the relatively simple
genome of yeast was used. In yeast, the binding sites are fairly well
characterised and they are normally located near the genes that they
regulate. The techniques used on the yeast genome were also tested
on the more complex genome of the mouse. It is known that the
regulatory mechanisms of the eukaryotic species, mouse, is considerably
more complex and it was therefore interesting to investigate the
techniques described here on such an organism.
The initial results were however not particularly encouraging: although
a small improvement on the base algorithms could be obtained,
the predictions were still of low quality. This was the case for
both the yeast and mouse genomes.
However, when the negatively labeled vectors in the training set were
changed, a substantial improvement in performance was observed.
The first change was to choose regions in the mouse genome a long
way (distal) from a gene over 4000 base pairs away - as regions not
containing binding sites. This produced a major improvement in performance.
The second change was simply to use randomised training
vectors, which contained no meaningful biological information, as the
negative class. This gave some improvement over the yeast genome,
but had a very substantial benefit for the mouse data, considerably
improving on the aforementioned distal negative training data. In
fact the resulting classifier was finding over 80% of the binding sites
in the test set and moreover 80% of the predictions were correct.
The final experiment used an updated version of the yeast dataset,
using more state of the art algorithms and more recent TFBSs annotation
data. Here it was found that using randomised or distal negative
examples once again gave very good results, comparable to the results
obtained on the mouse genome. Another source of negative data was
tried for this yeast data, namely using vectors taken from intronic
regions. Interestingly this gave the best results.