Bioinformatic Principles Underlying the Information Content of Transcription Factor Binding Sites
Empirically, it has been observed in several cases that the information content of transcription factor binding site sequences (Rsequence) approximately equals the information content of binding site positions (Rfrequency). A general framework for formal models of transcription factors and binding sites is developed to address this issue. Measures for information content in transcription factor binding sites are revisited and theoretic analyses are compared on this basis. These analyses do not lead to consistent results. A comparative review reveals that these inconsistent approaches do not include a transcription factor state space. Therefore, a state space for mathematically representing transcription factors with respect to their binding site recognition properties is introduced into the modelling framework. Analysis of the resulting comprehensive model shows that the structure of genome state space favours equality ofRsequence and Rfrequency indeed, but the relation between the two information quantities also depends on the structure of the transcription factor state space. This might lead to significant deviations betweenRsequence and Rfrequency. However, further investigation and biological arguments show that the effects of the structure of the transcription factor state space on the relation of Rsequence andRfrequency are strongly limited for systems which are autonomous in the sense that all DNA-binding proteins operating on the genome are encoded in the genome itself. This provides a theoretical explanation for the empirically observed equality.