Distribution of Base Composition around the Splice Sites in Different Species

Masahiko Mizuno (mizuno@kuicr.kyoto-u.ac.jp)
Minoru Kanehisa (kanehisa@kuicr.kyoto-u.ac.jp)

Institute for Chemical Research, Kyoto University
Uji, Kyoto 611, Japan


Abstract

We have analyzed the distribution of base composition around the 5' and 3' splice sites in genomic DNA sequences of different species. A set of sequences belonging to one species is aligned at the 5' and 3' splice sites, respectively, and the average of base composition is calculated for 10 base windows over the range of 100 bases each for upstream and downstream regions. In consistent with the previous observations that coding regions are more guanine-cytosine (GC) rich than noncoding regions, we observe a jump in the GC content at the splice sites, except for vertebrate sequences. In addition, introns are Uracil (U) rich rather than Adenine-Uracil (AU) rich, especially in plants and invertebrates. It is also found that the pyrimidine rich regions preceding the 3' splice site in mammals extend upstream over the consensus sequences, while the polypyrimidine tracts in plants and invertebrates are much shorter than in mammals. Furthermore, the size of increase in pyrimidine content is more striking at the 3' splice site in mammalian, but is smaller in plants and invertebrates. Thus, we consider that the broad and intensive polypyrimidine tract is required for the recognition of the 3' splice site in the higher eucaryotes, where introns are GC rich, and that more AU rich intron is important in the lower eucaryotes.