Finding Coding Region Using Secondary Hexamer Measure and Two-Dimensional Linear Discriminant Analysis

Katsuhiko Murakami [1] [2] (katsu@ims.u-tokyo.ac.jp)
Toshihisa Takagi [1] (takagi@ims.u-tokyo.ac.jp)

[1] Human Genome Center
Institute of Medical Science, University of Tokyo
[2] Central Research Laboratory, Hitachi Ltd.

Abstract

We have developed a coding region prediction system. It is constructed from several measures that indicate exonness of a region in DNA sequence. The system includes a new statistical measure called secondary hexamer measure which we have developed. This statistics are defined in the same way as existing hexamer measure, except for its learning data. In addition to the measure, several measures are combined by two-dimensional linear discriminant analysis (2D-LDA). Then the system outputs a best gene model, that is a model with the best score accumulated by phase-specific dynamic programming. Our test of this program on 568 vertebrate complete gene sequences had 61% accuracy at exon level for exact match and 95% accuracy at nucleotide level. The average correlation coefficient (CC) between prediction and actual structure was 0.80.