An Algorithm for Highly Specific Recognition of Protein-coding Regions

M. S. Gelfand [1] (misha@imb.imb.ac.ru)
T. V. Astakhova [2]
M. A. Roytberg [2] (roytberg@impb.serpukhov.su)

[1] Institute of Protein Research,
Russian Academy of Sciences,
Pushchino, 142292, Russia
[2] Institute of Mathematical Problems of Biology,
Russian Academy of Sciences,
Pushchino, 142292, Russia

Abstract

Since absolutely reliable recognition of protein-coding regions in eukaryote genomic DNA sequences by computational methods is unattainable, most existing algorithms try to keep some balance between underprediction and overprediction. However, in experimental practice it is often sufficient to have just a few protein-coding segments, but predicted with high specificity, that is, with (almost) no overprediction. Such predictions are then used for construction of oligonucleotide probes and PCR primers for analysis of cDNA libraries or total cellular RNA. Here we present a combinatorial algorithm solving this problem. Unlike other prediction schemes, the algorithm uses only the simplest statistical parameters (codon usage and positional nucleotide sequences in splicing sites) and thus can be used for analysis of obscure genomes, when large learning sets are unavailable. The algorithm's structure allows one to simply tune it for various experimental settings.