Pattern Recognition Problems in Genome Research

Gary D. Stormo

Dept. of Molecular, Cellular and Developmental Biology
Univ. of Colorado
Boulder, CO 80309, USA
e-mail: stormo@beagle.colorado.edu

The total amount of DNA and protein sequences continue to grow exponentially, and the various genome projects ensures that this will continue for some time. Much, perhaps most, of what we learn from those sequences will have to come from algorithms designed to extract the biologically important information from those sequences because traditional biochemical and genetic approaches cannot keep up with the data. At the very least sequence analysis tools are essential to direct the experimentalists toward the most promising experiments.

My talk will describe a couple of problems we have investigated for the last several years that involve the use of pattern recognition and classification algorithms on problems of identifying important functional domains in DNA and protein sequences. One of these problems is to find a common sequence "motif" in a collection of sequences that are known to have a common function. This is an example of a multiple sequence, local alignment problem and there can be a variety of criteria for the pattern that constitutes the ``motif'', including both sequence and structure components. We have examine a variety of methods including greedy algorithms, Expectation Maximization algorithms and neural networks. Another problem of interest to us is the classification of genomic DNA sequences into functional domains, such as coding and non-coding regions, based on a variety of statistical tests. We have developed an approach that utilizes dynamic programming to obtain optimal and near-optimal solutions given a weighting of the different types of evidence, and a neural network approach that finds the weighting which maximizes the reliability of the predictions. Other classification problems can be attacked with similar methods.