Computational Challenges in the Analysis of Sequence

Edward C. Uberbacher

Informatics Group Leader
Computer Science and Mathematics Division
Oak Ridge National Laboratory (ORNL)
Oak Ridge, TN 37831-6364, USA
e-mail: GRAILMAIL@ornl.gov

The development of computational methods for key problems in the genome project has proved to be challenging and fruitful ground for computer scientists. I will discuss several aspects of computation in relation to the project including pattern recognition and combinatorial optimization methods for constructing gene models, neural network and rule-based pattern recognition of gene regulatory regions, protein structure-function classification using clustering, decision tree and neural network methods, dynamic programming methods for sequence error detection and correction, types of sequence comparison algorithms which permit multiple frame-shifts, and methods for remote computation such as client-server architectures. I will discuss these methods in the context of ongoing research and development in the GRAIL and genQuest sequence analysis systems at ORNL.

GRAIL is a modular expert system for the analysis and characterization of DNA sequences which facilitates the recognition of gene features such as coding regions, poly-A addition sites, potential promoters, CpG islands and repetitive DNA elements, and also constructs gene models. GenQuest allows characterization of newly obtained sequences by homology-based methods using a number of protein, DNA, and motif databases and comparison methods such as FastA, BLAST, parallel Smith-Waterman. These analyses are available to the user in graphic form in the X-window-based client-server system XGRAIL, through Mosaic interfaces, or by email server. We have recently developed versions of GRAIL which can locate the protein coding regions of DNA sequences from Escherichia coli, Drosophila melanogaster and Arabidopsis thaliana, methods for detecting and ``correcting'' potential sequence errors which make the system insensitive to indels, and a "batch" server client which users to analyze groups of short (300-400 bp) sequences for coding character and automates database searches of translations of putative coding regions. Information can be obtained by sending the word ``help'' by email to GRAIL@ornl.gov.