Eukaryotic Gene Identification Algorithms

James W. Fickett(jwf@t10.lanl.gov)

Los Alamos National Laboratory, Los Alamos, NM 87545 USA

Abstract

First, a brief overview of genome informatics work at Los Alamos will be given. This will describe (1) the Los Alamos Sequence Database, (2) the AIDS Database, (3) theoretical design for genome mapping, (4) map assembly tools and methods, (5) a shotgun sequencing assembly algorithm, (6) studies in the evolution of repeats, (7) work on genome organization and estimation of genome coding density, and (8) gene identification methods.

Second, more technical depth will be given on one topic, that of gene identification. The state of the art (worldwide) will be described for the problem of computationally identifying genes in eukaryotic DNA. The component techniques used in gene identification algorithms (measurement of statistical regularities, identification of transcription and translation signals, and matching to overall gene syntax) will be described, then the synergy that results when different component techniques are combined, and finally the integrated algorithms of (1) Fields and Soderlund, (2) Guigo' et al., (3) Gelfand, and (4) Snyder and Stormo.

Although it is impossible to describe in detail all currently available tools, enough of the leading tools will be described to give a clear understanding of what capabilities are now available. In addition, since no existing tool combines all of the best available techniques, techniques will be reviewed in a more abstract sense, with the goal of clarifying where current algorithms are being improved, and how much performance is likely to progress in the near future.