GENOME SEQUENCE ANALYSIS: IDENTIFYING AND CLASSIFYING GENES

Phil Green

Genetics Dept., Washington University Medical School, St. Louis, USA

Abstract

The problem of interpreting the information being generated in genome sequencing projects is a major challenge for computational biologists, and considerable progress in this area has been made in recent years. Of particular interest is gene identification, which has two aspects: delineation of the coding region and associated signals, such as promoters and splice sites; and predicting possible functions for the encoded protein.

We will present our work in this area in conjunction with the C. elegans sequencing project. A computer program, Genefinder, has been developed for identifying probable genes. Key features include systematic use of likelihoods to discriminate sequence features; a dynamic programming algorithm for assembling exons into candidate genes; and an interactive graphics display showing the relative positions and statistical significance of candidate coding segments, splice sites, and database homologies. Prediction accuracy exceeds 90% on known C. elegans genes. Analysis of genomic sequence has revealed a surprisingly high density of predicted genes, many of which have been confirmed by partial cDNA sequencing.

At present, the only effective computational method for predicting protein function is to find similarity to known proteins. However, over 60% of genes being found in sequencing projects are not similar to anything in the databases. It has commonly been assumed that this reflects the relative incompleteness of the databases; however our recent studies (Science 259, 1711-1716 (1993)) comparing sets of genes from distantly related organisms suggest an alternative explanation: it appears that the majority of genes are either phylum specific, or are evolving too rapidly to retain detectable similarities over long evolutionary periods. Most "ancient evolutionarily conserved regions" of proteins are already represented in known proteins, and there are a limited number of such regions (fewer than 900). Finding additional homologies will accordingly require more sensitive analysis methods.