Protein sequence databases in the context of genome projects

Amos Bairoch (bairoch@cmu.unige.ch)

Medical Biochemistry Department, University of Geneva
1211 Geneva 4, Switzerland

Abstarct

Recent developments concerning the SWISS-PROT and PROSITE databases are discussed in the context of genome projects and of new network access tools
SWISS-PROT[1] is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc), a minimal level of redundancy and high level of integration with other databases. In the recent months we have developed the database in the following directions:

We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to be as complete as possible as well as to provide a high level of annotations. Entries originating from these organisms are cross-referenced to specialized database(s) that contain, among other data, some genetic information about the genes that code for these proteins. The organisms currently selected are (the associated specialized database is listed in brackets): B.subtilis (SubtiList); C.elegans (WormPep); D.discoideum (DictyDB); D.melanogaster (FlyBase); E.coli (EcoGene); H.sapiens (MIM) and S.cerevisiae (LISTA).
We have made an important effort in the implementation, in SWISS-PROT, of data relevant to human genetic diseases and of their characterization at the molecular level. Information concerning disease causing mutations is now available in the database.
SWISS-PROT has committed itself to work in close collaboration with a number of groups developing 2D gel databases. In particular we provide cross-references to the identificators for the spots corresponding to known or unknown microsequenced proteins. We also create new entries for micro- sequences that correspond to novel, yet unidentified, proteins.

PROSITE[2] is a compilation of sites and patterns found in protein sequences; it can be used as a method of determining the function of uncharacterized proteins translated from genomic or cDNA sequences. Recent developments include:

The extension of the collection to include profile-based motif descriptions in addition to regular expression-like patterns. This will allow the detection of protein families and domains that cannot be detected using patterns due to their extreme sequence divergence. Typical examples of important functional domains which are weakly conserved are the Ig domains, the SH2 and SH3 domains, or the Fn-III domain.
A significant increase in the number of patterns stored in PROSITE. In the current release there are 1029 patterns that allow the characterization of 18786 out of a total of 38303 entries in SWISS-PROT (close to 50%).

Both SWISS-PROT and PROSITE are available through the ExPASy World-Wide Web (WWW) server[3]. WWW is a powerful global information system merging networked information retrieval and hypertext. The ExPASy server allows access to the SWISS-PROT, PROSITE, SWISS-2DPAGE and SWISS-3DIMAGE databases and, through any SWISS-PROT protein sequence entry, to other databases such as EMBL, REBASE, FlyBase, GCRDb, MaizeDB, OMIM, PDB and Medline.