GenomeNet Database Service (October 2006)

The GenomeNet Database Service at http://www.genome.jp/ is developed and operated by the Bioinformatics Center, Institute for Chemical Research, Kyoto University.

1. KEGG: Kyoto Encyclopedia of Genes and Genomes

1.1. KEGG Databases

KEGG is a bioinformatics resource for understanding higher order functional meanings and utilities of the biological systems, such as the cell, the organism, and the biosphere, from genomic and molecular information. In order to link genomes to biological systems, the KEGG resource is categorized as building blocks in the genomic space (KEGG GENES) and the chemical space (KEGG LIGAND), wiring diagrams of interaction networks and reaction networks (KEGG PATHWAY), and hierarchical classifications involving various aspects of biological systems (KEGG BRITE).

Database Content Source
PATHWAY Protein interaction and reaction networks for metabolism, various cellular processes, and human diseases Manually entered from published materials
GENES GENES: Gene catalogs of complete genomes with manual annotation Generated from RefSeq and other public resources with reannotation by KEGG
DGENES: Gene catalogs of draft genomes with automatic annotation
EGENES: Gene catalogs (consensus contigs) of EST data with automatic annotation
GENOME: Genome maps and organism information
SSDB: Sequence similarities with best-hit information for identifying ortholog/paralog clusters and conserved gene clusters Computationally derived from GENES by pairwise genome comparisons of all protein-coding genes
EXPRESSION: Microarray gene expression profiles Microarray data obtained by the Japanese groups
LIGAND COMPOUND: Chemical compound structures Manually entered from published materials
DRUG: Chemical structures of drugs
GLYCAN: Glycan structures
REACTION: Chemical reactions
RPAIR: Chemical structure transformation patterns
ENZYME: Enzyme nomenclature Generated from IUBMB/IUPAC nomenclature
BRITE Functional hierarchies representing our knowledge on various aspects of biological systems including KO (KEGG Orthology) grouping Manually entered from published materials

See also: Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K.F., Itoh, M., Kawashima, S., Katayama, T., Araki, M., and Hirakawa, M.; From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34, D354-357 (2006). [pubmed] [pdf]

1.2. Graph Representation

It is useful to know that KEGG is based on the concept of graph for representation and manipulation of data. Mathematically, a graph is a set of nodes (building blocks) and edges (interactions or relations). There are three types of data for the molecular objects of genes, proteins, and chemical compounds.

Graph Node Edge Main Databases
Protein network Protein
(Gene product)
Generalized protein interaction
(direct protein-protein interaction,
gene expression relation,
enzyme-enzyme relation)
Gene universe Gene Adjacency on chromosome,
Sequence/structural similarity,
Expression similarity, etc.
Chemical universe Chemical compound Chemical reaction,
Structural similarity

Another important concept in KEGG is the level of abstraction, which is represented by nested graphs. A nested graph is a graph whose nodes can themselves be graphs. Thus, a subgraph at one level corresponds to a node at a higher level. Examples are the following.

Higher-level node Subgraph Database
KEGG Orthology (KO) group Set of genes KO
Pathway module Set of proteins PATHWAY
Protein family Set of proteins BRITE

1.3. Network Hierarchy

The protein network is the most unique data object in KEGG, which is stored as a collection of pathway maps (diagrams) in the PATHWAY database. Reflecting the map resolution, the KEGG protein network or the PATHWAY database is organized in a hierarchy. The top two levels in the current hierarchy is the following.

First Level Second Level
Metabolism Carbohydrate Metabolism
Energy Metabolism
Lipid Metabolism
Nucleotide Metabolism
Amino Acid Metabolism
Metabolism of Other Amino Acids
Glycan Biosynthesis and Metabolism
Biosynthesis of Polyketides and Nonribosomal Peptides
Metabolism of Cofactors and Vitamins
Biosynthesis of Secondary Metabolites
Biodegradation of Xenobiotics
Genetic Information Processing Transcription
Sorting and Degradation
Replication and Repair
Environmental Information Processing Membrane Transport
Signal Transduction
Signaling Molecules and Interaction
Cellular Processes Cell Motility
Cell Growth and Death
Cell Communication
Endocrine System
Immune System
Nervous System
Sensory System
Human Diseases Neurodegenerative Disorders
Infectious Diseases
Metabolic Disorders

1.4. KEGG Orthology (KO)

Originally, the integration of pathway information and genomic information was first achieved in KEGG by the EC numbers. Once the EC numbers were correctly assigned to enzyme genes in the genome, organism-specific pathways could be generated automatically by matching against the networks of EC numbers (enzymes) in the reference metabolic pathways. However, in order to incorporate non-metabolic pathways and to overcome various problems inherent in the enzyme nomenclature, a new scheme based on the ortholog IDs was introduced replacing the EC numbers. KO (KEGG Orthology) is a further extension of ortholog IDs based on not only the pathway maps but also the BRITE functional hierarchies, most notably classifications of protein families.

Identifier Purpose
EC number Mapping enzyme genes to metabolic pathways
Ortholog ID Mapping genes to both metabolic and regulatory pathways
KO Mapping genes to both pathways and BRITE hierarchies

Thus, under the current KO system, the KO identifiers (K numbers) are placed at the fourth (lowest) level in the network hierarchy shown above, or at the lowest level of the BRITE hierarchy.

1.5. BRITE Functional Hierarchy

The BRITE database is a collection of hierarchical text files and binary relation files. It is intended to supplement the PATHWAY database in two ways. One is to computerize higher-level knowledge that cannot easily be represented as molecular interaction/reaction networks, in terms of the hierarchically structured vocabulary. The other is to inntegrate our knowledge about the genomic space (K numbers) with different types of knowledge in the chemical space (C/D/G/R/A numbers in the LIGAND database). The BRITE collection is currently categorized as follows.

Top Category Second Category
Genes and Proteins Network hierarchy
Protein families
Compounds and Reactions Compounds
Compoound interactions
Drugs and Diseases Drugs
Cells and Organisms Organisms

2. DBGET/LinkDB: Integrated Database Retrieval System

2.1. Web of Molecular Biology Data

DBGET/LinkDB is the backbone retrieval system for all GenomeNet databases including a number of molecular biology databases that are mirrored at the GenomeNet. DBGET/LinkDB is based on a flat-file view of molecular biology databases, where the database is considered as a collection of entries. Because each entry is given a unique entry name (or an accession number) within a database, the molecular biology databases in the world can be retrieved uniformly by the combination of the database name and the entry name:

In KEGG an organism is a collection of genes, which may also be considered as a flat-file database. Any gene or gene product (protein or RNA) in KEGG can thus be specified by the combination of the organism name and the gene name:
When two data entries are related in any way, it is customary to incorporate cross-reference information in the molecular biology databases. Examples include links between sequence data and literature data or between amino acid sequence data and nucleotide sequence data. The link information between two entries is a binary relation represented by:
    database1:entry1 --> database2:entry2
LinkDB is a collection of all such direct links in the GenomeNet databases as well as indirect links that are computationally obtained by combining multiple links and/or using links in reverse directions.

It is interesting to note that the web of molecular biology databases can be considered as another type of graph, consisting of database entries as nodes and cross-reference links as edges. It is a huge graph somewhat similar to the World Wide Web (WWW).

Graph Node Edge
World Wide Web Page Hyperlink
Web of molecular biology data Database entry Cross-reference link
KEGG gene universe Gene Any relation between genes or gene products
KEGG protein network Protein Protein interaction or relation in known pathways
KEGG chemical universe Chemical compound Chemical reaction
Chemical compound Atom Atomic bond
Glycan structure Monosaccharide Glycosidic bond

2.2. Databases Available

The following is the GenomeNet databases, many of which are daily updated.

*DNAGeneric database name representing: GenBank+EMBL 
*ProteinGeneric database name representing: SwissProt+PIR+PRF+PDBSTR
*nr-ntNon-redundant DNA database constructed from GenBank and EMBL
*nr-aaNon-redundant Protein database constructed from SwissProt,
TrEMBL, TrEMBL_new, PIR, PRF, and GenPept
*RefSeqGeneric database name representing: RefNuc+RefSeqNCBI
*RefNucNCBI reference nucleotide sequence database
*RefPepNCBI reference protein sequence database
*GenBankGenBank nucleic acid sequence database by (including DDBJ)NCBI
*GenPeptTranslated GenBank
*EMBLEMBL nucleic acid sequence databaseEBI
*SwissProtSwissProt protein sequence databaseExPASy / EBI
*TrEMBLTrEMBL protein sequence databaseEBI / ExPASy
*TrEMBL_newTrEMBL_new protein sequence database
PIRNBRF-PIR protein sequence databaseNBRF
PRFPRF (Protein Research Foundation) protein sequence databasePRF
*PDBRCSB Protein Data Bank for 3D structuresRCSB
*PDBSTRProtein Data Bank reorganized as a sequence database
EPDEukaryotic promoter database by Philipp BucherISREC
PROSITEDictionary of protein sites and patterns by Amos BairochExPaSy
BLOCKSBlocks of conserved segments by Henikoff and HenikoffFHCRC
ProDomProtein Domain database by Corpet, Gouzy, and KahnINRA
PRINTSProtein motif fingerprint database by Attwood et al.UMBER
PfamProtein families and motifs by Washington U. and Sanger CentreWash.U / Sanger
*COMPOUNDChemical compoundsKyoto
*DRUGChemical structures of drugs
*GLYCANCarbohydrate structures
*REACTIONChemical reactions
*RPAIRReactant pairs and alignments
*ENZYMEEnzyme nomenclature
*PATHWAYKEGG pathway maps and ortholog group tables
*KOKEGG Orthology
*GENESKEGG gene catalogs
GENOMEKEGG organisms
DGENESKEGG draft genome gene catalogs
EGENESKEGG EST gene catalogs (consensus contigs)
EXPRESSIONKEGG microarray gene expression profiles
VGENOMEViral genomes reorganized from RefSeq
VGENESViral genes reorganized from RefSeq
OGENESOrganella genes reorganized from RefSeq
*OMIMOnline Mendelian Inheritance in ManNCBI
PMDProtein mutation database by Ken NishikawaDDBJ
AAindexAmino acid index database by Kyoto U.Kyoto
LITDBPRF protein/peptide literature database (published as Peptide Information)PRF
MedlineBiomedical literature database located at NCBINCBI
*LinkDBDatabase of database links maintained by Kyoto U. 
  * Daily/weekly updated databases

3. Computation Services

3.1. Sequence Analysis

BLASTSequence similarity searchNCBIKyoto
FASTASequence similarity searchW.PearsonKyoto
MOTIFSequence motif searchICR, KyotoKyoto
CLUSTALWMultiple sequence alignmentD. Higgins et al.Kyoto
MAFFTMultiple sequence alignmentK. KatohKyoto
PRRNMultiple sequence alignmentO. GotohKyoto
EGassemblerGeneration of consensus contigusAli Masoudi-NejadTokyo
KAASAutomatic genome annotationY. MoriyaKyoto

See also other computation services

3.2. Chemical Analysis

SIMCOMPCompound substructure searchM. HattoriKyoto
SUBCOMPCompound substructure searchN. TanakaKyoto
KCaMGlycan structure searchK.F. AokiKyoto
e-zymeReaction predictionM. KoteraKyoto

4. GenomeNet Addresses


Last updated: November 10, 2010