For the success of genome projects on human and model organisms, it is essential to develop new informatics technologies as well as experimental technologies. In addition to promoting basic research, we utilize the mechanism of the Genome Informatics Project in two aspects. First, we established and maintain a computer network, GenomeNet, as an essential component of the infrastructure for genome research. Second, we organize annual workshops and tutorials and set up small working groups on databases and algorithms in order to facilitate collaboration of biologists and computer scientists. Thus, we have been emphasizing computerization and match-making, both probably very Japanese. After two and a half years, I feel more comfortable now of speaking out what we have been doing and what we plan to do. It is time to redesign our newsletter and publish, at least in part, in English. It is also time to start thinking beyond the current five year project.
There is no doubt that genome informatics requires interdisciplinary research encompassing biological science and computer science and international collaboration, especially for database activities. However, we do not limit genome informatics to just mapping and sequence databases or analyses like sequence comparison and structure prediction. It includes molecular and cellular aspects of biological information processing up to the life cycle of an organism and the origin and evolution of species. Eventually, because the information gathered from the Human Genome Project will have profound impacts on understanding ourselves, genome informatics may become interdisciplinary research encompassing anthropology, archeology, and other social sciences.
Minoru Kanehisa
Kyoto University and University of Tokyo
Starting immediately we initiate another type of service called Gopher, which is an easy-to-use, menu-driven system for accessing numerous Internet sites around the world. We caution, however, that Gopher can easily be misused; some people spend a lot of their time and the network resources for personal interests and amusements. The GenomeNet is an academic research network supported by the Human Genome Project of the Ministry of Education, Science and Culture. We cannot justify the operation cost if the network is abused. The new Gopher server is intended for biologists to make easy access to the GenomeNet services. We welcome your comments and suggestions.
Still another, yet experimental, mode of access to our services is called client-server, where a user runs a client program on his/her machine which automatically communicates with the server program on a machine at the Human Genome Center, the Supercomputer Laboratory, or other database centers in Japan and abroad. There are already fine examples of such client-server programs: network entrez developed by the NCBI and GDB/accessor developed by the Cold Spring Harbor Laboratory.
We envision a network community where informatics needs of individual researchers and individual projects are realized on their local machines by integrating databases and computational resources distributed over the network. In addition to expanding physical connections of the GenomeNet, the development of new software tools and databases is required for the genome research network community.
Minoru Kanehisa
Kyoto University and University of Tokyo
In addition, we have also started offering the DBGET integrated database retrieval system under the GenomeNet World Wide Web (WWW) service. WWW is a distributed hypermedia environment originated at CERN, and we find its capability of making hot links extremely useful. Currently, DBGET integrates fifteen databases, six of which originate from the Japanese research community. You may start with one database, retrieve an entry by key words or an accession number, and then, just by clicking highlighted words (hot links) on the screen, you can move into corresponding entries in other databases.
In the past, and probably still in the present, many people consider database activities as trivial, albeit enormous, tasks of putting things in order. Thus, they argue that Japan should contribute money or manpower to the international databases. I consider database activities should be based on technology developments in both how to organize and how to distribute the data. Especially in the area of rapidly advancing molecular biology, the concept of database is still evolving. First, we had bibliographic databases. Then, relatively simple factual databases of molecular sequences and 3-D structures appeared. Now there are many different genome databases integrating maps and sequences. In the very near future there will be efforts to organize biological knowledge of molecular and cellular functions.
The international collaboration is essential for free and immediate sharing of data. However, the technology development has been and will continue to be competitive. Unless we invest our own informatics technology developments, we Japanese will never be able to initiate new concepts in molecular biology databases.
Minoru Kanehisa
Kyoto University and University of Tokyo
The second level was the development of new database systems that would meet the needs of mapping and sequencing projects. In January 1992 database working groups were organized under the auspices of the Human Genome Center. The software products of these groups, Locus-in, ContigMaker, Gnome, and Genomatica, are made freely available to the international genome research community. The working groups activity has since been expanded to include additional projects.
Now we are approaching the third level, which is the task of organizing knowledge in biological sciences. The forty year history of molecular biology has been the history of structure determinations of nucleic acids and proteins. The genome project will eventually determine the DNA sequence of the entire genome or all genes. However, the sequence is not the goal; it is simply a means to understand how genes and genomes function. Molecular sequences and 3-D structures are simple factual data that can easily be organized in a database. In contrast, the data on molecular functions often involve interpretation by individual researchers. We need to somehow find good representation of hard functional data in a biological knowledge base.
In this yearÕs Genome Informatics Tutorial held in Toyama July 6-9, we learned a lot about molecular interactions in a cell. I think function-oriented databases currently available, such as Prosite and TFD, are still based on the idea of one sequence (pattern) corresponding to one function. It seems now feasible to organize data and knowledge in the form of one interaction corresponding to one function. A number of computer scientists helped us establish the infrastructure and develop the new database systems. Now we need help from biologists as well because it is their knowledge that we wish to computerize.
Minoru Kanehisa
Kyoto University and University of Tokyo
The proliferation of Web servers around the world is causing fundamental changes in the attitude of biologists toward computers. The database retrieval is so simple now, just pointing and clicking a mouse, that many biologists start trying by themselves. We must acknowledge the increased government funding for the campus networks, which are finally making the Internet accessible by everyone. Because the generic client program called Mosaic is freely available and because it can be used in any discipline, a number of non-specialists may be learning about the Human Genome Project (HGP) from our Web server.
WWW is a revolution, a sort of democratization of information services which once were the privilege of big centers. While this is welcome in general, needs will also arise for quality control; each user is responsible for distinguishing research from relaxation. One way to solve the problem is to develop specialized client programs, such as Genomatica and HyperGenome in our project, which make access to preselected information services that are relevant to specific research fields.
Another possibility is Ôsoftware agentÕ, which is a kind of robot that follows the instruction of a biologist and, by moving on the Internet, finds information that meets individual needs. Agents will hopefully be a boon to older molecular biologists who feel more comfortable asking someone to get the job done. During the first phase of HGP in 1991-1995, we developed the client-server mechanism in various aspects of genome research. In the second phase in 1996-2000, we plan to make mobile agents widely available for information retrieval and data exchange.
Minoru Kanehisa
Kyoto University and University of Tokyo
However, from the view point of biologists there still remain a lot of problems that they want computer scientists to help them solve. The problems are practical for biologists, but some are not interesting to the computer scientists. Some are interesting but ill-defined and cannot be represented by computer languages. Some can be formalized but are too hard to solve within reasonable time frames even with up-to-date computer technologies.
In order to develop genome informatics we have to explore the problems to find the good ones among them. To perform such work we need to expand and extend our collaborations with biologists and/or medical scientists. Only with such close collaborations can we construct and solve these problems that are both interesting to computer scientists and useful to biologists.
From the next fiscal year the Human Genome Project of Japan will enter into the 2nd five-year stage. We are now constructing the plan. We solicit biologists/medical scientists to join our Genome Informatics Project.
Toshihisa Takagi
The University of Tokyo
During the past five years we emphasized the research and development of new informatics technologies for database and data interpretation problems. In the new project 1996-2000, we will concentrate more on the actual data collection and knowledge organization. Especially, it will become increasing important to organize functional data in higher biological processes, rather than those associated with single molecules or genes. Our goal is to describe and decipher molecular information pathways that make up the living organisms.
Minoru Kanehisa
Kyoto University
In order to proceed our research of this direction, we must establish an integrated knowledge base consisting of both molecular biological and medial data or rules. So far Genome Informatics community has collected and integrated various kinds of molecular biological data. In contrast, most medical data or rules are still kept only as texts in literature, documents in laboratories, or understanding in researchers' minds. On considering the status quo, the following two approaches are essential: (i) development of computer technologies for extracting medical knowledge from literature, and (ii) accumulation of data from laboratories and from researchers' minds onto computers with the help of medical scientists.
These two subjects will be the mainstreams in the next five year project. We appreciate your cooperation.
Toshihisa Takagi
The University of Tokyo
On December 1, 1995 two new menu items were added to the home page of the GenomeNet WWW server. Once is the entry point to the Japanese genome databases for Bacillus, E.coli, cyanobacteria, and others. The others is called KEGG (Kyoto Encyclopedia of Genes and Genomes), which is an attempt to computerize molecular/genetic pathway data and to correlate them with gene catalogs of various organisms. At the moment KEGG focuses on matabolic pathways.
The links between related entries in different databases are represented as binary relations in DBGET, and reverse links and indirect links are calculated from original links. Similarly, once biological links, i.e., interactions between molecules or genes, are properly represented as binary relations in KEGG, it will become feasible to compute pathways in order to assist experiments, facilitate understanding, and even perform simulations of different aspects of living organisms.
Minoru Kanehisa
Kyoto University
Once the catalogs of genes and gene products are known, the next obvious step is to understand functional implications, namely, to decipher both experimentally and computationally when, where, and how genes and molecules function in living organisms. In order to make full use of the information obtained by genome projects, it is essential that functional data obtained in wide areas of molecular and cellular biology are properly computerized.
In the existing molecular biology databases the functional data are computerized based on the concept of structure-function relationship; namely, the function is considered an attribute of the molecular structure. The collection of such data only represents how individual components (molecules) work, and it does not tell the wiring diagram (molecular pathway) of a biological system.
In the new informatics project we pay more attention on the aspects of interactions between molecules. We collect and organize the functional data based on the concept of structure-structure (molecule-molecule) relationship. In a simplest form, the basic data item may thus be represented by a binary relation of interacting molecules or genes. It is a challenging problem to compute molecular pathways from binary relations. At the same time we computerize known pathways derived by human experts. Again there will be a number of computational problems that need be solved, for example, in pathway comparisons.
The developments of our project have been and will be posted in our Web service at
http://www.genome.ad.jp/
This address is also linked to all of the databases and software products that have been produced in our project. The Genome Informatics News is intended to supplement this on-line publication. In the past the News were circulated only within the Japanese scientific community. In this new occasion, we start distributing the News internationally. We welcome your comments and suggestions.
Minoru Kanehisa
Kyoto University
Knowledge discovery is the most fundamental activity in various sciences that has been performed by experts of the fields. For example, Kepler (1571-1630) discovered the famous law with his name from planet data. Such discoveries belonged to only very talented experts and the amount of data analyzed was limitted to a human readable size until the computers and databases became available. Recently, however, a strong need is arising to support and assist such scientific discoveries with the paradigms created in Computer Science.
"Knowledge Discovery in Databases (KDD)" is a filed of Computer Science that is attacking problems of knowledge discovery in various fields. Knowledge discovery in databases varies in novelty from simple to hard. The database search or information retrieval has been the most fundamental wide-spread method for acquiring knowledge. Retrieving a single sequence pattern from databases may have a chance to lead to a new discovery. A harder demand is, for example, to create hypotheses about unknown data or whole sequences as human experts do. For such demands of higher complexity, various technologies from Computer Science will work, such as knowledge base, machine learning, parallel processing, etc. KDD consists of several stages of data processing. One of the most important and fascinating stages is called "data mining" which is a process of "mining" the nuggets of useful knowledge from well-processed refined data as an end-product of computing.
The target of knowledge discovery in genome databases is to mine the nuggets of genomic knowledge. Therefore, it should be a unified system that includes not only novel processes of high complexity but also straightforward applications of algorithms invented in Computer Science.
The major barrier for obtaining high-quality knowledge from data is the fact that the data are rarely collected for the process of data mining. The data are usually collected as a byproduct of other tasks and therefore they have limitations on breadth or coverage and do not represent all aspects of a product. By circumscribing such limitations on data, there is also a possibility that extrapolations can take the difference in population into account.
Although knowledge discovery in genome databases has various difficulties, I believe that knowledge discovery systems unified with databases will play an important role in Genome Science.
Satoru Miyano
The University of Tokyo