Editorials

Genome Informatics News

Invitation to Genome Informatics (October 1993)
Database Access on the GenomeNet (April 1994)
Technology Development for Molecular Biology Databases (July 1994)
Biological Knowledge Base (October 1994)
From Client to Agent (January 1995)
A Call to Biologists (April 1995)
Genome Informatics 1996-2000 (July 1995)
Development of Medical Knowledge Base (October 1995)
New Version of GenomeNet WWW Server (January 1996)
New Genome Informatics Project in Japan (May 1996)
Knowledge Discovery in Genome Databases (October 1996)

Vol. 1, No. 1, October 1993

Invitation to Genome Informatics

The Genome Informatics Research Project was started in April 1991 as a part of the Human Genome Project of the Ministry of Education, Science and Culture (MESC) of Japan, which is headed by Ken-ichi Matsubara of Osaka University. Due to budgetary constraints, the Informatics Project was funded separately from the main body of the MESC Human Genome Project, by a Grant-in-Aid for Scientific Research on Priority Areas, of which I am principal investigator. In retrospect, it was quite fortunate for us to be formally independent of the actual mapping and sequencing groups because we could set up long-range objectives and, as the result, could attract a number of young scientists outside of the traditional biological science community.

For the success of genome projects on human and model organisms, it is essential to develop new informatics technologies as well as experimental technologies. In addition to promoting basic research, we utilize the mechanism of the Genome Informatics Project in two aspects. First, we established and maintain a computer network, GenomeNet, as an essential component of the infrastructure for genome research. Second, we organize annual workshops and tutorials and set up small working groups on databases and algorithms in order to facilitate collaboration of biologists and computer scientists. Thus, we have been emphasizing computerization and match-making, both probably very Japanese. After two and a half years, I feel more comfortable now of speaking out what we have been doing and what we plan to do. It is time to redesign our newsletter and publish, at least in part, in English. It is also time to start thinking beyond the current five year project.

There is no doubt that genome informatics requires interdisciplinary research encompassing biological science and computer science and international collaboration, especially for database activities. However, we do not limit genome informatics to just mapping and sequence databases or analyses like sequence comparison and structure prediction. It includes molecular and cellular aspects of biological information processing up to the life cycle of an organism and the origin and evolution of species. Eventually, because the information gathered from the Human Genome Project will have profound impacts on understanding ourselves, genome informatics may become interdisciplinary research encompassing anthropology, archeology, and other social sciences.

Minoru Kanehisa
Kyoto University and University of Tokyo

Vol. 1, No. 2, April 1994

Database Access on the GenomeNet

Since September 1992 we have been offering database retrieval and analysis services on the GenomeNet, a genome research computer network in Japan. There are several modes of access to these services. First, a user may send an electronic mail message containing a query of homology search, motif search, or simple database retrieval and receive the result in a reply message. This service is called e-mail servers. Second, a user may login a password free account, perform database retrieval and, if necessary, receive the result file by e-mail. Third, a user may use a mechanism called anonymous FTP to download small database files and programs from, what is called, an FTP server.

Starting immediately we initiate another type of service called Gopher, which is an easy-to-use, menu-driven system for accessing numerous Internet sites around the world. We caution, however, that Gopher can easily be misused; some people spend a lot of their time and the network resources for personal interests and amusements. The GenomeNet is an academic research network supported by the Human Genome Project of the Ministry of Education, Science and Culture. We cannot justify the operation cost if the network is abused. The new Gopher server is intended for biologists to make easy access to the GenomeNet services. We welcome your comments and suggestions.

Still another, yet experimental, mode of access to our services is called client-server, where a user runs a client program on his/her machine which automatically communicates with the server program on a machine at the Human Genome Center, the Supercomputer Laboratory, or other database centers in Japan and abroad. There are already fine examples of such client-server programs: network entrez developed by the NCBI and GDB/accessor developed by the Cold Spring Harbor Laboratory.

We envision a network community where informatics needs of individual researchers and individual projects are realized on their local machines by integrating databases and computational resources distributed over the network. In addition to expanding physical connections of the GenomeNet, the development of new software tools and databases is required for the genome research network community.

Minoru Kanehisa
Kyoto University and University of Tokyo

Vol. 1, No. 3, July 1994

Technology Development for Molecular Biology Databases

In this issue of Genome Informatics News we announce the availability of the software products that have been developed in collaboration of the scientists in the Genome Informatics Research Project and the Human Genome Center of the University of Tokyo. The database systems Locus-in and ContigMaker are to support mapping experiments of human genome, while Genomatica is an integrated database system of maps and sequences which may be used for sequencing projects of E. coli, Bacillus, and other genomes. Gnome is a software tool targeted to more general users who want to search sequence homologies and motifs by e-mails.

In addition, we have also started offering the DBGET integrated database retrieval system under the GenomeNet World Wide Web (WWW) service. WWW is a distributed hypermedia environment originated at CERN, and we find its capability of making hot links extremely useful. Currently, DBGET integrates fifteen databases, six of which originate from the Japanese research community. You may start with one database, retrieve an entry by key words or an accession number, and then, just by clicking highlighted words (hot links) on the screen, you can move into corresponding entries in other databases.

In the past, and probably still in the present, many people consider database activities as trivial, albeit enormous, tasks of putting things in order. Thus, they argue that Japan should contribute money or manpower to the international databases. I consider database activities should be based on technology developments in both how to organize and how to distribute the data. Especially in the area of rapidly advancing molecular biology, the concept of database is still evolving. First, we had bibliographic databases. Then, relatively simple factual databases of molecular sequences and 3-D structures appeared. Now there are many different genome databases integrating maps and sequences. In the very near future there will be efforts to organize biological knowledge of molecular and cellular functions.

The international collaboration is essential for free and immediate sharing of data. However, the technology development has been and will continue to be competitive. Unless we invest our own informatics technology developments, we Japanese will never be able to initiate new concepts in molecular biology databases.

Minoru Kanehisa
Kyoto University and University of Tokyo

Vol. 1, No. 4, October 1994

Biological Knowledge Base

When we initiated the Genome Informatics Research Project three and a half years ago, we realized three levels of informatics supports were necessary for promoting genome research in Japan. The first level involved the infrastructure. The Tokyo-Kyoto-Osaka segment of the GenomeNet computer network became operational in September 1991. A year later the network reached to Fukuoka and we at the Human Genome Center in Tokyo and the Supercomputer Laboratory in Kyoto started offering database services.

The second level was the development of new database systems that would meet the needs of mapping and sequencing projects. In January 1992 database working groups were organized under the auspices of the Human Genome Center. The software products of these groups, Locus-in, ContigMaker, Gnome, and Genomatica, are made freely available to the international genome research community. The working groups activity has since been expanded to include additional projects.

Now we are approaching the third level, which is the task of organizing knowledge in biological sciences. The forty year history of molecular biology has been the history of structure determinations of nucleic acids and proteins. The genome project will eventually determine the DNA sequence of the entire genome or all genes. However, the sequence is not the goal; it is simply a means to understand how genes and genomes function. Molecular sequences and 3-D structures are simple factual data that can easily be organized in a database. In contrast, the data on molecular functions often involve interpretation by individual researchers. We need to somehow find good representation of hard functional data in a biological knowledge base.

In this yearÕs Genome Informatics Tutorial held in Toyama July 6-9, we learned a lot about molecular interactions in a cell. I think function-oriented databases currently available, such as Prosite and TFD, are still based on the idea of one sequence (pattern) corresponding to one function. It seems now feasible to organize data and knowledge in the form of one interaction corresponding to one function. A number of computer scientists helped us establish the infrastructure and develop the new database systems. Now we need help from biologists as well because it is their knowledge that we wish to computerize.

Minoru Kanehisa
Kyoto University and University of Tokyo

Vol. 2, No. 1, January 1995

From Client to Agent

In conjunction with the symposium on ÒNew Trends in Molecular Biology DatabasesÓ that we co-organized at the Annual Meeting of the Japanese Molecular Biology Society in Kobe, the on-site demonstration was performed for the database and software products. There were also six Macintoshes with the Internet connection to be used by molecular biologists as free terminals. I was pleasantly surprised that the Macs were fully occupied during the four days. It seems young molecular biologists can feel comfortable in both wet, bench side works and dry, desk top works. Many of the scientists were examining our GenomeNet WWW (World Wide Web) server. The access to this server is rapidly increasing, currently with over 1,000 queries per day from more than 50 countries.

The proliferation of Web servers around the world is causing fundamental changes in the attitude of biologists toward computers. The database retrieval is so simple now, just pointing and clicking a mouse, that many biologists start trying by themselves. We must acknowledge the increased government funding for the campus networks, which are finally making the Internet accessible by everyone. Because the generic client program called Mosaic is freely available and because it can be used in any discipline, a number of non-specialists may be learning about the Human Genome Project (HGP) from our Web server.

WWW is a revolution, a sort of democratization of information services which once were the privilege of big centers. While this is welcome in general, needs will also arise for quality control; each user is responsible for distinguishing research from relaxation. One way to solve the problem is to develop specialized client programs, such as Genomatica and HyperGenome in our project, which make access to preselected information services that are relevant to specific research fields.

Another possibility is Ôsoftware agentÕ, which is a kind of robot that follows the instruction of a biologist and, by moving on the Internet, finds information that meets individual needs. Agents will hopefully be a boon to older molecular biologists who feel more comfortable asking someone to get the job done. During the first phase of HGP in 1991-1995, we developed the client-server mechanism in various aspects of genome research. In the second phase in 1996-2000, we plan to make mobile agents widely available for information retrieval and data exchange.

Minoru Kanehisa
Kyoto University and University of Tokyo

Vol. 2, No. 2, April 1995

A Call to Biologists

In this issue of Genome Informatics News we report on two international meetings:'Genome Project and Computer Science' symposium and HICSS95 conference. You can see in the reports that many computer scientists are interested and involved in the genome project from various points of view and that a new research field, called Genome Informatics, sitting at the boundary between biology and computer science is being established.

However, from the view point of biologists there still remain a lot of problems that they want computer scientists to help them solve. The problems are practical for biologists, but some are not interesting to the computer scientists. Some are interesting but ill-defined and cannot be represented by computer languages. Some can be formalized but are too hard to solve within reasonable time frames even with up-to-date computer technologies.

In order to develop genome informatics we have to explore the problems to find the good ones among them. To perform such work we need to expand and extend our collaborations with biologists and/or medical scientists. Only with such close collaborations can we construct and solve these problems that are both interesting to computer scientists and useful to biologists.

From the next fiscal year the Human Genome Project of Japan will enter into the 2nd five-year stage. We are now constructing the plan. We solicit biologists/medical scientists to join our Genome Informatics Project.

Toshihisa Takagi
The University of Tokyo

Vol. 2, No. 3, July 1995

Genome Informatics 1996-2000

We are pleased to inform you that the next five-year project of Human Genome Research has been approved by the Ministry of Education, Science and Culture. The new priority-area (juten ryoiki) research project with Yoshiyuki Sakaki of the University of Tokyo as principal investigator will have three research teams: (i) structural analysis of human genome headed by Misao Ohki of National Cancer Center Research Institute, (ii) functional analysis of human and other genomes headed by Yuji Kohara of National Institute of Genetics, and (iii) biological knowledge information of genes and genomes headed by Minoru Kanehisa of Kyoto University. Although the informatics portion will become an integral part of the new MESC priority-area research project, we expect a similar level of funding for informatics as compared with the current independent priority-area research project.

During the past five years we emphasized the research and development of new informatics technologies for database and data interpretation problems. In the new project 1996-2000, we will concentrate more on the actual data collection and knowledge organization. Especially, it will become increasing important to organize functional data in higher biological processes, rather than those associated with single molecules or genes. Our goal is to describe and decipher molecular information pathways that make up the living organisms.

Minoru Kanehisa
Kyoto University

Vol. 2, No. 4, October 1995

Development of Medical Knowledge Base

Diagnosis or treatment of (hereditary) diseases with genetic information produced in Human Genome Project is one of the most important aims of this project. Last July at the Genome Informatics Tutorial in Kobe (see inside for detail), four leading researchers who actively utilizve the outcome of genome analysis to diagnosis lectured on the relationship between genomic information and hereditary diseases: triplet repeat disease, colon cancer, muscular dystrophy, Werner's syndrome and so on. Their lectures made us realize afresh that we should collaborate and contribute to frontiers of medical science.

In order to proceed our research of this direction, we must establish an integrated knowledge base consisting of both molecular biological and medial data or rules. So far Genome Informatics community has collected and integrated various kinds of molecular biological data. In contrast, most medical data or rules are still kept only as texts in literature, documents in laboratories, or understanding in researchers' minds. On considering the status quo, the following two approaches are essential: (i) development of computer technologies for extracting medical knowledge from literature, and (ii) accumulation of data from laboratories and from researchers' minds onto computers with the help of medical scientists.

These two subjects will be the mainstreams in the next five year project. We appreciate your cooperation.

Toshihisa Takagi
The University of Tokyo

Vol. 3, No. 1, January 1996

New Version of GenomeNet WWW Server

The GenomeNet WWW server became officially operational in July 1994. The number of accesses per month now exceeds 300,000 from more than 50 countries, and it is still rapidly growing. The popularity of the server is due to the two useful resources: DBGET for retrieval of a web of molecular biology databases and SIT (sequence interpretation tools) for homology and motif analyses.

On December 1, 1995 two new menu items were added to the home page of the GenomeNet WWW server. Once is the entry point to the Japanese genome databases for Bacillus, E.coli, cyanobacteria, and others. The others is called KEGG (Kyoto Encyclopedia of Genes and Genomes), which is an attempt to computerize molecular/genetic pathway data and to correlate them with gene catalogs of various organisms. At the moment KEGG focuses on matabolic pathways.

The links between related entries in different databases are represented as binary relations in DBGET, and reverse links and indirect links are calculated from original links. Similarly, once biological links, i.e., interactions between molecules or genes, are properly represented as binary relations in KEGG, it will become feasible to compute pathways in order to assist experiments, facilitate understanding, and even perform simulations of different aspects of living organisms.

Minoru Kanehisa
Kyoto University

Vol. 3, No. 2, May 1996

New Genome Informatics Project in Japan

In April 1996 the second five-year phase of the MESSC (Ministry of Education, Science, Sports and Culture) Human Genome Project was initiated with Professor Yoshiyuki Sakaki of the University of Tokyo as principal investigator of a new Grant-in-Aid for Scientific Research on Priority Areas. The informatics component was an independent project in the first five-year phase, but this time it is an integral part of the Grant-in-Aid with a similar level of funding (about $2M per year). In this new informatics project, which I continue to head, more emphasis will be placed on the data collection and knowledge organization, because the informatics infrastructure and technology developments have well advanced in Japan during the last five years, and because the genome projects have already entered the phase of massive data productions.

Once the catalogs of genes and gene products are known, the next obvious step is to understand functional implications, namely, to decipher both experimentally and computationally when, where, and how genes and molecules function in living organisms. In order to make full use of the information obtained by genome projects, it is essential that functional data obtained in wide areas of molecular and cellular biology are properly computerized.

In the existing molecular biology databases the functional data are computerized based on the concept of structure-function relationship; namely, the function is considered an attribute of the molecular structure. The collection of such data only represents how individual components (molecules) work, and it does not tell the wiring diagram (molecular pathway) of a biological system.

In the new informatics project we pay more attention on the aspects of interactions between molecules. We collect and organize the functional data based on the concept of structure-structure (molecule-molecule) relationship. In a simplest form, the basic data item may thus be represented by a binary relation of interacting molecules or genes. It is a challenging problem to compute molecular pathways from binary relations. At the same time we computerize known pathways derived by human experts. Again there will be a number of computational problems that need be solved, for example, in pathway comparisons.

The developments of our project have been and will be posted in our Web service at
http://www.genome.ad.jp/
This address is also linked to all of the databases and software products that have been produced in our project. The Genome Informatics News is intended to supplement this on-line publication. In the past the News were circulated only within the Japanese scientific community. In this new occasion, we start distributing the News internationally. We welcome your comments and suggestions.

Minoru Kanehisa
Kyoto University

Vol. 3, No. 3, October 1996

Knowledge Discovery in Genome Databases

The whole DNA sequences of many species are being determined and some have already been made public in the genome databases. Due to the rapid advance in sequence production, Genome Informatics is now getting into a new era of challenge.

Knowledge discovery is the most fundamental activity in various sciences that has been performed by experts of the fields. For example, Kepler (1571-1630) discovered the famous law with his name from planet data. Such discoveries belonged to only very talented experts and the amount of data analyzed was limitted to a human readable size until the computers and databases became available. Recently, however, a strong need is arising to support and assist such scientific discoveries with the paradigms created in Computer Science.

"Knowledge Discovery in Databases (KDD)" is a filed of Computer Science that is attacking problems of knowledge discovery in various fields. Knowledge discovery in databases varies in novelty from simple to hard. The database search or information retrieval has been the most fundamental wide-spread method for acquiring knowledge. Retrieving a single sequence pattern from databases may have a chance to lead to a new discovery. A harder demand is, for example, to create hypotheses about unknown data or whole sequences as human experts do. For such demands of higher complexity, various technologies from Computer Science will work, such as knowledge base, machine learning, parallel processing, etc. KDD consists of several stages of data processing. One of the most important and fascinating stages is called "data mining" which is a process of "mining" the nuggets of useful knowledge from well-processed refined data as an end-product of computing.

The target of knowledge discovery in genome databases is to mine the nuggets of genomic knowledge. Therefore, it should be a unified system that includes not only novel processes of high complexity but also straightforward applications of algorithms invented in Computer Science.

The major barrier for obtaining high-quality knowledge from data is the fact that the data are rarely collected for the process of data mining. The data are usually collected as a byproduct of other tasks and therefore they have limitations on breadth or coverage and do not represent all aspects of a product. By circumscribing such limitations on data, there is also a possibility that extrapolations can take the difference in population into account.

Although knowledge discovery in genome databases has various difficulties, I believe that knowledge discovery systems unified with databases will play an important role in Genome Science.

Satoru Miyano
The University of Tokyo