A sensitive and efficient homology search method to find proteincoding regions using "protein-coding region DNA database"

K. Wada
Y. Wada (ATR Ltd.)
S. Tanaka (Japan Advanced Institute of Science and Technology)
H. Doi (Fujitsu Ltd.)
Y. Nakamura
K. Sugaya
T. Fukagawa
T. Ikemura (National Institute of Genetics; The Graduate Univ. for Advanced Studies)

Abstract

One and important task of genome project is to assign proteincoding regions in newly determined DNA sequences to elucidate their biological functions. Because of rapid accumulation of a vast amount of gene sequences in a wide rage of organisms, homology search becomes a standard and powerful method to identify protein-coding regions in newly determined DNA sequences. Actually, protein-coding regions in new sequences can be very often assigned basing on certain homology with at least a portion of known gene sequences of some species, at nucleotide and/or amino acid levels. This indicates that, so far judged by their domain-, motif-, and/or module-sequences, gene sequences accumulated in the present databases does correspond already to a significant portion of gene sequences of present living organisms. At the same time, however, this accumulation of sequence data brings various difficulty to carry out an efficient and sensitive homology search especially for human genomic sequences. In the human genome there exist many kinds of repetitive sequences; e.g., well known Alu, L1, and satellite sequences, less characterized several ten (and possibly more than serval hundred) medium reiteration (MRE) sequences, and simple homo- or oligo-nucleotide tracts. When kb-level human genomic sequences are analyzed, searching present DNA databases, it frequently hits these repetitive-type sequences. It should be stressed that entry numbers of individual repetitive sequences registered in the DNA databases have already become remarkably high. Therefore, even such unique-type sequence that has meaningful, but not very high, homology with known proteincoding sequences (e.g., 70% identity in more than 100 nt) are easily masked by the repetitive-type sequences which often exist in the kb-level human genomic sequences. One practical way to reduce this trouble is to analyze 300 - 400 nt genomic sequences obtained in each sequencing determination, without further contiguous connection. Then, only the sequences found to have repetitive-type sequence were manually edited, and homology search was repeated for the unique-portion. This is rather time-consuming process both in computation and in manual editing. Another method is to generate six supposed peptide sequences for the three frames of either direction of one genomic sequence and to perform homology-search using protein sequence databases. This method is known to be useful, but several demerits are apparent.

In the present paper, we introduced one method which we used in our human genomic sequencing study. We at first constructed the "protein-coding region DNA database" based on GenBank. Then "Fasta" homology search (and undoubtedly "Blast" homology search) was performed for newly determined genomic sequences using this database. Because there are essentially no repetitive sequences in the database and also the database size is drastically reduced, the search becomes very efficient and sensitive to find protein-coding sequences. Actually we can easily identify sequences that show relatively low but biologically meaningful homology with known protein-coding sequences, which were easily masked by repetitive sequences adjacently located in the ordinary homology search. The "protein-coding region database" was produced, as a kind of byproduct, during construction of our "Codon Usage Database" (1-6). We will briefly explained these databases, and then show an example of results of "Fasta" homology search using the "proteincoding region DNA database".

In selecting protein coding sequences we relied on the FEATURES tables of the GenBank. In the GenBank, a group of consecutive genes whose entire region had been sequenced were registered under one LOCUS name. To distinguish the different genes belonging to a single LOCUS, symbol # followed by a number is added after the LOCUS name; the numbers represent the order of the peptides registered in the FEATURES of the GenBank. When introns of a gene have not been completely sequenced, some of its exons are registered in separate entries (LOCUS) in the GenBank. These exons belonging to the same gene but having different LOCUS names were combined following the comment for "join", and the LOCUS name with the comment followed by symbol * was given to the gene thus combined. The data set of protein-coding sequences thus obtained was called as the "protein-coding region DNA database". Codon usage in the genes, starting with an initiation codon and ending with one of stop codons, was then calculated. The codon usage database is called as the CUTG Database (1-5), and is distributed on EMBL CD-ROM as a member of NAR Sequence Supplement Databases (4). The CUTG codon database is also available for on-line access to DDBJ.

Figure 1 shows an example of results of "Fasta" homology search of one genomic sequence in the HLA class III locus using the "protein-coding region DNA database", and Figure 2 shows that using the ordinary GenBank database. In the former analysis a significant homology with PBX2 (a homeo-box gene) was revealed, but in the latter this homology was masked by homology to Alu repetitive sequence that exists in this genomic sequence. It should be noted that our further extensive sequence analysis around this genome portion proved existence of an intact form of one new PBX-like gene in this locus. Since we found the present method useful in many examples, we are planning to register the "protein-coding region DNA database" for public use in HGC (Human Genome Center) and/or DDBJ.