Knowledge Acquisition from Amino Acid Sequences by Decision Trees and Indexing

Satoru Miyano[1]
Ayumi Shinohara[1]
Setsuo Arikawa[1]
Shinichi Shimozono[2]
Takeshi Shinohara[2]
Satoru Kuhara[3]

[1]Kyushu University
[2]Kyushu Institute of Technology
[3]Kyushu University

Abstract

We present a machine learning system for knowledge acquisition that produces hypotheses from positive and negative examples, and report some experiments on protein data using the PIR and GenBank databases. This learning system is developed with an algorithmic learning theory for decision trees over regular patterns, which we newly devised for this research. In the experiments on transmembrane domain identification, the system discovered very simple hypotheses with very high accuracy from a small number of positive and negative examples. These hypotheses show that negative motifs, namely, motifs of negative data, play a key role in such classification. In these experiments, we classified 20 symbols of amino acid residues into 3 categories according to the hydropathy indices due to Kyte and Doolittle. We call such transformation of symbols an indexing. We observed that the indexing by the hydropathy indices is important in making the learning algorithm efficient and accurate. This observation inspired us with a desire to discover such an indexing itself just by a learning algorithm. We succeeded in it by combining the above learning algorithm and the local search technique for finding good indexings. We also report some experiments on signal peptides.

We have implemented this learning system, called BONSAI, which shall be presented at the Computer Demonstration Session during this workshop.