Protein Motif Discovery from Positive Examples by Minimal Multiple Generalization over Regular Patterns

Hiroki Arimura[1] (arim@ai.kyutech.ac.jp)
Ryoichi Fujino[1] (fujino@ai.kyutech.ac.jp)
Takeshi Shinohara[1] (shino@ai.kyutech.ac.jp)
Setsuo Arikawa[2] (arikawa@rifis.kyushu-u.ac.jp)

[1] Department of Artificial Intelligence Kyushu Institute of Technology, Iizuka 820, Japan
[2] Research Institute of Fundamental Information Science Kyushu University, Fukuoka 812, Japan

Abstract

Recently, several attempts have been made at applying machine learning method to protein motif discovery, but most of these methods require negative examples in addition to positive examples. This paper proposes an efficient method for learning protein motif from positive examples. A regular pattern is a string consisting of constant symbols and mutually distinct variables, and represents the set of the constant strings obtained by substituting nonempty constant strings for variables. Regular patterns and their languages are called extended if empty substitutions are allowed. Our learning algorithm, called k-minimal multiple generalization (k-mmg), finds a minimally general collection of at most k regular patterns that explains all the positive examples. We have implemented this algorithm for subclasses for regular patterns and extended regular patterns where the number of variables are bounded by a small constant, and run experiments on protein data taken from GenBank and PIR databases. We incorporate three heuristics into these algorithms for controlling nondeterministic choices. The experiments show that the k-mmg algorithm can very quickly find a hypothesis on the computers in practice, and that the results of our system are comparable with the results of learning method from positive and negative data.