##
Protein Motif Discovery from Positive Examples by
Minimal Multiple Generalization over Regular Patterns

Hiroki Arimura[1] (arim@ai.kyutech.ac.jp)

Ryoichi Fujino[1] (fujino@ai.kyutech.ac.jp)

Takeshi Shinohara[1] (shino@ai.kyutech.ac.jp)

Setsuo Arikawa[2] (arikawa@rifis.kyushu-u.ac.jp)

[1] Department of Artificial Intelligence
Kyushu Institute of Technology, Iizuka 820, Japan

[2] Research Institute of Fundamental Information Science
Kyushu University, Fukuoka 812, Japan

**Abstract**
Recently, several attempts have been made at applying machine learning
method to protein motif discovery, but most of these methods require
negative examples in addition to positive examples.
This paper proposes an efficient method for learning protein motif
from positive examples. A regular pattern is a string
consisting of constant symbols and mutually distinct variables, and
represents the set of the constant strings obtained by substituting
nonempty constant strings for variables. Regular patterns and their
languages are called extended if empty substitutions are
allowed. Our learning algorithm, called *k*-minimal multiple
generalization (*k*-mmg), finds a minimally general collection of at
most *k* regular patterns that explains all the positive examples.
We have implemented this algorithm for subclasses for regular patterns
and extended regular patterns where the number of variables are
bounded by a small constant, and run experiments on protein data taken
from GenBank and PIR databases. We incorporate three heuristics into
these algorithms for controlling nondeterministic choices.
The experiments show that the *k*-mmg algorithm can very quickly find
a hypothesis on the computers in practice, and that the results of our
system are comparable with the results of learning method from
positive and negative data.