Protein Motif Extraction Using Hidden Markov Model

Yukiko Fujiwara (yukiko@ csl.cl.nec.co.jp)
Akihiko Konagaya (konagaya@ csl.cl.nec.co.jp)

Massively Parallel systems NEC Laboratory
4-1-1, Miyazaki, Miyamaeku, Kawasaki, Kanagawa 216, Japan
TEL:(044)856-2178, FAX:(044)856-2231

Abstract

In this paper, we study the application of HMM to the problem of representing protein sequences by a stochastic motif. A stochastic (protein) motif represents the portions of protein sequences that have a certain function or structure, where conditional probabilities are used to deal with the stochastic nature of the motif. We proposed the iterative duplication method for HMM network learning. HMMs are much more expressive than symbolic patterns and are better suited to represent the variety of protein sequences. As an experiment, we constructed HMMs for leucine zipper motif using 112 protein sequences as a training set, and obtained an accuracy of 79.3 percent in the prediction of protein sequences, compared for an accuracy 14.8 percent when using a symbolic representation. Our approach can be used also for the validation of protein databases; the automatically constructed HMM has indicated that one protein sequence annotated as "leucine-zipper like sequence" in the database is quite different from other leucine-zipper sequences in terms of likelihood.