Prediction of Beta-Sheet Structures Using Stochastic Tree Grammars

Hiroshi Mamitsuka (mami@sbl.cl.nec.co.jp)
Naoki Abe (abe@sbl.cl.nec.co.jp)

Theory NEC Laboratory, RWCP[1]
c/o NEC C & C Research Laboratories, 4-1-1 Miyazaki Miyamae-ku,
Kawasaki,216 Japan.

[1] Real World Computing Partnership.

Abstract

We empirically demonstrate the effectiveness of a method of predicting protein secondary structures, beta-sheet regions in particular, using a class of stochastic tree grammars as representational language for their amino acid sequence patterns. The family of stochastic tree grammars we use, the Stochastic Ranked Node Rewriting Grammars (SRNRG), is one of the rare families of stochastic grammars that are expressive enough to capture the kind of long-distance dependencies exhibited by the sequences of beta-sheet regions, and at the same time enjoy relatively efficient processing. We applied our method on real data obtained from the HSSP database and the results obtained are encouraging: Using an SRNRG trained by data of a particular protein, our method was actually able to predict the location and structure of beta-sheet regions in a number of different proteins, whose sequences are less than 25 per cent homologous to the training sequences. The learning algorithm we use is an extension of the `Inside-Outside' algorithm for stochastic context free grammars, but with a number of significant modifications. First, we restricted the grammars used to be members of the `linear' subclass of SRNRG, and devised simpler and faster algorithms for this subclass. Secondly, we reduced the alphabet size (i.e. the number of amino acids) by clustering them using their physico-chemical properties, gradually through the iterations of the learning algorithm. Our experiments indicate that our prediction method not only goes beyond what is possible by alignment alone, but the grammar that was acquired by our learning algorithm captures the type of long distance dependencies that could not be succinctly expressed by an HMM. We also stress that our method can predict the structure as well as the location of beta-sheet regions, which was not possible by previous inverse protein folding methods.