Stochastic Context-Free Grammars in Computational Biology: Applications to Modeling RNA

Yasubumi Sakakibara[1] (yasu@iias.flab.fujitsu.co.jp)
Michael Brown[2]
Rebecca C. Underwood[2]
I. Saira Mian[3]
David Haussler[2] (haussler@cse.ucsc.edu)

[1]Institute for Social Information Science Fujitsu Laboratories Ltd.
140, Miyamoto, Numazu, Shizuoka 410-03, Japan
[2]Computer and Information Sciences
[3]Sinsheimer Laboratories
University of California, Santa Cruz, CA 95064, USA.

Keywords: Stochastic Context-Free Grammar, RNA, Transfer RNA, Multiple Sequence Alignments, Database Searching.

Abstract

Stochastic context-free grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of homologous RNA sequences. These models capture the common primary and secondary structure of the sequences with a context-free grammar, much like those used to define the syntax of programming languages. SCFGs generalize the hidden Markov models used in related work on protein and DNA sequences. The novel aspect of this work is that the SCFGs developed here are learned automatically from initially unaligned and unfolded training sequences. To do this, a new generalization of the forward-backward algorithm, commonly used to train hidden Markov models, is introduced. This algorithm is based on tree grammars, and is more efficient than the inside-outside algorithm, which was previously proposed to train SCFGs. This method is tested on the family of transfer RNA (tRNA) sequences. The results show that the model is able to reliably discriminate tRNA sequences from other RNA sequences of similar length, that it can reliably determine the secondary structure of new tRNA sequences, and that it can produce accurate multiple alignments of large collections of tRNA sequences. The model is also extended to handle introns present in tRNA genes.