Stochastic Models and Grammars in Molecular Biology

Kiyoshi Asai

Electrotechnical Laboratory (ETL)
Umezono, Tsukuba 305, Japan
e-mail: asai@etl.go.jp　FAX: 0298-58-5939

One of the most important goal in molecular biology is to find the meanings of the biological sequences. Translation from the triplet DNAs to the amino acids were clarified, but most of the meanings are still unknown. Because of the constraints of base pairing in DNA and RNA, it is natural to adopt grammars to fine the meanings. Regular grammar, context free grammar, and many other grammars have been applied to describe the features of the biological sequences. Another way to find the hidden meaning of the sequences is statistics. Even though the translation mechanism of DNA sequences or the dynamics of the structures of protein may not be probabilistic, statistics has been widely used in molecular biology. The occurrence of certain amino acids in certain protein structures and the number of certain combinations of bases in DNA have been counted to get the hint to predict the unknown protein structures and to find the signal sequences in DNA.

These grammars and statistics are now developing into more powerful methods, stochastic models and stochastic grammars.

Hidden Markov Models (HMMs), which are commonly used stochastic models in speech recognition, are becoming popular in molecular biology. In molecular biology, HMMs have been used to predict the protein structures, to make multiple sequence alignments, to classify the protein sequences, to extract stochastic motifs of protein sequences, to model the signal patterns in DNA sequences. The word 'hidden' of HMMs means that we cannot directly see the hidden states which correspond to structures, signals etc. in biological sequences, but the amino acids and bases.

From the view point of language theory, Markov Models are regular grammar, and HMMs are stochastic regular grammar. Proteins and DNAs are far from regular, however, because there are pairwise interaction in their structures. Therefore, stochastic context free grammar and stochastic tree grammar have been used for structure modeling of RNAs and proteins.

However, the more the complicated the grammar, the more difficult to find the practical parsing algorithm. It is required not only to find the optimal grammars for the sequences but to find the practical parsing algorithm for them.

References

[1]K.Asai et.al.: ``Genetic Information Processing by Stochastic Model'' Proc. Genome Informatics Workshop 2, (1991).

[2]K.Asai et.al.: ``HMM with Protein Structure Grammar,'' Proc. 26th HICSS (1993).

[3]M.Brown et.al: ``Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families,'' Proc. ISMB93 (1993).

[4]Y.Fujiwara and A.Konagaya: ``Protein Motif Extraction Using Hidden Markov Model,'' Proc. 7th JSAI (1993).

[5]Haussler_A}D.Haussler et.al.: ``Protein modeling using hidden Markov Models: Analysis of globins,'' Proc. 26th HICSS (1993).

[6]S.Kobayashi and T.Yokomori: ``Modeling RNA Secondary Structures Using Tree Grammars,'' Proc. Genome Informatics Workshop 1994.

[7]H.Mamitsuka and N.Abe: ``Predicting Location and Structure of Beta-Sheet Regions Using Stochastic Tree Grammars,'' Proc. ISMB94 (1994).

[8]Y.Sakakibara et.al: ``Stochastic Context-Free Grammars for Modeling RNA'' Proc. 27th HICSS (1994).

[9]C.M.Stultz, J.V.White and T.F.Smith: ``Structural analysis based on state-space modeling,'' Protein Science 2 (1993).

[10]H.Tanaka et.al.: ``Classification of Proteins via Successive State Splitting of Hidden Markov Network,'' Proc. of Workshop 26 in IJCAI93 (1993).