Computational Linguistics for DNA Evolution
and RNA Structure Modeling

Takashi YOKOMORI

Dept. of Computer Science and Information Mathematics
The University of Electro-Communications
1-5-1, Chofugaoka, Chofu, Tokyo 182, Japan
e-mail: yokomori@cs.uec.ac.jp

Abstract

Since DNA sequences are not just linear sequences of deoxyribonucleic acids randomly generated but sequences intended to convey genetic codes of specific biological functions (or meanings), it appears quite reasonable to assume the existence of a certain kind of grammatical devices by which DNA (or RNA) sequences are generated and modeled.

Therefore, in order to solve the prediction problems concerning biological structures such as protein secondary (or tertiary) structures, RNA secondary structures, no matter what structures they are, it may be of great importance to propose certain formal systems (grammars, automata, etc.) to model those structures in an appropriate manner. This enables us to analyse the biological properties of DNA or amino acid sequences in terms of formal language theoretic concepts {Searl93}.

This talk concerns three DNA linguistic topics of that kind mentioned above. The first topic refers to formal rewriting systems called splicing systems, proposed by Tom Head{Head87}, and some experimental results performed on biological data. More specifically, we first propose an efficient algorithm for learning from positive data a special type of regular languages called { locally testable languages}, and then describe several experimental results via a machine identification system based on the learning algorithm developed above. Following a theoretical observation due to Head which strongly suggests that a certain type of amino acid sequences can be expressed by a locally testable language due to Head, we apply the system to identifying the protein alpha-chain region in amino acid sequences for hemoglobin. Some experimental data show that the system achieved an overall success rate of more than 95% correct identification for both positive and negative data {YIK94}.

In the second topic, we are concerned with analysing formal linguistic properties of DNA sequences in which a number of the language theoretic analysis on DNA sequences are established by means of computational methods {YK94}. First, employing formal language theoretic framework, we consider a kind of an { evolutionary problem} of DNA sequences, asking how DNA sequences were initially created and then evolved (grew up) to be a language of certain complexity, and what primitive constructs were minimally required for the process of evolution. In terms of formal linguistic concepts, we present several results that provide our view on these questions at a conceptual level. For example, one of our results tells that biological palindromes play a very primitive and essential role in generating any algorithmic language.

Based on the formal analysis on these biological questions, in the third topic, we then choose a certain type of tree generating grammars called Tree Adjunct Grammars (TAGs) {JLT75} to attack the problem of modeling the secondary structure of RNA sequences. By proposing an extended model of TAGs (called TAGs with tags and denoted by TAGs), we evaluate the usefulness of the grammars for modeling some typical RNA secondary structures including pseudoknots. It is observed that TAGs is too powerful to model the planar graphic representation for RNA secondary structures. Hence, we propose a specialized version of TAG, called tree adjunct grammar with tag for RNA (denoted by TAG_RNA ), which appears to be just the right kind of descriptive power for modeling RNA secondary structures of any complexity {KY94}. These suggests that TAG families as RNA grammars have a great potential applicability to RNA secondary structure prediction.

Note. This work contains the results of cooperative work with Satoshi Kobayashi.

References

[JLT75] A. K. Joshi, L. S. Levy and M. Takahashi. Tree adjunct grammars. Journal of Computer and System Sciences, vol.10, pp.136-163, 1975.

[Hea87] T. Head. Formal Language Theory and DNA : An Analysis of the Generative Capacity of Specific Recombinant Behaviors.

Bulletin of Mathematical Biology, vol.49, pp.737-759, 1987.
[Sea93] D. B. Searls. The computational linguistics of biological sequences. In L. Hunter, editor, Artificial Intelligence and Molecular biology : Chapter 2, AAAI Press, pp. 47-120, 1993.

[KY94] S. Kobayashi and T. Yokomori. Modeling RNA Secondary Structures Using Tree Grammars. In Proc. of Genome Informatics Workshop V, pp.29-38, 1994.

[YIK94] T. Yokomori, N. Ishida and S. Kobayashi . Learning Local Languages and Its Application to Protein alpha-chain Identification. In Proc. of 27th Hawaii International Conference on System Sciences, vol.V, pp.113-122, 1994.

[YK94] T. Yokomori and S. Kobayashi. DNA Evolutionary Linguistics and RNA Structure Modeling : A Computational Approach. To appear in Proc. of 1st Intern. IEEE Symposium on Intelligence in Neural and Biological Systems, May, 1994.