Since DNA sequences are not just linear sequences of deoxyribonucleic acids randomly generated but sequences intended to convey genetic codes of specific biological functions (or meanings), it appears quite reasonable to assume the existence of a certain kind of grammatical devices by which DNA (or RNA) sequences are generated and modeled.
Therefore, in order to solve the prediction
problems concerning biological structures such
as protein secondary (or tertiary) structures,
RNA secondary structures, no matter what structures they are,
it may be of great importance to propose certain
formal systems (grammars, automata, etc.) to model
those structures in an appropriate manner. This enables
us to analyse the biological properties of DNA or
amino acid sequences in terms of formal language
theoretic concepts {Searl93}.
This talk concerns three DNA linguistic topics of that kind mentioned above. The first topic refers to formal rewriting systems called splicing systems, proposed by Tom Head{Head87}, and some experimental results performed on biological data. More specifically, we first propose an efficient algorithm for learning from positive data a special type of regular languages called { locally testable languages}, and then describe several experimental results via a machine identification system based on the learning algorithm developed above. Following a theoretical observation due to Head which strongly suggests that a certain type of amino acid sequences can be expressed by a locally testable language due to Head, we apply the system to identifying the protein alpha-chain region in amino acid sequences for hemoglobin. Some experimental data show that the system achieved an overall success rate of more than 95% correct identification for both positive and negative data {YIK94}.
In the second topic, we are concerned with
analysing formal linguistic properties of DNA sequences in which
a number of the language theoretic analysis on DNA sequences are
established by means of computational methods {YK94}. First,
employing formal language theoretic framework, we consider
a kind of an { evolutionary problem} of
DNA sequences, asking how DNA sequences were initially
created and then evolved (grew up) to be a language
of certain complexity, and what primitive
constructs were minimally required for the process of evolution.
In terms of formal linguistic concepts, we present
several results that provide our view on these questions
at a conceptual level. For example, one of our results tells
that biological palindromes play a very primitive and
essential role in generating any algorithmic language.
Based on the formal analysis on these biological questions, in the third topic, we then choose a certain type of tree generating grammars called Tree Adjunct Grammars (TAGs) {JLT75} to attack the problem of modeling the secondary structure of RNA sequences. By proposing an extended model of TAGs (called TAGs with tags and denoted by TAGs), we evaluate the usefulness of the grammars for modeling some typical RNA secondary structures including pseudoknots. It is observed that TAGs is too powerful to model the planar graphic representation for RNA secondary structures. Hence, we propose a specialized version of TAG, called tree adjunct grammar with tag for RNA (denoted by TAG_RNA ), which appears to be just the right kind of descriptive power for modeling RNA secondary structures of any complexity {KY94}. These suggests that TAG families as RNA grammars have a great potential applicability to RNA secondary structure prediction.
Note. This work contains the results of cooperative work with Satoshi Kobayashi.
[Hea87] T. Head. Formal Language Theory and DNA : An Analysis of the Generative Capacity of Specific Recombinant Behaviors.
Bulletin of Mathematical Biology, vol.49, pp.737-759, 1987.
[Sea93] D. B. Searls.
The computational linguistics of biological sequences.
In L. Hunter, editor,
Artificial Intelligence and Molecular biology : Chapter 2,
AAAI Press, pp. 47-120, 1993.
[KY94] S. Kobayashi and T. Yokomori. Modeling RNA Secondary Structures Using Tree Grammars. In Proc. of Genome Informatics Workshop V, pp.29-38, 1994.
[YIK94] T. Yokomori, N. Ishida and S. Kobayashi . Learning Local Languages and Its Application to Protein alpha-chain Identification. In Proc. of 27th Hawaii International Conference on System Sciences, vol.V, pp.113-122, 1994.
[YK94] T. Yokomori and S. Kobayashi. DNA Evolutionary Linguistics and RNA Structure Modeling : A Computational Approach. To appear in Proc. of 1st Intern. IEEE Symposium on Intelligence in Neural and Biological Systems, May, 1994.