FOREST, a Browser for Huge DNA Sequences

R. Gras (gras@irisa.fr)
J. Nicolas (jnicolas@irisa.fr)

IRISA
Campus de Beaulieu 35042 Rennes cedex, France

Abstract

We present a new tool, FOREST, aiming at representing the content of a large nucleic acid sequence (e.g. >100KB ) in a suitable form for the biologist. More precisely, FOREST builds all subsequences repeated in a sequence or a set of sequences. It allows not only to look for the location of the various occurrences of a given subsequence but points also to interesting subsequences with respect to a given criterion. This tool is based on two key ideas. The first idea consists to build a suffix-tree representation of a sequence and to associate to each node of this tree a set of synthesized attributes, computed on the set of subsequences under this node. This allows the biologist to "browse" in the sequence with a constant abstract view of what he may expect to find in the section of the tree he is currently investigating. The second idea consists to summarize the distribution of the information with boolean vectors associated to the sequence. These vectors may be easily displayed in form of a linear map of events, as it is done in genetic mapping. Both representations allow various efficient operations on the sequence. They provide a powerful filtering capacity of the data, while reducing the set of elementary filtering operations to a minimum of conceptual operations. This allows the biologist to easily investigate the most prominent features of the lexical structure of its sequences.