Approximate Multiple String Searching by Clustering

Fei Shi (shi@inf.ethz.ch)
Peter Widmayer (widmayer@inf.ethz.ch)

Department of Computer Science
Swiss Federal Institute of Technology Zurich
(ETH Zurich)
ETH Zentrum, CH-8092 Zurich, Switzerland

Abstract

We are given a finite set S of text strings and a pattern P over some fixed alphabet Sigma. The topic of this paper is the design of a data structure D(S) which supports approximate multiple string searching queries efficiently. Thereby, for a given upper bound in Z+ on the allowable distance, P = p1... pm is said to appear approximately in a text T = t1... tn, m, in Z+, if there exist positions u, v in T such that the edit distance between P and tu... tv is at most k. Let N denote the sum of the lengths of all strings in S. We present an algorithm that constructs the data structure D(S) in O(N) time and space. Afterwards, an approximate multiple string search query can be answered in O(N) expected-time if the allowable distance k is bounded above by O(m/log m). The method can be used to search large nucleotide and amino acid sequence databases for similar sequences.