Multiple Sequence Alignment with a Tree-Based Weighting System

O. Gotoh (gotoh@saitama-cc.go.jp)

Department of Biochemistry, Saitama Cancer Center Research Institute
818 komuro, Ina-machi, Saitama 362, Japan

Abstract

Multiple sequence alignment is a powerful tool for studying structure-function relationship and evolution of biological macromolecules. The problem is computationally hard, and it is impractical to get an exact solution for more than several sequences. Although the so called progressive method is most widely used today for aligning a large set of sequences, the quality of alignments thus obtained is not satisfactory especially when the sequences to be aligned are distantly related. Recent progress in methods has changed the situation, and we can now obtain a high-quality alignment of a large number of sequences in practical computation time. The strategy attempts to refine a crude alignment by iteration with rigorous pairwise alignment between randomly partitioned submembers of the sequences [1, 2]. The cost of calculation for large sets of sequences can be suppressed by hundreds times by the use of 'generalized profile operations' [3].
Although the goodness of an alignment has been evaluated with the sum-of-pairs (SP) measure in these method, this measure is not appropriate when evolutionary distances between members are not evenly distributed. When we look at the globin superfamliy, for example, there are many alpha-chain and beta-chain haemoglobins and myoglobins with similar sequences within each family, while non-vertebrate globins are highly diverged from one another. Thus, the contributions of these minor members to the total SP score are predominated by those of large families. To correct the biased contributions, Altschul et al. suggested to calculate an SP score with a weight given to each pair of members, and proposed two methods for obtaining such a set of weights [4]. The weighted sum-of-pairs (WSP) score invokes a new problem, however, since profile based operations can no longer be applicable in general, which invalidates the accelerated alignment algorithm [3]. I show here that the profile-based fast algorithm of multiple sequence alignment can be regained under some limited yet practical conditions imposed on the weights.
In brief, we first calculate an unrooted tree which represents evolutionary relationships among all the members of sequences to be aligned. We impose a limitation on the weight given to a pair of members, W_{ab}, so that it is expressed as a product of factors assigned to the edges along the path from the member a to b in the tree. One of the Altschul et al.'s weighting system [4] intrinsically satisfies the above limitation. Although the other, more sound weighting system does not meet the condition per se we can easily obtain a very good approximation in accordance with the limitation. When the tree is dissected into two groups at an edge, the weight given to a pair between one member of the left side of the tree and that of the other side is 'factorized', and so it is possible to define the profiles of the left- and right-side groups. In this way we get the WSP of a given alignment in O(N) operations in stead of O(N^{2}) with the conventional method, where N is the number of primary sequences in the alignment. Refinement of a crude alignment by the randomized iteration method is also carried out without a significant loss of efficiency if partition points are restricted to the edges in the tree. Comparing the results of sequence alignment with those derived from three- dimensional structure of proteins, we found a general tendency in the quality of alignments obtained with various sequence alignment methods, i.e. the matching of the sequence alignments to structural ones improves in the order of the used methods: (1) pairwise alignment between single members < (2) progressive method < (3) randomized iteration without weight < (4) that with weight.