next up previous contents
Next: 6 Appendix Up: Pairwise Sequence Alignments Previous: 4 Realistic Gap Models

5 Variations of Pairwise Alignment

5.1 Local Alignment and Local Similarity

  (Local Alignment) The notion of edit distance and its implementation via dynamic programming are easily adapted to variations of the original problem. Two such variations are discussed here.

We first discuss the problem of local alignment, which is also called approximate pattern matching. We then turn to the problem of local similarity, which (unfortunately) has often been called "local alignment".

Consider the problem of locating the famous TATAAT-box (a bacterial promoter) in a piece of DNA. Typically, this motif does not occur exactly as spelled here, but may show up with one or two bases changed.

Local alignment is the problem where is relatively short with respect to and we seek that subunit of which aligns best with. More precisely:

Given and , find such that is minimal among all choices of .

This is also called the approximate matching problem (of against ), and very sophisticated methods have been developed to do this efficiently. But a simple change to the dynamic programming scheme will also do the job, albeit a little slower.

(Local Alignment Recursion) Local alignment means that we do not charge costs for

  1. deletion of a prefix
  2. deletion of a suffix

(1) means that , and hence we initialize the first row of the distance matrix accordingly:

The recursion formula for the inner row is unchanged (as given in section 2.1). However

(2) means that

So the last line of the cost matrix is calculated according to

As before, gives the cost of the optimal local alignment. The matching subsequence is then found as follows:

An interesting alternative is to calculate the last line of the matrix according to the general case. Then, minimal values in the last line indicate those positions where the best matches of against substrings of end. This is the way the ``local''-option of the BioMOO alignment server is programmed.

(Local Similarity) For the second variation of our similarity theme, consider the following case. Let and be two proteins, which we suspect to carry some functionally related subunits. However, most parts of and do not contribute to this function, and may well be very different. A pairwise alignment of and would be unlikely to clearly exhibit small regions of high similarity, as it is geared to minimize distance over the full length of both sequences.

Thus we are faced with another problem, this one being symmetric between and . We ask for those subunits of and that exhibit most similarity. This is called the local similarity problem, since we use a similarity rather than a distance measure in the following way:

This still leaves a lot of freedom to differentiate degrees of (dis-)similarity. The point is that we use the score 0 as a cut-off value between subsequences with/without similarity. We are now maximizing similarity rather than minimizing distance. The border cases now become

and the general recursion formula is

The cut-off value of zero means that long stretches of dissimilarity show as regions of zeroes in the matrix, from which stretches of local similarity rise as islands of positive values.

Some Exercises involving Local Alignment / Local Similarity

  1. Calculate a dynamic programming matrix and local alignment for the sequences ATT and ATCTTC. Compare your results with the BioMOO server, i.e. type "opt_align ATT ATCTTC matrix zero with #90". (You can also use the WWW-Interface, see this tutorial.)

  2. Justify intuitively the treatment of the border cases in the local simililarity problem.
  3. What happens if you formulate "local alignment" in a symmetric fashion, not charging for terminal gaps in either of the two sequences ?
  4. If you use ``2'' as the cutoff value for local similarity calculations (see text), can you expect meaningful results ?

5.2 Heuristic Methods

(Edit Dist. Calc. Complexity) Let us now consider the computational resources needed for calculating edit distances: The dynamic programming scheme calculates (for input sequences and ) matrix entries. Each matrix element can be calculated from the three adjacent elements with a fixed number of operations. Then the execution time of this program is proportional to . In the jargon of computer science, we say that its time complexity is .

If we are only interested in the edit distance of and , we only need to store one column (or row) of the matrix at a time in order to calculate the next one. In this case, the space complexity is or . If we need to retrace the path that leads to the optimal alignment, the whole matrix must be stored and the space complexity moves up to , too.

This is quite feasible on today's computers, even when and are 10 000 or 100 000 characters long. However, for comparing a sequence with a whole database of tens or hundreds of thousands of sequences, we must seek for a more efficient solution.

The answer is to use heuristic approaches, which only approximate the proper edit distance and the optimal alignment, but do this with a time complexity close to for a given pair and . We will discuss the ideas of two such heuristics, and you will learn how to use them in the next chapter.

(FASTA) FASTA was developed by Lipman and Pearson in 1985. FASTA considers exact matches between short substrings of and , i.e. all pairs such that , for a given parameter . If a significant number of such exact matches is found, FASTA uses the dynamic programming algorithm to compute optimal alignments.

This approach allows to trade speed for precision: The larger we choose the parameter , the smaller is the number of exact matches. This makes the program faster, but loses precision: It becomes less likely that the optimal alignment contains enough exact matches of length , and the procedure may find nothing. Nevertheless, experience shows that with sensibly chosen parameters, FASTA misses very few cases of significant homology.

(BLAST) BLAST, developed by Altschul et al. in 1990, is another heuristic based on a similar idea. BLAST focusses on no-gap alignments of (again) a certain, fixed length . Rather than requiring exact matches, BLAST uses a scoring function to measure similarity (rather than distance). In particular for proteins, one can argue that segment pairs with no gaps and a high similarity score indicate regions of functional similarity. For a given threshold parameter , BLAST reports to the user all database entries which have a segment pair with the query sequence that scores higher than . If the scoring function used has a probabilistic interpretation, BLAST can also give an assessment of the statistical significance of the matches it reports.

Some Exercises involving Heuristic Methods

These exercises should be done after going through Chapter 2, which tells you where to find the FASTA and BLAST servers. More exercises are available, thanks to Francisco M. De La Vega. These also include a solution sheet.

  1. Retrieve an arbitrary sequence from Genebank/EMBL or SwissProt. (You may choose a sequence from your research context.) Slightly modify it and call the modified sequence .
  2. Search for homologues using FASTA. Is the original sequence among the reported matches? Save the results.
  3. Repeat (3) using BLAST.
  4. Compare the matches found in (3) and (4) with each other. Can you explain the differences?
  5. Redo (3) or (4) and try to use more restrictive parameters, such that the original sequence is reported as the only match. If this is not possible, explain why!

Back to VSNS BioComputing Division Home Page.
VSNS-BCD Copyright 1995/1996.
Robert Giegerich



next up previous contents
Next: 6 Appendix Up: Pairwise Sequence Alignments Previous: 4 Realistic Gap Models




Mon Apr 29 18:31:03 MET DST 1996
Valid HTML 2.0!