(Local Alignment) The notion of edit distance and its implementation via dynamic programming are easily adapted to variations of the original problem. Two such variations are discussed here.
We first discuss the problem of local alignment, which is also called approximate pattern matching. We then turn to the problem of local similarity, which (unfortunately) has often been called "local alignment".
Consider the problem of locating the famous TATAAT-box (a bacterial promoter) in a piece of DNA. Typically, this motif does not occur exactly as spelled here, but may show up with one or two bases changed.
Local alignment is the problem where
is relatively short with respect to
and we seek that subunit of
which
aligns best with. More
precisely:
Givenand
, find
such that
is minimal among all choices of
.
This is also called the approximate matching problem (of
against
), and
very sophisticated methods have been developed to do this efficiently. But
a simple change to the dynamic programming scheme will also do the job, albeit
a little slower.
(Local Alignment Recursion) Local alignment means that we do not charge costs for
(1) means that
,
and hence we initialize the first row of the distance matrix accordingly:
The recursion formula for the inner row is unchanged (as given in section 2.1). However
(2) means that
So the last line of the cost matrix is calculated according to
As before,
gives the cost of the optimal local alignment.
The matching subsequence
is then found as follows:
An interesting alternative is to calculate the last line of the matrix
according to the general case. Then, minimal values in the last line
indicate those positions where the best matches of
against substrings
of
end. This is the way the ``local''-option of the BioMOO alignment
server is programmed.
(Local Similarity)
For the second variation of our similarity theme, consider the following
case. Let
and
be two proteins,
which we suspect to carry some
functionally related subunits. However, most parts of
and
do not contribute to this function, and may well be very different.
A pairwise alignment of
and
would
be unlikely to clearly exhibit
small regions of high similarity, as it is geared to minimize distance over
the full length of both sequences.
Thus we are faced with another problem, this one being symmetric between
and
. We ask for those subunits of
and
that exhibit most similarity.
This is called the local similarity problem, since we use a similarity rather than a
distance measure in the following way:
This still leaves a lot of freedom to differentiate degrees of (dis-)similarity. The point is that we use the score 0 as a cut-off value between subsequences with/without similarity. We are now maximizing similarity rather than minimizing distance. The border cases now become
and the general recursion formula is
The cut-off value of zero means that long stretches of dissimilarity show as regions of zeroes in the matrix, from which stretches of local similarity rise as islands of positive values.
"opt_align ATT ATCTTC matrix zero with #90".
(You can also use the WWW-Interface, see
this tutorial.)
(Edit Dist. Calc. Complexity)
Let us now consider the computational resources needed for calculating edit
distances:
The dynamic programming scheme calculates (for input sequences
and
)
matrix entries. Each matrix
element can be calculated from the three adjacent elements with a fixed number
of operations.
Then the execution time of this program is proportional to
. In
the jargon of computer science, we say that its time complexity is
.
If we are only interested in the edit distance of
and
, we only need
to store one column (or row) of the matrix at a time in order to calculate the
next one. In this case, the space complexity is
or
.
If we need to retrace the path that leads to the optimal alignment, the
whole matrix must be stored and the space complexity moves up to
, too.
This is quite feasible on today's computers, even when
and
are
10 000 or 100 000 characters long. However, for comparing a sequence
with a whole database of tens or hundreds of thousands of sequences, we
must seek for a more efficient solution.
The answer is to use heuristic approaches, which only approximate the
proper edit distance and the optimal alignment, but do this with a time
complexity close to
for a given pair
and
. We will discuss the ideas of two such
heuristics, and you will learn how to use them in the next chapter.
(FASTA)
FASTA was developed by Lipman and Pearson in 1985. FASTA considers exact
matches between short substrings of
and
, i.e. all pairs
such that
,
for a given parameter
. If a significant number of such exact matches is
found, FASTA uses the dynamic programming algorithm to compute optimal alignments.
This approach allows to trade speed for precision: The larger we choose the
parameter
, the smaller is the number of exact matches. This makes the
program faster, but loses precision: It becomes less likely that the
optimal alignment contains enough exact matches of length
,
and the procedure may find nothing.
Nevertheless, experience shows that with sensibly chosen parameters, FASTA
misses very few cases of significant homology.
(BLAST)
BLAST, developed by Altschul et al. in 1990, is another heuristic based
on a similar idea. BLAST focusses on no-gap alignments of (again) a
certain, fixed length
. Rather than requiring exact matches, BLAST
uses a scoring function to measure similarity (rather than distance). In
particular for proteins, one can argue that segment pairs with no gaps and a
high similarity score indicate regions of functional similarity.
For a given threshold parameter
, BLAST reports to the user all
database entries which have a segment pair with the query sequence that
scores higher than
.
If the scoring function used has a probabilistic interpretation, BLAST can
also give an assessment of the statistical significance of the matches it
reports.
These exercises should be done after going through Chapter 2, which tells you where to find the FASTA and BLAST servers. More exercises are available, thanks to Francisco M. De La Vega. These also include a solution sheet.
.
Back to VSNS BioComputing Division Home Page.
VSNS-BCD Copyright 1995/1996.
Robert Giegerich