Weight Matrices for Sequence Similarity Scoring

Version 2.0
May 1996

David Wheeler, Ph.D.
Department of Cell Biology,
Baylor College of Medicine
Houston, Texas
E-mail: wheeler@bcm.tmc.edu

Table of Contents

  1. Weight matrices for sequence similarity scoring
  2. Importance of scoring matrices
  3. Examples of matrices
  4. Log odds matrices
  5. PAM matrix construction
  6. Properties of the PAM matrix
  7. Assumptions in the PAM model
  8. BLOSUM (Blocks Substitution Matrix) matrix
  9. Practical aspects
  10. Selecting an optimal PAM matrix
  11. Other specialized scoring matrices

Weight Matrices for Sequence Similarity Scoring

Outline:

  1. Objective: Overview of methods and theories that underlie the construction of scoring matrices.
  2. Examples of weight matrices for nucleotide and amino acid scoring.
  3. Transition probability matrix: PAM
  4. BLOSUM matrix
  5. Practical aspects
  6. Other refinements to transition probability matrices.

Reading:

Back to Table of Contents.


Importance of scoring matrices

Similarity vs. Distance

  1. Elements of the matrices specify the weight to assign a given comparison by:
  2. Distance is more naturally used for phylogenetic tree reconstruction; similarity is used for database searching.
  3. The logic of the algorithm doesn't change: maximizing a similarity is fundamentally the same as minimizing a distance.
  4. Distance and similarity matrices are inter-convertible by some mathematical transformation appropriate for the given application.
Back to Table of Contents.


Examples of matrices

A remark on notation

When we consider scoring matrices, we encounter the convention that matrices have numeric indices corresponding to the rows and columns of the matrix. That is, M11 refers to the entry at the first row and the first column. In general, Mij refers to the entry at the ith row and the jth column. To use this for sequence alignment, we simply associate a numeric value to each letter in the alphabet of the sequence. For example, if the alphabet is
[squiggly_A]={A,C,G,T}
then A = 1, C = 2, etc. Thus, one would find the score for a match between A and C at M12. Since we consider different scoring matrices in this section, we distinguish between them by using different letters for the matrix, Rij refers to the Replacement matrix, Sij to the log odds matric, and so on.

Nucleotide scoring

  1. Identity matrix (similarity)
       A  T  C  G
    
    A  1  0  0  0
    
    T  0  1  0  0
    
    C  0  0  1  0
    
    G  0  0  0  1
    

    For elements in row i by column j:

    Sij=1,i=j
    Sij=0,i!=j

  2. BLAST matrix (similarity)
       A  T  C  G
    
    A  5 -4 -4 -4
    
    T -4  5 -4 -4
    
    C -4 -4  5 -4
    
    G -4 -4 -4  5
    

  3. Transition/Transversion Matrix
       A  T  C  G
    
    A  0  5  5  1
    
    T  5  0  1  5
    
    C  5  1  0  5
    
    G  1  5  5  0
    

    Nucleotide bases fall into two categories depending on the ring structure of the base. Purines (Adenine and Guanine) are two ring bases, pyrimidines (Cytosine and Thymine) are single ring bases. Mutations in DNA are changes in which one base is replaced by another. A mutation that conserves the ring number is called a transition (e.g., A -> G or C -> T) a mutation that changes the ring number are called transversions. (e.g. A -> C or A -> T and so on).

    Although there are more ways to create a transversion, the number of transitions observed to occur in nature (i.e., when comparing related DNA sequences) is much greater. Since the likelihood of transitions is greater, it is sometimes desireable to create a weight matrix which takes this propensity into account when comparing two DNA sequences.

    Use of a Transition/Transversion Matrix reduces noise in comparisons of distantly related sequences.

Protein scoring

  1. Identity matrix

    Rij=1,i=j
    Rij=0,i!=j

  2. Genetic Code Matrix
  3. Physical/chemical characteristics
Back to Table of Contents.


Log odds matrices

Sij= log(qij/(pi*pj))
S is the log odds ratio of two probabilities: the probability that two residues, i and j, are aligned by evolutionary descent and the probability that they are aligned by chance.


PAM Matrix

Summary of steps:

  1. Align sequences that are at least 85% identical.
  2. Reconstruct phylogenetic trees and infer ancestral sequences. 71 trees containing 1,572 exchanges were used.
  3. Tally replacements "accepted" by natural selection, in all pair-wise comparisons (each Aij is the number of times amino acid j was replaced by amino acid i in all comparisons).
  4. Compute amino acid mutability, mj, i.e., the propensity of a given amino acid, j, to be replaced.
  5. Combine data from 3 & 4 to produce a Mutation Probability Matrix for one PAM of evolutionary distance, according to the following formulae:
    Mij=(mj*Aij)/(sum_over_all_i Aij)

    Mjj=1-mj

  6. Calculate Log Odds Matrix for similarity scoring:

    Divide each element of the Mutation Data Matrix, M, by the frequency of occurance of each residue:
    Rij=Mij/fi
    R is a Relatedness Odds Matrix, fi is the frequency of residue i.

    The Log Odds Matrix, Sij, is calculated from the relatedness odds matrix, Rij, simply by taking the log of each Rij.

  7. Different protein families manifest different PAM rates.
Back to Table of Contents.


Properties of Mutation Probablitiy Matrix

  1. The sum of mj for any column, j, is one (trivial). Note that the probability that an amino acid will change is on the order of 1% for each amino acid. The probability that it will stay the same is on the order ot 99% for each amino acid.
  2. The Mutation Probability Matrix, M1, defines a unit of evolutionary change: specifically, 1 PAM (Accepted Point Mutation per 100 residues).
  3. Successive application of M1 on a sequence yields 2, 3, 4... PAMs of evolutionary change.
  4. The following operations are equivalent:
  5. The matrix has compositional information in it, since it depends on the relative frequencies of amino acids in the pool of sequences from which the tallies were drawn. In the extremes, the following obtain:
Back to Table of Contents.


Assumptions in PAM model:

  1. replacement at any site depends only on the amino acid at that site and the probability given by the table (Markov model).
  2. sequences that are being compared have average amino acid composition.
Sources of error in PAM model

  1. Many sequences depart from average composition.
  2. Rare replacements were observed too infrequently to resolve relative probabilities accurately (for 36 pairs no replacements were observed!).
  3. Errors in 1PAM are magnified in the extrapolation to 250 PAM.
  4. The Markov process is an imperfect representation of evolution: Distantly related sequences usually have islands (blocks) of conserved residues. This implies that replacement is not equally probable over entire sequence.
Back to Table of Contents.


BLOSUM (Blocks Substitution Matrix) Matrix

Steven Henikoff and Jorja G. Henikoff (1992). Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89: 10915-10919.

  1. Starting data is conserved blocks from Blocks database.
  2. Tallies of replacements are made by straight forward tallying of all pairs of aligned residues, fij
    qij = fij/(total number of residue pairs)
  3. Similar sequences in a block, above a threshold percent similarity are clustered and members of the cluster count fractionally toward the final tally.
Back to Table of Contents.


New Scoring Matrices

David T. Jones, William R. Taylor and Janet M. Thornton
(1992). The rapid generation of mutation data matrices from
protein sequences. CABIOS 8: 275-282.
An update to the PAM matrix using the method of Dayhoff. 59,190 accepted mutations in 16,130 sequences were tallied.
Gaston H. Gonnet, Mark A. Cohen, Steven A. Benner
(1992). Exhaustive Matching of the Entire Protein Sequence
Database. Science 256: 1443-1445
The answer to life the universe and everything. A scoring
matrix based on alignment of the entire SWISS-PROT data base. 1.7 x 10^6 matches were used from sequences differing by 6.4 to 100.0 PAM.
Back to Table of Contents.


Selecting an optimal PAM matrix

  1. As the evolutionary distance increases, the information content of the corresponding PAM matrix decreases.
  2. As the information content of the PAM matrix decreases a longer region of similarity is required to generate a sufficiently high score to be distinguishable from chance.
  3. Regions of similarity in real proteins are found in narrower "blocks" as the evolutionary distance increases.

    PAM                             Min. significant                
    distance        H (bits)        length (30 bits)                
     
      0             4-17              8             
     10             3-43              9             
     20             2-95             11             
     30             2-57             12             
     40             2-26             14             
     50             2-00             15             
     60             1-79             17             
     70             1-60             19             
     80             1-44             21             
     90             1-30             24             
    100             1-18             26             
    110             1-08             28             
    120             0-08             31             
    130             0-90             34             
    140             0-82             37             
    150             0-70             40             
    160             0-70             43             
    170             0-65             47             
    180             0-60             51             
    190             0-55             55             
    200             0-51             59             
    210             0-48             63             
    220             0-45             68             
    230             0-42             73             
    240             0-39             78             
    250             0-36             83             
    260             0-34             89             
    270             0-32             91             
    280             0-30            100             
    290             0-28            107             
    300             0-27            113             
    310             0-25            120             
    320             0-24            127             
    330             0-22            134             
    340             0-21            141             
    350             0-20            149             
    

Back to Table of Contents.


Other specialized scoring matrices

Back to Table of Contents.


Back to VSNS BioComputing Division Home Page
VSNS-BCD Copyright
David Wheeler

Thanks to Karen R. Lafollette and Dr Paula Burch for technical assistance, and to A. Guffanti for manuscript preparation.

HTML Checked!