The Wayback Machine - https://web.archive.org/web/20100311140200/http://www.ebi.ac.uk/help/matrix.html

Help - About Matrices

Introduction

It is assumed that the sequences being sought have an evolutionary ancestral sequence in common with the query sequence. The best guess at the actual path of evolution is the path that requires the fewest evolutionary events. All substitutions are not equally likely and should be weighted to account for this. Insertions and deletions are less likely than substitutions and should be weighted to account for this. It is necessary to consider that the choice of search algorithm influences the sensitivity and selectivity of the search. The choice of similarity matrix determines both the pattern and the extent of substitutions in the sequences the database search is most likely to discover.

There have been extensive studies looking at the frequencies in which amino acids substituted for each other during evolution. The studies involved carefully aligning all of the proteins in several families of proteins and then constructing phylogenetic trees for each family. Each phylogenetic tree can then be examined for the substitutions found on each branch. This can then be used to produce tables(scoring matrices) of the relative frequencies with which amino acids replace each other over a short evolutionary period. Thus a substitution matrix describes the likelihood that two residue types would mutate to each other in evolutionary time.

A substitution is more likely to occur between amino acids with similar biochemical properties. For example the hydrophobic amino acids Isoleucine(I) and valine(V) get a positive score on matrices adding weight to the likeliness that one will substitute for another. While the hydrophobic amino acid isoleucine has a negative score with the hydrophilic amino acid cystine(C) as the likeliness of this substitution occurring in the protein is far less. Thus matrices are used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.

Guidelines for using matricies

Protein Query Length Matrix Open Gap Extend Gap
>300 BLOSUM50 -10 -2
85-300 BLOSUM62 -7 -1
50-85 BLOSUM80 -16 -4
>300 PAM250 -10 -2
85-300 PAM120 -16 -4
35-85 MDM40 -12 -2
<=35 MDM20 -22 -4
<=10 MDM10 -23 -4


Importance of scoring matrices
Types of matrices

Differences between PAM and BLOSSUM

Equivalent PAM and Blossum matrices

The following matrices are roughly equivalent...
Generally speaking...
PAM (Point Accepted Mutation) matrix

Amino acid scoring matrices are traditionally PAM (Point Accepted Mutation) matrices which refer to various degrees of sensitivity depending on the evolutionary distance between sequence pairs. In this manner PAM40 is most sensitive for sequences 40 PAMs apart. PAM250 is for more distantly related sequences and is considered a good general matrix for protein database searching. For nucleotide sequence searching a simpler approach is used which either convert a PAM40 matrix into match/mismatch values which takes into consideration that a purine may be replaced by a purine and a pyrimidine by a pyrimidine.

e.g. The PAM 250 matrix
This is appropriate for searching for alignments of sequence that have diverged by 250 PAMs, 250 mutations per 100 amino acids of sequence. Because of back mutations and silent mutations this corresponds to sequences that are about 20 percent identical.

C 12


G -3   5


P -3  -1   6


S  0   1   1   1


A -2   1   1   1   2


T -2   0   0   1   1   3


D -5   1  -1   0   0   0   4


E -5   0  -1   0   0   0   3   4


N -4   0  -1   1   0   0   2   1   2


Q -5  -1   0  -1   0  -1   2   2   1   4


H -3  -2   0  -1  -1  -1   1   1   2   3   6


K -5  -2  -1   0  -1   0   0   0   1   1   0   5


R -4  -3   0   0  -2  -1  -1  -1   0   1   2   3   6


V -2  -1  -1  -1   0   0  -2  -2  -2  -2  -2  -2  -2   4


M -5  -3  -2  -2  -1  -1  -3  -2   0  -1  -2   0   0   2   6


I -2  -3  -2  -1  -1   0  -2  -2  -2  -2  -2  -2  -2   4   2   5


L -6  -4  -3  -3  -2  -2  -4  -3  -3  -2  -2  -3  -3   2   4   2   6


F -4  -5  -5  -3  -4  -3  -6  -5  -4  -5  -2  -5  -4  -1   0   1   2   9


Y  0  -5  -5  -3  -3  -3  -4  -4  -2  -4   0  -4  -5  -2  -2  -1  -1   7  10


W  -8  -7  -6  -2  -6  -5  -7  -7  -4  -5  -3  -3   2  -6  -4  -5  -2   0   0  17


   C   G   P   S   A   T   D   E   N   Q   H   K   R   V   M   I   L   F   Y   W

	  
In this example Isoleucine(I) is likely to be substituted by valine(V) and gets a score of 4. Isoleucine(I) is unlikely to be substituted for Cystine and gets a score of -2.

BLOSSUM (Blocks Substitution Matrix)

The BLOSUM matrices, also used for protein database search scoring (the default in blastp), are divided into statistical significance degrees which, in a way, are reminiscent of PAM distances. For example, BLOSUM64 is roughly equivalent to PAM 120. BLOSSUM Blocks Substitution Matrix). BLOSSUM matrices are most sensitive for local alignment of related sequences. The BLOSUM matrices are therefore ideal when tying to identify an unknown nucleotide sequence.

e.g. Blosum 45 Matrix

This is derived from sequence blocks clustered at the 45% identity level.


G  7


P -2   9


D -1  -1   7


E -2   0   2   6


N  0  -2   2   0   6


H -2  -2   0   0   1  10


Q -2  -1   0   2   0   1   6


K -2  -1   0   1   0  -1   1   5


R -2  -2  -1   0   0   0   1   3   7


S  0  -1   0   0   1  -1   0  -1  -1   4


T -2  -1  -1  -1   0  -2  -1  -1  -1   2   5


A  0  -1  -2  -1  -1  -2  -1  -1  -2   1   0   5


M -2  -2  -3  -2  -2   0   0  -1  -1  -2  -1  -1   6


V -3  -3  -3  -3  -3  -3  -3  -2  -2  -1   0   0   1   5


I -4  -2  -4  -3  -2  -3  -2  -3  -3  -2  -1  -1   2   3   5


L -3  -3  -3  -2  -3  -2  -2  -3  -2  -3  -1  -1   2   1   2   5


F -3  -3  -4  -3  -2  -2  -4  -3  -2  -2  -1  -2   0   0   0   1   8


Y -3  -3  -2  -2  -2   2  -1  -1  -1  -2  -1  -2   0  -1   0   0   3   8


W -2  -3  -4  -3  -4  -3  -2  -2  -2  -4  -3  -2  -2  -3  -2  -2   1   3  15


C -3  -4  -3  -3  -2  -3  -3  -3  -3  -1  -1  -1  -2  -1  -3  -2  -2  -3  -5  12


   G   P   D   E   N   H   Q   K   R   S   T   A   M   V   I   L   F   Y   W   C



	  
In this example Isoleucine(I) is likely to be substituted by valine(V) and gets a score of 3. Isoleucine(I) is unlikely to be substituted for Cystine and gets a score of -3.

Summary

These 2 matrices both generally perform well, but give slightly different results. The Blosum matrices have often been the better performers, reflecting the fact that the Blosum matrices are based on the replacement patterns found in more highly conserved regions of the sequences. This seems to be an advantage as these more highly conserved regions are those discovered in database searches and they serve as anchor points in alignments involving complete sequences. It is expected that the replacements that occur in more highly conserved regions will be more restricted than those that occur in highly variable regions of the sequence. This is supported by the different pattern of positive and negative scores in the two families of matrices. These different patterns of positive and negative scores reflect different estimates of what constitute conservative and non conservative substitutions in the evolution of proteins. These differences reflect the differences in constructing the two families of matrices. Some of the difference is also likely to be because the Blosum matrices are based on much more data than the PAM matrices. The PAM matrices still perform quite well despite the small amount of data underlying them. The most likely reasons for this are the care used in constructing the alignments and phylogenetic trees used in counting replacements and the fact that they are based on a simple model of evolution and thus they still perform better than some of the more modern matrices that are less carefully constructed.

GONNET Matrix

A different method to measure differences among amino acids was developed by Gonnet, Cohen and Benner (1992) using exhaustive pairwise alignments of the protein databases as they existed at that time. They used classical distance measures to estimate an alignment of the proteins. They then used this data to estimate a new distance matrix. This was used to refine the alignment, estimate a new distance matrix and so on iteratively. They noted that the distance matrices (all first normalised to 250 PAMs) differed depending on whether they were derived from distantly or closely homologous proteins. They suggest that for initial comparisons their resulting matrix should be used in preference to a PAM250 matrix, and that subsequent refinements should be done using a PAM matrix appropriate to the distance between proteins.



 A     C    D    E      F      G       H      I      K      L      M       N      P      Q      R      S      T     V       W     Y     ..
0.6 0.125 -0.075 0     -0.575  0.125 -0.2   -0.2   -0.1   -0.3   -0.175  -0.075  0.075 -0.05  -0.15  0.275  0.15   0.025 -0.9   -0.55   A
    2.875 -0.8  -0.75  -0.2   -0.5   -0.325 -0.275 -0.7   -0.375 -0.225  -0.45  -0.775 -0.6   -0.55  0.025 -0.125  0     -0.25  -0.125  C
           1.175 0.675 -1.125  0.025  0.1   -0.95   0.125 -1     -0.75    0.55  -0.175  0.225 -0.075 0.125  0     -0.725 -1.3   -0.7    D
                 0.9   -0.975 -0.2    0.1   -0.675  0.3   -0.7   -0.5     0.225 -0.125  0.425  0.1   0.05  -0.025 -0.475 -1.075 -0.675  E
                        1.75  -1.3   -0.025  0.25  -0.825  0.5    0.4    -0.775 -0.95  -0.65  -0.8  -0.7   -0.55   0.025  0.9    1.275  F

                               1.65  -0.35  -1.125 -0.275  -1.1   -0.875  0.1   -0.4   -0.25  -0.25  0.1   -0.275 -0.825 -1     -1      G
                                      1.5   -0.55   0.15   -0.475 -0.325  0.3   -0.275  0.3    0.15 -0.05  -0.075 -0.5   -0.2    0.55   H
                                             1     -0.525   0.7    0.625 -0.7   -0.65  -0.475 -0.6  -0.45  -0.15   0.775 -0.45  -0.175  I
                                                    0.8    -0.525 -0.35   0.2   -0.15   0.375  0.675 0.025  0.025 -0.425 -0.875 -0.525  K
                                                            1      0.7   -0.75  -0.575 -0.4   -0.55 -0.525 -0.325  0.45  -0.175  0      L

                                                                   1.075 -0.55   -0.6  -0.25 -0.425 -0.35  -0.15   0.4   -0.25  -0.05   M
                                                                          0.95   -0.225 0.175 0.075  0.225  0.125 -0.55  -0.9   -0.35   N
                                                                                  1.9  -0.05 -0.225  0.1    0.025 -0.45  -1.25  -0.775  P
                                                                                        0.675 0.375  0.05   0     -0.375 -0.675 -0.425  Q
                                                                                              1.175 -0.05  -0.05  -0.5   -0.4   -0.45   R

                                                                                                     0.55   0.375 -0.25  -0.825 -0.475  S
                                                                                                            0.625  0     -0.875 -0.475  T
                                                                                                                   0.85  -0.65  -0.275  V
                                                                                                                          3.55   1.025  W
                                                                                                                                 1.95   Y



	  

DNA Identity Matrix (Unitary Matrix )

Here a you only get a positive score for a match, and a score of -10000 for a mismatch. As such a high penalty is given for a mismatch, no substitution should be allowed, although a gap may be permitted.


     A          T         G        C

 A    1

 T   -10000     1

 G   -10000    -10000     1

 C   -10000    -10000    -10000    1

PUPY Matrix

The PUPY matrix rewards both purine-purine and pyrimidine-pyrimidine transitions. This matrix or one like it may be useful in developing better PCR primers or in finding candidate noncoding RNA genes.



References

Altschul, S.F. (1991)
Amino acid substitutions matrices from an information theoretic perspective.
J. Mol. Biol. 219, 555-665 (1991). View

Altschul, S. F., M. S. Boguski, W. Gish and J. C. Wootton (1994)
Issues in searching molecular sequence databases.
Nature Genetics 6:119-129. View

Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.( 1978)
A model of evolutionary change in proteins.
In "Atlas of Protein Sequence and Structure" 5(3) M.O. Dayhoff (ed.), 345 - 352.

Gonnet G.H., Cohen M.A., Benner S.A. (1992).
Exhaustive matching of the entire protein sequence database.
Science 1992 Jun 5;256(5062):1443-5. View

Henikoff, S. and Henikoff, J. (1992)
Amino acid substitution matrices from protein blocks.
Proc. Natl. Acad. Sci. USA. 89(biochemistry): 10915 - 10919. 1992. View