Help - About Matrices

Introduction

It is assumed that the sequences being sought have an evolutionary ancestral sequence in common with the query sequence. The best guess at the actual path of evolution is the path that requires the fewest evolutionary events. All substitutions are not equally likely and should be weighted to account for this. Insertions and deletions are less likely than substitutions and should be weighted to account for this. It is necessary to consider that the choice of search algorithm influences the sensitivity and selectivity of the search. The choice of similarity matrix determines both the pattern and the extent of substitutions in the sequences the database search is most likely to discover.

There have been extensive studies looking at the frequencies in which amino acids substituted for each other during evolution. The studies involved carefully aligning all of the proteins in several families of proteins and then constructing phylogenetic trees for each family. Each phylogenetic tree can then be examined for the substitutions found on each branch. This can then be used to produce tables(scoring matrices) of the relative frequencies with which amino acids replace each other over a short evolutionary period. Thus a substitution matrix describes the likelihood that two residue types would mutate to each other in evolutionary time.

A substitution is more likely to occur between amino acids with similar biochemical properties. For example the hydrophobic amino acids Isoleucine(I) and valine(V) get a positive score on matrices adding weight to the likeliness that one will substitute for another. While the hydrophobic amino acid isoleucine has a negative score with the hydrophilic amino acid cystine(C) as the likeliness of this substitution occurring in the protein is far less. Thus matrices are used to estimate how well two residues of given types would match if they were aligned in a sequence alignment.

Guidelines for using matricies

Protein Query Length	Matrix	Open Gap	Extend Gap
>300	BLOSUM50	-10	-2
85-300	BLOSUM62	-7	-1
50-85	BLOSUM80	-16	-4
>300	PAM250	-10	-2
85-300	PAM120	-16	-4
35-85	MDM40	-12	-2
<=35	MDM20	-22	-4
<=10	MDM10	-23	-4

Importance of scoring matrices

Scoring matrices appear in all analysis involving sequence comparison.
The choice of matrix can strongly influence the outcome of the analysis.
Scoring matrices implicitly represent a particular theory of evolution.
Understanding theories underlying a given scoring matrix can aid in making proper choice.

Types of matrices

Differences between PAM and BLOSSUM

PAM matrices are based on an explicit evolutionary model (that is, replacements are counted on the branches of a phylogenetic tree), whereas the Blosum matrices are based on an implicit rather than explicit model of evolution.
The sequence variability in the alignments used to count replacements. The PAM matrices are based on mutations observed throughout a global alignment, this includes both highly conserved and highly mutable regions. The Blosum matrices are based only on highly conserved regions in series of alignments forbidden to contain gaps.
The method used to count the replacements is different, unlike the PAM matrix, the Blosum procedure uses groups of sequences within which not all mutations are counted the same.

Equivalent PAM and Blossum matrices

The following matrices are roughly equivalent...

PAM100 ==> Blosum90
PAM120 ==> Blosum80
PAM160 ==> Blosum60
PAM200 ==> Blosum52
PAM250 ==> Blosum45

Generally speaking...

The Blosum matrices are best for detecting local alignments.
The Blosum62 matrix is the best for detecting the majority of weak protein similarities.
The Blosum45 matrix is the best for detecting long and weak alignments.

PAM (Point Accepted Mutation) matrix

Amino acid scoring matrices are traditionally PAM (Point Accepted Mutation) matrices which refer to various degrees of sensitivity depending on the evolutionary distance between sequence pairs. In this manner PAM40 is most sensitive for sequences 40 PAMs apart. PAM250 is for more distantly related sequences and is considered a good general matrix for protein database searching. For nucleotide sequence searching a simpler approach is used which either convert a PAM40 matrix into match/mismatch values which takes into consideration that a purine may be replaced by a purine and a pyrimidine by a pyrimidine.

e.g. The PAM 250 matrix
This is appropriate for searching for alignments of sequence that have diverged by 250 PAMs, 250 mutations per 100 amino acids of sequence. Because of back mutations and silent mutations this corresponds to sequences that are about 20 percent identical.

C 12


G -3   5


P -3  -1   6


S  0   1   1   1


A -2   1   1   1   2


T -2   0   0   1   1   3


D -5   1  -1   0   0   0   4


E -5   0  -1   0   0   0   3   4


N -4   0  -1   1   0   0   2   1   2


Q -5  -1   0  -1   0  -1   2   2   1   4


H -3  -2   0  -1  -1  -1   1   1   2   3   6


K -5  -2  -1   0  -1   0   0   0   1   1   0   5


R -4  -3   0   0  -2  -1  -1  -1   0   1   2   3   6


V -2  -1  -1  -1   0   0  -2  -2  -2  -2  -2  -2  -2   4


M -5  -3  -2  -2  -1  -1  -3  -2   0  -1  -2   0   0   2   6


I -2  -3  -2  -1  -1   0  -2  -2  -2  -2  -2  -2  -2   4   2   5


L -6  -4  -3  -3  -2  -2  -4  -3  -3  -2  -2  -3  -3   2   4   2   6


F -4  -5  -5  -3  -4  -3  -6  -5  -4  -5  -2  -5  -4  -1   0   1   2   9


Y  0  -5  -5  -3  -3  -3  -4  -4  -2  -4   0  -4  -5  -2  -2  -1  -1   7  10


W  -8  -7  -6  -2  -6  -5  -7  -7  -4  -5  -3  -3   2  -6  -4  -5  -2   0   0  17


   C   G   P   S   A   T   D   E   N   Q   H   K   R   V   M   I   L   F   Y   W

In this example Isoleucine(I) is likely to be substituted by valine(V) and gets a score of 4. Isoleucine(I) is unlikely to be substituted for Cystine and gets a score of -2.

BLOSSUM (Blocks Substitution Matrix)

The BLOSUM matrices, also used for protein database search scoring (the default in blastp), are divided into statistical significance degrees which, in a way, are reminiscent of PAM distances. For example, BLOSUM64 is roughly equivalent to PAM 120. BLOSSUM Blocks Substitution Matrix). BLOSSUM matrices are most sensitive for local alignment of related sequences. The BLOSUM matrices are therefore ideal when tying to identify an unknown nucleotide sequence.

e.g. Blosum 45 Matrix

This is derived from sequence blocks clustered at the 45% identity level.



G  7


P -2   9


D -1  -1   7


E -2   0   2   6


N  0  -2   2   0   6


H -2  -2   0   0   1  10


Q -2  -1   0   2   0   1   6


K -2  -1   0   1   0  -1   1   5


R -2  -2  -1   0   0   0   1   3   7


S  0  -1   0   0   1  -1   0  -1  -1   4


T -2  -1  -1  -1   0  -2  -1  -1  -1   2   5


A  0  -1  -2  -1  -1  -2  -1  -1  -2   1   0   5


M -2  -2  -3  -2  -2   0   0  -1  -1  -2  -1  -1   6


V -3  -3  -3  -3  -3  -3  -3  -2  -2  -1   0   0   1   5


I -4  -2  -4  -3  -2  -3  -2  -3  -3  -2  -1  -1   2   3   5


L -3  -3  -3  -2  -3  -2  -2  -3  -2  -3  -1  -1   2   1   2   5


F -3  -3  -4  -3  -2  -2  -4  -3  -2  -2  -1  -2   0   0   0   1   8


Y -3  -3  -2  -2  -2   2  -1  -1  -1  -2  -1  -2   0  -1   0   0   3   8


W -2  -3  -4  -3  -4  -3  -2  -2  -2  -4  -3  -2  -2  -3  -2  -2   1   3  15


C -3  -4  -3  -3  -2  -3  -3  -3  -3  -1  -1  -1  -2  -1  -3  -2  -2  -3  -5  12


   G   P   D   E   N   H   Q   K   R   S   T   A   M   V   I   L   F   Y   W   C

In this example Isoleucine(I) is likely to be substituted by valine(V) and gets a score of 3. Isoleucine(I) is unlikely to be substituted for Cystine and gets a score of -3.

Summary

These 2 matrices both generally perform well, but give slightly different results. The Blosum matrices have often been the better performers, reflecting the fact that the Blosum matrices are based on the replacement patterns found in more highly conserved regions of the sequences. This seems to be an advantage as these more highly conserved regions are those discovered in database searches and they serve as anchor points in alignments involving complete sequences. It is expected that the replacements that occur in more highly conserved regions will be more restricted than those that occur in highly variable regions of the sequence. This is supported by the different pattern of positive and negative scores in the two families of matrices. These different patterns of positive and negative scores reflect different estimates of what constitute conservative and non conservative substitutions in the evolution of proteins. These differences reflect the differences in constructing the two families of matrices. Some of the difference is also likely to be because the Blosum matrices are based on much more data than the PAM matrices. The PAM matrices still perform quite well despite the small amount of data underlying them. The most likely reasons for this are the care used in constructing the alignments and phylogenetic trees used in counting replacements and the fact that they are based on a simple model of evolution and thus they still perform better than some of the more modern matrices that are less carefully constructed.

GONNET Matrix

A different method to measure differences among amino acids was developed by Gonnet, Cohen and Benner (1992) using exhaustive pairwise alignments of the protein databases as they existed at that time. They used classical distance measures to estimate an alignment of the proteins. They then used this data to estimate a new distance matrix. This was used to refine the alignment, estimate a new distance matrix and so on iteratively. They noted that the distance matrices (all first normalised to 250 PAMs) differed depending on whether they were derived from distantly or closely homologous proteins. They suggest that for initial comparisons their resulting matrix should be used in preference to a PAM250 matrix, and that subsequent refinements should be done using a PAM matrix appropriate to the distance between proteins.



 A     C    D    E      F      G       H      I      K      L      M       N      P      Q      R      S      T     V       W     Y     ..
0.6 0.125 -0.075 0     -0.575  0.125 -0.2   -0.2   -0.1   -0.3   -0.175  -0.075  0.075 -0.05  -0.15  0.275  0.15   0.025 -0.9   -0.55   A
    2.875 -0.8  -0.75  -0.2   -0.5   -0.325 -0.275 -0.7   -0.375 -0.225  -0.45  -0.775 -0.6   -0.55  0.025 -0.125  0     -0.25  -0.125  C
           1.175 0.675 -1.125  0.025  0.1   -0.95   0.125 -1     -0.75    0.55  -0.175  0.225 -0.075 0.125  0     -0.725 -1.3   -0.7    D
                 0.9   -0.975 -0.2    0.1   -0.675  0.3   -0.7   -0.5     0.225 -0.125  0.425  0.1   0.05  -0.025 -0.475 -1.075 -0.675  E
                        1.75  -1.3   -0.025  0.25  -0.825  0.5    0.4    -0.775 -0.95  -0.65  -0.8  -0.7   -0.55   0.025  0.9    1.275  F

                               1.65  -0.35  -1.125 -0.275  -1.1   -0.875  0.1   -0.4   -0.25  -0.25  0.1   -0.275 -0.825 -1     -1      G
                                      1.5   -0.55   0.15   -0.475 -0.325  0.3   -0.275  0.3    0.15 -0.05  -0.075 -0.5   -0.2    0.55   H
                                             1     -0.525   0.7    0.625 -0.7   -0.65  -0.475 -0.6  -0.45  -0.15   0.775 -0.45  -0.175  I
                                                    0.8    -0.525 -0.35   0.2   -0.15   0.375  0.675 0.025  0.025 -0.425 -0.875 -0.525  K
                                                            1      0.7   -0.75  -0.575 -0.4   -0.55 -0.525 -0.325  0.45  -0.175  0      L

                                                                   1.075 -0.55   -0.6  -0.25 -0.425 -0.35  -0.15   0.4   -0.25  -0.05   M
                                                                          0.95   -0.225 0.175 0.075  0.225  0.125 -0.55  -0.9   -0.35   N
                                                                                  1.9  -0.05 -0.225  0.1    0.025 -0.45  -1.25  -0.775  P
                                                                                        0.675 0.375  0.05   0     -0.375 -0.675 -0.425  Q
                                                                                              1.175 -0.05  -0.05  -0.5   -0.4   -0.45   R

                                                                                                     0.55   0.375 -0.25  -0.825 -0.475  S
                                                                                                            0.625  0     -0.875 -0.475  T
                                                                                                                   0.85  -0.65  -0.275  V
                                                                                                                          3.55   1.025  W
                                                                                                                                 1.95   Y

DNA Identity Matrix (Unitary Matrix )

Here a you only get a positive score for a match, and a score of -10000 for a mismatch. As such a high penalty is given for a mismatch, no substitution should be allowed, although a gap may be permitted.

     A          T         G        C

 A    1

 T   -10000     1

 G   -10000    -10000     1

 C   -10000    -10000    -10000    1

PUPY Matrix

The PUPY matrix rewards both purine-purine and pyrimidine-pyrimidine transitions. This matrix or one like it may be useful in developing better PCR primers or in finding candidate noncoding RNA genes.

References

Altschul, S.F. (1991)
Amino acid substitutions matrices from an information theoretic perspective.
J. Mol. Biol. 219, 555-665 (1991).

Altschul, S. F., M. S. Boguski, W. Gish and J. C. Wootton (1994)
Issues in searching molecular sequence databases.
Nature Genetics 6:119-129.

Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.( 1978)
A model of evolutionary change in proteins.
In "Atlas of Protein Sequence and Structure" 5(3) M.O. Dayhoff (ed.), 345 - 352.

Gonnet G.H., Cohen M.A., Benner S.A. (1992).
Exhaustive matching of the entire protein sequence database.
Science 1992 Jun 5;256(5062):1443-5.

Henikoff, S. and Henikoff, J. (1992)
Amino acid substitution matrices from protein blocks.
Proc. Natl. Acad. Sci. USA. 89(biochemistry): 10915 - 10919. 1992.

Feb	MAR	Apr
	11
2009	2010	2011