Protein Database Searches Using Compositionally Adjusted Substitution Matrices

Stephen Altschul

John Wootton

E Gertz

Richa Agarwala

Aleksandr Morgulis

Alejandro Schäffer

Yi Yu

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894

Corresponding author: Stephen F. Altschul; vog.hin.mln.ibcn@luhcstla; Tel: (301) 435-7803; Fax: (301) 480-2288

Abstract

Almost all protein database search methods use amino acid substitution matrices for scoring, optimizing, and assessing the statistical significance of sequence alignments. Much care and effort has therefore gone into constructing substitution matrices, and the quality of search results can depend strongly upon the choice of the proper matrix. A long-standing problem has been the comparison of sequences with biased amino acid compositions, for which standard substitution matrices are not optimal. To address this problem, we have recently developed a general procedure for transforming a standard matrix into one appropriate for the comparison of two sequences with arbitrary, and possibly differing compositions. Such adjusted matrices yield, on average, improved alignments and alignment scores when applied to the comparison of proteins with markedly biased compositions.

Here we review the application of compositionally adjusted matrices and consider whether they may also be applied fruitfully to general purpose protein sequence database searches, in which related sequence pairs do not necessarily have strong compositional biases. Although it is not advisable to apply compositional adjustment indiscriminately, we describe several simple criteria under which invoking such adjustment is on average beneficial. In a typical database search, at least one of these criteria is satisfied by over half the related sequence pairs. Compositional substitution matrix adjustment is now available in NCBI's protein-protein version of BLAST.

Keywords: substitution matrices, compositional adjustment, protein database searches, BLAST, BLOSUM

Abstract

References

1. Needleman SB, Wunsch CDA general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–53.[PubMed][Google Scholar]
2. McLachlan AD. Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551. J Mol Biol. 1971;61:409–24.[PubMed]
3. Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978) A model of evolutionary change in proteins in Atlas of Protein Sequence and Structure (Dayhoff, M. O., ed) pp. 345–52, Natl Biomed Res Found, Washington, DC.
4. Schwartz, R. M. & Dayhoff, M. O. (1978) Matrices for detecting distant relationships in Atlas of Protein Sequence and Structure (Dayhoff, M. O., ed) pp. 353–58, Natl Biomed Res Found, Washington, DC.
5. Feng DF, Johnson MS, Doolittle RFAligning amino acid sequences: comparison of commonly used methods. J Mol Evol. 1984;21:112–25.[PubMed][Google Scholar]
6. Taylor WRThe classification of amino acid conservation. J Theor Biol. 1986;119:205–18.[PubMed][Google Scholar]
7. Rao JKMNew scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int J Peptide Protein Res. 1987;29:276–81.[PubMed][Google Scholar]
8. Risler JL, Delorme MO, Delacroix H, Henaut A. Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix. J Mol Biol. 1988;204:1019–29.[PubMed]
9. Smith TF, Waterman MSIdentification of common molecular subsequences. J Mol Biol. 1981;147:195–7.[PubMed][Google Scholar]
10. Karlin S, Altschul SFMethods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A. 1990;87:2264–8.[Google Scholar]
11. Dembo A, Karlin S, Zeitouni OLimit distribution of maximal non-aligned two-sequence segmental score. Ann Prob. 1994;22:2022–39.[PubMed][Google Scholar]
12. Altschul SFAmino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991;219:555–65.[PubMed][Google Scholar]
13. Henikoff S, Henikoff JGAmino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992;89:10915–9.[Google Scholar]
14. Gonnet GH, Cohen MA, Benner SAExhaustive matching of the entire protein sequence database. Science. 1992;256:1443–5.[PubMed][Google Scholar]
15. Jones DT, Taylor WR, Thornton JMThe rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci. 1992;8:275–82.[PubMed][Google Scholar]
16. Muller T, Vingron MModeling amino acid replacement. J Comput Biol. 2000;7:761–76.[PubMed][Google Scholar]
17. Crooks GE, Brenner SEAn alternative model of amino acid replacement. Bioinformatics. 2005;21:975–80.[PubMed][Google Scholar]
18. Henikoff S, Henikoff JGPerformance evaluation of amino acid substitution matrices. Proteins. 1993;17:49–61.[PubMed][Google Scholar]
19. Pearson WRComparison of methods for searching protein sequence databases. Protein Sci. 1995;4:1145–60.[Google Scholar]
20. Yu YK, Wootton JC, Altschul SFThe compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci U S A. 2003;100:15688–93.[Google Scholar]
21. Sueoka NDirectional mutation pressure and neutral molecular evolution. Proc Natl Acad Sci U S A. 1988;85:2653–7.[Google Scholar]
22. Wan H, Wootton JCA global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comput Chem. 2000;24:71–94.[PubMed][Google Scholar]
23. Yu YK, Altschul SFThe construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics. 2005;21:902–11.[PubMed][Google Scholar]
24. Altschul SFA protein alignment scoring system sensitive at all evolutionary distances. J Mol Evol. 1993;36:290–300.[PubMed][Google Scholar]
25. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJBasic local alignment search tool. J Mol Biol. 1990;215:403–10.[PubMed][Google Scholar]
26. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJGapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.[Google Scholar]
27. Ng PC, Henikoff JG, Henikoff S. PHAT: a transmembrane-specific substitution matrix. Predicted hydrophobic and transmembrane. Bioinformatics. 2000;16:760–6.[PubMed]
28. Muller T, Rahmann S, Rehmsmeier MNon-symmetric score matrices and the detection of homologous transmembrane proteins. Bioinformatics. 2001;17(Suppl 1):S182–9.[PubMed][Google Scholar]
29. Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SFImproving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29:2994–3005.[Google Scholar]
30. Chandonia JM, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SEASTRAL compendium enhancements. Nucleic Acids Res. 2002;30:260–3.[Google Scholar]
31. Green RE, Brenner SEBootstrapping and normalization for enhanced evaluations of pairwise sequence comparison. Proc IEEE. 2002;90:1834–47.[PubMed][Google Scholar]
32. Murzin AG, Brenner SE, Hubbard T, Chothia CSCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–40.[PubMed][Google Scholar]
33. Brenner SE, Chothia C, Hubbard TJAssessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc Natl Acad Sci U S A. 1998;95:6073–8.[Google Scholar]
34. Gribskov M, Robinson NLUse of receiver operating characteristic ROC analysis to evaluate sequence matching. Comput Chem. 1996;20:25–33.[PubMed][Google Scholar]
35. Endres DM, Schindelin JEA new metric for probability distributions. IEEE Trans Info Theo. 2003;49:1858–60.[PubMed][Google Scholar]
36. Wootton JC, Federhen SStatistics of local complexity in amino acid sequences and sequence databases. Comput Chem. 1993;17:149–63.[PubMed][Google Scholar]
37. Fourer, R., Gay, D. M. & Kernighan, B. W. (2002) AMPL: A Modeling Language for Mathematical Programming, 2nd edn, Duxbury Press, Pacific Grove, CA.
38. Golub, G. H. & Van Loan, C. F. (1996) Matrix Computations, Johns Hopkins University Press, Baltimore, MD.
39. Nocedal, J. & Wright, S. (1999) Numerical Optimization, Springer, New York, NY.
40. Gotoh OAn improved algorithm for matching biological sequences. J Mol Biol. 1982;162:705–8.[PubMed][Google Scholar]
41. Altschul SF, Erickson BWOptimal sequence alignment using affine gap costs. Bull Math Biol. 1986;48:603–16.[PubMed][Google Scholar]
42. Altschul SF, Bundschuh R, Olsen R, Hwa TThe estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res. 2001;29:351–61.[Google Scholar]