MUSCLE: multiple sequence alignment with high accuracy and high throughput
We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5.com/muscle.
The average Q score for each method over all PREFAB alignments (All), and the total CPU time in seconds are given. The remaining columns show average Q scores on subsets in which the structure pairs fall within the given pairwise identity ranges. Note that T-Coffee required 10 CPU days to complete the test, compared with <5 h for MUSCLE and ∼30 min for MUSCLE-p.
Average Q and TC scores for each method on BAliBASE are shown, together with the total CPU time in seconds. Align-m aborted on two alignments; average scores on the remainder were Q = 0.852 and TC = 0.670, requiring 2202 s.
The average Q score for each method on each BAliBASE subset is shown. Ref1 is the largest subset with 81 test sets, comprising almost 60% of the database. Other subsets are smaller. For example, Ref4 and Ref5 have 12 alignments each, and there are large variances in the individual scores from which the averages are computed. In our opinion, it is not possible to draw meaningful conclusions about the relative performance of different methods on these subsets.
The average TC score for each method on each BAliBASE subset is shown.
The average APDB score of each method on the PREFAB reference alignments is given. There is no statistically significant difference between the four best methods. The top four are significantly better than FFTNS1 (MUSCLE-p > FFTNS1 with P = 0.009), and FFTNS1 is significantly better than CLUSTALW (P = 3 × 10).
All gives the average Q score over all SABmark alignments, Superfamily and Twilight are average Q scores on the two subsets. These are computed first by averaging Q for each pair in a single multiple alignment, then averaging over multiple alignments. This corrects for the lack of independence between pairs in a given multiple alignment. Align-m aborted in nine cases; quoted averages for this program are for completed alignments. Selected P-values are: MUSCLE > T-Coffee P = 0.14, MUSCLE > MUSCLE-p P = 4×10, MUSCLE > NWNSI P = 6 × 10, MUSCLE-p > NWNSI P = 0.03, T-Coffee > MUSCLE-p P = 0.1, T-Coffee > Align-m P < 10.
The average Q and TC accuracy scores over the 267 reference alignments in SMART that have no more than 100 sequences are given. The last column is the P-value of the difference between the method in a row and the method in the next row, measured on the Q score. The P-value for MUSCLE > T-Coffee is 0.0004 on Q and 0.01 on TC; the P-value for NSI > T-Coffee is 0.19 on Q and 0.0002 on TC. The difference between MUSCLE and NWNSI is only weakly significant on the Q score (P = 0.07) and is not significant on the TC score (P = 0.3).
Each entry in the table contains the P-value assigned by a Friedman rank test to the difference between a pair of methods. The upper-right corner of the matrix is obtained from Q scores on BAliBASE, the lower-left corner from Q scores on PREFAB. If the method to the left is ranked higher than the method above, the P-value is preceded by +. If the method to the left is ranked lower, the P-value is preceded by –. If the P-value is >0.05, the difference is not considered significant and is shown in parentheses. So, for example, MUSCLE ranks higher than T-Coffee on PREFAB with P = 0.0002 and MUSCLE-p higher than CLUSTALW on BAliBASE with P = 0.02.
- 1. Wang L. and Jiang,T. (1994) On the complexity of multiple sequence alignment. J. Comput. Biol., 1, 337–348. [
- 2. Waterman M.S., Smith,T.F. and Beyer,W.A. (1976) Some biological sequence metrics. Adv. Math., 20, 367–387.
- 3. Hogeweg P. and Hesper,B. (1984) The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J. Mol. Evol., 20, 175–186. [
- 4. Feng D.F. and Doolittle,R.F. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360. [
- 5. Notredame C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol., 302, 205–217. [
- 6. Notredame C(2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics, 3, 131–144. [
- 7. Gotoh O(1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol., 264, 823–838. [
- 8. Katoh K., Misawa,K., Kuma,K. and Miyata,T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res., 30, 3059–3066.
- 9. Edgar R.C(2004) Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Res., 32, 380–385.
- 10. Kimura M(1983) The Neutral Theory of Molecular Evolution. Cambridge University Press.
- 11. Sneath P.H.A. and Sokal,R.R. (1973) Numerical Taxonomy. Freeman, San Francisco.
- 12. Saitou N. and Nei,M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406–425. [
- 13. Edgar R.C. and Sjolander,K. (2004) A comparison of scoring functions for protein sequence profile alignment. Bioinformatics, DOI: 10.1093/bioinformatics/bth090. [
- 14. Sjolander K., Karplus,K., Brown,M., Hughey,R., Krogh,A., Mian,I.S. and Haussler,D. (1996) Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. CABIOS, 12, 327–345. [
- 15. von Ohsen N. and Zimmer,R. (2001) Improving profile–profile alignment via log average scoring. In Gascuel,O. and Moret,B.M.E. (eds), Algorithms in Bioinformatics, First International Workshop, WABI 2001. Springer-Verlag, Berlin, Germany, pp. 11–26.
- 16. Muller T., Spang,R. and Vingron,M. (2002) Estimating amino acid substitution models: a comparison of Dayhoff’s estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol., 19, 8–13. [
- 17. Brudno M., Do,C.B., Cooper,G.M., Kim,M.F., Davydov,E., Green,E.D., Sidow,A. and Batzoglou,S. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res., 13, 721–731.
- 18. Hirosawa M., Totoki,Y., Hoshida,M. and Ishikawa,M. (1995) Comprehensive study on iterative algorithms of multiple sequence alignment. CABIOS, 11, 13–18. [
- 19. Thompson J.D., Plewniak,F. and Poch,O. (1999a) BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics, 15, 87–88. [
- 20. Bahr A., Thompson,J.D., Thierry,J.C. and Poch,O. (2001) BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res., 29, 323–326.
- 21. Van Walle I., Lasters,I. and Wyns,L. (2004) Align-m—a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics, DOI: 10.1093/bioinformatics/bth116. [
- 22. Schultz J., Milpetz,F., Bork,P. and Ponting,C.P. (1998) SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl Acad. Sci. USA, 95, 5857–5864.
- 23. Schultz J., Copley,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000) SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res., 28, 231–234.
- 24. Ponting C.P., Schultz,J., Milpetz,F. and Bork,P. (1999) SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res., 27, 229–332.
- 25. Thompson J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680.
- 26. Brenner S.E., Koehl,P. and Levitt,M. (2000) The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res., 28, 254–256.
- 27. Murzin A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540. [
- 28. Boutonnet N.S., Rooman,M.J., Ochagavia,M.E., Richelle,J. and Wodak,S.J. (1995) Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. Protein Eng., 8, 647–662. [
- 29. Shindyalov I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739–747. [
- 30. Fischer D., Barret,C., Bryson,K., Elofsson,A., Godzik,A., Jones,D., Karplus,K.J., Kelley,L.A., MacCallum,R.M., Pawowski,K., Rost,B., Rychlewski,L. and Sternberg,M. (1999) CAFASP-1: critical assessment of fully automated structure prediction methods. Proteins, Suppl. 3, 209–217. [
- 31. Eidhammer I., Jonassen,I. and Taylor,W.R. (2000) Structure comparison and structure patterns. J. Comput. Biol., 7, 685–716. [
- 32. Sauder J.M., Arthur,J.Wand Dunbrack,R.L.,Jr (2000) Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins, 40, 6–22. [
- 33. Leplae R. and Hubbard,T.J. (2002) MaxBench: evaluation of sequence and structure comparison methods. Bioinformatics, 18, 494–495. [
- 34. Sadreyev R. and Grishin,N. (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol., 326, 317–336. [
- 35. Edgar R.C. and Sjolander,K. (2004) COACH: profile-profile alignment of protein families using hidden Markov models. Bioinformatics, DOI: 10.1093/bioinformatics/bth091. [
- 36. Holm L. and Sander,C. (1998) Touring protein fold space with Dali/FSSP. Nucleic Acids Res., 26, 316–319.
- 37. Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402.
- 38. Schaffer A.A., Aravind,L., Madden,T.L., Shavirin,S., Spouge,J.L., Wolf,Y.I., Koonin,E.V. and Altschul,S.F. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res., 29, 2994–3005.
- 39. Pruitt K.D., Tatusova,T. and Maglott,D.R. (2003) NCBI Reference Sequence project: update and current status. Nucleic Acids Res., 31, 34–37.
- 40. Thompson J.D., Plewniak,F. and Poch,O. (1999b) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res., 27, 2682–2690.
- 41. O’Sullivan O., Zehnder,M., Higgins,D., Bucher,P., Grosdidier,A. and Notredame,C. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19 Suppl. 1, I215–I221. [
- 42. Friedman M(1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc., 32, 675–701.
- 43. Stoye J., Evers,D. and Meyer,F. (1998) Rose: generating sequence families. Bioinformatics, 14, 157–163. [
- 44. Sjolander K(1998) Phylogenetic inference in protein superfamilies: analysis of SH2 domains. Proc. Int. Conf. Intell. Syst. Mol. Biol., 6, 165–174. [