Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Journal: 1997/October - Nucleic Acids Research
ISSN: 0305-1048
PUBMED: 9254694
Abstract:
The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
Relations:
Content
Citations
(20K+)
References
(69)
Chemicals
(1)
Organisms
(2)
Processes
(2)
Affiliates
(1)
Similar articles
Articles by the same authors
Discussion board
Nucleic Acids Res 25(17): 3389-3402

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Abstract

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

Full Text

The Full Text of this article is available as a PDF (205K).

Selected References

These references are in PubMed. This may not be the complete list of references from this article.
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. [PubMed] [Google Scholar]
  • Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448.[PMC free article] [PubMed] [Google Scholar]
  • Altschul SF, Gish W. Local alignment statistics. Methods Enzymol. 1996;266:460–480. [PubMed] [Google Scholar]
  • Chao KM, Pearson WR, Miller W. Aligning two sequences within a specified diagonal band. Comput Appl Biosci. 1992 Oct;8(5):481–487. [PubMed] [Google Scholar]
  • Altschul SF, Erickson BW. Locally optimal subalignments using nonlinear similarity functions. Bull Math Biol. 1986;48(5-6):633–660. [PubMed] [Google Scholar]
  • Waterman MS, Eggert M. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J Mol Biol. 1987 Oct 20;197(4):723–728. [PubMed] [Google Scholar]
  • Karlin S, Altschul SF. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A. 1990 Mar;87(6):2264–2268.[PMC free article] [PubMed] [Google Scholar]
  • Altschul SF. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 1991 Jun 5;219(3):555–565. [PubMed] [Google Scholar]
  • Altschul SF. A protein alignment scoring system sensitive at all evolutionary distances. J Mol Evol. 1993 Mar;36(3):290–300. [PubMed] [Google Scholar]
  • Smith TF, Waterman MS, Burks C. The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 1985 Jan 25;13(2):645–656.[PMC free article] [PubMed] [Google Scholar]
  • Collins JF, Coulson AF, Lyall A. The significance of protein sequence similarities. Comput Appl Biosci. 1988 Mar;4(1):67–71. [PubMed] [Google Scholar]
  • Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915–10919.[PMC free article] [PubMed] [Google Scholar]
  • Wilbur WJ, Lipman DJ. Rapid similarity searches of nucleic acid and protein data banks. Proc Natl Acad Sci U S A. 1983 Feb;80(3):726–730.[PMC free article] [PubMed] [Google Scholar]
  • Robinson AB, Robinson LR. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc Natl Acad Sci U S A. 1991 Oct 15;88(20):8880–8884.[PMC free article] [PubMed] [Google Scholar]
  • Karlin S, Altschul SF. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci U S A. 1993 Jun 15;90(12):5873–5877.[PMC free article] [PubMed] [Google Scholar]
  • Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. [PubMed] [Google Scholar]
  • Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. [PubMed] [Google Scholar]
  • Bairoch A, Apweiler R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Res. 1997 Jan 1;25(1):31–36.[PMC free article] [PubMed] [Google Scholar]
  • Jou WM, Verhoeyen M, Devos R, Saman E, Fang R, Huylebroeck D, Fiers W, Threlfall G, Barber C, Carey N, et al. Complete structure of the hemagglutinin gene from the human influenza A/Victoria/3/75 (H3N2) strain as determined from cloned DNA. Cell. 1980 Mar;19(3):683–696. [PubMed] [Google Scholar]
  • McLachlan AD. Analysis of gene duplication repeats in the myosin rod. J Mol Biol. 1983 Sep 5;169(1):15–30. [PubMed] [Google Scholar]
  • Staden R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 2):505–519.[PMC free article] [PubMed] [Google Scholar]
  • Schneider TD, Stormo GD, Gold L, Ehrenfeucht A. Information content of binding sites on nucleotide sequences. J Mol Biol. 1986 Apr 5;188(3):415–431. [PubMed] [Google Scholar]
  • Taylor WR. Identification of protein sequence homology by consensus template alignment. J Mol Biol. 1986 Mar 20;188(2):233–258. [PubMed] [Google Scholar]
  • Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987 Feb 20;193(4):723–750. [PubMed] [Google Scholar]
  • Dodd IB, Egan JB. Systematic method for the detection of potential lambda Cro-like DNA-binding regions in proteins. J Mol Biol. 1987 Apr 5;194(3):557–564. [PubMed] [Google Scholar]
  • Gribskov M, McLachlan AD, Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A. 1987 Jul;84(13):4355–4358.[PMC free article] [PubMed] [Google Scholar]
  • Patthy L. Detecting homology of distantly related proteins with consensus sequences. J Mol Biol. 1987 Dec 20;198(4):567–577. [PubMed] [Google Scholar]
  • Stormo GD, Hartzell GW., 3rd Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci U S A. 1989 Feb;86(4):1183–1187.[PMC free article] [PubMed] [Google Scholar]
  • Tatusov RL, Altschul SF, Koonin EV. Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc Natl Acad Sci U S A. 1994 Dec 6;91(25):12091–12095.[PMC free article] [PubMed] [Google Scholar]
  • Yi TM, Lander ES. Recognition of related proteins by iterative template refinement (ITR). Protein Sci. 1994 Aug;3(8):1315–1328.[PMC free article] [PubMed] [Google Scholar]
  • Henikoff S, Henikoff JG. Embedding strategies for effective use of information from multiple sequence alignments. Protein Sci. 1997 Mar;6(3):698–705.[PMC free article] [PubMed] [Google Scholar]
  • Bucher P, Karplus K, Moeri N, Hofmann K. A flexible motif search technique based on generalized profiles. Comput Chem. 1996 Mar;20(1):3–23. [PubMed] [Google Scholar]
  • Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993 Oct 8;262(5131):208–214. [PubMed] [Google Scholar]
  • Altschul SF, Carroll RJ, Lipman DJ. Weights for data related by a tree. J Mol Biol. 1989 Jun 20;207(4):647–653. [PubMed] [Google Scholar]
  • Sibbald PR, Argos P. Weighting aligned protein or nucleic acid sequences to correct for unequal representation. J Mol Biol. 1990 Dec 20;216(4):813–818. [PubMed] [Google Scholar]
  • Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins. 1991;9(1):56–68. [PubMed] [Google Scholar]
  • Gerstein M, Sonnhammer EL, Chothia C. Volume changes in protein evolution. J Mol Biol. 1994 Mar 4;236(4):1067–1078. [PubMed] [Google Scholar]
  • Henikoff S, Henikoff JG. Position-based sequence weights. J Mol Biol. 1994 Nov 4;243(4):574–578. [PubMed] [Google Scholar]
  • Thompson JD, Higgins DG, Gibson TJ. Improved sensitivity of profile searches through the use of sequence weights and gap excision. Comput Appl Biosci. 1994 Feb;10(1):19–29. [PubMed] [Google Scholar]
  • Eddy SR, Mitchison G, Durbin R. Maximum discrimination hidden Markov models of sequence consensus. J Comput Biol. 1995 Spring;2(1):9–23. [PubMed] [Google Scholar]
  • Gotoh O. A weighting system and algorithm for aligning many phylogenetically related sequences. Comput Appl Biosci. 1995 Oct;11(5):543–551. [PubMed] [Google Scholar]
  • Henikoff JG, Henikoff S. Using substitution probabilities to improve position-specific scoring matrices. Comput Appl Biosci. 1996 Apr;12(2):135–143. [PubMed] [Google Scholar]
  • Brown M, Hughey R, Krogh A, Mian IS, Sjölander K, Haussler D. Using Dirichlet mixture priors to derive hidden Markov models for protein families. Proc Int Conf Intell Syst Mol Biol. 1993;1:47–55. [PubMed] [Google Scholar]
  • Sjölander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D. Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput Appl Biosci. 1996 Aug;12(4):327–345. [PubMed] [Google Scholar]
  • Altschul SF, Boguski MS, Gish W, Wootton JC. Issues in searching molecular sequence databases. Nat Genet. 1994 Feb;6(2):119–129. [PubMed] [Google Scholar]
  • Benson DA, Boguski MS, Lipman DJ, Ostell J. GenBank. Nucleic Acids Res. 1997 Jan 1;25(1):1–6.[PMC free article] [PubMed] [Google Scholar]
  • Holm L, Sander C. New structure--novel fold? Structure. 1997 Feb 15;5(2):165–171. [PubMed] [Google Scholar]
  • Ohta M, Inoue H, Cotticelli MG, Kastury K, Baffa R, Palazzo J, Siprashvili Z, Mori M, McCue P, Druck T, et al. The FHIT gene, spanning the chromosome 3p14.2 fragile site and renal carcinoma-associated t(3;8) breakpoint, is abnormal in digestive tract cancers. Cell. 1996 Feb 23;84(4):587–597. [PubMed] [Google Scholar]
  • Heidenreich RA, Mallee J, Segal S. Rat galactose-1-phosphate uridyltransferase coding sequence, transcription start site and genomic organization. DNA Seq. 1993;3(5):311–318. [PubMed] [Google Scholar]
  • Maskell DJ, Szabo MJ, Deadman ME, Moxon ER. The gal locus from Haemophilus influenzae: cloning, sequencing and the use of gal mutants to study lipopolysaccharide. Mol Microbiol. 1992 Oct;6(20):3051–3063. [PubMed] [Google Scholar]
  • Plateau P, Fromant M, Schmitter JM, Buhler JM, Blanquet S. Isolation, characterization, and inactivation of the APA1 gene encoding yeast diadenosine 5',5'''-P1,P4-tetraphosphate phosphorylase. J Bacteriol. 1989 Dec;171(12):6437–6445.[PMC free article] [PubMed] [Google Scholar]
  • Koonin EV, Altschul SF, Bork P. BRCA1 protein products ... Functional motifs... Nat Genet. 1996 Jul;13(3):266–268. [PubMed] [Google Scholar]
  • Bork P, Hofmann K, Bucher P, Neuwald AF, Altschul SF, Koonin EV. A superfamily of conserved domains in DNA damage-responsive cell cycle checkpoint proteins. FASEB J. 1997 Jan;11(1):68–76. [PubMed] [Google Scholar]
  • Callebaut I, Mornon JP. From BRCA1 to RAP1: a widespread BRCT module closely associated with DNA repair. FEBS Lett. 1997 Jan 2;400(1):25–30. [PubMed] [Google Scholar]
  • Miki Y, Swensen J, Shattuck-Eidens D, Futreal PA, Harshman K, Tavtigian S, Liu Q, Cochran C, Bennett LM, Ding W, et al. A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. Science. 1994 Oct 7;266(5182):66–71. [PubMed] [Google Scholar]
  • Wu LC, Wang ZW, Tsan JT, Spillman MA, Phung A, Xu XL, Yang MC, Hwang LY, Bowcock AM, Baer R. Identification of a RING protein that can interact in vivo with the BRCA1 gene product. Nat Genet. 1996 Dec;14(4):430–440. [PubMed] [Google Scholar]
  • Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, et al. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996 Aug 23;273(5278):1058–1073. [PubMed] [Google Scholar]
  • Nagase T, Seki N, Ishikawa K, Ohira M, Kawarabayasi Y, Ohara O, Tanaka A, Kotani H, Miyajima N, Nomura N. Prediction of the coding sequences of unidentified human genes. VI. The coding sequences of 80 new genes (KIAA0201-KIAA0280) deduced by analysis of cDNA clones from cell line KG-1 and brain. DNA Res. 1996 Oct 31;3(5):321–354. [PubMed] [Google Scholar]
  • Wilson R, Ainscough R, Anderson K, Baynes C, Berks M, Bonfield J, Burton J, Connell M, Copsey T, Cooper J, et al. 2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans. Nature. 1994 Mar 3;368(6466):32–38. [PubMed] [Google Scholar]
  • Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S, et al. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res. 1996 Jun 30;3(3):109–136. [PubMed] [Google Scholar]
  • Allende ML, Amsterdam A, Becker T, Kawakami K, Gaiano N, Hopkins N. Insertional mutagenesis in zebrafish identifies two novel genes, pescadillo and dead eye, essential for embryonic development. Genes Dev. 1996 Dec 15;10(24):3141–3155. [PubMed] [Google Scholar]
  • Sonnhammer EL, Durbin R. A workbench for large-scale sequence homology analysis. Comput Appl Biosci. 1994 Jun;10(3):301–307. [PubMed] [Google Scholar]
  • Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982 Dec 15;162(3):705–708. [PubMed] [Google Scholar]
  • Fitch WM, Smith TF. Optimal sequence alignments. Proc Natl Acad Sci U S A. 1983 Mar;80(5):1382–1386.[PMC free article] [PubMed] [Google Scholar]
  • Altschul SF, Erickson BW. Optimal sequence alignment using affine gap costs. Bull Math Biol. 1986;48(5-6):603–616. [PubMed] [Google Scholar]
  • Myers EW, Miller W. Optimal alignments in linear space. Comput Appl Biosci. 1988 Mar;4(1):11–17. [PubMed] [Google Scholar]
  • Richardson M, Dilworth MJ, Scawen MD. The amino acid sequence of leghaemoglobin I from root nodules of broad bean (Vicia faba L.). FEBS Lett. 1975 Mar 1;51(1):33–37. [PubMed] [Google Scholar]
  • Matsuda G, Maita T, Braunitzer G, Schrank B. Hämoglobine, XXXIII: Notiz zur Sequenz der Hämoglobine des Pferdes. Hoppe Seylers Z Physiol Chem. 1980 Jul;361(7):1107–1116. [PubMed] [Google Scholar]
  • Tokunaga O, Yaegashi T, Lowe J, Dobbs L, Padmanabhan R. Sequence analysis in the E1 region of adenovirus type 4 DNA. Virology. 1986 Dec;155(2):418–433. [PubMed] [Google Scholar]
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. altschul@ncbi.nlm.nih.gov
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. altschul@ncbi.nlm.nih.gov

Abstract

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

Abstract
Full Text
Selected References
Collaboration tool especially designed for Life Science professionals.Drag-and-drop any entity to your messages.