Protein sequence similarity searches using patterns as seeds.
Abstract
Protein families often are characterized by conserved sequence patterns or motifs. A researcher frequently wishes to evaluate the significance of a specific pattern within a protein, or to exploit knowledge of known motifs to aid the recognition of greatly diverged but homologous family members. To assist in these efforts, the pattern-hit initiated BLAST (PHI-BLAST) program described here takes as input both a protein sequence and a pattern of interest that it contains. PHI-BLAST searches a protein database for other instances of the input pattern, and uses those found as seeds for the construction of local alignments to the query sequence. The random distribution of PHI-BLAST alignment scores is studied analytically and empirically. In many instances, the program is able to detect statistically significant similarity between homologous proteins that are not recognizably related using traditional single-pass database search methods. PHI-BLAST is applied to the analysis of CED4-like cell death regulators, HS90-type ATPase domains, archaeal tRNA nucleotidyltransferases and archaeal homologs of DnaG-type DNA primases.
Full Text
Selected References
These references are in PubMed. This may not be the complete list of references from this article.
- Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981 Mar 25;147(1):195–197. [PubMed] [Google Scholar]
- Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 1988 Apr;85(8):2444–2448.[PMC free article] [PubMed] [Google Scholar]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403–410. [PubMed] [Google Scholar]
- Altschul SF, Gish W. Local alignment statistics. Methods Enzymol. 1996;266:460–480. [PubMed] [Google Scholar]
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997 Sep 1;25(17):3389–3402.[PMC free article] [PubMed] [Google Scholar]
- Myers EW, Miller W. Approximate matching of regular expressions. Bull Math Biol. 1989;51(1):5–37. [PubMed] [Google Scholar]
- Smith RF, Smith TF. Automatic generation of primary sequence patterns from sets of related protein sequences. Proc Natl Acad Sci U S A. 1990 Jan;87(1):118–122.[PMC free article] [PubMed] [Google Scholar]
- Staden R. Searching for patterns in protein and nucleic acid sequences. Methods Enzymol. 1990;183:193–211. [PubMed] [Google Scholar]
- Mehldau G, Myers G. A system for pattern matching applications on biosequences. Comput Appl Biosci. 1993 Jun;9(3):299–314. [PubMed] [Google Scholar]
- Tatusov RL, Koonin EV. A simple tool to search for sequence motifs that are conserved in BLAST outputs. Comput Appl Biosci. 1994 Jul;10(4):457–459. [PubMed] [Google Scholar]
- Ogiwara A, Uchiyama I, Takagi T, Kanehisa M. Construction and analysis of a profile library characterizing groups of structurally known proteins. Protein Sci. 1996 Oct;5(10):1991–1999.[PMC free article] [PubMed] [Google Scholar]
- Bairoch A, Bucher P, Hofmann K. The PROSITE database, its status in 1997. Nucleic Acids Res. 1997 Jan 1;25(1):217–221.[PMC free article] [PubMed] [Google Scholar]
- Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443–453. [PubMed] [Google Scholar]
- Sankoff D. Matching sequences under deletion-insertion constraints. Proc Natl Acad Sci U S A. 1972 Jan;69(1):4–6.[PMC free article] [PubMed] [Google Scholar]
- Zhang Z, Berman P, Miller W. Alignments without low-scoring regions. J Comput Biol. 1998 Summer;5(2):197–210. [PubMed] [Google Scholar]
- Staden R. Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci. 1989 Apr;5(2):89–96. [PubMed] [Google Scholar]
- Robinson AB, Robinson LR. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc Natl Acad Sci U S A. 1991 Oct 15;88(20):8880–8884.[PMC free article] [PubMed] [Google Scholar]
- Smith TF, Waterman MS, Burks C. The statistical distribution of nucleic acid similarities. Nucleic Acids Res. 1985 Jan 25;13(2):645–656.[PMC free article] [PubMed] [Google Scholar]
- Collins JF, Coulson AF, Lyall A. The significance of protein sequence similarities. Comput Appl Biosci. 1988 Mar;4(1):67–71. [PubMed] [Google Scholar]
- Pearson WR. Empirical statistical estimates for sequence similarity searches. J Mol Biol. 1998 Feb 13;276(1):71–84. [PubMed] [Google Scholar]
- Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992 Nov 15;89(22):10915–10919.[PMC free article] [PubMed] [Google Scholar]
- Altschul SF, Boguski MS, Gish W, Wootton JC. Issues in searching molecular sequence databases. Nat Genet. 1994 Feb;6(2):119–129. [PubMed] [Google Scholar]
- Benson DA, Boguski MS, Lipman DJ, Ostell J, Ouellette BF. GenBank. Nucleic Acids Res. 1998 Jan 1;26(1):1–7.[PMC free article] [PubMed] [Google Scholar]
- Seshagiri S, Miller LK. Caenorhabditis elegans CED-4 stimulates CED-3 processing and CED-3-induced apoptosis. Curr Biol. 1997 Jul 1;7(7):455–460. [PubMed] [Google Scholar]
- Chinnaiyan AM, Chaudhary D, O'Rourke K, Koonin EV, Dixit VM. Role of CED-4 in the activation of CED-3. Nature. 1997 Aug 21;388(6644):728–729. [PubMed] [Google Scholar]
- Zou H, Henzel WJ, Liu X, Lutschg A, Wang X. Apaf-1, a human protein homologous to C. elegans CED-4, participates in cytochrome c-dependent activation of caspase-3. Cell. 1997 Aug 8;90(3):405–413. [PubMed] [Google Scholar]
- Li P, Nijhawan D, Budihardjo I, Srinivasula SM, Ahmad M, Alnemri ES, Wang X. Cytochrome c and dATP-dependent formation of Apaf-1/caspase-9 complex initiates an apoptotic protease cascade. Cell. 1997 Nov 14;91(4):479–489. [PubMed] [Google Scholar]
- Bergerat A, de Massy B, Gadelle D, Varoutas PC, Nicolas A, Forterre P. An atypical topoisomerase II from Archaea with implications for meiotic recombination. Nature. 1997 Mar 27;386(6623):414–417. [PubMed] [Google Scholar]
- Mushegian AR, Bassett DE, Jr, Boguski MS, Bork P, Koonin EV. Positionally cloned human disease genes: patterns of evolutionary conservation and functional motifs. Proc Natl Acad Sci U S A. 1997 May 27;94(11):5831–5836.[PMC free article] [PubMed] [Google Scholar]
- Tsui HT, Mandavilli BS, Winkler ME. Nonconserved segment of the MutL protein from Escherichia coli K-12 and Salmonella typhimurium. Nucleic Acids Res. 1992 May 11;20(9):2379–2379.[PMC free article] [PubMed] [Google Scholar]
- Wilson R, Ainscough R, Anderson K, Baynes C, Berks M, Bonfield J, Burton J, Connell M, Copsey T, Cooper J, et al. 2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans. Nature. 1994 Mar 3;368(6466):32–38. [PubMed] [Google Scholar]
- Nagase T, Seki N, Tanaka A, Ishikawa K, Nomura N. Prediction of the coding sequences of unidentified human genes. IV. The coding sequences of 40 new genes (KIAA0121-KIAA0160) deduced by analysis of cDNA clones from human cell line KG-1. DNA Res. 1995 Aug 31;2(4):167–210. [PubMed] [Google Scholar]
- Yue D, Maizels N, Weiner AM. CCA-adding enzymes and poly(A) polymerases are all members of the same nucleotidyltransferase superfamily: characterization of the CCA-adding enzyme from the archaeal hyperthermophile Sulfolobus shibatae. RNA. 1996 Sep;2(9):895–908.[PMC free article] [PubMed] [Google Scholar]
- Dracheva S, Koonin EV, Crute JJ. Identification of the primase active site of the herpes simplex virus type 1 helicase-primase. J Biol Chem. 1995 Jun 9;270(23):14148–14153. [PubMed] [Google Scholar]
- Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, et al. Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii. Science. 1996 Aug 23;273(5278):1058–1073. [PubMed] [Google Scholar]
- Klenk HP, Clayton RA, Tomb JF, White O, Nelson KE, Ketchum KA, Dodson RJ, Gwinn M, Hickey EK, Peterson JD, et al. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature. 1997 Nov 27;390(6658):364–370. [PubMed] [Google Scholar]
- Koonin EV, Mushegian AR, Galperin MY, Walker DR. Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea. Mol Microbiol. 1997 Aug;25(4):619–637. [PubMed] [Google Scholar]
- LeBlanc DJ, Lee LN, Inamine JM. Cloning and nucleotide base sequence analysis of a spectinomycin adenyltransferase AAD(9) determinant from Enterococcus faecalis. Antimicrob Agents Chemother. 1991 Sep;35(9):1804–1810.[PMC free article] [PubMed] [Google Scholar]
- Black CG, Fyfe JA, Davies JK. A promoter associated with the neisserial repeat can be used to transcribe the uvrB gene from Neisseria gonorrhoeae. J Bacteriol. 1995 Apr;177(8):1952–1958.[PMC free article] [PubMed] [Google Scholar]
- Altschul SF. Generalized affine gap costs for protein sequence alignment. Proteins. 1998 Jul 1;32(1):88–96. [PubMed] [Google Scholar]
- Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982 Dec 15;162(3):705–708. [PubMed] [Google Scholar]
- Fitch WM, Smith TF. Optimal sequence alignments. Proc Natl Acad Sci U S A. 1983 Mar;80(5):1382–1386.[PMC free article] [PubMed] [Google Scholar]
- Altschul SF, Erickson BW. Optimal sequence alignment using affine gap costs. Bull Math Biol. 1986;48(5-6):603–616. [PubMed] [Google Scholar]
- Myers EW, Miller W. Optimal alignments in linear space. Comput Appl Biosci. 1988 Mar;4(1):11–17. [PubMed] [Google Scholar]
Abstract
Protein families often are characterized by conserved sequence patterns or motifs. A researcher frequently wishes to evaluate the significance of a specific pattern within a protein, or to exploit knowledge of known motifs to aid the recognition of greatly diverged but homologous family members. To assist in these efforts, the pattern-hit initiated BLAST (PHI-BLAST) program described here takes as input both a protein sequence and a pattern of interest that it contains. PHI-BLAST searches a protein database for other instances of the input pattern, and uses those found as seeds for the construction of local alignments to the query sequence. The random distribution of PHI-BLAST alignment scores is studied analytically and empirically. In many instances, the program is able to detect statistically significant similarity between homologous proteins that are not recognizably related using traditional single-pass database search methods. PHI-BLAST is applied to the analysis of CED4-like cell death regulators, HS90-type ATPase domains, archaeal tRNA nucleotidyltransferases and archaeal homologs of DnaG-type DNA primases.