Publications by Robert Edgar

Publication

MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Journal: Nucleic Acids Research

July/5/2004

Abstract

We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC390337/bin/gkh340f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC390337/bin/gkh340f2.jpg

Authors

Robert C Edgar

Pulse

Views:

5

Posts:

No posts

Rating:

Not rated

Publication

Search and clustering orders of magnitude faster than BLAST.

Journal: Bioinformatics

February/15/2011

Abstract

BACKGROUND

Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification.

RESULTS

UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.

BACKGROUND

Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch.

Authors

Robert C Edgar

Publication

UCHIME improves sensitivity and speed of chimera detection.

Download PDF

Journal: Bioinformatics

February/8/2012

Abstract

BACKGROUND

Chimeric DNA sequences often form during polymerase chain reaction amplification, especially when sequencing single regions (e.g. 16S rRNA or fungal Internal Transcribed Spacer) to assess diversity or compare populations. Undetected chimeras may be misinterpreted as novel species, causing inflated estimates of diversity and spurious inferences of differences between populations. Detection and removal of chimeras is therefore of critical importance in such experiments.

RESULTS

We describe UCHIME, a new program that detects chimeric sequences with two or more segments. UCHIME either uses a database of chimera-free sequences or detects chimeras de novo by exploiting abundance data. UCHIME has better sensitivity than ChimeraSlayer (previously the most sensitive database method), especially with short, noisy sequences. In testing on artificial bacterial communities with known composition, UCHIME de novo sensitivity is shown to be comparable to Perseus. UCHIME is >100× faster than Perseus and >1000× faster than ChimeraSlayer.

BACKGROUND

robert@drive5.com

BACKGROUND

Source, binaries and data: http://drive5.com/uchime.

BACKGROUND

Supplementary data are available at Bioinformatics online.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3150044/bin/btr381f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3150044/bin/btr381f2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3150044/bin/btr381f3.jpg

Authors

Robert C Edgar; Brian J Haas; Jose C Clemente; Christopher Quince; Rob Knight

Publication

MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Download PDF

Journal: BMC Bioinformatics

October/28/2004

Abstract

BACKGROUND

In a previous paper, we introduced MUSCLE, a new program for creating multiple alignments of protein sequences, giving a brief summary of the algorithm and showing MUSCLE to achieve the highest scores reported to date on four alignment accuracy benchmarks. Here we present a more complete discussion of the algorithm, describing several previously unpublished techniques that improve biological accuracy and / or computational complexity. We introduce a new option, MUSCLE-fast, designed for high-throughput applications. We also describe a new protocol for evaluating objective functions that align two profiles.

RESULTS

We compare the speed and accuracy of MUSCLE with CLUSTALW, Progressive POA and the MAFFT script FFTNS1, the fastest previously published program known to the author. Accuracy is measured using four benchmarks: BAliBASE, PREFAB, SABmark and SMART. We test three variants that offer highest accuracy (MUSCLE with default settings), highest speed (MUSCLE-fast), and a carefully chosen compromise between the two (MUSCLE-prog). We find MUSCLE-fast to be the fastest algorithm on all test sets, achieving average alignment accuracy similar to CLUSTALW in times that are typically two to three orders of magnitude less. MUSCLE-fast is able to align 1,000 sequences of average length 282 in 21 seconds on a current desktop computer.

CONCLUSIONS

MUSCLE offers a range of options that provide improved speed and / or alignment accuracy compared with currently available programs. MUSCLE is freely available at http://www.drive5.com/muscle.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC517706/bin/1471-2105-5-113-1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC517706/bin/1471-2105-5-113-2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC517706/bin/1471-2105-5-113-3.jpg

Authors

Robert C Edgar

Pulse

Views:

6

Posts:

No posts

Rating:

Not rated

Publication

UPARSE: highly accurate OTU sequences from microbial amplicon reads.

Journal: Nature Methods

December/10/2013

Abstract

Amplified marker-gene sequences can be used to understand microbial community structure, but they suffer from a high level of sequencing and amplification artifacts. The UPARSE pipeline reports operational taxonomic unit (OTU) sequences with ≤1% incorrect bases in artificial microbial community tests, compared with >3% incorrect bases commonly reported by other methods. The improved accuracy results in far fewer OTUs, consistently closer to the expected number of species in a community.

Authors

Robert C Edgar

Publication

Defining the core Arabidopsis thaliana root microbiome.

Download PDF

Journal: Nature

September/3/2012

Abstract

Land plants associate with a root microbiota distinct from the complex microbial community present in surrounding soil. The microbiota colonizing the rhizosphere (immediately surrounding the root) and the endophytic compartment (within the root) contribute to plant growth, productivity, carbon sequestration and phytoremediation. Colonization of the root occurs despite a sophisticated plant immune system, suggesting finely tuned discrimination of mutualists and commensals from pathogens. Genetic principles governing the derivation of host-specific endophyte communities from soil communities are poorly understood. Here we report the pyrosequencing of the bacterial 16S ribosomal RNA gene of more than 600 Arabidopsis thaliana plants to test the hypotheses that the root rhizosphere and endophytic compartment microbiota of plants grown under controlled conditions in natural soils are sufficiently dependent on the host to remain consistent across different soil types and developmental stages, and sufficiently dependent on host genotype to vary between inbred Arabidopsis accessions. We describe different bacterial communities in two geochemically distinct bulk soils and in rhizosphere and endophytic compartments prepared from roots grown in these soils. The communities in each compartment are strongly influenced by soil type. Endophytic compartments from both soils feature overlapping, low-complexity communities that are markedly enriched in Actinobacteria and specific families from other phyla, notably Proteobacteria. Some bacteria vary quantitatively between plants of different developmental stage and genotype. Our rigorous definition of an endophytic compartment microbiome should facilitate controlled dissection of plant-microbe interactions derived from complex soil communities.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4074413/bin/nihms-598750-f0001.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4074413/bin/nihms-598750-f0002.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4074413/bin/nihms-598750-f0003.jpg

Authors

Derek S Lundberg; Sarah L Lebeis; Sur Herrera Paredes; Scott Yourstone; Jase Gehring; Stephanie Malfatti+10 authors

Publication

The genome sequence of taurine cattle: a window to ruminant biology and evolution.

Download PDF

Journal: Science

May/10/2009

Abstract

To understand the biology and evolution of ruminants, the cattle genome was sequenced to about sevenfold coverage. The cattle genome contains a minimum of 22,000 genes, with a core set of 14,345 orthologs shared among seven mammalian species of which 1217 are absent or undetected in noneutherian (marsupial or monotreme) genomes. Cattle-specific evolutionary breakpoint regions in chromosomes have a higher density of segmental duplications, enrichment of repetitive elements, and species-specific variations in genes associated with lactation and immune responsiveness. Genes involved in metabolism are generally highly conserved, although five metabolic genes are deleted or extensively diverged from their human orthologs. The cattle genome sequence thus provides a resource for understanding mammalian evolution and accelerating livestock genetic improvement for milk and meat production.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943200/bin/nihms231121f2.jpg

Authors

Bovine Genome Sequencing and Analysis Consortium; Christine G Elsik; Ross L Tellam; Kim C Worley; Richard A Gibbs; Donna M Muzny+302 authors

Publication

PILER: identification and classification of genomic repeats.

Journal: Bioinformatics

June/21/2006

Abstract

CONCLUSIONS

Repeated elements such as satellites and transposons are ubiquitous in eukaryotic genomes. De novo computational identification and classification of such elements is a challenging problem. Therefore, repeat annotation of sequenced genomes has historically largely relied on sequence similarity to hand-curated libraries of known repeat families. We present a new approach to de novo repeat annotation that exploits characteristic patterns of local alignments induced by certain classes of repeats. We describe PILER, a package of efficient search algorithms for identifying such patterns. Novel repeats found using PILER are reported for Homo sapiens, Arabidopsis thalania and Drosophila melanogaster.

BACKGROUND

The PILER software is freely available at http://www.drive5.com/piler.

Authors

Robert C Edgar; Eugene W Myers

Publication

Error filtering, pair assembly and error correction for next-generation sequencing reads.

Journal: Bioinformatics

May/29/2016

Abstract

BACKGROUND

Next-generation sequencing produces vast amounts of data with errors that are difficult to distinguish from true biological variation when coverage is low.

RESULTS

We demonstrate large reductions in error frequencies, especially for high-error-rate reads, by three independent means: (i) filtering reads according to their expected number of errors, (ii) assembling overlapping read pairs and (iii) for amplicon reads, by exploiting unique sequence abundances to perform error correction. We also show that most published paired read assemblers calculate incorrect posterior quality scores.

METHODS

These methods are implemented in the USEARCH package. Binaries are freely available at http://drive5.com/usearch.

BACKGROUND

robert@drive5.com

BACKGROUND

Supplementary data are available at Bioinformatics online.

Authors

Robert C Edgar; Henrik Flyvbjerg

Publication

PILER-CR: fast and accurate identification of CRISPR repeats.

Download PDF

Journal: BMC Bioinformatics

March/5/2007

Abstract

BACKGROUND

Sequencing of prokaryotic genomes has recently revealed the presence of CRISPR elements: short, highly conserved repeats separated by unique sequences of similar length. The distinctive sequence signature of CRISPR repeats can be found using general-purpose repeat- or pattern-finding software tools. However, the output of such tools is not always ideal for studying these repeats, and significant effort is sometimes needed to build additional tools and perform manual analysis of the output.

RESULTS

We present PILER-CR, a program specifically designed for the identification and analysis of CRISPR repeats. The program executes rapidly, completing a 5 Mb genome in around 5 seconds on a current desktop computer. We validate the algorithm by manual curation and by comparison with published surveys of these repeats, finding that PILER-CR has both high sensitivity and high specificity. We also present a catalogue of putative CRISPR repeats identified in a comprehensive analysis of 346 prokaryotic genomes.

CONCLUSIONS

PILER-CR is a useful tool for rapid identification and classification of CRISPR repeats. The software is donated to the public domain. Source code and a Linux binary are freely available at http://www.drive5.com/pilercr.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1790904/bin/1471-2105-8-18-1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1790904/bin/1471-2105-8-18-2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1790904/bin/1471-2105-8-18-3.jpg

Authors

Robert C Edgar

Publication

Multiple sequence alignment.

Journal: Current Opinion in Structural Biology

August/17/2006

Abstract

Multiple sequence alignments are an essential tool for protein structure and function prediction, phylogeny inference and other common tasks in sequence analysis. Recently developed systems have advanced the state of the art with respect to accuracy, ability to scale to thousands of proteins and flexibility in comparing proteins that do not share the same domain architecture. New multiple alignment benchmark databases include PREFAB, SABMARK, OXBENCH and IRMBASE. Although CLUSTALW is still the most popular alignment tool to date, recent methods offer significantly better alignment quality and, in some cases, reduced computational cost.

Authors

Robert C Edgar; Serafim Batzoglou

Related with

Citations(75)Processes(2)Authors(2)

Publication

Updating the 97% identity threshold for 16S ribosomal RNA OTUs.

Journal: Bioinformatics

July/10/2018

Abstract

UNASSIGNED

The 16S ribosomal RNA (rRNA) gene is widely used to survey microbial communities. Sequences are often clustered into Operational Taxonomic Units (OTUs) as proxies for species. The canonical clustering threshold is 97% identity, which was proposed in 1994 when few 16S rRNA sequences were available, motivating a reassessment on current data.

UNASSIGNED

Using a large set of high-quality 16S rRNA sequences from finished genomes, I assessed the correspondence of OTUs to species for five representative clustering algorithms using four accuracy metrics. All algorithms had comparable accuracy when tuned to a given metric. Optimal identity thresholds were ∼99% for full-length sequences and ∼100% for the V4 hypervariable region.

UNASSIGNED

Reference sequences and source code are provided in the Supplementary Material.

UNASSIGNED

Supplementary data are available at Bioinformatics online.

Authors

Robert C Edgar

Related with

Citations(53)Authors(1)

Publication

A comparison of scoring functions for protein sequence profile alignment.

Journal: Bioinformatics

September/20/2004

Abstract

BACKGROUND

In recent years, several methods have been proposed for aligning two protein sequence profiles, with reported improvements in alignment accuracy and homolog discrimination versus sequence-sequence methods (e.g. BLAST) and profile-sequence methods (e.g. PSI-BLAST). Profile-profile alignment is also the iterated step in progressive multiple sequence alignment algorithms such as CLUSTALW. However, little is known about the relative performance of different profile-profile scoring functions. In this work, we evaluate the alignment accuracy of 23 different profile-profile scoring functions by comparing alignments of 488 pairs of sequences with identity < or =30% against structural alignments. We optimize parameters for all scoring functions on the same training set and use profiles of alignments from both PSI-BLAST and SAM-T99. Structural alignments are constructed from a consensus between the FSSP database and CE structural aligner. We compare the results with sequence-sequence and sequence-profile methods, including BLAST and PSI-BLAST.

RESULTS

We find that profile-profile alignment gives an average improvement over our test set of typically 2-3% over profile-sequence alignment and approximately 40% over sequence-sequence alignment. No statistically significant difference is seen in the relative performance of most of the scoring functions tested. Significantly better results are obtained with profiles constructed from SAM-T99 alignments than from PSI-BLAST alignments.

BACKGROUND

Source code, reference alignments and more detailed results are freely available at http://phylogenomics.berkeley.edu/profilealignment/

Authors

Robert C Edgar; Kimmen Sjölander

Related with

Citations(42)Processes(3)Authors(2)

Publication

Improved repeat identification and masking in Dipterans.

Download PDF

Journal: Gene

March/22/2007

Abstract

Repetitive sequences are a major constituent of many eukaryote genomes and play roles in gene regulation, chromosome inheritance, nuclear architecture, and genome stability. The identification of repetitive elements has traditionally relied on in-depth, manual curation and computational determination of close relatives based on DNA identity. However, the rapid divergence of repetitive sequence has made identification of repeats by DNA identity difficult even in closely related species. Hence, the presence of unidentified repeats in genome sequences affects the quality of gene annotations and annotation-dependent analyses (e.g. microarray analyses). We have developed an enhanced repeat identification pipeline using two approaches. First, the de novo repeat finding program PILER-DF was used to identify interspersed repetitive elements in several recently finished Dipteran genomes. Repeats were classified, when possible, according to their similarity to known elements described in Repbase and GenBank, and also screened against annotated genes as one means of eliminating false positives. Second, we used a new program called RepeatRunner, which integrates results from both RepeatMasker nucleotide searches and protein searches using BLASTX. Using RepeatRunner with PILER-DF predictions, we masked repeats in thirteen Dipteran genomes and conclude that combining PILER-DF and RepeatRunner greatly enhances repeat identification in both well-characterized and un-annotated genomes.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1945102/bin/nihms-17562-f0001.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1945102/bin/nihms-17562-f0002.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1945102/bin/nihms-17562-f0003.jpg

Authors

Christopher D Smith; Robert C Edgar; Mark D Yandell; Douglas R Smith; Susan E Celniker; Eugene W Myers; Gary H Karpen

Publication

Quality measures for protein alignment benchmarks.

Download PDF

Journal: Nucleic Acids Research

May/9/2010

Abstract

Multiple protein sequence alignment methods are central to many applications in molecular biology. These methods are typically assessed on benchmark datasets including BALIBASE, OXBENCH, PREFAB and SABMARK, which are important to biologists in making informed choices between programs. In this article, annotations of domain homology and secondary structure are used to define new measures of alignment quality and are used to make the first systematic, independent evaluation of these benchmarks. These measures indicate sensitivity and specificity while avoiding the ambiguous residue correspondences and arbitrary distance cutoffs inherent to structural superpositions. Alignments by selected methods that indicate high-confidence columns (ALIGN-M, DIALIGN-T, FSA and MUSCLE) are also assessed. Fold space coverage and effective benchmark database sizes are estimated by reference to domain annotations, and significant redundancy is found in all benchmarks except SABMARK. Questionable alignments are found in all benchmarks, especially in BALIBASE where 87% of sequences have unknown structure, 20% of columns contain different folds according to SUPERFAMILY and 30% of 'core block' columns have conflicting secondary structure according to DSSP. A careful analysis of current protein multiple alignment benchmarks calls into question their ability to determine reliable algorithm rankings.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2853116/bin/gkp1196f1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2853116/bin/gkp1196f2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2853116/bin/gkp1196f3.jpg

Authors

Robert C Edgar

Publication

SATCHMO: sequence alignment and tree construction using hidden Markov models.

Journal: Bioinformatics

April/19/2004

Abstract

BACKGROUND

Aligning multiple proteins based on sequence information alone is challenging if sequence identity is low or there is a significant degree of structural divergence. We present a novel algorithm (SATCHMO) that is designed to address this challenge. SATCHMO simultaneously constructs a tree and a set of multiple sequence alignments, one for each internal node of the tree. The alignment at a given node contains all sequences within its sub-tree, and predicts which positions in those sequences are alignable and which are not. Aligned regions therefore typically get shorter on a path from a leaf to the root as sequences diverge in structure. Current methods either regard all positions as alignable (e.g. ClustalW), or align only those positions believed to be homologous across all sequences (e.g. profile HMM methods); by contrast SATCHMO makes different predictions of alignable regions in different subgroups. SATCHMO generates profile hidden Markov models at each node; these are used to determine branching order, to align sequences and to predict structurally alignable regions.

RESULTS

In experiments on the BAliBASE benchmark alignment database, SATCHMO is shown to perform comparably to ClustalW and the UCSC SAM HMM software. Results using SATCHMO to identify protein domains are demonstrated on potassium channels, with implications for the mechanism by which tumor necrosis factor alpha affects potassium current.

BACKGROUND

The software is available for download from http://www.drive5.com/lobster/index.htm

Authors

Robert C Edgar; Kimmen Sjölander

Publication

Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity.

Download PDF

Journal: Microbiome

May/20/2015

Abstract

BACKGROUND

The operational taxonomic unit (OTU) is widely used in microbial ecology. Reproducibility in microbial ecology research depends on the reliability of OTU-based 16S ribosomal subunit RNA (rRNA) analyses.

RESULTS

Here, we report that many hierarchical and greedy clustering methods produce unstable OTUs, with membership that depends on the number of sequences clustered. If OTUs are regenerated with additional sequences or samples, sequences originally assigned to a given OTU can be split into different OTUs. Alternatively, sequences assigned to different OTUs can be merged into a single OTU. This OTU instability affects alpha-diversity analyses such as rarefaction curves, beta-diversity analyses such as distance-based ordination (for example, Principal Coordinate Analysis (PCoA)), and the identification of differentially represented OTUs. Our results show that the proportion of unstable OTUs varies for different clustering methods. We found that the closed-reference method is the only one that produces completely stable OTUs, with the caveat that sequences that do not match a pre-existing reference sequence collection are discarded.

CONCLUSIONS

As a compromise to the factors listed above, we propose using an open-reference method to enhance OTU stability. This type of method clusters sequences against a database and includes unmatched sequences by clustering them via a relatively stable de novo clustering method. OTU stability is an important consideration when analyzing microbial diversity and is a feature that should be taken into account during the development of novel OTU clustering methods.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4438525/bin/40168_2015_81_Fig1_HTML.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4438525/bin/40168_2015_81_Fig2_HTML.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4438525/bin/40168_2015_81_Fig3_HTML.jpg

Authors

Yan He; J Gregory Caporaso; Xiao-Tao Jiang; Hua-Fang Sheng; Susan M Huse; Jai Ram Rideout+5 authors

Publication

A Comprehensive, Automatically Updated Fungal ITS Sequence Dataset for Reference-Based Chimera Control in Environmental Sequencing Efforts.

Download PDF

Journal: Microbes and Environments

March/13/2016

Abstract

The nuclear ribosomal internal transcribed spacer (ITS) region is the most commonly chosen genetic marker for the molecular identification of fungi in environmental sequencing and molecular ecology studies. Several analytical issues complicate such efforts, one of which is the formation of chimeric-artificially joined-DNA sequences during PCR amplification or sequence assembly. Several software tools are currently available for chimera detection, but rely to various degrees on the presence of a chimera-free reference dataset for optimal performance. However, no such dataset is available for use with the fungal ITS region. This study introduces a comprehensive, automatically updated reference dataset for fungal ITS sequences based on the UNITE database for the molecular identification of fungi. This dataset supports chimera detection throughout the fungal kingdom and for full-length ITS sequences as well as partial (ITS1 or ITS2 only) datasets. The performance of the dataset on a large set of artificial chimeras was above 99.5%, and we subsequently used the dataset to remove nearly 1,000 compromised fungal ITS sequences from public circulation. The dataset is available at http://unite.ut.ee/repository.php and is subject to web-based third-party curation.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4462924/bin/30_145_1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4462924/bin/30_145_2.jpg

Authors

R Henrik Nilsson; Leho Tedersoo; Martin Ryberg; Erik Kristiansson; Martin Hartmann; Martin Unterseher+10 authors

Publication

COACH: profile-profile alignment of protein families using hidden Markov models.

Journal: Bioinformatics

September/20/2004

Abstract

BACKGROUND

Alignments of two multiple-sequence alignments, or statistical models of such alignments (profiles), have important applications in computational biology. The increased amount of information in a profile versus a single sequence can lead to more accurate alignments and more sensitive homolog detection in database searches. Several profile-profile alignment methods have been proposed and have been shown to improve sensitivity and alignment quality compared with sequence-sequence methods (such as BLAST) and profile-sequence methods (e.g. PSI-BLAST). Here we present a new approach to profile-profile alignment we call Comparison of Alignments by Constructing Hidden Markov Models (HMMs) (COACH). COACH aligns two multiple sequence alignments by constructing a profile HMM from one alignment and aligning the other to that HMM.

RESULTS

We compare the alignment accuracy of COACH with two recently published methods: Yona and Levitt's prof_sim and Sadreyev and Grishin's COMPASS. On two sets of reference alignments selected from the FSSP database, we find that COACH is able, on average, to produce alignments giving the best coverage or the fewest errors, depending on the chosen parameter settings.

BACKGROUND

COACH is freely available from www.drive5.com/lobster

Authors

Robert C Edgar; Kimmen Sjölander

Publication

Local homology recognition and distance measures in linear time using compressed amino acid alphabets.

Download PDF

Journal: Nucleic Acids Research

February/10/2004

Abstract

Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC373290/bin/gkh180equ1.gif

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC373290/bin/gkh180equ2.gif

Authors

Robert C Edgar

Publication

Characterization and distribution of retrotransposons and simple sequence repeats in the bovine genome.

Download PDF

Journal: Proceedings of the National Academy of Sciences of the United States of America

August/25/2009

Abstract

Interspersed repeat composition and distribution in mammals have been best characterized in the human and mouse genomes. The bovine genome contains typical eutherian mammal repeats, but also has a significant number of long interspersed nuclear element RTE (BovB) elements proposed to have been horizontally transferred from squamata. Our analysis of the BovB repeats has indicated that only a few of them are currently likely to retrotranspose in cattle. However, bovine L1 repeats (L1 BT) have many likely active copies. Comparison of substitution rates for BovB and L1 BT indicates that L1 BT is a younger repeat family than BovB. In contrast to mouse and human, L1 occurrence is not negatively correlated with G+C content. However, BovB, Bov A2, ART2A, and Bov-tA are negatively correlated with G+C, although Bov-tAs correlation is weaker. Also, by performing genome wide correlation analysis of interspersed and simple sequence repeats, we have identified genome territories by repeat content that appear to define ancestral vs. ruminant-specific genomic regions. These ancestral regions, enriched with L2 and MIR repeats, are largely conserved between bovine and human.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2722308/bin/zpq9990989930001.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2722308/bin/zpq9990989930002.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2722308/bin/zpq9990989930003.jpg

Authors

David L Adelson; Joy M Raison; Robert C Edgar

Publication

Accuracy of taxonomy prediction for 16S rRNA and fungal ITS sequences.

Journal: PeerJ

November/13/2018

Abstract

Prediction of taxonomy for marker gene sequences such as 16S ribosomal RNA (rRNA) is a fundamental task in microbiology. Most experimentally observed sequences are diverged from reference sequences of authoritatively named organisms, creating a challenge for prediction methods. I assessed the accuracy of several algorithms using cross-validation by identity, a new benchmark strategy which explicitly models the variation in distances between query sequences and the closest entry in a reference database. When the accuracy of genus predictions was averaged over a representative range of identities with the reference database (100%, 99%, 97%, 95% and 90%), all tested methods had ≤50% accuracy on the currently-popular V4 region of 16S rRNA. Accuracy was found to fall rapidly with identity; for example, better methods were found to have V4 genus prediction accuracy of ∼100% at 100% identity but ∼50% at 97% identity. The relationship between identity and taxonomy was quantified as the probability that a rank is the lowest shared by a pair of sequences with a given pair-wise identity. With the V4 region, 95% identity was found to be a twilight zone where taxonomy is highly ambiguous because the probabilities that the lowest shared rank between pairs of sequences is genus, family, order or class are approximately equal.

Authors

Robert C Edgar

Related with

Citations(20)Authors(1)

Publication

Accuracy of microbial community diversity estimated by closed- and open-reference OTUs.

Download PDF

Journal: PeerJ

October/12/2017

Abstract

Next-generation sequencing of 16S ribosomal RNA is widely used to survey microbial communities. Sequences are typically assigned to Operational Taxonomic Units (OTUs). Closed- and open-reference OTU assignment matches reads to a reference database at 97% identity (closed), then clusters unmatched reads using a de novo method (open). Implementations of these methods in the QIIME package were tested on several mock community datasets with 20 strains using different sequencing technologies and primers. Richness (number of reported OTUs) was often greatly exaggerated, with hundreds or thousands of OTUs generated on Illumina datasets. Between-sample diversity was also found to be highly exaggerated in many cases, with weighted Jaccard distances between identical mock samples often close to one, indicating very low similarity. Non-overlapping hyper-variable regions in 70% of species were assigned to different OTUs. On mock communities with Illumina V4 reads, 56% to 88% of predicted genus names were false positives. Biological inferences obtained using these methods are therefore not reliable.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5631090/bin/peerj-05-3889-g001.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5631090/bin/peerj-05-3889-g002.jpg

Authors

Robert C Edgar

Related with

Citations(16)References(23)Authors(1)

Publication

Optimizing substitution matrix choice and gap parameters for sequence alignment.

Download PDF

Journal: BMC Bioinformatics

January/27/2010

Abstract

BACKGROUND

While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments.

RESULTS

POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB.

CONCLUSIONS

The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2791778/bin/1471-2105-10-396-1.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2791778/bin/1471-2105-10-396-2.jpg

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2791778/bin/1471-2105-10-396-3.jpg

Authors

Robert C Edgar