Alan Filipski
Citations
All
Search in:AllTitleAbstractAuthor name
Publications
(11)
Patents
Grants
Pathways
Clinical trials
Publication
Journal: Molecular Biology and Evolution
July/7/2014
Abstract
We announce the release of an advanced version of the Molecular Evolutionary Genetics Analysis (MEGA) software, which currently contains facilities for building sequence alignments, inferring phylogenetic histories, and conducting molecular evolutionary analysis. In version 6.0, MEGA now enables the inference of timetrees, as it implements the RelTime method for estimating divergence times for all branching points in a phylogeny. A new Timetree Wizard in MEGA6 facilitates this timetree inference by providing a graphical user interface (GUI) to specify the phylogeny and calibration constraints step-by-step. This version also contains enhanced algorithms to search for the optimal trees under evolutionary criteria and implements a more advanced memory management that can double the size of sequence data sets to which MEGA can be applied. Both GUI and command-line versions of MEGA6 can be downloaded from www.megasoftware.net free of charge.
Publication
Journal: Nature
January/2/2008
Abstract
Comparative analysis of multiple genomes in a phylogenetic framework dramatically improves the precision and sensitivity of evolutionary inference, producing more robust results than single-genome analyses can provide. The genomes of 12 Drosophila species, ten of which are presented here for the first time (sechellia, simulans, yakuba, erecta, ananassae, persimilis, willistoni, mojavensis, virilis and grimshawi), illustrate how rates and patterns of sequence divergence across taxa can illuminate evolutionary processes on a genomic scale. These genome sequences augment the formidable genetic tools that have made Drosophila melanogaster a pre-eminent model for animal genetics, and will further catalyse fundamental research on mechanisms of development, cell biology, genetics, disease, neurobiology, behaviour, physiology and evolution. Despite remarkable similarities among these Drosophila species, we identified many putatively non-neutral changes in protein-coding genes, non-coding RNA genes, and cis-regulatory regions. These may prove to underlie differences in the ecology and behaviour of these diverse species.
Publication
Journal: Proceedings of the National Academy of Sciences of the United States of America
January/28/2013
Abstract
Molecular dating of species divergences has become an important means to add a temporal dimension to the Tree of Life. Increasingly larger datasets encompassing greater taxonomic diversity are becoming available to generate molecular timetrees by using sophisticated methods that model rate variation among lineages. However, the practical application of these methods is challenging because of the exorbitant calculation times required by current methods for contemporary data sizes, the difficulty in correctly modeling the rate heterogeneity in highly diverse taxonomic groups, and the lack of reliable clock calibrations and their uncertainty distributions for most groups of species. Here, we present a method that estimates relative times of divergences for all branching points (nodes) in very large phylogenetic trees without assuming a specific model for lineage rate variation or specifying any clock calibrations. The method (RelTime) performed better than existing methods when applied to very large computer simulated datasets where evolutionary rates were varied extensively among lineages by following autocorrelated and uncorrelated models. On average, RelTime completed calculations 1,000 times faster than the fastest Bayesian method, with even greater speed difference for larger number of sequences. This speed and accuracy will enable molecular dating analysis of very large datasets. Relative time estimates will be useful for determining the relative ordering and spacing of speciation events, identifying lineages with significantly slower or faster evolutionary rates, diagnosing the effect of selected calibrations on absolute divergence times, and estimating absolute times of divergence when highly reliable calibration points are available.
Publication
Journal: Proceedings of the National Academy of Sciences of the United States of America
February/5/2006
Abstract
Molecular clocks have been used to date the divergence of humans and chimpanzees for nearly four decades. Nonetheless, this date and its confidence interval remain to be firmly established. In an effort to generate a genomic view of the human-chimpanzee divergence, we have analyzed 167 nuclear protein-coding genes and built a reliable confidence interval around the calculated time by applying a multifactor bootstrap-resampling approach. Bayesian and maximum likelihood analyses of neutral DNA substitutions show that the human-chimpanzee divergence is close to 20% of the ape-Old World monkey (OWM) divergence. Therefore, the generally accepted range of 23.8-35 millions of years ago for the ape-OWM divergence yields a range of 4.98-7.02 millions of years ago for human-chimpanzee divergence. Thus, the older time estimates for the human-chimpanzee divergence, from molecular and paleontological studies, are unlikely to be correct. For a given the ape-OWM divergence time, the 95% confidence interval of the human-chimpanzee divergence ranges from -12% to 19% of the estimated time. Computer simulations suggest that the 95% confidence intervals obtained by using a multifactor bootstrap-resampling approach contain the true value with >95% probability, whether deviations from the molecular clock are random or correlated among lineages. Analyses revealed that the use of amino acid sequence differences is not optimal for dating human-chimpanzee divergence and that the inclusion of additional genes is unlikely to narrow the confidence interval significantly. We conclude that tests of hypotheses about the timing of human-chimpanzee divergence demand more precise fossil-based calibrations.
Publication
Journal: Genome Research
April/3/2007
Abstract
DNA sequence alignment is a prerequisite to virtually all comparative genomic analyses, including the identification of conserved sequence motifs, estimation of evolutionary divergence between sequences, and inference of historical relationships among genes and species. While it is mere common sense that inaccuracies in multiple sequence alignments can have detrimental effects on downstream analyses, it is important to know the extent to which the inferences drawn from these alignments are robust to errors and biases inherent in all sequence alignments. A survey of investigations into strengths and weaknesses of sequence alignments reveals, as expected, that alignment quality is generally poor for two distantly related sequences and can often be improved by adding additional sequences as stepping stones between distantly related species. Errors in sequence alignment are also found to have a significant negative effect on subsequent inference of sequence divergence, phylogenetic trees, and conserved motifs. However, our understanding of alignment biases remains rudimentary, and sequence alignment procedures continue to be used somewhat like benign formatting operations to make sequences equal in length. Because of the central role these alignments now play in our endeavors to establish the tree of life and to identify important parts of genomes through evolutionary functional genomics, we see a need for increased community effort to investigate influences of alignment bias on the accuracy of large-scale comparative genomics.
Publication
Journal: Trends in Genetics
February/6/2012
Abstract
Modern technologies have made the sequencing of personal genomes routine. They have revealed thousands of nonsynonymous (amino acid altering) single nucleotide variants (nSNVs) of protein-coding DNA per genome. What do these variants foretell about an individual's predisposition to diseases? The experimental technologies required to carry out such evaluations at a genomic scale are not yet available. Fortunately, the process of natural selection has lent us an almost infinite set of tests in nature. During long-term evolution, new mutations and existing variations have been evaluated for their biological consequences in countless species, and outcomes are readily revealed by multispecies genome comparisons. We review studies that have investigated evolutionary characteristics and in silico functional diagnoses of nSNVs found in thousands of disease-associated genes. We conclude that the patterns of long-term evolutionary conservation and permissible sequence divergence are essential and instructive modalities for functional assessment of human genetic variations.
Publication
Journal: Molecular Biology and Evolution
September/1/2010
Abstract
The rapid expansion of sequence data and the development of statistical approaches that embrace varying evolutionary rates among lineages have encouraged many more investigators to use DNA and protein data to time species divergences. Here, we report results from a systematic evaluation, by means of computer simulation, of the performance of two frequently used relaxed-clock methods for estimating these times and their credibility intervals (CrIs). These relaxed-clock methods allow rates to vary in a phylogeny randomly over lineages (e.g., BEAST software) and in autocorrelated fashion (e.g., MultiDivTime software). We applied these methods for analyzing sequence data sets simulated using naturally derived parameters (evolutionary rates, sequence lengths, and base substitution patterns) and assuming that clock calibrations are known without error. We find that the estimated times are, on average, close to the true times as long as the assumed model of lineage rate changes matches the actual model. The 95% CrIs also contain the true time for>>or=95% of the simulated data sets. However, the use of incorrect lineage rate model reduces this frequency to 83%, indicating that the relaxed-clock methods are not robust to the violation of underlying lineage rate model. Because these rate models are rarely known a priori and are difficult to detect empirically, we suggest building composite CrIs using CrIs produced from MultiDivTime and BEAST analysis. These composite CrIs are found to contain the true time for>>or=97% data sets. Our analyses also verify the usefulness of the common practice of interpreting the congruence of times inferred from different methods as a reflection of the accuracy of time estimates. Overall, our results show that simple strategies can be used to enhance our ability to estimate times and their CrIs when using the relaxed-clock methods.
Publication
Journal: Molecular Biology and Evolution
March/27/2016
Abstract
We present a procedure to test the effect of calibration priors on estimated times, which applies a recently developed calibration-free approach (RelTime) method that produces relative divergence times for all nodes in the tree. We illustrate this protocol by applying it to a timetree of metazoan diversification (Erwin DH, Laflamme M, Tweedt SM, Sperling EA, Pisani D, Peterson KJ. 2011. The Cambrian conundrum: early divergence and later ecological success in the early history of animals. Science 334:1091-1097.), which placed the divergence of animal phyla close to the time of the Cambrian explosion inferred from the fossil record. These analyses revealed that the two maximum-only calibration priors in the pre-Cambrian are the primary determinants of the young divergence times among animal phyla in this study. In fact, these two maximum-only calibrations produce divergence times that severely violate minimum boundaries of almost all of the other 22 calibration constraints. The use of these 22 calibrations produces dates for metazoan divergences that are hundreds of millions of years earlier in the Proterozoic. Our results encourage the use of calibration-free approaches to identify most influential calibration constraints and to evaluate their impact in order to achieve biologically robust interpretations.
Publication
Journal: Molecular Biology and Evolution
March/6/2016
Abstract
Scientists are assembling sequence data sets from increasing numbers of species and genes to build comprehensive timetrees. However, data are often unavailable for some species and gene combinations, and the proportion of missing data is often large for data sets containing many genes and species. Surprisingly, there has not been a systematic analysis of the effect of the degree of sparseness of the species-gene matrix on the accuracy of divergence time estimates. Here, we present results from computer simulations and empirical data analyses to quantify the impact of missing gene data on divergence time estimation in large phylogenies. We found that estimates of divergence times were robust even when sequences from a majority of genes for most of the species were absent. From the analysis of such extremely sparse data sets, we found that the most egregious errors occurred for nodes in the tree that had no common genes for any pair of species in the immediate descendant clades of the node in question. These problematic nodes can be easily detected prior to computational analyses based only on the input sequence alignment and the tree topology. We conclude that it is best to use larger alignments, because adding both genes and species to the alignment augments the number of genes available for estimating divergence events deep in the tree and improves their time estimates.
Publication
Journal: Genome
March/31/2003
Abstract
Based on published information, we have identified 991 genes and gene-family clusters for cattle and 764 for pigs that have orthologues in the human genome. The relative linear locations of these genes on human sequence maps were used as "rulers" to annotate bovine and porcine genomes based on a CSAM (contiguous sets of autosomal markers) approach. A CSAM is an uninterrupted set of markers in one genome (primary genome; the human genome in this study) that is syntenic in the other genome (secondary genome; the bovine and porcine genomes in this study). The analysis revealed 81 conserved syntenies and 161 CSAMs between human and bovine autosomes and 50 conserved syntenies and 95 CSAMs between human and porcine autosomes. Using the human sequence map as a reference, these 991 and 764 markers could correlate 72 and 74% of the human genome with the bovine and porcine genomes, respectively. Based on the number of contiguous markers in each CSAM, we classified these CSAMs into five size groups as follows: singletons (one marker only), small (2-4 markers), medium (5-10 markers), large (11-20 markers), and very large >> 20 markers). Several bovine and porcine chromosomes appear to be represented as di-CSAM repeats in a tandem or dispersed way on human chromosomes. The number of potential CSAMs for which no markers are currently available were estimated to be 63 between human and bovine genomes and 18 between human and porcine genomes. These results provide basic guidelines for further gene and QTL mapping of the bovine and porcine genomes, as well as insight into the evolution of mammalian genomes.
Publication
Journal: BMC Genomics
January/27/2016
Abstract
BACKGROUND
A central problem of computational metagenomics is determining the correct placement into an existing phylogenetic tree of individual reads (nucleotide sequences of varying lengths, ranging from hundreds to thousands of bases) obtained using next-generation sequencing of DNA samples from a mixture of known and unknown species. Correct placement allows us to easily identify or classify the sequences in the sample as to taxonomic position or function.
RESULTS
Here we propose a novel method (PhyClass), based on the Minimum Evolution (ME) phylogenetic inference criterion, for determining the appropriate phylogenetic position of each read. Without using heuristics, the new approach efficiently finds the optimal placement of the unknown read in a reference phylogenetic tree given a sequence alignment for the taxa in the tree. In short, the total resulting branch length for the tree is computed for every possible placement of the unknown read and the placement that gives the smallest value for this total is the best (optimal) choice. By taking advantage of computational efficiencies and mathematical formulations, we are able to find the true optimal ME placement for each read in the phylogenetic tree. Using computer simulations, we assessed the accuracy of the new approach for different read lengths over a variety of data sets and phylogenetic trees. We found the accuracy of the new method to be good and comparable to existing Maximum Likelihood (ML) approaches.
CONCLUSIONS
In particular, we found that the consensus assignments based on ME and ML approaches are more correct than either method individually. This is true even when the statistical support for read assignments was low, which is inevitable given that individual reads are often short and come from only one gene.