treeman: an R package for efficient and intuitive manipulation of phylogenetic trees
Phylogenetic trees are hierarchical structures used for representing the inter-relationships between biological entities. They are the most common tool for representing evolution and are essential to a range of fields across the life sciences. The manipulation of phylogenetic trees—in terms of adding or removing tips—is often performed by researchers not just for reasons of management but also for performing simulations in order to understand the processes of evolution. Despite this, the most common programming language among biologists, R, has few class structures well suited to these tasks.
We present an R package that contains a new class, called TreeMan, for representing the phylogenetic tree. This class has a list structure allowing phylogenetic trees to be manipulated more efficiently. Computational running times are reduced because of the ready ability to vectorise and parallelise methods. Development is also improved due to fewer lines of code being required for performing manipulation processes.
We present three use cases—pinning missing taxa to a supertree, simulating evolution with a tree-growth model and detecting significant phylogenetic turnover—that demonstrate the new package’s speed and simplicity.
Electronic supplementary material
The online version of this article (doi:10.1186/s13104-016-2340-8) contains supplementary material, which is available to authorized users.
Phylogenetic trees have been a mainstay of the R statistical software environment since the release of Emmanuel Paradis’ APE package in 2002 [1, 2]. This package introduced the phylo object, an S3 class for the presentation and manipulation of phylogenetic tree data in the R environment. In its most basic implementation, the phylo object contains a list of three elements: an edge matrix, a vector of tip labels and an integer of the number of internal nodes. The use of an edge matrix facilitates phylogenetically structured statistical analyses because of its convenience for generating distance, cophenetic or covariance matrices. For this reason the APE package’s phylo is the dominant class for phylogenetic tree representation in R and is used by many well-known phylogenetic R packages (e.g. phangorn , phytools ). Since phylo’s first incarnation the number of available functions in the APE package has risen from 28 to 171 (versions 0.1–3.4), and to date there are 147 reverse dependencies, i.e. packages on CRAN  that depend on the phylo class. More recently, the phylo class has been updated to S4 as part of the phylobase package .
An edge matrix, however, leads to a dependence on index referencing, leading to certain computational scenarios in which the phylo object performs poorly: in particular, analyses that require the manipulation of the tree itself (i.e. tip and node addition/deletion). Such analyses include simulating, comparing, pruning, and merging trees, and calculating phylogenetic statistics such as measures of phylogenetic richness  and evolutionary distinctness . These have become the preserve of software solutions external to R, e.g. [8, 9], hindering their integration with the many packages in biomolecular, evolutionary and ecological studies already available for R. Although there are alternatives to the phylo class for phylogenetics or more generally ‘networks’ available in R , these packages and classes are rarely used for phylogenetics and may lack the intuitive functional framework for manipulating evolutionary trees.
Here we present the new phylogenetic tree manipulation class ‘TreeMan’ (see Fig. 1 for an overview); this is presented as the R package ‘treeman’ (N.B. the package name is all lowercase). This class is built around a list of named nodes rather than an index-based edge matrix as is the case for the phylo class. Using an edge matrix, whenever a node is added or removed the new positions of all nodes in the matrix must be determined and the tree must be re-computed. With a node list, however, order does not matter; nodes can be added and removed without altering the entire tree structure. Manipulations are also less dependent on tree size because all that is required is to update the local nodes: those that directly descend or ascend from the new node, converting that scale of computation time from O(N2) to O(N) (see Fig. 2 for a comparison of growing a tree with the phylo and TreeMan classes). Furthermore, with a node list the nodes in the tree can have unique IDs, which persist after insertions or deletions, allowing elements in a tree (such as node labels) to be more easily tracked during analysis. The subsequent sections of the paper describe the overall structure of the new class, describe treeman’s naming convention, and provide examples of tree manipulations that use the new package. The aims of treeman are to be conceptually intuitive for tree manipulation and as computationally efficient as possible within the R environment.
The TreeMan object in R is an S4 formal class whose main data slot is a list—which in R is a vector whose elements can be named. All nodes in a TreeMan object are named elements in this list (ndlst). Each node usually contains the following data slots: the node ID (id), the length of the preceding edge (spn, for “span”), the IDs of all connecting ascending/ancestral nodes to root (pre-node IDs, prid), the IDs of the immediately descending nodes (post-node IDs, ptid), and the IDs of all descending tips (kids). Additionally, if all nodes in a tree contain the spn slot, then each node will also contain: the total edge length of all descending nodes (phylogenetic diversity, pd), total edge length of all connected pre-nodes (prdst; in a rooted tree this is the root-to-tip distance), and the relative distance of the node in the tree (age, for a time-calibrated rooted tree). All nodes must have either a prid and/or ptid data slots: tip nodes have only prid slots, root nodes have only ptid slots, and internal nodes have both. These slots must contain IDs that are found within the ndlst; if they do not, an error is raised. These core slots are supplemented by optional slots, a non-unique taxonomic name that can be used to generate lineages (txnym) and user-defined slots that can contain any kind of information. In addition to the ndlst, the TreeMan object contains informative slots that are generated upon reading or generating the tree, and are updated whenever modified. Basic tree information can be seen by printing the tree to console.
The treeman package implements an intuitive naming convention in which each tree feature has a specific name that all methods and objects must use (see Fig. 1; Table 1). Specific tree or node information can be accessed using R indexing by character-string. For example, entering tree [“tips”] will output data on a tree’s tips. Double square brackets are used for pulling out information on individual nodes, e.g. tree [[“t1”]] will return node information on tip/node t1. The majority of methods in the treeman package are grouped into four main groups: get, set, calc and manip (see Table 1). The get methods return node or tree specific information, set methods change the tree’s or its nodes’ parameters, calc methods generate tree statistics, and manip methods alter the tree, usually by adding or removing tips and nodes. Methods that act across nodes are indicated with ‘nd’ or ‘nds’ in their function name, e.g. getNdAge() for a single node and getNdsAge() for multiple nodes. These core methods are designed to be fast and modular, allowing them to be readily combined into more complex functions. For example, evolutionary distinctness using the Fair Proportion metric  can be calculated by using the method calcFrPrp(). In implementation, this method uses getNdPrid() to get all pre-node IDs, runs getNdsKids() on these IDs to find the number of descendants per pre-node, and then sums the division of the numbers of descendants over the pre-nodes’ spans.
|get||Retrieve specific information about parts of a tree, often nodes||getNdAge, getPrnt, getNdKids, getNdLng, getNdPrid, getNdPtid, getPath, getSubtree|
|calc||Calculate tree statistics and tree associated information||calcDstMtrx, calcTrDst, calcPhyDv, calcFrPrp|
|set||Set node or overall tree values||setNdSpn, setAge, setPD, setRoot, setTol|
|manip||Change tree structure by adding or removing tips and nodes||addTip, rmTip, pinTip|
Because the TreeMan class depends on the ndlst, all functions that run over this list are vectorised. All treeman functions that can be vectorised are done so using plyr vectorisation , providing substantial performance benefits, as computation is no longer taking place at the scripting level. Through the use of plyr these functions can also be parallelised using the “.parallel” argument that is passed onto plyr functions, which work in conjunction with parallel R packages such as DoMC  and doSNOW .
Results and discussion
To demonstrate the TreeMan class and how the treeman functions can be combined to complete complex tasks, we demonstrate three use-cases: pinning missing taxa using online taxonomic databases to a molecular phylogenetic tree; simulating phylogenetic trees through time using different models of evolution; and testing for significant phylogenetic turnover between ecological communities.
Tip pinning: adding missing taxa to a tree using online taxonomies
Trees often have missing tips due to a lack of data for phylogenetic construction. One approach to placing these tips is to use the taxonomy of the missing taxa and constraining placement with a model of evolution, as implemented by the Pastis software package . Similar methods can be implemented in R using treeman. Our package provides an addTip() function that takes as arguments the incipient edge and an age range. Furthermore, TreeMan objects can be taxonomically informed: taxonomic names (txnyms) can be assigned to every node in a tree, allowing a user to constrain random placement of nodes. To demonstrate this functionality we present the mammalian supertree  that has been taxonomically informed using NCBI’s taxonomy . We retrieved all the species names listed in NCBI but not present in the supertree, and pinned an example set of 100 new tips to the supertree. New tips were added to the supertree at any point in the branches that shared the lowest matching taxonomic rank with the taxonomy of the new tip (Figs. 3, 4). This was implemented with pinTips(), a function of 49 lines. The equivalent function using a phylo object is approximately 500 lines (see ‘pinning-with-phylo.R’ in Additional file 1).
Tree simulation: generating trees using different models of evolution
Tree simulation is an important tool for exploring the processes that may have generated biodiversity. A common tool for simulating trees is the birth–death simulation or equals-rates Markov model (ERMM) that randomly removes and adds tips on a tree . Many publications have independently explored alterations of the ERMM, such as modifying the rates of speciation for different tips in the tree [18–21]. All such studies, however, have developed their own software tools for tree simulation. The TreeMan object is easily modified using the addTip() and rmTip() functions and its speed of processing and implementation makes TreeMan an ideal software tool for tree simulation. Furthermore, as part of the R environment a user has access to the wide range of eco-evolutionary R packages already available to expand on these earlier tree simulations. To demonstrate this functionality, we produced an R script for simulating a tree using an Evolutionary Distinctness Biased Markov Model (EDBMM ) with fewer than 40 lines of R code (Fig. 5). The simulation randomly adds and removes tips but with a bias towards evolutionary distinctness: more evolutionarily distinct tips have a lower rate of speciation and extinction (Fig. 6). The equivalent script for running a vectorised EDBMM with a phylo object takes 188 lines (see ‘edbmm-with-phylo.R’ in Additional file 2).
Testing for significant phylogenetic turnover
A common need in biological analysis is the detection of phylogenetic signal. Such analyses are often executed using model-based statistical analyses for which APE has been primarily designed . A model-based approach, however, is not always applicable to every question related to phylogenetic signal. One such question is whether the changes in species between habitat types are due to phylogenetic signal in species’ gains and losses, e.g. as a result of human-caused habitat loss . A useful metric for calculating such a difference is the UniFrac measure , which measures the unique and shared fractions of branches represented between communities. The TreeMan object is well suited to calculating this metric as all nodes in a tree have IDs; even if nodes are added or removed, IDs are constant. To demonstrate this we randomly generated community data with different intensities of overlap. We then ran permutation tests that detect whether there has been significant phylogenetic turnover between these communities using treeman’s calcOvrlp() (Figs. 7, 8).
TreeMan is an S4 class that encodes a phylogenetic tree using a node list. The advantage of a node list is the faster computational processing, and the ready capacity to track nodes between manipulations and vectorise or parallelise large-scale tree manipulations. The treeman package introduces new terminology to describe different elements of a tree and uses a naming convention to combine these new terms to make a more intuitive set of methods for tree manipulation.
DJB initiated the project, developed the code and wrote the paper. STT and MDS provided supervision, ideas and critically reviewed the final manuscript. All authors read and approved the final manuscript.
Thanks to Susy Echeverría-Londoño for testing the initial code, and to the volunteers at CRAN for making the code available.
The authors declare that they have no competing interests.
Availability of data and materials
The datasets supporting the conclusions of this article are available in the project GitHub repository, https://github.com/DomBennett/treeman.
This project was funded by a Natural and Environmental Research Council (NERC, UK) PhD. grant.
Project home page—https://github.com/DomBennett/treeman
Operating system(s)—platform independent
Other requirements—R v. 3+
Any restrictions to use by non-academics—none.
- 1. Paradis E, Blomberg S, Bolker B, Claude J, Cuong HS, Desper R, Didier G, Durand B, Dutheil J, Gascuel O. ape: Analyses of phylogenetics and evolution. 2016. https://cran.r-project.org/web/packages/ape/index.html. Accessed 24 Feb 2016.
- 2. APE: analyses of phylogenetics and evolution in R languageBioinformatics2004202289290
- 3. phangorn: phylogenetic analysis in RBioinformatics2011274592593
- 4. phytools: an R package for phylogenetic comparative biology (and other things)Methods Ecol Evol20123217223
- 5. Michonneau F. phylobase: Base package for phylogenetic structures and comparative data. 2016. https://cran.r-project.org/web/packages/phylobase/index.html. Accessed 3 Sep 2016.
- 6. Conservation evaluation and phylogenetic diversityBiol Conserv1992611110
- 7. Mammals on the EDGE: conservation priorities based on threat and phylogenyPLoS ONE200723e296
- 8. Phyutility: a phyloinformatics tool for trees, alignments and molecular dataBioinformatics2008245715716
- 9. TreeCmp: comparison of trees in polynomial timeEvolut Bioinform Online.20128475487
- 10. Csardi G. igraph: network analysis and visualization. 2015. https://cran.r-project.org/web/packages/igraph/index.html. Accessed 3 Sep 2016.
- 11. The split-apply-combine strategy for dataJ Stat Softw201140129
- 12. Calaway R, Weston S, Revolution analytics. doMC: foreach parallel adaptor for ‘parallel’. 2015. https://cran.r-project.org/web/packages/doMC/index.html. Accessed 17 April 2016.
- 13. Calaway R, Weston S, Revolution analytics. doSNOW: foreach parallel adaptor for the ‘snow’ package. 2015. [https://cran.r-project.org/web/packages/doSNOW/index.html]. Accessed 17 April 2016.
- 14. PASTIS: an R package to facilitate phylogenetic assembly with soft taxonomic inferencesMethods Ecol Evol201341110111017
- 15. The delayed rise of present-day mammalsNature20074467135507512
- 17. Inferring evolutionary process from phylogenetic tree shapeQ Rev Biol.1997723154
- 18. The shape of mammalian phylogeny: patterns, processes and scalesPhilos Trans Royal Soc Lond B.2011366157724622477
- 19. Age-dependent speciation can explain the shape of empirical phylogeniesSystematic Biol.2015643432440
- 20. Model inadequacy and mistaken inferences of trait-dependent speciationSyst Biol2015642340355
- 21. Bennett DJ, Sutton MD, Turvey ST. Evolutionarily distinct “living fossils” require both lower speciation and lower extinction rates. Paleobiology. (In press).
- 22. Loss of avian phylogenetic diversity in neotropical agricultural systemsScience2014
- 23. UniFrac: a new phylogenetic method for comparing microbial communitiesAppl Environ Microbiol2005711282288235
- 24. Bennett DJ. MoreTreeTools: more phylogenetic tree tools in R (development copy). 2016. https://zenodo.org/badge/latestdoi/4641/DomBennett/MoreTreeTools. Accessed 24 July 2016.