| Literature DB >> 29774245 |
John A Lees1,2, Michelle Kendall3, Julian Parkhill1, Caroline Colijn3, Stephen D Bentley1, Simon R Harris1.
Abstract
Background: Phylogenetic reconstruction is a necessary first step in many analyses which use whole genome sequence data from bacterial populations. There are many available methods to infer phylogenies, and these have various advantages and disadvantages, but few unbiased comparisons of the range of approaches have been made.Entities:
Keywords: bacteria; phylogenetic methods; phylogeny; simulation; tree distance
Year: 2018 PMID: 29774245 PMCID: PMC5930550 DOI: 10.12688/wellcomeopenres.14265.2
Source DB: PubMed Journal: Wellcome Open Res ISSN: 2398-502X
Figure 1. The phylogeny inferred by Kremer et al. [12] used as the true tree in simulations.
Tips are coloured by BAPS cluster inferred from the core genome alignment.
Accuracy and resource usage of phylogenetic reconstruction methods, ordered by KC metric score.
The method lists the best combinations of all alignment with phylogenetic method, and distance matrices with phylogenetic methods. Three scores of accuracy of the phylogeny are shown; the KC metric is described in the text, the BAPS scores (the primary and secondary clusters, respectively) are a tick if the clusters are as in the true tree, otherwise which clusters are wrong (all clusters, or just the polyphyletic clusters). Parallelisability shown is that built into the software, “completely” is when every value in a distance matrix is independent so can be parallelised up to N 2 times. Accessory indicates whether accessory elements (not present in all isolates) are used in the phylogenetic inference.
| Method | KC
| BAPS 1 | BAPS 2 | CPU time | Memory | Overheads | Parallelisability | Accessory
| Recommended |
|---|---|---|---|---|---|---|---|---|---|
| RAxML + close
| 4.63 | ✓ | ✓ | 806.5 minutes | 2.7 Gb | Mapped
| Pthreads | No | NA (artificial) |
| RAxML
| 11.2 | ✓ | ✓ | 587 minutes | 3.0 Gb | Mapped
| Pthreads | No | Accurate
|
| IQ-TREE (slow)
| 11.2 | ✓ | ✓ | 703 minutes | 3.2 Gb | Mapped
| Pthreads or MPI | No | Accurate
|
| IQ-TREE (fast)
| 11.3 | ✓ | ✓ | 14.6 minutes | 1.1 Gb | Mapped
| Pthreads or MPI | No | Accurate/fast
|
| Parsnp | 14.0 | ✓ | ✓ | 42.5 minutes | 2.6 Gb | Assemblies | Threads | No | Artificial |
| FastTree
| 16.0 | ✓ | ✓ | 189 minutes | 10.6 Gb | Mapped
| Threads
| No | Accurate/fast
|
| RAxML + core
| 18.6 | ✓ | ✓ | 29.2 minutes | 154 Mb | Core gene
| Pthreads | No | Comparable
|
| NJ + SNPs
| 20.5 | ✓ | ✓ | Negligible | Negligible | Mapped
| No | No | No |
| IQ-TREE + mixed
| 24.5 | ✓ | ✓ | 1316 minutes | 3.2Gb | Mapped
| Pthreads or MPI | Yes | No |
| BIONJ + mash
| 51.7 | ✓ | ✓ | 0.75 minutes | 10 Mb | Assembly | Completely | Yes | Best, when no
|
| RAxML + Seven
| 62.6 | ✓ | ✓ | 1.4 minutes | 19 Mb | Assembly | Pthreads | No | No |
| BIONJ + andi
| 66.0 | ✓ | polyphyly | 7.48 minutes | 290 Mb | Assembly | Completely | Yes | No |
| RAxML + Cactus
| 67.2 | ✓ | ✓ | 9 600 minutes | 37.4 Gb | Assembly | Threads | No | No |
| RAxML + gene
| 77.3 | ✓ | polyphyly | 4.28 minutes | 20 Mb | Core gene
| Threads | Yes | No |
| BIONJ + k-mer
| 89.6 | ✓ | ✓ | 37.3 minutes | 180 Mb | Assembly | Threads | Yes | No |
| NJ + ANI/
| 98.1 | ✓ | polyphyly | Negligible | 230 Mb | Mapped
| No | No | No |
| BIONJ + BIGSdb-
| 150 | ✓ | polyphyly | 0.48 minutes | Negligible | Assembly | Completely | No | No |
| UPGMA + NCD | 210 | ✓ | all | 1 040 minutes | Negligible | Assembly | Completely | Yes | No |
Figure 2. Ordered accuracies from Table 1, showing the CPU time required for each tree.
There are large changes in accuracy between the alignment and distance methods, and again between two inaccurate distance methods.
Figure 3. A multidimensional scaling plot of the KC distances between all core gene trees from a real population of 616 S. pneumoniae genomes.
Top: topology distances ( λ = 0); bottom: branch length distances ( λ = 0). The core genome tree from the concatenated alignment is shown in yellow; trees from ribosomal proteins, which tended to have different topologies due to their lack of variation, are shown in blue. The top twenty divergent trees by branch length are listed in Supplementary Table 2 ( Supplementary File 1). The full list of distances by gene can be accessed at https://gist.github.com/johnlees/da164a4260e13528e8315e266a46bf3f.
Figure 4. Tree of tree methods.
Using the KC metric between all the inferred phylogenies in Table 1 to create a pairwise distance matrix, an NJ tree created from this matrix. This shows how the topologies from all methods are related to each other (a tree-of-trees, or supertree). The true tree is in orange at the top, and four classes of methods are labeled. For alignment-based methods the mapping of reads to the TIGR4 reference was used, unless explicitly stated. We also performed multi-dimensional scaling of these distances in two dimensions to show how the methods clustered (see interactive treespace plots or static Supplementary Figure 6; Supplementary File 1).