| Literature DB >> 28481363 |
Subha Kalyaanamoorthy1,2, Bui Quang Minh3, Thomas K F Wong1,4, Arndt von Haeseler3,5, Lars S Jermiin1,4.
Abstract
Model-based molecular phylogenetics plays an important role in comparisons of genomic data, and model selection is a key step in all such analyses. We present ModelFinder, a fast model-selection method that greatly improves the accuracy of phylogenetic estimates by incorporating a model of rate heterogeneity across sites not previously considered in this context and by allowing concurrent searches of model space and tree space.Entities:
Mesh:
Year: 2017 PMID: 28481363 PMCID: PMC5453245 DOI: 10.1038/nmeth.4285
Source DB: PubMed Journal: Nat Methods ISSN: 1548-7091 Impact factor: 28.547
Figure 1Assessment of the accuracy of phylogenetic estimates obtained using ModelFinder.
(a) The rooted 100-tipped tree, with a root-to-tip distance of 0.5 substitutions/site, that was used to generate the simulated data. (b) Plot showing the true values of r and w (red lines; r = (0.06, 0.42, 0.82, 1.28, 2.58) and w = (0.08, 0.34, 0.10, 0.36, 0.12)) and the estimated values of (r, w) for the 100 simulated data sets (black dots). (c) Histograms showing the number of times different models of SE were identified under different criteria (AIC, AICc and BIC) using the default (black) and advanced (red) search options. (d) Graphs showing the distribution of Robinson-Foulds (RF) distances between the true tree and (a) the tree used during the default model search (Default), (b) the tree found, given the optimal model of SE found using the default model-search option (Combined), and (c) the tree found during the advanced model search (Advanced) (the BIC optimality criterion was used in this example).
Figure 2Illustration of the advantages provided by ModelFinder.
(a) One-dimensional plot showing the BIC scores of selected models of SE, given the alignment of amino acids used by Wu et al.19 The models are listed above the line. Numbers drawn at a 45° angle are the BIC scores and those shown in italics are the ΔBIC scores. The relative position of each model of SE is shown on the axis, with the worst model on the right and the best model on the left. (b) Plot showing the values of r and w obtained under the R14 model of RHAS (red lines and balls) and the Γ14 model of RHAS (black lines and balls) for the alignment analyzed by Wu et al.19 Stars (*) indicate local peaks in the R14 model of RHAS. (c) Plot showing the RF distances between the most likely tree inferred under the LG+R14 model of SE and the most likely trees inferred under the LG+Γ14, LG+Γ4, LG+I+Γ4, LG+I+Γ5 and WAG+I+Γ5 models of SE. For comparison, a histogram with the distribution of 1,000 RF distances is included; each of these distances was obtained by comparing the most likely tree inferred under the LG+R14 model of SE to a randomly-generated tree with the same number of leaves.
Results from analyses of five other data sets. For each data set is shown: the numbers of sequences in the alignment, the number of sites in the alignment, the optimal models of SE identified using ModelFinder and IQ-TREE’s implementations of jModelTest9 and ProtTest10 (Other Methods), and the differences in terms of the ∆BIC score and RF distance between phylogenetic estimates inferred using these optimal models of SE.
| Data type, source & origin | Sequences | Sites | ModelFinder | BIC | Other Methods | BIC | ∆BIC | RF |
|---|---|---|---|---|---|---|---|---|
| DNA, Lassa virus | 179 | 3,186 | SYM+R5 | 131,325 | SYM+I+Γ4 | 131,540 | 215 | 16 |
| DNA, mitochondrial, mammals | 274 | 7,370 | GTR+R8 | 681,837 | GTR+I+Γ4 | 684,469 | 2,632 | 16 |
| DNA, nuclear, birds | 200 | 394,684 | GTR+R8 | 18,891,706 | GTR+I+Γ4 | 18,969,054 | 77,348 | 4 |
| Protein, plastids, green plants | 360 | 19,449 | JTT+F+R10 | 2,830,471 | JTT+F+I+Γ4 | 2,838,957 | 8,486 | 4 |
| Protein, nuclear, yeast | 23 | 634,530 | LG+F+R7 | 25,629,204 | LG+F+I+Γ4 | 25,638,043 | 8,839 | 0 |