| Literature DB >> 28472384 |
Vincent Lefort1, Jean-Emmanuel Longueville1, Olivier Gascuel1,2.
Abstract
Model selection using likelihood-based criteria (e.g., AIC) is one of the first steps in phylogenetic analysis. One must select both a substitution matrix and a model for rates across sites. A simple method is to test all combinations and select the best one. We describe heuristics to avoid these extensive calculations. Runtime is divided by ∼2 with results remaining nearly the same, and the method performs well compared with ProtTest and jModelTest2. Our software, "Smart Model Selection" (SMS), is implemented in the PhyML environment and available using two interfaces: command-line (to be integrated in pipelines) and a web server (http://www.atgc-montpellier.fr/phyml-sms/).Entities:
Keywords: AIC and BIC criteria; PhyML; heuristic procedure; model selection; web server
Mesh:
Year: 2017 PMID: 28472384 PMCID: PMC5850602 DOI: 10.1093/molbev/msx149
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1.Interface, input, output, models, and options. (A) By default, the substitution model is selected by SMS using AIC; alternatively, the user may choose BIC or select the model manually. (B) The output contains standard PhyML results and the model selected by SMS with detailed information. (C) Models and options available in SMS.
Method Comparison with 500 DNA, and 500 Protein Representative MSAs.
| Methods | Data | Criterion | Same Model | SMS Better | SMS Worse | Δ AIC & Δ BIC per taxon per site | # PhyML Runs SMS/other | Speed Increase |
|---|---|---|---|---|---|---|---|---|
| SMS versus Exhaustive | DNA | AIC | 486 | na | 14 | 4.6 x 10−5 | 6.1/16 | 1.9–2.0 |
| BIC | 476 | na | 24 | 8.0 x 10−5 | 7.5/16 | 1.7–1.9 | ||
| SMS versus Exhaustive | Protein | AIC | 494 | na | 6 | 3.7 x 10−3 | 29.3/68 | 2.2–2.1 |
| BIC | 497 | na | 3 | 3.8 x 10−3 | 30.2/68 | 2.1–2.0 | ||
| SMS versus jModelTest2 | DNA | AIC | 380 | 85 | 35 | −2.5 x 10−5 | 6.1/7.8 | 1.1–0.8 |
| BIC | 308 | 151 | 41 | −1.1 x 10−4 | 7.5/7.8 | 0.9–0.8 | ||
| SMS versus ProtTest | Protein | AIC | 465 | 14 | 21 | −8.9 x 10−4 | 29.3/120 | 3.7–3.4 |
| BIC | 465 | 12 | 23 | −7.5 x 10−4 | 30.2/120 | 3.5–3.2 |
Note.—The “Exhaustive” approach uses the same set of models as SMS and evaluates all of them. “Same model”: number of times (among 500 MSAs) where both methods return the same model; “SMS better”: number of times where the model returned by SMS has a lower AIC/BIC value; “SMS worse”: number of times where the model returned by SMS has a higher AIC/BIC value; “Δ AIC and Δ BIC per taxon per site”: when both models were different, we computed the difference in AIC/BIC per taxon per site, and averaged the results over all MSAs showing a model difference (a negative/positive value means that SMS’s model is better/worse in terms of AIC/BIC); “# PhyML runs”: number of PhyML runs for one method versus the other; “Speed increase”: for each MSA, we computed the computing time ratio of the method being compared with respect to SMS (e.g., 2 means that SMS is twice as fast), with the column displaying: i) the median value among the 500 speedup ratios for all MSAs, ii) the median value for the 50 largest MSAs (number of sites x number of taxa; see supplementary fig. S1, Supplementary Material online for additional computing time results with large MSAs).