| Literature DB >> 27230264 |
Sebastián Duchêne1,2, David A Duchêne3, Francesca Di Giallonardo4,3, John-Sebastian Eden4,3, Jemma L Geoghegan4,3, Kathryn E Holt5,6, Simon Y W Ho3, Edward C Holmes4,3.
Abstract
BACKGROUND: Recent developments in Bayesian phylogenetic models have increased the range of inferences that can be drawn from molecular sequence data. Accordingly, model selection has become an important component of phylogenetic analysis. Methods of model selection generally consider the likelihood of the data under the model in question. In the context of Bayesian phylogenetics, the most common approach involves estimating the marginal likelihood, which is typically done by integrating the likelihood across model parameters, weighted by the prior. Although this method is accurate, it is sensitive to the presence of improper priors. We explored an alternative approach based on cross-validation that is widely used in evolutionary analysis. This involves comparing models according to their predictive performance.Entities:
Keywords: Bayesian phylogenetics; Cross-validation; Demographic models; Marginal likelihood; Model selection; Molecular clock
Mesh:
Year: 2016 PMID: 27230264 PMCID: PMC4880944 DOI: 10.1186/s12862-016-0688-y
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Details of four viral and bacterial data sets analysed in this study
| Data set | Number of sequences | Alignment length (bp) | Variable sites | Sampling time span | Reference |
|---|---|---|---|---|---|
| EV-A71 | 34 | 859 | 101 | 2011– 2013 | [ |
| WNV | 68 | 10299 | 366 | 1999 – 2013 | [ |
| RHDV | 72 | 1737 | 571 | 1995 – 2014 | [ |
|
| 161 | 1626 | 1626 | 1995 – 2014 | [ |
Molecular-clock models selected for data sets simulated with three different sequence lengths (nt) and using three different clock models: the strict clock (SC), uncorrelated lognormal relaxed clock (UCLN), uncorrelated exponential relaxed clock (UCED)
| Clock model used for simulation | Clock model used for analysis | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| 5,000 nt | 10,000 nt | 15,000 nt | |||||||
| SC | UCLN | UCED | SC | UCLN | UCED | SC | UCLN | UCED | |
| SC | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 |
| UCLN | 0.00 | 0.80 | 0.20 | 0.00 | 0.60 | 0.40 | 0.00 | 0.80 | 0.20 |
| UCED | 0.00 | 0.90 | 0.10 | 0.00 | 0.60 | 0.40 | 0.00 | 0.40 | 0.60 |
The numbers indicate the frequency with which each model was selected, out of ten simulation replicates
Demographic models selected for replicate data sets simulated with three different sequence lengths (nt) and using two different demographic models: the constant-size coalescent (CSC) and exponential-growth coalescent (EGC), with a growth rate of 0.25
| Demographic model used for simulation | Demographic model used for analysis | |||||
|---|---|---|---|---|---|---|
| 5,000 nt | 10,000 nt | 15,000 nt | ||||
| CSC | EGC | CSC | EGC | CSC | EGC | |
| CSC | 0.70 | 0.30 | 0.40 | 0.60 | 0.40 | 0.60 |
| EGC | 0.10 | 0.90 | 0.10 | 0.90 | 0.10 | 0.90 |
Each row corresponds to simulations performed using one of the two demographic models
Comparison of molecular clock and demographic models for four empirical data sets: Enterovirus A71 (EV-A71), West Nile Virus (WNV), Rabbit Hemorrhagic Disease Virus (RHDV), and Shigella sonnei
| Method | Data set | SC + CSC | SC + EGC | UCLN + CSC | UCLN + EGC |
|---|---|---|---|---|---|
| Cross validation (50 % training; 50 % test) | EV-A71 | −1129.4(±3.1) |
| −1921.9(±9.8) | −1396.1(±12.0) |
| WNV | −8216.7(±1.3) |
| −8648.9(±5.3) | −8691.3(±5.0) | |
| RHDV | −6456.1(±0.6) | −6908.8(±0.3) | −6102.8(±1.3) |
| |
|
|
| −7699.5(±0.3) | −25997.4(±7.9) | −25630.9(±6.3) | |
| Cross validation (80 % training; 20 % test) | EV-A71 | −443.0(±1.8) |
| −1246.5(±4.7) | −1286.7(±14.8) |
| WNV | −3615.2(±2.6) |
| −3900.0(±19.9) | −3857.1(±19.2) | |
| RHDV | −2394.7(±0.6) | −2393.5(±0.7) | −2336.7(±1.0) |
| |
|
|
| −2979.8(±2.0) | −3172.3(±11.6) | −3032.5(±10.0) | |
| Marginal likelihoods using stepping stone | EV-A71 | −2017.0 |
| −2017.9 | −2078.6 |
| WNV | −18012.7 | −17998.2 | −18009.2 |
| |
| RHDV | −11323.8 | −11292.6 | −11271.5 |
| |
|
| −14739.6 | −14746.5 | −14717.8 |
|
The models correspond to four combinations of clock and demographic models: strict clock (SC), uncorrelated lognormal clock (UCLN), constant-size coalescent (CSC), and exponential-growth coalescent (EGC). Mean log likelihoods across ten replicates are given for the test set from each data set, using training sets of 50 and 80 % of the total alignment length. Marginal log likelihoods using stepping-stone sampling are also shown for comparison. Values in bold correspond to the highest log likelihood in each case. Values in parentheses indicate the standard error around the mean likelihood for ten cross-validation replicates
Fig. 1Posterior distributions of the coefficient of variation of branch rates and the population growth rate for four empirical data sets: Enterovirus A71 (EV-A71), West Nile Virus (WNV), Rabbit Hemorrhagic Disease Virus (RHDV), and Shigella sonnei. Estimates were made using the uncorrelated lognormal clock (UCLN) and the exponential-growth coalescent (EGC). A coefficient of variation of branch rates that approaches zero indicates that evolution has been clock-like. A growth rate including zero indicates that population size has been constant