| Literature DB >> 31949994 |
Drew J Duckett1, Tara A Pelletier2, Bryan C Carstens1.
Abstract
Phylogenetic estimation under the multispecies coalescent model (MSCM) assumes all incongruence among loci is caused by incomplete lineage sorting. Therefore, applying the MSCM to datasets that contain incongruence that is caused by other processes, such as gene flow, can lead to biased phylogeny estimates. To identify possible bias when using the MSCM, we present P2C2M.SNAPP. P2C2M.SNAPP is an R package that identifies model violations using posterior predictive simulation. P2C2M.SNAPP uses the posterior distribution of species trees output by the software package SNAPP to simulate posterior predictive datasets under the MSCM, and then uses summary statistics to compare either the empirical data or the posterior distribution to the posterior predictive distribution to identify model violations. In simulation testing, P2C2M.SNAPP correctly classified up to 83% of datasets (depending on the summary statistic used) as to whether or not they violated the MSCM model. P2C2M.SNAPP represents a user-friendly way for researchers to perform posterior predictive model checks when using the popular SNAPP phylogenetic estimation program. It is freely available as an R package, along with additional program details and tutorials.Entities:
Keywords: Coalescent; Multispecies coalescent model; Posterior predictive simulation; Species trees
Year: 2020 PMID: 31949994 PMCID: PMC6956792 DOI: 10.7717/peerj.8271
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Workflow of the P2C2M.SNAPP pipeline.
Blue arrows represent the path of the data. Steps outlined in blue are those performed by the user and steps outlined in red are performed by P2C2M.SNAPP. The workflow proceeds from the top of the figure.
Figure 2Models used in simulation testing.
(A) MSCM model used for simulation testing. (B) Example of the MSCM+m model that includes gene flow violating the MSCM model implemented in SNAPP. The amount of gene flow and taxa exchanging genes were randomly selected for each simulation replicate.
Results of simulation testing.
Results include all simulations with both the MSCM and MSCM+m models. False positives are datasets simulated under the MSCM model which P2C2M.SNAPP classified as a model violation. False negatives are datatsets simulated under the MSCM+m model that P2C2M.SNAPP classified as not violating the model implemented in SNAPP.
| Statistic | True positives | True negatives | False positives | False negatives | Matthews correlation coefficient (MCC) |
|---|---|---|---|---|---|
| Average pairwise FST (FSTA) | 66 | 0 | 100 | 34 | −0.45 |
| Range of pairwise FST (FSTR) | 81 | 0 | 100 | 19 | −0.32 |
| FST outlier test (PFST) | 3 | 88 | 12 | 97 | −0.17 |
| Kuhner–Felsenstein distance (KF) | 100 | 0 | 100 | 0 | 0.00 |
| Robinson–Foulds distance (RF) | 0 | 100 | 0 | 100 | 0.00 |
| Mean of maximum likelihood (MLM) | 84 | 0 | 100 | 16 | −0.29 |
| Standard deviation of maximum likelihood (MLSD) | 71 | 95 | 5 | 29 | 0.68 |
Figure 3Correlations between the level of gene flow and the ability of each summary statistic to identify model violations.
The p-value for each MSCM+m simulation is plotted against the amount of gene flow simulated with that dataset. (A) FSTA: average pairwise FST. (B) FSTR: range of pairwise FST. (C) KF: Kuhner–Felsenstein distance. (D) MLM: Mean of the maximum likelihood of posterior trees. (E) MLSD: standard deviation of the maximum likelihood of posterior trees. (F) RF: Robinson–Foulds distance.