| Literature DB >> 35652926 |
Simon Laurin-Lemay1, Kassandra Dickson1, Nicolas Rodrigue2,3,4.
Abstract
We draw attention to an under-appreciated simulation method for generating artificial data in a phylogenetic context. The approach, which we refer to as jump-chain simulation, can invoke rich models of molecular evolution having intractable likelihood functions. As an example, we simulate data under a context-dependent model allowing for CpG hypermutability and show how such a feature can mislead common codon models used for detecting positive selection. We discuss more generally how this method can serve to elucidate the ways by which currently used models for inference are susceptible to violations of their underlying assumptions. Finally, we show how the method could serve as an inference engine in the Approximate Bayesian Computation framework.Entities:
Keywords: Approximate Bayesian Computation; CpG hypermutability; Likelihood ratio test; Model violations; Positive selection; Site-interdependent models; Substitution models
Mesh:
Year: 2022 PMID: 35652926 PMCID: PMC9233627 DOI: 10.1007/s00239-022-10058-0
Source DB: PubMed Journal: J Mol Evol ISSN: 0022-2844 Impact factor: 3.973
Fig. 1Distribution of maximum likelihood parameter values obtained from analyzing simulated alignments with M0 model from CodeML. Simulated alignments were generated under realistic conditions, corresponding to posterior distribution of M0 obtained from analyzing a mammalian alignment of the WDR91 gene, with different values (black-dashed lines) and CpG transition rates (blue: , orange: , red: ). There were 100 replicates per condition. Details of the simulation grid are presented in supplementary materials. a All simulations are generated under (black-dashed line): 51%, 97%, and 100% of simulations had greater than the true value when , , and , respectively. b All simulations are generated under (black-dashed line): 50%, 90%, and 100% of simulations had greater than the true value when , , and , respectively. c All simulations are generated under (black-dashed line): 49%, 73%, and 95% of simulations had greater than the true value when , , and , respectively. d–f Proportion of simulations (y-axis) rejecting the M7 model upon likelihood ratio test conducted with both M7 and M8 models (2 degrees of freedom). Simulated data where generated under 5 different mixtures of values with equally distributed values among sites from each mixture component, along with 4 levels of CpG transition rates. For realism, simulations were conducted using posterior average parameter values under M0 obtained by analyzing mammalian alignments of STRIP1, GPAM, and WDR91 genes for panels d, e, and f, respectively. Circle, star, asterisk, triangle, and square markers correspond to -mixture 1 (0.1, 0.2, 0.3), mixture 2 (0.4, 0.5, 0.6), mixture 3 (0.7, 0.8, 0.9), mixture 4 (0.2, 0.5, 0.7), and mixture 5 (0.5, 0.7, 0.9), respectively (Color figure online)
Fig. 2Posterior distribution of recovered using CABC methodology when analyzing three simulated alignments (see Supplement Materials) generated with a CpG transition rate of . For one of the simulations (blue histogram), has posterior mean of 6.38 with 95% credibility interval of 4.64–8.39. For a second simulation (red histogram), has posterior mean of 9.78 with 95% credibility interval of 7.74–11.99. For a third simulation (orange histogram), , and has a posterior mean of 7.99 with 95% credibility interval of 6.62–9.40 (Color figure online)