| Literature DB >> 21971516 |
Abstract
Bacterial evolution is characterized by frequent gain and loss events of gene families. These events can be inferred from phyletic pattern data-a compact representation of gene family repertoire across multiple genomes. The maximum parsimony paradigm is a classical and prevalent approach for the detection of gene family gains and losses mapped on specific branches. We and others have previously developed probabilistic models that aim to account for the gain and loss stochastic dynamics. These models are a critical component of a methodology termed stochastic mapping, in which probabilities and expectations of gain and loss events are estimated for each branch of an underlying phylogenetic tree. In this work, we present a phyletic pattern simulator in which the gain and loss dynamics are assumed to follow a continuous-time Markov chain along the tree. Various models and options are implemented to make the simulation software useful for a large number of studies in which binary (presence/absence) data are analyzed. Using this simulation software, we compared the ability of the maximum parsimony and the stochastic mapping approaches to accurately detect gain and loss events along the tree. Our simulations cover a large array of evolutionary scenarios in terms of the propensities for gene family gains and losses and the variability of these propensities among gene families. Although in all simulation schemes, both methods obtain relatively low levels of false positive rates, stochastic mapping outperforms maximum parsimony in terms of true positive rates. We further studied the factors that influence the performance of both methods. We find, for example, that the accuracy of maximum parsimony inference is substantially reduced when the goal is to map gain and loss events along internal branches of the phylogenetic tree. Furthermore, the accuracy of stochastic mapping is reduced with smaller data sets (limited number of gene families) due to unreliable estimation of branch lengths. Our simulator and simulation results are additionally relevant for the analysis of other types of binary-coded data, such as the existence of homologues restriction sites, gaps, and introns, to name a few. Both the simulation software and the inference methodology are freely available at a user-friendly server: http://gloome.tau.ac.il/.Entities:
Mesh:
Year: 2011 PMID: 21971516 PMCID: PMC3215202 DOI: 10.1093/gbe/evr101
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Evaluation of Stochastic Mapping and Maximum Parsimony Performance for Events Detection in Various Simulation Schemes
| Simulation Scenario Code | Rate Distribution among Sites | Loss/Gain Ratio in Simulation | CRR | MCCs Ratio | MCC Mapping | TPR Mapping | FPR Mapping | MCC Parsimony | TPR Parsimony | FPR Parsimony |
| ER_gEql | Equal | 1 | 1.812 | 1.435 | 0.809 | 0.763 | 0.005 | 0.563 | 0.421 | 0.006 |
| ER_gVrl_1 | Equal | 1 | 1.515 | 1.296 | 0.685 | 0.577 | 0.005 | 0.529 | 0.381 | 0.006 |
| ER_gVrl_2 | Equal | 2 | 1.588 | 1.345 | 0.666 | 0.563 | 0.006 | 0.495 | 0.354 | 0.006 |
| ER_gVrl_4 | Equal | 4 | 1.709 | 1.426 | 0.616 | 0.503 | 0.007 | 0.432 | 0.294 | 0.007 |
| ER_gVrl_8 | Equal | 8 | 1.675 | 1.416 | 0.524 | 0.381 | 0.006 | 0.37 | 0.227 | 0.006 |
| VR_gEql | Gamma | 1 | 1.834 | 1.463 | 0.729 | 0.651 | 0.005 | 0.498 | 0.355 | 0.006 |
| VR_gVrl_1 | Gamma | 1 | 1.952 | 1.527 | 0.712 | 0.621 | 0.005 | 0.466 | 0.318 | 0.005 |
| VR_gVrl_2 | Gamma | 2 | 2.007 | 1.557 | 0.702 | 0.608 | 0.005 | 0.451 | 0.303 | 0.005 |
| VR_gVrl_4 | Gamma | 4 | 2.093 | 1.608 | 0.66 | 0.546 | 0.005 | 0.411 | 0.261 | 0.005 |
| VR_gVrl_8 | Gamma | 8 | 2.142 | 1.636 | 0.593 | 0.446 | 0.004 | 0.362 | 0.208 | 0.004 |
| COG_Parsimonyy | Parsimony | 2.89 | 1.359 | 1.22 | 0.425 | 0.246 | 0.004 | 0.348 | 0.181 | 0.004 |
| COG_Model | Model | 4.63 | 1.802 | 1.456 | 0.576 | 0.419 | 0.004 | 0.396 | 0.232 | 0.004 |
Rates based on empirical estimation of COG gene families.
The gain-to-loss ratio is 1 and does not vary among sites.
Maximum Parsimony Performance Separated for Gain and Loss Detection under Two Parsimony Cost Matrices
| Cost Matrix (Gain:Loss) | MCC | TPR | FPR | |
| Overall inference | Cost 1:1 | 0.348 | 0.181 | 0.004 |
| Cost 2:1 | 0.337 | 0.18 | 0.004 | |
| Gain inference | Cost 1:1 | 0.388 | 0.231 | 0.003 |
| Cost 2:1 | 0.356 | 0.163 | 0.001 | |
| Loss inference | Cost 1:1 | 0.322 | 0.131 | 0.001 |
| Cost 2:1 | 0.339 | 0.197 | 0.003 |
NOTE.—The simulation scenario in all these evaluations is based on rates estimated by maximum parsimony from COG gene families (COGParsimony).
Performance Evaluation in Various Subsets of Events
| Evaluated Subset | CRR | MCCs Ratio | MCC Mapping | TPR Mapping | FPR Mapping | MCC Parsimony | TPR Parsimony | FPR Parsimony |
| Reference | 1.359 | 1.22 | 0.425 | 0.246 | 0.004 | 0.348 | 0.181 | 0.004 |
| External branches | 1.208 | 1.125 | 0.543 | 0.378 | 0.003 | 0.483 | 0.313 | 0.003 |
| Internal branches | 1.562 | 1.352 | 0.357 | 0.185 | 0.004 | 0.264 | 0.119 | 0.004 |
| Deep branches | 2.044 | 1.795 | 0.242 | 0.11 | 0.005 | 0.135 | 0.054 | 0.005 |
| Low rate | 1.208 | 1.123 | 0.494 | 0.309 | 0.002 | 0.44 | 0.256 | 0.002 |
| High rate | 1.47 | 1.301 | 0.387 | 0.217 | 0.005 | 0.297 | 0.147 | 0.005 |
NOTE.—The simulation scenario in all these evaluations is based on rates estimated by maximum parsimony from COG gene families (COGParsimony).
Performance Evaluation with Variable Data Set Size
| Number of Sites Used for Model and Branch Lengths Estimation | CRR | MCCs Ratio | MCC Mapping | TPR Mapping | FPR Mapping | MCC Parsimony | TPR Parsimony | FPR Parsimony |
| 10,000 | 1.359 | 1.22 | 0.425 | 0.246 | 0.004 | 0.348 | 0.181 | 0.004 |
| 5,000 | 1.34 | 1.21 | 0.425 | 0.247 | 0.00354 | 0.352 | 0.184 | 0.00355 |
| 1,000 | 1.34 | 1.21 | 0.423 | 0.245 | 0.00352 | 0.35 | 0.183 | 0.00357 |
| 500 | 1.32 | 1.2 | 0.421 | 0.243 | 0.00351 | 0.351 | 0.183 | 0.00352 |
| 100 | 1.23 | 1.15 | 0.4 | 0.224 | 0.00354 | 0.349 | 0.182 | 0.00357 |
| 50 | 1.18 | 1.12 | 0.394 | 0.219 | 0.0035 | 0.353 | 0.185 | 0.00351 |
| 10 | 0.898 | 0.941 | 0.328 | 0.163 | 0.00328 | 0.349 | 0.181 | 0.00351 |
| 10 | 2.43 | 1.74 | 0.604 | 0.44 | 0.00357 | 0.348 | 0.181 | 0.00359 |
NOTE.—Smaller number of sites available for model and branch length estimation results with lowered stochastic mapping performance. The simulation scenario in all these evaluations is based on rates estimated by maximum parsimony from COG gene families (COGParsimony). In all cases, overall number of sites used for performance estimation was 10,000.
Branch lengths are given rather than estimated from the data.