| Literature DB >> 20657773 |
Wayne Delport1, Konrad Scheffler, Mike B Gravenor, Spencer V Muse, Sergei Kosakovsky Pond.
Abstract
The single rate codon model of non-synonymous substitution is ubiquitous in phylogenetic modeling. Indeed, the use of a non-synonymous to synonymous substitution rate ratio parameter has facilitated the interpretation of selection pressure on genomes. Although the single rate model has achieved wide acceptance, we argue that the assumption of a single rate of non-synonymous substitution is biologically unreasonable, given observed differences in substitution rates evident from empirical amino acid models. Some have attempted to incorporate amino acid substitution biases into models of codon evolution and have shown improved model performance versus the single rate model. Here, we show that the single rate model of non-synonymous substitution is easily outperformed by a model with multiple non-synonymous rate classes, yet in which amino acid substitution pairs are assigned randomly to these classes. We argue that, since the single rate model is so easy to improve upon, new codon models should not be validated entirely on the basis of improved model fit over this model. Rather, we should strive to both improve on the single rate model and to approximate the general time-reversible model of codon substitution, with as few parameters as possible, so as to reduce model over-fitting. We hint at how this can be achieved with a Genetic Algorithm approach in which rate classes are assigned on the basis of sequence information content.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20657773 PMCID: PMC2908124 DOI: 10.1371/journal.pone.0011587
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Comparison of single rate versus random models for 3 alignments.
| alignment |
|
|
|
|
| Pandit PF00803 | 13 | 27 | 43 (15) | 96 (80) |
| Rhodopsin | 38 | 330 | 80 (66) | 100 (99) |
| HIV-1 group M | 98 | 541 | 80 (56) | 99 (96) |
= number of taxa, = number of sites, = number of random permutations out of 100 which showed significantly improved fit over the model (Likelihood Ratio Test, ). Numbers in parentheses are based on Bonferroni corrected .
Figure 1Distribution of log likelihood scores for multi-rate models where amino-acid substitutions are assigned randomly to 5 non-synonymous classes.
The fit of single rate (SR), linear combination of amino acid properties (LCAP), empirical codon model (ECM), Genetic Algorithm (GA) and the general reversible model (REV) are shown as upside-down triangles. Number of rate classes inferred in the GA are 3, 4 and 5 for PF00803, rhodopsin and HIV-1 pol, respectively. Dashed lines indicate the log likelihood required to (i) reject the single rate model in favor of a 5 rate model (left), and (ii) reject a 5 rate model in favor of REV (right). All models were fitted using maximum likelihood estimates of position-specific nucleotide frequencies.
Comparison of empirical model fits using BIC.
| # rate parameters | SR | ECM | LCAP |
| GA | REV |
| 1 | 0 | 5 | 5 |
| 75 | |
| Pandit PF00803 | 20978.6 | 20667.2 | 20844.9 | 20970.3 |
| 21032 |
| Rhodopsin | 27514.5 | 27994 | 27223.6 | 27454.6 |
| 27224.6 |
| HIV-1 | 45729.2 | 48683.2 | 45394 | 45658.5 |
| 45314.1 |
The best model (with smallest BIC) is shown in boldface. The BIC for , the model in which amino acid substitution pairs are randomly assigned to one of 5 rate classes, is estimated as the mean BIC over 100 permutations of rate class assignment. is the number of rate classes identified using a Genetic Algorithm model fitting procedure described in [20] and shown in parenthesis for each alignment. All models were fitted using maximum likelihood estimates of position-specific nucleotide frequencies.