| Literature DB >> 20689581 |
Sergei Kosakovsky Pond1, Wayne Delport, Spencer V Muse, Konrad Scheffler.
Abstract
Markov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a "corrected" empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of sequence alignments, our estimators show a significant improvement in goodness of fit compared to the approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the -style estimators.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20689581 PMCID: PMC2912764 DOI: 10.1371/journal.pone.0011230
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Relationships between empirical frequencies, frequency parameters and equilibrium frequencies in codon models.
Figure 2Comparison of frequency parameterizations fitted to simulated alignments.
The top row (A,B) shows the comparison of scores on simulated data obtained with different corrected frequency estimates; C) Bias in the estimate of the substitution rate in near-asymptotic regime () is apparent under , but does not exist for the other two estimators; D) variance of the estimate for is reduced with increasing sample size.
Figure 3The effect of the frequency estimator on the inference of and (relative to the rate) substitution rate from alignments sampled from the Pandit database [10].
The estimate of under is biased downwards relative to .