| Literature DB >> 21398426 |
Abstract
The CpG dinucleotide is disproportionately represented in human genetic variation due to the hypermutability of 5-methyl-cytosine (5mC). We exploit this hypermutability and a novel codon substitution model to identify candidate functionally important exonic nucleotides. Population genetic theory suggests that codon positions with high cross-species CpG frequency will derive from stronger purifying selection. Using the phylogeny-based maximum likelihood inference framework, we applied codon substitution models with context-dependent parameters to measure the mutagenic and selective processes affecting CpG dinucleotides within exonic sequence. The suitability of these models was validated on >2,000 protein coding genes from a naturally occurring biological control, four yeast species that do not methylate their DNA. As expected, our analyses of yeast revealed no evidence for an elevated CpG transition rate or for substitution suppression affecting CpG-containing codons. Our analyses of >12,000 protein-coding genes from four primate lineages confirm the systemic influence of 5mC hypermutability on the divergence of these genes. After adjusting for confounding influences of mutation and the properties of the encoded amino acids, we confirmed that CpG-containing codons are under greater purifying selection in primates. Genes with significant evidence of enhanced suppression of nonsynonymous CpG changes were also shown to be significantly enriched in Online Mendelian Inheritance in Man. We developed a method for ranking candidate phenotypically influential CpG positions in human genes. Application of this method indicates that of the ∼1 million exonic CpG dinucleotides within humans, ∼20% are strong candidates for both hypermutability and disease association.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21398426 PMCID: PMC3184784 DOI: 10.1093/gbe/evr021
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
MAA Substitutions
| Sub type | aa sub. | Codon sub. |
| CpG transitions | A | GCG |
| C | TGC | |
| H | CAC | |
| L | CTG | |
| L | TTG | |
| M | ATG | |
| Q | CAA | |
| R | CGG | |
| A | GCA | |
| Non-CpG counterparts | L | CTA |
| L | TTA | |
| R | AGG |
Note.—The MAAs are defined by amino acid substitutions resulting from CpG transitions. The non-CpG counterparts are other codon substitution events that give rise to the same amino acid substitution. Sub. type, substitution type; aa sub., amino acid substitution; codon sub., codon substitutions.
FDistributions of CpG substitution parameter MLEs from primate and yeast. is the MLE for G.K from the CNFGTR + G.K model where LRT(1) was nominally significant (P < 0.05). is the MLE for G from the CNFGTR + G.K + G model where both LRT(1) and LRT(2) were nominally significant (P < 0.05). Frequency is the number of significant alignments whose MLE lay within the indicated bin. The vertical dotted blue line corresponds to the Parameter having no effect. MLE estimates > 15 were collapsed into the 15+ bin.
LRTs for Distinctive Mutation and Selection Influences Affecting CpG-Containing Codons
| Clade | LRT (#) | +Param | % ( | % (Bonf | |
| Primate | 1 | CNFGTR | 66.33 | 17.36 | |
| 2 | CNFGTR + G.K | 17.63 | 0.82 | ||
| 3 | CNFGTR + G.K | α | 22.57 | 0.75 | |
| 4 | CNFGTR + G.K + | 11.02 | 0.07 | ||
| 5 | CNFGTR + G.K | 22.50 | 1.03 | ||
| Yeast | 1 | CNFGTR | 15.59 | 0.00 | |
| 2 | CNFGTR + G.K | 9.20 | 0.00 | ||
| 3 | CNFGTR + G.K | α | 38.29 | 1.11 | |
| 4 | CNFGTR + G.K + | 11.37 | 0.00 | ||
| 5 | CNFGTR + G.K | 18.12 | 0.16 |
Note.—The total number of alignments was 12,092 for primate and 2,533 for yeast. H, null hypothesis; +Param, parameter added to null to make the alternate hypothesis; % (P < 0.05), percent of total alignments significant at the nominal 0.05 level; % (Bonf P < 0.05), percentage of alignments significant after adjustment for multiple tests. If the model was the first in a series, the adjustment was the total number of alignments, if they were second in a series, the adjustment was for the corrected significant number from the earlier result. For example, the correction for primate LRT(1) was 12,092 but for LRT(2), it was the number of LRT(1) significant after correcting for multiple tests, 2,099.
FSelection affecting CpG-containing codons in primates. MLEs were from CNFGTR + G.K + G.K.ω + α for genes with nominally significant LRT(5) (P < 0.05).
Genes Identified with Suppression of Nonsynonymous CpG Substitutions Were Enriched in OMIM
| Nondisease Association | Disease Association | ||
| Non-CpG effect | 5,026 | 1,242 | |
| CpG effect | 1,040 | 321 | 1.14 × 10−3 |
Disease association status was defined according to presence/absence of an allelic variant in the OMIM record.
The probability from a Fisher’s Exact test that frequency of CpG effect genes in the disease association class is the same, or lower, than the frequency in the nondisease class.
Both LRT(3) and LRT(5) were nominally significant, Ĝ.K from LRT(3) was > 1 and ω from LRT(5) was < 1.
FIdentification of phenotypically influential CpG-containing codons in F8. Horizontal red lines—are the 1st and 99th quantiles of LRT from data simulated under the null hypothesis; yellow triangle—codons with conserved CpG; magenta star—codons with synonymous CpG substitutions; green diamond—codons with nonsynonymous CpG substitutions.
Ranking of Exonic CpG for Potential Deleterious Impact of Genetic Variation
| Rank | Criteria | Number |
| 1 | No CpG in humans | 185,919 |
| 2 | CpG in humans but nonsynonymous changes in other species or nonsynonymous in humans at a codon position not within the CpG | 222,582 |
| 3 | CpG in humans with nonsynonymous SNP, CpG is minor allele | 5,630 |
| 4 | Conserved CpG | 359,388 |
| 5 | CpG in humans with synonymous change in human or other species or with SNP, CpG is major allele | 123,896 |
| 6 | Conserved CpG within gene showing a nominally significant LRT(5) | 102,738 |
| 7 | CpG in humans with synonymous change in human or other species within gene showing a nominally significant LRT(5) or with human SNP where CpG is major allele | 42,057 |
Note.—From 14,348 genes, there were 1,042,210 exonic CpG. Number is the total of CpG at each rank.
CpG within humans satisfying criteria.