| Literature DB >> 25987828 |
Kuangyu Wang1, Shuhui Yu2, Xiang Ji1, Clemens Lakner1, Alexander Griffing1, Jeffrey L Thorne1.
Abstract
Models of protein evolution tend to ignore functional constraints, although structural constraints are sometimes incorporated. Here we propose a probabilistic framework for codon substitution that evaluates joint effects of relative solvent accessibility (RSA), a structural constraint; and gene expression, a functional constraint. First, we explore the relationship between RSA and codon usage at the genomic scale as well as at the individual gene scale. Motivated by these results, we construct our framework by determining how probable is an amino acid, given RSA and gene expression, and then evaluating the relative probability of observing a codon compared to other synonymous codons. We come to the biologically plausible conclusion that both RSA and gene expression are related to amino acid frequencies, but, among synonymous codons, the relative probability of a particular codon is more closely related to gene expression than RSA. To illustrate the potential applications of our framework, we propose a new codon substitution model. Using this model, we obtain estimates of 2N s, the product of effective population size N, and relative fitness difference of allele s. For a training data set consisting of human proteins with known structures and expression data, 2N s is estimated separately for synonymous and nonsynonymous substitutions in each protein. We then contrast the patterns of synonymous and nonsynonymous 2N s estimates across proteins while also taking gene expression levels of the proteins into account. We conclude that our 2N s estimates are too concentrated around 0, and we discuss potential explanations for this lack of variability.Entities:
Keywords: codon usage; gene expression; protein evolution; protein structure; scaled selection coefficient; solvent accessibility
Year: 2015 PMID: 25987828 PMCID: PMC4415675 DOI: 10.4137/EBO.S22911
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Test at genomic scale of whether RSA tendencies vary among synonymous codons.
| SYNONYMOUS CODON GROUPS | AA | HUMAN | MOUSE |
|---|---|---|---|
| GCT, GCC, GCA, GCG | A | 0.0003* | 0.2473 |
| TGT, TGC | C | 0.9381 | 0.2330 |
| GAT, GAC | D | 0.0115* | 0.4074 |
| GAA, GAG | E | <0.0001* | 0.5510 |
| TTT, TTC | F | 0.4852 | 0.3127 |
| GGT, GGC, GGA, GGG | G | 0.0153* | 0.9709 |
| CAT, CAC | H | 0.2765 | 0.3875 |
| ATT, ATC, ATA | I | <0.0001* | 0.0703 |
| AAA, AAG | K | 0.0012* | 0.0320 |
| TTA, TTG, CTT, CTC, CTA, CTG | L | 0.0057* | 0.0975 |
| AAT, AAC | N | 0.0001* | 0.0188 |
| CCT, CCC, CCA, CCG | P | 0.0016* | 0.6651 |
| CAA, CAG | Q | 0.0058* | 0.1058 |
| CGT, CGC, CGA, CGG, AGA, AGG | R | 0.0338* | 0.2547 |
| TCT, TCC, TCA, TCG, AGT, AGC | S | <0.0001* | 0.0326 |
| ACT, ACC, ACA, ACG | T | 0.0034* | 0.3692 |
| GTT, GTC, GTA, GTG | V | 0.0255* | 0.8705 |
| TAT, TAC | Y | 0.0082* | 0.2580 |
Notes: The first column contains synonymous codon groups, and the second column shows the corresponding amino acid type. The P-values of individual k-sample A-D tests are shown in the final two columns. To control for false discovery at level α = 0.05, the BH method48 was applied to the human P-values and then the mouse P-values. An * indicates significance at α = 0.05.
Test at individual gene scale of whether RSA tendencies vary among synonymous codons.
| SYNONYMOUS CODON GROUPS | AA | HUMAN | MOUSE |
|---|---|---|---|
| GCT, GCC, GCA, GCG | A | 0.0941 | 0.2033 |
| TGT, TGC | C | 0.2529 | 0.5224 |
| GAT, GAC | D | 0.4482 | 0.2829 |
| GAA, GAG | E | 0.9285 | 0.9118 |
| TTT, TTC | F | 0.4127 | 0.9098 |
| GGT, GGC, GGA, GGG | G | 0.0071 | 0.4487 |
| CAT, CAC | H | 0.2685 | 0.3649 |
| ATT, ATC, ATA | I | 0.1844 | 0.6461 |
| AAA, AAG | K | 0.1098 | 0.7943 |
| TTA, TTG, CTT, CTC, CTA, CTG | L | 0.3073 | 0.4867 |
| AAT, AAC | N | 0.5365 | 0.0688 |
| CCT, CCC, CCA, CCG | P | 0.3841 | 0.2274 |
| CAA, CAG | Q | 0.4883 | 0.7631 |
| CGT, CGC, CGA, CGG, AGA, AGG | R | 0.0730 | 0.4402 |
| TCT, TCC, TCA, TCG, AGT, AGC | S | 0.0466 | 0.4255 |
| ACT, ACC, ACA, ACG | T | 0.1324 | 0.2816 |
| GTT, GTC, GTA, GTG | V | 0.0713 | 0.5674 |
| TAT, TAC | Y | 0.5540 | 0.5869 |
Notes: The first column contains synonymous codon groups, and the second column shows the corresponding amino acid type. The P-values of individual k-sample A-D tests are shown in the final two columns. To control for false discovery at level α = 0.05, the BH method48 was applied to the human P-values and then the mouse P-values. No tests for either human or mouse were significant at α = 0.05.
Figure 1Distance matrix for amino acids using combined human data. Distance between two amino acid types is defined by the normalized KS test statistics (Equation 2). The complete-linkage method was used to perform hierarchical clustering.
Figure 2Distance matrix for amino acids using stratified human data. Distance between two amino acid types is defined by the normalized KS test statistics (Equation 2). The complete-linkage method was used to perform hierarchical clustering. Entries in the matrix are computed by taking the average of the distances between amino acids among all proteins.
Codons: MLR-estimated coefficients for gene expression using human data.
| AA | REF CODON | NON-REF CODON | INTERCEPT COEF | EXPRESSION COEF | INTERCEPT | EXPRESSION |
|---|---|---|---|---|---|---|
| A | GCC | GCA | −0.3091 | −0.1650 | 0.0002 | <0.0001 |
| GCG | −1.6278 | 0.0331 | <0.0001 | 0.3824 | ||
| GCT | −0.0943 | −0.1363 | 0.2237 | <0.0001 | ||
| G | GGC | GGA | 0.1279 | −0.1639 | 0.1294 | <0.0001 |
| GGG | −0.3834 | −0.0039 | <0.0001 | 0.8958 | ||
| GGT | −0.5043 | −0.0909 | <0.0001 | 0.0076 | ||
| I | ATC | ATA | −0.6378 | −0.2427 | <0.0001 | <0.0001 |
| ATT | 0.0974 | −0.1437 | 0.2063 | <0.0001 | ||
| L | CTG | CTA | −1.3565 | −0.1966 | <0.0001 | <0.0001 |
| CTC | −0.5859 | −0.0597 | <0.0001 | 0.0089 | ||
| CTT | −0.6472 | −0.2047 | <0.0001 | <0.0001 | ||
| TTA | −1.1326 | −0.2292 | <0.0001 | <0.0001 | ||
| TTG | −0.9156 | −0.1259 | <0.0001 | <0.0001 | ||
| P | CCC | CCA | 0.1390 | −0.1315 | 0.1450 | 0.0001 |
| CCG | −1.4229 | 0.0621 | <0.0001 | 0.1675 | ||
| CCT | 0.1333 | −0.1263 | 0.1619 | 0.0001 | ||
| R | CGG | AGA | 0.3689 | −0.1976 | 0.0012 | <0.0001 |
| AGG | 0.2032 | −0.1231 | 0.0754 | 0.0013 | ||
| CGA | 0.0133 | −0.2373 | 0.9170 | <0.0001 | ||
| CGC | 0.0172 | −0.0124 | 0.8795 | 0.7315 | ||
| CGT | −0.6141 | −0.0978 | <0.0001 | 0.0428 | ||
| S | AGC | AGT | −0.2386 | −0.1159 | 0.0157 | 0.0012 |
| TCA | −0.3185 | −0.1327 | 0.0018 | 0.0004 | ||
| TCC | −0.1431 | 0.0058 | 0.1140 | 0.8499 | ||
| TCG | −1.7424 | 0.1118 | <0.0001 | 0.0167 | ||
| TCT | −0.0936 | −0.1069 | 0.3207 | 0.0016 | ||
| T | ACC | ACA | −0.0842 | −0.0641 | 0.3423 | 0.0341 |
| ACG | −1.1397 | 0.0214 | <0.0001 | 0.5834 | ||
| ACT | −0.1699 | −0.1289 | 0.0705 | 0.0001 | ||
| V | GTG | GTA | −0.9888 | −0.2038 | <0.0001 | <0.0001 |
| GTC | −0.6873 | −0.0132 | <0.0001 | 0.6153 | ||
| GTT | −0.6146 | −0.1727 | <0.0001 | <0.0001 |
Notes: The first column denotes amino acid types for synonymous codon groups. Within each synonymous codon group, the most frequent codon for each amino acid is chosen as the reference category, and these codons are shown in the second column. The non-reference codons are listed in the third column. The fourth and fifth columns contain the estimated coefficients for intercept and gene expression in Equation 4. The corresponding P-values of the estimated coefficients can be found in the last two columns.
Codons: LR-estimated coefficients for gene expression using human data.
| AA | REF CODON | NON-REF CODON | INTERCEPT COEF | EXPRESSION COEF | INTERCEPT | EXPRESSION |
|---|---|---|---|---|---|---|
| C | TGC | TGT | 0.1474 | −0.1109 | 0.1622 | 0.0035 |
| D | GAC | GAT | 0.0650 | −0.0985 | 0.3275 | <0.0001 |
| E | GAG | GAA | 0.1872 | −0.1632 | 0.0018 | <0.0001 |
| F | TTC | TTT | 0.1959 | −0.1009 | 0.0091 | 0.0001 |
| H | CAC | CAT | −0.1216 | −0.1132 | 0.2212 | 0.0018 |
| K | AAG | AAA | −0.0871 | −0.1144 | 0.1714 | <0.0001 |
| N | AAC | AAT | 0.0934 | −0.1340 | 0.2320 | <0.0001 |
| Q | CAG | CAA | −0.6138 | −0.1920 | <0.0001 | <0.0001 |
| Y | TAC | TAT | −0.1045 | −0.0688 | 0.2184 | 0.0213 |
Notes: The first column denotes amino acid types for synonymous codon groups. Within each synonymous codon group, the most frequent codon for each amino acid is chosen as the reference category, and these codons are shown in the second column. The non-reference codons are listed in the third column. The fourth and fifth columns contain the estimated coefficients for intercept and gene expression in Equation 4. The corresponding P-values of the estimated coefficients can be found in the last two columns.
Amino acids: MLR-estimated coefficients for RSA and gene expression using human data.
| AA | INTERCEPT | RSA | EXPRESSION | |||
|---|---|---|---|---|---|---|
| COEF | COEF | COEF | ||||
| A | −0.9449 | <0.0001 | 0.0208 | <0.0001 | 0.0378 | 0.0065 |
| C | −1.2789 | <0.0001 | −0.0080 | <0.0001 | −0.0255 | 0.2123 |
| D | −1.9803 | <0.0001 | 0.0479 | <0.0001 | 0.0320 | 0.0292 |
| E | −2.0101 | <0.0001 | 0.0523 | <0.0001 | 0.0478 | 0.0006 |
| F | −0.7462 | <0.0001 | −0.0034 | 0.0105 | −0.0020 | 0.9009 |
| G | −1.3812 | <0.0001 | 0.0354 | <0.0001 | 0.0309 | 0.0283 |
| H | −2.0176 | <0.0001 | 0.0322 | <0.0001 | −0.0128 | 0.5008 |
| I | −0.5001 | <0.0001 | −0.0057 | <0.0001 | −0.0282 | 0.0634 |
| K | −2.2285 | <0.0001 | 0.0537 | <0.0001 | 0.0526 | 0.0003 |
| M | −1.9446 | <0.0001 | 0.0096 | <0.0001 | 0.0381 | 0.0813 |
| N | −2.0827 | <0.0001 | 0.0455 | <0.0001 | −0.0246 | 0.1393 |
| P | −2.0514 | <0.0001 | 0.0455 | <0.0001 | 0.0295 | 0.0571 |
| Q | −2.1544 | <0.0001 | 0.0464 | <0.0001 | 0.0176 | 0.2721 |
| R | −1.9454 | <0.0001 | 0.0441 | <0.0001 | 0.0373 | 0.0136 |
| S | −1.3001 | <0.0001 | 0.0366 | <0.0001 | −0.0126 | 0.3740 |
| T | −1.5064 | <0.0001 | 0.0329 | <0.0001 | 0.0305 | 0.0421 |
| V | −0.5067 | <0.0001 | 0.0034 | 0.0020 | 0.0304 | 0.0263 |
| W | −1.9197 | <0.0001 | 0.0022 | 0.2615 | −0.0159 | 0.5218 |
| Y | −1.2309 | <0.0001 | 0.0109 | <0.0001 | 0.0019 | 0.9128 |
Notes: The first column shows amino acid types in their one-letter format. The second and third columns contain estimated values and their corresponding P-values of the intercept (see Equation 5). The fourth and fifth columns show estimated coefficients and their P-values of RSA in MLR. The last two columns show coefficient estimates and P-values of log-transformed gene expression. The most frequent amino acid leucine (L) is not included in the table because it is chosen as the reference category for MLR.
Figure 3Predicted probabilities of four amino acids across possible ranges of RSA and gene expression. Probabilities for each amino acid were calculated from the estimated RSA and gene expression coefficients (see Table 5). The horizontal axis covers the possible range of RSA, and the vertical axis is for gene expression. Red regions indicate relatively high probabilities, while blue regions show low probabilities. The four amino acids depicted here (histidine, leucine, proline, and serine) were selected for the diversity of heatmap diagrams that they represent. Heatmaps for the other 16 amino acids can be found in Supplementary Figure 1.
Figure 4Comparison of S = 2N s estimates between nonsynonymous and synonymous point mutations to human genes. For each protein-coding human gene, estimates of S were obtained for possible nonsynonymous (red) and synonymous (blue) point mutations at each site. The x-axis represents the logarithm of the geometric mean across tissues of expression measurements for a gene. The y-axis represents inferred scaled selection coefficients (S). For each gene, the scaled selection coefficient distribution among point mutations is summarized by the 5th percentile (square shape), the 50th percentile (triangular shape), and the 95th percentile (round shape).