Literature DB >> 25987828

Roles of solvent accessibility and gene expression in modeling protein sequence evolution.

Kuangyu Wang¹, Shuhui Yu², Xiang Ji¹, Clemens Lakner¹, Alexander Griffing¹, Jeffrey L Thorne¹.

Abstract

Models of protein evolution tend to ignore functional constraints, although structural constraints are sometimes incorporated. Here we propose a probabilistic framework for codon substitution that evaluates joint effects of relative solvent accessibility (RSA), a structural constraint; and gene expression, a functional constraint. First, we explore the relationship between RSA and codon usage at the genomic scale as well as at the individual gene scale. Motivated by these results, we construct our framework by determining how probable is an amino acid, given RSA and gene expression, and then evaluating the relative probability of observing a codon compared to other synonymous codons. We come to the biologically plausible conclusion that both RSA and gene expression are related to amino acid frequencies, but, among synonymous codons, the relative probability of a particular codon is more closely related to gene expression than RSA. To illustrate the potential applications of our framework, we propose a new codon substitution model. Using this model, we obtain estimates of 2N s, the product of effective population size N, and relative fitness difference of allele s. For a training data set consisting of human proteins with known structures and expression data, 2N s is estimated separately for synonymous and nonsynonymous substitutions in each protein. We then contrast the patterns of synonymous and nonsynonymous 2N s estimates across proteins while also taking gene expression levels of the proteins into account. We conclude that our 2N s estimates are too concentrated around 0, and we discuss potential explanations for this lack of variability.

Entities: CellLine Chemical Disease Gene Species

Keywords: codon usage; gene expression; protein evolution; protein structure; scaled selection coefficient; solvent accessibility

Year: 2015 PMID： 25987828 PMCID： PMC4415675 DOI： 10.4137/EBO.S22911

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

Proteins perform a vast array of functions within living organisms. As a result, natural selection affects both the structure of proteins and the regulation of their production. A major source of structural constraints arises because proteins require a stable and suitable three-dimensional structure to function. Mutations that destabilize proteins will be selected against.1 Also, selective constraints in a protein vary according to structural locations. For example, Franzosa and Xia2 found a strong, positive, and linear relationship between the ratio of nonsynonymous and synonymous rates and a measure of solvent exposure at the structural location. While structural constraints have mainly been investigated to explain nonsynonymous rate variation within proteins, functional constraints have been shown to explain a large amount of the variation in average nonsynonymous rate among proteins.3–5 For example, Drummond et al.4 showed that expression level explains roughly half the variation in average nonsynonymous rate among Saccharomyces cerevisiae protein-coding genes. To reflect structural constraints when modeling protein evolution, the pioneering simulation work of Parisi and Echave6 had a statistical potential to govern the amino acid replacement process. A similar approach was adopted by later inferential studies.7–10 Statistical potentials can include terms related to solvent accessibility, pairwise distance interactions, torsion angles, and flexibility of the residues.11 Structurally constrained evolutionary models have tended to place less emphasis on codon usage. The assumption that synonymous mutations are selectively neutral has often been made.7,8,10 This may be a reasonable first assumption, because so much natural selection depends on the protein sequence. However, synonymous change is also affected by natural selection. For example, Agashe et al.12 found that synonymous mutations to a key-enzyme coding gene can decrease gene expression and fitness by more than 90% compared to wild type. Also, selection acting on gene expression was found to be the single dominant predictor, among all predictors considered, of the number of nonsynonymous substitutions per site in yeast.4 To explain the negative correlations between rates of coding sequence evolution and gene expression levels that have been inferred across a wide range of taxa, Drummond and Wilke13 suggested that mistranslation-induced protein misfolding explains much of coding sequence evolution. Another study discovered that translationally optimal codons are associated with structurally sensitive sites.14 These findings indicate that structural and functional constraints are coupled. While protein structures are increasingly considered in protein evolution models, less attention has been paid to the combined effect of gene expression and protein structure. The overall codon usage in a genome can be dramatically different between species.15 Chen et al.16 concluded that different patterns of codon usage between species are determined primarily by mutational processes that act throughout the genome and only secondarily by selective forces acting on protein-coding sequences. From a selectionist point of view, a classic explanation for systematic variation across a genome is that certain preferred codons are translated more accurately and/or efficiently.17–19 Strong evidence for this hypothesis has been found in several species.20–24 However, 30% of bacterial species show no evidence of translational selection.25 Understanding codon usage patterns continues to be an active area of research. Relative solvent accessibility (RSA) is a summary of local structural environment at a protein location that aims to quantify the relative exposure of an amino acid in a globular protein to water molecules. It has been a widely used summary that is correlated with rates and patterns of protein evolution.2 Here, we attempt to characterize the roles played by RSA and gene expression in modeling protein evolution. We propose a probabilistic framework aiming to combine effects of RSA and gene expression on amino acid usage as well as codon usage. We carry out hypothesis testing in order to address whether RSA tendencies vary among synonymous codons. To capture potentially species-specific association between RSA and codon usage, the testing procedure is conducted at the genomic scale as well as at the individual gene scale in two different species (Mus musculus and Homo sapiens). We assess the effects of gene expression on synonymous codon usage and the combined effects of RSA and gene expression in influencing amino acid usage. We do this via a multinomial logistic regression (MLR) approach that connects RSA and gene expression to evolution using correlation rather than causality. To demonstrate possible applications of our probabilistic framework, we propose a new protein evolutionary model that accounts for both the structural and functional contexts of a codon. Our model has a mutation-selection balance framework that incorporates the selective impacts of possible synonymous and nonsynonymous mutations on human protein-coding genes. The selective impacts of these mutations are assessed by our estimates of scaled selection coefficients (ie, twice the product of effective population size N and the relative fitness difference s between the mutant and wild-type allele).

Materials and Methods

Structural, sequence, and expression data

We collected protein structures, amino acid sequences, nucleotide sequences, and protein-coding gene expression data for two species, H. sapiens and M. musculus. Protein structures were obtained from the Protein Data Bank (PDB).26 Only structures determined by X-ray crystallography with a resolution of 3.0 Å or better were employed. In addition, proteins were required to have lengths greater than 50 amino acids. In an effort to collect a relatively homogeneous data set to which a single model applies, we restrict our data to monomeric proteins with one chain. Also for the purposes of gathering a relatively homogeneous data set, membrane proteins and protein–DNA/RNA hybrid structures were excluded. To lessen problems arising from estimating parameters when proteins are correlated because of common ancestry, homologs were removed from our data by employing a 30% identity filter. Gene expression measurements were not used to determine which homologs were included or excluded. Because the Consensus CDS (CCDS) database27 stores a core set of human and mouse protein-coding regions that are associated with highly reliable annotation, it was used to identify nucleotide sequences that encode the proteins selected from PDB. To get CCDS IDs, protein PDB IDs were translated to the UniProt Knowledgebase (UniProtKB) IDs. The UniProt IDs were then mapped to Gene IDs. Finally, Gene IDs were converted to CCDS IDs. Proteins with CCDS IDs that could not be successfully identified were removed. To match a protein in PDB with its corresponding protein-coding nucleotide sequence, we used the Smith–Waterman algorithm28 of the water program from the EMBOSS tools.29 This let us identify the longest ungapped region with an exact match between the nucleotide sequence and amino acid sequence. This process yielded 864 and 156 matches in H. sapiens and M. musculus, respectively (see PDB IDs in Supplementary files). For each site in the ungapped region in each protein structure, relative solvent accessibility (RSA) was calculated by the NACCESS software.30 The presence of heteroatoms in some structures was ignored when calculating accessibilities. RNA-seq data for H. sapiens and M. musculus were collected as part of a multi-species multi-organ gene expression study31 and were downloaded from the Expression Atlas database.32 To establish one-to-one mapping between the PDB ID of a protein and the Ensembl ID of a transcript, PDB IDs were translated to the UniProtKB IDs and then the UniProt IDs were mapped to Ensembl IDs via the ID mapping service hosted by the Protein Information Resource.33 Following a previous study by Drummond and Wilke,13 aggregate mRNA level was quantified as the geometric mean signal of the measurements from the six tissues in the Brawand et al data set. We log-transformed these geometric means in the following analysis. There are 241 human proteins and 60 mouse proteins in our data set that have both structure and gene expression information available.

A probabilistic framework for assessing the joint effect of RSA and gene expression on amino acid and codon usage

Our probabilistic framework is based on the idea that the propensity of a codon representing a given site of a protein can be predicted by a two-step process: First, estimate the probability of the corresponding amino acid by the observed frequency of that amino acid at similar structural and functional contexts in other proteins. Second, predict the relative probability of observing a codon conditional upon the amino acid, the structural context, and the functional context. For a codon C of a protein sequence, let A be the amino acid encoded by this codon. Here, R is the RSA at the site and represents the structural context, while E is the expression level of the gene and represents the functional context. The probability of obtaining C conditional upon the context is then

Testing association between codon usage and RSA

Codon usage is correlated with gene expression.19 Furthermore, amino acid usage is RSA dependent2,34,35 and also influenced by gene expression.36,37 Zhou et al.14 tested the association between codon optimality and solvent accessibility. Instead of focusing on codon optimality that primarily reflects the influence of selection for translation speed and/or accuracy on codon usage, we test a slightly more general null hypothesis that, conditional upon the amino acid, codon usage is independent of RSA. To do this, a k-sample Anderson–Darling (A-D) test38 was applied to compare RSA distributions of different synonymous codons encoding an amino acid. Here, k equals the number of synonymous codons. For example, k = 6 if a test is conducted for serine. The A-D test is nonparametric and distribution free. To get the null distributions of A-D test statistics, we permuted RSA values among codons that specify the same amino acid type. By doing this, the association between amino acid usage and structural environments was not affected, but any association between codon usage and structural environments was eliminated. For each test, 10,000 simulated A-D test statistics were generated via this sort of permutation. The actual computation was done with the ad.test function in the R package kSamples.38 We can test whether RSA distributions of synonymous codons are different either at the genomic scale or at the individual gene scale. At the genomic scale, for each of the 18 amino acids that have more than one codon, a k-sample A-D test was conducted. Positions in all genes that have the same amino acid type were pooled together, and RSA values were permuted among these positions. In each test, the null distribution was formed by the 10,000 simulated A-D test statistics. To get the P-value, the test statistic computed from observed data is compared to the null distribution, and the proportion of the null distribution that exceeds the observed test statistic value is recorded. At the genomic scale, we have the power to detect even slight violations of the null hypothesis. A disadvantage of the genomic-level test is that permuted data sets do not retain the same codon usage patterns as the actual data. This means that association between codon usage and gene expression is disrupted in the permuted data sets. When the hypothesis was tested at the gene scale, the resampling procedure only allows RSA values belonging to positions that have the same amino acid type to be permuted within a gene. Therefore, permuted data sets do not disrupt association between codon usage and gene expression because permuted genes retain the same codon usage for each gene as found in actual data. Under this restriction, 18 × 864 and 18 × 156 null distributions were generated, since we have 864 human proteins and 156 mouse proteins. To perform a single test of association between codon usage and RSA within each amino acid type, normalized test statistics (z-scores) were obtained for the amino acid type in each gene by subtracting the sample mean of the individual k-sample A-D test statistics for each combination of gene and amino acid type and then dividing the differences by the sample standard deviation. For each amino acid type, the 10,000 × 864 and 10,000 × 156 z-scores were summed across proteins so that 10,000 sum-of-z-scores (the combined null distribution) were obtained in human and in mouse. Finally, we compute P-values by comparing the observed sum-of-z-score for an amino acid with the corresponding null distribution. Although testing at the gene scale allows gene-specific codon bias to be removed, there is less power to reject the null hypothesis at the individual gene scale because each gene has a comparatively small number of codons for each amino acid.

Clustering amino acids according to RSA preference

The RSA of an amino acid residue in a protein is affected by tertiary structure of the protein, apart from the amino acid’s physicochemical properties.39 Before we proceed to quantitative modeling, we would like to find out how interchangeable two amino acids are in terms of RSA distribution and whether amino acids can be grouped together by their RSA distribution similarities. The two-sample Kolmogorov–Smirnov (KS) test is one of the most useful and general nonparametric methods for comparing two samples. In other contexts,40–42 the KS test statistic has been used to define distances for the purpose of hierarchical clustering. Borrowing the idea from hypothesis testing, we define the distance between a pair of amino acids by the normalized two-sample KS test statistic (Equation 2) when contrasting their RSA distributions. We let n and n′ be sample sizes for each amino acid type, and we obtain a test statistic: where F1, and F2, are the empirical RSA distribution functions of the two amino acids, respectively. Similar to hypothesis testing for codon usage, an amino acid distance matrix can be constructed either at the genomic scale or at the individual gene scale. For the latter, one distance matrix can be computed within each protein and all matrices can be combined by taking element-wise means. Complete-linkage clustering43 is employed to find the grouping pattern in a distance matrix.

Logistic regression (LR)

If there is no significant association between codon usage and RSA, we can make the following simplification to Equation 1: Since we are interested in understanding how gene expression (a predictor variable) affects probabilities of amino acid types (a categorical outcome variable), LR (in cases where there are exactly two synonymous codons) and MLR can be employed. The most frequent codon in each synonymous group is denoted as C, and the probability of any other synonymous codon C is modeled relative to it: where α is the intercept and β is the slope for gene expression with codon C. When amino acids have exactly two synonymous codons, we use LR instead of MLR. Since methionine and tryptophan are encoded by a single codon, they are not included in this analysis. To measure the joint effect of RSA and gene expression on amino acid usage quantitatively, we use both gene expression and RSA as independent variables and build a model similar to the one in Equation 4, but this time the probabilities are relative to A, the reference amino acid: where β and β are the slopes for amino acid A of RSA and gene expression, respectively. The most frequent amino acid (in our case, leucine) is chosen as A. A likelihood ratio test can compare the full model in Equation 5 with reduced models in Equations 6 and 7.

A functionally and structurally constrained Markov model of codon substitution

Using the probabilistic framework that we have constructed, we develop a model of protein-coding DNA sequence evolution that explicitly accounts for structural and functional constraints. The rate of change from codon i to codon j is denoted by q. Our mutation-selection balance parameterization has each substitution rate be proportional to the mutation rate from i to j multiplied by the probability that a new mutation becomes fixed. We denote π as the stationary probability of nucleotide type k (k ∈ {A, C, G, T}) in the absence of natural selection and use π as the vector representing the four nucleotide frequencies. Let κ be the transition–transversion rate ratio. Assume a diploid population of effective size N individuals, and let P(Z) be the probability that a new mutant allele j eventually gets fixed in the population that otherwise consists of 2N − 1 alleles of type i. We set q to 0 if i or j is a stop codon, or i and j differ by more than one nucleotide. When i and j differ by exactly a single codon position and codon j has nucleotide type h at this position, our model specifies the substitution rate q from codon i to codon j as where the scale factor u is defined by the requirement that the average substitution rate for a codon substitution process at stationarity is 1. Following Choi et al.44, we design our evolutionary model to have a stationary distribution that matches a desired target probability distribution. In this case, we have our evolutionary model yield a stationary distribution that is identical to the probability distribution of codons that we estimated via MLR from solvent accessibility and gene expression data of human proteins. Accordingly, we set P(i|R,E) as the stationary probability for codon i. It can be calculated using Equations 1, 4, and 5. Let P0(i|π) be the probability of codon i under a model where stop codons are lethal, but all other codons are equally fit. Adjusting for the three stop codons in the universal genetic code, According to the approximation introduced in Choi et al.44, and yield a Markov model that is time reversible. One advantage of our model is that the number of parameters that are introduced in the effort to include RSA and gene expression is small. The trade-off is that our model will incur extra computation because RSA values can be different from site to site. We need to have one rate matrix for each site, although these rate matrices can share parameters such as κ. We set the relative fitnesses of i and j, respectively, as 1 and 1 + s. For an effective population of size N diploid individuals with multiplicative fitnesses, the scaled selection coefficient S = 2N s can be assessed by Refs. 44–46: To get reliable estimates of coefficients for MLRs in Equations 4 and 5, a sufficient number of proteins with both PDB structure and RNA-Seq data are required. Owing to insufficient multi-organ RNA-seq data from mouse, we conducted the LR and trained the evolutionary model only for humans. For this analysis, we use nucleotide frequencies observed in the human genome (41% GC base pairs and 59% AT base pairs according to International Human Genome Sequencing47) as estimates of π. This way of estimating π stems from the fact that the majority of the human genome consists of DNA with no known biological function and that is putatively evolving without much impact from natural selection. We therefore set π= π= 0.59/2 = 0.295 and π= π = 0.41/2 = 0.205. This treatment could be improved by allowing π to vary among genes to account for regional genomic differences in mutation patterns, but it is not pursued in this study.

Results

After using the Benjamini–Hochberg (BH) method48 to control for false discovery rate, no synonymous codon group in mouse was found to exhibit a significant correlation between codon usage and RSA. In contrast, 15 out of 18 synonymous codon groups in human were statistically significant (Table 1). For humans, the null hypothesis cannot be rejected in synonymous codon groups belonging to cysteine (C), phenylalanine (F), and histidine (H).

Table 1

Test at genomic scale of whether RSA tendencies vary among synonymous codons.

SYNONYMOUS CODON GROUPS	AA	HUMAN P-VALUE	MOUSE P-VALUE
GCT, GCC, GCA, GCG	A	0.0003*	0.2473
TGT, TGC	C	0.9381	0.2330
GAT, GAC	D	0.0115*	0.4074
GAA, GAG	E	<0.0001*	0.5510
TTT, TTC	F	0.4852	0.3127
GGT, GGC, GGA, GGG	G	0.0153*	0.9709
CAT, CAC	H	0.2765	0.3875
ATT, ATC, ATA	I	<0.0001*	0.0703
AAA, AAG	K	0.0012*	0.0320
TTA, TTG, CTT, CTC, CTA, CTG	L	0.0057*	0.0975
AAT, AAC	N	0.0001*	0.0188
CCT, CCC, CCA, CCG	P	0.0016*	0.6651
CAA, CAG	Q	0.0058*	0.1058
CGT, CGC, CGA, CGG, AGA, AGG	R	0.0338*	0.2547
TCT, TCC, TCA, TCG, AGT, AGC	S	<0.0001*	0.0326
ACT, ACC, ACA, ACG	T	0.0034*	0.3692
GTT, GTC, GTA, GTG	V	0.0255*	0.8705
TAT, TAC	Y	0.0082*	0.2580

Notes: The first column contains synonymous codon groups, and the second column shows the corresponding amino acid type. The P-values of individual k-sample A-D tests are shown in the final two columns. To control for false discovery at level α = 0.05, the BH method48 was applied to the human P-values and then the mouse P-values. An * indicates significance at α = 0.05.

When the same hypothesis was tested at the individual gene level, the P-value results can be found in Table 2. After applying the BH method, no synonymous codon group was found to be statistically significant in either human or mouse, although P-values in human are consistently smaller than those in mouse.

Table 2

Test at individual gene scale of whether RSA tendencies vary among synonymous codons.

SYNONYMOUS CODON GROUPS	AA	HUMAN P-VALUE	MOUSE P-VALUE
GCT, GCC, GCA, GCG	A	0.0941	0.2033
TGT, TGC	C	0.2529	0.5224
GAT, GAC	D	0.4482	0.2829
GAA, GAG	E	0.9285	0.9118
TTT, TTC	F	0.4127	0.9098
GGT, GGC, GGA, GGG	G	0.0071	0.4487
CAT, CAC	H	0.2685	0.3649
ATT, ATC, ATA	I	0.1844	0.6461
AAA, AAG	K	0.1098	0.7943
TTA, TTG, CTT, CTC, CTA, CTG	L	0.3073	0.4867
AAT, AAC	N	0.5365	0.0688
CCT, CCC, CCA, CCG	P	0.3841	0.2274
CAA, CAG	Q	0.4883	0.7631
CGT, CGC, CGA, CGG, AGA, AGG	R	0.0730	0.4402
TCT, TCC, TCA, TCG, AGT, AGC	S	0.0466	0.4255
ACT, ACC, ACA, ACG	T	0.1324	0.2816
GTT, GTC, GTA, GTG	V	0.0713	0.5674
TAT, TAC	Y	0.5540	0.5869

Figure 1 shows the hierarchical clustering result of the amino acid distance matrix computed for combined human data. As would be expected, if natural selection is associated with solvent accessibility, the hierarchical clustering suggests that RSA tendencies of amino acids in naturally occurring proteins are correlated with the physicochemical properties of these amino acids. On the whole, the 20 amino acids are naturally separated into two groups. The first group contains mainly hydrophobic amino acids (C, M, W, A, Y, F, V, I, L) and the second group contains the other amino acids (D, Q, R, P, N, T, H, G, S, K, E). According to the dendrogram, there are a few smaller and tighter clusters within each group. For example, in the first group, amino acids L, I, V, and F have more similar RSA preference than other amino acids in this group. In the second group, the cluster that contains amino acids K and E shows strong distinct signal, while two other clusters (one formed by T, H, G, S and the other formed by D, Q, R, P, N) are also present.

Figure 1

Distance matrix for amino acids using combined human data. Distance between two amino acid types is defined by the normalized KS test statistics (Equation 2). The complete-linkage method was used to perform hierarchical clustering.

As explained in the Materials and Methods section, we also constructed another distance matrix using data stratified by each gene. We computed one distance matrix for each protein in our data set and then calculated the average values for each cell in the distance matrix across all proteins. The resulting distance matrix is shown in Figure 2. The same clustering pattern was found as in Figure 1, although the signal is weaker because of the much smaller sample sizes in individual protein-coding genes.

Figure 2

Distance matrix for amino acids using stratified human data. Distance between two amino acid types is defined by the normalized KS test statistics (Equation 2). The complete-linkage method was used to perform hierarchical clustering. Entries in the matrix are computed by taking the average of the distances between amino acids among all proteins.

MLR results

Using human data, the estimated coefficients of MLR models for synonymous codon groups with more than two codons are shown in Table 3. For groups with exactly two synonymous codons, estimated coefficients of LR models are in Table 4. The estimates of the slope parameter for log-scaled gene expression are all negative for LR models, and they are all statistically significant. Across all MLR models, 24 out of 32 slope estimates for gene expression are negative and statistically significant, while only one coefficient belonging to codon TCG in amino acid serine is positive and significant. Since we define the most frequent (also called preferred) codon to be the reference category in both MLR and LR cases, these estimates coincide with findings from earlier studies21,24,49 that the probability of observing the most frequent codon increases when gene expression level is higher (with one exception).

Table 3

Codons: MLR-estimated coefficients for gene expression using human data.

AA	REF CODON	NON-REF CODON	INTERCEPT COEF	EXPRESSION COEF	INTERCEPT P-VALUE	EXPRESSION P-VALUE
A	GCC	GCA	−0.3091	−0.1650	0.0002	<0.0001
		GCG	−1.6278	0.0331	<0.0001	0.3824
		GCT	−0.0943	−0.1363	0.2237	<0.0001
G	GGC	GGA	0.1279	−0.1639	0.1294	<0.0001
		GGG	−0.3834	−0.0039	<0.0001	0.8958
		GGT	−0.5043	−0.0909	<0.0001	0.0076
I	ATC	ATA	−0.6378	−0.2427	<0.0001	<0.0001
		ATT	0.0974	−0.1437	0.2063	<0.0001
L	CTG	CTA	−1.3565	−0.1966	<0.0001	<0.0001
		CTC	−0.5859	−0.0597	<0.0001	0.0089
		CTT	−0.6472	−0.2047	<0.0001	<0.0001
		TTA	−1.1326	−0.2292	<0.0001	<0.0001
		TTG	−0.9156	−0.1259	<0.0001	<0.0001
P	CCC	CCA	0.1390	−0.1315	0.1450	0.0001
		CCG	−1.4229	0.0621	<0.0001	0.1675
		CCT	0.1333	−0.1263	0.1619	0.0001
R	CGG	AGA	0.3689	−0.1976	0.0012	<0.0001
		AGG	0.2032	−0.1231	0.0754	0.0013
		CGA	0.0133	−0.2373	0.9170	<0.0001
		CGC	0.0172	−0.0124	0.8795	0.7315
		CGT	−0.6141	−0.0978	<0.0001	0.0428
S	AGC	AGT	−0.2386	−0.1159	0.0157	0.0012
		TCA	−0.3185	−0.1327	0.0018	0.0004
		TCC	−0.1431	0.0058	0.1140	0.8499
		TCG	−1.7424	0.1118	<0.0001	0.0167
		TCT	−0.0936	−0.1069	0.3207	0.0016
T	ACC	ACA	−0.0842	−0.0641	0.3423	0.0341
		ACG	−1.1397	0.0214	<0.0001	0.5834
		ACT	−0.1699	−0.1289	0.0705	0.0001
V	GTG	GTA	−0.9888	−0.2038	<0.0001	<0.0001
		GTC	−0.6873	−0.0132	<0.0001	0.6153
		GTT	−0.6146	−0.1727	<0.0001	<0.0001

Notes: The first column denotes amino acid types for synonymous codon groups. Within each synonymous codon group, the most frequent codon for each amino acid is chosen as the reference category, and these codons are shown in the second column. The non-reference codons are listed in the third column. The fourth and fifth columns contain the estimated coefficients for intercept and gene expression in Equation 4. The corresponding P-values of the estimated coefficients can be found in the last two columns.

Table 4

Codons: LR-estimated coefficients for gene expression using human data.

AA	REF CODON	NON-REF CODON	INTERCEPT COEF	EXPRESSION COEF	INTERCEPT P-VALUE	EXPRESSION P-VALUE
C	TGC	TGT	0.1474	−0.1109	0.1622	0.0035
D	GAC	GAT	0.0650	−0.0985	0.3275	<0.0001
E	GAG	GAA	0.1872	−0.1632	0.0018	<0.0001
F	TTC	TTT	0.1959	−0.1009	0.0091	0.0001
H	CAC	CAT	−0.1216	−0.1132	0.2212	0.0018
K	AAG	AAA	−0.0871	−0.1144	0.1714	<0.0001
N	AAC	AAT	0.0934	−0.1340	0.2320	<0.0001
Q	CAG	CAA	−0.6138	−0.1920	<0.0001	<0.0001
Y	TAC	TAT	−0.1045	−0.0688	0.2184	0.0213

A likelihood ratio test suggests that the full model for amino acid probability (Equation 5) is significantly better than the model considering only RSA (Equation 6; P-value ≐ 4.1e–09) and the model considering only the level of gene expression (Equation 7; P-value ≐ 2.2e–16). Table 5 summarizes the maximum-likelihood estimates of the parameters and their P-values in the full model. With the exception of three amino acids (cysteine, phenylalanine, and isoleucine), slope estimates for RSA of the non-reference amino acids are positive. This means that the relative probability of observing these amino acids rather than the reference amino acid leucine is higher as RSA increases, holding gene expression identical. To depict a full picture of the joint effect of RSA and gene expression, heatmaps present the predicted probabilities of each amino acid (Fig. 3 and Supplementary Fig. 1). While RSA plays a dominant role in determining the region of high probabilities for amino acids such as leucine and proline, the effect of gene expression is evident in cases such as histidine and serine (Fig. 3).

Table 5

Amino acids: MLR-estimated coefficients for RSA and gene expression using human data.

AA	INTERCEPT		RSA		EXPRESSION
AA	COEF	P-VALUE	COEF	P-VALUE	COEF	P-VALUE
A	−0.9449	<0.0001	0.0208	<0.0001	0.0378	0.0065
C	−1.2789	<0.0001	−0.0080	<0.0001	−0.0255	0.2123
D	−1.9803	<0.0001	0.0479	<0.0001	0.0320	0.0292
E	−2.0101	<0.0001	0.0523	<0.0001	0.0478	0.0006
F	−0.7462	<0.0001	−0.0034	0.0105	−0.0020	0.9009
G	−1.3812	<0.0001	0.0354	<0.0001	0.0309	0.0283
H	−2.0176	<0.0001	0.0322	<0.0001	−0.0128	0.5008
I	−0.5001	<0.0001	−0.0057	<0.0001	−0.0282	0.0634
K	−2.2285	<0.0001	0.0537	<0.0001	0.0526	0.0003
M	−1.9446	<0.0001	0.0096	<0.0001	0.0381	0.0813
N	−2.0827	<0.0001	0.0455	<0.0001	−0.0246	0.1393
P	−2.0514	<0.0001	0.0455	<0.0001	0.0295	0.0571
Q	−2.1544	<0.0001	0.0464	<0.0001	0.0176	0.2721
R	−1.9454	<0.0001	0.0441	<0.0001	0.0373	0.0136
S	−1.3001	<0.0001	0.0366	<0.0001	−0.0126	0.3740
T	−1.5064	<0.0001	0.0329	<0.0001	0.0305	0.0421
V	−0.5067	<0.0001	0.0034	0.0020	0.0304	0.0263
W	−1.9197	<0.0001	0.0022	0.2615	−0.0159	0.5218
Y	−1.2309	<0.0001	0.0109	<0.0001	0.0019	0.9128

Notes: The first column shows amino acid types in their one-letter format. The second and third columns contain estimated values and their corresponding P-values of the intercept (see Equation 5). The fourth and fifth columns show estimated coefficients and their P-values of RSA in MLR. The last two columns show coefficient estimates and P-values of log-transformed gene expression. The most frequent amino acid leucine (L) is not included in the table because it is chosen as the reference category for MLR.

Figure 3

Predicted probabilities of four amino acids across possible ranges of RSA and gene expression. Probabilities for each amino acid were calculated from the estimated RSA and gene expression coefficients (see Table 5). The horizontal axis covers the possible range of RSA, and the vertical axis is for gene expression. Red regions indicate relatively high probabilities, while blue regions show low probabilities. The four amino acids depicted here (histidine, leucine, proline, and serine) were selected for the diversity of heatmap diagrams that they represent. Heatmaps for the other 16 amino acids can be found in Supplementary Figure 1.

Scaled selection coefficients

By applying Equation 12, we estimated the scaled selection coefficient S = 2N s of all possible point mutations at each site for the 241 protein-coding human genes. The estimated values fall into two groups defined by nonsynonymous and synonymous substitutions. Figure 4 depicts how the distribution of scaled selection coefficients for genes changes with the corresponding gene expression levels. We can see from the figure that when mutations are being selected against (S < 0), both nonsynonymous and synonymous mutations are inferred to be more deleterious as the gene expression level increases. The figure also shows that the degree of selective advantage for beneficial mutations gets larger with increasing gene expression. As expected, Figure 4 indicates that nonsynonymous mutations are influenced by gene expression to a lesser degree than synonymous ones.

Figure 4

Comparison of S = 2N s estimates between nonsynonymous and synonymous point mutations to human genes. For each protein-coding human gene, estimates of S were obtained for possible nonsynonymous (red) and synonymous (blue) point mutations at each site. The x-axis represents the logarithm of the geometric mean across tissues of expression measurements for a gene. The y-axis represents inferred scaled selection coefficients (S). For each gene, the scaled selection coefficient distribution among point mutations is summarized by the 5th percentile (square shape), the 50th percentile (triangular shape), and the 95th percentile (round shape).

Discussion

We developed a probabilistic framework that simultaneously considers RSA (a structural constraint) and gene expression (a functional constraint). By design, the two-step construction process of this framework (Equation 1) captures codon usage bias at the nucleotide level as well as structural and functional dependence of amino acids. Hypothesis testing at a genomic scale reveals a potentially species-specific difference in the relationship between synonymous codon choice and RSA. In humans, the null hypothesis that RSA is independent of synonymous codon choice can be rejected. In mouse, the result is not statistically significant, possibly because of insufficient data and possibly because the relationship between RSA and synonymous codon usage is less strong. At the individual gene level, we test the same hypothesis using permutation and resampling. Although we fail to reject the null hypothesis in both human and mouse, we notice that the association between RSA and codon usage is stronger in human than in mouse. This is consistent with our finding at the genomic scale. Because we derive one test statistic from each gene and combine the test statistics to get an overall test statistic and its null distribution, we are treating each protein equally. However, because of small sample size, the test statistics calculated in some proteins may not be informative and can contribute stochastic noise to the combined test statistic. Improved test statistics might yield different conclusions. In the first step of constructing our probabilistic framework, we used MLR and LR to model the probabilities of different synonymous codons, given gene expression. Slope estimates suggest that, in general, it is more likely to observe the preferred codon in highly expressed genes. This is consistent with previous findings.21,24,49 RSA is ignored in this step because our analyses indicate that the correlation between RSA and codon usage is small to nonexistent. This does not mean that there is no structure constraint on codon usage. Protein structure may lead to selection pressure on synonymous codon choice through an interaction between the translation process and protein folding.50 Codon usage is also subject to other constraints because it can affect splicing and/or mRNA stability.51 To predict amino acid types, we incorporate both RSA and gene expression as independent variables in our MLR model. Our likelihood ratio test confirms that the full model with both variables fits data significantly better than reduced models. Although the impact of gene expression on amino acid probabilities is weaker than the impact of RSA, it is still significant. The heatmaps for amino acids such as histidine and serine show that these amino acids reach their highest probabilities at intermediate RSA values (Fig. 3). We observed a trend that 2N s estimates for both nonsynonymous and synonymous point mutations become more negative as gene expression increases (Fig. 4). This observation is biologically reasonable, and a few mechanisms might contribute to it. Highly expressed proteins have been found to evolve slower because of the stronger selection pressure that they experience.4,13,52 Many highly expressed proteins are functionally important or even essential. For these proteins, the cell cannot afford their function to be disturbed or their abundance to become low.53 Some aspects of the scaled selection coefficients are qualitatively reasonable. For example, 64.2% of the possible point mutations yield negative values of 2N s. This is in keeping with the expectation that if protein-coding genes are affected by natural selection, more possible point mutations should be deleterious than advantageous. The qualitative pattern remains when we restrict consideration to synonymous point mutations. Of these mutations, 64.0% yield negative values. Among nonsynonymous mutations, 64.3% of possible point mutations are inferred to be deleterious. However, the distribution of nonsynonymous scaled selection coefficient estimates is clearly unrealistic in that it is too tightly clustered around the value of 0. Because some possible nonsynonymous mutations are lethal or at least highly deleterious, there should be some scaled selection coefficient values that are far below 0. We do not observe this expected long lower tail of the distribution of scaled selection coefficients for nonsynonymous point mutations (eg, see the 5% nonsynonymous values in Fig. 4). Because some nonsynonymous mutations should be extremely deleterious, we also expect the average over all possible nonsynonymous mutations to be substantially below zero. We do not observe this. In fact, the average estimates of 2N s for nonsynonymous mutations is about −0.162, and this is only slightly less than the −0.158 value represented by the average 2N s estimate among synonymous point mutations. Similar shortcomings have been noted for distributions of scaled selection coefficients that have been inferred via a mutation-selection balance model of molecular evolution.44,54 Much of the mismatch between our expectations and our estimates is likely because of flaws of our evolutionary model. Specifically, the rates in our model depend on the RSA and gene expression covariates, but other aspects of phenotype are clearly very relevant to natural selection and are not captured by our model. An additional possible weakness of our mutation-selection model and other mutation-selection models is the assumption that each new mutation is fixed or lost before the next one occurs. Our inferential framework does not accommodate the possibility that fitness-affecting genetic variants at one locus interfere with the survival or loss of fitness-affecting variants at other loci. This Hill–Robertson phenomenon55 is potentially important to consider when connecting interspecific models of sequence change to population genetics.56 A basic assumption of our probabilistic framework is that sites in a protein-coding sequence are independent. This assumption is commonly made, and it allows simplification of computation, even though it is clear that sites in a protein sequence do not evolve independently. It would be challenging yet worthwhile for future evolutionary studies to include structural constraints related to site dependence in a coding sequence. Supplementary Figure 1. Predicted probabilities of 16 amino acids across possible ranges of RSA and gene expression.

50 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

Authors: Kim D Pruitt; Jennifer Harrow; Rachel A Harte; Craig Wallin; Mark Diekhans; Donna R Maglott; Steve Searle; Catherine M Farrell; Jane E Loveland; Barbara J Ruef; Elizabeth Hart; Marie-Marthe Suner; Melissa J Landrum; Bronwen Aken; Sarah Ayling; Robert Baertsch; Julio Fernandez-Banet; Joshua L Cherry; Val Curwen; Michael Dicuccio; Manolis Kellis; Jennifer Lee; Michael F Lin; Michael Schuster; Andrew Shkeda; Clara Amid; Garth Brown; Oksana Dukhanina; Adam Frankish; Jennifer Hart; Bonnie L Maidak; Jonathan Mudge; Michael R Murphy; Terence Murphy; Jeena Rajan; Bhanu Rajput; Lillian D Riddick; Catherine Snow; Charles Steward; David Webb; Janet A Weber; Laurens Wilming; Wenyu Wu; Ewan Birney; David Haussler; Tim Hubbard; James Ostell; Richard Durbin; David Lipman
Journal: Genome Res Date: 2009-06-04 Impact factor: 9.043

3. Computational methods for evaluating phylogenetic models of coding sequence evolution with dependence between codons.

Authors: Nicolas Rodrigue; Claudia L Kleinman; Hervé Philippe; Nicolas Lartillot
Journal: Mol Biol Evol Date: 2009-04-21 Impact factor: 16.240

4. Coevolution of codon usage and tRNA genes leads to alternative stable states of biased codon usage.

Authors: Paul G Higgs; Wenqi Ran
Journal: Mol Biol Evol Date: 2008-08-06 Impact factor: 16.240

5. History can matter: non-Markovian behavior of ancestral lineages.

Authors: Reed A Cartwright; Nicolas Lartillot; Jeffrey L Thorne
Journal: Syst Biol Date: 2011-03-11 Impact factor: 15.683

6. Structural determinants of protein evolution are context-sensitive at the residue level.

Authors: Eric A Franzosa; Yu Xia
Journal: Mol Biol Evol Date: 2009-07-13 Impact factor: 16.240

7. Correlation between the abundance of yeast transfer RNAs and the occurrence of the respective codons in protein genes. Differences in synonymous codon choice patterns of yeast and Escherichia coli with reference to the abundance of isoaccepting transfer RNAs.

Authors: T Ikemura
Journal: J Mol Biol Date: 1982-07-15 Impact factor: 5.469

8. The effect of linkage on limits to artificial selection.

Authors: W G Hill; A Robertson
Journal: Genet Res Date: 1966-12 Impact factor: 1.588

9. Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution.

Authors: D Allan Drummond; Claus O Wilke
Journal: Cell Date: 2008-07-25 Impact factor: 41.582

10. Expression Atlas update--a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments.

Authors: Robert Petryszak; Tony Burdett; Benedetto Fiorelli; Nuno A Fonseca; Mar Gonzalez-Porta; Emma Hastings; Wolfgang Huber; Simon Jupp; Maria Keays; Nataliya Kryvych; Julie McMurry; John C Marioni; James Malone; Karine Megy; Gabriella Rustici; Amy Y Tang; Jan Taubert; Eleanor Williams; Oliver Mannion; Helen E Parkinson; Alvis Brazma
Journal: Nucleic Acids Res Date: 2013-12-04 Impact factor: 16.971