| Literature DB >> 28138077 |
Rosina Savisaar1, Laurence D Hurst1.
Abstract
While the principal force directing coding sequence (CDS) evolution is selection on protein function, to ensure correct gene expression CDSs must also maintain interactions with RNA-binding proteins (RBPs). Understanding how our genes are shaped by these RNA-level pressures is necessary for diagnostics and for improving transgenes. However, the evolutionary impact of the need to maintain RBP interactions remains unresolved. Are coding sequences constrained by the need to specify RBP binding motifs? If so, what proportion of mutations are affected? Might sequence evolution also be constrained by the need not to specify motifs that might attract unwanted binding, for instance because it would interfere with exon definition? Here, we have scanned human CDSs for motifs that have been experimentally determined to be recognized by RBPs. We observe two sets of motifs-those that are enriched over nucleotide-controlled null and those that are depleted. Importantly, the depleted set is enriched for motifs recognized by non-CDS binding RBPs. Supporting the functional relevance of our observations, we find that motifs that are more enriched are also slower-evolving. The net effect of this selection to preserve is a reduction in the over-all rate of synonymous evolution of 2-3% in both primates and rodents. Stronger motif depletion, on the other hand, is associated with stronger selection against motif gain in evolution. The challenge faced by our CDSs is therefore not only one of attracting the right RBPs but also of avoiding the wrong ones, all while also evolving under selection pressures related to protein structure.Entities:
Keywords: RNA-binding proteins; avoidance selection; dual coding; synonymous sites
Mesh:
Substances:
Year: 2017 PMID: 28138077 PMCID: PMC5400389 DOI: 10.1093/molbev/msx061
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Spearman Correlation between Normalized Density (ND) and Various Expression Parameters, Determined Based on FANTOM5 Data.
| Expression Breadth (fraction of tissues where gene is expressed | Maximum Expression | Median Expression | Median Expression in Tissues Where the Gene Is Expressed | |
|---|---|---|---|---|
| ≈−0.151 | ≈−0.035 | ≈−0.157 | ≈−0.016 | |
| ≈9.576×10−30 (≈3.830×10−29) | ≈0.010 (≈0.038) | ≈3.071×10−32 (≈1.228×10−31) | ≈0.280 (1.000) |
A gene is considered to be expressed in a given tissue if more than five tags per million map to the promoter region (see “Materials and Methods” section for further details).
The parentheses contain the Bonferroni-corrected P value.
Motif Density and Conservation Parameters for Various Genic Regions.
| CDSs | 5′-UTRs | 3′-UTRs | Introns | Upstream Intronic | Downstream Intronic | |
|---|---|---|---|---|---|---|
| Median motif density | ≈0.573 | ≈0.537 | ≈0.573 | ≈0.578 | ≈0.580 | ≈0.560 |
| Median ND | ≈0.115 | ≈0.145 | ≈0.103 | ≈0.129 | ≈0.167 | ≈0.130 |
| Enrichment | ≈0.001 | ≈0.010 | ≈0.010 | ≈0.010 | ≈0.010 | ≈0.010 |
| ≈0.064 | ≈0.052 | ≈0.043 | ≈0.055 | ≈0.051 | ≈0.054 | |
| Normalized | ≈−0.041 | ≈−0.019 | ≈−0.026 | ≈−0.034 | ≈−0.035 | ≈−0.017 |
| Conservation | ≈0.003 | ≈0.030 | ≈0.040 | ≈0.010 | ≈0.030 | ≈0.149 |
| Global reduction | ≈−2.4% | ≈−1.0% | ≈−1.5% | ≈−2.0% | ≈−2.0% | ≈−0.9% |
Upstream/downstream intronic regions correspond to 100 bp slices immediately upstream/downstream from an exon.
Normalized density.
One-tailed P derived from an empirical distribution of simulant statistics. 1,000 simulants were used for CDSs and 100 in the other cases.
Rate of evolution at noncoding sites. Used for all sequence regions except for CDSs.
Rate of evolution at synonymous sites. Used for CDSs.
The global reduction is the product of the motif density and the conservation statistic (multiplied by 100). It is an estimate for the extent to which the (synonymous) substitution rate is decreased in the relevant region because of selection to preserve RBP motifs. Note that in the table, the density and the normalized conservation estimates have been rounded to the third decimal, whereas exact values were used when calculating the global reduction.
F(A) Each data point corresponds to the probability that a given motif set (recognized by a particular RBP) would be found at its current density (or higher) by chance given the underlying dinucleotide composition. The black line traces the distribution of enrichment P values obtained in the same sequences for size-matched sets of random k-mers. Note that RBP motifs display a peak at either extreme of the distribution whereas the random motifs do not. In other words, RBP motifs show a disproportionate tendency to occur at a density that deviates from neutral expectations. Importantly, this can mean both enrichment (P value approaching 0) and depletion (P value approaching 1). (B) As A, except that only RBPs for which we found crosslinking and immunoprecipitation studies on binding preferences are shown. Motif sets associated to CDS-binding RBPs (blue) have a peak near 0 (enrichment), whereas the other sets (yellow) have a peak near 1 (depletion).
F(A) Correlation between a motif set’s normalized density (ND) and its nucleotide-normalized d from alignment to macaque. Motif sets that are more strongly enriched are also more conserved, controlling for nucleotide composition. The dashed lines intersect the plot at the points where expected and observed frequencies would be equal. (B) Correlation between ND and the nucleotide-normalized propensity to gain the motifs over evolution (measured by determining how frequently macaque sites that are orthologous to human 4-fold degenerate sites that are a single base substitution away from the motif in human contain the base that would give rise to the motif in human). Note that because our analysis did not make use of an outgroup, we cannot know on which branch the substitution occurred in cases where the human and macaque sequence differ. See caption to subplot A for interpretation of the dashed lines.