| Literature DB >> 23737754 |
David S Lawrie1, Philipp W Messer, Ruth Hershberg, Dmitri A Petrov.
Abstract
Synonymous sites are generally assumed to be subject to weak selective constraint. For this reason, they are often neglected as a possible source of important functional variation. We use site frequency spectra from deep population sequencing data to show that, contrary to this expectation, 22% of four-fold synonymous (4D) sites in Drosophila melanogaster evolve under very strong selective constraint while few, if any, appear to be under weak constraint. Linking polymorphism with divergence data, we further find that the fraction of synonymous sites exposed to strong purifying selection is higher for those positions that show slower evolution on the Drosophila phylogeny. The function underlying the inferred strong constraint appears to be separate from splicing enhancers, nucleosome positioning, and the translational optimization generating canonical codon bias. The fraction of synonymous sites under strong constraint within a gene correlates well with gene expression, particularly in the mid-late embryo, pupae, and adult developmental stages. Genes enriched in strongly constrained synonymous sites tend to be particularly functionally important and are often involved in key developmental pathways. Given that the observed widespread constraint acting on synonymous sites is likely not limited to Drosophila, the role of synonymous sites in genetic disease and adaptation should be reevaluated.Entities:
Mesh:
Year: 2013 PMID: 23737754 PMCID: PMC3667748 DOI: 10.1371/journal.pgen.1003527
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Figure 1The signal of strong selection acting on 4D sites.
(A) Overview of the bootstrap method. We sample 4D sites and their nearby (<1 KB apart) short intron pairs with replacement in order to control for linked selection and variation in GC content and mutation/recombination rates between the neutral reference (short introns) and the test set (4D sites). The short intron, 4D pair must have the same nucleotide as their major allele. (B) The folded Site Frequency Spectra (SFS) of observed SNPs from short introns, 4D sites, and the theoretical neutral distribution in a population with constant size. The SNPs were resampled to 130 strains and folded using the minor allele frequency. (C) The ratio of the amount of polymorphism in short introns versus 4D sites in all, conserved, and variable amino acids with standard error bars. Conserved amino acids are those present and identical in the 12 sequenced Drosophila genomes. Variable amino acids are defined as being not conserved according to the above definition. Ten bootstraps were done for each category (all, conserved, and variable) of 4D site. Lifting the restriction on distance and only controlling for GC content in the bootstrap produces identical results as above (not shown). To be conservative, we continued to use the distance restriction in the bootstrap. Note, had we simply taken the density of polymorphism as is without correction of GC content, we would've only seen a 7% drop in the density of polymorphism from short introns to 4D sites (5.58% vs 6.0% segregating in 4D versus short intron sites).
Estimated proportion of 4D sites and 4N for each selection class.
| Selection Category | Fraction of Sites | Strength |
| Neutral | 77.4% (+/−0.6%) | 0 |
| Weak Constraint | 0 | N/A |
| Strong Constraint | 22.6% (+/−0.6%) | −283 (+/−28.3) |
Selection categories are defined as follows = > Neutral: 4N = 0, Weak Constraint: |4N|<5, and Strong Constraint: |4N|>100 - defining Strong Constraint: |4N|>5 gives exactly the same Maximum Likelihood Estimator (MLE) for the fraction/strength of the strong category;
mean of the MLEs for the fraction of sites in each category over the ten bootstrap runs (+/− s.e.);
mean of the MLEs for the strength of strong selection over the ten bootstrap runs (+/− s.e.); 4N (θ) = 0.0132.
Figure 2Conservation versus constraint at 4D sites in conserved amino acids.
For each 4D site in a conserved amino acid, we use GERP to infer the number of substitutions that have occurred at that site across the Drosophila tree (removing D. melanogaster and D. willistoni from the analysis). We define eight rate classes defined by the number of inferred substitutions across the tree - a proxy for the rate of evolution at the site - and bin the 4D sites accordingly. The class of the slowest evolving sites consists of those codons completely conserved across the ten Drosophila species (0 inferred substitutions along the tree at the 4D site). The fastest evolving class meanwhile has sites with greater than or equal to 9.3 substitutions per site. The remaining substitution classes are spread at intermediate values with a view to roughly equilibrate the number of sites in each class. The substitution bins (b) are as follows: (b1 = 0, 0
Figure 3Constraint across codons.
For each amino acid, we list the codons and, in parentheses, the number of 4D sites from each codon used in the bootstrap analysis – representing, in relative terms, the abundance of each codon in the genome. P-codons are all 4D sites from optimal codons grouped together, while U-codons are all 4D sites from non-optimal codons. 4D sites were binned into codons either by their ancestral allele as determined by parsimony to D. sechellia or by major allele if there is a substitution at that site between D. sechellia and D. melanogaster. Gold bars are the optimal codons for each amino acid, while dark grey bars are the non-optimal codons. 10 bootstraps determine the fraction of sites under constraint for each codon-type. Error bars represent the s.e. of the estimates. A negative value indicates an excess of polymorphism at 4D sites compared to short introns and is likely due to mispolarization assigning SNPs to the wrong codon.
Figure 4Codon optimality versus constraint in conserved codons.
Codons are conserved from D. sechellia-D. grimshawi (excluding D. willistoni). The conserved codons were separated into those that were ancestrally preferred (P) and those that were ancestrally unpreferred (U) using polarization with the D. sechellia-D. grimshawi (excluding D. willistoni) outgroup. 10 bootstraps were done within each class. Error bars represent the s.e. of the estimates. The dark bars represent the counts of all sites that fall into each class while the light bars represent the number of sites estimated to be under strong constraint via the bootstrap procedure. The dashed line indicates what the count of total unpreferred conserved codons would have been had unpreferred 4D sites been conserved to the same extent as preferred 4D sites in otherwise conserved amino acids, i.e. the dashed line represents the proportion of U:P in all conserved amino acids. More than half (53%) of those unpreferred codons that are conserved across the ten Drosophila species are under strong purifying selection in D. melanogaster; 38% of preferred conserved codons are under strong selection.
Strong constraint in genes grouped by codon bias.
| FOP | Fraction of Sites | ENC | Fraction of Sites |
| high FOP | 18.7% (+/−1.7%) | low ENC | 21.8% (+/−1.5%) |
| medium FOP | 23.1% (+/−1.0%) | medium ENC | 21.8% (+/−0.8%) |
| low FOP | 23.2% (+/−0.9%) | high ENC | 23.0% (+/−1.0%) |
Genes are ranked in descending order by their Frequency of Optimal Codons (FOP) with the top, middle, and bottom third forming the high, medium, and low FOP classes respectively;
genes are ranked in ascending order by their Effective Number of Codons (ENC) with the top, middle, and bottom third forming the low, medium, and high ENC classes respectively;
mean fraction of 4D sites under strong constraint in each category over 10 bootstrap runs (+/− s.e.).
Strong constraint over different genic features.
| Category | Fraction of Sites | Category | Fraction of Sites |
| 5′ 75 bp of CDS | 30.7% (+/−3.0%) | 3′ 75 bp of CDS | 31.5% (+/−2.5%) |
| Bulk Nucleosomes | 24.2% (+/−0.7%) | splice junctions | 26.0% (+/−1.0%) |
| multi-exon genes | 22.0% (+/−0.6%) | single-exon genes | 21.8% (+/−4.4%) |
| long genes | 25.8% (+/−0.9%) | long CDSs | 24.1% (+/−0.8%) |
| medium genes | 19.3% (+/−0.6%) | medium CDSs | 19.2% (+/−1.0%) |
| short genes | 17.3% (+/−1.3%) | short CDSs | 20.3% (+/−1.8%) |
| autosomal genes | 22.6% (+/−0.5%) | X-linked genes | 19.2% (+/−1.3%) |
Mean fraction of 4D sites under strong constraint in each category over 10 bootstrap runs (+/− s.e.);
4D sites within 75 bp of the translation start site (longest transcript);
4D sites within 75 bp of stop codon (longest transcript);
4D sites in bulk nucleosome footprints;
4D sites within 48 bp of a splice junction;
4D sites from multi-exon genes;
4D sites from single-exon genes;
genes are ranked in descending order by their gene length (UTR + all exons + all introns) with the top, middle, and bottom third forming the long, medium, and short gene classes respectively;
genes are ranked in descending order by their CDS length (longest transcript) with the top, middle, and bottom third forming the long, medium, and short CDS classes respectively;
4D sites from Autosomal genes;
4D sites from X-linked genes.
Figure 5Strong constraint versus gene expression across development.
Within each developmental time point, genes were ranked by their level of expression and then grouped into high, moderate, and low expression levels - each group comprising of one-third of all genes. Within each gene set within each time point, the fraction of 4D synonymous sites under strong constraint was calculated using the bootstrap. 10 bootstraps were done within each such class. Error bars represent the s.e. of the estimates.
Functional clusters in genes enriched for strong constraint.
| Cluster # | Overall Functional Annotation | Enrichment |
| 1 | transcriptional regulation | 9.69 |
| 2 | imaginal disc development | 9.28 |
| 3 | homeobox protein domain | 7.57 |
| 4 | eye morphogenesis | 7.49 |
| 6 | epithelium development | 6.07 |
| 8 | immunoglobulin domain | 5.93 |
| 9 | ribosomal proteins | 5.36 |
| 10 | cell signaling | 4.59 |
| 12 | gamete generation | 4.33 |
| 13 | neuron development | 3.51 |
Functional annotation clusters ranked by significance by DAVID [86], [87]. These clusters are groups of similar or related biological annotation terms, with similarity determined by a simple stringency setting - in the above, a high stringency setting was used. The significance of the overall cluster reflects the combined enrichment in the test gene set of the individual annotation terms within a cluster (see c). Clusters 5, 7, and 11 are not reported here as their biological terms were similar to clusters 4, 1, and 4 & 13 respectively, so provided no new information. The full information for the top 13 clusters is reported in Table S3;
Summary description of the type of annotation terms within each cluster. The specific annotation terms for each cluster are in the supplement;
The enrichment score of the overall cluster as calculated by DAVID in the test gene set with respect to the background gene set. According to the description of enrichment scores by DAVID, each individual annotation term within a cluster has a p-value, or significance, for the enrichment of that term in the test gene set. The enrichment score of the overall cluster is then the geometric mean of these p-values. Thus the higher the enrichment score, the lower the p-values are for all terms in the overall annotation cluster and the more significantly enriched the overall cluster is in the test gene set. The p-values for the enrichment of the annotation terms in each cluster are reported in Table S3.