| Literature DB >> 31138098 |
Daniel Gebert1, Julia Jehn1, David Rosenkranz1.
Abstract
Codon composition, GC content and local RNA secondary structures can have a profound effect on gene expression, and mutations affecting these parameters, even though they do not alter the protein sequence, are not neutral in terms of selection. Although evidence exists that, in some cases, selection favours more stable RNA secondary structures, we currently lack a concrete idea of how many genes are affected within a species, and whether this is a universal phenomenon in nature. We searched for signs of structural selection in a global manner, analysing a set of 1 million coding sequences from 73 species representing all domains of life, as well as viruses, by means of our newly developed software PACKEIS. We show that codon composition and amino acid identity are main determinants of RNA secondary structure. In addition, we show that the arrangement of synonymous codons within coding sequences is non-random, yielding extremely high, but also extremely low, RNA structuredness significantly more often than expected by chance. Taken together, we demonstrate that selection for high and low levels of secondary structure is a widespread phenomenon. Our results provide another line of evidence that synonymous mutations are less neutral than commonly thought, which is of importance for many evolutionary models.Entities:
Keywords: PACKEIS; RNA secondary structure; mRNA backfolding; natural selection
Mesh:
Substances:
Year: 2019 PMID: 31138098 PMCID: PMC6544989 DOI: 10.1098/rsob.190020
Source DB: PubMed Journal: Open Biol ISSN: 2046-2441 Impact factor: 6.411
Figure 1.(a) DBF of aORFs from mosquito-borne RNA viruses for the purpose of illustration. The X-axis shows the distribution of DBFs relative to the average value of all aORFs. Genomes of the yellow fever virus and the Edge Hill virus both encode a single polyprotein. DBFs and the corresponding DBF scores are indicated in red. Only the ORF of the Edge Hill virus shows an exceptional DBF which is considerably higher than the corresponding aORFs (DBF scoremodel0 = 0.99, DBF scoremodel2 = 1.00). (b) The analysis of DBF scores for all available viral ORF sequences reveals a consistent and significant enrichment for extremely high DBF scores. (c) Lines in the heatmap represent species; rows represent DBF scores from 0 to 1 in steps of 0.01 using model2 (shuffle). The colour indicates row Z-scores with Z-scores above 1.96 (p < 0.05 for the two-tailed hypothesis) indicated in shades from yellow to red. Lines for viruses represent dsDNA, dsRNA, ssDNA, ssRNA(−) and ssRNA(+) viruses.
Figure 3.PACKEIS uses different models to generate artificial ORFs based on the oORF. Colours refer to the probability of a specific codon being placed at a given position. When applying model0 (free), PACKEIS uses equal probabilities for all codons of a specific amino acid. When applying model1 (strict), the probabilities are derived from the global codon usage of the species in question. When applying model2 (shuffle), codons of the oORF are randomly shuffled. Note that, in the above example, valine can only be encoded by GUG when applying model2 since no alternative valine codons are present in the given ORF.
Figure 2.(a) Estimation on the fraction of ORFs under structural selection. (b) Virus ORFs are significantly more often under structural selection. p-values represent two-tailed p-values from unpaired t-tests. Error bars refer to standard deviation. (c) Codon frequencies in ORFs sorted by DBF score. Data exemplarily taken from Mus musculus. (d) Amino acid frequencies in ORFs sorted by DBF score. Data exemplarily taken from M. musculus. (e) Structuring of ORFs that code for identical peptide sequences but use different sets of codons (top) and structuring of ORFs that code for peptides being composed of different sets of amino acids (bottom).