| Literature DB >> 29097404 |
Josh T Cuperus1,2, Benjamin Groves3, Anna Kuchina3, Alexander B Rosenberg3, Nebojsa Jojic4, Stanley Fields1,2,5, Georg Seelig3,6.
Abstract
Our ability to predict protein expression from DNA sequence alone remains poor, reflecting our limited understanding of cis-regulatory grammar and hampering the design of engineered genes for synthetic biology applications. Here, we generate a model that predicts the protein expression of the 5' untranslated region (UTR) of mRNAs in the yeast Saccharomyces cerevisiae. We constructed a library of half a million 50-nucleotide-long random 5' UTRs and assayed their activity in a massively parallel growth selection experiment. The resulting data allow us to quantify the impact on protein expression of Kozak sequence composition, upstream open reading frames (uORFs), and secondary structure. We trained a convolutional neural network (CNN) on the random library and showed that it performs well at predicting the protein expression of both a held-out set of the random 5' UTRs as well as native S. cerevisiae 5' UTRs. The model additionally was used to computationally evolve highly active 5' UTRs. We confirmed experimentally that the great majority of the evolved sequences led to higher protein expression rates than the starting sequences, demonstrating the predictive power of this model.Entities:
Mesh:
Substances:
Year: 2017 PMID: 29097404 PMCID: PMC5741052 DOI: 10.1101/gr.224964.117
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Experimental design and biological discovery. (A) Experimental design of a liquid-based growth assay of 489,348 5′ UTR variants. Random 50 nt were introduced directly upstream of the HIS3 coding sequence, replacing the 56 nt of the 5′ UTR of the CYC1 promoter. These constructs were introduced into a low copy number plasmid, transformed into yeast without a native copy of HIS3, and competed in media lacking histidine. The enrichment of each UTR after growth was measured by using massively parallel sequencing before and after selection. (B) 5′ UTR enrichment scores per nucleotide were averaged at each position. (C) The Kozak sequences (−5 to −1 position) leading to the highest His3 protein expression compared to the most abundant yeast Kozak sequence (AAAAA). (D) The enrichment of 5′ UTRs based on the predicted minimum free energy of the −50 to +70 sequences. (E) The enrichment of 5′ UTRs based on the presence of an upstream AUG (uAUG) and a stop codon within the UTR. Upstream open reading frames (uORFs) are characterized by an in-frame uAUG followed by a termination codon before the primary ORF start codon, or an out-of-frame uAUG followed by a stop codon before or after the primary ORF start codon.
Figure 2.A convolutional neural network (CNN) approach to model random 5′ UTR sequences. (A) A three-layer convolutional neural network model trained on random 5′ UTRs was tested on a held-out test set of the top 5% based on input read depth. Tested 5′ UTRs are specified by color for those with or without an upstream open reading frame. (B) Four hundred eighty-eight thousand random 13-mers were scored for each filter in layer 1 of the CNN. The top 1000 13-mers were used to create a positional weight matrix (PWM) for each filter. These PWMs include motifs of start codons, stop codons, and guanine quadruplexes. Positive Pearson correlations indicate a positive effect on enrichment, while negative correlations indicate a negative effect on enrichment. (C) The effect of each motif per position was measured by assessing the Pearson correlation of motif score and enrichment at each position. Heat maps of all 5′ UTRs (left) and those lacking upstream AUGs (right), including specific examples highlighting filters with different positional patterns are shown.
Figure 3.Validation of the CNN model on native 5′ UTRs. (A) Native 5′ UTR sequences were synthesized in 50-nt fragments and introduced into the HIS3-based selection system. (B) Correlation of a native library with the predictions from our convolutional neural network built from random sequences.
Figure 4.Model-guided optimization of 5000 random sequences. (A) Using our convolutional neural network, we iteratively predicted the optimal single nucleotide change in 100 random 5′ UTR sequences until no additional increase in enrichment was predicted. An example of these changes can be seen in the inset. (B) The start, midpoint, and endpoints from evolutions in A were tested experimentally. The predicted and observed enrichments are plotted. (C) Experimental data from endpoints of the optimized 5′ UTR sequences derived from both the random and native sets of sequence are compared to the enrichment distribution from the original random and native libraries. (D) Five thousand sequences from our random library were evolved over 40 steps and assayed for enrichment and depletion of common nucleotide features. (E) Analysis of the enrichment (left) and depletion (right) of motifs identified from the first convolutional layer of our model—the same as described in Figure 2.