| Literature DB >> 27347373 |
Geoffrey Siwo1, Andrew Rider2, Asako Tan3, Richard Pinapati4, Scott Emrich2, Nitesh Chawla2, Michael Ferdig4.
Abstract
The quantitative prediction of transcriptional activity of genes using promoter sequence is fundamental to the engineering of biological systems for industrial purposes and understanding the natural variation in gene expression. To catalyze the development of new algorithms for this purpose, the Dialogue on Reverse Engineering Assessment and Methods (DREAM) organized a community challenge seeking predictive models of promoter activity given normalized promoter activity data for 90 ribosomal protein promoters driving expression of a fluorescent reporter gene. By developing an unbiased modeling approach that performs an iterative search for predictive DNA sequence features using the frequencies of various k-mers, inferred DNA mechanical properties and spatial positions of promoter sequences, we achieved the best performer status in this challenge. The specific predictive features used in the model included the frequency of the nucleotide G, the length of polymeric tracts of T and TA, the frequencies of 6 distinct trinucleotides and 12 tetranucleotides, and the predicted protein deformability of the DNA sequence. Our method accurately predicted the activity of 20 natural variants of ribosomal protein promoters (Spearman correlation r = 0.73) as compared to 33 laboratory-mutated variants of the promoters (r = 0.57) in a test set that was hidden from participants. Notably, our model differed substantially from the rest in 2 main ways: i) it did not explicitly utilize transcription factor binding information implying that subtle DNA sequence features are highly associated with gene expression, and ii) it was entirely based on features extracted exclusively from the 100 bp region upstream from the translational start site demonstrating that this region encodes much of the overall promoter activity. The findings from this study have important implications for the engineering of predictable gene expression systems and the evolution of gene expression in naturally occurring biological systems.Entities:
Keywords: DNA sequence; DREAM challenges; Expression prediction; Gene expression; Gene regulation; Machine learning; Promoter activity; Transcription modeling
Year: 2016 PMID: 27347373 PMCID: PMC4916984 DOI: 10.12688/f1000research.7485.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. Summary of the DREAM6 gene expression challenge.
( A) Training data consisted of DNA sequences for 90 yeast RP promoters whose activities were experimentally determined [30, 34]. DNA sequences for blinded test set of 53 promoters whose activity was hidden also experimentally determined but withheld from the challenge participants was also provided. ( B) Outline for strategy of modeling promoter activity. Each promoter was segmented into 100 bp non-overlapping windows with the full promoter regarded as a separate window. For each window, DNA sequence features were extracted and feature selection using a linear regression wrapper performed prior to machine learning. Performance of machine learning models trained on each window was determined in 5- and 10-fold cross-validations using Pearson correlation.
DNA sequence features predictive of promoter activity.
| DNA feature | Description |
|---|---|
| Mononucleotides | Frequency of G |
| Dinucleotides | Frequency of GT |
| Trinucleotides | Frequency of 6 trinucleotides |
| Tetranucleotides | Frequency of 12 tetranucleotides |
| T-tracts | Length of T-tracts |
| TA-tracts | Length of TA-tracts |
| DNA deformability | Negatively correlated to activity |
Figure 2. Performance of the SVM model on validation test set by the DREAM consortium.
( A) Correlation between predicted activity by the SVM model and actual promoter activity of 53 promoters whose activity was not available to participants. ( B) Performance of team FIrST relative to other 20 teams based on a combined score.
Figure 3. Relationship between protein deformability of promoters and activity.
Among the top 20 promoters with extreme activities (high and low), significant deviation in deformability occurs at the -40 to -60 bp region from the TrSS (T-test P = 0.008).
Figure 4. Dependence of prediction error on promoter class or activity.
( A) Natural promoters had a lower prediction error compared to synthetically mutated promoters. ( B) Prediction error is negatively correlated to promoter activity.