| Literature DB >> 30425888 |
Ramit Bharanikumar1, Keshav Aditya R Premkumar2, Ashok Palaniappan3.
Abstract
We present PromoterPredict, a dynamic multiple regression approach to predict the strength of Escherichia coli promoters binding the σ70 factor of RNA polymerase. σ70 promoters are ubiquitously used in recombinant DNA technology, but characterizing their strength is demanding in terms of both time and money. We parsed a comprehensive database of bacterial promoters for the -35 and -10 hexamer regions of σ70-binding promoters and used these sequences to construct the respective position weight matrices (PWM). Next we used a well-characterized set of promoters to train a multivariate linear regression model and learn the mapping between PWM scores of the -35 and -10 hexamers and the promoter strength. We found that the log of the promoter strength is significantly linearly associated with a weighted sum of the -10 and -35 sequence profile scores. We applied our model to 100 sets of 100 randomly generated promoter sequences to generate a sampling distribution of mean strengths of random promoter sequences and obtained a mean of 6E-4 ± 1E-7. Our model was further validated by cross-validation and on independent datasets of characterized promoters. PromoterPredict accepts -10 and -35 hexamer sequences and returns the predicted promoter strength. It is capable of dynamic learning from user-supplied data to refine the model construction and yield more robust estimates of promoter strength. PromoterPredict is available as both a web service (https://promoterpredict.com) and standalone tool (https://github.com/PromoterPredict). Our work presents an intuitive generalization applicable to modelling the strength of other promoter classes.Entities:
Keywords: Data mining; Genetic engineering; PWM construction; Promoter sequences; Promoter strength prediction; Regression modelling; Sigma70 promoters; Software tools; Weak promoters
Year: 2018 PMID: 30425888 PMCID: PMC6228582 DOI: 10.7717/peerj.5862
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Sequence logos of the −35 and −10 hexamers of the selected RegulonDB promoters.
(A) −35 motif; (B) −10 motif. Figure was made using WebLogo (Crooks et al., 2004).
Summary of promoter information.
| Promoter | −35 hexamer | −10 hexamer | Promoter activity | Predicted | |
|---|---|---|---|---|---|
| BBa_J23100 | TTGACG | TACAGT | 1 | 0 | −1.6336486579 |
| BBa_J23101 | TTTACA | TATTAT | 0.7 | −0.35667494 | 0.0555718065 |
| BBa_J23102 | TTGACA | TACTGT | 0.86 | −0.15082289 | −1.0957849491 |
| BBa_J23104 | TTGACA | TATTGT | 0.72 | −0.32850407 | 0.1647181133 |
| BBa_J23105 | TTTACG | TACTAT | 0.24 | −1.42711636 | −2.2871659092 |
| BBa_J23106 | TTTACG | TATAGT | 0.47 | −0.75502258 | −1.3174788735 |
| BBa_J23107 | TTTACG | TATTAT | 0.36 | −1.02165125 | −1.0266628468 |
| BBa_J23108 | CTGACA | TATAAT | 0.51 | −0.67334455 | −0.4282477098 |
| BBa_J23109 | TTTACA | GACTGT | 0.04 | −3.21887582 | −3.3693144659 |
| BBa_J23110 | TTTAGG | TACAAT | 0.33 | −1.10866262 | −3.3946866337 |
| BBa_J23111 | TTGACG | TATAGT | 0.58 | −0.54472718 | −0.3731455955 |
| BBa_J23112 | CTGATA | GATTAT | 0.01 | −4.60517019 | −3.1533888284 |
| BBa_J23113 | CTGATG | GATTAT | 0.01 | −4.60517019 | −4.2356234817 |
| BBa_J23114 | TTTATG | TACAAT | 0.1 | −2.30258509 | −2.5943689001 |
| BBa_J23115 | TTTATA | TACAAT | 0.15 | −1.89711998 | −1.5121342469 |
| BBa_J23116 | TTGACA | GACTAT | 0.16 | −1.83258146 | −1.5897942167 |
| BBa_J23117 | TTGACA | GATTGT | 0.06 | −2.81341072 | −1.1644781255 |
| BBa_J23118 | TTGACG | TATTGT | 0.56 | −0.5798185 | −0.91751654 |
Note:
The promoter activities (strengths) are seen to span two orders of magnitude in the range (0.0, 1.0). The promoters follow the naming in the Anderson dataset.
Figure 2The regression surface of the estimated model with the training data points (red).
x- and y-axes represent PWM scores and the z-axis (vertical) represents the predicted ln(promoter strength).
Cross-validation results.
| Fold | PWM_35 | PWM_10 | Combined | logStrength | cvpred | cvres |
|---|---|---|---|---|---|---|
| 1 | 6.5966 | 2.398 | 9 | 0 | −1.757 | 1.757 |
| 2 | 6.9195 | 8.089 | 15.01 | −0.357 | 0.145 | −0.50 |
| 3 | 9.1308 | 0.402 | 9.53 | −0.151 | −1.3 | 1.15 |
| 4 | 9.1308 | 5.025 | 14.16 | −0.329 | 0.286 | −0.62 |
| 5 | 4.3854 | 3.465 | 7.85 | −1.427 | −2.36 | 0.93 |
| 6 | 4.3854 | 7.022 | 11.41 | −0.755 | −1.377 | 0.62 |
| 7 | 4.3854 | 8.089 | 12.47 | −1.022 | −1.027 | 0.00 |
| 8 | 4.5119 | 10.086 | 14.6 | −0.673 | −0.362 | −0.31 |
| 9 | 6.9195 | −4.474 | 2.45 | −3.219 | −3.463 | 0.24 |
| 10 | 4.3854 | 5.462 | 9.85 | −1.109 | −1.792 | 0.68 |
| 11 | 6.5966 | 7.022 | 13.62 | −0.545 | −0.349 | −0.20 |
| 12 | 2.5179 | 3.213 | 5.73 | −4.605 | −2.847 | −1.76 |
| 13 | −0.0162 | 3.213 | 3.2 | −4.605 | −3.977 | −0.63 |
| 14 | 2.3914 | 5.462 | 7.85 | −2.303 | −2.646 | 0.34 |
| 15 | 4.9255 | 5.462 | 10.39 | −1.897 | −1.485 | −0.41 |
| 16 | 9.1308 | −1.411 | 7.72 | −1.833 | −1.518 | −0.32 |
| 17 | 9.1308 | 0.15 | 9.28 | −2.813 | −0.796 | −2.02 |
| 18 | 6.5966 | 5.025 | 11.62 | −0.58 | −0.944 | 0.36 |
Note:
In each fold of cross-validation, the instance corresponding to the fold was designated as the test instance while the prediction model was built using the rest of the instances. This process was repeated 18 times, once for each test instance and the cross-validation (CV) residuals were obtained. combined, sum of the PWM scores; cvpred, predicted log strength of the test instance; cvres, cross-validation residual.
Validation results: using data of Davis, Rubin & Sauer (2011).
| Actual rank | Promoter | −35 sequence | −10 sequence | Strength | Predicted exp(logStrength) | Predicted rank |
|---|---|---|---|---|---|---|
| 1 | pro1 | tttacg | gtatct | 0.009 | 0.0079073845 | 1 |
| 2.5 | pro2 | gcggtg | tataat | 0.017 | 0.0306978849 | 2.5 |
| 2.5 | pro3 | ttgacg | gaggat | 0.017 | 0.0306978849 | 2.5 |
| 4 | proA | tttacg | taggct | 0.03 | 0.0482647297 | 4 |
| 5 | pro4 | tttacg | gatgat | 0.033 | 0.0809816409 | 5 |
| 6 | pro5 | tttacg | taggat | 0.05 | 0.0867400443 | 6 |
| 7 | proB | tttacg | taatat | 0.119 | 0.1534857959 | 7 |
| 8 | pro6 | tttacg | taaaat | 0.193 | 0.2645364297 | 8 |
| 9 | proC | tttacg | tatgat | 0.278 | 0.3059490889 | 9 |
| 10 | proD | tttacg | tataat | 1 | 0.6173668247 | 10 |
Note:
The promoters were ordered based on the rank of their strength, and given as input to our model. The predicted promoter log strengths were then examined for agreement with the actual rank and the ordering obtained matched the original ordering. The individual predicted values for pro2 and pro3 were 0.0024 and 0.059, respectively.
Validation with T. maritima strong promoter candidates.
| Promoter | −35 sequence | −10 sequence | Strength | Predicted exp(logStrength) | Predicted class |
|---|---|---|---|---|---|
| TM0373 | ttgaca | tataat | Strong | 4.6845788997 | Strong |
| TM1016 | ttgaat | tttaat | Strong | 0.3808572257 | Strong |
| TM1272 | ttgaca | tttaat | Strong | 1.6386551999 | Strong |
| TM1429 | ttgaca | tataat | Strong | 4.6845788997 | Strong |
| TM1667 | ttgaaa | tataat | Strong | 2.5859432664 | Strong |
| TM1780 | ttcata | tataat | Strong | 0.463878289 | Strong |
| Tmt11 | ttgaat | taaaat | Strong | 0.4665383797 | Strong |
| TM0032 | tcgaaa | cataat | Strong | 0.0562167049 | |
| TM0477 | ttgaat | tataat | Strong | 1.0887926414 | Strong |
| TM1067 | ttgacc | tattat | Strong | 0.7046782664 | Strong |
| TM1271 | ttgaca | tataat | Strong | 4.6845788997 | Strong |
| Tmt45 | ttgaac | tataat | Strong | 0.670434893 | Strong |
| TM1490 | ttgact | taaaat | Strong | 0.8451600149 | Strong |
Figure 4Model diagnostics plots for investigating the assumptions underlying linear modelling.
(A) Residuals vs. fitted values; (B) homogeneity of residual variances; (C) normal Q-Q plot; and (D) residuals vs. leverage plot.
Validation with major (A1, A2, A3) and minor (C, D) promoters.
| Promoter | −35 sequence | −10 sequence | Strength | Predicted exp(logStrength) | Predicted class |
|---|---|---|---|---|---|
| A1 | ttgact | gatact | strong | 0.2904988307 | Medium |
| A2 | ttgaca | taagat | strong | 0.9947607331 | Strong |
| A3 | ttgaca | tacgat | strong | 0.658183377 | Strong |
| C | ttgacg | tagtct | minor | 0.1452865585 | Minor |
| D | ttgact | taggct | minor | 0.1541996302 | Minor |
Figure 3Effects plots of promoter sites on promoter strength.
(A) −35 promoter site; and (B) −10 promoter site.
Correlation matrix of features and response variables.
| Correlation coefficient | PWM–35 | PWM–10 | Combined | Strength | Log-strength |
|---|---|---|---|---|---|
| PWM–35 | 1 | −0.3715610 | 0.3401672 | 0.4558838 | 0.5153622 |
| PWM–10 | −0.3715610 | 1 | 0.7466500 | 0.3025062 | 0.4115533 |
| Combined | 0.3401672 | 0.7466500 | 1 | 0.6330488 | 0.7861173 |
| Strength | 0.4558838 | 0.3025062 | 0.6330488 | 1 | 0.8665495 |
| Log-strength | 0.5153622 | 0.4115533 | 0.7861173 | 0.8665495 | 1 |