| Literature DB >> 29931073 |
Cédric Nadiras1,2, Eric Eveno1, Annie Schwartz1, Nara Figueroa-Bossi3, Marc Boudvillain1.
Abstract
Bacterial transcription termination proceeds via two main mechanisms triggered either by simple, well-conserved (intrinsic) nucleic acid motifs or by the motor protein Rho. Although bacterial genomes can harbor hundreds of termination signals of either type, only intrinsic terminators are reliably predicted. Computational tools to detect the more complex and diversiform Rho-dependent terminators are lacking. To tackle this issue, we devised a prediction method based on Orthogonal Projections to Latent Structures Discriminant Analysis [OPLS-DA] of a large set of in vitro termination data. Using previously uncharacterized genomic sequences for biochemical evaluation and OPLS-DA, we identified new Rho-dependent signals and quantitative sequence descriptors with significant predictive value. Most relevant descriptors specify features of transcript C>G skewness, secondary structure, and richness in regularly-spaced 5'CC/UC dinucleotides that are consistent with known principles for Rho-RNA interaction. Descriptors collectively warrant OPLS-DA predictions of Rho-dependent termination with a ∼85% success rate. Scanning of the Escherichia coli genome with the OPLS-DA model identifies significantly more termination-competent regions than anticipated from transcriptomics and predicts that regions intrinsically refractory to Rho are primarily located in open reading frames. Altogether, this work delineates features important for Rho activity and describes the first method able to predict Rho-dependent terminators in bacterial genomes.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29931073 PMCID: PMC6144790 DOI: 10.1093/nar/gky563
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Rho-dependent termination of transcription. (A) Schematic representation of the termination process. Putative contacts between Rho and RNAP (74) are not depicted. Transcriptional R-loops, sometimes formed behind RNAPs, are also dissociated by the Rho factor (68), as depicted. (B) Genomic regions encoding Rho-dependent terminators usually contain a C>G bubble upstream from the termination sites (10,13,29,37). The case of the pgaA terminator from E. coli is shown as a representative example (10). (C) Configuration of the RNA interaction network within the Rho hexamer based on crystal structures of E. coli’s Rho (16,75). The tethered tracking mechanism used by Rho implies that PBS-RNA contacts are preserved while other contacts change as the RNA chain is translocated within the SBS, leading to the progressive lengthening of the PBS→SBS linker (see (76) and references therein).
Figure 3.Rho-dependent termination signals encoded by DNA templates containing C>G bubbles. Representative denaturing PAGE gels illustrate the different classes of transcription termination signals. Rho did not change the transcription profiles for 5.6% of the tested ‘C>G-plus’ DNA templates (‘None’ category). Rho-dependent signals were considered ‘Weak’ when bands migrating faster than runoff transcripts appeared in the presence of Rho while the intensity of the ‘runoff’ band was hardly affected (47.2% of the ‘C>G-plus’ templates). They were considered ‘Strong’, when Rho elicited the appearance of fast-migrating bands and a sharp decrease of the intensity of the runoff band (47.2% of the ‘C>G-plus’ templates).
Figure 2.C>G-less genomic regions are devoid of significant Rho-dependent signals. (A) Schematic depiction of the DNA templates used in our standard in vitro transcription termination experiments. (B) Representative denaturing PAGE gels illustrate the absence of formation of Rho-specific truncated transcripts during transcription of C>G-less templates. The RO bands correspond to runoff transcripts. RNAs migrating more slowly than the RO bands result from template switching events (once RNAP has reached a template end; see (12) and references therein). (C) Representative transcription experiments with the four unusual C>G-less templates yielding low amounts of truncated transcripts (identified by stars next to the gels) in the presence of Rho.
Figure 4.Statistical analysis of the Rho-dependent signals detected with the training set of DNA templates (Supplementary Table 1). (A) Dot plots for representative sequence descriptors. ANOVA F- and P-values are shown above each plot. (B) Unsupervised PCA 3D score plot obtained for the 104 DNA templates of the training set using the complete set of 111 descriptors. Q2cum = 0.464 and R2cum = 0.561 for the four PCA components. The gray sphere represents the Hotelling's T2 = 0.05 limit. (C) The bar graph shows the best positive and negative variable loadings for the first principal component (PC1). DenBubLength: cumulated length of all C>G bubbles in non-template DNA strand relative to full strand length (density); (C–G)max%Bub: maximal difference between %C and %G in longest C>G bubble of non-template DNA strand; Nb_Reg1Bu: Number of YC dimers in longest C>G bubble; (C–G)av%Bub: average difference between %C and %G in longest C>G bubble; %CmaxBub: highest %C in longest C>G bubble. %GGC, %TGG, %GGG, %GG and %G are percentages of respective motifs in the non-template DNA strand.
Figure 5.OPLS-DA modelling of Rho-dependent termination of transcription. (A) Standard 2D score plot obtained for the OPLS-DA-111 model with the training set of DNA templates. (B) Analysis of the E. coli MG1655 genome with the OPLS-DA-111 model predicts that regions refractory to Rho-dependent termination (Predicted ‘None’ regions) are fewer and smaller than regions eliciting termination (see Supplementary methods for definition of predicted regions). The proportions of genomic positions for which model predictions are reliable (in-model), or not (out-of-model), according to the PmodXPS+ = 0.05 threshold are shown inset. (C) The C>G bubble content strongly correlates with the predicted termination strength of the regions (ANOVA P < 0.001; F = 2713). Regions as in panel B. (D) The Bicyclomycin Sensitive Transcript (BST) start points (25) fall overwhelmingly within predicted ‘Strong’ regions. (E) Comparison of the termination classes predicted with the OPLS-DA-111 model for 500 nt-long sequences located either upstream or downstream from BST start points (see diagram inset). The distributions of the predicted response scores (Ypred) from SIMCA for the ‘Strong’ class are also shown.
Main features of the multivariate predictive models of Rho-dependent termination
| CV-ANOVAb | Classification of observations into known classesc | ROC area under curve (AUC)e | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | PCsa |
|
|
|
|
| Weak | Strong | None | Total |
| Weak | Strong | None | Remark | |
| PCA-class-111 | N | 3 | 0.546 | n.a. | 0.277 | n.a. | n.a. | 89.5% | 79.4% | 90.6% | 86.5% | 1.2 x 10−6 | 0.861 | 0.928 | 0.956 | One PCA per class with all 111 descriptors |
| W | 3 | 0.480 | n.a. | 0.302 | ||||||||||||
| S | 5 | 0.580 | n.a. | 0.233 | ||||||||||||
| PLS-DA-111 | 1 | 0.271 | 0.359 | 0.345 | 11.5 | 2.1 x 10−8 |
| 91.2% | 90.6% |
| 3.0 x 10−7 |
| 0.876 | 0.973 | PLS-DA with all 111 descriptors | |
| OPLS-DA-111 | 2 (2)f | 0.513 | 0.639 | 0.410 | 4.4 | 2.1 x 10−7 | 86.8% | 82.3% | 87.5% | 85.6% | 1.1 x 10−6 | 0.927 | 0.968 | 0.989 | OPLS-DA with all 111 descriptors | |
| OPLS-DA-40 | 2 (1) | 0.709 | 0.536 | 0.454 | 7.9 | 4.3 x 10−12 | 71% | 76.5% | 84.4% | 76.9% | 9.6 x 10−7 | 0.861 | 0.942 | 0.976 | VIPs < 1 in OPLS-DA-111 removedg | |
| OPLS-DA-83 | 2 (2) | 0.572 | 0.627 | 0.453 | 5.9 | 2.4 x 10−10 | 81.6% | 82.4% | 90.6% | 84.6% | 1.1 x 10−6 | 0.918 | 0.965 | 0.991 | OPLS-DA with only the 83 descriptors that pass ANOVAg | |
| OPLS-DA-6 | 2 (0) | 0.962 | 0,398 | 0,372 | 8.6 | 4.4 x 10−10 | 44.7% | 67.7% | 96.9% | 68.3% | 8.7 x 10−7 | 0.734 | 0.904 | 0.961 | OPLS-DA with C>G Bubble descriptors onlyh | |
aOrthogonal Principal Components (PCs) are in parentheses when relevant. R2Xcum values do not include contributions from orthogonal PCs.
bANOVA of the cross-validated predictive residuals as implemented in SIMCA software.
cClassification based on jackknife cross-validation (see methods).
dFisher's probability of the classification occurring by chance.
eThe AUC of the ROC curve varies from 0.5 (random prediction) to 1.0 (perfect prediction).
fSources of orthogonal variance in OPLS-DA-111 uncorrelated to termination are discussed in Supplementary information.
gAlthough excluding low ranking descriptors based on ANOVA or VIP (variable Importance in the Projection) scores sometimes improves OPLS-DA predictions, this strategy was detrimental in our case (see also Supplementary Figure S5).
hDescriptors used were SumBubLength, SumBubSurf, DenBubLength, DenBubSurf, BublongLength and BublongSurf (see Supplementary information for details).
Figure 6.Distribution of the predicted ‘Strong’, ‘None’, and ‘Weak’ regions as a function of gene location in the MG1655 genome. Four distinct categories (Sense-Intragenic, Sense-Intergenic, Antisense-Intragenic and Antisense-Intergenic) were defined with respect to the arrangement of regions and genes (key is inset). Categorization of the intergenic regions as sense/antisense was done with respect to the next downstream gene (‘sense’ if the region and gene are in same strand orientation, ‘antisense’ otherwise). Contributions to these categories were calculated for each predicted region and then summed up for all regions of the same class, as shown in the diagrams.