| Literature DB >> 17652327 |
P Roback1, J Beard, D Baumann, C Gille, K Henry, S Krohn, H Wiste, M I Voskuil, C Rainville, R Rutherford.
Abstract
The prediction of operons in Mycobacterium tuberculosis (MTB) is a first step toward understanding the regulatory network of this pathogen. Here we apply a statistical model using logistic regression to predict operons in MTB. As predictors, our model incorporates intergenic distance and the correlation of gene expression calculated for adjacent gene pairs from over 474 microarray experiments with MTB RNA. We validate our findings with known examples from the literature and experimentation. From this model, we rank each potential operon pair by the strength of evidence for cotranscription, choose a classification threshold with a true positive rate of over 90% at a false positive rate of 9.1%, and use it to construct an operon map for the MTB genome.Entities:
Mesh:
Year: 2007 PMID: 17652327 PMCID: PMC1976454 DOI: 10.1093/nar/gkm518
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Known operons in MTB, along with the source of published evidence of laboratory confirmation. This table also includes gene pairs (which may or may not represent complete operons) whose cotranscription was verified in our laboratory by RT-PCR (see Materials and Methods section). Annotation for genes are derived from Tuberculist (http://genolist.pasteur.fr/TubercuList/). Adjacent genes descriptions are separated by a semicolon
| Operon (or gene pair) name | Gene pairs | Annotation | Source |
|---|---|---|---|
| Rv0167-Rv0174 | 7 | CONSERVED HYPOTHETICAL INTEGRAL MEMBRANE PROTEIN YRBE1A; CONSERVED HYPOTHETICAL INTEGRAL MEMBRANE PROTEIN YRBE1B; MCE-FAMILY PROTEIN MCE1A; MCE-FAMILY PROTEIN MCE1B; MCE-FAMILY PROTEIN MCE1C; MCE-FAMILY PROTEIN MCE1D; POSSIBLE MCE-FAMILY LIPOPROTEIN LPRK | ( |
| Rv490-Rv491 | 1 | TWO COMPONENT REGULATORY SYSTEM SenX3;RegX3 | ( |
| Rv933-Rv0936 | 3 | PHOSPHATE-TRANSPORT ATP-BINDING PROTEIN ABC TRANSPORTER PSTB; PERIPLASMIC PHOSPHATE-BINDING LIPOPROTEIN PSTS1 (PBP-1) (PSTS1); PHOSPHATE-TRANSPORT INTEGRAL MEMBRANE ABC TRANSPORTER PSTC1; PHOSPHATE-TRANSPORT INTEGRAL MEMBRANE ABC TRANSPORTER PSTA | ( |
| Rv0986-Rv0988 | 2 | PROBABLE ADHESION COMPONENT TRANSPORT ATP-BINDING PROTEIN ABC TRANSPORTER; PROBABLE ADHESION COMPONENT TRANSPORT ATP-BINDING PROTEIN ABC TRANSPORTER; POSSIBLE CONSERVED EXPORTED PROTEIN | ( |
| Rv1161-Rv1164 | 3 | PROBABLE RESPIRATORY NITRATE REDUCTASE (ALPHA CHAIN) NARG; PROBABLE RESPIRATORY NITRATE REDUCTASE (BETA CHAIN) NARH; PROBABLE RESPIRATORY NITRATE REDUCTASE (DELTA CHAIN) NARJ; PROBABLE RESPIRATORY NITRATE REDUCTASE (GAMMA CHAIN) NARI | ( |
| Rv1411c-Rv1410c | 1 | PROBABLE CONSERVED LIPOPROTEIN LPRG; AMINOGLYCOSIDES/TETRACYCLINE-TRANSPORT INTEGRAL MEMBRANE PROTEIN | ( |
| Rv1477-Rv1478 | 1 | HYPOTHETICAL INVASION PROTEIN; HYPOTHETICAL INVASION PROTEIN | ( |
| Rv1483-Rv1484 | 1 | 3-OXOACYL-[ACYL-CARRIER PROTEIN] REDUCTASE FABG1; NADH-DEPENDENT ENOYL-[ACYL-CARRIER-PROTEIN] REDUCTASE INHA | ( |
| Rv1964-Rv1966 | 2 | CONSERVED HYPOTHETICAL INTEGRAL MEMBRANE PROTEIN YRBE3A; CONSERVED HYPOTHETICAL INTEGRAL MEMBRANE PROTEIN YRBE3B; MCE-FAMILY PROTEIN MCE3A | ( |
| Rv1966-Rv1971 | 5 | MCE-FAMILY PROTEIN MCE3; MCE-FAMILY PROTEIN MCE3B; MCE-FAMILY PROTEIN MCE3C; MCE-FAMILY PROTEIN MCE3D; POSSIBLE MCE-FAMILY LIPOPROTEIN LPRM; MCE-FAMILY PROTEIN MCE3F | ( |
| Rv2358-Rv2359 | 1 | PROBABLE TRANSCRIPTIONAL REGULATORY PROTEIN; PROBABLE FERRIC UPTAKE REGULATION PROTEIN FURB | ( |
| Rv2431c-Rv2430c | 1 | PE FAMILY PROTEIN; PPE FAMILY PROTEIN | ( |
| Rv2594c-Rv2592c | 2 | PROBABLE CROSSOVER JUNCTION ENDODEOXYRIBONUCLEASE RUVC; PROBABLE HOLLIDAY JUNCTION DNA HELICASE RUVA; PROBABLE HOLLIDAY JUNCTION DNA HELICASE RUVB | ( |
| Rv2688c-Rv2686c | 2 | PROBABLE ANTIBIOTIC-TRANSPORT ATP-BINDING PROTEIN ABC TRANSPORTER; PROBABLE ANTIBIOTIC-TRANSPORT INTEGRAL MEMBRANE LEUCINE AND VALINE RICH PROTEIN ABC TRANSPORTER; PROBABLE ANTIBIOTIC-TRANSPORT INTEGRAL MEMBRANE LEUCINE AND ALANINE AND VALINE RICH PROTEIN ABC TRANSPORTER | ( |
| Rv3083-Rv3089 | 6 | PROBABLE MONOOXYGENASE; PROBABLE ACETYL-HYDROLASE/ESTERASE LIP; PROBABLE SHORT-CHAIN TYPE DEHYDROGENASE/REDUCTASE; PROBABLE ZINC-TYPE ALCOHOL DEHYDROGENASE ADHD (ALDEHYDE REDUCTASE); CONSERVED HYPOTHETICAL PROTEIN; CONSERVED HYPOTHETICAL PROTEIN; PROBABLE CHAIN -FATTY-ACID-CoA LIGASE FADD13 | ( |
| Rv3134c-Rv3132c | 2 | CONSERVED HYPOTHETICAL PROTEIN; TWO COMPONENT TRANSCRIPTIONAL REGULATORY PROTEIN DEVR; TWO COMPONENT SENSOR HISTIDINE KINASE DEVS | ( |
| Rv3874-Rv3875 | 1 | KDA CULTURE FILTRATE ANTIGEN ESXB; 6 KDA EARLY SECRETORY ANTIGENIC TARGET ESXA | ( |
| Rv3793-Rv3795 | 2 | INTEGRAL MEMBRANE INDOLYLACETYLINOSITOL ARABINOSYLTRANSFERASE EMBC, INTEGRAL MEMBRANE INDOLYLACETYLINOSITOL ARABINOSYLTRANSFERASE EMBA, INTEGRAL MEMBRANE INDOLYLACETYLINOSITOL ARABINOSYLTRANSFERASE EMBB | ( |
| Rv0047c-Rv0046c | 1 | CONSERVED HYPOTHETICAL PROTEIN; MYO-INOSITOL-1-PHOSPHATE SYNTHASE INO1 | This study |
| Rv0287-Rv0288 | 1 | ESAT-6 LIKE PROTEIN ESXG; LOW MOLECULAR WEIGHT PROTEIN ANTIGEN 7 ESXH | This study |
| Rv1304-Rv1305 | 1 | PROBABLE ATP SYNTHASE A CHAIN ATPB; PROBABLE ATP SYNTHASE C CHAIN ATPE | This study |
| Rv1334-Rv1335 | 1 | CONSERVED HYPOTHETICAL PROTEIN; 9.5 KDA CULTURE FILTRATE ANTIGEN CFP10A | This study |
| Rv1465-Rv1466 | 1 | POSSIBLE NITROGEN FIXATION RELATED PROTEIN; CONSERVED HYPOTHETICAL PROTEIN | This study |
| Rv1826-Rv1827 | 1 | PROBABLE GLYCINE CLEAVAGE SYSTEM H PROTEIN GCVH; CONSERVED HYPOTHETICAL PROTEIN CFP17 | This study |
| Rv2745c-Rv2744c | 1 | POSSIBLE TRANSCRIPTIONAL REGULATORY PROTEIN; CONSERVED 35 KDA ALANINE RICH PROTEIN | This study |
| Rv2934-Rv2937 | 3 | PHENOLPTHIOCEROL SYNTHESIS TYPE-I POLYKETIDE SYNTHASE PPSD; PHENOLPTHIOCEROL SYNTHESIS TYPE-I, PROBABLE DAUNORUBICIN-DIM-TRANSPORT ATP-BINDING PROTEIN ABC TRANSPORTER DRRAPOLYKETIDE SYNTHASE PPSE; PROBABLE DAUNORUBICIN-DIM-TRANSPORT ATP-BINDING PROTEIN ABC TRANSPORTER DRRA; PROBABLE DAUNORUBICIN-DIM-TRANSPORT INTEGRAL MEMBRANE PROTEIN ABC TRANSPORTER DRRB | This study |
| Rv3152-Rv3153 | 1 | PROBABLE NADH DEHYDROGENASE I (CHAIN H) NUOH; PROBABLE NADH DEHYDROGENASE I (CHAIN I) NUOI | This study |
| Rv3516-Rv3517 | 1 | POSSIBLE ENOYL-CoA HYDRATASE ECHA19; CONSERVED HYPOTHETICAL PROTEIN | This study |
DNA microarray data sets used in this work
| Experimental treatment | Number of microarrays | Microarray technology | Methods reference |
|---|---|---|---|
| Ethambutol | 62 | Amplicon | ( |
| Hydrogen peroxide | 28 | Amplicon | ( |
| 55 | Oligo | ( | |
| Hypoxia | 37 | Amplicon | ( |
| 32 | Oligo | ( | |
| Iron | 19 | Amplicon | ( |
| Potassium cyanide | 15 | Amplicon | ( |
| Nitric oxide | 135 | Amplicon | ( |
| 9 | Oligo | ( | |
| Protonophores | 18 | Amplicon | Unpublished |
| Sigma B deletion | 48 | Oligo | Unpublished |
| Sigma E | 14 | Amplicon | ( |
Figure 1.Density estimates of (A) intergenic distance and (B) gene expression correlation for known operon pairs (solid) and non-operon pairs (dashed) using nonparametric kernel density estimates with Gaussian kernels. (C) A scatterplot showing the relationship of coexpression (vertical axis) and intergenic distance (horizontal axis) for all known operons pairs (red dots) and potential operon pairs (blue dots).
Logistic regression coefficients and associated Wald tests of significance for three models
| Predictor | Type | Model A | Model B | Model C | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Est. | SE | Est. | SE | Est. | SE | ||||||||
| intercept | −5.88 | 0.487 | −12.1 | <0.001 | −5.60 | 0.477 | −11.7 | <0.001 | −2.54 | 0.149 | −17.0 | <0.001 | |
| Distance | −0.012 | 0.003 | −4.29 | <0.001 | −0.012 | 0.002 | −5.01 | <0.001 | −0.010 | 0.002 | −5.25 | <0.001 | |
| Ethambutol | Amplicon | 2.491 | 0.772 | 3.23 | 0.001 | ||||||||
| H2O2 | Oligo | 1.371 | 0.592 | 2.32 | 0.021 | 1.934 | 0.551 | 3.51 | <0.001 | ||||
| Hypoxia | Oligo | 2.217 | 0.588 | 3.77 | <0.001 | 2.729 | 0.589 | 4.63 | <0.001 | ||||
| Potassium cyanide | Amplicon | 1.507 | 0.520 | 2.90 | 0.004 | ||||||||
| Sigma B | Oligo | 2.284 | 0.623 | 3.67 | <0.001 | 3.027 | 0.600 | 5.04 | <0.001 | ||||
Model A: To be used for gene pairs with distance, oligo and amplicon data.
Model B: To be used for gene pairs lacking amplicon data.
Model C: To be used for gene pairs with no expression data.
Measures of performance for operon prediction from logistic regression models with different sets of explanatory variables
| Model label and description | c Index | Kendall's tau-a | BIC | AIC | Overall accuracy |
|---|---|---|---|---|---|
| (A) Dist + Oligo(3) + Amplicon(2) | 0.954 | 0.072 | 284.9 | 248.5 | 0.908 |
| (B) Dist + Oligo(3) | 0.946 | 0.071 | 300.1 | 274.1 | 0.884 |
| (C) Distance | 0.777 | 0.044 | 432.5 | 422.1 | 0.716 |
| (D) Dist + Oligo(1) | 0.929 | 0.068 | 293.4 | 277.8 | 0.876 |
| (E) Dist + Oligo(4) | 0.947 | 0.071 | 304.3 | 273.2 | 0.889 |
| (F) Dist + Oligo(1) + Amplicon(1) | 0.935 | 0.069 | 292.1 | 271.3 | 0.887 |
| (G) Dist + Coexpression(1) | 0.921 | 0.067 | 303.4 | 287.8 | 0.876 |
| (H) Dist + Oligo(4) + Amplicon(8) | 0.960 | 0.073 | 316.1 | 243.3 | 0.905 |
In the model descriptions, Oligo(1) means that a single correlation of expression is used for all oligo experiments, Oligo(4) means that separate gene expression correlations are used for each experiment type involving oligo technology, and Oligo(3) means the correlation from one experiment type involving oligo technology was removed via backward elimination. Similarly, Amplicon(1) means that a single correlation of expression is used for all amplicon experiments, Amplicon(8) means that separate gene expression correlations are used for each experiment type involving amplicon technology and Amplicon(2) means the correlations from six experiment types involving amplicon technology were removed via backward elimination. Finally, Coexpression(1) means that a single correlation of expression was used for all experiment types.
Figure 2.ROC curves comparing three predictive models. The best model shown (Model A; solid line) uses intergenic distance and coexpression data from oligo and amplicon microarrays, while the next best model (Model B; dashed line) uses intergenic distance and coexpression data from only oligo microarrays and the poorest performing model (Model C; dotted line) uses only intergenic distance.
Figure 3.To further test the predictions of the model described in this work, two operons were subjected to additional laboratory testing. Starting from a foundation of gene pairs we have successfully amplified in the past in our laboratory (shown with asterisk) we selected adjacent newly predicted operon pairs and tested them by RT-PCR. In the summary table of results (A), gene pairs without an asterisk are therefore new results. For each gene pair, the intergenic distance, coexpression rank percentile and model prediction are shown. The final column indicates whether we have successfully verified that the gene pair coexists on a single RNA molecule by RT-PCR as described in the Materials and Methods section. Panel B shows some of the associated gel image data. Specifically, of the six newly predicted operon pairs tested, we were able to confirm all but one by generating a PCR fragment. The lone exception (Rv1462-Rv1463) may indicate (i) they are not cotranscribed as predicted, (ii) we have not currently (but may eventually) amplify a fragment which bridges this pair by RT-PCR or (iii) other factors are at work which could confound RT-PCR. As a result, we have labeled Rv1462-Rv1463 ‘unconfirmed’ in the table.