| Literature DB >> 25774498 |
Minseung Kim1, Violeta Zorraquino2, Ilias Tagkopoulos1.
Abstract
A tantalizing question in cellular physiology is whether the cellular state and environmental conditions can be inferred by the expression signature of an organism. To investigate this relationship, we created an extensive normalized gene expression compendium for the bacterium Escherichia coli that was further enriched with meta-information through an iterative learning procedure. We then constructed an ensemble method to predict environmental and cellular state, including strain, growth phase, medium, oxygen level, antibiotic and carbon source presence. Results show that gene expression is an excellent predictor of environmental structure, with multi-class ensemble models achieving balanced accuracy between 70.0% (±3.5%) to 98.3% (±2.3%) for the various characteristics. Interestingly, this performance can be significantly boosted when environmental and strain characteristics are simultaneously considered, as a composite classifier that captures the inter-dependencies of three characteristics (medium, phase and strain) achieved 10.6% (±1.0%) higher performance than any individual models. Contrary to expectations, only 59% of the top informative genes were also identified as differentially expressed under the respective conditions. Functional analysis of the respective genetic signatures implicates a wide spectrum of Gene Ontology terms and KEGG pathways with condition-specific information content, including iron transport, transferases, and enterobactin synthesis. Further experimental phenotypic-to-genotypic mapping that we conducted for knock-out mutants argues for the information content of top-ranked genes. This work demonstrates the degree at which genome-scale transcriptional information can be predictive of latent, heterogeneous and seemingly disparate phenotypic and environmental characteristics, with far-reaching applications.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25774498 PMCID: PMC4361189 DOI: 10.1371/journal.pcbi.1004127
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Compendium analysis and normalization.
(A) The E. coli Gene Expression Compendium (EcoGEC) is constructed from raw genome-wise transcriptional data (B) Principal Component Analysis on the EcoGEC before (inset) and after (main) normalization through linear transformation. p1 and p2 represent first and second principal component respectively. Platform biases are corrected by performing platform-specific categorization of gene expression values.
Class label distributions.
| Classes | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Medium | Strain | Phase | Oxygen | Nor | Amp | Carbon | ||||||||
|
| LB | 1356 | MG1655 | 1368 | E-Exp | 148 | Y | 2178 | Y | 227 | Y | 56 | Glucose | 471 |
| M9 | 301 | BW25113 | 148 | ML-Exp | 1368 | N | 64 | N | 2015 | N | 2186 | Glycerol | 94 | |
| MOPS | 86 | EMG2 | 132 | Stat | 132 | Acetate | 49 | |||||||
| Others | 499 | Others | 594 | Missing | 601 | Others | 1628 | |||||||
|
| 60.4% | 61% | 61% | 97.1% | 89.8% | 97.5% | 72% | |||||||
Fig 2Gene expression compendium and classification workflow.
The workflow is divided into three steps: (A) data preprocessing that combines RNA-Seq and microarray datasets. EcoGEC is categorized into three differential expression bins (under-expressed, UE; wild-type, WT; over-expressed OE) and pre-processed for batch-effect and bias correction. (B) model training, where parameters are trained based on four different machine learning methods for each of the classification tasks, and (C) model testing where new samples are assigned to the class labels that have the majority of votes from 4 prediction methods for each of the eight characteristic predictors.
Fig 3Classification performance.
(A) Balanced accuracy in testing set for the 8 classification tasks as a function of number of genes selected. Genes (x-axis) are ordered by the mutual information of their expression to the predictor variable. For each classifier, the optimal number of features (derived from the training data) and the minimum number of genes at near-optimal (within 2%) classification are shown in the legend (first and second value, respectively). (B) Leave-one-batch-out cross-validation, with the training and testing balanced accuracy for each classifier is compared with the baseline. The baseline is estimated by dividing the maximum accuracy (100) by the number of classes for any given characteristic. (C) Combined multi-modal predictions using a set of individual classifiers. The parameter k represents the number of characteristics to be classified (two antibiotics, aerobic or anaerobic respiration, medium, phase and strain), represents all possible combinations and increases from 2 to 7 (x-axis). The average accuracy for each combination of k characteristics to be predicted is reported. (D) ROC curve (left) and PR curve (right) for predictor of each characteristic (TPR; true-positive rate, FPR; false-positive rate, E-Exp; early exponential phase, M/L-Exp; mid/late exponential phase, Stat; stationary phase).
Contingency table of composite classifier.
| Predicted Medium/Phase/Strain | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LEM | LEB | LEE | LEO | 9EM | 9EB | 9EO | MEM | MEB | OEM | OEE | OEO | O | Total | ||
|
|
| 822 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 831 |
|
| 3 | 97 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 104 | |
|
| 2 | 0 | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 54 | |
|
| 13 | 2 | 0 | 250 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 2 | 271 | |
|
| 0 | 0 | 0 | 1 | 215 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 216 | |
|
| 0 | 0 | 0 | 0 | 0 | 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | |
|
| 0 | 0 | 0 | 0 | 1 | 0 | 49 | 1 | 0 | 0 | 0 | 0 | 1 | 52 | |
|
| 3 | 0 | 0 | 2 | 0 | 0 | 0 | 50 | 3 | 0 | 0 | 0 | 0 | 58 | |
|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 22 | 0 | 0 | 0 | 0 | 24 | |
|
| 2 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 174 | 0 | 1 | 8 | 188 | |
|
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 29 | 0 | 5 | 35 | |
|
| 4 | 2 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 186 | 2 | 199 | |
|
| 31 | 9 | 15 | 18 | 0 | 0 | 2 | 1 | 0 | 20 | 3 | 22 | 58 | 179 | |
|
| 880 | 111 | 57 | 277 | 217 | 30 | 52 | 53 | 25 | 199 | 32 | 213 | 96 | 2242 | |
(1) LEM, LB medium + mid/late exponential phase + MG1655; (2) LEB, LB medium + mid/late exponential phase + BW25113; (3) LEE, LB medium + mid/late exponential phase + EMG2; (4) LEO, LB medium + mid/late exponential phase + strains other than MG1655, BW25133 and EMG2; (5) 9EM, M9 + mid/late exponential phase + MG1655; (6) 9EB, M9 + mid/late exponential phase + BW25113; (7) 9EO, M9 + mid/late exponential phase + strains other than MG1655, BW25133 and EMG2; (8) MEM, MOPS + mid/late exponential phase + MG1655; (9) MEB, MOPS + mid/late exponential phase + BW25113; (10) OEM, the other medium that is not LB, M9, or MOPS + mid/late exponential phase + MG1655; (11) OEE, the other medium that is not LB, M9, and MOPS + mid/late exponential + EMG2; (12) OEO, the other medium that is not LB, M9, or MOPS + mid/late exponential phase + the other strain that is not MG1655, BW25113, or EMG2; (12) O, the others that don’t belong to any of thirteen classes
Fig 4Feature and functional enrichment analysis.
(A) Mutual information (MI) content for each of the 8 classifiers. The 4166 genes are sorted by decreasing order of their MI. Solid and dashed lines correspond to empirical data and inverse log-linear fitting, respectively. (B) The common set of the most informative genes across different classifiers. For each of the 8 classifiers, genes that account for top 10% of MI of all genes are extracted (side bars depict the size of the corresponding gene set). The top histogram depicts the size of the unique features (genes) per classifier. (C) Functional annotations of the selected features for each classifier. The six most significantly enriched ontology terms are depicted. As some of functional terms were synonyms, we extract the non-duplicated associated terms. Ratios represent the proportion of the specific ontology terms present in a MI gene set.
Fig 5Highly informative genes on a genetic interaction network.
(A) Genes are grouped into five separate modules that are distinct from the core network. Ontology of pathways and compositions of transporter complexes are based on EcoCyc for E. coli K-12 MG1655. Green edges represent genetic interactions identified in [47]. Histograms show frequencies of MI genes for different classifiers for 5 pathway modules. (B) A higher resolution representation for the biosynthesis and transporter complex pathways that are highly enriched in a number of classifiers. Genes shown are the top-ranked in each classification task. The node color denote the classification task that it is highly informative of (task legend on the upper right of the figure).