| Literature DB >> 30177705 |
Xiaokang Wang1,2, Violeta Zorraquino2, Minseung Kim2,3, Athanasios Tsoukalas2, Ilias Tagkopoulos4,5.
Abstract
A tantalizing question in evolutionary biology is whether evolution can be predicted from past experiences. To address this question, we created a coherent compendium of more than 15,000 mutation events for the bacterium Escherichia coli under 178 distinct environmental settings. Compendium analysis provides a comprehensive view of the explored environments, mutation hotspots and mutation co-occurrence. While the mutations shared across all replicates decrease with the number of replicates, our results argue that the pairwise overlapping ratio remains the same, regardless of the number of replicates. An ensemble of predictors trained on the mutation compendium and tested in forward validation over 35 evolution replicates achieves a 49.2 ± 5.8% (mean ± std) precision and 34.5 ± 5.7% recall in predicting mutation targets. This work demonstrates how integrated datasets can be harnessed to create predictive models of evolution at a gene level and elucidate the effect of evolutionary processes in well-defined environments.Entities:
Mesh:
Year: 2018 PMID: 30177705 PMCID: PMC6120903 DOI: 10.1038/s41467-018-05807-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Overview of predicting mutations in E. coli using a data-driven approach. a A compendium was constructed with mutation profiles across 178 conditions over 83 features that capture attributes related to the strain, medium and stress from experiments reported in 95 publications. b We built three individual predictors, namely an Artificial Neural Network (ANN), Support Vector Machines (SVM) and a Naive Bayes (NB) model, which are integrated under one Ensemble method. c Assessment of the predictions from all three individual predictors and the Ensemble method is performed through forward validation over a novel experimental setting through the evolution and whole-genome resequencing of 35 cell lines
Fig. 2Database summary and statistics. a The distribution of culture conditions in terms of medium, strain and stress. In the case of medium, the legend is restricted to evolution runs with multiple replicates to save space. b The frequencies of different mutation types in the database. c The distribution of mutations among 3819 genome sites with the top genes presented inside the plot. d The frequency of the genomes sites hit by a mutation under 178 culture conditions. The count distribution for each genome site (log y-axis) is shown in the inset. The histogram is approximated by a gamma distribution (green curve). Dotted orange line depicts p value < 0.05. e All the mutations were presented along the genome of MG1655 E. coli. with the top three hotspots flagged. When visualizing the mutations along the genome, we used a 5 kb sliding window to find regions that were most/least likely to be hit by a mutation. Dotted orange line depicts p value < 0.05
The hotspot genes in non-mutator and mutator strains
| Non-mutator strains | Mutator strains | ||||||
|---|---|---|---|---|---|---|---|
| Gene | Function | Frequency |
| Gene | Function | Frequency |
|
|
| RNA polymerase sigma 24 | 29.8% | −99 |
| DNA gyrase | 52.9% | −22 |
|
| Pyruvate kinase I | 20.2% | −85 |
| RNA polymerase sigma 24 | 41.2% | −18 |
|
| RNase PH | 15.5% | −47 |
| Endochitinase | 41.2% | −19 |
|
| MalT-maltotriose-ATP DNA-binding transcriptional activator | 14.8% | −42 |
| RNA polymerase sigma 24 | 35.3% | −14 |
|
| Putative transport protein, monovalent cation: proton antiporter-2 (CPA2) family | 14.3% | −38 |
| Endo-1,4- | 35.3% | −15 |
|
| Ribokinase | 14.3% | −40 |
| Aldehyde alcohol dehydrogenase | 35.3% | −13 |
|
| Ribose ABC transporter | 14.3% | −41 |
| Membrane protein required for maintenance of rod shape | 29.4% | −12 |
|
| Guanosine 3 | 13.7% | −26 |
| Protein translocation ATPase | 29.4% | −13 |
|
| DNA topoisomerase I | 13.1% | −35 |
| Predicted peptidase with chaperone function | 29.4% | −11 |
|
| NadR DNA-binding transcriptional repressor and NMN adenylyl transferase | 13.1% | −36 |
| Kdo2-lipid A phospho-ethanolamine7-transferase | 29.4% | −10 |
|
| Ribose pyranase | 13.1% | −38 |
| Predicted outer membrane usher protein | 29.4% | −10 |
|
| Ribose ABC transporter | 13.1% | −39 |
| Obactin (enterochelin) transport | 29.4% | −11 |
|
| RbsR-ribose | 13.1% | −36 |
| Regulatory | 29.4% | −9 |
|
| Fis DNA-binding transcriptional dual regulator | 13.1% | −32 |
| Conserved protein | 29.4% | −11 |
|
| HslVU protease | 12.5% | −35 |
| Glutamine synthetase adenylyl transferase glutamine synthetase deadenylase | 29.4% | −12 |
|
| EnvZ sensory histidine kinase | 12.5% | −35 |
| RhsC protein in rhs element | 29.4% | −9 |
|
| Putative transport protein, major facilitator superfamily (MFS) | 11.9% | −30 |
| Inner membrane protein YbjL | 29.4% | −8 |
|
| Aspartate kinase/ homoserinede-hydrogenase | 11.3% | −27 |
| e14 prophage; predicted SAM-dependent methyltransferase | 29.4% | −11 |
|
| Rod shape-determining membrane protein; sensitivity and drug | 10.7% | −30 |
| Inner membrane protein inhibits the Rcs signaling pathway | 29.4% | −10 |
|
| IclR-glyox | 10.1% | −28 |
| Glycyl-tRNA synthetase | 29.4% | −12 |
The p value is in log10 scale
Fig. 3Mutation profile analysis for co-occurrence and functional relationships. a Spectral clustering of mutations occurring in two or more conditions. Squares along the diagonal represent clusters of genome sites which are highly correlated with respect to mutation profiles. The average pairwise mutual information in each cluster decreases from cluster 1 to cluster 6. Heatmap based on mutual information. b The enriched molecular function GO terms and corresponding genes. In each cluster the genes mapped to the same GO terms are together (clusters have different colors and indices). The numbers following each GO term are p values calculated by DAVID (p value threshold is 0.1). The color bar at the bottom and right indicates the membership of the GO terms/genes among the three clusters. Only clusters with enriched GO terms are shown. c The number of mutations as a function of generations elapsed, with hypermutator strains excluded. Inset shows patterns when hypermutators are included with fitted exponential curve in the inner plot being N = 5.6e0.00012g. The R2 for linear and exponential fitting is 0.72 and 0.35, respectively. d A dendrogram illustrating the clusters of antibiotics generated by hierarchical clustering. The legend describes the action mechanism of each category of antibiotics. When computing the pairwise distance, the Euclidean distance between mutation profiles was used
Fig. 4Frequency and co-occurrence of mutations as a function of biological replicates. a The reciprocal of the number of biological replicates is linearly related to the number of the average frequency of a genome site mutated among all the replicates given in a culture condition. The culture conditions that have the same number of replicates were grouped as one data point in the plot (the dots without standard deviation bar corresponds to the case that only one condition has such number of replicates; this also applies to b and c). The dash line describes a relation between number of replicates and frequency with a condition if each mutation occurred in only one replicate. b The average overlap between the mutated genome sites in a biological replicate and all the mutated genome sites under a culture condition as a function of the number of replicates. c The pairwise overlapping in mutated genome sites between biological replicates given a culture condition. d The pairwise overlapping ratio under different conditions was addressed against the medium, strain and stress, respectively. The number in each bar represents the number of culture conditions corresponding to that data point. The error bar represents the standard deviation of the pairwise overlapping ratio across the culture conditions corresponding to that data point
Fig. 5Predictor performance. a Schematic depiction of the Ensemble predictor and an example of how the probability of a mutation is calculated given an input. The model input consists of 83 binary variables that capture experimental factors. The model output is a binary variable that captures the presence/absence of mutation(s) in a specific gene (rph here), given a condition (MG1655, M9 with Glucose and Osmotic stress here). b The prediction performance of the ensemble predictor and each individual predictor in leave-one-condition-out-cross-validation. ROC curve. c Precision–recall curve. A-N-S is the Ensemble predictor. The baseline refers to the prediction based on the frequency of a mutation across different culture conditions in the database
Mutations discovered in 35 E. coli isolates under osmotic pressure
| Gene | Mutation type | Frequency | Product |
|---|---|---|---|
|
| SNP | 34 | |
|
| Insertion | 34 | |
|
| Deletion | 27 | RNase PH |
|
| Deletion | 13 | |
|
| SNP | 4 | SufBCD Fe-S cluster scaffold complex |
|
| SNP | 3 | Pyruvate kinase I |
|
| Insertion | 3 | Predicted diguanylate cyclase |
|
| Deletion | 3 | L-cysteine desulfurase |
|
| SNP | 2 | Glycine betaine/proline ABC transporter |
|
| SNP | 2 | RNA polymerase sigma 24 |
|
| SNP | 2 | SufBC2D Fe-S cluster scaffold complex |
|
| SNP | 1 | Guanosine 3-diphosphate 5-triphosphate 3-diphosphatase (multifunctional) |
|
| Deletion | 1 | Transcription termination/antitermination L factor |
|
| Insertion | 1 | |
|
| SNP | 1 | |
|
| SNP | 1 | |
|
| Deletion | 1 | Predicted protein |
|
| SNP | 1 | Mechanosensitive channel of miniconductance YbdG |
|
| SNP | 1 | Chemotaxis signaling complex–ribose/galactose/glucose sensing |
|
| SNP | 1 | SufBC2D Fe-S cluster scaffold complex |
|
| SNP | 1 | Predicted oxidoreductase, Zn-dependent and NAD(P)-binding |
|
| SNP | 1 | Peptide chain release factor RF2 |
|
| SNP | 1 | Nickel ABC transporter |
The product of each mutated gene was annotated according to http://ecocyc.org/; the product of intergenic region was left blank. The bold genome sites are predicted to be mutated for more than 5 times among 10 times of bootstrapping