Literature DB >> 34762392

Escherichia coli Data-Driven Strain Design Using Aggregated Adaptive Laboratory Evolution Mutational Data.

Patrick V Phaneuf¹, Daniel C Zielinski^2,3, James T Yurkovich², Josefin Johnsen³, Richard Szubin², Lei Yang³, Se Hyeuk Kim³, Sebastian Schulz³, Muyao Wu², Christopher Dalldorf², Emre Ozdemir³, Rebecca M Lennen³, Bernhard O Palsson^1,2,4,3, Adam M Feist^2,3.

Abstract

Microbes are being engineered for an increasingly large and diverse set of applications. However, the designing of microbial genomes remains challenging due to the general complexity of biological systems. Adaptive Laboratory Evolution (ALE) leverages nature's problem-solving processes to generate optimized genotypes currently inaccessible to rational methods. The large amount of public ALE data now represents a new opportunity for data-driven strain design. This study describes how novel strain designs, or genome sequences not yet observed in ALE experiments or published designs, can be extracted from aggregated ALE data and demonstrates this by designing, building, and testing three novel Escherichia coli strains with fitnesses comparable to ALE mutants. These designs were achieved through a meta-analysis of aggregated ALE mutations data (63 Escherichia coli K-12 MG1655 based ALE experiments, described by 93 unique environmental conditions, 357 independent evolutions, and 13 957 observed mutations), which additionally revealed global ALE mutation trends that inform on ALE-derived strain design principles. Such informative trends anticipate ALE-derived strain designs as largely gene-centric, as opposed to noncoding, and composed of a relatively small number of beneficial variants (approximately 6). These results demonstrate how strain design efforts can be enhanced by the meta-analysis of aggregated ALE data.

Entities: Chemical

Keywords: adaptive laboratory evolution; data-driven strain design; genome design variables; meta-analysis; mutation functional analysis; structural biology

Mesh：

Substances：

Year: 2021 PMID： 34762392 PMCID： PMC8870144 DOI： 10.1021/acssynbio.1c00337

Source DB: PubMed Journal: ACS Synth Biol ISSN： 2161-5063 Impact factor: 5.110

Introduction

The ability to precisely engineer microbial strains continues to improve. Multiple fields have emerged to take advantage of recent advances in biomolecular methods and technologies.[1] However, an incomplete understanding of biology renders the rational design of microbial strains challenging.[2] Introducing rationally designed changes into the DNA of a cellular chassis often leads to a perturbed suboptimal metabolic or regulatory state, resulting in the underachievement of goals[3] and the need for a prolonged engineering effort that has been seen to require 6–8 years and over $50 million in costs.[2] The opportunity exists to leverage the built-in problem-solving processes of adaptive evolution to discover and elucidate biological functions and generate solutions for applications. Adaptive Laboratory Evolution (ALE) is the formalization of a controlled evolutionary process that can successfully be applied to understand and engineer bacterial strains. ALE can provide adaptive mutations to optimize a strain’s growth rate or related fitness properties useful for both microbial engineering research and applications.[4] When a production pathway (e.g., metabolic) can be coupled with growth, ALE can provide adaptive mutations that will increase the throughput of this pathway.[5,6] Furthermore, if a strain’s fitness has been severely disrupted by a designed change, ALE methods can identify mutations that rebalance the strain’s homeostasis.[4,7−11] ALE can additionally serve to harden strains against industrial conditions[3,4,12,13] and improve their utilization of secondary or non-native substrates.[4,14−19] Due to ALE’s potential for discovery and application,[4] almost 700 manuscripts and over 18 000 experimental evolutions have been published.[20] The growth in ALE data has inspired multiple efforts toward its consolidation and analysis.[20−23] Aggregating public data is expected to enable new discoveries not evident through single experiments.[24] Critical biomedical efforts, such as cancer and antimicrobial resistance research, have benefited from the meta-analysis of big data sets describing their respective fields.[25−27] Previous work has datamined public ALE data, though their results were limited to genetic loci and therefore did not explore the specific mutational sequence changes.[21] Nucleotide-level resolution is ideal for strain design, and the aggregation of ALE mutation sequence changes could better reveal the specific types of changes adaptive evolution selects for with conditions of interest. Thus, one could initiate a strain design based on an observed ALE mutation that best represents a mutation trend in aggregated mutational data. This is the current typical end point of strain engineering with ALE. One could further strive to derive the local functional objectives of the ALE mutation trend to a genomic feature and therefore characterize a useful design variable.[4] Comprehensive understanding of genome design variables can serve to both advance the basic understanding of genomic feature functions as well as contribute novel solutions for strain engineering.[28] Strain engineering makes use of rational genome design methods, though these primarily manipulate the presence/absence of genes[29−31] or have their scope limited to reaction networks[32−34] and gene product kinetic properties.[35] Design variables within gene sequences are currently inaccessible to these rational model-based methods, though sequence-level data can be integrated through the results of mutagenesis experiments and libraries of sequence variants. Sequence-level design variables can be used to design variants or biological parts for the engineering of a more fit strain in related conditions.[36] Knowledge of the functional bounds for a DNA encoded design variable could also solve issues in synthesizing genomes, where streamlining is required that does not allow for the original ALE mutation sequence.[37] Finally, the meta-analysis of aggregated ALE mutations may also reveal general strain design principles that could inform strain design strategies.[38] This study seeks to demonstrate how novel genome design variables can be derived from aggregated ALE data using multiscale mutation annotations, metadata, structural biology methods, and meta-analysis methods. Aggregated ALE mutational and experimental conditions data was acquired from ALEdb, a web-based platform reporting on experimental evolution mutations and their conditions.[22] Analysis methods were organized into a workflow that functioned to narrow down ALE mutational data from the entire input set to those relevant for experimental conditions of interest. The combination of multiscale genome annotations, aggregated ALE data, and metadata enabled mutated systems to be associated with experimental conditions to aid in deconvoluting mutation selection pressures.[23] Structural biology methods and functional annotations revealed properties being targeted by ALE mutation trends related to protein function. Further, the workflow enabled interpretation of ALE mutation trend impact on specific features for specific conditions of interest, which were leveraged to define and validate genome design variables by building novel variants of similar benefit to the ALE mutations. As ALE studies continue to accumulate, the aggregation of their mutational data represents a valuable opportunity for deriving novel insights into genome design variables, therefore enhancing our functional understanding of the individual components encoded within a genome.

Results

A Meta-analysis Workflow Leveraging Mutation Trends in Consolidated ALE Data to Design Variants

A generalized workflow was developed to (1) organize data and analysis methods in a way that identifies a subset of ALE mutations from an entire input data set that are relevant to experimental conditions of interest, (2) derive the beneficial functional impact that ALE mutations converge upon, and (3) design and build variants using the resulting knowledge on genome design variables (Figure ). The following describes the workflow’s overall general steps, and specific sections are referenced for each step to provide details:

Figure 1

A general workflow to derive strain designs from aggregated ALE mutations and experimental conditions. The multiple resources used in this workflow are described in the Methods section. Resources and processes are described in the section text.

Select conditions of interest. This step is often dictated by the desired application and is informed by known biological mechanisms associated with the desired phenotype. The conditions available for investigation come from the metadata for ALE experiments which describe the experimental conditions (section ). This study used ALEdb (7) as its source for ALE mutational data and experimental conditions. As an aggregated ALE data set’s diversity increases, the queryable cases will also increase. In this work, growth on glycerol as a carbon source (section ) and toxic concentrations of isobutyric acid (section ) were targeted as relevant bioprocessing phenotypes.[39−43] There remains many other potential selection pressures in the data set used for this study. Extract all mutated features associated with conditions. Mutated genomic features and their experimental conditions are extracted from an aggregated and curated data set to seed the analysis. Statistically significant associations are established between mutated features and conditions to aid in deconvoluting the selection pressures for mutated features. It is important to note that the scale of association analyses used to link mutated features to conditions depends on the amount and variety of annotated genomic features in a data set. Extending mutation annotations beyond the standard types included in genome references (genes and intergenic regions) improves the variety of mutated systems that may be associated with conditions of interest (section ).[23] Understand what mutated features to investigate. Rank-order mutated features of interest by mutation frequency and strength of association to condition of interest (sections and 2.3.2.1). Relationships between mutated features should also be considered (e.g., in the case of epistasis[44]) and can help avoid potential incompatible sequence changes and provide insights into the systemic changes that result from multiple mutations (sections and 2.3.2.1). Derive ALE mutation trend functional impact on features. Derive the functional impact of ALE mutation trends on genomic features of interest using a variety of methods and resources (sections and 2.3.2.2, Methods section, Supplementary Table S11). Interpretations can be informed using structural biology and mutation effect prediction methods, among others. The product of this step is knowledge that can be encapsulated as genome design variables. Experimentally validate genome design variables for target strain. Perform an assay to experimentally validate that a strain harboring a designed variant based on a derived genome design variable is beneficial relative to the wild-type (sections and 2.3.2.4). To validate the potential for deriving the functional impact of ALE mutation trends, the designed variants described in this study are based on novel sequence changes not observed in ALE mutations. Further, ALE mutations representative of observed trends can be used as controls for comparing a designed variant’s fitness against an observed ALE mutation (sections and 2.3.2.3). A general workflow to derive strain designs from aggregated ALE mutations and experimental conditions. The multiple resources used in this workflow are described in the Methods section. Resources and processes are described in the section text.

Meta-analysis Trends Suggest Types of Genome Design Variables to Expect from ALE Data

This study used ALE mutations and metadata from ALEdb, a database for experimental evolution mutation data,[22] and an enriched set of mutation annotations generated by a multiscale annotation method.[23] The data set contained 63 Escherichia coli K-12 MG1655 based ALE experiments from ALEdb, totaling 357 independent evolutions and 13 957 observed mutations (Figure a). The observed mutations were filtered to exclude hypermutator strains, mutations with frequencies below 0.5 from population sequencing samples (24% of the total samples), and ALE-uniqueness, resulting in 3921 dominant ALE-unique mutations (Figure a). Mutations were annotated with 10 different genomic feature types (gene, intergenic, promoter, transcription factor binding site, ribosomal binding site, terminator, attenuator terminator, operon, pathway, regulon) to enable the identification of mutation convergence on a broad set of genomic features and biological functions[23] (Figure a, Figure a). The data set tracked 10 different condition types describing the strain and environment of ALE experiments with a total of 93 unique conditions (Figure a). Meta-analysis of this consolidated data set revealed trends that predict the general shape of the workflow’s results and will be described in the following sections.

Figure 2

Figure 3

Mutation types and effects exhibit a bias toward specific genomic feature types. (a) Table of mutation type and mutated feature frequencies. Synonymous SNPs are abbreviated as “syn”, nonsynonymous SNPs are abbreviated as “non-syn”, and truncating mutations are abbreviated as “trunc”. (b) The distribution of mutation sizes and amount of genomic features affected according to mutation types. Abbreviations: SNP, single nucleotide polymorphism; DEL, deletion; MOB, mobile insertion elements; INS, insertion; SUB, substitution; CNV, copy number variant. (c) The proportion of mutations to individual features across feature types that are truncations. (d) The number of sequence truncating and nontruncating mutations for individual genomic features. Abbreviations: TFBS, transcription factor binding site; RBS, ribosomal binding site.

Dimensions and properties of the mutational data set used in this study. (a) A plot of the different dimensions of the ALE data used within this study as extracted from ALEdb. (b) A visual representation of the different condition types across ALEs in the targeted set. The mapping between individual colors and labels can be found in Supplementary Figures S1–S9. Mutation types and effects exhibit a bias toward specific genomic feature types. (a) Table of mutation type and mutated feature frequencies. Synonymous SNPs are abbreviated as “syn”, nonsynonymous SNPs are abbreviated as “non-syn”, and truncating mutations are abbreviated as “trunc”. (b) The distribution of mutation sizes and amount of genomic features affected according to mutation types. Abbreviations: SNP, single nucleotide polymorphism; DEL, deletion; MOB, mobile insertion elements; INS, insertion; SUB, substitution; CNV, copy number variant. (c) The proportion of mutations to individual features across feature types that are truncations. (d) The number of sequence truncating and nontruncating mutations for individual genomic features. Abbreviations: TFBS, transcription factor binding site; RBS, ribosomal binding site. ALE mutations come in a variety of types and can affect a variety of features encoded on the genome. Six different types of mutations were found within the data set (Figure a): single nucleotide polymorphisms (SNP), deletions (DEL), insertions (INS), mobile element insertions (MOB), multinucleotide substitutions (SUB), and copy number variations (CNV). Different mutation types manifest at different frequencies, resulting in different amounts for each mutation type within this data set (Figure a). The mutated feature annotations used in this study range from small genomic features (e.g., terminators) to larger features (i.e., operons, regulons, and pathways) describing biological function (Figure a). There was a clear meta-analysis trend in the frequency of small mutated features: genes were most often mutated, as also observed with long-term evolution experiments,[45] and promoters hosted the most mutations for noncoding features. Multinucleotide mutations or overlapping features can result in more than one feature affected by a mutation; therefore, more mutated features than mutations can occur within an ALE experiment[23] (Figure a, Figure b).

The Majority of ALE Mutations to Noncoding Regulatory Features Resulted in Truncations

Amino acid substitutions and some of the disruptive effects mutations can have on genomic feature sequences can be confidently predicted. These predictions rely on mutation size (Figure b) or specific coding sequence changes. The effects of mutations on the translation of genes were the most straightforward to predict, primarily by considering the effects mutations have on the open reading frame. SNPs can result in synonymous or nonsynonymous codon changes, where a nonsynonymous SNP can result in an amino acid substitution or a truncation due to the introduction of a premature stop codon or the removal of a start codon. Truncations due to structural variants (SV) can also be predicted. Open reading frames were expected to be functionally truncated if the SV caused a frameshift. The effect of SV to noncoding regulatory features of the genome were more difficult to predict, though it is likely safe to assume that SVs of 10 nucleotides or more truncate both coding and noncoding features. SNPs were the most frequent mutation type and the largest contributor to the mutation of known features (Figure a). SNPs were also most frequently found within genes and were most often predicted to result in a nonsynonymous amino acid substitution, with the majority of the coding SNPs resulting in nontruncations (Figure a). These meta-analysis SNP trends have also been observed in long-term evolution experiments.[45,46] Deletions were the most frequent SV and contribute a substantial amount of the truncations to features (Figure a). Different feature types display different mutation type and effect trends. All noncoding feature types, besides intergenic regions, were more often targeted by truncations while genes were equally targeted by truncating and nontruncating mutations (Figure c). This suggested that noncoding features were more likely to be truncated than refined by ALE mutations. Individual features are often mutated with both truncating and nontruncating mutations, though some were more often affected by one of the predicted mutation effects (Figure d). The features affected by only predicted nontruncating mutations may be benefiting from gain-of-function mutations, which represent potential for variant designs involving something other than truncations.

Unique Mutated Features Are Correlated with Only a Small Number of Other Mutated Features

Correlations between all mutated genomic features (gene, promoter, TFBS, intergenic, attenuator terminator, terminator, RBS) across the entire set of mutations can be leveraged to approximate coarse-grain relationships. Positively correlated genomic features should represent features that can be mutated together to optimize a strain for one or more conditions, which may constitute part of or the whole set of adaptive mutations from an ALE. Negative correlations should represent features that result in neutral or negative effects on the strain’s fitness when mutated together. The nature of selection for growth with ALE experiments results in more and stronger positive correlations than negative correlations between mutated features (Supplementary Figure S13, Figure a). Hierarchical clustering of correlated features results in a distribution of cluster sizes with a median of six features (Figure a, Figure b), potentially describing the general need for approximately six mutated features to achieve a substantially optimized strain.

Figure 4

ALE adapted genotypes are gene-centric and involve few mutated features per condition. (a) A clustermap of the Pearson correlation coefficients for all genomic feature pairs (656 085). (b) The distribution of cluster sizes from the clustering of all genomic feature pairs according to their correlation. The median cluster size was six, as highlighted. (c) The total amount of statistically significant associations between unique features and conditions according to feature types. (d) The amount of significantly associated conditions per unique genomic feature.

Unique Mutated Features Are Associated with Only a Small Number of Conditions

The ALE conditions metadata enables efforts in linking mutated features to conditions, isolating subsets of mutations from across experiments that may be related to conditions of interest. There were 93 unique conditions across 10 different condition types (Figure b, Supplementary Figures S1–S9). Due to the variety of experiments consolidated in the data of this study, some conditions will contain more potential for designs than others. The supplement, starting-strain, and carbon-source conditions have substantially more associated features than the other condition types (Supplementary Figure S11), likely reflecting the variety of conditions (Figure b, Supplementary Figures S1–S9). Operons and genes had the largest amount of associated conditions (Figure c). This is likely due to operons being composed of genes, which have the largest variety of uniquely mutated features in this data set (Figure a). These associations can also describe the specificity of the mutated features for the given conditions. Most features across all types were associated with only a small set of conditions, though there exist some outliers that were associated with a broad range of conditions and could therefore be applicable to a broad range of stresses (Figure d). According to these trends, it was expected that variants derived from these associations were going to be gene-centric and only involve a small number of coding and noncoding features.

Aggregated ALE Data Reveal Common Low-Frequency Mutations Targets Across Multiple Experiments

Some mutated features are more often selected across independent ALE replicates of individual experiments and are described as having a measure of parallelism or convergence[23] (Figure , Supplementary Figure S10). The phenomenon of convergence is leveraged in evolution experiments to identify potentially beneficial mutations for given conditions. Some highly beneficial mutations occur less frequently due to their relative inaccessibility to mutation processes, requiring complex sequence changes or specific mutations present beforehand.[47] Accessibility can be used to describe the ability to access and benefit from a mutation,[48] where the degree of mutation convergence likely reflects this combination of attributes. Mutated features generally have low convergence (Supplementary Figure S10), though there were some that demonstrate high convergence and therefore high accessibility (Figure , Supplementary Figure S10). Some features mutated in many experiments were observed with low convergence and associated conditions (Figure ), suggesting that these mutations were accessible to secondary selection pressures. These mutated features may represent genome design variables that are more easily identified through the aggregation of ALE experiment mutations. For example, the nagBAC-umpH operon is mutated across 10 different ALE experiments in this data set, though has an average convergence of 0.36 (Figure ). nagA and nagC, the most frequently mutated genes of this operon, are involved in the recycling of cell wall peptidoglycan and may be introducing broadly applicable beneficial changes. Mutations to the cell envelope are often seen in ALE experiments and have been shown to be beneficial.[49]

Figure 5

Aggregated ALE data reveal common low-frequency mutation targets with potential benefit to a broad set of conditions.

Case Studies of Application to Strain Design

The meta-analysis results reveal many opportunities for variant designs with the aggregated ALE data set. The following sections describe case studies deriving three variant designs using ALE data and the presented workflow (Figure ). The case studies demonstrate specific conditions that have a potential for application and have a large amount of samples or involve genes frequently mutated in multiple experiments. The case studies also demonstrate designs based on either truncating or nontruncating mutations to explore the potential for design of both mutation types.

An E. coli K-12 MG1655 Strain Design for Glycerol as a Carbon Source

The use of different carbon sources for bioproduction could prove to be an important strategy in maintaining feedstock flexibility. Glycerol has been shown to serve as feedstock for the production of valuable biochemicals, such as 1,3-PDO,[39] ethanol,[40] and limonene.[41] Glycerol-associated ALE mutations[23,50] thus represent an opportunity for an ALE-derived strain design with possible valuable commercial applications and are targeted as a demonstrative case study (Figure workflow step 1).

Associations with Conditions and Analysis of Multiscale Mutation Annotations Outlines Systems Important for Adaptation to Glycerol as a Carbon Source

Within the data set, 149 mutations had at least one of their host features significantly associated with glycerol as a carbon source (Figure workflow step 2, Supplementary Figure S14). The CRP regulon hosts the most mutated features for ALEs using glycerol as a carbon source (Supplementary Table S3), is associated with this selection pressure, and is one of the most frequently mutated regulons across this data set’s ALE experiments (Figure workflow step 2 and 3.1, Figure , Supplementary Figure S16). The CRP regulon describes many operons that encode for catabolic functions, including secondary carbon source metabolism. The CRP regulon is statistically associated with six different conditions, three of which were secondary carbon sources (Supplementary Figure S16). The three most frequently mutated operons associated with glycerol as a carbon source and linked to CRP were glpFKX, cyaA, and ptsHI-crr (Supplementary Table S2, Supplementary Figure S15, Supplementary Figure S16). The genes glpK, cyaA, and crr were the most frequently mutated features of their operons (Supplementary Figure S15, Figure a). Mutations to the CRP regulon, glpFKX operon, and glpK were strongly selected for (Figure ) and glpFKX mutations were strongly associated with conditions involving glycerol and a temperature of 30 °C (Supplementary Table S1, Supplementary Figure S16). While the cyaA and ptsHI-crr operons and the cyaA and crr genes were less strongly selected for by their ALE experiments, they were still seen mutated in multiple ALE experiments (Figure ) and were associated with a partially overlapping set of conditions (Supplementary Figure S16). Multiple strong associations to ALE conditions suggest the potential that mutations to these features could be beneficial for a broad set of stresses. Mutations to glpK and cyaA or glpK and crr were found in the same samples, though mutations to crr and cyaA were not (Figure a). This, along with correlations between the mutated genomic features associated with glycerol as a carbon source (Supplementary Figure S29), suggest that mutations to crr and cyaA have a negative epistatic relationship[23] (Figure workflow step 3.2).

Figure 6

Clustering of truncating or nontruncating mutations reveal variant designs for glycerol as a carbon source. (a) An oncoplot demonstrating the types of mutations to genomic features on operons of interest (operons cyaA, glpFKX, and ptsHI-crr) and the conditions for the ALE samples hosting these mutations. Values within parentheses represent concentrations in g/L unless otherwise stated. (b) A mutation needle plot for mutated amino acids across GlpK’s amino acid chain. (c) GlpK’s 3D structure and mutated residues from mutations. The residue chain and transparent surfaces are colored according to the legend of the corresponding mutation needle plot. Mutations are represented by a small opaque sphere with a value representing their amino acid position on the corresponding mutation needle plot. The color of the mutation’s sphere corresponds to the mutation’s predicted effect as described by the legend on the corresponding mutation needle plot. The transparent sphere centered on the mutations’ opaque sphere represents the number of mutations with a specific predicted effect on that position. The angle shown illustrates how all the GlpK–GlpK interface surfaces are oriented on the same side of the 3D structure along with the clustering of mutations on or near these surfaces. (d) A mutation needle plot for mutated amino acids across CyaA’s amino acid chain. (e) The accumulation of the truncated amino acids downstream of truncating mutation from the mutation needle plot. (f) CyaA’s protein structure and mutated residues from nontruncating mutations. (g) The growth rates of the mutants harboring ALE mutations and designed variants for GlpK in the selection pressure of glycerol as a carbon source. (h) The growth rates of the mutants harboring ALE mutations and designed variants for CyaA in the selection pressure of glycerol as a carbon source. (i) The growth rates of the mutants harboring ALE mutations and designed variants for CyaA in the selection pressure of Δpgi.

ALE Mutation Trends in glpK, cyaA, and crr

Among glpK, cyaA, and crr, glpK was the most frequently mutated (Figure workflow step 3.1). Further, it was observed to mutate in ALE experiments that involve substrate switching between glucose and glycerol (Figure a); all ALE experiment mutations to glpK were investigated to understand if mutation types clustered within the gene according to conditions. All mutations to glpK were SNPs and frequently landed in codons for amino acids involved in GlpK’s subunit interface (Figure b), where GlpK can form both a dimer and tetramer. The subunit interface amino acids are spread across GlpK’s sequence (Figure b), though in the 3D structure model their residue surfaces all group together to face the same direction (Figure c), revealing further clustering of mutations specific to 3D space. Finding the distances between mutated residues and GlpK features according to their 3D positions on GlpK’s structure provided for a potentially more accurate measure of nearness between mutations and features. GlpK subunit interfaces continue to be nearest to or directly host the most SNPs (Supplementary Figure S17), though the mutations to the GlpK subunit binding sites have the highest proportion of mutations with a predicted effect (Supplementary Figure S18). Four out of five of the SNPs to the GlpK subunit binding sites were accomplished through the same nucleotide substitution S59Y (TCC → TAC) and were predicted deleterious via a SIFT score <0.05. cyaA was the second most frequently mutated gene of those associated with glycerol and its mutations were often found in samples with mutations to glpK (Figure a). cyaA was mutated in multiple experiments of different conditions (Figure a); all ALE experiment mutations to cyaA were investigated to understand if mutation types clustered within the gene according to conditions. Mutations from multiple experiments cluster near the middle of the amino acid sequence (Figure d). Many of the mutations within this cluster caused a frameshift or truncation, thereby disrupting the downstream coding sequence. The accumulation of truncated amino acids demonstrated that the regulatory region was often targeted for disruption across varying ALE stressors, including that of glycerol as a carbon source (Figure e). Observing the mutated residues on CyaA’s structure from mutations other than truncations, there was no novel 3D clustering of mutated residues not already apparent with the linear analysis (Figure f). The clustering of mutations from multiple ALE experiments with various selection pressures suggests that a specific change to cyaA may have broad applicability. crr was the third most frequently mutated gene of those associated with glycerol (Figure a) and its mutations were often found in samples with mutations to glpK and never found in samples with mutations to cyaA (Figure a, Figure workflow step 3.2). The absence of mutated cyaA and crr genes in the same sample has been hypothesized to be due to their mutations having similar effects on the same system in the case of ALEs with a glycerol carbon source.[23]crr was mutated in multiple experiments of different conditions (Figure a); all ALE experiment mutations to crr were investigated to understand if mutation types clustered within the gene according to selection pressure. Most mutations to crr’s amino acid sequence fall on or near interfaces (Figure workflow step 4). These interfaces describe separate surfaces of Crr’s structure[51] (Supplementary Figure S21). Finding the distances between mutated residues and Crr features according to their 3D positions on Crr’s structure[51] provided for a potentially more accurate measure of nearness between mutations and features. Crr feature residues truncated by an upstream coding disruption were additionally included. The Crr-Crr interface hosts the most mutated residues (Supplementary Figure S22), though its mutation type differs from those mutations to all other features (Supplementary Figure S23). Mutations to the Crr-Crr interface were also specific to Δpgi ALEs while mutations to all other features were specific to glycerol carbon source ALEs (Supplementary Figure S20). Mutations from glycerol carbon source ALEs additionally cluster near each other on or near the same surface of the 3D structure (Supplementary Figure S21). This surface hosts the GlpK, PtsG, PtsH, PtsI, and FrsA interfaces as well as binding and active sites. All but one mutation clustering near these multi-interface residues were predicted to be disruptive (Supplementary Figure S20).

glpK and cyaA Genome Design Variables and Novel Variant Designs

In defining the genome design variable for glpK, multiple amino acid substitution properties were considered. Amino acid S59 was targeted for design due to its high frequency and specificity of mutation on an active site (Figure workflow step 5.3). Across all possible amino acid substitutions for glpK S59, the ALE observed tyrosine substitution (Y) was very highly ranked according to SIFT scores, size, and flexibility difference relative to wild-type residues (Supplementary Figure S19). High scores in these dimensions were chosen to represent the genome design variable to glpK (Figure workflow step 4). The S59Y (TCC → TAC) substitution was chosen as the representative ALE mutation for the mutation trend on glpK (Figure workflow step 5.3). The residue substitution of tryptophan (W) scored higher or similar to the tyrosine ALE-derived substitution in the previously mentioned categories, as well as being categorized as having a bulky aromatic side chain (Supplementary Figure S19b), thus making it a good candidate for a derived design. A substitution of tryptophan at this position has already been characterized to eliminate inhibition of GlpK’s catalytic activity.[52] Since the tryptophan substitution had already been characterized, another novel substitution was pursued for the purposes of this study. Phenylalanine (F) was the only other residue that is both characterized as being bulky and having an aromatic side chain as well as scoring similarly to tyrosine and tryptophan substitutions; therefore, the final proposed novel design for GlpK was that of S59F (Figure workflow step 5.1). Mutations to cyaA have a more interpretable effect than those to crr and are therefore easier to define a genome design variable with the truncation of CyaA’s regulatory region (Figure workflow step 4). The 5 BP deletion starting within amino acid 455 in cyaA was chosen as the representative ALE mutation for those of CyaA and Crr (Figure workflow step 5.3). This mutation was selected due to being mutated in two separate ALE replicates along with being an easily reintroducible mutation type. To truncate the regulatory region with more accuracy, CyaA’s novel variant design inserted 3 stop codons at amino acid 540, immediately upstream of the regulatory region (Figure workflow step 5.1).

Experimental Validation of glpK and cyaA Novel Variant Designs

To examine the fitness changes from glpK and cyaA mutants relative to wild-type with glycerol as a carbon source, growth screens were performed on reconstructed strains harboring the ALE mutations with the designed variants. The results show that the ALE-derived mutants and designs have similar growth rates along with higher growth rates than wild-type (Figure g, Figure h, Figure workflow step 5.5). To test the potential applicability of ALE-data-driven-designs with multiple different stresses, the cyaA mutants and design were additionally tested in the background of a Δpgi strain. Some of the cyaA ALE mutations that clustered in the center of the gene were selected for by a Δpgi ALE experiment (Figure d). The growth screen results demonstrated that the mutants and designs have similar growth rates along with higher growth rates than wild-type (Figure i). These results also show that a partial truncation to cyaA granted a higher fitness than a full truncation (Figure i, Figure workflow step 5.5), which differs from the glycerol carbon-source stressor growth screen, where partial and full truncations had similar fitness.

An E. coli K-12 MG1655 Strain Design for High Concentrations of Isobutyric Acid

Tolerance is a key phenotype for microbial cell factories. Industrial requirements can have strains exposed to or produce toxic concentrations of substrates or products. Genotypes that provide any improved tolerance to detrimental conditions can be valuable in that their tolerance can translate to higher concentrations of product, especially with large scale operations.[42] Isobutyric acid is a biochemical with a market size of 100 000 tons in 2011[43] and can be produced with an engineered E. coli strain.[42] Isobutyric acid associated ALE mutations[42] represent an opportunity for an ALE-derived strain design with possible applications (Figure workflow step 1).

Associations with Conditions and Analysis of Multiscale Mutation Annotations Outlines Systems Important for Adaptation to Isobutyric Acid Tolerance

Within the data set, 79 mutations had at least one of their host features significantly associated with the condition of toxic isobutyric acid concentrations (Supplementary Figure S24, Figure workflow step 2). The purine metabolism pathway hosts the most mutations (Supplementary Table S6), with the majority of the mutations coming from the pykF and rplKAJL-rpoBC operons (Supplementary Figure S25, Supplementary Table S5, Figure workflow step 3.1). Of these two, the pykF operon is strongly associated with toxic isobutyric acid concentrations (Supplementary Figure S28). The purine metabolic pathway, rplKAJL-rpoBC operon, pykF operon, and pykF were frequently mutated features across this data set’s ALE experiments (Figure ). The purine metabolic pathway and rplKAJL-rpoBC operon were strongly selected for in ALE experiments, while mutations to pykF and its operon were less so, on average (Figure ). pykF and its operon were also more specific in their associations to conditions, where the rplKAJL-rpoBC operon and purine metabolic pathway mutation associations were more broad (Figure , Supplementary Figure S28). Correlations between frequently mutated genomic features demonstrate pykF and rpoB as positively correlated (Supplementary Table S4, Supplementary Figure S30). The positive correlation between pykF and rpoB along with their differing associations suggested that these compatible mutations were adapting for different selection pressures (Figure workflow step 3.2).

ALE Mutation Trends in pykF

The pykF operon is frequently mutated across this study’s many ALE experiments. These mutations were gathered and investigated to understand if mutations clustered according to ALE experiment conditions. Most of these mutations target pykF’s coding sequence, with some to the operon’s noncoding features (Figure a). A mutation to the pykF operon’s noncoding features seems to preclude a mutation to any of the others, suggesting that they have related effects on the host strain (Figure a). Mutations to pykF were spread across its sequence, with the majority contained in the second half of the sequence (Figure b). Due to the large variety of experiments mutating pykF, the broad distribution of mutations across pykF’s sequence, and the large amount and variety of structural feature annotations available for PykF, it is expected that the trends described by the aggregate of all ALE experiment mutations would be most revealing. Many of the mutations to pykF were predicted to truncate the coding sequence or disrupt the potential function and/or structural stability of a domain through a nonsynonymous amino acid substitution. Considering the accumulation of truncated amino acids, the PykF subunit interfaces were the most frequently mutated features on PykF (Figure b, Figure d, Supplementary Figure S26). The truncations also seem to cluster on the second half of PykF’s sequence, resulting in the downstream coding disruptions primarily affecting the PykF tetramer subunit interfaces while avoiding the active sites in the first half of the sequence (Figure b, Figure d). Those mutations predicted not to cause truncations or frameshifts were found on PykF’s 3D structure nearest to binding sites within the cleft of PykF’s barrel domain (Figure c, Supplementary Figure S27). All mutations to PykF and those specifically from the toxic isobutyric acid ALEs follow the same trend across features on PykF’s structure (Supplementary Figure S26), providing evidence that mutations to pykF across different sets of conditions may accomplish similar outcomes. These trends suggest unique functional targets for the different mutation types, with truncations having the clearest outcome: a disruption to PykF’s ability to form a complex (Figure workflow step 4).

Figure 7

Clustering of truncating mutations reveals a variant design for toxic concentrations of isobutyric acid. (a) An oncoplot demonstrating mutations linked to the pykF operon across all ALE experiments of this study’s data. Values within parentheses represent concentrations in g/L unless otherwise stated. (b) A mutation needle plot for mutated amino acids across PykF’s amino acid chain. (c) PykF’s 3D structure and mutated residues. No truncating mutations are included. The residue chain and transparent surfaces are colored according to the legend of the corresponding mutation needle plot. Mutations are represented by a small opaque sphere with a value representing their amino acid position on the corresponding mutation needle plot. The color of the mutation’s sphere corresponds to the mutation’s predicted effect as described by the legend on the corresponding mutation needle plot. The transparent sphere centered on the mutation’s opaque sphere represents the number of mutations with a specific predicted effect on that position. The angle shown illustrates how most of the mutations cluster in 3D space around the area which hosts most of the catalytic domains. (d) The accumulation of the truncated amino acids downstream of truncating mutations from mutation needle plot. (e) The growth rates of WT, a ΔpykF strain, the pykF ALE mutant, and the pykF designed variant with inhibiting concentration of isobutyric acid (12.5 g/L). A ΔpykF mutant was used to investigate for any difference between the strains that partially truncate pykF and its full truncation. (f) The growth rates of WT, a ΔpykF strain, the pykF ALE mutant, and the pykF designed variant with glucose as a carbon source. A ΔpykF mutant was used to investigate for any difference between the strains that partially truncate pykF and its full truncation.

pykF Genome Design Variable and Novel Variant Design

The more interpretable trend of truncated amino acids accumulating across PykF subunit interfaces led to the genome design variable of truncating the later half of pykF containing these interfaces. A truncating insertion to amino acid 266 was chosen as the representative ALE mutation to pykF for the toxic isobutyric acid condition since it represents a clear trend in disrupting the PykF subunit interfaces and manifested in two independent ALE replicates with this selection pressure (Figure workflow step 5.3). A novel design variant for PykF was defined as an insertion of 3 stop codons at amino acid 253 immediately upstream of all the PykF subunit interfaces (Figure workflow step 5.1). The genome design variable was inspired by the strong selection of pykF mutations in the presence of high concentrations of isobutyric acid, though their benefit may instead be linked to growth on an abundance of glucose as a carbon source, as suggested by other studies.[45,49,53] All ALEs with mutations to pykF include abundant glucose as a carbon source (Figure a).

Experimental Validation of pykF Novel Variant Design

Growth screens were performed on the reconstructed strains harboring the pykF ALE mutation along with the designed truncation to examine the differences in fitness between mutants and wild-type (Figure workflow step 5.5). Growth screens for both toxic isobutyric acid levels and competitive glucose uptake were performed. The results show the mutant and designs have similar growth rates as well as higher growth rates than wild type with toxic isobutyric acid concentrations (Figure e) while not demonstrating obvious benefit in competitive glucose screens (Figure f).

Discussion

Meta-analysis methods were used to explore the potential for deriving novel genome design variables and variant designs from aggregated ALE data. Multiple methods were implemented to find ALE mutation trends at different levels of detail. These methods were organized into a sequence that functioned to narrow down ALE mutation trend targets and impact toward the derivation of genome design variables and novel variant designs for specific experimental conditions. Associations between mutated features and conditions extracted the mutated features potentially relevant for a condition of interest. Multiscale annotations helped group mutated features that belong to the same system, clarifying whether a beneficial change can be accomplished through few or many mutated features. High-level mutation trends were revealed by the meta-analysis methods. The majority of ALE mutations to noncoding regulatory features resulted in predicted truncations. Biological robustness is achieved by the regulation and maintenance of a variety of biological functions.[54] Deregulation of biological functions to remove restrictions on flux or the deactivation of unnecessary functions may ultimately benefit a host’s performance in highly specific ALE environmental selection pressures. In essence, the optimization of an organism through ALE may result in a simplification of its systems toward maximizing the use of a subset of beneficial functions. Most ALE-data-derived genotypic solutions for a selection pressure may only describe a small number of mutated genomic features and those features will likely be genes. This small mutated feature count trend was demonstrated by the small median cluster size of correlated mutant genomic features and the small median amount of conditions associated with unique mutated features. This result does not fully reflect the findings of the LTEE, where adaptive mutations continue to accumulate indefinitely.[55,56] ALE experiments typically demonstrate diminishing fitness returns per adaptive mutation, with the initial few providing drastic fitness increases, and the remainder providing substantially less fitness. ALE experiments not designed for longer-term study primarily report on those initial highly beneficial mutations, which is what this study’s meta-analysis results reflect. Genes were, by far, the most frequently mutated genomic feature. The bias toward genetic mutations may be due to many factors, one of which is that the genome of E. coli K-12 MG1655 is mostly composed of coding sequences (24). Another potential factor is that this study’s ALE selection pressures may only have required small alterations to functions encoded by genes. These high-level mutation trends may represent ALE-derived strain design principles that could inform rational design methods. The case studies in this work demonstrated how novel genome design variables and variants can be derived from aggregated ALE data. The case studies additionally validated these results by building strains that included novel variant designs based on the genome design variables and assayed their fitness against that of the wild-type strain. The meta-analysis methods, as organized in the proposed workflow, generated enough evidence to derive the functional impact of convergent ALE mutations. From this understanding, novel sequence variants were designed, built, and tested. Two different types of genome design variables were derived from two different types of mutation trends on genes. The genome design variables involving truncations were interpreted from the mutation trend of accumulating truncated amino acids on functional annotations across a gene’s amino acid sequence. The variant designs involving nontruncating mutations were interpreted from the clustering of mutated residues on a gene product’s functionally annotated 3D structure and residue properties. All design variables of this work seemed to target the reduction of functionality for a specific functional site encoded within a gene. These designs and their original mutations may be trading the robustness of a system for more simple and higher performing processes. A comprehensive screen of growth and robustness in multiple conditions would shed light on such trade-offs, but is technically challenging to perform given the vast range of potential growth conditions and stress combinations E. coli is known to have encountered in its evolutionary history. GlpK’s variant design involved an amino acid substitution that may have increased the glycerol kinase reaction rate. GlpK (glycerol kinase) is part of the pathway for utilizing glycerol as a carbon source. GlpK can form a catalytically active homotetramer and homodimer, though the homotetramer can be allosterically inhibited by fructose-1,6-bisphosphate, a downstream product serving as negative feedback. Mutations to Serine 59 have already been shown to disable homotetramer formation through steric incompatibility.[52] This leaves the homodimer formation, which may accomplish a higher overall rate of glycerol metabolism due to the lack of inhibition. Mutations to cyaA and crr may be maintaining Carbon Catabolite Repression (CCR). CyaA (adenylate cyclase) is part of the pathway that generates the activated CRP complex (cAMP-CRP), which goes on to activate genes for multiple secondary carbon source metabolic systems.[57] CyaA’s regulatory region is thought to activate the enzyme.[58] For the condition of glycerol as a carbon source, CyaA would become activated and produce cAMP, therefore activating these metabolic systems.[58] A truncation of cyaA has been shown to prevent cAMP production[59] and the downstream activation of CRP is expected to be nullified. Crr can also play a role in CCR. Unphosphorylated Crr binds to and inhibits GlpK in vitro,[57] though in the case of a glycerol as a carbon source, phosphorylated Crr should be more abundant. Crr interacts with CyaA and will activate CyaA’s cAMP production if phosphorylated.[57] Though this work does not include CyaA interface data for Crr, the phosphoryl group active sites are located in the middle of the multi-interface residues shared with GlpK; mutations to this interface surface may prevent the interaction with CyaA that activates cAMP production. The repression of these secondary carbon source metabolic systems is known as CCR and is normally enforced by the phosphotransferase system in the presence of glucose.[57] With glycerol as a carbon source, ALE-derived strains may have found a way to maintain CCR through mutations to cyaA while still allowing for the activation of glycerol metabolism. CCR maintenance could enable a more efficient cell metabolism in this circumstance, since it represses the activation of multiple unnecessary metabolic systems. Additionally, the cyaA case-study results demonstrated that the same mechanism can provide fitness across a variety of stressors. This result emphasizes the value of aggregated ALE data in that it enables the identification of broadly applicable variants. As suggested with the glpK and cyaA ALE mutations from the glycerol carbon-source case study, designs could also result from a combined set of compatible variants, where each variant optimizes for a separate stress. A strain’s environment is naturally composed of multiple conditions; therefore, compatible optimizations would be valuable in addressing multistress circumstances such as those of industrial scale fermentation (physical, chemical, and biological).[60−62] The ALE and designed mutant strains of this work were screened for their phenotypes. The mutant strains were found to be more fit than wild-type in the conditions that initially selected for their presence in ALEs. The ability of the strains hosting designed variants to achieve similar growth rates to beneficial ALE mutants demonstrates the possibility to design variants derived from aggregated ALE data. The similarity between ALE and the novel variant growth rates also suggests that ALE processes of the current scale and time-span used in the meta-analysis may not find all possible beneficial sequence changes. This lack of full coverage for beneficial sequence changes by ALE processes may be due to the probability of specific mutational sequence changes. For example, GlpK’s ALE mutation involved a single base pair substitution, while both its designed variants involved two base pair substitutions within the same codon. The designed variant sequence changes may be less likely to occur with ALE. Thus, there likely exists potential for beneficial variants not revealed through ALE, but can be understood through utilization of methods outlined in this study such as the mutation clustering and structural biology approaches. In the cases involving truncations, ALE mutations and designed variants could produce more benefit than full gene truncations depending on the stressor. This was demonstrated with the cyaA mutants, where partial truncations had greater benefit than full truncations in a Δpgi background, though no substantial difference in benefit was observed between partial and full truncations of cyaA with glycerol as a carbon source. The representative mutation, controls, and designed variant of pykF also involved truncations. Full truncations of pykF are thought to be valuable with scarce glucose, as seen with long-term evolution experiments,[45,53] where partial truncations may be more valuable in abundant glucose.[49] All of this study’s pykF mutations manifested in strains using glucose as a carbon source (Figure a) and the majority of the truncations occurred near the middle of the coding sequence, where truncations mostly avoided catalytic sites encoded in the first half of the gene (Figure b, Figure d). The results from the phenotypic screens demonstrated otherwise: no obvious benefit was gained from pykF mutants in competitive glucose growth screens (Figure f). The beneficial effects of pykF truncations relative to glucose may require a different genetic background such as a different strain or an already present set of variants. Finally, there may still exist conditions in which a partial truncation to pykF is more beneficial than a full truncation, though these conditions are currently unclear according to this study’s results. The methods in this work demonstrate the value in leveraging diverse public resources to describe ALE variants and have much potential for improvement. The successful characterization the beneficial mechanisms of mutations relies upon the availability of genome annotations local to the mutations of interest as well as tools that can describe the nature and magnitude of a mutation’s severity. The efforts of this work leveraged an already available whole-genome multiscale annotation framework and further enriched mutation annotations local to the genes of interest.[23] These additional annotations, coming from resources such as Uniprot,[63] EcoCyc,[64] Pfam,[65] and Mutfunc,[66] are available for the E. coli strain of this study, though efforts have not yet been made to consolidate them into a unified computational resource. Additionally, the mutation characterization efforts of this study greatly benefitted from investigating the other local mutations to a feature of interest. Variants can therefore also serve as genome annotations, and are especially useful if described by the conditions they were found in. Traditional genome annotations typically do not include variants, though recent efforts have developed a bioinformatics resource that combines genome annotations as well as variants: the Bitome.[67] The Bitome could serve as the locus for consolidating whole-genome multiscale feature annotations along with variants and their metadata for a comprehensive, high-resolution, genomic resource per organism. There exists the potential to substantially increase sets of consolidated comparable mutations by including variants across strains. Recently accessible sequencing and computational tools have greatly expanded the amount and variety of publicly available annotated genomes. Pan-genomic methods can leverage this growing set of genomes to compare and group similar gene sequences across strains and enable cross-strain variant functional analysis.[68] Extensions of these methods may be possible for cross-species functional analysis. Integrating the sequence variations between alleles established through pangenomic methods offer the potential to significantly increase the number of variants available for the functional analysis of genome design variables.

Conclusion

This work demonstrates how to derive nucleotide-level Escherichia coli genome design variables from aggregated ALE mutational data. These design variables are impactful in that they can be leveraged in variant and strain designs, and are inaccessible to current rational strain design methods. Toward this effort, meta-analysis methods involving nucleotide-level mutation data, multiscale functional annotations, and predicted mutation effects were used to anticipate the general characteristics of ALE-data-driven strain designs. These predictions described condition-specific strain designs involving a small amount of sequence changes that primarily target genes with both truncating and nontruncating effects. A workflow was developed that executes the meta-analysis methods in an order that derives specific nucleotide-level design variables and variants applicable to specific conditions from the broader aggregated ALE data set. This study focused on E. coli based variant designs, though it is expected that these methods could be applied to various microbial species with available ALE data. Two case-studies were included to serve as proof-of-concepts for ALE-data-driven genome design variables and may hold value for applications in microbial cell factories and beyond. The case-studies demonstrated beneficial designs based on point mutations and partial-truncations, where mutation functional targets were highlighted by either the accumulation of truncated codons on the gene sequence or point mutation clustering on the 3D structure. In the case of point mutations, the predicted effects of mutations on gene-product properties were necessary to elucidate ALE mutation trend functional impact. Variant designs were also shown to be beneficial for multiple stressors, and there exists mutational evidence that variant designs can be combined in the same strain. Finally, depending on the stressor, partial truncations were shown to be more beneficial than full gene knockouts. Until rational methods can predict all possible biological paths between genotypes and phenotypes, data-driven methods will continue to provide value toward strain design.

Methods

Biological Material

All mutants used the base strain of E. coli K-12 MG1655 (ATCC 47076). Mutant strains were generated following a CRISPR/Cas9-assisted protocol outlined by Zhao et al.[69] (Supplementary Table S10). This method relies on Cas9 to cut the genome of the starting strain while leaving intact the successfully mutated genome. A single plasmid encoding the CRISPR/Cas9 and Lambda Red Recombinase systems along with repair arms and a 20 nt guide RNA targeting the starting strain sequence was constructed using Golden Gate Assembly. In this case the repair arms were generated using PCR instead of annealing two oligos. Depending on the target mutation, one of two strategies for the placement of the 20 nucleotide guide RNA were used: (1) spanning the bases targeted for mutation if that region was next to a PAM sequence, or (2) close to the region being mutated and next to a PAM sequence that could be eliminated by introducing a synonymous codon with the repair arms. When option 2 was used to introduce a single nucleotide change, one of the 20 bases of the guide RNA unrelated to that position was deliberately changed so that the successfully mutated strain would have two mismatched bases with respect to the guide RNA while the starting strain would have just one mismatch. Colonies were screened using ARMS PCR in which one of the primers was designed to work on the starting strain and not on the mutated sequence. All mutations were verified by Sanger sequencing an amplicon generated with primers targeting the genome distal to the ends of the repair arms to avoid sequencing the plasmid. Finally the plasmid was eliminated by growth at 37 °C, verified by parallel plating on media with and without Kanamycin. Each chromosomal cyaA, pgi, and pykF deletion was introduced using a temperature-sensitive pGE3 carrying λ-recombinases and MAD7 nuclease (MADzyme) which sequence obtained from Inscripta Inc. Briefly, E. coli MG1655 was transformed with a temperature-sensitive pGE3 (Supplementary Table S8, Supplementary Table S9), modified from pMP11 by replacing Cas9 into MAD7.[70] Lambda recombinases were induced with 0.2% arabinose at 37 °C for 45 min. Then, induced cells were transformed with 200 ng gRNA plasmid together with 100 pmol synthetic oligo containing each flanking 45 bp homologous sequence for each gene. Upon transformation, cells were recovered at 30 °C for 1 h 30 min. Then, the cells were transferred to 2 mL of LB containing ampicillin and chloramphenicol and grew overnight at 30 °C. Knockout transformants were isolated by plating and validated by colony PCR and Sanger sequencing. Loss of guide RNA plasmid was done by growing confirmed isolates in LB containing ampicillin and 200 μg/L of anhydrotetracycline at 30 °C for at least 6 h. Loss of pGE3 was achieved by propagating at 37 °C. Loss of both plasmids was validated by checking the cell’s sensitivity to antibiotics. cyaA knockout was performed on Δpgi strain for construction of pgi and cyaA double knockout strain.

Physiological Characterizations

The growth rates of strain clones were screened by inoculating cells from an overnight culture to a low OD and then sampling the OD600 until the stationary phase was reached. A linear regression of the log–linear region was computed using the linregress function from the scipy.stats Python package and the growth rates were determined from the resulting slopes. Screen condition details are given in Supplementary Table S7.

Residue Properties

Residue properties were acquired from the software package ssbio.[71]

SIFT and ΔΔG Scores for the Mutation Effect Prediction of Amino Acid Substitutions

In the case studies, two more effect types were predicted for mutations: deleterious and structural destabilization. Both of these were specific to coding regions. Deleterious effects were assumed according to significant SIFT scores (SIFT score <0.05).[72] Structural destabilization was assumed according to predicted significant ΔΔG scores (ΔΔG > 2).[73] SIFT and ΔΔG scores were acquired from Mutfunc.[66]

Gene Product Feature Annotations

GlpK features were acquired from Uniprot,[63] EcoCyc,[64] Pfam,[65] Mutfunc,[66] and individual publications.[74]

Gene Product Structures and Distances between Mutated Residues and Features

Distances between mutated residues and the residues of functional features were calculated using gene product 3D structures and the Cartesian distance formula. GlpK was represented by the 3EZW PDB model. Structures for Crr, CyaA, and PykF were obtained from those provided with the iML1515 model of E. coli K-12 MG1655.[51]

Software Scripts

The software scripts and data supporting the conclusions of this article are archived in the following open-access archive repository: 10.5281/zenodo.5108958.

Availability of Data and Materials

The data sets supporting the conclusions of this article are available in the following open-access archive repository: 10.5281/zenodo.5108958. These data sets are also available in the ALEdb database.[22]

Mutation Data Cleaning

The mutations from ALEdb are initially described relative to the sample and the genome reference. Since some of these experiments include midpoint samples, mutations that emerge within a midpoint sample and carried through to an end point sample can get counted more than once per ALE, which would result in an inappropriately inflated mutation count. Unique ALE mutations were therefore only considered once per ALE. Starting strain mutations were filtered out of the ALE experiment mutation data sets according to their publications after being exported from ALEdb. ALE replicates containing hypermutators were also removed from the data set. An ALE replicate was predicted to host a hypermutator strain if two conditions were met: (1) a hypermutator gene[75] was mutated within the ALE replicate, and (2) the number of mutations found within the ALE replicate was labeled as an outlier relative to boxplot quartiles when considering the distribution of mutation counts for all ALE replicates used in this study. Mutations in population samples with a frequency below 50% were filtered out to instead focus on mutations that demonstrate dominant selection within a sample. In calculating correlations between mutated genomic features and generating the network diagrams of multiscaled mutated features, large deletions were removed to filter out large sets of mutated features that were only mutated once.

Quantitative Plots

Unless otherwise stated, figure plots were generated using Matplotlib version 3.0.3[76] and Seaborn version 0.11[77] or Plotly[78] Python software packages.

Network Diagrams

The network diagrams of multiscale mutated features were generated using Cytoscape.js.[79]

Mutation Needle Plots

The mutation needle plots were generated using the trackViewer R software package.[80]

Oncoplots

The oncoplots were generated using the ComplexHeatmap R software package.[81]

3D Protein Structures

The visualizations for the 3D protein structures were generated using the NGL software package.[82]

78 in total

Review 1. Comparative genomics: the bacterial pan-genome.

Authors: Hervé Tettelin; David Riley; Ciro Cattuto; Duccio Medini
Journal: Curr Opin Microbiol Date: 2008-10 Impact factor: 7.934

Review 2. Engineering microbial membranes to increase stress tolerance of industrial strains.

Authors: Yanli Qi; Hui Liu; Xiulai Chen; Liming Liu
Journal: Metab Eng Date: 2018-12-31 Impact factor: 9.783

3. ssbio: a Python framework for structural systems biology.

Authors: Nathan Mih; Elizabeth Brunk; Ke Chen; Edward Catoiu; Anand Sastry; Erol Kavvas; Jonathan M Monk; Zhen Zhang; Bernhard O Palsson
Journal: Bioinformatics Date: 2018-06-15 Impact factor: 6.937

Review 4. Biofuel production in Escherichia coli: the role of metabolic engineering and synthetic biology.

Authors: James M Clomburg; Ramon Gonzalez
Journal: Appl Microbiol Biotechnol Date: 2010-02-09 Impact factor: 4.813

Review 5. Genomes by design.

Authors: Adrian D Haimovich; Paul Muir; Farren J Isaacs
Journal: Nat Rev Genet Date: 2015-08-11 Impact factor: 53.242

6. Development of a fast and easy method for Escherichia coli genome editing with CRISPR/Cas9.

Authors: Dongdong Zhao; Shenli Yuan; Bin Xiong; Hongnian Sun; Lijun Ye; Jing Li; Xueli Zhang; Changhao Bi
Journal: Microb Cell Fact Date: 2016-12-01 Impact factor: 5.328

7. k-OptForce: integrating kinetics with flux balance analysis for strain design.

Authors: Anupam Chowdhury; Ali R Zomorrodi; Costas D Maranas
Journal: PLoS Comput Biol Date: 2014-02-20 Impact factor: 4.475

8. Cytoscape.js: a graph theory library for visualisation and analysis.

Authors: Max Franz; Christian T Lopes; Gerardo Huck; Yue Dong; Onur Sumer; Gary D Bader
Journal: Bioinformatics Date: 2015-09-28 Impact factor: 6.937

9. Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance.

Authors: Erol S Kavvas; Edward Catoiu; Nathan Mih; James T Yurkovich; Yara Seif; Nicholas Dillon; David Heckmann; Amitesh Anand; Laurence Yang; Victor Nizet; Jonathan M Monk; Bernhard O Palsson
Journal: Nat Commun Date: 2018-10-17 Impact factor: 14.919