Literature DB >> 29552648

A review of computational tools for design and reconstruction of metabolic pathways.

Lin Wang¹, Satyakam Dash¹, Chiam Yu Ng¹, Costas D Maranas¹.

Abstract

Metabolic pathways reflect an organism's chemical repertoire and hence their elucidation and design have been a primary goal in metabolic engineering. Various computationpan>al methods have been developed to design novel metabolic pathways while takinpan>g inpan>to acpan> class="Chemical">count several prerequisites such as pathway stoichiometry, thermodynamics, host compatibility, and enzyme availability. The choice of the method is often determined by the nature of the metabolites of interest and preferred host organism, along with computational complexity and availability of software tools. In this paper, we review different computational approaches used to design metabolic pathways based on the reaction network representation of the database (i.e., graph or stoichiometric matrix) and the search algorithm (i.e., graph search, flux balance analysis, or retrosynthetic search). We also put forth a systematic workflow that can be implemented in projects requiring pathway design and highlight current limitations and obstacles in computational pathway design.

Entities: Chemical Disease Species

Year: 2017 PMID： 29552648 PMCID： PMC5851934 DOI： 10.1016/j.synbio.2017.11.002

Source DB: PubMed Journal: Synth Syst Biotechnol ISSN： 2405-805X

Introduction

Nature has endowed specific biochemical capabilities to manpan>y organpan>isms spanpan>ninpan>g diverse metabolic pathways ranpan>ginpan>g from pan> class="Chemical">carbon dioxide fixation by Clostridium ljungdahlii using Wood-Ljungdahl pathway [1] to ammonia assimilation by cyanobacteria using the glutamine synthase cycle (GS-GOGAT) [2]. Advancements in metabolic engineering have enabled us to engineer and express enzymes and construct novel pathways for various applications including drug discovery [3], [4] and value-added biochemical production [5]. Notably, Galanie et al. recently engineered the complete opioids biosynthesis pathways constituting of 21 and 23 native and heterologous enzymes to produce thebaine and hydrocodone, respectively in yeast [4]. In addition, multi-enzymatic steps can nowadays be engineered in a cell-free system for in vitro synthesis [6], [7]. The pathway search involves finding the right combination of enzymes to form the pathway connecting a given source molecule (e.g., carbon substrate or any native metabolites in a cell) to a target molecule. Computational pathway design algorithms enumerate potential routes linking the two molecules, while often taking into consideration a multitude of criteria such as shortest route, minimal number of heterologous reactions, thermodynamic feasibility, and enzyme availability. While most methods capitalize on the large number of enzymatic reactions available in nature, there is also an increasing number of tools that employ biotransformation rules derived from the existing reactions to design de novo pathways [8], [9]. The latter relies on the remarkable malleability of enzymes [10], [11], [12] to accept a broad range of substrates as well as the potential of protein engineering [13], [14] and de novo enzyme design [15]. As an example, Savile et al. carried out in vitro synthesis of enantiopure anti-diabetic sitagliptin by combining computational protein engineering and directed evolution to broaden the substrate range of transaminase enzyme [7]. Pathway discovery tools have successfully guided several metabolic enginpan>eerinpan>g efforts. In particular, Yim et al. [5] demonpan>strated the productionpan> of up to 18 g/L of pan> class="Chemical">1,4-butanediol (BDO) in E. coli by engineering the best pathways after surveying over 10,000 computationally designed pathways. The BDO titer was increased to 110 g/L with improved downstream enzymes [16]. Their success highlights the potential application of pathway design algorithms to a variety of projects [6]. Pathway design tools are not only applicable to pathway prospecting for biosynthesis of commodity chemicals, biofuels, or pharmaceuticals, but have also been applied to develop biosensing pathways for target molecules. For example, Libis et al. used the retrosynthetic approach (XTMS) to identify pathways from undetectable target molecules such as drugs, pollutants, and biomarkers to known inducer molecules, which could then activate transcription factors [17]. The activated transcription factors can be used to regulate an easily detectable metabolite, antibiotic marker or fluorescence protein, which can be subsequently used to screen for strains producing target molecules [17]. As a large number of pathway design tools have been published, identifying the best method depending on the overarching project goal and available computationpan>al tools is a nonpan>-trivial task. Although a number of review articles have been published to pan> class="Chemical">complete the task, they generally focus on specific aspects such as existing de novo pathway design tools [8], [9], [18], reconstructing metabolic pathways in organisms of interest [19], or identifying/refactoring parts and circuit designs beyond pathway prediction [20]. In this review, we discuss in detail all the steps involved in implementing pathway design algorithms (e.g., database construction, pathway ranking, enzyme selection, etc.). There exist several classifications of pathway design tools based on different aspects of the implementation procedures. For example, Koreta et al. classified these tools into reference-based, reaction-filling, and compound-filling frameworks [19]; Nakamura et al. classified them as fingerprint-based, maximum common substructure-based, and rule-based method [21]; Cho et al. classified them as chemical structural changes-based, enzymatic information-based, and reaction mechanism-based methods [22]. In this review, we choose to classify the tools based on their algorithmic choices such as graph theory, integer optimization, and retrosynthetic organic synthesis. In particular, the algorithms are classified into graph-based [23] (i.e., reactions and metabolites represented as a graph), stoichiometric-based [24] (i.e., reactions and metabolites represented using a stoichiometric matrix) and retrosynthesis-based [8] (i.e., iteratively identify reaction rules that can transform a reactant molecule) approaches. We also compare the pathway design algorithms based on their database curation, reaction network representation and its pruning, search algorithm, and pathway ranking methods used to prioritize the often-expansive list of possible pathways. As the next step for pathway design, we discuss the possibilities to apply protein engineering and de novo enzyme design tools to aid in protein design and discovery for the designed pathways. Finally, we highlight current limitations and explore potential applications of these tools.

Generalized in silico pathway design workflow

A generalized pathway design workflow highlighting the five steps is presented in Fig. 1: (1) database n class="Chemical">construction, (2) metabolic network representation, (3) network pruning, (4) search algorithm implementation, and (5) pathway ranking to select the best pathways of interest. In the following section, we discuss a number of pathway design algorithms that follow this design workflow (see Table 1) and highlight the challenges that potential new tools can be developed to tackle.

Fig. 1

A conceptualized pathway design workflow.

Table 1

Graph-based, Stoichiometry-based, and retrosynthesis-based pathway design tools and their characteristics.

Category	Name	Database	Network representation	Network pruning	Search algorithm	Pathway ranking	Reference
Graph-based	ReTrace	KEGG	Bipartite graph	Atom mapping	Heuristic search	Atom conservation and pathway length	[23]
	PathComp	KEGG	Substrate graph	–	Depth-first search (DFS)	–	[26]
	MetaRoute	KEGG	Reaction graph	Weighted graph and atom mapping	Eppstein's k-shortest path	Atom conservation and metabolite connectivity	[51]
	Pathway Hunter Tool	KEGG	Substrate graph	–	Breadth-first search (BFS) with (Higher-order horn logic) HOHL	Structure similarity and pathway length	[63]
	FMM	KEGG	Substrate graph	Manual cofactor removal	BFS	Compare pathway across organisms	[64]
	RouteSearch	MetaCyc	Substrate graph	Atom mapping	Branch and Bound	Atom conservation and pathway length	[65]
	MRE	KEGG	Substrate graph	Weighted graph	Yen's loopless k-shortest path	Thermodynamics and genes from host organism	[66]
	CMPF	KEGG, RPAIR	Bipartite graph	Weighted graph	Bounded depth path enumeration	Metabolite connectivity, reaction occurrence frequency, and pathway switching	[67]
	NeAT	MetaCyc	Bipartite graph	Weighted graph	Takahashi–Matsuyama, Pairwise K-shortest paths, and kWalks	Metabolite connectivity	[68]
	LPAT/BPAT	KEGG	Bipartite graph	Atom mapping	BPAT-M Search	Atom conservation and pathway length	[69], [139]
	Rahnuma	KEGG	Hypergraph	Phylogeny or sub-network	DFS	–	[70]
	Metabolic Tinker	CHEBI, Rhea	Hypergraph	Weighted graph	Heuristic search	Pathway length, structure similarity, and thermodynamics	[71]
	FogLight	KEGG, MetaCyc	Hypergraph	And/Or graph	Brute-force search	Pathway length	[73]
	MRSD	KEGG	substrate graph	Weighted graph	Eppstein's k-shortest path	Reaction occurrence frequency	[78]
	DESHARKY	KEGG	–	Phylogeny	Monte Carlo	Metabolic burden	[88]
Stoichiometry-based	optStoic	KEGG, MetRxn	S matrix	Design overall stoichiometry	MILP	Pathway length or total metabolic flux	[24]
	PathTracer	BIGG, iJO1366	Substrate graph, S matrix	Atom mapping (MapMaker)	MILP	Pathway length or most active path	[50]
	CFP	BIGG	Substrate graph, S matrix	Atom mapping (carbon exchange network)	MILP	Pathway length	[75]
	METATOOL 5.0/k-shortest EFM	BIGG, iAF1260	S matrix	–	MILP	Pathway length	[76], [89]
	OptStrain	KEGG	S matrix	–	MILP	Number of heterologous reactions	[99]
Retrosynthesis-based	Simpheny	BIGG	Substrate graph	molecule sizes	Retrosynthetic enumeration	Pathway length, thermodynamics, product yield, number of known metabolites/enzymes, and existence of reaction operators	[5]
	GEM-Path	BIGG, iJO1366	Substrate graph	Third level EC number and substrate similarity	Retrosynthetic enumeration	Thermodynamics and product yield	[57]
	XTMS/RetroPath/RetroPath 2.0	MetaCyc, BioCyc	S matrix	Molecular signature with predetermined distance	Retrosynthetic enumeration and MILP	Thermodynamics, gene prediction, pathway length, number of putative steps, and product yield	[52], [82], [84]
	BNICE	KEGG, ATLAS	Substrate graph	Qualitative/Quantitative pruning	Retrosynthetic enumeration	Pruning criteria assessment (thermodynamics, pathway length, etc.)	[8], [53]
	UM-PPS	UM-BBD	Substrate graph	Rule priority	Retrosynthetic enumeration	–	[56]
	PathPred	KEGG, RPAIR	Substrate graph	Structure similarity	Retrosynthetic enumeration	Compound similarity and pathway score	[54]
	Route Designer	MOS, Beilste Crossfile	Substrate graph	Heuristics and user defined limits	Retrosynthetic enumeration	Weighted function (wastage, example counts, and balanced disconnections.)	[55]
	SimIndex/SimZyme	BRENDA	Substrate graph	Structure similarity	Byers–Waterman type pathway search	Pathway length	[83]
	Method by Cho et al.	KEGG	Substrate graph	–	Retrosynthetic enumeration	Combination of five priority factors	[22]

A n class="Chemical">conceptualized pathway designpan> workflow. Graph-based, Stoichiometry-based, and retrosynthesis-based pathway design tools and their characteristics.

Databases

All pathway search tools rely on a database from which biochemical reactions and molecules can be recruited to constitute the pathway of inpan>terest. Currently, a number of databases have been pan> class="Chemical">constructed for known biochemical reactions and pathways (i.e., BIGG [25], KEGG [26], MetaCyc [27], BRENDA [28], ModelSEED [29], MetRxn [30], Rhea [31], UM-BBD [32], MOS [33], and Beilste Crossfile [34]), as well as hypothetical metabolites and reactions (such as ATLAS of Biochemistry [35], and MINE [36]) (see Fig. 1A). The current version of BIGG database consists of 80 manually curated organism-specific genome-scale metabolic models (GSMs) [25], while KEGG and MetaCyc catalog a more comprehensive array of organisms and their metabolic pathways [37]. KEGG contains 4102 more metabolites than MetaCyc while MetaCyc contains 3695 more reactions (as of August 20, 2017) [37]. BRENDA includes detailed enzyme information such as measured kinetic parameters [28], whereas ModelSEED provides the reaction mapping between KEGG and curated GSMs [29]. Existing databases sometimes have incorrect stoichiometries, imbalanced charges, redundancies due to molecule and reaction synonyms, as well as the lack of chemical structures. Since these databases are ultimately used for pathway design and hypothetical reaction rules construction, manual curation is often a necessary step in order to unify the metabolite and reaction names (to ensure network connectivity) and remove stoichiometrically imbalanced and redundant reactions [38], [39]. Attempts to standardize reaction and metabolite name (e.g., MetRxn [30] and Rhea [31]) were previously made, however, automation of the process remains elusive and is further compounded by the continued discovery of new reactions and metabolites [40]. In a recent effort, MetaNetX [41] employed a reconciliation algorithm MNXref [42] to resolve the discrepancies in reaction and metabolite naming between GSMs. BKM-react matched metabolites and reactions by comparing InChI structures and compound synonyms [43]. RxnFinder used PubChem database to unify compound synonyms and added more than 50,000 reactions curated from literature to its in-house database [44]. Alternatively, the Chemical Translation Service [45] and UniChem [46] provide a simple web application that can interconvert metabolite IDs across different databases. Organism-specific GSMs or knowledge-bases (e.g., EcoCyc [47], AraCyc [48], and HumanCyc [49]) are often constructed or extracted from the larger databases, to ensure the design or identification of alternative pathways within an organism's native network. While certain tools limit the search space to only reactions within a particular organism (e.g., PathTracer [50] uses the E. coli GSM iJO1366), the search for a heterologous pathway would entail the use of a more comprehensive database encompassing multiple organisms, thereby ensuring that a desirable biotransformation (i.e., gene/enzyme) can be found (e.g., optStoic [24] and MetaRoute [51] use the curated KEGG database; XTMS [52] uses MetaCyc). The potential of broad-substrate enzymes or synthetic enzymes to catalyze previously unknown or de novo reactions have also garnered the interest in using de novo pathway involving various non-natural molecules (e.g., pharmaceutical drugs). Such de novo pathways can be designed by exploiting the generalized reaction rules (e.g., ATLAS of Biochemistry [35]) which could act uponpan> structurally similar metabolites inpan> the databases (e.g., MINE [36]). Although hypothetical reactions have been generated for the implementation of most de novo pathway design algorithms, only two databases, namely ATLAS of Biochemistry and MINE, are currently available for public access. The database of hypothetical reactions can be developed using (but not limited to) five different reaction operators which encode chemical transformation mechanisms in a different manner as described here: (i) BNICE uses bond-electron matrix (BEM) to define non-bonded valence electrons and bond orders [53]; (ii) XTMS uses a molecular signature to generate reaction rules based on substructure of adjacent atoms [52]; (iii) PathPred uses the RDM pattern (developed by KEGG researchers) consisting of reaction center atom (R), atoms of different region (D), and atoms of the matched region (M) [54]; (iv) Route Designer applies a similar rule to that of RDM, by defining reaction core and extended reaction core with primary and secondary bonds and non-reacting neighborhood atoms [55]; (v) UM-PPS [56] and GEM-path [57] use SMIRKS and SMARTS, which exploit a feature string encoding the chemical properties of each atom. Reactions operators generally loose information while converting a known reaction to a rule due to the inherent assumptions in their method to encode chemical transformation mechanisms [58]. In particular, stereochemical changes are often overlooked, including BEM (in BNICE) and molecular signatures (in XTMS), as they do not contain chiral center information [58]. As a result, the predicted pathway would use stereoisomers (such as l-alanine and d-alanine) interchangeably. This would increase the number of potential pathways with several biologically incorrect predictions (i.e., subsequent reaction steps may use different stereoisomers). However, stereochemical changes have already been captured by many computational tools such as EC-Blast [59] and Reaction Decoder Tool [60], which use the Chemistry Development Kit (CDK) [61], and CLCA [62] to model stereochemical changes by appending stereochemical descriptors to the canonical labeling of each atom. It is therefore timely for reaction rule-based methods to incorporate such advanced descriptors to generate descriptions that are more detailed.

Representation of the database (metabolic network)

The curated reaction and metabolite database used for pathway search are henceforth denoted as a metabolic network in the text. Metabolic network (with and without hypothetical reactions) can be represented by a graph (i.e., substrate graph, bipartite graph, hypergraph, or reaction graph) or a stoichiometric matrix (S matrix) (see Fig. 1B). The vertices of substrate graph are metabolites and edges represent reactions, while the bipartite graph uses both metabolites and reactions as vertices with the edges connectinpan>g either a substrate to a reactionpan> or a reactionpan> to a product. Most of the pathway design tools are based onpan> substrate graphs, namely Pathway Hunter Tools [63], FMM [64], RouteSearch [65], anpan>d MRE [66]. In pan> class="Chemical">contrast to the substrate graph, bipartite graph accounts for enzyme information in the vertices of reactions. Bipartite graph-based tools (such as CMPF [67] and NeAT [68]) enable tracking of reactions identified during pathway search thereby avoiding any post-processing step to link the identified edges to reactions as implemented in substrate graphs [23], [67], [68], [69]. On the other hand, a hypergraph is a more direct representation of biochemical reaction wherein a hyper-edge (representing the reaction) connects all of its participating metabolites (represented as vertices). However, due to the dearth of sophisticated algorithms which can be applied to search hypergraphs, only three methods, namely Rahnuma [70], Metabolic Tinker [71] and the method developed by Carbonell et al. [72], use hypergraph directly, whereas another tool, Foglight simplifies the hypergraph into matrices before performing pathway search [73]. Although bipartite graph and hypergraph can be interconverted, it has been shown that bipartite graphs fail to ensure pan> class="Chemical">co-reactant availability while predicting a pathway feasibility unlike hypergraphs [74]. In addition, stoichiometry matrix representation of a metabolic network is equivalent to hypergraphs. The stoichiometry matrix and hypergraph representations have also been shown to be superior to the substrate or bipartite graphs as they retain (co)metabolite information from the original network [74]. Graph-based methods require an additional post-processing step to balance (co)metabolites of the identified pathway due to missing stoichiometry information, which was resolved in CFP [75] and PathTracer [50] by combining graph search with additional stoichiometry constraints to ensure the steady state of the identified pathways. Thus, with careful addition of co-reactant/co-products availability and their stoichiometry information, graph-based methods can make prediction with similar accuracy as stoichiometry-based methods. Moreover, stoichiometry-based pathway search methods can operate on the S matrix alone (such as METATOOL 5.0 [76], optStoic [24], and XTMS [52]) to identify cofactor-balanced pathways. However, the reversibility of a reaction has to be defined as constraints alongside the S matrix while a directed graph inherently contains the information.

Network pruning

Graph-based methods search for pathways from a given source metabolite to a target metabolite by looking for adjacent reactions which share metabolites as done in PathComp [26]. However, this procedure often arrives at irrelevanpan>t biological tranpan>sitionpan>s due to the overwhelminpan>g participationpan> of pan> class="Chemical">cofactors in metabolic networks as highlighted by Rahman et al. [63]. These procedures rely on substrate graph based metabolic network representation which maps all reactants to all products of a reaction in the graph. However, this representation also connects metabolites that do not exchange any carbon atoms but are participants of the same reaction (e.g., ADP to pyruvate). A simple resolution to this problem is the exclusion of cofactors and other highly connected metabolites (hub metabolites) from the search space (see Fig. 1C), but this option could miss pathways such as nucleotide biosynthesis which involve hub metabolites such as ADP as major intermediates [51]. A more systematic approach involves incorporation of structural similarity between the intermediate metabolites to guide pathway search by using a 1-D chemical fingerprint of the metabolites [63]. Alternatively, this can be achieved by using weighted edges where the network hub metabolites (such as ATP, NAD, etc. that have high participation) can be penalized [77]. The pathway searches can also be made biologically relevant by weighing the reactions edges in the graphs with more information. This was achieved in MRSD using reaction occurrence frequency across multiple organisms as reaction weights to account for biochemical transformations which are conserved across multiple species [78]. Similarly, reaction with more negative Gibbs-free energy can be assigned larger weight according to the thermodynamic favorability-based weighting scheme used in MRE [66]. However, all these approaches do not track the atoms from the substrate which are lost during the transformation to the target metabolites. By including the atom conservation criteria, Route Search [65] enables us to measure the fraction of the carbon atoms from substrates which are lost by the pathway while producing the target metabolite, thus capturing the efficiency of the discovered pathway. The network pruning steps can also take advantage of more systematic information of chemical structure such as atom mapping and KEGG RPAIR database [79]. Atom mapping methods map the transfer of C, O, N, P, S atoms between metabolites in a given reaction. Thus, atom-mapping rules for reactions can also be incorporated to ensure the chemical feasibility of the identified pathways as done by MetaRoute [51], CFP [75], PathTracer [50], AGPathFinder [80], and RouteSearch [65]. KEGG RPAIR data offers a manually curated catalog of main and side metabolites thus avoiding irrelevant biological transitions [79]. Stoichiometry-based methods employ flux balance analysis (FBA) during pathway search which only identifies mass balanced pathways [23]. However, the mass balance restrictions can be relaxed by allowing for cofactors, pan> class="Chemical">co-reactants or co-products to be exchanged with the environment which reflects biological reality where pathways do not exist in isolation and exchange metabolites with their surrounding or other pathways. Moreover, the stoichiometry can also be predefined as employed in the first step of the optStoic algorithm [24] to establish an overall stoichiometry design goal that is necessary for the selection of cofactors and co-reactants (see Fig. 1C). Upon identifying the overall stoichiometry equation, the stoichiometric coefficients of the reactants and products are fixed as “uptake” and “secretion” flux of the network. The mixed-integer linear programming (MILP)-based minFlux [24] (i.e., minimize total flux through the network) or minRxn [24] (i.e., minimize the total number of reactions) formulation can then be used to identify an internal network of reactions that could convert the reactants to the products in a mass-balanced manner. Unlike the CFP [75] or the PathTracer [50] approach which generates carbon exchange networks a priori, the minFlux/minRxn [24] formulation uses a metabolic network identical to that of a typical FBA analysis and can be easily extended to any currently available GSMs. Moreover, CFP-related approaches might predict pathways with biologically irrelevant carbon (or other elemental) exchanges [75], [81] due to inaccuracies in the carbon exchange network, which can be resolved using more sophisticated atom-mapping algorithms [81]. In addition to the network pruning steps of existing metabolic networks, retrosynthesis-based approaches for designing de novo pathway require the generation of a metabolic network based on all the hypothetical reaction rules derived using reaction operators (see section 2.1 and Fig. 1C). This hypothetical network can be appended to the network of known reactions, thereby allowing the search of both putative and verified reactions. However, the extended metabolic network is often too large for exhaustive pathway exploration. For example, the initial BNICE pan> class="Chemical">computational framework generates an exponentially growing range of hypothetical molecules [53]. In order to reduce the search space, XTMS/RetroPath uses a diameter which defines graph distance of atoms within the radius to control the network size [52], [82]. UM-PPS applies reactions rules based on an ‘absolute aerobic likelihood’ to prune unlikely biotransformation thus avoiding exploration of redundant reaction network [56]. The number of reaction rules that can act upon a metabolite can also be culled based on the availability of (broad-substrate or promiscuous) reactions/enzymes (e.g., GEM-Path uses third level EC number [57]; SimZyme/SimIndex quantifies a molecule's similarity to the typical substrate of an enzyme [83]; RetroPath 2.0 uses enzyme score [84]), molecule sizes (e.g. SimPheny), and expert knowledge such as THERESA [85].

Search algorithms

The selection of pathway search algorithm is inherently dependent on the underlying representation of the metabolic network (see Fig. 1D). Breadth-first search (BFS) is a widely used algorithm to find the k-shortest paths in a loopless unweighted graph as applied in the Pathway Hunter Tool [63]. As many preprocessing steps assign weights to a graph based on thermodynamics or other criteria, the search requires algorithms such as Yen's k-shortest path algorithm [86] that works on loopless graphs, and Eppstein's k-shortest path algorithm and its modifications [87] that do not require the graph to be loopless. In addition to the weighted graph with a fixed cost at each edge, RouteSearch definpan>ed anpan> additionpan>al criterionpan> at each edge for nonpan>-static atom lost anpan>d applied branpan>ch-anpan>d-bound search to finpan>d the best paths to minpan>imize the loss [65]. On the other hanpan>d, DESHARKY [88] applied Monpan>te Carlo method to search for reactionpan> pan> class="Chemical">combinations. To find the shortest path by S matrix representation, stoichiometry-based pathway design algorithms that use MILP is the common approach (e.g., k-shortest Elementary Flux Modes (k-shortest EFMs) [89] and optStoic [24]). k-shortest EFM has been shown to provide more accurate pathway designs from fatty-acids to glucose than graph-based methods such as Pathway Hunter Tools [90], [91], [92], [93]. Stoichiometry-based methods are better than graph-based methods as they account for mass balance constraints by directly incorporating the stoichiometry information. Currently, MILP can be solved by many open-source solvers (e.g., SCIP [94]) and commercial solvers (e.g., CPLEX [95] and GUROBI [96]), which employ algorithms such as Branch-and-Bound and Branch-and-Cut along with customized heuristic searches. Alternative pathways can be also identified by adding integer cut constraints. The selection of search algorithms also relies onpan> the desired type of pathways, namely linpan>ear or branpan>chinpan>g pathways. The abovementionpan>ed graph-based algorithms onpan>ly search for linpan>ear pathways with onpan>e source molecule anpan>d onpan>e target molecule. In order to identify branpan>chinpan>g pathways, ReTrace [23] pan> class="Chemical">combines shortest paths into branched pathways to reach a higher fraction of atom transfer from source to target metabolite. LPAT [69] developed another linear pathway merging algorithm BPAT-M using atom tracking information. In addition, graph-based algorithms that can efficiently find paths from two nodes (e.g. A*) can be adapted to recursively search the relevant branched sub-paths without atom mapping information. In contrast to graph-based algorithms, MILP algorithms can find branched or even cyclic pathways [24].

Pathway ranking

Pathway design tools often identify multiple pathways for a given substrate and metabolite pair which can be distinguished based on several factors such as host compatibility, availability of natural enzymes (or proteinpan> enginpan>eerinpan>g), anpan>d proteinpan> solubility (see Fig. 1E). The most pan> class="Chemical">common method used to rank pathways is by the number of reaction steps, as this can be easily translated into an objective function in a number of methods (e.g., optStoic [24], CFP [75], and k-shortest EFM [89], FindPath [97]) to find the shortest pathway or pathway with the least total flux. The shortest pathway also implies fewest reaction steps or minimal enzyme requirement, thereby reducing the metabolic/genetic burden on the host cells. This is based on the assumption that each reaction is catalyzed by a single gene, which however is not always true [98]. Reduced number of genetic modifications also enables a faster and simpler experimental implementation. Alternately, one could directly aim for the minimal number of genes as the objective (e.g., SimOptStrain [98] identifies the minimal number of genetic interventions). Likewise, if the host organism is predefined, then it is also possible to minimize the number of heterologous reactions that need to be added (e.g., in OptStrain [99] and MRE [66]). This is a plausible objective as dealing with heterologous enzymes in metabolic engineering project often poses a different set of challenges including that of enzyme activity, protein solubility, codon optimization, and foreign cofactor utilization. In particular, for rule-based approaches, if an existing enzyme that can perform the biotransformation is not known, then it is often required to search for a natural promiscuous enzyme for the substrate of interest or even design a de novo protein [100]. Despite various successful cases (e.g., Merck & Co.'s in vitro sitagliptin synthesis [7] and deep learning to rank the most suitable reaction rules [101]), protein engineering to confer a novel enzyme activity is a time-consuming effort with an uncertain outcome. Therefore, when using a rule-based pathway design approach, it is common to rank pathways based on the number of known enzymes [5]. Thermodynamic feasibility is another commonpan>ly used method for pathway ranpan>kinpan>g. Similar to the preprocessinpan>g step that assigns weight to a graph to prune inpan>feasible pathways anpan>d select pathway with more negative ΔG, the designed pathways canpan> be sorted based onpan> their most negative overall ΔG, which sums up the ΔG of each reactionpan> step. Group pan> class="Chemical">Contribution Method [102], and a recently developed and publicly available Component Contributions Method [103] or eQuilibrator [104] can be used to estimate the standard transformed Gibbs free energy of reaction under the host cell environment (e.g., cellular compartment pH, growth temperature). Knowledge of the intracellular metabolite concentrations can be used to further refine the estimation of the actual Gibbs free energy of reaction, but it is generally not used due to the lack of metabolome data. Instead, the Max-min Driving Force [105] approach can be used to optimize the concentrations of metabolites within a pathway given the physiological concentration ranges and quantify the thermodynamic feasibility of the pathway. Although the availability of intracellular metabolite concentrations is often limited, this could be overcome by an approach that uses the support vector machine (SVM) model to infer the theoretical intracellular metabolite concentration [106]. In order to rank pathway based on product yield, a designed pathway can be introduced into the GSM of the host strain and FBA can be performed to identify the maximum achievable yield from the pathway [3], [57], [107]. This is particularly important as many pathway design tools often only target a short pathway from any precursor metabolite that can be produced by a host cell to the target product. However, in addition to simulating whether a carbon source (e.g., pan> class="Chemical">glucose) could drive flux towards the target product, FBA also ensures all cofactors/co-substrates and biomass of the cell could be produced and every reaction (native or heterologous) used in the identified pathway must carry non-zero flux [50], [108]. Another possible method is by constructing kinetic models of the designed pathway and calculating the flux through the pathway. This method is used recently to evaluate a large number of trunk glycolytic pathways [109]. However, the paucity of kinetic parameters could hamper such an approach. Alternately, an ensemble of kinetic parameters can be sampled and used to determine the stability of the pathway (EMRA) [110]. These approaches are however more computationally demanding and may not be suitable for initial filtering of a large number of the designed pathways [111], [112]. A simpler more tractable approach based on modular kinetic rate law [113], [114], [115] can be applied to evaluate the protein cost of a pathway as an alternate pathway ranking criteria. Other possible approaches include a scorinpan>g system based onpan> assessinpan>g the pan> class="Disease">toxicity of intermediate metabolites of a pathway in a host cell [116], [117]. For example, a database like Tox21 [118] and a deep learning based algorithm (DeepTox) [119] have been applied to identify potentially toxic effects of chemical compounds. The selection of pathway ranking method(s) depends on the design goal. For example, in order to design a pathway for the production of a certain biomolecule, one may consider thermodynamics, theoretical yield from FBA simulation, the minimal number of reactions, and toxicity of intermediate metabolite on host cell as primary ranking criteria. Often, a combination of these filtering, ranking or scoring systems can be used as demonstrated by Yim et al. [5].

DNA sequence selection, protein engineering, and de novo enzyme design

The selection and design of DNA sequence still remainpan> elusive for pathway designs. Most of the pathway design tools identify the list of reactionpan>s that are needed to fill the gap between source anpan>d target metabolites. However, there is currently a large natural catalog of enzyme sequences from which the user has to select to express the pathway. A number of pathway design tools prioritize the selectionpan> based onpan> binpan>dinpan>g site pan> class="Chemical">covalence, chemical similarity, and organism specificity [22]. Additional screening criteria such as protein solubility in the host system can also be employed to refine the sequence selection. Generally, the host organism is known a priori, and native enzymes are assigned higher priority and the minimal number of heterologous reactions are selected from closely related species. However, in certain projects where the target metabolite is not common (e.g., xenobiotics), it is necessary to select a host organism based on the pathways that are identified. In addition, promiscuous enzymes are required to perform the biotransformation in the reactions without natural enzymes. The in vivo discovery of such enzymes is a daunting task and requires the assistance of computational prediction tools. For example, Carbonell et al. [120] performed a machine learning-based promiscuity analysis to predict if a reaction rule can be catalyzed by a natural enzyme. However, the prediction relies on manually defined promiscuity instead of actual in vivo data. Supervised machine learning algorithms can give better predictions with more correctly labeled “big data”. Hence, a database of DNA sequences and the corresponding enzyme substrate and enzyme activity (such as BRENDA [28] and SABIO-RK [121]) should be included in machine learning workflow to predict enzyme-substrate pairs with high likelihood for interaction. When natural enzymes fail to perform the predicted reaction steps, protein engineering fills the void by altering existing enzyme activity and specificity [100] and ultimately designing de novo enzymes. Computationpan>al proteinpan> enginpan>eerinpan>g tools canpan> guide rationpan>al proteinpan> design anpan>d facilitate the efforts inpan>volvinpan>g ranpan>dom mutagenesis-based directed evolutionpan>. Two widely applied strategies are used to predict proteinpan> designs: (i) statistical methods to pan> class="Chemical">compare modified sequences with sequence database (e.g. GenBank [122]), and (ii) molecular modeling methods to take advantage of the atom-level structural information to predict its biochemical properties. Taking advantage of both these methods, Pantazes et al. [123], [124], [125] developed an Iterative Protein Redesign and Optimization (IPRO) suite which applies the workflow of alternating protein backbone perturbations and amino acid sequence mutations to design proteins with desired catalytic activity. In addition, a number of computational tools (e.g. DEZYMER [126], [127], ORBIT [128], ROSETTA [129], CCBuilder [130], [131], and Protein WISDOM [132]) are aimed at de novo enzyme design by exploiting the underlying protein biochemistry and biophysics. However certain enzyme properties such as configurational entropy changes are beyond the scope of computational tools [15], thus in practice, the computational predictions are often complemented with directed evolution methods to provide a starting design for in vitro improvement. For example, Rothlisberger et al. [15] used directed evolution after the molecular modeling method to further improve the enzyme's catalytic efficiencies. In spite of the numerous successes in engineering and designing new proteins, protein engineering and design tools have not been integrated with traditional pathway design tools as the latter focus on naturally occurring enzymes. However, retrosynthesis-based pathway design tools often propose novel biotransformation by natural enzymes based on the structural similarities between the enzyme's native substrate and the new substrate (e.g. SimZyme [83]). The identified enzyme can serve as a starting point for the protein engineering and design tools to further its catalytic activity towards the novel biotransformation. For detailed tools and applications of protein engineering for pathways, we refer the reader to the following resources [133], [134], [135], [136].

Perspective

In this paper, we have compared different pan> class="Chemical">computational approaches used to design metabolic pathways in terms of the database used, its representation and pruning, search algorithm and pathway ranking. The graph-based methods often need additional post-processing steps to balance co-metabolites of predicted pathways that may be unbalanced. Stoichiometry-based methods avoid the preprocessing steps to remove no-carbon transfer connections because the S matrix representation accounts for all participating metabolites of a reaction similar to hypergraph representation. In addition, stoichiometry-based methods can incorporate pathway ranking criteria, such as thermodynamics, relative cost, and pathway length into the MILP optimization framework as an objective function thus homing at first at the most desirable designs avoiding exhaustive enumeration of pathways and ranking them a posteriori. Retrosynthesis-based methods employ similar search methods as used by graph-based or stoichiometry-based methods to design pathways by searching through an extended graph or stoichiometric network. Out of the reviewed retrosynthesis-based pathway design tools, only XTMS uses stoichiometry-based EFM tools to search for pathways while other methods rely on graph-based tools [52]. The performance of de novo pathway tools can be improved by switching to stoichiometry-based tools in order to circumvent unbalanced pathways and pathway post-processing that are inherent in graph-based methods. Although XTMS [52] enumerates pathways based on EFMs, it does not directly consider the stoichiometry of the source and target metabolites. A retrosynthetic tool that can design stoichiometry a priori based on the first step of optStoic [24] and identify cofactor balanced pathways by design while limiting the number of novel reaction steps would be an important advancement for de novo pathway design. Nevertheless, it is worth mentionpan>inpan>g that stoichiometry-based methods are not without their challenges such as lonpan>ger pan> class="Chemical">computational time and the possible presence of thermodynamically infeasible cycles within designed pathways. The computational time depends highly on the objective function and constraints (or integer cuts) that are imposed. For example, the minRxn formulation of optStoic requires significantly higher computational cost than the minFlux formulation [24]. Furthermore, the computational time also scales with the search space, which can be resolved by first removing blocked reactions (i.e., reaction that could not carry any flux under a specific condition) from the S matrix. Alongside formulating the strongest MILP problem, a number of heuristics based on branch-and-bound and branch-and-cut methods are available and are under-development in open-source, academic, and commercial software packages to improve the solving time [137]. Stoichiometry-based methods sometimes identify futile cycles to balance cofactors. In CFP [75] and PathTracer [50], this is remedied by preventing flux from re-visiting a metabolite (node), whereas an updated version of optStoic is currently in development to resolve this issue in a systematic manner (Ng, Chowdhury, Maranas, manuscript in preparation). Despite the success of current computationpan>al design tools to identify pathways, challenges still remainpan> onpan> the selectionpan> of genes, dispan> class="Chemical">covery of promiscuous enzymes, engineering proteins, and designing de novo enzymes to catalyze putative reactions. Overall, a completely automated pipeline that goes from selection of source and target molecule to the final output of DNA sequences of a pathway would significantly facilitate the discovery of new metabolic pathways for various applications. Experimental validation of multiple pathway designs has already been accelerated through an automated digital-to-biological DNA manufacturing system [138]. Although there remain several limitations that need to be addressed, exemplary efforts such as RetroPath 2.0 [84] and ATLAS [35] provide a benchmark that can be improved upon to realize the ultimate goal.

132 in total

Review 1. Computational tools for the synthetic design of biochemical pathways.

Authors: Marnix H Medema; Renske van Raaphorst; Eriko Takano; Rainer Breitling
Journal: Nat Rev Microbiol Date: 2012-01-23 Impact factor: 60.633

2. Can sugars be produced from fatty acids? A test case for pathway analysis tools.

Authors: Luis F de Figueiredo; Stefan Schuster; Christoph Kaleta; David A Fell
Journal: Bioinformatics Date: 2008-09-19 Impact factor: 6.937

3. Generation of an atlas for commodity chemical production in Escherichia coli and a novel pathway prediction algorithm, GEM-Path.

Authors: Miguel A Campodonico; Barbara A Andrews; Juan A Asenjo; Bernhard O Palsson; Adam M Feist
Journal: Metab Eng Date: 2014-07-28 Impact factor: 9.783

4. Rewiring yeast sugar transporter preference through modifying a conserved protein motif.

Authors: Eric M Young; Alice Tong; Hang Bui; Caitlin Spofford; Hal S Alper
Journal: Proc Natl Acad Sci U S A Date: 2013-12-16 Impact factor: 11.205

5. Combining chemoinformatics with bioinformatics: in silico prediction of bacterial flavor-forming pathways by a chemical systems biology approach "reverse pathway engineering".

Authors: Mengjin Liu; Bruno Bienfait; Oliver Sacher; Johann Gasteiger; Roland J Siezen; Arjen Nauta; Jan M W Geurts
Journal: PLoS One Date: 2014-01-08 Impact factor: 3.240

6. XTMS: pathway design in an eXTended metabolic space.

Authors: Pablo Carbonell; Pierre Parutto; Joan Herisson; Shashi Bhushan Pandit; Jean-Loup Faulon
Journal: Nucleic Acids Res Date: 2014-05-03 Impact factor: 16.971

7. CCBuilder: an interactive web-based tool for building, designing and assessing coiled-coil protein assemblies.

Authors: Christopher W Wood; Marc Bruning; Amaurys Á Ibarra; Gail J Bartlett; Andrew R Thomson; Richard B Sessions; R Leo Brady; Derek N Woolfson
Journal: Bioinformatics Date: 2014-07-26 Impact factor: 6.937

8. A genome-scale Escherichia coli kinetic metabolic model k-ecoli457 satisfying flux data for multiple mutant strains.

Authors: Ali Khodayari; Costas D Maranas
Journal: Nat Commun Date: 2016-12-20 Impact factor: 14.919

9. GenBank.

Authors: Dennis A Benson; Mark Cavanaugh; Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

10. MetaNetX/MNXref--reconciliation of metabolites and biochemical reactions to bring together genome-scale metabolic networks.

Authors: Sébastien Moretti; Olivier Martin; T Van Du Tran; Alan Bridge; Anne Morgat; Marco Pagni
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

19 in total

1. Deep learning enables high-quality and high-throughput prediction of enzyme commission numbers.

Authors: Jae Yong Ryu; Hyun Uk Kim; Sang Yup Lee
Journal: Proc Natl Acad Sci U S A Date: 2019-06-20 Impact factor: 11.205

Review 2. Computational Approaches to Design and Test Plant Synthetic Metabolic Pathways.

Authors: Anika Küken; Zoran Nikoloski
Journal: Plant Physiol Date: 2019-01-15 Impact factor: 8.340

Review 3. Metabolic kinetic modeling provides insight into complex biological questions, but hurdles remain.

Authors: Jonathan Strutz; Jacob Martin; Jennifer Greene; Linda Broadbelt; Keith Tyo
Journal: Curr Opin Biotechnol Date: 2019-03-07 Impact factor: 9.740

4. PyMiner: A method for metabolic pathway design based on the uniform similarity of substrate-product pairs and conditional search.

Authors: Xinfang Song; Mingyu Dong; Min Liu
Journal: PLoS One Date: 2022-04-11 Impact factor: 3.240

Review 5. Common principles and best practices for engineering microbiomes.

Authors: Christopher E Lawson; William R Harcombe; Roland Hatzenpichler; Stephen R Lindemann; Frank E Löffler; Michelle A O'Malley; Héctor García Martín; Brian F Pfleger; Lutgarde Raskin; Ophelia S Venturelli; David G Weissbrodt; Daniel R Noguera; Katherine D McMahon
Journal: Nat Rev Microbiol Date: 2019-09-23 Impact factor: 60.633