Literature DB >> 35196325

Comparative analyses of parasites with a comprehensive database of genome-scale metabolic models.

Maureen A Carey^1,2, Gregory L Medlock³, Michał Stolarczyk^4,5, William A Petri², Jennifer L Guler^2,4, Jason A Papin^2,3,6.

Abstract

Protozoan parasites cause diverse diseases with large global impacts. Research on the pathogenesis and biology of these organisms is limited by economic and experimental constraints. Accordingly, studies of one parasite are frequently extrapolated to infer knowledge about another parasite, across and within genera. Model in vitro or in vivo systems are frequently used to enhance experimental manipulability, but these systems generally use species related to, yet distinct from, the clinically relevant causal pathogen. Characterization of functional differences among parasite species is confined to post hoc or single target studies, limiting the utility of this extrapolation approach. To address this challenge and to accelerate parasitology research broadly, we present a functional comparative analysis of 192 genomes, representing every high-quality, publicly-available protozoan parasite genome including Plasmodium, Toxoplasma, Cryptosporidium, Entamoeba, Trypanosoma, Leishmania, Giardia, and other species. We generated an automated metabolic network reconstruction pipeline optimized for eukaryotic organisms. These metabolic network reconstructions serve as biochemical knowledgebases for each parasite, enabling qualitative and quantitative comparisons of metabolic behavior across parasites. We identified putative differences in gene essentiality and pathway utilization to facilitate the comparison of experimental findings and discovered that phylogeny is not the sole predictor of metabolic similarity. This knowledgebase represents the largest collection of genome-scale metabolic models for both pathogens and eukaryotes; with this resource, we can predict species-specific functions, contextualize experimental results, and optimize selection of experimental systems for fastidious species.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35196325 PMCID： PMC8901074 DOI： 10.1371/journal.pcbi.1009870

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Malaria, African sleeping sickness, many diarrheal diseases, and leishmaniasis are all caused by eukaryotic single-celled parasites; these infections result in over one million deaths annually and contribute significantly to disability-adjusted life years [1-3]. In addition, human infectious and related parasites infect domestic and wild animals, resulting in a large reservoir of human pathogens and diseased animal population [4]. This combined global health burden makes parasitic diseases a top priority of many economic development and health advocacy groups [5-7]. However, effective prevention and treatment strategies are lacking. No widely-used, efficacious vaccine exists for any parasitic disease (e.g. [8-12]). Patients have limited treatment options because few drugs exist for many of these diseases, drug resistance is common, and many drugs have stage specificity (e.g. [13-15]). Thus, there is a pressing need for novel, effective therapeutics. Beyond the economic constraints associated with antimicrobial development [16,17], antiparasitic drug development is technically challenging for two primary reasons: these parasites are eukaryotes and they are challenging to manipulate in vitro. As protozoa, these parasites share many more features with their eukaryotic host than prokaryotic pathogens do. Thus, antiparasitics must target the parasite while minimizing the effect on potentially similar host targets, similar to cancer therapeutics. Enzyme kinetics can be leveraged such that the drug targets the pathogen’s weak points while remaining below the lethal dose for host [18] or drugs can synergize with the host immune response (e.g. [19,20]). Unique parasite features (i.e. signalling cascades as in [21] or plastid organelles as in [22]) can also be targeted once identified. Drug target identification and validation are further complicated by experimental challenges associated with these parasites. Many of these organisms have no in vitro culture systems, such as Plasmodium vivax (malaria) and Cryptosporidium hominis (diarrheal disease), or in vivo model system, such as Cryptosporidium meleagridis (diarrheal disease). Some parasite species have additional unique biology and resultant experimental challenges hindering drug development, such as resistance to genetic modification. For example, Plasmodium falciparum (malaria) was considered refractory to genetic modification until recently [23,24]. Entamoeba histolytica (diarrheal disease) has also been refractory to efficient genetic manipulation, and the genomes of Leishmania develop significant aneuploidy under selective pressure [25,26]. Although these challenges may be circumvented with new technology, the use of clinical samples, and reductionist approaches, little data exist relative to that which is available for most bacterial pathogens. Without adequate profiling data (genome-wide essentiality, growth profiling in diverse environmental conditions, etc.), we do not have the knowledge to rationally identify novel drug targets. Untargeted and unbiased screens of chemical compounds for antiparasitic effects have proven useful (if the parasite can be cultured, e.g. [27-32]), but this approach provides little information about mechanism of action or mechanisms of resistance development. Typical approaches to study drug resistance, such as evolving resistance to identify mutations in a drug’s putative target, are not possible without a long-term culture system and a relatively well-annotated genome. As a result of these difficulties (), data collected in one organism are frequently extrapolated to infer knowledge about another parasite, across and within genera (). Toxoplasma gondii is frequently used as a model organism for other apicomplexa due to its genetic and biochemical manipulability [33-36]. Mouse models of malaria [37,38] and cryptosporidiosis [39,40] imperfectly represent the disease and/or use different species than the human pathogen. However, the modest characterization of functional differences among parasite species, beyond comparative genomics (e.g. [41-44]), limits the utility of this extrapolation-based approach, especially broadly among protozoa. Systematic assembly of existing knowledge about parasites and their predicted capabilities could greatly improve the extrapolation-based knowledge transfer by facilitating rigorous in silico comparison. Such systems biology approaches (e.g. genome-scale metabolic modeling) provide a framework to understand parasite genomes, highlight knowledge gaps, and generate data-driven hypotheses about parasite metabolism.

EuPathDB databases.

EuPathDB is the Eukaryotic Pathogens database and serves as a repository for parasite ‘omics data; EuPathDB contains field-specific databases including GiardiaDB, AmoebaDB, MicrosporidiaDB, TriTrypDB, TrichDB, CryptoDB, ToxoDB, PlasmoDB, and PiroplasmaDB (all shown), as well as FungiDB, HostDB, and MicrobiomeDB. Here, a phylogenetic tree of database member parasites is shown (lines are not to scale). Each EuPathDB sub-database is in a rough phylogenetic grouping, but the parasites on the EuPathDB databases are genetically and phenotypically highly diverse. Database color-coding shown here will be used through other figures.

Summary of select parasitic diseases and their causal organism.

Parasites cause important human and animal diseases and have unique biological and experimental challenges that have made interpretation of in vivo and in vitro data challenging. Several examples are shown. Current treatments and associated observed drug resistance are noted. Many well-studied parasites remain refractory to genetic modification and/or still have poor genome annotation. ‘Uncharacterized’ genes were identified via EuPathDB searches for terms such as ‘uncharacterized’, ‘putative’, ‘hypothetical’, etc., for a representative strain. Because each database is heavily influenced by the respective scientific community, some databases such as CryptoDB do not use these terms because the function of so few genes have been validated in the Cryptosporidium parasites. Thus, the genomes of the Cryptosporidium parasites are mostly hypothetical and proposed functions are only putative; the reported percent of genome that is hypothetical is low for this reason (highlighted by an asterisk). Genome-scale metabolic models are built from genomic data and by inferring function to complete or connect metabolic pathways; these models are supplemented with data from functional genetic and biochemical studies, representing our best understanding of an organism’s biochemistry and cellular biology. Unfortunately, existing approaches for the construction of metabolic network models are lacking in standardization and scalability and/or biological relevance for eukaryotes. While there are pipelines that include compartmentalization (i.e. RAVEN [45] and merlin [46]), individual high-quality parasite reconstructions (e.g. [47-51]), and scalable pipelines for the construction of many networks (i.e. CarveMe [52], ModelSEED [53]), we sought to build on these tools and the Eukaryotic Pathogens Database (EuPathDB [54]) to leverage genomic information on the EuPathDB database and existing effort towards manual curation of individual reconstructions. Here, we present a parasite knowledgebase, Parasite Database Including Genome-scale metabolic Models (ParaDIGM), for this purpose. ParaDIGM is a collection of publicly available genome-scale metabolic models, and the computational tools needed to generate and re-generate these models iteratively as new data becomes available. Importantly, these tools also enable the propagation of experimental data collected in a manual curation to closely related organisms. The integration of this genomic and experimental evidence into genome-scale metabolic models enables direct comparison of predicted metabolic capabilities in specific contexts, rather than the purely qualitative comparisons that can be performed with traditional genomic approaches. We demonstrate the utility of ParaDIGM by comparing metabolic capacity, gene essentiality, and pathway utilization. Ultimately, ParaDIGM can be used to better leverage experimentally tractable model systems for the study of eukaryotic parasites and antiparasitic drug development.

Results

Building ParaDIGM, a parasite knowledgebase

To build a comprehensive collection of genome-scale network reconstructions representing parasite metabolism, we designed a novel network reconstruction pipeline optimized for eukaryotic organisms (). Our pipeline builds on publicly available, open source software and resources [52,54-56] and focuses on the compartmentalization of biochemical reactions (). We applied this pipeline to assemble networks for all publicly available reference genomes from parasite isolates representing 119 species (see Data Availability for link to code and reconstructions). In brief, we obtained 192 high-quality genomes from the parasite genome resource, EuPathDB [54], to generate a de novo reconstruction for each genome (, step 1). We mapped the protein sequence of all open reading frames against a biochemical database [56] to identify putative metabolic functions via gene-protein-reaction mappings. Reaction compartmentalization was adjusted to maintain each gene-protein-reaction mapping but only with the subcellular compartments relevant for each organism. A large proportion of parasite gene-reaction pairs would otherwise be misassigned or removed from the network due to assignment to an incorrect compartment, due to lack of orthologous and compartmentalized reactions in biochemical databases; our pipeline reassigns these reactions to the cytosol or extracellular space (). Although not all functions annotated on EuPathDB are integrated into our de novo reconstructions using this approach, well studied enzymes and pathways are well represented (); discrepancies between EuPathDB and de novo reconstructions can be prioritized in future curation efforts. We also identify metabolic functions not currently annotated on EuPathDB ().

Building a parasite knowledgebase.

Genetic data (from EuPathDB), orthology information (from EuPathDB’s OrthoMCL), and biochemical data from metabolomics studies (acquired from a literature review) were used to build our reconstructions in a multistep process; gene essentiality data was used to evaluate resultant models. (A): Reconstruction pipeline. First, de novo reconstructions are built from annotated genomes and supplemented with KEGG reaction-associated genes on the database (see ). Next, we curated an existing manually curated reconstruction for P. falciparum 3D7. Third, we mapped orthologous genes so that (fourth) we could add all metabolic functions from our curated iPfal22 into the de novo reconstruction by transforming each gene-protein-reaction rule via orthology. Lastly, we performed automated curation by gapfilling reconstructions to known metabolic capabilities and to generate biomass. With the resulting reconstruction, we can compare simulations to experimental data such as gene essentiality screens. (B): Considering compartmentalization. Our approach moves a large proportion of the reconstruction’s reactions from compartments in a biochemical database to biologically-relevant compartments (e.g. periplasm to extracellular). Thus, our de novo reconstruction approach accounts for compartmentalization, unlike many previous metabolic network reconstruction pipelines. Each model is represented by a point. Boxplots for each database denote the interquartile range with the median value at center; whiskers extend to 1.5 times the inter-quartile range (i.e. distance between the first and third quartiles) above or below the median. (C): Orthology adds information. Orthology-based curation improves reconstruction scope regarding total number of genes and reactions. These semi-curated reconstructions (each labeled dark point) are larger in scope due to the addition of reactions associated with genes added via orthologous-transformation. Semi-curated reconstructions are connected via a line to the draft uncurated reconstruction for that genome. Reconstructions are named by the associated species; Plasmodium species are labeled with species name. Light colored dots represent previously published Plasmodium reconstructions (iPfal22, from [57] and [47], iPfa2017 from [49], iPbe-blood and iPbe-liver from [58], all others from [48]). (D): Prediction accuracy. Semi-curated reconstructions (diamonds) recapitulate the biology of experimentally-facile parasites as well as published, manually-curated reconstructions (circles). We tested accuracy of model predictions from the de novo reconstruction (triangle) and the final orthology-translated and semi-curated reconstruction (diamond) for P. berghei and compared these summary statistics to the prediction accuracy generated by our well-curated iPfal22 and other previously published reconstructions [48-50,58,59]. This comparison was used to motivate our approach over de novo reconstruction building as our pipeline generates a reconstruction with greater predictive accuracy than de novo reconstruction and comparable to a well-curated reconstruction. We next leveraged the manual curation in one parasite reconstruction, P. falciparum (, step 2, curation from [47,57] and in ), to generate a semi-curated reconstruction for a subset of phylogenetically-related organisms, specifically all Plasmodium sp.. To build these semi-curated reconstructions, we transformed the manually-curated reconstruction using genetic orthology (, step 3) and added all transformed reactions to the recipient de novo reconstruction (, step 4), improving the overlap between our curated and draft networks for Plasmodium reconstructions (). Lastly, all draft and semi-curated reconstructions were gapfilled using parsimonious flux balance analysis (pFBA)-based gapfilling [60,61] to complete biochemical requirements identified in the experimental literature (, step 5) and to produce biomass (see the : Online Methods). Gapfilling too adds to the metabolic scope of all reconstructions (). As a result, when compared to manually-curated parasite reconstructions [47-50], semi-curated reconstructions are larger in scope than de novo reconstructions () and generate predictions with comparable or improved accuracy (). These reconstructions are also more compliant with community standards [62,63] than previous reconstructions for parasites (representative MEMOTE examples shown at https://github.com/maureencarey/paradigm/tree/master/memote_reports). Our de novo draft reconstructions contain only genetically supported information (prior to gapfilling) and, unsurprisingly, reconstruction size is correlated with genome size (). The large genome of Chromera velia CCMP2878 (a non-parasitic organism on CryptoDB with 31,799 ORFs and 3,064 reactions) corresponds to a reconstruction with the second most unique reactions with 58. Unique reactions are defined here as reactions found in only one reconstruction and no other reconstructions. However, even small reconstructions contain unique reactions prior to gapfilling (Figs and ) and the vast majority of these unique reactions are well connected within the network (). In fact, 39 reconstructions contain at least one unique reaction () and every database has unique functions, or functions that are found in every reconstruction within a database but not other reconstructions (). For example, the group of Plasmodium networks share over 200 reactions that are only found in Plasmodium reconstructions; among these reactions include hemoglobin breakdown (). The number of unique reactions is correlated with genome size, both before and after gapfilling ().

Reconstructions for all eukaryotic organisms with published genomes.

(A): Model summary. Genome size is measured here by the number of amino acid sequences encoded by the genome (triangle) and model size is measured by the number of reactions present in the network (square points). Grey rings highlight 100, 500, 1000, 5000, and 10,000 ORFs moving from the center outwards. Genomes are grouped by database, a rough phylogenetic grouping (see ). Note: T. gondii RH is excluded from all future analyses given only a subset of the genome is available from EuPathDB. (B): Model size is correlated with genome size. Larger genomes tend to generate larger models. Line is fit to a linear regression with R2 noted (p-value < 0.001); the standard error is not shown. Points are color-coded by database. (C): Unique reactions by database. Number of unique metabolic reactions per database. Unique reactions are defined here as reactions found in every reconstruction within a database grouping and in no other reconstructions outside of that database grouping. Reactions found in different cellular compartments are considered distinct reactions. In sum, 34% of reactions are in fewer than 10% of models (, light grey) and 352 reactions are unique to just a single model (examples in ). Importantly, these unique reactions are typically well-connected within the network and rarely represent blocked or unconnected reactions (). A core set of 45 reactions are contained in all 192 reconstructions (). Just 3% of reactions are in at least 90% of models (dark grey in ); reactions shared by many models include functions such as glycolytic enzymes. The relationship between genome size and model size is weakened following gapfilling (), likely due to the same biomass formulation for all reconstructions, and the frequency of rare reactions (light grey reactions in ) increases. ParaDIGM can be used to tease apart the difference between unique, species-specific functions and poorly annotated functions to illuminate the uncharacterized fraction of parasite genomes. To illustrate additional examples of using this resource, we identified niche-specific functions, predicted fluxomics studies to identify divergent enzymes, and identified representative model systems for drug development.

Reaction frequency ranges from unique to core metabolism.

Reconstructions help identify rare metabolic functions (light grey box and on histogram, in fewer than 10 reconstructions) and core parasite metabolism (dark grey box and on histogram, in more than 166 reconstructions). Example rare reactions include seven metabolic reactions that are found in only one reconstruction. Of the 45 reactions found in all reconstructions (core metabolism), most reactions correspond to ABC transporter functions for ions or phospholipids. One reaction corresponds with a tRNA synthetase and the remaining correspond to fatty acid-CoA ligases for various fatty acids.

Niche-specific metabolic functions

To identify niche-specific functions, we used ParaDIGM to compare the enzymatic capacity of each organism. Specifically, we compared which enzymes are genetically supported and, therefore, present in each reconstruction prior to gapfilling. We performed classical multidimensional scaling using the Euclidiean distance between reaction presence for each reconstruction (). We observe that phylogenetically-related parasites tend to contain similar reactions (). However, while networks generated from genomes within a common genera or species cluster together, models also cluster within environmental niche rather than broader phylogenetic grouping such as phylum. Apicomplexan parasites cluster tightly within genus but not across genera (, Apicomplexa colored by database). Cryptosporidium parasites cluster with other gut pathogens (, gut pathogens in black) rather than other Apicomplexa. Thus, phylogeny is not the sole predictor of model similarity (permutational multivariate analysis of variance, p = 0.001 using groups of Cryptosporidium, Toxoplasma, Plasmodium, and all other, and homogeneity of dispersion, p<0.001).

Identifying metabolic niches.

(A): Reaction content. Classical multidimensional scaling was performed on the reaction content of all de novo reconstructions; each reconstruction is represented by a point (grey/black or colored by database for emphasis). Thus, this analysis focuses exclusively on the genetically supported features of each reconstruction. Apicomplexan parasites (colored by database) and all other gut pathogens (black points) are highlighted. (B): Reaction content with alternative color scheme. Parasites that invade red blood cells (triangles, Plasmodium and Babesia) or can replicate extracellularly (circles) are highlighted; all other parasites are in lighter grey squares. (C): Important variables for the classification of gut pathogens. We performed a random forest classification to distinguish organisms that are considered gut pathogens from other organisms in ParaDIGM (AUC = 0.98 and an out-of-bag error rate of less than 8%). Important variables with a difference in occurrence score of 1 were present in 100% of gut pathogens and 0% of other organism’s reconstructions and those with a score of -1 were present in 100% of non-gut pathogens and 0% of gut pathogen’s reconstructions. (D): Transporter profile. Again, parasites that invade red blood cells (triangles) or can replicate extracellularly (circles, like the kinetoplastids and Giardia, among others) are highlighted, with all other parasites are in lighter grey squares. Red blood cell-invading parasites cluster. Next, we performed random forest classification using reconstruction reaction content to identify the specific metabolic reactions associated with the metabolic niche of the gut environment. The classifier performed well with an AUC of 0.98 and an out-of-bag error rate of less than 8%, supporting our observation that gut parasites contain distinct metabolic reactions. Most important variables (reactions) were associated with being more frequently observed in non-gut pathogens, including gamma-glutamylcysteine synthetase (GLUCYS), glycerol-3-phosphate dehydrogenase (G3PD and G3PD4), an extracellular membrane proton pump (PPA_1), the glycine cleavage system (GCCb, GLYCL_2), phosphogluconate dehydrogenase (GND), and a pyruvate dehydrogenase using lipoamide (PDHa; ). Reactions associated with gut pathogenicity included Butanal:NAD+ oxidoreductase (BNORhc), glucan 1-4-alpha-glucosidase (GLCGSD), and starch synthase (STARCH300Sc; ). Similarly, parasites that invade red blood cells, including Plasmodium and Babesia, are dissimilar when comparing their full reaction content (, triangles); however, the same analysis limited to each organism’s genetically encoded transporters reveals that these parasites have relatively similar transporter capabilities (, triangles). This result indicates that these red blood cell-invading parasites rely on similar nutrients from their host red blood cell. On the contrary, the broad metabolic niche of extracellular growth yielded some outliers regarding enzymatic capacity and transporter profile ( and , circles), likely due to the range of environments that parasites capable of extracellular growth encounter.

Predicting metabolic function

Beyond the direct comparison of enzyme presence, we can use ParaDIGM to predict metabolic functions and the functional consequences of reaction presence and network connectivity. This approach augments the analysis beyond mere genetic comparisons: some enzymes may not be discovered in the genome despite being necessary to perform biochemical function observed experimentally and are included in these models (). Relatively few fluxomics or controlled biochemical studies have been conducted for any one organism but these data can be predicted in silico. We simulate fluxomics studies to profile the metabolic capability of an organism using both genetic evidence and inferred network structure. To do this, we identified which metabolites can be consumed or produced in each model following gapfilling in a rich in silico media. This environment is simulated by permitting import of any metabolite for which there is a genetically-encoded transporter () or gapfilled transport reaction (). A schematic for each metabolite categorization is shown in with experimental data shown in , with untested or unknown results in white, and analogous in silico results in . All models except for one (Chromera velia CCMP2878 with the largest genome) required gapfilling to synthesize one or more metabolites and/or biomass. We can expand the in silico predictions to all metabolites in all models (a total of 5,141 metabolites by 192 models, ) to generate hypotheses about understudied metabolites and enzymes.

Predicting metabolic function.

(A): Advantage of network-based approaches. Metabolic models include hypothetical functions (i.e. the enzyme encoded by gene2) that are unsupported by direct genetic evidence but may be indirectly required based on biochemical evidence. These functions are added through gapfilling. Using models augments our analysis beyond mere genetic comparisons: some enzymes may not be discovered in the genome despite being necessary for biochemical observations made and are included in these models. (B): Defining metabolic capacities. With our gapfilled models, we can identify if metabolites are consumed and/or produced. (C): Experimentally-derived metabolic functions. We compiled data providing evidence for consumption or production of select metabolites from the literature (). Consumed metabolites are imported by the parasite from the extracellular environment (e.g. the in vitro growth medium). Produced metabolites are synthesized by the parasite even when the metabolite is not in the extraceullar environment. See : Online Methods for more detail. Data are sparse. (D): Analogous Inferred metabolic capacity of each organism from Panel C for every metabolite from panel C. Data from panel C was used to gapfill reconstructions to generate data presented in Panel D (see for methods). See Panel B for definitions. Metabolites that are neither produced nor consumed are consumed intracellularly but are not taken up from the extracellular environment. Metabolites noted as ‘complex or unknown’ here are represented by multiple metabolite identifiers in the reconstructions (e.g., lactate is measured experimentally, but could represent both D-lactate and L-lactate within the reconstruction). (E-G): Example gapfilled functions in the Vitamin B6 pathway. These reactions were added to support the observed metabolic functions in Panel C or to support in silico growth. Panel E shows L-alanine-alpha-keto acid aminotransferase (ASPTA6, added to 58 reconstructions), Panel F shows pyridoxamine-pyruvic transaminase (PDYXPT_c, added to 64 reconstructions), and Panel G shows pyridoxamine oxidase (PYDXO, named pyridoxal oxidase in BiGG, added to 90 reconstructions). Note, a deaminateing pyridoxamine:oxygen oxidoreductase (PYDXO_1) is also added to 12 reactions to interconvert pyridoxal and pyridoxamine. Interestingly, several metabolic enzymes were consistently predicted to be necessary for observed metabolic capabilities (metabolic tasks in ) or growth across all parasites (gapfilled reactions in ); three common examples are shown in including three steps in Vitamin B6 metabolism. Pyridoxamine oxidase () is an understudied enzyme involved in Vitamin B6 metabolism; fewer than 300 articles on PubMed describe the enzyme. Not surprisingly given the lack of literature, the reaction associated with this enzyme is in just seven reconstructions in the BiGG database, including two iterations of the S. cervisiae S288C model [56]. The deaminating version of this reaction is in only 10 reconstructions in the BiGG database; all ten of these reconstructions are for eukaryotes including five Plasmodium genomes. Pyridoxamine oxidase was only added to the V. brassicaformis CCMP3155 and G. niphandrodes reconstructions in the bioinformatics-driven model construction steps; however, this enzyme was added in 90 gapfilling solutions to satisfy experimentally-derived functions. Thus, we predict that it is important for parasite growth. We also predict that the unidentified sequences for pyridoxal oxidase are highly divergent from known sequences because they were not identified using bioinformatic annotation methods. By comparing the reconstructions within ParaDIGM, we can identify high-confidence reactions that are encoded by divergent genetic sequences and missed by purely bioinformatic approaches.

Most frequently gapfilled reactions.

These reactions (in the BiGG namespace) were the most commonly added reactions as a result of all gapfilling steps.

Selecting the most representative model system for an experiment

Genome-wide essentiality screens are available for Plasmodium falciparum [64] and berghei [58,65], Toxoplasma gondii [35], and Trypanosoma brucei. Using the models generated with ParaDIGM, we can perform the equivalent in silico simulations ( and ) regardless of experimental genetic tractability (). These analyses can be used to identify drugs for repurposing or the best model system for testing a novel drug target. To do this, we sequentially removed each reaction from the reconstruction to identify which reactions are necessary for growth (i.e. production of biomass). These simulations are performed in an unconstrained model (i.e. all metabolites with a transporter can be imported, all enzymes can be used) to simulate the parasite’s growth intracellularly in the nutrient-rich host cell. Dissimilarity of reaction essentiality for all Toxoplasma, Plasmodium, and Cryptosporidium reconstructions was calculated using the Euclidean distance ().

Selecting experimental model systems using reaction essentiality.

Single reaction knockouts were performed on unconstrained models to identify the reactions that are essential for generating biomass. Dissimilarity scores were calculated from binary essentiality results using Euclidean distance (root sum-of-squares of differences). A low dissimilarity score of 0 indicates enzyme essentiality is identical between the two models; a high score indicates many differences. Each point represents a pairwise comparison with genera labels on the x-axis. Within genus comparisons are made on the left; across genus comparisons are made on the right. Several examples are highlighted with genome names. Genome-wide reaction essentiality is more similar between Toxoplasma and Cryptosporidium than Toxoplasma and Plasmodium. Mean dissimilarity score is significantly different (by two-sided student’s t-test with multiple testing correction) between every labeled group. Reaction essentiality is generally more similar for closely related organisms (i.e. within genera). However, some genera generate more similar predictions than others. Essentiality predictions were more similiar when comparing Plasmodium genomes to one another than between Toxoplasma, despite all genomes being of the same species, or Cryptosporidium genomes. Cryptosporidium genomes generate predictions that are significantly less similar than Plasmodium genomes. Essentiality predictions in T. gondii are less similar to Cryptosporidium parasites than to Plasmodium (). As T. gondii is a popular model system for other parasites, this result supports the use of T. gondii to test hypotheses about Plasmodium over Cryptosporidium. Moreover, we can identify organisms that are particularly unique within a genus. For example, C. parvum is a poor representative of C. ubiquitum whereas C. muris and C. andersoni are quite similar. Despite being distinct immunotypes, T. gondii VEG and GT1 are the most similar Toxoplasma. P. vivax Sal-1, an unculturable and clinically relevant Plasmodium species, is more similar to P. knowlesi H than the average two Plasmodium genomes, whereas P. falciparum 3D7 and P. berghei ANKA are among the most dissimilar Plasmodium genomes. Importantly, more complete models generate less similar predictions indicating differences in essentiality reflects functional differences, not merely incomplete genome annotation resulting in incomplete reconstructions (). These results highlight how ParaDIGM can be used to identify functional similarities and differences between parasites that directly inform experiments for developing and studying new drugs.

Discussion

Here, we presented a novel pipeline for generating metabolic network reconstructions from eukaryotic genomes and applied it to create 192 reconstructions for parasites, expanding the scope of parasite modeling. These reconstructions represent the first genome-scale metabolic network reconstructions for all but nine of these organisms, making ParaDIGM the broadest computational biochemical resource for eukaryotes to date. ParaDIGM uses reaction and metabolite nomenclature from the Biochemical, Genetic and Genomic knowledge base (BiGG, which includes both microbial and mammalian genome-scale metabolic network reconstructions) [56], facilitating future work involving host-pathogen interaction modeling. Gene nomenclature used in ParaDIGM is from the Eukaryotic Pathogens Database (EuPathDB) [54], consistent with the parasitology field standards and ‘omics data collection. Reproducible data integration approaches are used to curate each reconstruction, making this the first fully automated reconstruction pipeline for eukaryotes; code and data are available in the Data Availability section for iterative improvements by ourselves and others. ParaDIGM or individual reconstructions can be used for comparative analyses or applied to interrogate clinically- and biologically-relevant phenotypes. The adherence to community standards for metabolic modeling throughout ParaDIGM enables easier manual curation for users interested in studying a specific parasite in more detail. Together, this adherence to standards and the automated approach for integration of experimental data, will accelerate further curation of ParaDIGM itself as genome annotation improves, more experiments are performed with individual parasites, and ParaDIGM users provide feedback on reconstruction usage and performance. This eukaryote-specific reconstruction process () generates comprehensive networks of comparable quality to manually curated parasite reconstructions (). However, manual network curation and adding condition-specific constraints remain the gold standard approaches to maximize the accuracy of network predictions, especially for modeling stage-specific metabolism (i.e. [58]). Even so, our semi-automated curation approach enhances the genome-wide coverage of each reconstruction () and generates models with comparable accuracy to previously published manually curated reconstructions (); maximum model accuracy is dependent on including compartmentalized reactions in the reconstruction process (). To evaluate these networks, we compare in silico predictions to experimental results; all have imperfect accuracy regarding gene essentiality (), emphasizing how challenging it is to make a truly predictive model without integrating extensive stage-specific experimental data. High rates of false positives (when the model incorrectly identifies a gene as essential) are a product of the model building process; these reconstructions are built to summarize all metabolic capabilities of the organism, not the specific stage-dependent phenotype of an organism in the experimental system. Thus, constraining a reconstruction with in vitro expression data will reduce the false positive rate (e.g. [47,58]). We also compared our manually curated P. falciparum 3D7 reconstruction to our new semi-curated reconstruction for the same species. Differences in the two iterations fall into three groups: (1) non-specific genes that map to multiple reactions, (2) non-enzymatic genes (specifically, tRNAs), and (3) metabolic functions not yet encoded in the BiGG database for which reaction objects were created in our manual curation efforts. These differences can inform the first round of curation for semi-curated reconstructions. However, this pipeline offers a few key limitations. First, simulation accuracy remains low (), largely because annotation pipelines may over-annotate function and (importantly) these models represent the metabolic capacity of many life stages, whereas experimental data is derived from a single timepoint. Thus, without constraining these models with stage-specific data, models will under predict essential genes and have poor accuracy. Furthermore, all reconstructions are limited by the data used for their construction; for example, we used the reactions already documented in the BiGG database as a universal set of reactions. Thus, reactions not contained in BiGG will not be included in ParaDIGM and only the reactions with specific cofactor utilization or directionality documented in BiGG will be included. Similarly, limited experimental data are available for the localization of specific enzymes and transporters and there has been limited successful experimental validation of computational predictions. Transporters in particular will influence pathway usage and a large proportion of transporters were added in the gapfilling process (. Lastly, the data incorporated in objective reactions and extracellular environment (i.e. media formulation) heavily influences which reactions are considered essential and which non-genetically supported reactions are added via gapfilling. Currently, these constraints within ParaDIGM are not experimentally-derived for each organism. The metabolic capacity represented in ParaDIGM will be expanded and the accuracy of each reconstruction and associated simulations will be improved as (1) BiGG is further expanded, (2) more confidence is gained regarding protein localization, and (3) metabolomics analyses improve biomass and media formulation. Additionally, we used our orthology-based semi-curation approach for only Plasmodium models; however, this approach can be used for other organisms to propagate manual curation efforts (from our group and others, e.g. [48-50]) from one species to closely-related organisms as well. Finally, these reconstructions have not been manually curated and require such attention for improved accuracy, especially in well-studied pathways and transporters and to represent stage-specific phenotypes. Despite these limitations, the models within ParaDIGM perform similarly to manually curated models () and so we highlight example use cases. First, we used ParaDIGM to better leverage model systems for drug development by identifying divergent or conserved metabolic pathways between select human pathogens. Network structures were quite unique with only 25.8% of all reactions in more than 50% of the reconstructions (); network topology did however cluster by genus, and transport ability is associated with specific host environments (). Despite these structural similarities, minor topological differences in networks (and unique functions, Figs ) confer key metabolic strengths or weaknesses (Figs ). We compare metabolic reaction (or enzyme) essentiality to identify the best in vitro system or non-primate infection model of disease for drug development (). For example, enzyme essentiality is broadly more consistent between Toxoplasma gondii and Cryptosporidium parasites than between T. gondii and the malaria parasites. By leveraging network context (), we can impute fluxomic studies in all 192 parasites (Figs and ) to contextualize the variable results across species in relatively few in vitro fluxomics studies () and to expand these observations to untested organisms and metabolites ( Beyond our use cases of ParaDIGM, the pipeline and reconstructions presented here can be used broadly by the field. The study of microbial pathogens generated paradigm-shifting results in biology. The study of viruses revealed basic cellular machinery present nearly ubiquitously in eukaryotic cells, such as the discovery of alternative RNA splicing in adenovirus [66]. The study of bacteria has provided a nearly real-time observation of evolution, allowing researchers to perform hypothesis-driven evolutionary biology experiments in addition to observational research [67]. These microorganisms have shed light on cell biology and the history of life in impactiful yet highly unanticipated ways; experimental challenges associated with parasites have slowed their utility in this regard. However, both the genetic ‘dark matter’ of eukaryotic parasites and known parasite-specific functions are abundant (); thus, parasites too have the capacity to inform our understanding of life. The reconstructions in ParaDIGM can be used broadly to contextualize existing experimental data and generate novel hypotheses about eukaryotic parasite biochemistry as it relates to the rest of the tree of life. ParaDIGM provides a framework for organizing and interpreting knowledge about eukaryotic parasites. The reconstruction pipeline designed for ParaDIGM implements and builds on field-accepted standards for genome-scale metabolic modeling and the latest genome annotations in the parasitology field; moreover, it is uniquely tailored to eukaryotic cells by recognizing the importance of compartmentalization and the design of the objective function. The pipeline can be implemented with other organisms and re-implemented iteratively to incorporate novel genome sequences, biochemical datasets, genome annotations, and reconstruction curation efforts. The genome-scale metabolic network reconstructions organized in ParaDIGM also can be used broadly by the scientific community, using the reconstructions as-is as biochemical and genetic knowledgebases or as draft reconstructions for further manual curation to maximize the utility and predictive accuracy of the models. These reconstructions can be used to generate targeted experimental hypotheses for exploring parasite phenotypes, ultimately improving the accessibility of modeling approaches, increasing the utility of parasites as model systems, and accelerating clinically-motivated research in parasitology.

Automated curation tasks.

All reconstructions were gapfilled to ensure the network could consume or produce all relevant metabolites outlined below. Data from multiple strains of one species were aggregated. The first two columns describe each metabolite with a subsytem and name. The first two columns represent the genus and species for which literature evidence was compiled. Each {i,j} position in the matrix represents whether there is experimental evidence for a species’ consumption or production of the metabolite. Blank cells indicate no literature evidence was found for that metabolite in that species. (XLSX) Click here for additional data file.

Genetically-encoded transporters.

Transporters in each reconstruction prior to gapfilling. These transporters are annotated into each genome. (CSV) Click here for additional data file.

Gapfilled transporters.

Transporters added to each reconstruction in the gapfiling process. These transporters are necessary to generate flux through the biomass reaction using parsimony-based gapfilling. (CSV) Click here for additional data file.

Available essentiality datasets.

Experimental genome-wide essentiality datasets that are available in the literature. These data for Toxoplasma and Plasmodium were used to evaluate model performance and specifically simulated gene essentiality. (CSV) Click here for additional data file.

Compartmentalization.

Subcellular compartments used for reconstructions in each genus. (XLSX) Click here for additional data file.

Blocked reactions.

Each row represents a reconstruction (named in column 1). All following columns list out the BiGG identifiers for the blocked/unconnected reactions in that reconstruction. The products of blocked reactions are not used in any other reaction, whereas the reactants of unconnected reactions are not generated by any other reaction. Row headings are arbitrarily counting the number of reactions in each reaction in that category. (XLS) Click here for additional data file.

Comparison of ParaDIGM gene coverage to genes annotated on EuPathDB with GO terms relating to metabolism.

(A): Schematic for gene count comparisons. Genes found on EuPathDB with a GO term related to ‘metabolic processes’, genes incorporated into de novo reconstructions (N = 192 reconstructions), and genes in both of these categories are described. (B): Number of genes per category. Boxplots represent the total number of genes per category across all reconstructions. (C): Number of genes EuPathDB associated with three example pathways. These genes represent the intersection of EuPathDB genes and reconstruction genes for three select pathways. For all panels, the box extends from the lower to upper quartile values of the data; the center line marks the median and whiskers shows the range of the data. Outliers are not shown. (D): Percent of genes on EuPathDB associated that are represented in reconstructions. Again, these genes represent the intersection of EuPathDB genes and reconstruction genes for three select pathways. For each example pathway in Panel C, the percent of total genes from EuPathDB that are represented in the de novo reconstructions are shown. (EPS) Click here for additional data file.

Reaction content overlap between Plasmodium reconstructions in ParaDIGM and a manually-curated reconstruction, iPfal22.

Semi-curation improves reaction content overlap between Plasmodium reconstructions in ParaDIGM and a well-curated reconstruction. Venn diagram of reaction content between three Plasmodium species or strains (falciparum 3D7 in A, falciparum Dd2 in B, and berghei ANKA in C) for the draft reconstruction including only genes identified by Diamond and the semi-curated reconstruction (see : Online Methods), and compared to iPfal22. (EPS) Click here for additional data file.

Benchmarking of gapfilling approach on Plasmodium reconstructions.

Only Plasmodium reconstructions were assessed here because we can compare the resultant reconstructions to a manually-curated reconstruction, iPfal22. (A): Number of reactions added at each step of the reconstruction pipeline per reconstruction. For each boxplot, the box extends from the lower to upper quartile values of the data; the center line marks the median and whiskers shows the range of the data. Outliers are not shown. (B-D): Venn diagrams highlighting the shared reaction content of three Plasmodium reconstructions with and without gapfilling. Gapfilling increases the coverage of functions contained in the well-curated reconstruction, iPfal22. (EPS) Click here for additional data file.

Further characterization of ParaDIGM reconstructions.

(A): Unique reactions by reconstruction 39 reconstructions contain at least one unique metabolic reaction, or a reaction not found in any of the other 191 models. Reconstructions are colored by EuPathDB grouping, like in panel A, and the bar represents the number of unique reactions in that reconstruction. (B-C): Unique reactions are well connected. Percent (B) and total number (C) of all unique reactions per reconstruction that are blocked, unconnected, or both blocked and unconnected. Blocked reactions are defined as those whose products are not utilized by any other reactions (including transport and exchange reactions), whereas unconnected reactions are those whose reactants are not produced by other reactions. For all panel B, the box extends from the lower to upper quartile values of the data; the center line marks the median and whiskers shows the range of the data. Outliers are shown as points. For panel C, each column or bar represents an individual reconstruction. Most unique reactions are connected and unblocked. (D-E): Gapfilled model size remains correlated with genome size. Following gapfilling, the relationship between genome size (as measured by open reading frames or genes with metabolic GO terms, D and E respectively) and model size remains; however, the correlation is weak due to an increase in the number of reactions for reconstructions built from medium-sized genomes. (F-G): Larger genomes have more unique reactions before (F) and after (G) gapfilling. Genome size is measured by open reading frames. For panels D-G, line is fit to a linear regression with R2 noted (p-value < 0.001); the standard error is not shown. Points in D-G are color coded by database. (EPS) Click here for additional data file.

Complete in silico metabolic capacity.

Inferred metabolic capacity of each organism (rows) for metabolites (columns) for every reconstruction and metabolite in ParaDIGM (5,141 metabolites by 192 models). See for definitions. Note the sheer volume of data acquired from ParaDIGM. (EPS) Click here for additional data file.

Evaluation of prediction dissimilarity and reconstruction coverage.

(A-C): Distribuition of reconstruction completeness scores for Cryptosporidium, Plasmodium, and Toxoplasma reconstructions. Completeness scores were calculated by identifying the ratio of metabolic functions (reactions) in a single reconstruction compared to the sum of metabolic functions covered by all reconstructions from the respective genus. (D): Prediction dissimilarity is correlated to model completeness. For each pair of models, reaction essentiality predictions were compared to generate a dissimilarity score (). Identical predictions have a dissimilarity score of 0 whereas a high dissimilarity score indicates divergent predictions. (EPS) Click here for additional data file.

Compartmentalization improves prediction accuracy and coverage of the genome.

(A): Sensitivity and specificity of in silico gene essentiality predictions when compared to experimental data. Reconstructions generated using our pipeline with and without compartmentalization are connected with a line; other points represent published models for reference. Our compartmentalization approach improves the sensitivity and or specificity of gene essentiality predictions for P. berghei ANKA, (P. falciparum) 3D7, and T. gondii GT1. (B): Venn diagram of gene content for the ParaDIGM P. falciparum 3D7 reconstruction (with and without compartmentalization) and our manually curated P. falciparum 3D7 reconstruction, iPfal22. Incorporating compartmentalization improves genome coverage by adding 12 genes also found in iPfal22 and 17 genes not found in iPfal22. (EPS) Click here for additional data file.

Only half of transporters are genetically supported.

(A): Percent of all transporters with gene annotations. (B): Percent of intracellular transporters with gene annotations. Red dotted line indicates mean. (EPS) Click here for additional data file.

Distribution of blocked and unconnected reactions.

The distribution of poorly connected reactions (A: blocked, B: unconnected, C: both) was similar before and after gapfilling. (TIFF) Click here for additional data file. 20 Oct 2021 Dear Dr. Carey, Thank you very much for submitting your manuscript "Comparative analyses of parasites with a comprehensive database of genome-scale metabolic models" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Pedro Mendes, PhD Associate Editor PLOS Computational Biology Kiran Patil Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I congratulate the authors for creating a very important knowledge-base to the parasite community. The paper is very well written, the methods used are solid and the results reported are of extreme importance for the scientific community. I especially appreciate that the authors propose that the method will be used iteratively and models can be improved as more data arise. As I mention in the revision below tha I believe that more and more the scientific community should be careful when only propagating data from closely related species. And the authors indeed show in their results that "Phylogeny is not the sole predictor of model similarity". This idea can also be extended to the genomic prediction, annotation, metabolic models reconstruction, and so on and we should keep in mind that much information might be lost when we try to tailor-made the models based mostly on related species. This is also why I believe that creating this extremely rich and diverse knowledge-base becomes of utter importance to future advances in the area. Comments and questions: 1. You describe that reconstructions were built based on the annotated genomes, and then proteins were re-annotated using Diamond against the BiGG database. However, I would like to know and you could also discuss this within the paper about the genomic annotation present within EuPathDB, as I don't imagine that all annotations are done in a homogeneous way across distinct databases. And if this is correct, how do you handle these annotations coming from different sources? Could you discuss in terms of how robust is your pipeline taking into account this kind of differences? It would be interesting to also include for each genome within your dataset the proportion of uncharacterized protein coding genes or genes having DUFs (domains of unknown function) and whether these correlate with the proportion of reactions that are gap-filled in the final step. 2. In terms of compartmentalization, which was the proportion of transport reactions included in the models in order to allow for correct fluxes of reactions within compartments and how many of these are orphan reactions? Is this one the reasons you did not include compartments for all species or were there other reasons for this choice? 3. In page 4 (lines 7 to 11) and in Supp. Fig 1: I am not sure if I understood Suppl Figure 1C and D correctly. Are these the categories of genes that are only in EuPathDB (C) or these are all genes in EuPathDB? Also, regarding these genes that were not included in the reconstructions and by referring also to the methods described on page 17 lines 23-40: what is the distribution of genes belonging to each of the 3 groups you mention? And do these percentages change depending on the database used (for example TritrypDB versus ToxoDB)? 4. You mention that you perform a brief manual curation of models. I would like to know time-wise how long is brief in this case for comparison reasons to fully curation that can take up much time. This is a minor comment, but I think it would be important to state in the text whether the authors are planning to perform this type of curation with other species. This could be discussed as a perspective as it would be of great interest to the broader parasite community to have models that were constructed similarly in order to be comparable within different taxa. 5. Can you describe a little more the reactions that were present in the manual curation that were not added to the final semi-curated Plasmodium models? Why do you think these are not included and is it possible to propose a way to improve this step of semi-curation in the future in order to arrive at models closer to curated ones? 6. In Figure 3: You could use the colors of databases provided in 3A and 3C on the individual points for Figure 3B. Also, it seems that not only number of reactions but also number of unique reactions are related to genome size, correct? It would be interesting to see the same trend as showed in Figure 3B but also taking into account unique reactions, especially since your barplots in 3C do not account for per genome variability. I would also highlight the genomes that actually deviate from this linear regression curve in order to pinpoint whether these differences arise from biological traits or if they are due to technical differences. 7. Still in Figure 3 (and the text): Could you explain why you don't find basic pathways such as glycolysis, TCA cycle, Pentose Phosphate, etc within your core reactions? It would be interesting to know in which models these reactions are not taking place and not only knowing they take place in ~90% of the models. Can you also explain if this is merely a technical issue of if this is in accordance with the biology of these species lacking central metabolism enzymes? 8. In page 5, line 33 you mention that even small reconstructions contain unique reactions prior to gapfilling. I am also wonder if these numbers change a lot after the gapfilling process. 9. I am not sure if there is a particular reason the authors do not discuss or compare also previous databases of metabolic networks such as Metacyc. I would at least advise you to explain if you perform better and if possible add an explanation as to why or state if these are not comparable instances of models from the same species. Also this is of interest since as you mention in page 11 line 3, reactions not included in BiGG database are not within your set of universal reactions. 10. Also, it was not clear to me whether the authors will regularly maintain and update the generated databases as new data arise or if this was more of a proposition that others can use the code in order to generate new models. Finally, in my opinion, an extremely interesting finding is also buried within the "Niche-specific metabolic function" paragraph and could be brought much more into play either in discussion or conclusion if the authors are confident enough, which is: "Phylogeny is not the sole predictor of model similarity". And I say this at the end because I believe that your findings support the idea that we should not just propagate ortholog reactions from the most similar species available because much of the information might be lost by doing this. Minor comments: 0. I could not find table 1 within the material provided. 1. In Figure 2C, I would label the "unlabeled dots" or at least provide a subtitle in terms of which species each is referring to. 2. Supplementary tables could have more information in tabs, column names and rows within the tables to improve readability. Also In Suppl. Table 4 I don't understand if the category "both" should be a sum of the other two (blocked/unconnected). If this is the case, I don't think the numbers add up. 3. In Suppl Figures 2 and 3, the blue color circle on top of a purple color does give a good contrast to interpretation. 4. In Suppl Figure 3A: Do you see different trends from distinct databases regarding the number of reactions added at each step or is this very similar to all groups? 4. Page 5 line 31: I believe the sentence ends abruptly and is missing some words. 5. In Figure 3E you could indicate from which database the species having the 7 metabolic unique reactions are coming from for easier readability. 6. In general I don't see the point of including "data not shown", so if it's not relevant to your paper, I would just exclude it, otherwise please support the information necessary for this statement. 7. Page 7, line 4: "Most important variables" are variables here related to "reactions"? 8. In page 10: you created the reconstructions for all "but nine" organisms. Which are these nine organisms? And why were they excluded? I found the explanation for T. gondii RH, but could not find other examples of excluded genomes. Reviewer #2: This is a timely and important database. The paper is well-written. I have a few comments below, one of them more pressing, on the importance of transporter annotations. The Introduction is very good at presenting the problem and need for this study, but it lacks a bit in presenting the state of the art for similar databases of GSMs-genome-scale models-and large-scale batch reconstruction efforts (e.g. ModelSEED, CarveME, BiGG, etc), as well as for large-scale comparative works with GSMs (mostly done in the aforementioned automated reconstruction tools, or evaluation tools e.g. MEMOTE). Comparisons of curated GSMs are very few, and use few models, (e.g. Xavier et al. Plos Comp Bio 2018) which highlights the need for databases as this one. I would advise a new paragraph before the last of the Introduction to do this - the authors already do mention much of this literature throughout the rest of the paper. In theory, the community as a whole would benefit from one, centralized source of good GSM data, (e.g. as done for genome data with NCBI), yet new ones continue to appear. Why is ParaDIGM important in this scenario, and how does it not increase division? This can be returned to in the Discussion as well. The usage of MEMOTE and BiGG nomenclature by the authors is commendable, and shows concern for community standards. Fig. 1A is more of a table than a figure. I think for the community it would be more helpful as a table, annotated with the references that convey the information given. To keep it as a figure though, it would really help to ditch the current grey color that does not convey information, and perhaps color code "yes" and "no"; it would also really help to sort the table, perhaps by the last column, perhaps split it by disease (which could come in a first column, non-repeated). How was KEGG information mapped to BiGG? Where is this mapping available for review? Please detail in the methods. The most sensitive point: could you please expand on the process of annotation of transporters in the methods and discussion, perhaps consider a small analysis on how many were annotated per genome, and how many added through gap-filling? These have a huge influence in GSMs, but in the case of parasites it becomes extreme (more so if one considers transport between compartments, important in the case of eukaryotes as the authors highlight). So as the authors recognize particularly with the correlation of some results with the environment of each species (Fig.4) transport will have a major impact in this work. KEGG does not have transporter annotations in the form of reactions, so they must come from Diamond-BiGG + Gapfilling, but how much of each? Being very explicit about this point will increase the value of ParaDIGM substantially. In general the Figures are too crowded, with much text and schematics - things that would be better as boxes, tables or supplementary material. Some colors are undefined in the figures themselves; some figures are just crowded with gridlines etc., e.g. fig 5C the grey squares are confusing as the grey color is part of the key. This also occurs in Supplementary Figures, e.g. Supp Fig 5 - there's an excess of color and data and it's hard to discern what one can infer from this figure. Please review the figures in general for readability, consistency, and to remove excess content. Lines 19-26 page 11 - the authors mean 4D instead of 4C, twice, I believe Extensive work required on reference formatting; some names in format "Last, First" others in "First Last", urls heterogeneous, etc. Reviewer #3: 1)“Reaction compartmentalization was adjusted to maintain each gene-protein-reaction mapping but only with the subcellular compartments relevant for each organism. A large proportion of parasite gene-reaction pairs would otherwise be misassigned or removed from the network due to assignment to an incorrect compartment, due to lack of orthologous and compartmentalized reactions in biochemical databases; our pipeline reassigns these reactions to the cytosol (Figure 2B).” -This should be explained more clearly as far as the” adjustments” that are happening. To be sure, does Figure 2B show only the percentage of reactions assigned to the cytosol? When I look at the legend below, I see that adjustment from the periplasm to extracellular is mentioned, not just cytosol re-assignments. 2)“B): Considering compartmentalization. Our approach moves a large proportion of the reconstruction’s reactions from compartments in a biochemical database to biologically-relevant compartments (e.g. periplasm to extracellular)." - “biologically-relevant compartments (e.g. periplasm to extracellular).”Is the periplasm not a biological-relevant compartment? Although I understand why to assign periplasm to extracellular as an abstraction, I don’t understand how the point being made about compartments in a database vs biologically-relevant compartments in the context of the example provided. -It is excellent that the authors are accounting for compartmentalization as they mentioned “, our de novo reconstruction approach accounts for compartmentalization, unlike previous metabolic network reconstruction pipelines.” I am unsure that the statement is correct. I believe that at least the metabolic reconstruction pipeline “merlin” accounts for compartmentalization as well (unsure about other tools that might have been released more recently). -A little more clarification/changes in the paragraph text and legend would be appreciated to clarify this. -Also, I only see the “pros” of making this compartment adjustment; as the authors mention, “large proportion of parasite gene-reaction pairs would otherwise be misassigned or removed”. What about the cons of doing this re-adjustment of reaction compartments and keeping those reactions? 3)"(D): Prediction accuracy. Semi-curated reconstructions (filled squares and triangles)" -In the figure legend, filled squares are shown as "uncurated". 4)“semi-curated reconstructions 8 are larger in scope than de novo reconstructions (Figure 2C) and generate predictions with comparable or improved accuracy (Figure 2D).” -What does the increase in scope provide? What are we gaining from all those "extra" reactions when we only get "comparable accuracy" in the models? I looked at the supplementary material data to see if this increase in the number of reactions also meant an increase in the number of blocked reactions in the model but could not find out. 5)"The relationship between genome size and model size is weakened following gapfilling (Supplemental Figure 4D)" -Can you explain why this happens? I guess that biomass requirements/biomass equation will be the same for many of these draft reconstructions (since accurate data on this is limited), meaning gap-filling will add a lot of the same reactions to fulfill the exact biomass requirements. 6)Page 8, line 23 "Interestingly, several metabolic enzymes were consistently predicted to be necessary for observed metabolic capabilities (Figure 5C) or growth across all parasites (Table 1)" -I am trying to understand how Figure 5C supports this claim along with Table 1. But it seems that Table 1 is missing. I can only find a legend for it: “Table 1: Most frequently gapfilled reactions. These reactions (in the BiGG namespace) were the most commonly added reactions as a result of all gapfilling steps.” Either table 1 missing, or does it refer to supplementary table 1, and do we have to parse the data from there to understand what is claimed here? 7)"While reaction essentiality is generally more similar for closely related organisms (i.e. within genera), essentiality predictions were more similiar when comparing Plasmodium genomes to one another than between Toxoplasma or Cryptosporidium genomes.” -I am confused here. Aren’t the results showing that as expected (“generally”) essentiality results are more similar for closely related Plasmodium genomes. I guess my confusion stems from the phrase starting with “while”, I assumed that the conclusion would be otherwise to what “generally” happens. So it’s either that or I am interpreting Figure 6 wrong. I want to compliment the authors on clearly explaining the critical limitations of their pipeline in the discussion session. That addressed multiple questions I was going to include in my review. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Mariana Galvao Ferrarini Reviewer #2: No Reviewer #3: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. 8 Dec 2021 Submitted filename: response_to_reviewers.docx Click here for additional data file. 27 Jan 2022 Dear Ms. Carey, We are pleased to inform you that your manuscript 'Comparative analyses of parasites with a comprehensive database of genome-scale metabolic models' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Pedro Mendes, PhD Associate Editor PLOS Computational Biology Kiran Patil Deputy Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: As I previously mentioned, the paper is very well written, the methods used are solid and the results reported are of extreme importance for the scientific community. I also did not find the missing end of the sentence I mention in the previous review. It must have been my error when reading the first time around. I fully support the acceptance of this paper and I am pleased with the answers to my questions and with the changes made to the original manuscript. I also congratulate the authors for a very nice read. On a side note: I have recently come across a novel model for T. cruzi which might be valuable for the refinement efforts of the authors in the future (in case the authors haven't seen it yet): https://pubmed.ncbi.nlm.nih.gov/33021977/ Reviewer #2: The authors have revised the text thoroughly and provided all the information I asked for. I have no further comments, Congratulations! Reviewer #3: Thank you for answering my questions and make the necessary changes to the manuscript. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Mariana G. Ferrarini Reviewer #2: No Reviewer #3: No 14 Feb 2022 PCOMPBIOL-D-21-01649R1 Comparative analyses of parasites with a comprehensive database of genome-scale metabolic models Dear Dr Carey, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Zsanett Szabo PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Table 1

Summary of select parasitic diseases and their causal organism.

Species	Disease	Treatable?	Drug Resistance?	Culturable?	Genetically tractable?	Percent of genome is ’hypothetical’?
Trypanosoma brucei	African sleeping sickness	yes	yes	yes	yes	76.40%
Babesia bovis	babesiosis	yes	no	yes	yes	72.00%
Trypanosoma cruzi	Chagas disease	yes	yes	yes	yes	52.90%
Cryptosporidium hominis	diarrhea	no	-	no	no	54.10%
Cryptosporidium parvum	diarrhea	no	-	yes	yes	4.1%*
Entamoeba histolytica	diarrhea	yes	yes	yes	yes	79.80%
Giardia intestinalis	diarrhea	yes	yes	yes	yes	39.20%
Naegleria fowleri	encephalitis	yes	yes	yes	no	31.70%
Leishmania major	leishmaniasis	yes	yes	yes	yes	76.60%
Plasmodium falciparum	malaria	yes	yes	yes	yes	37.60%
Plasmodium vivax	malaria	yes	yes	no	no	43.50%
Toxoplasma gondii	toxoplasmosis	yes	yes	yes	yes	56.20%
Trichomonas varginalis	trichomoniasis	yes	yes	yes	yes	94.00%

Table 2

Most frequently gapfilled reactions.

These reactions (in the BiGG namespace) were the most commonly added reactions as a result of all gapfilling steps.

Reaction	Gapfilled N times?	Reaction Name
NADPPPS	96	NADP phosphatase
PYDXO	90	Pyridoxal oxidase
IMPtr	86	Transport of IMP
SO4HCOtex	84	Sulfate transport via bicarbonate countertransport
EX_lyslyslys_e	81	LysLysLys exchange
LYSLYSLYSr	81	Metabolism (Formation/Degradation) of LysLysLys
LYSLYSLYSt	81	LysLysLys transport
PSERT	80	Phosphoserine transaminase
PGCD	75	Phosphoglycerate dehydrogenase
GTHOXti	74	Glutathione transport
CYSLY3	65	Cysteine lyase
NNDPR	65	Nicotinate-nucleotide diphosphorylase (carboxylating)
PDYXPT_c	64	Pyridoxamine-pyruvic transaminase
H2O2t	63	Hydrogen peroxide transport
PSP_L	60	Phosphoserine phosphatase (L-serine)
EX_ileargile_e	59	IleArgIle exchange
ILEARGILEr	59	Metabolism (Formation/Degradation) of IleArgIle
ILEARGILEt	59	IleArgIle transport
lipid2	59	aggregation of all fatty acyl-CoAs
ASPTA6	58	L-alanine-alpha-keto acid aminotransferase
GMPR	56	GMP reductase
GTHRDH_syn	55	Glutathione hydralase
GTHPe	53	Glutathione peroxidase
H2Ot	51	Water transport
HISD_c	48	Histidine degradation to glutamate

72 in total

1. msa: an R package for multiple sequence alignment.

Authors: Ulrich Bodenhofer; Enrico Bonatesta; Christoph Horejš-Kainrath; Sepp Hochreiter
Journal: Bioinformatics Date: 2015-08-26 Impact factor: 6.937

2. Tetracyclines specifically target the apicoplast of the malaria parasite Plasmodium falciparum.

Authors: Erica L Dahl; Jennifer L Shock; Bhaskar R Shenai; Jiri Gut; Joseph L DeRisi; Philip J Rosenthal
Journal: Antimicrob Agents Chemother Date: 2006-09 Impact factor: 5.191

3. Autophagy in protozoan parasites: Trypanosoma brucei as a model.

Authors: Feng-Jun Li; Cynthia Y He
Journal: Future Microbiol Date: 2017-10-03 Impact factor: 3.165

4. Comparative genomics of the neglected human malaria parasite Plasmodium vivax.

Authors: Jane M Carlton; John H Adams; Joana C Silva; Shelby L Bidwell; Hernan Lorenzi; Elisabet Caler; Jonathan Crabtree; Samuel V Angiuoli; Emilio F Merino; Paolo Amedeo; Qin Cheng; Richard M R Coulson; Brendan S Crabb; Hernando A Del Portillo; Kobby Essien; Tamara V Feldblyum; Carmen Fernandez-Becerra; Paul R Gilson; Amy H Gueye; Xiang Guo; Simon Kang'a; Taco W A Kooij; Michael Korsinczky; Esmeralda V-S Meyer; Vish Nene; Ian Paulsen; Owen White; Stuart A Ralph; Qinghu Ren; Tobias J Sargeant; Steven L Salzberg; Christian J Stoeckert; Steven A Sullivan; Marcio M Yamamoto; Stephen L Hoffman; Jennifer R Wortman; Malcolm J Gardner; Mary R Galinski; John W Barnwell; Claire M Fraser-Liggett
Journal: Nature Date: 2008-10-09 Impact factor: 49.962

5. Cytokine interactions in experimental cutaneous leishmaniasis. Interleukin 4 synergizes with interferon-gamma to activate murine macrophages for killing of Leishmania major amastigotes.

Authors: C Bogdan; S Stenger; M Röllinghoff; W Solbach
Journal: Eur J Immunol Date: 1991-02 Impact factor: 5.532

6. The activities of current antimalarial drugs on the life cycle stages of Plasmodium: a comparative study with human and rodent parasites.

Authors: Michael Delves; David Plouffe; Christian Scheurer; Stephan Meister; Sergio Wittlin; Elizabeth A Winzeler; Robert E Sinden; Didier Leroy
Journal: PLoS Med Date: 2012-02-21 Impact factor: 11.069

7. Bioenergetics-based modeling of Plasmodium falciparum metabolism reveals its essential genes, nutritional requirements, and thermodynamic bottlenecks.

Authors: Anush Chiappino-Pepe; Stepan Tymoshenko; Meriç Ataman; Dominique Soldati-Favre; Vassily Hatzimanikatis
Journal: PLoS Comput Biol Date: 2017-03-23 Impact factor: 4.475

Review 8. Functional genomics of simian malaria parasites and host-parasite interactions.

Authors: Mary R Galinski
Journal: Brief Funct Genomics Date: 2019-09-24 Impact factor: 4.241

9. The RAVEN toolbox and its use for generating a genome-scale metabolic model for Penicillium chrysogenum.

Authors: Rasmus Agren; Liming Liu; Saeed Shoaie; Wanwipa Vongsangnak; Intawat Nookaew; Jens Nielsen
Journal: PLoS Comput Biol Date: 2013-03-21 Impact factor: 4.475

10. A Novel Piperazine-Based Drug Lead for Cryptosporidiosis from the Medicines for Malaria Venture Open-Access Malaria Box.

Authors: R S Jumani; K Bessoff; M S Love; P Miller; E E Stebbins; J E Teixeira; M A Campbell; M J Meyers; J A Zambriski; V Nunez; A K Woods; C W McNamara; C D Huston
Journal: Antimicrob Agents Chemother Date: 2018-03-27 Impact factor: 5.191