Literature DB >> 29062930

The secondary metabolite bioinformatics portal: Computational tools to facilitate synthetic biology of secondary metabolite production.

Abstract

Natural products are among the most important sources of lead molecules for drug discovery. With the development of affordable whole-genome sequencing technologies and other 'omics tools, the field of natural products research is currently undergoing a shift in paradigms. While, for decades, mainly analytical and chemical methods gave access to this group of compounds, nowadays genomics-based methods offer complementary approaches to find, identify and characterize such molecules. This paradigm shift also resulted in a high demand for computational tools to assist researchers in their daily work. In this context, this review gives a summary of tools and databases that currently are available to mine, identify and characterize natural product biosynthesis pathways and their producers based on 'omics data. A web portal called Secondary Metabolite Bioinformatics Portal (SMBP at http://www.secondarymetabolites.org) is introduced to provide a one-stop catalog and links to these bioinformatics resources. In addition, an outlook is presented how the existing tools and those to be developed will influence synthetic biology approaches in the natural products field.

Entities: CellLine Chemical Disease Species

Keywords: A, adenylation domain; Antibiotics; BGC, biosynthetic gene cluster; Bioinformatics; Biosynthesis; C, condensation domain; GPR, gene-protein-reaction; HMM, hidden Markov model; LC, liquid chromatography; MS, mass spectrometry; NMR, nuclear magnetic resonance; NRP, non-ribosomally synthesized peptide; NRPS; NRPS, non-ribosomal peptide synthetase; Natural product; PCP, peptidyl carrier protein; PK, polyketide; PKS; PKS, polyketide synthase; RiPP, ribosomally and post-translationally modified peptide; SVM, support vector machine

Year: 2016 PMID： 29062930 PMCID： PMC5640684 DOI： 10.1016/j.synbio.2015.12.002

Source DB: PubMed Journal: Synth Syst Biotechnol ISSN： 2405-805X

Introduction

Antimicrobial resistance is projected to be one of the major global challenges for maintaining our future health systems. According to the report commissioned by the Department of Health of the UK government, chaired by the economist Jim O'Neill, the global economic costs of antimicrobial resistance will result in more than 10 million annual deaths, leading to a loss of 2.0–3.5% of the world gross domestic product equivalent to 60–100 trillion USD by 2050 [e.g., references1, 2, 3]. While this report may predict a worst-case scenario, it is clear that the problem of antimicrobial resistance has to be urgently addressed globally. As there will be no simple single solution, efforts have to be undertaken in various fields, for example in optimizing hygiene, access to clear water, vaccinations, increased efforts to prevent infections, or reduced use of antibiotics families that are used in human medicine and feedstock. Another important challenge will be to develop novel antimicrobial therapies and drugs. Historically, natural products have been the major source of lead compounds for antimicrobial drugs, but also are used in other application fields, such as anti-cancer drugs, insecticides, anthelmintics, painkillers, flavors, cosmeceuticals and crop protection. Nevertheless, most big pharma companies have severely reduced their research efforts on natural products during the last 20 years due to high rediscovery rates of known molecules and a lack of innovative screening approaches. Therefore, it is surprising that still the majority of newly approved small-molecule drugs are natural products or their derivatives. With the broad availability of ‘omics technologies, we currently experience a paradigm shift in natural product research; for decades, the only way to get access to new compounds was to cultivate antibiotics-producing microorganisms, mainly fungi and bacteria, under different growth conditions, and then isolate and characterize the compounds with sophisticated analyticalchemistry. Nowadays, ‘omics approaches offer complementary access to natural products; by identifying natural product/secondary metabolite biosynthetic gene clusters (BGCs), it is possible to assess the genetic potential of producer strains and to more effectively identify previously unknown metabolites. While this approach has led to some renaissance of natural product research in academia and industry, this information will also be the basis to rationally engineer molecules or develop “designer molecules” using synthetic biology approaches in the future. When the first whole genome sequences of the model streptomycete Streptomyces coelicolor A3(2) and the avermectin producer Streptomyces avermitilis10, 11 were determined, both strains were found to possess more secondary metabolite BGCs than an initial estimation made based on the number of their already known secondary metabolites. This is especially remarkable as both strains have served as model organisms and – in the case of S. avermitilis – industrial production strains for many years and thus have been studied by many researchers all over the world. With the rise of novel sequencing technologies and a growing number of microbial whole genome sequences, it became evident that a high number of BGCs is a common feature among various groups of bacteria, for example actinomycetes. Although the diversity of natural product chemical scaffolds is vast, the biosynthetic principles are highly conserved for many secondary metabolites. There is a set of enzyme families, which are often and very specifically associated with the biosynthesis of different classes of secondary metabolites. Thus, sequence information of these known gene families can be used to mine genomes for the presence of secondary metabolite biosynthetic pathways. There are two principal strategies in the implementation of bioinformatic tools. Rule-based approaches can be used to identify gene clusters encoding known biosynthetic routes with high precision. In the first step of the mining process, these tools identify genes encoding conserved enzymes/protein domains that have associated roles in secondary metabolism, for example the “condensation (C)”, “adenylation (A)” and “peptidyl carrier protein (PCP)” domains of non-ribosomal peptide synthetases (NRPSs). In the second step, predefined rules are used to associate the presence of such hits with defined classes of natural products. In the above example, a NRPS BGC can be simply and unambiguously identified if genes are present that code for at least one C-, A- and PCP domain. More complex rules may take into account whether specific genes are encoded in close proximity, for example type II polyketide BGCs can be detected using a rule that evaluates whether a ketosynthase α, a ketosynthase β/chain length factor and acyl-carrier protein are encoded by 3 individual genes in direct proximity. Such rule-based search strategies are, for example, implemented as one option in the pipeline antibiotics and Secondary Metabolite Analysis SHell (antiSMASH),13, 14, 15 which, currently in its version 3, can detect 44 different classes of BGCs. Especially, clusters containing modular polyketide synthase (PKS) or NRPS genes can be easily detected by scanning the genome for genes that encode their characteristic enzyme domains, as also implemented in NaPDoS, NP.searcher, GNP/PRISM, and SMURF. All these approaches are very precise in detecting gene clusters of known families and classes of which rules can be defined. Based on the prerequisite to have defined rules, these algorithms cannot detect novel pathways that use a different biochemistry and enzymes. To avoid this limitation, also rule-independent methods, which are less biased, have been developed, for example implemented in ClusterFinder and EvoMining (see below for details on how they work). These tools use machine learning-based approaches or automated phylogenomics analyses to make their predictions. For fungi, algorithms that evaluate transcriptome data can also efficiently predict clusters of co-transcribed genes. As computational approaches to natural product discovery are rather a new and dynamic field, we intend to give an overview on existing computational tools and databases that help scientists solve the abovementioned tasks and develop perspectives on how these approaches will change the discovery of new natural products (Fig. 1).

Fig. 1

Overview of the most commonly used and freely accessible tools specialized for the analysis of secondary metabolites and their pathways.

Computational tools for natural product research

Recently, several reviews have been published, describing different strategies employed by the genome mining tools commonly used to detect secondary metabolite BGCs [e.g., references23, 24, 25, 26]. In this review, we therefore give a summarizing, but comprehensive up-to-date overview on the tools and databases that are currently available for mining for BGCs, analyzing biosynthetic pathways, combining genomic and metabolomic data, and generating genome-scale metabolic models of the secondary metabolite producers (Table 1, Table 2). More importantly, this overview information is coherently provided through the newly established Secondary Metabolite Bioinformatics Portal (SMBP) along with links to references and websites of the tools and databases. We also discuss perspectives on further development of the field.

Table 1

Comprehensive collection of freely accessible software programs and databases dedicated to natural product research. Only software programs and databases properly functioning as of December 2015 are listed in this table. A more comprehensive list can be found at the SMBP (http://www.secondarymetabolites.org).

Software program or database	URL	Reference	Last publication or documented update	Main content and/or function
Tools for mining of secondary metabolite gene clusters^R: rule-based, ^N: non-rule based algorithms used to detect the BGCs
2metDB^R	http://secmetdb.sourceforge.net/	²⁷	2013	Standalone (Mac) tool to mine PKS/NRPS gene clusters
antiSMASH^R/N	http://antismash.secondarymetabolites.org	13, 14, 15	2015	Web application and standalone tool (LINUX, MacOS and MS Windows) to mine and analyze BGCs; includes comparative genomics tools and a homology-based metabolic modeling pipeline
BAGEL^R	http://bagel2.molgenrug.nl/	28, 29, 30	2013	Web application to mine and analyze RiPPs
CLUSEAN^R	https://bitbucket.org/tilmweber/clusean	³¹	2013	Standalone (LINUX and MacOS) tool to mine and analyze BGCs, mainly PKS/NRPS
ClusterFinder^N	https://github.com/petercim/ClusterFinder	²⁰	2014	Standalone tool (LINUX and MacOS) to identify BGCs with an non-rule based approach
eSNaPD^R	http://esnapd2.rockefeller.edu/	32, 33, 34	2014	Web application to mine metagenomic datasets for BGCs
EvoMining^N	http://148.247.230.39/newevomining/new/evomining_web/index.html	²¹	2015	Web application for phylogenomic approach of cluster identification
GNP/Genome Search^R	http://magarveylab.ca/gnp/#!/genome	³⁵	2015	Web application to mine and analyze BGCs, mainly PKS/NRPS
GNP/PRISM^R	http://magarveylab.ca/prism	¹⁸	2015	Web application to mine and analyze BGCs, mainly PKS/NRPS, including glycosylations and structure prediction
MIDDAS-M^N	http://133.242.13.217/MIDDAS-M/	³⁶	2013	Web application to use transcriptome data to identify BGC coordinates in fungal genomes
MIPS-CG^N	http://www.fung-metb.net/	37, 38	2015	Web application to identify BGC coordinates in fungal genomes without transcriptome data
NaPDoS^R	http://napdos.ucsd.edu/	¹⁶	2012	Web application offering phylogenomic analysis of PKS-KS and NRPS-C domains
SMURF^R	http://jcvi.org/smurf/index.php	¹⁹	2010	Web application to mine PKS/NRPS/terpenoid gene clusters in fungal genome
Software for the analysis of type I PKS and NRPS pathways
ClustScan Professional	http://bioserv.pbf.hr/cms/index.php?page=clustscan	³⁹	2008	Java-based standalone tool to mine for PKS/NRPS BGCs
NP.searcher	http://dna.sherman.lsi.umich.edu/	¹⁷	2009	Web application/standalone tool (LINUX) to mine for PKS/NRPS BGCs
NRPS-PKS/SBSPKS	http://www.nii.ac.in/~pksdb/sbspks/master.html	40, 41	2010	Web application to mine for PKS BGCs
SEARCHPKS	http://linux1.nii.res.in/~pksdb/DBASE/pagesearchpks.html	⁴²	2003	Web application to mine for PKS BGCs
Software for predicting substrate specificities
LSI-based A-domain function predictor	http://bioserv7.bioinfo.pbf.hr/LSIpredictor/AdomainPrediction.jsp	⁴³	2014	Web application to predict A-domain specificities
NRPS/PKS substrate predictor	http://www.cmbi.ru.nl/NRPS-PKS-substrate-predictor/	⁴⁴	2013	Web application to predict A-domain/AT-domain specificities
NRPSpredictor/NRPSpredictor2	http://nrps.informatik.uni-tuebingen.de	45, 46	2011	Web application/standalone tool (LINUX, MS Windows, MacOS) to predict A-domain specificities
NRPSsp	http://www.nrpssp.com/	⁴⁷	2012	Web application to predict A-domain specificities
PKS/NRPS Web Server/Predictive Blast Server	http://nrps.igs.umaryland.edu/nrps/	²⁷	2009	Web application to determine domain organization and A-domain specificities
SEARCHGTr	http://linux1.nii.res.in/~pankaj/gt/gt_DB/html_files/searchgtr.html	⁴⁸	2005	Web application to predict glycosyltransferase specificities
SEQL-NRPS	http://services.birc.au.dk/seql-nrps/	⁴⁹	2015	Web application to predict A-domain specificities
Databases focusing on gene clusters
Bactibase	http://bactibase.pfba-lab-tun.org	50, 51	2011	Web accessible database of bacteriocins
ClusterMine360	http://www.clustermine360.ca/	⁵²	2013	Web accessible database of BGCs
ClustScan Database	http://csdb.bioserv.pbf.hr/csdb/ClustScanWeb.html	⁵³	2013	Web accessible database of PKS/NRPS BGCs
DoBISCUIT	http://www.bio.nite.go.jp/pks/	⁵⁴	2015	Web accessible database of PKS/NRPS BGCs
IMG-ABC	http://img.jgi.doe.gov/abc	⁵⁵	2015	Web accessible database of BGCs, tightly integrated into JGI's IMG platform
MIBiG	http://mibig.secondarymetabolites.org	⁵⁶	2015	Web accessible repository of BGCs
Recombinant ClustScan Database	http://csdb.bioserv.pbf.hr/csdb/RCSDB.html	⁵⁷	2013	Database of in silico recombined BGCs
Databases focusing on bioactive compounds
Antibioticome	http://magarveylab.ca/antibioticome	Unpublished	2015	Web accessible database on compounds, compound families and modes of action
ChEBI	https://www.ebi.ac.uk/chebi/	⁵⁸	2015	Web accessible database and ontology on compounds focused on small molecules
ChEMBL	https://www.ebi.ac.uk/chembl/	⁵⁹	2015	Web accessible database on bioactive compounds with drug-like properties
ChemSpider	http://www.chemspider.com/	⁶⁰	2015	Web accessible database on structures and properties of over 35 million structures
KNApSAcK database	http://kanaya.aist-nara.ac.jp/KNApSAcK/	61, 62	2015	Web accessible database on compounds; standalone version of KNApSAcK metabolite database available
NORINE	http://bioinfo.lifl.fr/norine	63, 64	2015	Web accessible database on NRPs
Novel Antibiotics Database	http://www.antibiotics.or.jp/journal/database/database-top.htm	Unpublished	2008	Web accessible database on compounds
PubChem	http://pubchem.ncbi.nlm.nih.gov/	⁶⁵	2015	Web accessible database on compounds and bioactivities; source data available for download
StreptomeDB	http://www.pharmaceutical-bioinformatics.de/streptomedb	66, 67	2015	Web accessible database on compounds produced by streptomycetes; download of compounds and metadata in SD format.
Metabolomics tools
Cycloquest	http://cyclo.ucsd.edu	⁶⁸	2011	Web application to correlate tandem MS data of cyclopeptides with gene clusters
GNPS	http://gnps.ucsd.edu/	unpublished	2015	Generic metabolomics portal to analyze MS/MS data (dereplication and molecular networking)
GNP/iSNAP	http://magarveylab.ca/gnp/	35, 69, 70, 71	2015	Web application to automatically identify metabolites in MS/MS data based on genomic data
NRPquest	http://cyclo.ucsd.edu	⁷²	2014	Web application to correlate NRP tandem data with gene clusters
Pep2Path	http://pep2path.sourceforge.net	⁷³	2014	Standalone application to correlate peptide sequence tags with NRP and RiPP BGCs
RiPPquest	http://cyclo.ucsd.edu	⁷⁴	2014	Web application to correlate RIPP tandem data with gene clusters

Table 2

High-throughput metabolic modeling tools that can facilitate engineering of actinomycetes for secondary metabolite production. Tools are shown in the order of the year they appeared.

Software program	URL	Reference	Year of publication	Main content and/or function
Model SEED	http://seed-viewer.theseed.org/seedviewer.cgi?page=ModelView	⁹⁹	2010	First online high-throughput metabolic modeling tool
MEMOSys	https://memosys.i-med.ac.at/MEMOSys/home.seam	¹⁰⁰	2011	Allows management, storage, and development of metabolic models
SuBliMinaL Toolbox	http://www.mcisb.org/resources/subliminal/	¹⁰¹	2011	Has strengths in managing chemical information for metabolites in a metabolic model
FAME	http://f-a-m-e.fame-vu.vm.surfsara.nl/ajax/page1.php	¹⁰²	2012	Allows streamlined analysis of a newly built metabolic model using various simulation methods
GEMSiRV	http://sb.nhri.org.tw/GEMSiRV/en/GEMSiRV	¹⁰³	2012	Allows metabolic model reconstruction, simulation and visualization
MetaFlux in Pathway Tools	http://bioinformatics.ai.sri.com/ptools/	¹⁰⁴	2012	Provides strong supports for predicting, modeling, curating and visualizing metabolic pathways
MicrobesFlux	http://www.microbesflux.org/	¹⁰⁵	2012	Allows both flux balance analysis (FBA) and dynamic FBA of a newly generated metabolic model
RAVEN Toolbox	http://biomet-toolbox.org/index.php?page=downtools-raven	¹⁰⁶	2013	Allows metabolic model reconstruction, simulation and visualization in MATLAB environment
CoReCo	https://github.com/esaskar/CoReCo	¹⁰⁷	2014	Useful for modeling metabolisms of multiple related species
merlin	http://www.merlin-sysbio.org/	¹⁰⁸	2015	Most recently released metabolic modeling program with comprehensive genome annotation functionalities necessary for model generation
antiSMASH	http://www.secondarymetabolites.org	¹³	2015	Provides comprehensive genome mining platform for BGCs; currently the only platform offering automated modeling including secondary metabolite specific reactions

Manual genome mining

Before automated tools (see below) became available, genome mining approaches have been undertaken by “manually” identifying key biosynthetic enzymes in genome data. For this, either amino acid sequences of characterized proteins of interest were used as queries for BLAST or PSI-BLAST, or – if alignments of a family of query sequences were available – these were used to generate profile Hidden Markov Models (HMMs) which served as queries using the software HMMer. Gene clusters were then identified by analyzing the genes encoded up- and downstream of the hit sequence. While this approach has been superseded by automatic tools for most of the commonly observed gene cluster types, it is still highly relevant for identifying gene clusters which are not covered by the rulesets of the common tools and where prototypes have just been discovered and described. The manual genome mining can be further improved with tools like MultiGeneBlast, which allow a BLAST-based analyses of whole operons or gene clusters.

Tools for identification of BGCs

Identifying BGCs with BLAST and HMMer works very well with low false positive rates for many different classes of secondary metabolites, for example polyketides (PKs) synthesized by type I or type II PKS, ribosomally and post-translationally modified peptides (RiPPs), or NRPs. Therefore, a number of tools have been developed that use rule-based approaches, i.e., the specific search for distinct enzymes or enzymatic domains (Fig. 1). BAGEL28, 29, 30 is a web-based comprehensive mining suite to identify and characterize RiPPs in microbial genomes. BAGEL provides an annotation-independent identification of the genes encoding precursor peptides, classification of the RiPP types as well as a database of known RiPPs. Especially, in the field of identification of the BGCs of type I PKS, NRPS and hybrid PKS/NRPS, a wide variety of tools exist. ClustScan is a Java-based desktop application that offers mining for PKS and NRPS gene clusters in a convenient graphical user interface. ClustScan was used to compile and analyze the data contained in the ClustScan database (see below). NP.searcher is a web-based software program with an emphasis on structure prediction of the putative peptide or polyketide metabolites. NaPDoS uses BLAST and HMMer to identify ketosynthase domain (in PKS) and condensation domain (in NRPS) encoding genes in genomic and metagenomic datasets and provides a detailed phylogenetic analysis of these domains which are then classified into functional categories. GNP/Genome search35, 69, 78 and GNP/PRISM are web-based tools to mine for and analyze PKS and NRPS pathways, including identification of similar known pathways, the latter with an emphasis on the prediction of putative products. They are closely interconnected with the metabolomics platform iSNAP, which uses information on predicted products to identify corresponding peaks in liquid chromatography/tandem mass spectrometry (LC-MS/MS) data (see paragraph 2.6). The Secondary Metabolite Unknown Region Finder SMURF can detect fungal PKS, NRPS and terpenoid gene clusters involving a dimethylallyltryptophan synthase type prenyltransferases. With pipelines such as CLUster SEquence ANalyzer (CLUSEAN), there are also tools available that can automate the analysis of larger datasets using scripts instead of interactive web pages. While the tools mentioned above are specialized in detecting and analyzing specific classes of secondary metabolites, antiSMASH13, 14, 15 provides detection rules for 44 different classes and subclasses of secondary metabolites. In addition to the identification of gene clusters, antiSMASH also provides detailed annotation of the domain structures of modular PKS and NRPS, analysis of lanthipeptide pathways, substrate predictions, genome-scale metabolic modeling and comparative genomics tools to identify conserved subclusters biosynthesizing building-blocks, similar gene clusters in other sequenced genomes and the Minimum Information about a Biosynthetic Gene cluster (MIBiG)-standard dataset. With this functionality, antiSMASH currently is the most comprehensive software for mining microbial genomes for BGCs. In the future, it is planned to extend antiSMASH as a generic platform integrating various tools such as CRISPy-web, a web-based tool to design guide RNAs (sgRNAs) for CRISPR applications (Blin et al. in this issue). All rule-based BGC-mining approaches can precisely identify BGCs of known biosynthetic types, but fail to identify pathways, which use non-homologous enzymes or enzymes with biochemistry that is presently unknown. However, there are some alternative approaches that try to identify BGCs independent of pre-defined rulesets. The software ClusterFinder, which also is implemented as an alternative cluster detection algorithm in antiSMASH, uses a HMM-based approach to detect chromosomal regions in genomes that aggregate protein domains associated with secondary metabolite biosynthetic pathways. The EvoMining approach identifies gene clusters based on the observation that many BGCs encode isoenzymes closely related to primary metabolism, but displaying a different phylogeny. By scanning the genomes for the occurrence of such enzymes, it is possible to detect secondary metabolite BGCs without respect to their conserved enzymology.

Tools for analyzing specific enzymes

In addition to the general genome mining tools mentioned above, a whole set of tools was developed specifically to provide automated specificity prediction for NRPS A-domains and to detect the enzymatic domains in multi-modular PKS and NRPS, such as SEARCHPKS or NRPS-PKS/SBSPKS.40, 41 One of the hallmarks of computational analysis of secondary metabolite biosynthetic pathways was the deciphering of the NRPS A-domain specificity conferring code by Stachelhaus et al. and Challis et al., who found out that conserved amino acids near the active site of NRPS A-domains can be used to map the substrate specificity of these enzymes,which is an important prerequisite for the computational prediction of the biosynthetic products. The PKS/NRPS Web Server, Predictive Blast Server, and 2metDB deliver predictions based on BLAST analyses against the signatures determined by Challis et al. Later tools introduced the use of profile HMMs, for example an algorithm by Minowa et al., NRPSsp, NRPS/PKS substrate predictor, machine learning-based on transductive Support Vector Machines (SVMs), as for example implemented in NRPSpredictor,45, 46 Latent Semantic Indexing, which is used by the LSI-based A-domain predictor or the Sequence Learner algorithm, which is used in SEQL-NRPS. There have also been first successful reports on using structural bioinformatics involving both crystal structure or homology models and docking analyses with putative substrates, which contributed to predicting substrate specificities of A-domains. However, this approach is currently very compute-intensive, and no automated tools have been reported so far. For other enzymes involved in secondary metabolite biosynthesis, only few tools are available. PKSIIIexplorer uses transductive SVMs to classify type III PKSs. SEARCHGTr currently is the only tool that offers prediction of glycosyltransferase specificities.

Databases focusing on biosynthesis genes and their clusters

All the tools mentioned in the previous section can be used to identify or analyze secondary metabolite BGCs or specific enzymes of the pathways in the user-submitted gene cluster/genome data. To allow cross-species comparison, several databases have been developed focusing on different aspects of secondary metabolism. The ClustScan database, DoBISCUIT, and ClusterMine360 provide collections of a limited set of mostly hand-curated PKS and NRPS gene clusters. The recombinant ClustScan database r-CSDB in addition contains more than 20,000 in silico recombined sequences that are expected to produce novel molecules. Recently, a standard on MIBiG has been developed. In the course of this project, a MIBiG repository was generated, containing more than 1000 characterized BGCs; more than 400 of them were manually annotated and curated by the original researchers carrying out the experimental characterizations. In addition to these databases, data collections were also established based on large-scale sequencing efforts. The Integrated Microbial Genomes: Atlas of Biosynthetic Gene Clusters (IMG-ABC) is a huge data collection based on manually curated BGCs, but also includes automatically mined BGCs of public genome data and genomes that were sequenced at the US Department of Energy Joint Genome Institute (JGI). Currently, IMG-ABC is the largest collection of BGCs data. So far, the genome data used for genome mining of whole biosynthetic pathways almost exclusively originated from cultivable organisms. Considering the fact that only a little percentage of environmental bacteria can be grown in culture, the unculturable microorganisms remain a huge and currently under-exploited resource. The environmental Surveyor of NAtural Product Diversity (eSNAPD)32, 33, 85 is a system to map amplicon datasets to known BGCs. As eSNAPD can also use location metadata, the data can be analyzed based not only on the sequences but also on location information about the sampling sites.

Databases focusing on compounds

In addition to general public molecule databases, such as PubChem, ChEMBL,86, 87 and ChEBI,88, 89 which contain information on a humongous volume of chemical compounds including secondary metabolites, commercial natural product compound databases are available, including antiBASE (Wiley-VCH, Weinheim, Germany), and the Dictionary of Natural Products (Taylor and Francis Group LLC, USA). Recently, several freely accessible or openly licensed databases have also been developed. The KNApSAcK61, 62 website offers information on various secondary metabolites with respect to their basic chemical properties and bioactivities. Although the KNApSAcK system is mostly focused on plant metabolites, it also contains information on microbial bioactive compounds. A component of the KNApSAcK system dealing with metabolites can also be downloaded and used as a standalone Java-based tool. StreptomeDB66, 67 is a database focusing on secondary metabolites isolated from streptomycetes. Bactibase50, 51 is focused on ribosomally synthesized antimicrobial peptides, while NORINE63, 64 is a hand-curated database of NRPs and their activities.

Metabolomics tools for natural product identification

LC-MS and nuclear magnetic resonance (NMR)-based metabolomics approaches gain increasing importance in natural product studies [for reviews, see references90, 91]. While some of the tools or databases on natural product compounds and their BGCs already have histories of more than ten years, first computational approaches have been published only very recently that use cheminformatic approaches to automatically classify and map metabolomics (i.e., MS and MS/MS data) to natural product families and corresponding biosynthetic pathways. This has been especially successful for identifying peptides (RiPPs and NRPs) in the mass spectra of complex samples. Software programs for these approaches include Pep2Path, RiPPquest, NRPquest, and Cycloquest. The GNP/iSNAP (From Genes to Natural Products) – web application provides a user-friendly interface to carry out analyses of MS/MS data of NRP producing strains.35, 69, 70 Signals corresponding to NRPs or NRP-analogs are detected by comparison to databases containing computationally generated fragments of known secondary metabolites (e.g., those extracted from NORINE or PubChem). Recently, iSNAP has also been extended to identify PK compounds and analogs of known molecules. The Global Natural Products Social Molecular Networking system (GNPS) provides workflows for automated spectra deconvolution, molecular networking to identify compound families and dereplication against a database of known molecules (unpublished). In addition to the analysis function, GNPS has a social network component that allows users to share their mass spectrometry datasets (including continuous identification by re-analyzing the deposited datasets against updated spectra libraries) or datasets of reference compounds.

High-throughput metabolic modeling tools

The availability of genomic information allows generation of genome-scale metabolic models, which have now become one of standard tools in systems biology and metabolic engineering communities. This technology enables linking between genotype, including BGCs of secondary metabolites, and metabolic phenotype of secondary metabolite producing microorganisms. A genome-scale metabolic model is a type of mathematical model that is based on mass balances of all the metabolites known/predicted to be present in an organism of interest and is represented in a large-scale stoichiometric matrix that can be simulated with various numerical optimization tools. One of the unique features of genome-scale metabolic model is description of gene-protein-reaction (GPR) associations in a Boolean format; the GPR associations logically connect genomic information with the organism's metabolism, and hence enable prediction of various metabolic phenotypes using gene-level information. In the field of secondary metabolites, genome-scale metabolic models have largely contributed to studies on (i) predicting intracellular flux distributions of actinomycetes under specific environmental/genetic conditions93, 94 and (ii) gene manipulation targets for overproduction of target secondary metabolites.95, 96 Although development of a genome-scale metabolic model is a laborious and time-consuming procedure, involving a total of 96 steps in a protocol, a large fraction of the procedure can now be automated. Such high-throughput metabolic modeling tools allow streamlined system-wide metabolic studies for newly sequenced genomes of actinomycetes and other secondary metabolite producers whose number keeps growing due to increased attentions on novel antibiotics discovery. Among currently available high-throughput metabolic modeling tools, to our knowledge, only Model SEED has been deployed to reconstruct multiple actinomycete species in a high-throughput manner for large-scale metabolic studies. Currently available high-throughput modeling tools are summarized in Table 2. For a detailed comparison of high-throughput metabolic modeling tools, see Hamilton and Reed, and Dias et al. Finally, a challenge for modeling secondary metabolite producers is that all the available metabolic modeling tools do not consider secondary metabolite biosynthesizing reactions and their relevant precursors, and the fact that most secondary metabolites are biosynthesized in stationary phase and not in the exponential growth phase, which stands against the pseudo-steady state assumption of this modeling approach. These special circumstances will therefore require additional efforts in optimizing the metabolic models. High-throughput metabolic modeling tools that can facilitate engineering of actinomycetes for secondary metabolite production. Tools are shown in the order of the year they appeared.

The secondary metabolite bioinformatics portal

The field of secondary metabolite bioinformatics is drastically changing with new tools being released and old services discontinued. We therefore started the web-portal SMBP as a one-stop access point containing a manually curated collection of all the relevant tools and databases for ‘omics-based secondary metabolism research, including short descriptions of the tools, literature references and links to the web sites and/or download pages (Fig. 2). Currently, the tools and databases are assigned to one (or more) categories of contents/functionalities covering secondary metabolite compounds, genome mining, PKS/NRPS analysis, specificity predictors, metabolomics analysis, metabolic modeling and generic tools. A full text search engine provides easy access to the relevant information. The SMBP is openly available at http://www.secondarymetabolites.org, and the Markdown source code for the portal is available at https://bitbucket.org/secmetbioinf/portal.

Fig. 2

A screenshot of the antiSMASH page in the Secondary Metabolite Bioinformatics Portal at http://www.secondarymetabolites.org.

Future challenges

Despite significant advances on computational approaches to identify and characterize BGCs, there still exist several challenges that have to be addressed in the near future. Even for the well-studied secondary metabolite classes such as PK or NRP pathways, prediction of the core scaffold structure of a compound is incomplete because the biochemical knowledge on these systems is not yet implemented in the software (relatively easy to fix in this case) or the relevant biochemical knowledge is not sufficiently available to be the basis for the implementation of novel computational algorithms (more difficult to overcome than the former case). In particular, for machine learning-based approaches, the availability of medium- to large-scale biochemical data required to train good models is very limiting in many cases. Another unsolved problem is currently inaccurate prediction of gene cluster borders. The most widely used genome mining software antiSMASH simply assigns n kb upstream or downstream of the core biosynthetic genes to the cluster (for example, n = 20 kb for PKS and NRPS clusters, and n = 10 kb for lanthipeptides). SMURF, which addresses fungal PK, NRP and terpenoid metabolites, uses a different approach; a statistical analysis of 22 clusters of the model strain Aspergillus fumigatus led to the identification of a total of 27 protein domains, which commonly co-occur with the PK, NRP and terpenoid biosynthetic genes. The occurrence of these domains in genes flanking the core biosynthetic genes, together with the intergenic distance, is then considered to calculate the cluster borders. Another promising approach to predict BGC borders is to use comparative genomics data; genes within a putatively identified BGC, which are conserved among other producers of similar compounds, are likely to belong to the BGCs, whereas genes not belonging to the cluster are more divergent. An algorithm implementing this strategy for filamentous fungi (MIPS-CG) has been described by Takeda et al. For fungal BGCs, it has further been demonstrated that – in addition to the mining and analysis methods described above – transcriptome data can provide valuable information on the borders of the BGCs.22, 36 For prokaryotes, to our best knowledge, no such observations have been reported so far. Analyses involving the integration of different “kinds” of data (e.g., genome with transcriptome or metabolome data) generally suffer from a very poor integration of different functionalities available across the tools and the requirement of specific input and output formats; all these barriers make using relevant software programs difficult for researchers not familiar with bioinformatics. In fact, this is a chronic problem in bioinformatics and systems biology in general. Advances in integrating heterogeneous ‘omics data would offer new dereplication opportunities to identify already known metabolites at a very early stage of the metabolite discovery process. In relation to this, proteome data can deliver important information on secondary metabolite biosynthesis when they are correlated to metabolome data (e.g., obtained by LC-MS) and bioactivity profiles. Using a set of different growth conditions, which leads to the differential expression of BGCs and thus different bioactivity profiles, Gubbens et al. were able to correlate the expression levels of biosynthetic enzymes with the occurrence of secondary metabolites. Using this approach, it was possible to identify juglomycin C and the corresponding gene cluster in Streptomyces sp. MBT70. Furthermore, the power of combining large-scale genome and metabolome data was explored along with computational approaches to identify novel secondary metabolites. Doroghazi et al. identified 11,422 PKS-, NRPS-, NRPS-independent siderophores, lanthipeptides and thiazole-oxazole modified microcin geneclusters in 830 genome sequences of actinomycetes. The gene cluster sequences were then clustered based on a combination of different distance metrics, resulting in 4122 gene cluster families. For a subset of 178 analyzed strains, this network was then automatically correlated with high-resolution mass spectrometric data of known compounds leading to the automatic identification of 110 molecules and 27 molecule families. Thus, for some of these molecule families, previously unidentified gene clusters could be automatically related to the produced metabolite. Taken together, as demonstrated in the studies discussed above, it is highly desirable to interconnect the existing tools and data, and automate the analysis workflows for streamlined characterization of genomes and their resulting secondary metabolites. Current bottlenecks in such integrative approaches can be relieved by standardizing APIs and data structures for programmatic access of the different tools.

Implications for synthetic biology applications in natural product studies

While the availability of computational tools provides new possibilities for identifying and characterizing novel secondary metabolites, such tools are also essential for the development of synthetic biology strategies, which aim at the efficient production of rationally designed molecules. While there exist several generic synthetic biology tools to predict, prioritize, model, select and implement pathways, as reviewed in reference, only few reports exist on their use to engineer natural product biosynthetic pathways. Especially, engineering PKS and NRPS megasynthases will need further emphasis; from a formal perspective, these modular enzymes are excellent candidates for synthetic biology approaches because they display a modular organization and a well-defined split-up of “enzymatic tasks” and tempt to easy plug-and-play approaches. Although there are many successful module and/or domain replacements reported during the last 15 years that led to rationally [e.g., references114, 115, 116, 117, 118] or combinatorially [e.g., references] engineered products, the failure rates are still high and the yields obtained with the engineered assembly lines usually decrease severely. The main reason for this is likely that for designing the modified enzymes, mostly sequence divergence at the linker regions between the enzymatic domains or even trial-and-error approaches might have caused the suboptimal performance of the engineered assembly lines (i.e., inactivity or drastically decreased yields) as they interfered with the 3D structure and the intra- and intermolecular protein–protein interactions within the highly complex megaenzymes. Because structural data of not only separate enzymatic domains but also complete modules for both NRPS120, 121 and type I PKS122, 123 recently became available, they now offer the molecular background to overcome current challenges in engineering the PKS or NRPS assembly lines. In the same line, biochemical studies have been carried out, which specifically address how different domains interact with one another within the PKS or NRPS assembly lines and may help better understanding of the molecular mechanisms within the assembly lines [e.g., references124, 125]; this knowledge has yet to be integrated into synthetic biology design software. Certainly, these approaches will be supported by the availability of heterologous expression and genome engineering tools like CRISPR, which recently also became available for secondary metabolite producers.126, 127, 128, 129 These technologies will drastically reduce the efforts to generate the required recombinant strains and thus allow the high-throughput generation of many variants.

Conclusions

Genome mining and other ‘omics-based approaches to identify and characterize secondary metabolites and their producers have become essential technologies complementing the classical approaches of natural product discovery. This trend is manifested by an increasing number of new and improved bio- and cheminformatic tools and databases bridging computational biology and wet-lab work in the field. Because of the ever-growing number of computational tools and databases dedicated to secondary metabolites, we herein release the SMBP (http://www.secondarymetabolites.org) where researchers in the field can explore diverse tools and databases in one stop. The SMBP is expected to enable users to compare tools for their utilities and make further contributions to the field of secondary metabolites.

124 in total

1. Active-site residue, domain and module swaps in modular polyketide synthases.

Authors: Francesca Del Vecchio; Hrvoje Petkovic; Steven G Kendrew; Lindsey Low; Barrie Wilkinson; Rachel Lill; Jesús Cortés; Brian A M Rudd; Jim Staunton; Peter F Leadlay
Journal: J Ind Microbiol Biotechnol Date: 2003-06-14 Impact factor: 3.346

2. KNApSAcK family databases: integrated metabolite-plant species databases for multifaceted plant research.

Authors: Farit Mochamad Afendi; Taketo Okada; Mami Yamazaki; Aki Hirai-Morita; Yukiko Nakamura; Kensuke Nakamura; Shun Ikeda; Hiroki Takahashi; Md Altaf-Ul-Amin; Latifah K Darusman; Kazuki Saito; Shigehiko Kanaya
Journal: Plant Cell Physiol Date: 2011-11-28 Impact factor: 4.927

Review 3. Computational tools for the synthetic design of biochemical pathways.

Authors: Marnix H Medema; Renske van Raaphorst; Eriko Takano; Rainer Breitling
Journal: Nat Rev Microbiol Date: 2012-01-23 Impact factor: 60.633

4. CRISPR-Cas9 Based Engineering of Actinomycetal Genomes.

Authors: Yaojun Tong; Pep Charusanti; Lixin Zhang; Tilmann Weber; Sang Yup Lee
Journal: ACS Synth Biol Date: 2015-04-07 Impact factor: 5.110

Review 5. In silico tools for the analysis of antibiotic biosynthetic pathways.

Authors: Tilmann Weber
Journal: Int J Med Microbiol Date: 2014-02-19 Impact factor: 3.473

Review 6. Microbial genome mining for accelerated natural products discovery: is a renaissance in the making?

Authors: Brian O Bachmann; Steven G Van Lanen; Richard H Baltz
Journal: J Ind Microbiol Biotechnol Date: 2013-12-17 Impact factor: 3.346

7. Construction and completion of flux balance models from pathway databases.

Authors: Mario Latendresse; Markus Krummenacker; Miles Trupp; Peter D Karp
Journal: Bioinformatics Date: 2012-01-18 Impact factor: 6.937

8. BAGEL: a web-based bacteriocin genome mining tool.

Authors: Anne de Jong; Sacha A F T van Hijum; Jetta J E Bijlsma; Jan Kok; Oscar P Kuipers
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

9. Genome-scale metabolic network guided engineering of Streptomyces tsukubaensis for FK506 production improvement.

Authors: Di Huang; Shanshan Li; Menglei Xia; Jianping Wen; Xiaoqiang Jia
Journal: Microb Cell Fact Date: 2013-05-24 Impact factor: 5.328

10. Motif-independent prediction of a secondary metabolism gene cluster using comparative genomics: application to sequenced genomes of Aspergillus and ten other filamentous fungal species.

Authors: Itaru Takeda; Myco Umemura; Hideaki Koike; Kiyoshi Asai; Masayuki Machida
Journal: DNA Res Date: 2014-04-11 Impact factor: 4.458

33 in total

Review 1. Breaking the silence: new strategies for discovering novel natural products.

Authors: Hengqian Ren; Bin Wang; Huimin Zhao
Journal: Curr Opin Biotechnol Date: 2017-03-11 Impact factor: 9.740

Review 2. Exploring Newer Biosynthetic Gene Clusters in Marine Microbial Prospecting.

Authors: Manigundan Kaari; Radhakrishnan Manikkam; Abirami Baskaran
Journal: Mar Biotechnol (NY) Date: 2022-04-08 Impact factor: 3.619

Review 3. Metabolomics and genomics in natural products research: complementary tools for targeting new chemical entities.

Authors: Lindsay K Caesar; Rana Montaser; Nancy P Keller; Neil L Kelleher
Journal: Nat Prod Rep Date: 2021-11-17 Impact factor: 13.423

4. The antiSMASH database version 3: increased taxonomic coverage and new query features for modular enzymes.

Authors: Kai Blin; Simon Shaw; Satria A Kautsar; Marnix H Medema; Tilmann Weber
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

Review 5. Targeting Bacterial Genomes for Natural Product Discovery.

Authors: Edward Kalkreuter; Guohui Pan; Alexis J Cepeda; Ben Shen
Journal: Trends Pharmacol Sci Date: 2019-12-07 Impact factor: 14.819

Review 6. Recent Advances in Discovery of Lead Structures from Microbial Natural Products: Genomics- and Metabolomics-Guided Acceleration.

Authors: Linda Sukmarini
Journal: Molecules Date: 2021-04-27 Impact factor: 4.411

7. Evaluating the Distribution of Bacterial Natural Product Biosynthetic Genes across Lake Huron Sediment.

Authors: Maryam Elfeki; Shrikant Mantri; Chase M Clark; Stefan J Green; Nadine Ziemert; Brian T Murphy
Journal: ACS Chem Biol Date: 2021-10-04 Impact factor: 5.100

8. Comparative Genomics Analysis of Keratin-Degrading Chryseobacterium Species Reveals Their Keratinolytic Potential for Secondary Metabolite Production.

Authors: Dingrong Kang; Saeed Shoaie; Samuel Jacquiod; Søren J Sørensen; Rodrigo Ledesma-Amaro
Journal: Microorganisms Date: 2021-05-12

9. CRISPR-Cas9-Based Discovery of the Verrucosidin Biosynthesis Gene Cluster in Penicillium polonicum.

Authors: Silvia Valente; Edoardo Piombo; Volker Schroeckh; Giovanna Roberta Meloni; Thorsten Heinekamp; Axel A Brakhage; Davide Spadaro
Journal: Front Microbiol Date: 2021-05-21 Impact factor: 5.640

Review 10. Salicylic Acid Biosynthesis and Metabolism: A Divergent Pathway for Plants and Bacteria.

Authors: Awdhesh Kumar Mishra; Kwang-Hyun Baek
Journal: Biomolecules Date: 2021-05-09