Literature DB >> 34042443

A Machine Learning Bioinformatics Method to Predict Biological Activity from Biosynthetic Gene Clusters.

Abstract

Research in natural products, the genetically encoded small molecules produced by organisms in an idiosyncratic fashion, deals with molecular structure, biosynthesis, and biological activity. Bioinformatics analyses of microbial genomes can successfully reveal the genetic instructions, biosynthetic gene clusters, that produce many natural products. Genes to molecule predictions made on biosynthetic gene clusters have revealed many important new structures. There is no comparable method for genes to biological activity predictions. To address this missing pathway, we developed a machine learning bioinformatics method for predicting a natural product's antibiotic activity directly from the sequence of its biosynthetic gene cluster. We trained commonly used machine learning classifiers to predict antibacterial or antifungal activity based on features of known natural product biosynthetic gene clusters. We have identified classifiers that can attain accuracies as high as 80% and that have enabled the identification of biosynthetic enzymes and their corresponding molecular features that are associated with antibiotic activity.

Entities: Chemical Disease Species

Year: 2021 PMID： 34042443 PMCID： PMC8243324 DOI： 10.1021/acs.jcim.0c01304

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Natural products from bacteria, fungi, and plants have long been a rich source of useful molecules. Some 23.5% of FDA-approved drugs are small molecule natural products or chemically modified analogs of natural products, and an additional 22.5% are synthetic drugs designed to hit the same target as a natural product.[1] High-throughput screening and bioinformatics analysis have increased efficiencies in steps on the discovery to drug pipeline, but identifying desired functions such as antibiotic activity require production, purification, and assaying, which collectively form a major bottleneck.[2] The historic order, assaying, production, and purification, has the same collective bottleneck but with the added disadvantage of ultimately identifying a previously discovered molecule.[3,4] Several different mass spectrometry or NMR-based methods can be used to increase efficiencies in some steps, but these in themselves do not obviate the need to have molecules to test.[5−8] Overall, the process of going from activity in an extract of a bacterial culture to the discovery of a novel active molecule remains a time-consuming roadblock in using natural products as a source of antibiotics and other therapeutic agents. One step that can be performed efficiently at scale is identifying promising biosynthetic gene clusters (BGC) in the growing number of available bacterial genomes. However, the “promising” rankings that emerge have been based on genes to structure connections. Existing genome mining tools, which include antiSMASH,[9] PRISM,[10] Deep-BGC,[11] SMURF,[12] as well as others that are geared toward identifying specific classes of natural products such as PKminer,[13] 2metDB,[14] RiPP Miner,[15] BAGEL,[16] RODEO,[17] and NeuRiPP,[18] can be used to identify BGCs in genomes and compare them to BGCs for known natural products. There are over 147,000 BGC sequences that have been identified by antiSMASH alone,[19,20] and prioritization based on structural novelty, which has resulted in many new molecules being discovered, has not really addressed the issue of finding molecules with useful functions. There are numerous architecturally fascinating molecules being published that lack any identified function. If the activity of a natural product could be predicted from its BGC, searches could be prioritized to increase focus only on those most likely to produce a natural product with the activity of interest. Typically, it is not possible to predict the activity of natural products from a BGC. The presence of known resistance markers or duplication of an essential gene can indicate that a BGC produces an inhibitor for the duplicated gene.[21] Existing tools to prioritize BGCs with resistance markers include the Resistance Gene Identifier (RGI)[22] and the Antibiotic Resistant Target Seeker (ARTS).[23] However, these methods limit discovery to natural products with antibacterial activity and, except in the case of resistance through gene duplication, are less likely to identify natural products that act through novel mechanisms of action. A recently reported genome mining tool, Deep-BGC, uses machine learning to predict antibacterial, cytotoxic, inhibitor, and antifungal activity, but due to a small training set size (370), it does so with only low accuracy.[11] The activity predictions made by Deep-BGC were based only on protein family (PFAM) domains and no other types of genetic feature that could be predictive of activity. PRISM 4, another recently reported method for predicting natural product activity from the sequence of BGCs, first predicted the structure of the natural product from the BGC and then used chemical fingerprints to predict its activity.[24] There is still an unmet need for a more general method of predicting a natural product’s biological activity from its BGC sequence using both biosynthetic genes and other genetic features such as resistance markers. We here report a bioinformatics-based machine learning method to predict natural product activities from BGC sequences. To do this, we broke BGCs down into features that describe the type of genes and biosynthetic capabilities present in the cluster using automatic annotations such as PFAM domains and the Resistance Gene Identifier (RGI). We then trained machine learning models to predict the likelihood that a natural product will have activity such as antibacterial or antifungal with a library assembled from the MiBIG database and the literature. The resulting classifiers predict activity with accuracies up to 80%. In theory, this tool could also be trained to predict any other bioactivity if a high quality training dataset was available.

Results and Discussion

Generating a Training Data Set

In order to develop a machine learning model that can predict natural product bioactivity, we assembled a dataset of known BGCs paired with the activity of their products, represented BGCs as vectors based on the number of times various gene annotations appeared in the cluster, and trained several different binary classifiers on the resulting dataset (Figure A). To assemble a training dataset, we used bacterial BGCs listed in the Minimum Information about a Biosynthetic Gene Custer (MIBiG) database (version 1.4).[25] For each metabolite, we searched the literature to determine what was known about its bioactivity. Specifically, we searched to determine if a metabolite had recorded antibacterial, antifungal, cytotoxic, antitumor, or other activities. We only included compounds that had a documented activity or function or that had been shown to lack antibacterial, antifungal, or antitumor activity in our dataset. We recorded activity as a binary yes/no value and due to the difficulties in comparing the level of activity measured by different assays and did not require a specific level of activity. We assumed that a natural product had only the activities that it was reported to have and not any additional activities that were not reported. It is likely that this assumption will be incorrect in some cases because not every natural product has been exhaustively tested for the activities for which we developed classifiers. This incomplete information will likely contribute to error in predictions. If a BGC was documented to produce multiple products, we counted it as active if one or more of the produced compounds was active. A full list of natural products used in the training set is available in Table S1.

Figure 1

Machine learning method for predicting natural product activities. (A) Schematic illustrating our machine method for predicting natural product bioactivity. First, we generated a dataset of known BGCs and the activity of the natural product they produce, and we then represented the BGCs as a vector denoting the number of times different types of genes occurred in the cluster. Finally, we trained machine learning models to predict bioactivity. (B) Illustration of how example annotations provide information about the structure of erythromycin, e.g., how many polyketide units and sugars are in the molecule, as well as the approximate level of reduction of the polyketide backbone. Gene cluster rendered using Geneious Prime 2020.0.3 (https://www.geneious.com). Using existing bioinformatics techniques, we identified the number of times different families of proteins occurred in the biosynthetic gene cluster. We used the annotations produced by antiSMASH 4.1,[26] protein family (PFAM) classifications, smCOG, CDs motifs, and predictions of monomers for polyketides and nonribosomal peptides. To include more information about the biosynthetic machinery in each BGC, we further broke down some of the PFAM domains that were most associated with activity into sub-PFAM domains using sequence similarity networks (SSNs), which allow for the clustering of protein sequences into groups with high sequence similarity and often similar function.[27,28] These annotations provide information about the types of functional groups or substructures present in the molecule, such as sugars, amines, and halogens. They can also provide information about the general level of oxidation or reduction of the molecule, especially in the case of polyketides where modules require specific domains to reduce the growing polyketide chain (Figure B). We used the Resistance Gene Identifier (RGI) 3.2.1[29] to identify genes that had similarity to known resistance genes, which we expect to be predictive of antibacterial activity. In total, this analysis resulted in 1809 features for 1003 BGCs.

Training and Optimizing Machine Learning Models

We trained and optimized binary classifiers on six different binary classification problems: (1) antibacterial, (2) anti-Gram-positive, (3) anti-Gram-negative, (4) antifungal, antitumor, or cytotoxic (antifungal/antitumor/cytotoxic), (5) antifungal, and (6) antitumor or cytotoxic. We used binary classifiers rather than multilabel classification because many molecules have multiple activities and therefore belong to multiple classes. We chose to make the combined antifungal/antitumor/cytotoxic classifier, by considering a compound active if it has one or more of these activities, since they all indicate activity against eukaryotic cells. We used classifiers available in the Python scikit-learn library to perform classifications. We used random forest with extra-randomized trees, support vector machine (SVM), and logistic regression with regularization because these classifiers enable interpretation of which features are important for predictions. Parameters for each classifier were optimized to maximize average accuracy in a 10-fold cross-validation, a process in which over 10 trials, a different one-tenth of the data is held out from training and used to evaluate classifier accuracy (Figure S1 and Table S2). The balanced accuracy metric in scikit-learn was used to compare the performance of the optimized classifiers. The balanced accuracy metric is a more accurate reflection of classifier performance in the case of imbalanced datasets, where one class dominates the dataset. Our dataset is very imbalanced for the antifungal (20% of the dataset) and anti-Gram-negative (17% of the dataset) classification problems. We compared each classifier to the performance of a classifier trained on a scrambled version of the features, which represents random guessing. All of the scrambled data classifiers had balanced accuracies of approximately 50%; this is the expected result because binary classifiers can trivially achieve 50% balanced accuracy by always guessing the same label, regardless of how balanced the class labels are (see proof in the Supporting Information). All classifiers, except for the antifungal logistic regression classifier, perform significantly better (p < 0.001 for antifungal classification, p < 0.0001 for all other binary classifiers) than their randomized counterpart on 10-fold cross-validation (Figure ). Performance was generally independent of the classification method (Figure ). Classifiers for the antibacterial, anti-Gram-positive, and antifungal/antitumor/cytotoxic classification problems were all highly accurate, with balanced accuracies ranging from 74 to 80% (Figure ). The accuracy for classifying natural products as cytotoxic or antitumor was slightly less accurate (ranging from 69 to 73%). The anti-Gram-negative and antifungal classifiers are the least accurate (with accuracies between 57 and 64% for antifungal and between 66 and 70% for anti-Gram-negative). We used cross-validation to limit the degree that overfitting influenced our estimated accuracies. However, we did use the entire training set to optimize parameters for the classifiers, so it is likely that there is still some degree of overfitting.

Figure 2

Balanced accuracy of classifiers. Balanced accuracy, or the average of the true positive rate and true negative rate, in a 10-fold cross-validation for (A) antibacterial classifiers and (B) antifungal/antitumor classifiers. Solid bars represent classifiers trained on training dataset; hashed bars are trained on randomly scrambled. Significance was determined using a one-way ANOVA. In addition to balanced accuracy, we also generated receiver operator characteristic (ROC) curves and precision–recall curves for each classifier. ROC curves plot the false positive rate of a classifier vs the true positive rate, with a higher area under the curve (AROC), indicating that the classifier has a better true to false positive ratio. Our classifiers perform well with AROCs ranging between 0.57 and 0.79 (Figure S2). Precision–recall curves (Figure S3) plot the recall (x axis), or true positives/(true positives + false negatives), vs precision (y axis), or true positives/(true positives + false positives), and are a better way to gauge the accuracy of classifiers applied to unbalanced datasets. A good classifier will have a higher area under the curve, not sacrificing precision for a higher recall. By both the ROC and precision–recall curve metrics, the antibacterial, anti-Gram-positive, and antifungal/antitumor/cytotoxic classifiers perform very well while the anti-Gram-negative and antifungal classifiers do not perform as well. One possible reason for the lower accuracies of the anti-Gram-negative and antifungal classifiers is that natural products with anti-Gram-negative or antifungal activity make up a smaller portion of the dataset, 17% and 20%, respectively. The assumption that if a natural product was never recorded to have an activity then it does not have that activity likely contributes to error in all our classifiers but may especially affect the antifungal classifier because many compounds in the dataset were only tested for antitumor activity and not antifungal activity. Due to historical focus on the discovery of natural products with the activities for which we developed classifiers along with the use of bioactivity guided fractionation, our dataset is likely biased toward active natural products, especially antibacterial molecules. Therefore, it is likely that the probabilities calculated by our method are overestimated. Future work will focus on assessing the degree of this overestimate by experimentally validating the classifiers’ predictions. Our classifiers performed better than those in Deep-BGC[11] for the same activity classification, likely because of our larger training sets and inclusion of additional features such as resistance markers and sub-PFAM domains. Deep-BGC used only a random forest classifier, which is more difficult to interpret in terms of which features are associated with activity than the logistic regression classifier we used to interpret how our classifiers make predictions. Random forests have an importance score that can identify which features are the most important for classification, but the importance score does not specify whether the feature is associated with the presence or absence of activity, which the coefficients of a logistic regression do. Our analysis of the coefficient scores and the inclusion of sub-PFAM domains as features made it possible for us to identify molecular features associated with activity and to understand how our classifiers make their predictions. It is more difficult to compare the accuracy of our classifiers to PRISM 4 because they only reported the ROC curve, but comparison of these curves shows that our antibacterial and antitumor classifiers performed better than PRISM 4,[24] while the antifungal classifiers performed approximately as well as PRISM 4. Our improved performance could be due to inclusion of non-biosynthetic features such as resistance markers and transporters or due to more thorough optimization of classifier hyperparameters. Our classifiers were trained on activity alone and do not make any predictions about the mechanism of action, although some information about mechanism of action is contained in the resistance gene features and PFAM domains. Therefore, it is possible that our classifiers could identify natural products with previously undiscovered mechanisms of action. A recently reported deep learning method for predicting antibiotic activity based on molecular structure successfully used a similar mechanism-blind approach.[30] Future experimental work will focus on determining mechanism of action for any active molecules identified by our classifiers in order to test this hypothesis. The precision–recall curves for our classifiers provide insight into how our classifiers could be used to prioritize BGCs for study. For example, if a researcher is using heterologous expression of a variety of BGC classes, a relatively time-consuming approach, they would want to screen only a few BGCs and would want a few false positives (high precision). Even for the antifungal classifier, the precision–recall curve shows that it is possible to achieve a high precision of 78% (2 in 10 of the BGCs screened would be expected to be a false positive) with a recall of 33% (Figure S3). If the BGCs in question were amenable to high throughput cloning and heterologous expression, then the researcher would be willing to sacrifice some precision to obtain a higher recall. In the case of the antifungal classifier, the precision–recall curve shows that it is possible to achieve a recall of 90% and still have a precision of 30%. In practice, these accuracies might be slightly overestimated due to the similarity of the training and validation data and because the historical use of bioactivity guided fractionation likely inflates the number of active compounds in our training set. Specifically, we expect that the classifiers will be more accurate for BGCs that share features with those in our training set and will therefore be most accurate for BGCs for the major classes of natural products (nonribosomal and ribosomal peptides, polyketides, terpenes, alkaloids, and saccharides). If a BGC for a novel natural product class is discovered, it is possible that our classifiers could still make accurate predictions if it shares tailoring enzymes, transporters, or resistance markers with BGCs in the training set but researchers should treat these predictions with caution. Future work will determine the TPR and FPRs for the classifiers on BGCs for different levels of novelty relative to the training data set.

Different Features Play a Role in Different Prediction Problems

We used the trained logistic regression classifiers to determine which features are most important for each classification problem. Using the coefficients from the logistic regression, we ranked features by their importance to the prediction. Features with a large positive coefficient are highly associated with the activity, while those with a large negative coefficient are strongly associated with not having activity. Because we used elastic net regularization, most of the coefficients are zero or close to zero, indicating that they are not important for making predictions (Figure S4). We chose to analyze only the 50 features with the highest coefficients and 50 with the lowest coefficients. Figure shows pie charts with classification of the top and bottom 50 features for each classification problem; Table S3 lists the specific features.

Figure 3

PFAM domains associated with activity. Pie charts showing the 50 protein family domains most or least associated with activity, as measured by their coefficients in the logistic regression classifier. A full list of these PFAM domains is available in Table S3. The types of features associated with activity vary considerably between the different types of activities for which we developed classifiers. For example, for all antibacterial classifiers, many transporter genes were highly associated with activity, accounting for 16–36% of the top 50 features. Transporters are less important for predicting antifungal or antitumor activity, accounting for only 6–14% of the top 50 features (Figure ). This is probably due to the importance of transporters as a resistance mechanism; a dedicated transporter may be required to provide bacteria resistance to antibacterial secondary metabolites, while antifungal molecules could possibly be transported by a general transporter. Some transporter genes also appear in the bottom 50 features for each classifier, suggesting that only certain transporters are associated with those activities. For example, in the case of the antibacterial classifier, multiple PFAM domains that are part of ABC-transporters are highly associated with antibacterial activity. Conversely, some major facilitator transporters are in the top 50 genes and some are in the bottom 50 (Table S3). More work will be required to understand why certain transporters are associated with antibacterial compounds and others are not. Genes associated with resistance that were not transporters were generally not highly correlated with antibacterial activity. Initially, we were surprised by this result but resistance mechanisms such as drug modification, target modification, or target duplication are all specific to a class of molecules or mechanism of action. Transporters can provide resistance to many molecule classes that act through different mechanisms and are therefore more useful as a general predictor of antibacterial activity. There were also many biosynthetic genes associated with activity. Seven aminotransferase genes were in the top 50 genes associated with anti-Gram-negative activity (none were in the top 50 for Gram-positive). This is consistent with a known Gram-negative permeability rule, that molecules containing an amine are more likely to accumulate in Gram-negative bacteria.[31] Genes associated with RiPP biosynthesis, especially lantipeptides, are strongly associated with antibacterial activity in general and with anti-Gram-positive activity but not with anti-Gram-negative activity or other activities (Figure ). Lantipeptides often have activity against Gram-positive bacteria but rarely have activity against Gram-negative bacteria,[32] fungi,[33] or tumor cells. Genes for synthesizing the other major classes of natural products are present in both the top and bottom 50 genes, and it is difficult to draw any conclusions about how these correlations relate to structure–activity relationships. This is unsurprising because many of these genes, such as those responsible for NRPS and PKS synthesis, are responsible for the synthesis of diverse molecules with a variety of functions.

PFAM Sub-Families Give Insight into Molecular Features Associated with Function

While analyzing how PFAM, antiSMASH, and resistance marker features correlated with activity was useful to draw some general conclusions about what types of biosynthetic genes and natural product classes are associated with various activities, they failed to provide much insight into how specific chemical features might contribute to activity. Therefore, we made Sequence Similarity Networks (SSNs) for some of the PFAMs most associated with activity to break them down into sub-PFAMs. SSNs, which group proteins based on pairwise sequence similarity, have been shown to cluster proteins with similar enzymatic activities and substrates.[27,28] We colored each SSN based on the activity of the natural product produced by the cluster the domain was from and prioritized SSNs where there were one or more clusters dominated by a single activity. We first examined the SSN network generated for the aminotransferase class I and II PFAM domain (accession number PF00155) because it contained multiple clusters that were associated with different activities (Figure A). We then examined the literature to determine the enzymatic activity of each activity-associated cluster. We identified three sub-PFAM domains that were both associated with activity and had a single enzymatic activity (Figure A,B).

Figure 4

Aminotransferase class I and II SSN. (A) Each node in the sequence similarity network represents a domain from a BGC. Nodes are colored based on the activity of the natural product produced by the biosynthetic gene cluster, pink denotes antibacterial activity, blue antifungal, antitumor, or cytotoxic activity, red both activities, green neither activity, and black unknown activity. Clusters are numbered based on the number of sequences assigned to them. For clarity, clusters with three or fewer members are not shown. (B) Cluster 3 is involved in the biosynthesis of phenylglycine and its analogs. Cluster 6 installs an amine on the sugar mycosamine, and cluster 8 is involved in the synthesis of various sugars found in aminoglycoside antibiotics. (C) Examples of nonribosomal peptide antibiotics that contain at least one phenylglycine analog, highlighted in red. The sixth largest cluster associated with antifungal activity produced the sugar mycosamine.[34] BGCs containing this mycosamine-producing sub-PFAM domain are all polyene macrolides with similar structures (Figure S5). The eighth largest cluster is associated with antibacterial activity, and the sub-PFAM domains in this cluster all transfer amines to sugars in aminoglycoside antibiotics, e.g., kanosamine, part of the antibiotic kanamycin (Figure B).[35] Therefore clusters 6 and 8 both produce substructures of a single class of natural product. In the case of the polyene macrolides, the mycosamine sugar has been shown to be essential for activity, likely through its interactions with sterols in the fungal membrane.[36] This is encouraging as it shows that the SSN approach can rediscover classes of active molecules and identify the biosynthetic machinery that installs the groups that are essential for activity. Unlike clusters 6 and 8, cluster 3, whose members produce l-phenylglycine and its hydroxylated analogs, hydroxyphenylglycine (Hpg) and dihydroxyphenylglycine (Dpg), was not confined to one class of molecule.[37,38] All the BGCs in our database containing phenylglycine transaminase produced natural products with antibacterial activity, but they produced molecules with very divergent structures (Figure C). In line with this observation, these natural products also do not all act through the same mechanism of action. For example, glycopeptide antibiotics such as vancomycin inhibit cell wall synthesis by binding to the d-Ala-d-Ala terminal of growing peptide chains in the cell wall, preventing cell wall remodeling,[39] while nocardicin A is a β-lactam that inhibits cell wall synthesis by binding to penicillin binding proteins,[40] and pristinamycin inhibits protein synthesis by binding to the ribosome.[41] This suggests that, unlike mycosamine in polyenes, phenylglycine supports activity in a general way rather than through a specific mechanism or binding interaction. One possible explanation for phenylglycine’s association with activity is that unlike proteinogenic amino acids, phenylglycine does not have a β-carbon, so it has fewer rotatable bonds than proteinogenic aromatic amino acids like phenylalanine.[38] Therefore, phenylglycine-containing peptides should be more rigid than they would be if they contained a proteinogenic amino acid instead, reducing the entropic cost for binding to a target. There is evidence that the structural rigidity of glycopeptide antibiotics are thought to be especially important for their activity.[37,42] Alanine scanning studies have shown that two Hpg residues in the antibiotic feglymycin and four Hpg residues in ramoplanin are important for activity.[43,44] However, it is not clear why phenylglycine appears to be associated only with antibacterial activity and no other activities. It is possible that the reason for this is evolutionary, that once a bacteria’s competitor evolved resistance to an ancient phenylglycine-containing antibiotic, evolutionary pressure maintained the antibacterial function of the molecule but altered it to work through an alternative mechanism. Interestingly, many of the phenylglycine-containing antibiotics, including glycopeptide antibiotics, nocardicin, ramoplanin, and feglymycin, act through inhibiting cell wall formation through different targets.[39,40,44,45] Additional study will be required to determine if phenylglycine transaminases are useful as a handle for antibacterial discovery and to determine the mechanism by which phenylglycine and its analogs support activity. We examined 46 other SSNs for trends similar those observed for the aminotransferase class I and II PFAM domain. We found that five SSNs had at least one cluster associated with activity and, as determined by literature search, had an enzymatic activity distinct from other enzymes not in the cluster. The functional groups produced by these enzymes are summarized in Figure , and the SSNs are shown in Figures S6–S9. While five of these functional groups, including the previously mentioned mycosamine and aminoglycoside sugars, were restricted to one class of active molecule, another 10 of these were not specific to one natural product scaffold (Figure ). The sub-PFAM domains that produce these functional groups could be useful handles for discovering novel bioactive natural product scaffolds. It is likely that most of these functional groups are associated with activity because they change general properties of the molecule to improve its activity. For example, we identified one sub-PFAM domain that N-methylates the backbone or side chain of non-ribosomal peptides. It is known that N-methylation of peptides can improve their cell permeability in mammalian cells, increase protease resistance, and stabilize the conformation of cyclic peptides.[46−48] It is likely that many of the other functional groups identified through this analysis also support activity by making the molecule more permeable to the target organism’s membrane or improve binding to target proteins. Bacteria may have evolved these sub-PFAM domains to improve the activity of their natural products analogous to how medicinal chemists use specific chemical modifications to improve the drug-like properties of a molecule. These activity-associated domains could both be used to identify new BGCs that produce active molecules and to engineer known natural products to improve their activity.

Figure 5

Summary of molecular features associated with activity. List of molecular features associated with (A) antibacterial, (B) antifungal, antitumor, or cytotoxic, and (C) antibacterial, antifungal, antitumor, or cytotoxic activities identified through analysis of SSNs of PFAM domains from BGCs. Yellow boxes indicate features that were not associated with a single molecular scaffold or mechanism of action. These sub-PFAM domains could be useful handles for the discovery of bioactive compounds with novel scaffolds.

Classifiers Perform Well on Holdout Set BGCs with Some Similarity to BGCs in the Training Set

To assess how well our classifiers will perform on novel BGCs, we applied it to a holdout set of BGCs that were not used during the development and optimization of the classifiers. A holdout set is useful for assessing the algorithm because it ensures that the optimization process did not produce a classifier that works only on the training set and will not work as well on other data. To generate a holdout set, we used BGCs that were added to the MiBIG database when it was upgraded from version 1.4 to version 2.0.[49] Our holdout set consisted of 258 BGCs for which we could find information about the activity of the associated natural product(s) (Table S5). To assess the accuracy of the classifiers on the holdout set, we split the holdout set into subsets based on how similar the BGCs were to BGCs in the training set. To do this, we used the “knownclusterblast” feature of antiSMASH,[26] which scores how similar BGCs are to known clusters. We split the holdout set into six subsets: clusters not recognized by antiSMASH and therefore lacking a knownclusterblast score (ND) and maximum percentage of genes that show similarity to genes in a single training set cluster of 0, (0–25], (25–50], (50–75], and (75–100] (Figure and Figures S10–S12). The balanced accuracy increased with the similarity of all genes for almost all classifiers. This means that while the classifiers cannot accurately predict the activity for a natural product when its BGC has no similarity to any training set BGC, as the similarity increases, so does the accuracy of the prediction. All classifiers, except the antifungal SVM classifier, perform well on the 75–100 holdout set with balanced accuracies ranging from 67 to 98% (Figure and Figures S10–S12). Most classifiers also perform well on the 50–75 holdout set, with accuracies as high as 85%. This indicates that if a novel cluster has a knownclusterblast similarity score of at least 50 with a BGC in the training set, then the predictions made by the classifiers are fairly accurate. We expect that the addition of more BGCs to the training set would increase its diversity and therefore improve the accuracy of predictions made by the classifiers.

Figure 6

Performance of classifiers on holdout set. (A) Balanced accuracy of random forest antibacterial, antifungal/antitumor/cytotoxic, and antifungal classifiers applied to subsets of the holdout set, split based on the maximum known cluster BLAST similarity score to BGCs in the training set. ND indicates the subset where antiSMASH did not identify the cluster and therefore did not determine a known cluster BLAST score. (B) Comparison of aurantinin B[50] and bacillaene[51] structures and BGCs; genes that were identified as similar by known cluster BLAST are colored in the same color. (C) Comparison of aurantimycin A[53] and polyoxypeptin[54] structures and BGCs; genes that were identified as similar by known cluster BLAST are colored in the same color. The molecules produced by BGCs with similarity scores higher than 50 can still be quite different. One example from the holdout set that demonstrates this is the aurantinin cluster.[50] The closest match to aurantinin in the training set is the bacillaene cluster,[51] with a similarity score of 57%. Despite the high similarity, aurantinin and bacillaene have quite different molecular structures (Figure B). Aurantinin has multiple cycles and is glycosylated,[50] while bacillaene is linear, not glycosylated, and contains peptides in addition to the main polyketide backbone.[51] Despite their different structures, aurantinin and bacillaene both have antibacterial activity.[50,52] The classifiers were able to successfully predict that aurantinin has antibacterial activity (60% average probability). We also looked for examples in the holdout set that were very similar structurally to a molecule from the training set but that had different activities. We found that aurantimycin[53] is similar both in structure and BGC sequence (67% similarity score) to polyoxypeptin[54] (Figure C). Despite their similarity, aurantimycin is documented to have both antibacterial and antitumor activity,[55] both of which were correctly predicted by the classifier, with probabilities of 58% for antibacterial activity and 85% for antifungal, antitumor, or cytotoxic activity. In the literature we reviewed, polyoxypeptin was only documented to have antitumor activity.[56] Together, these examples suggest that the classifiers are not merely relying on the most similar cluster to make their predictions and are able to distinguish between similar BGCs that produce molecules with different activities. Therefore, we expect that the model will perform well on novel BGCs if they have some degree of similarity to BGCs in the training set and that the similarity required for accurate predictions does not mean that the molecules produced by the BGCs will themselves be similar.

Conclusions

We developed a bioinformatics method for predicting biological activities, antibacterial, antifungal, or antitumor, from biosynthetic gene clusters. The ability to predict natural product activity from BGC sequences alone will make it possible to prioritize clusters that are most likely to produce molecules with a desired activity, minimizing the number of times the bottlenecks of natural product discovery need to be encountered to discover a novel active compound. Our approach used machine learning to generate classifiers, which have balanced accuracies of at least 57% and up to 79%. Classifiers’ performance depends on training sets, and they work best for problems where there are many positive and negative examples in the dataset, which is why we focused on antibacterial activity. The method does not perform as well when molecules with the desired activity sparsely populate the dataset. In our analysis, this feature of machine learning is most pronounced for the antifungal classifiers, for which our method has the lowest accuracy. It is likely that increasing the size of the dataset will further improve the accuracy of all classifiers, especially the antifungal and anti-Gram-negative classifiers. By examining which features of BGCs are associated with activity, we gained new insights into which molecular features of natural products are associated with their activity. Some of these observations were consistent with previously discovered structure–activity relationships, for example, that molecules with an amine are more likely to accumulate in Gram-negative bacteria[31] or that mycosamine is essential for the activity of antifungal polyenes.[36] This concordance demonstrates that trained classifiers can be used to identify structural features associated with activity. Future work will be focused on validating our machine learning approach as a tool to accelerate functional natural product discovery and on investigating the connection between molecular features identified by our classifiers and activity.

Methods

Assembly of Training Dataset

We assembled a training dataset from BGCs available from the MiBIG (version 1.4) database.[25] We then searched the literature for each natural product to determine if it had documented antibacterial, antifungal, antitumor, or cytotoxic activity. If a natural product was documented to not have a given activity, we recorded it as not having the activity in the dataset. Unfortunately, we found that reports rarely note the absence of activity. Therefore, we also assumed that a molecule did not have a given activity if there were no reports stating that it had that activity and if it was described as having a different activity (e.g., activity against a different organism, siderophore activity, ionophore activity, antioxidant activity, etc.). If the molecule had no recorded activity or function and no assays specifically showing its inactivity, we assumed that it was not tested for activity and we excluded it from our dataset.

Selection of Features

We ran antiSMASH 4.1[26] on the BGCs in the dataset and wrote a Python script to extract PFAM, CDS motif, smCOG, and polyketide and non-ribosomal peptide monomer prediction annotations of BGCs from an antiSMASH output file. We also ran RGI 3.2.1[29] on all clusters and wrote a script to extract resistance marker features. Features were only included if they appeared at least five times in the dataset. Further details on settings used to run antiSMASH and RGI as well as more details on the scripts are available in the Supporting Information. Scripts are available as a Jupyter notebook here: https://github.com/allie-walker/Natural-product-function. Since our initial development of the classification algorithm, new versions of antiSMASH and RGI became available. The command line version of our prediction algorithm, also available on GitHub, accepts input files from antiSMASH 5 and RGI 5 as well as the versions used in this paper.

Sequence Similarity Networks of PFAM Domains

To add additional information about the biosynthetic capabilities of the BGCs to the features, we broke down PFAM domains into sub-PFAM domains using the SSN algorithm on the EFI-EST webserver.[28] We made SSNs for the top 30 PFAM domains associated with antibacterial or antifungal/antitumor/cytotoxic activity as measured by χ2 value, excluding any PFAM domains that appeared in fewer than 35 training set BGCs. We applied the EFI-EST tool to the sequences of selected PFAM domains that appeared in our training set BGCs. We chose alignment scores for clustering cutoffs that would result in distinct clusters in the SSN (listed in Table S4). We designated each cluster that contained more than five sequences and occurred in at least three BGCs a sub-PFAM domain and included it as a feature for training. SSN networks shown in Figure and Figures S6–S9 were visualized using Cytoscape.[57]

Optimization and Training of Classifiers

We used modules from the scikit-learn Python library[58] to perform all machine learning. Before training regression and SVM classifiers, we scaled all features using a min max scaler (scikit-learn MinMaxScaler). To assess which parameters produced the best results, we measured the average accuracy for each parameter set over a 10-fold cross-validation, where one-tenth of the data was held out from training in each trial and then used to assess accuracy. For the logistic regression classifier, we used the Stochastic Gradient Decent Classifier module (SGDClassifier) from scikit-learn with loss=’log’ and reg=’elasticnet’. We then tested all combinations of α = 0.5, 0.3, 0.2, 0.1, 10–2, 10–3, 10–4, 10–5, and 10–6 and l1 ratios = 0.5, 0.2, 0.1, 0.05, 10–2, 10–3, and 10–4 (Figure S1A and Table S2). For the SVM classifier, we used the SVC module from scikit-learn. We tested the accuracy of the SVM with linear and rbf kernels and tested the C (l2 regularization parameter) values = 100, 10, 1, 0.5, 0.1, and 0.01. For the rbf kernel, we also tested kernel coefficients, γ = 0.001, 0.01, 0.1, 1, and 10 (Figure S1B and Table S2). To train random forest classifiers, we used the ExtraTreesClassifier module from scikit-learn with bootstrap = True, max_features=’auto’, and criterion=’gini’. We then tested all combinations of n_estimators = 1, 5, 10, 15, 20, 25, 50, and 100 and max_depth = 10, 20, 50, 100, 1000, and None (Figure S1C). We then chose the combination of parameters with the highest average accuracy for each classification problem for all subsequent analysis (Table S2).

Calculating Accuracy Metrics for Trained Classifiers

Methods from the scikit-learn metrics module were used to analyze the accuracy of our classifiers. The methods we used were balanced_accuracy_score, roc_curve, roc_auc_score, and precision_recall_curve. For the balanced accuracy and AROC scores, we took the average of scores from a 10-fold cross-validation. For plots showing ROC and precision recall curves, we displayed a curve from one trial of the cross-validation. Statistical significance of improvement over random guessing was determined in PRISM using and one-way ANOVA test.

57 in total

1. Predictive compound accumulation rules yield a broad-spectrum antibiotic.

Authors: Michelle F Richter; Bryon S Drown; Andrew P Riley; Alfredo Garcia; Tomohiro Shirai; Riley L Svec; Paul J Hergenrother
Journal: Nature Date: 2017-05-10 Impact factor: 49.962

Review 2. Modern natural products drug discovery and its relevance to biodiversity conservation.

Authors: David G I Kingston
Journal: J Nat Prod Date: 2010-12-07 Impact factor: 4.050

3. SMURF: Genomic mapping of fungal secondary metabolite clusters.

Authors: Nora Khaldi; Fayaz T Seifuddin; Geoff Turner; Daniel Haft; William C Nierman; Kenneth H Wolfe; Natalie D Fedorova
Journal: Fungal Genet Biol Date: 2010-06-08 Impact factor: 3.495

4. Polyoxypeptins A and B Produced by Streptomyces: Apoptosis-Inducing Cyclic Depsipeptides Containing the Novel Amino Acid (2S,3R)-3-Hydroxy-3-methylproline.

Authors: Kazuo Umezawa; Kumi Nakazawa; Yoko Ikeda; Hiroshi Naganawa; Shinichi Kondo
Journal: J Org Chem Date: 1999-04-30 Impact factor: 4.354

5. The identification of bacillaene, the product of the PksX megacomplex in Bacillus subtilis.

Authors: Rebecca A Butcher; Frank C Schroeder; Michael A Fischbach; Paul D Straight; Roberto Kolter; Christopher T Walsh; Jon Clardy
Journal: Proc Natl Acad Sci U S A Date: 2007-01-18 Impact factor: 11.205

6. Alanine scan of the peptide antibiotic feglymycin: assessment of amino acid side chains contributing to antimicrobial activity.

Authors: Anne Hänchen; Saskia Rausch; Benjamin Landmann; Luigi Toti; Antje Nusser; Roderich D Süssmuth
Journal: Chembiochem Date: 2013-02-27 Impact factor: 3.164

7. Alanine scan of [L-Dap(2)]ramoplanin A2 aglycon: assessment of the importance of each residue.

Authors: Joonwoo Nam; Dongwoo Shin; Yosup Rew; Dale L Boger
Journal: J Am Chem Soc Date: 2007-06-26 Impact factor: 15.419

8. Identification and characterization of the biosynthetic gene cluster of polyoxypeptin A, a potent apoptosis inducer.

Authors: Yanhua Du; Yemin Wang; Tingting Huang; Meifeng Tao; Zixin Deng; Shuangjun Lin
Journal: BMC Microbiol Date: 2014-02-08 Impact factor: 3.605

9. MIBiG 2.0: a repository for biosynthetic gene clusters of known function.

Authors: Satria A Kautsar; Kai Blin; Simon Shaw; Jorge C Navarro-Muñoz; Barbara R Terlouw; Justin J J van der Hooft; Jeffrey A van Santen; Vittorio Tracanna; Hernando G Suarez Duran; Victòria Pascal Andreu; Nelly Selem-Mojica; Mohammad Alanjary; Serina L Robinson; George Lund; Samuel C Epstein; Ashley C Sisto; Louise K Charkoudian; Jérôme Collemare; Roger G Linington; Tilmann Weber; Marnix H Medema
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

10. NeuRiPP: Neural network identification of RiPP precursor peptides.

Authors: Emmanuel L C de Los Santos
Journal: Sci Rep Date: 2019-09-16 Impact factor: 4.379

3 in total

Review 1. Strategies to access biosynthetic novelty in bacterial genomes for drug discovery.

Authors: Franziska Hemmerling; Jörn Piel
Journal: Nat Rev Drug Discov Date: 2022-03-16 Impact factor: 84.694

Review 2. Benefiting from big data in natural products: importance of preserving foundational skills and prioritizing data quality.

Authors: Nadja B Cech; Marnix H Medema; Jon Clardy
Journal: Nat Prod Rep Date: 2021-11-17 Impact factor: 13.423

Review 3. Natural product drug discovery in the artificial intelligence era.

Authors: F I Saldívar-González; V D Aldas-Bulos; J L Medina-Franco; F Plisson
Journal: Chem Sci Date: 2021-12-13 Impact factor: 9.825

3 in total