| Literature DB >> 32998475 |
Alice Capecchi1, Jean-Louis Reymond1.
Abstract
Microbial natural products (NPs) are an important source of drugs, however, their structural diversity remains poorly understood. Here we used our recently reported MinHashed Atom Pair fingerprint with diameter of four bonds (MAP4), a fingerprint suitable for molecules across very different sizes, to analyze the Natural Products Atlas (NPAtlas), a database of 25,523 NPs of bacterial or fungal origin. To visualize NPAtlas by MAP4 similarity, we used the dimensionality reduction method tree map (TMAP). The resulting interactive map organizes molecules by physico-chemical properties and compound families such as peptides and glycosides. Remarkably, the map separates bacterial and fungal NPs from one another, revealing that these two compound families are intrinsically different despite their related biosynthetic pathways. We used these differences to train a machine learning model capable of distinguishing between NPs of bacterial or fungal origin.Entities:
Keywords: Keywords: natural products; chemical space; cheminformatics; databases; machine learning; molecular fingerprints; origin classification; support vector machine; visualization
Year: 2020 PMID: 32998475 PMCID: PMC7600738 DOI: 10.3390/biom10101385
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Calculated properties of NPAtlas molecules available as TMAP color-codes.
| Property | Min. Value | Max. Value | 25% Quantile | 50% Quantile | 75% Quantile |
|---|---|---|---|---|---|
| Molecular weight A | 70.1 | 2901.3 (1000 F) | 292 | 408.9 | 562.6 |
| Sp3 C fraction A | 0.0 | 1.0 | 0.4 | 0.6 | 0.7 |
| HBA count A,B | 0 | 68 (20 F) | 4 | 6 | 9 |
| HBD count A,C | 0 | 47 (10 F) | 3 | 2 | 4 |
| AlogP A,D | −28.9 (−2 G) | 33.8 (8 F) | 1.2 | 2.5 | 4.1 |
| TPSA A,E | 0.0 | 1135.81 (500 F) | 69.64 | 99.66 | 152.8 |
| Boiling point A,H | 311.5 | 7806.5 (2000 F) | 890.8 | 1141.6 | 1518.5 |
| Is Lipinski | Categorical: yes/no | ||||
| Substructures I | Categorical: contains dipeptide moiety/contains glycoside moiety/contains dipeptide and glycoside moieties | ||||
| Origin | Categorical: Bacterial/Fungal | ||||
| MAP4 SVM J prediction | Categorical: Bacterial/Fungal | ||||
| MAP4 SVM J performances | Categorical: correct/wrong | ||||
A Continuous properties; shown also as rank in the map. B Hydrogen bond acceptors (HBA). C Hydrogen bond donors (HBD). D LogP Calculated following Crippen’s approach (AlogP). E topological polar surface area (TPSA). F The maximum value shown in the map, all values above are represented with the same color code. G The minimum value shown in the map, all values below are represented with the same color code. H Joback calculated boiling point. I SMARTS matched substructures. J Support vector machine (SVM).
NPAtlas entries and unique publications number according to the origin and molecular weight.
| Fungal A | Bacterial A | |
|---|---|---|
| NPAtlas entries (≥1000 Da) | 15,759 (347) | 9764 (1392) |
| Unique publications B | 6110 (145) | 4653 (711) |
| Peptides (≥1000 Da) C | 722 (311) | 2144 (901) |
| Glycosides (≥1000 Da) D | 814 (12) | 1616 (421) |
| Glycopeptides (≥1000 Da) E | 1 (0) | 112 (89) |
| Aromatic NPs (≥1000 Da) F | 1322 (0) | 800 (31) |
| Aliphatic NPs (≥1000 Da) G | 2184 (59) | 1366 (220) |
A Natural product origin. B Number of unique publications used for the extraction of all NPAtlas entries C Containing a dipeptide moiety. D Containing a glycoside moiety. E both glycoside and dipeptide moiety. F fsp3C < 0.2. G fsp3C > 0.8.
Figure 1(A) NPAtlas MAP4 TMAP colored by MW, with a rainbow scale where the lowest values are purple, and the highest values are red. Two areas of the map are zoomed and colored by SMARTS substructure match: compounds containing a dipeptide moiety are highlighted in green, compounds containing a glycoside moiety are highlighted in magenta, compounds containing both moieties are highlighted in yellow; six examples of NPAtlas entries are reported with the same color code. (B) The NPAtlas MAP4 TMAP colored by fsp3C with a rainbow scale where the lowest values are purple, and the highest values are red. A low and a high fsp3C area of the map are zoomed, and two examples of polyphenols and of terpenoids are reported. (C) The NPAtlas MAP4 TMAP colored by a microbial origin classification, the compounds originated from fungi are colored in magenta, the compounds produced by bacteria are colored in green.
Figure 2The structural formula of natural product examples selected from the TMAPs in Figure 1.
SVM and k-NN classifier’s performance on the test set.
| Classifier | ROC AUC A | F1 Score A | Balanced Accuracy A | MCC A |
|---|---|---|---|---|
| MAP4 SVM B | 0.97 | 0.91 | 0.93 | 0.86 |
| MAP4 | 0.96 | 0.88 | 0.90 | 0.81 |
| Physchem SVM D | 0.86 | 0.73 | 0.78 | 0.56 |
A Area under the receiver operating characteristic curve (ROC AUC), F1 score, balanced accuracy, and MCC are metrices used to evaluate a machine learning model. MCC can assume values from –1 to 1, all other parameters can assume values from 0 to 1, and in all cases 1 is a perfect classification. Refer to Section 2 for details. B SVM classifier trained with the MAP4 fingerprint. C k-NN classifier trained with the MAP4 fingerprint. D SVM trained with physiochemical properties.
MAP4 SVM classification of new microbial natural products and of Phakefustatin A.
| Natural Product | MAP4 SVM A | Training Set | JD from NN B |
|---|---|---|---|
| Epicospirocin 1 | 0.99, 0.01 | Aspermicrone A (NPA024935) | 0.66 |
| Penicimeroterpenoid A | 1.0, 0.0 | Isocitreohybridone H (NPA016454) | 0.63 |
| Rhizolutin | 0.83, 0.17 | Monacolin K (NPA009354) | 0.80 |
| Bosamycin A | 0.04, 0.96 | AIP I (NPA010987) | 0.77 |
| Phakefustatin A | 0.12, 0.88 | Samoamide A (NPA022212) | 0.68 |
A Predicted origin: fungal or bacterial. B Approximated Jaccard distance (JD), see Section 2 for details from the training set NN.
Figure 3Examples of natural products reported in 2020, absent from NPAtlas, annotated with their predicted origin, and connected to its MAP4 NN in the training set.