| Literature DB >> 33431009 |
Myungwon Seo1, Hyun Kil Shin2, Yoochan Myung3, Sungbo Hwang1,4, Kyoung Tai No5,6.
Abstract
Computer-aided research on the relationship between molecular structures of natural compounds (NC) and their biological activities have been carried out extensively because the molecular structures of new drug candidates are usually analogous to or derived from the molecular structures of NC. In order to express the relationship physically realistically using a computer, it is essential to have a molecular descriptor set that can adequately represent the characteristics of the molecular structures belonging to the NC's chemical space. Although several topological descriptors have been developed to describe the physical, chemical, and biological properties of organic molecules, especially synthetic compounds, and have been widely used for drug discovery researches, these descriptors have limitations in expressing NC-specific molecular structures. To overcome this, we developed a novel molecular fingerprint, called Natural Compound Molecular Fingerprints (NC-MFP), for explaining NC structures related to biological activities and for applying the same for the natural product (NP)-based drug development. NC-MFP was developed to reflect the structural characteristics of NCs and the commonly used NP classification system. NC-MFP is a scaffold-based molecular fingerprint method comprising scaffolds, scaffold-fragment connection points (SFCP), and fragments. The scaffolds of the NC-MFP have a hierarchical structure. In this study, we introduce 16 structural classes of NPs in the Dictionary of Natural Product database (DNP), and the hierarchical scaffolds of each class were calculated using the Bemis and Murko (BM) method. The scaffold library in NC-MFP comprises 676 scaffolds. To compare how well the NC-MFP represents the structural features of NCs compared to the molecular fingerprints that have been widely used for organic molecular representation, two kinds of binary classification tasks were performed. Task I is a binary classification of the NCs in commercially available library DB into a NC or synthetic compound. Task II is classifying whether NCs with inhibitory activity in seven biological target proteins are active or inactive. Two tasks were developed with some molecular fingerprints, including NC-MFP, using the 1-nearest neighbor (1-NN) method. The performance of task I showed that NC-MFP is a practical molecular fingerprint to classify NC structures from the data set compared with other molecular fingerprints. Performance of task II with NC-MFP outperformed compared with other molecular fingerprints, suggesting that the NC-MFP is useful to explain NC structures related to biological activities. In conclusion, NC-MFP is a robust molecular fingerprint in classifying NC structures and explaining the biological activities of NC structures. Therefore, we suggest NC-MFP as a potent molecular descriptor of the virtual screening of NC for natural product-based drug development.Entities:
Keywords: Dictionary of Natural Product database (DNP); Molecular descriptor; Natural compound (NC); Natural product (NP); Natural product-based drug development; Virtual screening
Year: 2020 PMID: 33431009 PMCID: PMC6977316 DOI: 10.1186/s13321-020-0410-3
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1The schematic diagram for the NC-MFP concept is illustrated. The schematic diagram to explain the underlying idea of the hierarchical structure of the NC-MFP is illustrated, a query natural compound is described as a Scaffold (blue), Scaffold-Fragment Connection Points (yellow), and Fragments (green). The NC-MFP of the query natural compound is produced as bit strings with the Scaffold (blue), Scaffold-Fragment Connection Points (yellow), and Fragments (green)
Fig. 2The hierarchical tree of the molecular scaffolds. Based on the Bemis and Murko (BM) scaffold method, functional group of compounds removed. And then the ring systems in the molecular scaffolds are iteratively removed until an only single ring remains. In the hierarchical tree, each node means the molecular scaffolds and assigns a level based on the node position in the tree
Fig. 3The DB coverage calculation. The DB coverage of molecular scaffolds was calculated according to scaffold levels of from 0 to 3 by using the NCDBs
Fig. 4The heat map of the accuracy of classification according to the scaffold levels. The heat map shows that the assignment to NC structures of DNP into 16 classes in DNP by using scaffold library of levels of from 0 to 3. The value is the proportion of the accuracy in classification and ranges from 0 to 1. The best value is closing to 1. The abbreviation of 16 classes is in Table 2
The classes of the Dictionary of Natural Products (DNP) and scaffold levels are listed
| No | Classa | Class designation | No. of representative compoundsb | No. of scaffolds (Lv0)c | No. of scaffolds (Lv1)d | No. of scaffolds (Lv2)e | No. of scaffolds (Lv3)f |
|---|---|---|---|---|---|---|---|
| 1 | Aliphatic natural products | ANP | 31 | 16 | 10 | 4 | 3 |
| 2 | Alkaloids | Alk | 303 | 107 | 177 | 218 | 190 |
| 3 | Aminoacids and peptides | Ape | 13 | 9 | 9 | 7 | 5 |
| 4 | Benzofuranoids | Bfu | 11 | 5 | 6 | 6 | 3 |
| 5 | Benzopyranoids | Bpy | 15 | 7 | 8 | 4 | 4 |
| 6 | Carbohydrates | Car | 30 | 10 | 13 | 14 | 10 |
| 7 | Flavonoids | Fla | 19 | 8 | 8 | 10 | 2 |
| 8 | Lignans | Lig | 20 | 9 | 10 | 9 | 1 |
| 9 | Oxygen heterocycles | Oxy | 12 | 8 | 7 | 3 | 1 |
| 10 | Polycyclic aromatic natural products | PANP | 13 | 6 | 8 | 8 | 3 |
| 11 | Polyketides | Pke | 12 | 10 | 9 | 11 | 8 |
| 12 | Polypyrroles | Ppy | 6 | 6 | 6 | 6 | 6 |
| 13 | Simple aromatic natural products | SANP | 18 | 9 | 10 | 7 | 0 |
| 14 | Steroids | Ste | 17 | 5 | 5 | 5 | 6 |
| 15 | Tannins | Tan | 21 | 8 | 8 | 9 | 6 |
| 16 | Terpenoids | Ter | 141 | 34 | 33 | 28 | 14 |
| Total | 682 | 257 | 327 | 349 | 262 | ||
DNP are listed with its’ designated name. The number of representative compounds of each class are listed. The number of scaffolds at level 0, 1, 2, and 3 are summarized
aFrom Dictionary of Natural Product database (DNP), 16 classes were introduced
bThe number of representative natural compounds in each group of the DNP (“No. of NC representative group in DNP”)
cThe number of scaffold level 0. (“No. of scaffolds (Lv0)”)
dThe number of scaffold level 1. (“No. of scaffolds (Lv1)”)
eThe number of scaffold level 2. (“No. of scaffolds (Lv2)”)
fThe number of scaffold level 3. (“No. of scaffolds (Lv3)”)
Fig. 5Workflow to generate the NC-MFP. The NC-MFP algorithm consists of six steps. Preprocessing step prepares input query compound for NC-MFP calculation. Scaffold matching step is to find related scaffold from query compounds. Fragment list generation step is to generate fragments by remove scaffold from the input query compound. Scaffold-fragment connection point (SFCP) assigning step is to identify the location on the fragment in the scaffold. Fragment identifying step is to find the fragment information of query compound structure from all fragment list. Fingerprint representation step describes the feature of NC-MFP by a bit string
Fig. 6Preprocessing step in NC-MFP algorithm
Fig. 7Scaffold matching step in NC-MFP algorithm
Fig. 8Fragment list generation step in NC-MFP algorithm
Fig. 9Scaffold-fragment connection point (SFCP) assigning step in NC-MFP algorithm
Fig. 10Fragment identifying step in NC-MFP algorithm
Fig. 11Fingerprint representation step in NC-MFP algorithm
Fig. 12Two types of binary classification tasks
The number of active and inactive compounds for each target protein are summarized
| Biological activity | NPASSb Target ID | No. of active compoundsc | No. of inactive compoundsd | No. of total compounds with ring structurese | No. of total compounds without ring structuresf |
|---|---|---|---|---|---|
| Protein-tyrosine phosphatase 1B inhibitors | NPT178 | 81 | 171 | 252 | 3 |
| Acetylcholinesterase inhibitors | NPT204 | 54 | 108 | 162 | 4 |
| Aldose reductase inhibitors | NPT68 | 57 | 68 | 125 | 3 |
| Beta-secretase 1 inhibitors | NPT740 | 35 | 73 | 108 | 1 |
| Cyclooxygenase-2 inhibitors | NPT31 | 31 | 62 | 93 | 1 |
| Butyrylcholinesterase inhibitors | NPT439 | 28 | 53 | 81 | 1 |
| Cyclooxygenase-1 inhibitors | NPT324 | 27 | 49 | 76 | 0 |
Seven target proteins were selected from NPASS DBa together with active and inactive compounds for each target protein
aFrom Natural Product Activity & Species Source Database (NPASS DB), seven biological activities along with related protein targets were selected
bTarget ID code of the NPASS DB with which one can access protein information (“NPASS Target ID”)
cThe number of active natural compounds with ring structures obtained with the experimental inhibitory assay (“No. of active compounds”)
dThe number of inactive natural compounds with ring structures obtained with the experimental inhibitory assay (“No. of inactive compounds”)
eThe total number of natural compounds with ring structures used for the model development (“No. of total compounds with ring structures”)
fThe total number of natural compounds without ring structures (“No. of total compounds without ring structures”)
The result of DB coverage
| The DB coverage of the molecular scaffolds [ | ||||
|---|---|---|---|---|
| NCDBs (Y) | Level 0 ( | Level 1 ( | Level 2 ( | Level 3 ( |
| KNApSAcK | 99.95 | 75.70 | 43.08 | 12.79 |
| IBScreen | 99.96 | 79.49 | 22.07 | 3.43 |
| NPACT | 100.00 | 80.67 | 54.31 | 18.13 |
| Specs | 99.88 | 85.78 | 64.69 | 33.29 |
| TCM | 99.98 | 74.39 | 34.99 | 13.38 |
| NPASS | 99.97 | 72.24 | 37.52 | 13.10 |
| Avg. performance | 99.96 | 78.05 | 42.77 | 15.69 |
The natural compound databases (NCDBs) coverage defined by Eqs. (2) and (3) are summarized at different scaffold levels
“NCDBs” means Natural Compound Databases. “Avg. performance” means the average value of performance
Binary classification result of task Ia
| Performance of each molecular fingerprint obtained by averaging ten external validation tasksa,b | ||||||
|---|---|---|---|---|---|---|
| Molecular fingerprint | Natural compound classification | Synthetic compound classification | ||||
| Avg. TP | Avg. FN | Avg. Sensitivityc (%) | Avg. TN | Avg. FP | Avg. Specificityd (%) | |
| NC-MFP | 183 | 14 | 92.65 | 113 | 87 | 56.50 |
| MACCS | 169 | 30 | 84.60 | 146 | 53 | 73.35 |
PubChem FP | 165 | 34 | 82.60 | 154 | 46 | 77.00 |
| GraphFP | 161 | 38 | 80.75 | 143 | 56 | 71.80 |
| APFP | 153 | 46 | 76.55 | 141 | 58 | 70.70 |
aThe result of performance about the binary classification task I. The external validation data set was randomly selected 10 times by a proportion of 20% from the data set. “NC-MFP” stands for Natural Compound Molecular Fingerprints and “APFP” for AtomPairs2DFingerprint and “GraphFP” for GraphOnlyFingerprint. “MACCS” reports Molecular Access System keys fingerprints and “PubChemFP” stands for PubChem fingerprint
bThe performance index consist of Sensitivity and specificity. “TP” stands for True positive and “FN” stands for False negative and “TN” standards for True negative and “FP” standard for False negative
cThe sensitivity is the proportion of positive class that was correctly identified
dThe specificity is the proportion of negative class that was correctly identified
Binary classification results of task II
| Protein targets | Performancea of each molecular fingerprint obtained by averaging ten external validation tasksb | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NC-MFP | MACCS | PubChemFP | GraphFP | APFP | |||||||||||
| ACCc (%) | F1d (%) | MCCe | ACCc (%) | F1d (%) | MCCe | ACCc (%) | F1d (%) | MCCe | ACCc (%) | F1d (%) | MCCe | ACCc (%) | F1d (%) | MCCe | |
| Protein-tyrosine phosphatase 1B (NPT 178) | 78.98 | 80.65 | 0.57 | 66.90 | 72.56 | 0.32 | 69.66 | 74.40 | 0.36 | 67.24 | 71.88 | 0.33 | 61.03 | 58.07 | 0.29 |
| Acetylcholinesterase (NPT 204) | 73.42 | 76.42 | 0.49 | 70.79 | 75.75 | 0.42 | 70.00 | 76.15 | 0.41 | 66.58 | 72.05 | 0.30 | 59.74 | 63.94 | 0.18 |
| Aldose reductase (NPT 68) | 83.20 | 83.51 | 0.76 | 76.00 | 77.35 | 0.56 | 75.60 | 75.03 | 0.59 | 69.60 | 71.01 | 0.41 | 59.20 | 47.03 | 0.24 |
| Beta-secretase (NPT 740) | 87.20 | 88.64 | 0.83 | 77.20 | 80.48 | 0.55 | 73.20 | 77.46 | 0.45 | 77.20 | 81.44 | 0.53 | 71.20 | 74.78 | 0.48 |
| Cyclooxygenase-2 (NPT 31) | 84.76 | 86.37 | 0.78 | 74.28 | 79.30 | 0.56 | 69.52 | 74.69 | 0.45 | 73.33 | 77.36 | 0.45 | 63.81 | 60.26 | 0.35 |
| Butyrylcholinesterase (NPT 439) | 87.89 | 88.82 | 0.88 | 78.95 | 81.53 | 0.64 | 71.05 | 75.05 | 0.51 | 74.74 | 77.13 | 0.55 | 77.35 | 78.57 | 0.56 |
| Cyclooxygenase-1 (NPT 324) | 88.33 | 89.42 | 0.76 | 79.45 | 82.93 | 0.63 | 78.89 | 83.32 | 0.65 | 77.78 | 82.39 | 0.65 | 73.89 | 73.73 | 0.52 |
| Average | 83.40 | 84.83 | 0.72 | 74.80 | 78.56 | 0.53 | 72.56 | 76.59 | 0.49 | 72.35 | 76.18 | 0.46 | 66.60 | 65.20 | 0.37 |
The seven target proteins of task II and the compounds summarized in Table 1
aThe performance index consist of accuracy (ACC), F1-score (F1) and the Matthews Correlation Coefficient (MCC)
bThe result of performance about the binary classification task II. The external validation data set for each target is randomly selected 10 times from both active and inactive compound set of the target protein as of 20% in each target proteins. “NC-MFP” stands for Natural Compound Molecular Fingerprints and “APFP” for AtomPairs2DFingerprint and “GraphFP” for GraphOnlyFingerprint. “MACCS” reports Molecular Access System keys fingerprints and “PubChemFP” stands for PubChem fingerprint
CThe accuracy (ACC) is the proportion of the total number of correct predictions
dF1-score (F1) is the harmonic average of precision and sensitivity
eMatthews Correlation Coefficient (MCC) is used to evaluate the binary classification performance. MCC has a range of − 1 to 1 where − 1 means a completely wrong binary classifier while 1 means an entirely correct binary classifier