| Literature DB >> 30920214 |
Carlos Garcia-Hernandez1, Alberto Fernández1, Francesc Serratosa2.
Abstract
Extended reduced graphs provide summary representations of chemical structures using pharmacophore-type node descriptions to encode the relevant molecular properties. Commonly used similarity measures using reduced graphs convert these graphs into 2D vectors like fingerprints, before chemical comparisons are made. This study investigates the effectiveness of a graph-only driven molecular comparison by using extended reduced graphs along with graph edit distance methods for molecular similarity calculation as a tool for ligand-based virtual screening applications, which estimate the bioactivity of a chemical on the basis of the bioactivity of similar compounds. The results proved to be very stable and the graph editing distance method performed better than other methods previously used on reduced graphs. This is exemplified with six publicly available data sets: DUD-E, MUV, GLL&GDD, CAPST, NRLiSt BDB, and ULS-UDS. The screening and statistical tools available on the ligand-based virtual screening benchmarking platform and the RDKit were also used. In the experiments, our method performed better than other molecular similarity methods which use array representations in most cases. Overall, it is shown that extended reduced graphs along with graph edit distance is a combination of methods that has numerous applications and can identify bioactivity similarities in a structurally diverse group of molecules.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30920214 PMCID: PMC6668628 DOI: 10.1021/acs.jcim.8b00820
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 4.956
Figure 1Molecular comparison flowcharts. Difference between traditional ErG methods and our proposal.
Figure 2Example of molecule reduction using ErG. At the top there is the original molecule and at the bottom the ErG representation. Ac: H-bond acceptor; Hf: hydrophobic group; Ar: aromatic ring system; +: positive charge. Colors are used to show how different parts of the original structure are reduced to nodes in the ErG.
Figure 3Fingerprint-based method flowchart.
Figure 4SED-based method flowchart.
Figure 5GED-based method flowchart.
Figure 6One of the edit paths that transforms graph A into graph B.
Description of the Node and Edge Attributes That Make up an ErG
| node attributes | |
|---|---|
| attribute | description |
| [0] | hydrogen-bond donor |
| [1] | hydrogen-bond acceptor |
| [2] | positive charge |
| [3] | negative charge |
| [4] | hydrophobic group |
| [5] | aromatic ring system |
| [6] | carbon link node |
| [7] | noncarbon link node |
| [0, 1] | hydrogen-bond donor + hydrogen-bond acceptor |
| [0, 2] | hydrogen-bond donor + positive charge |
| [0, 3] | hydrogen-bond donor + negative charge |
| [1, 2] | hydrogen-bond acceptor + positive charge |
| [1, 3] | hydrogen-bond acceptor + negative charge |
| [2, 3] | positive charge + negative charge |
| [0, 1, 2] | hydrogen-bond donor + hydrogen-bond acceptor + positive charge |
| edge attributes | |
| attribute | description |
| - | single bond |
| = | double bond |
| ≡ | triple bond |
Substitution and Insertion/Deletion Costs Used in the GED and SED Calculation
| matrix of substitution
costs | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [0] | [1] | [2] | [3] | [4] | [5] | [6] | [7] | [0, 1] | [0, 2] | [0, 3] | [1, 2] | [1, 3] | [2, 3] | [0, 1, 2] | - | = | ≡ | |
| [0] | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 1 | 1 | 1 | 2 | 2 | 2 | 1 | 2 | 3 | 3 |
| [1] | 2 | 0 | 2 | 2 | 2 | 2 | 2 | 3 | 1 | 2 | 2 | 1 | 1 | 2 | 1 | 2 | 3 | 3 |
| [2] | 2 | 2 | 0 | 2 | 2 | 2 | 2 | 3 | 2 | 1 | 2 | 1 | 2 | 1 | 1 | 2 | 3 | 3 |
| [3] | 2 | 2 | 2 | 0 | 2 | 2 | 2 | 3 | 2 | 2 | 1 | 2 | 1 | 1 | 2 | 2 | 3 | 3 |
| [4] | 2 | 2 | 2 | 2 | 0 | 2 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |
| [5] | 2 | 2 | 2 | 2 | 2 | 0 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |
| [6] | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |
| [7] | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 0 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |
| [0, 1] | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |
| [0, 2] | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 3 | 2 | 0 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |
| [0, 3] | 1 | 2 | 2 | 1 | 2 | 2 | 2 | 3 | 2 | 2 | 0 | 2 | 2 | 2 | 2 | 2 | 3 | 3 |
| [1, 2] | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 3 | 2 | 2 | 2 | 0 | 2 | 2 | 2 | 2 | 3 | 3 |
| [1, 3] | 2 | 1 | 2 | 1 | 2 | 2 | 2 | 3 | 2 | 2 | 2 | 2 | 0 | 2 | 2 | 2 | 3 | 3 |
| [2, 3] | 2 | 2 | 1 | 1 | 2 | 2 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 0 | 2 | 2 | 3 | 3 |
| [0, 1, 2] | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 2 | 3 | 3 |
| 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 0 | 3 | 3 | |
| 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 0 | 3 | |
| ≡ | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 0 |
| insertion/deletion costs | ||||||||||||||||||
| insert | 1 | 1 | 1 | 1 | 1 | 1 | 0.5 | 0.5 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0.1 | 1 | 1 |
| delete | 1 | 1 | 1 | 1 | 1 | 1 | 0.5 | 0.5 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0.1 | 1 | 1 |
Input Data Used for the Experimentsa
| data set | targets used |
|---|---|
| ULS-UDS | 5HT1F_Agonist, MTR1B_Agonist, OPRM_Agonist, PE2R3_Antagonist |
| GLL&GDD | 5HT1A_Agonist, 5HT1A_Antagonist, 5HT1D_Agonist, 5HT1D_Antagonist, 5HT1F_Agonist, 5HT2A_Antagonist, 5HT2B_Antagonist, 5HT2C_Agonist, 5HT2C_Antagonist, 5HT4R_Agonist, 5HT4R_Antagonist, AA1R_Agonist, AA1R_Antagonist, AA2AR_Antagonist, AA2BR_Antagonist, ACM1_Agonist, ACM2_Antagonist, ACM3_Antagonist, ADA1A_Antagonist, ADA1B_Antagonist, ADA1D_Antagonist, ADA2A_Agonist, ADA2A_Antagonist, ADA2B_Agonist, ADA2B_Antagonist, ADA2C_Agonist, ADA2C_Antagonist, ADRB1_Agonist, ADRB1_Antagonist, ADRB2_Agonist, ADRB2_Antagonist, ADRB3_Agonist, ADRB3_Antagonist, AG2R_Antagonist, BKRB1_Antagonist, BKRB2_Antagonist, CCKAR_Antagonist, CLTR1_Antagonist, DRD1_Antagonist, DRD2_Agonist, DRD2_Antagonist, DRD3_Antagonist, DRD4_Antagonist, EDNRA_Antagonist, EDNRB_Antagonist, GASR_Antagonist, HRH2_Antagonist, HRH3_Antagonist, LSHR_Antagonist, LT4R1_Antagonist, LT4R2_Antagonist, MTR1A_Agonist, MTR1B_Agonist, MTR1L_Agonist, NK1R_Antagonist, NK2R_Antagonist, NK3R_Antagonist, OPRD_Agonist, OPRK_Agonist, OPRM_Agonist, OXYR_Antagonist, PE2R1_Antagonist, PE2R2_Antagonist, PE2R3_Antagonist, PE2R4_Antagonist, TA2R_Antagonist, V1AR_Antagonist, V1BR_Antagonist, V2R_Antagonist |
| CAPST | CDK2, CHK1, PTP1B, UROKINASE |
| DUD-E | COX2, DHFR, EGFR, FGFR1, FXA, P38, PDGFRB, SRC, AA2AR |
| NRLiSt_BDB | AR_Agonist, AR_Antagonist, ER_Alpha_Agonist, ER_Alpha_Antagonist, ER_Beta_Agonist, FXR_Alpha_Agonist, GR_Agonist, GR_Antagonist, LXR_Alpha_Agonist, LXR_Beta_Agonist, MR_Antagonist, PPAR_Alpha_Agonist, PPAR_Beta_Agonist, PPAR_Gamma_Agonist, PR_Agonist, PR_Antagonist, PXR_Agonist, RAR_Alpha_Agonist, RAR_Beta_Agonist, RAR_Gamma_Agonist, RXR_Alpha_Agonist, RXR_Alpha_Antagonist, RXR_Gamma_Agonist, VDR_Agonist |
| MUV | 466, 548, 600, 644, 652, 689, 692, 712, 713, 733, 737, 810, 832, 846, 852, 858, 859 |
The column entitled “dataset” contains the name of each dataset, and the column entitled “targets used” contains the name of the targets used during the experiments for each dataset. Note that in result plots shown below, per-target points are arranged in the same order as they are in this table.
Figure 7AUC and BEDROC (α = 20) over all available targets in the LBVS benchmarking platform. The scattered values on the left of both subplots represent the median value from 10 predefined random-built splits, using different colors and shapes per similarity method. Vertical segmented lines mark the edge between different data sets (from left to right: ULS-UDS, GLL&GDD, CAPST, DUD-E, NRLiSt_BDB, and MUV). The box-and-whisker plots on the right of both subplots show the distribution of the resulting values for each similarity method. The boxes show the first and third quartile, the line is the median value (second quartile), and the whiskers extend from the boxes to show the range of the data (outliers are included if there are any).
Figure 8AUC results for all available targets in the LBVS benchmarking platform separated by data set. Each scattered value on the left of each subplot represents the median value of 10 predefined random-built splits. A different color and shape is used for each similarity method. Box-and-whisker plots on the right of each subplot show the distribution of the resulting values for each similarity method.
Figure 9BEDROC (α = 20) results for all available targets in the LBVS benchmarking platform separated by data set. Each scattered value on the left of each subplot represents the median value of 10 predefined random-built splits. A different color and shape is used for each similarity method. Box-and-whisker plots on the right of each subplot show the distribution of the resulting values for each similarity method.
P-Values of a Friedman Test for AUC and BEDROC Results, Comparing All Three Similarity Methods (GED-Based, FP-Based, and SED-Based) at the Same Timea
| Friedman test (AUC) | Friedman test (BEDROC) | |
|---|---|---|
| ULS-UDS | 0.173774 | 0.173774 |
| GLL&GDD | 2.97804 × 10–14* | 5.30798 × 10–15* |
| DUD-E | 0.000911882* | 0.000300185* |
| NRLiSt_BDB | 7.48518 × 10–05* | 5.77775 × 10–08* |
| MUV | 0.00102573* | 0.00714619* |
| CAPST | 0.0497871* | 0.0497871* |
| all data sets | 7.46387 × 10–24* | 1.89219 × 10–28* |
The test is done per dataset, and all datasets are combined in the last row. Here, a confidence level of α = 0.05 is used, so p-values lower than α indicate statistically significant differences, which are marked with an asterisk (*) in the table.
P-Values of a Pairwise Wilcoxon Signed-Rank Test for AUC and BEDROC Results, Comparing All Three Similarity Methods (GED-Based, FP-Based, and SED-Based)a
| Wilcoxon test (AUC) | Wilcoxon test (BEDROC) | |||||
|---|---|---|---|---|---|---|
| SED | FP | GED | SED | FP | GED | |
| SED | 1.20841 × 10–13* | 7.4432 × 10–21* | 2.08456 × 10–16* | 1.18024 × 10–21* | ||
| FP | 1.20841 × 10–13* | 0.000748788* | 2.08456 × 10–16* | 0.000585085* | ||
| GED | 7.4432 × 10–21* | 0.000748788* | 1.18024 × 10–21* | 0.000585085* | ||
The test is applied to all targets in the datasets combined. Here, a confidence level of α = 0.05 is used, so p-values lower than α indicate statistically significant differences, which are marked with an asterisk (*) in the table.
P-Values of a Pairwise Wilcoxon Signed-Rank Test for AUC Results, Comparing All Three Similarity Methods (GED-Based, FP-Based, and SED-Based)a
| ULS-UDS | GLL&GDD | |||||
|---|---|---|---|---|---|---|
| SED | FP | GED | SED | FP | GED | |
| SED | 0.273322 | 0.0678892 | 4.26037 × 10–08* | 3.34327 × 10–12* | ||
| FP | 0.273322 | 0.465209 | 4.26037 × 10–08* | 0.0018207* | ||
| GED | 0.0678892 | 0.465209 | 3.34327 × 10–12* | 0.0018207* | ||
The test is done using all targets separated by datasets. Here, a confidence level of α = 0.05 is used, so p-values lower than α indicate statistically significant differences, which are marked with an asterisk (*) in the table.
P-Values of a Pairwise Wilcoxon Signed-Rank Test for BEDROC Results, Comparing All Three Similarity Methods (GED-Based, FP-Based, and SED-Based)a
| ULS-UDS | GLL&GDD | |||||
|---|---|---|---|---|---|---|
| SED | FP | GED | SED | FP | GED | |
| 0.273322 | 0.0678892 | 1.41571 × 10–09* | 7.73521 × 10–13* | |||
| 0.273322 | 1 | 1.41571 × 10–09* | 0.125128 | |||
| 0.0678892 | 1 | 7.73521 × 10–13* | 0.125128 | |||
| 0.0678892 | 0.0678892 | 0.00988213* | 0.00359936* | |||
| 0.0678892 | 0.465209 | 0.00988213* | 0.758312 | |||
| 0.0678892 | 0.465209 | 0.00359936* | 0.758312 | |||
| 0.015156* | 0.00768579* | 0.000318217* | 3.43006 × 10–05* | |||
| 0.015156* | 0.00768579* | 0.000318217* | 0.000284994* | |||
| 0.00768579* | 0.00768579* | 3.43006 × 10–05* | 0.000284994* | |||
The test is done using all targets separated by datasets. Here, a confidence level of α = 0.05 is used, so p-values lower than α indicate statistically significant differences, which are marked with an asterisk (*) in the table.
Three Sample Molecules from the Target VDR_Agonist in the NRLiSt_BDB Dataset
Distances between Molecules Shown in Table Computed Using the FP-Based, SED-Based, and GED-Based Similarity Methods
| Mol ID 2 ligand | Mol ID 3 decoy | |
|---|---|---|
| Mol ID 1 ligand | SED: 0.06 | SED: 0.28 |
| FPD: 0.40 | FPD: 0.92 | |
| GED: 0.49 | GED: 3.05 |