| Literature DB >> 32802980 |
Joel Kowalewski1, Anandasankar Ray1,2.
Abstract
There is an urgent need for the identification of effective therapeutics for COVID-19 and we have developed a machine learning drug discovery pipeline to identify several drug candidates. First, we collect assay data for 65 target human proteins known to interact with the SARS-CoV-2 proteins, including the ACE2 receptor. Next, we train machine learning models to predict inhibitory activity and use them to screen FDA registered chemicals and approved drugs (~100,000) and ~14 million purchasable chemicals. We filter predictions according to estimated mammalian toxicity and vapor pressure. Prospective volatile candidates are proposed as novel inhaled therapeutics since the nasal cavity and respiratory tracts are early bottlenecks for infection. We also identify candidates that act across multiple targets as promising for future analyses. We anticipate that this theoretical study can accelerate testing of two categories of therapeutics: repurposed drugs suited for short-term approval, and novel efficacious drugs suitable for a long-term follow up.Entities:
Keywords: ACE2; Chemical informatics; Computer-aided drug design; Covid-19; Drug discovery; Machine learning; Microbiology; SARS-CoV-2; Structure activity relationship; Toxicology; Viral disease; Virology; Viruses
Year: 2020 PMID: 32802980 PMCID: PMC7409807 DOI: 10.1016/j.heliyon.2020.e04639
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Figure 1Machine learning pipeline to identify chemicals that interfere with SARS-CoV-2 targets.a) Overview of the pipeline to predict chemicals for 65 SARS-CoV-2 human targets selected from Gordon et al., 2020 and using bioassay data from publicly available databases. b) Graphically depicts the pipeline details. Available bioassay data on the viral targets were mined for information to use in machine learning or structural analysis. This resulted in 24 targets that could be modeled using values for the most abundant inhibitory assay measure (e.g. Ki or IC50) and 21 targets modeled by classifying broad inhibition or actvity against the proteins (34 unique targets in total). The remaining targets with limited data were funneled into a structural similarity analysis, which aids in developing more bioassay data and helps clarify the chemical features contributing to bioactivity. For targets modeled with supervised machine learning, optimal chemical features were identified on subsets of training data. The top features were sampled by support vector machines (SVM). These models were then aggregated. In certain cases, the Random Forest algorithm was inlcuded to improve the fit. External chemicals were used to verify successful predictions. Models trained for the 34 targets predicted large chemical databases including FDA registered chemicals and approved drugs, as well as 10+ million purchasable chemicals from the ZINC database. Top scoring predicted chemicals were subsequently assigned theoretical toxicity, log vapor pressure, and MLOGP, which estimates membrane permeability.
Important chemical features for regression models. Top three physicochemical features for the viral targets with known bioassay activities.
| Feature | Target | Description |
|---|---|---|
| GATS5s | ABCC1 | Geary autocorrelation of lag 5 weighted by I-state |
| RDF055m | ABCC1 | Radial Distribution Function - 055/weighted by mass |
| SpMax_B(s) | ABCC1 | leading eigenvalue from Burden matrix weighted by I-State |
| CATS2D_08_AA | BRD2 | CATS2D Acceptor-Acceptor at lag 08 |
| RDF035s | BRD2 | Radial Distribution Function - 035/weighted by I-state |
| SpDiam_X | BRD2 | spectral diameter from chi matrix |
| HATS8p | BRD4 | leverage-weighted autocorrelation of lag 8/weighted by polarizability |
| R5i+ | BRD4 | R maximal autocorrelation of lag 5/weighted by ionization potential |
| RDF035m | BRD4 | Radial Distribution Function - 035/weighted by mass |
| Eig02_EA(bo) | CSNK2A2 | eigenvalue n. 2 from edge adjacency mat. weighted by bond order |
| Eig05_EA(bo) | CSNK2A2 | eigenvalue n. 5 from edge adjacency mat. weighted by bond order |
| SpMax2_Bh(m) | CSNK2A2 | largest eigenvalue n. 2 of Burden matrix weighted by mass |
| CATS2D_04_AA | CSNK2B | CATS2D Acceptor-Acceptor at lag 04 |
| SHED_DN | CSNK2B | SHED Donor-Negative |
| SpMin1_Bh(m) | CSNK2B | smallest eigenvalue n. 1 of Burden matrix weighted by mass |
| DISPm | DCTPP1 | displacement value/weighted by mass |
| HATS7u | DCTPP1 | leverage-weighted autocorrelation of lag 7/unweighted |
| Mor31s | DCTPP1 | signal 31/weighted by I-state |
| MATS1e | DNMT1 | Moran autocorrelation of lag 1 weighted by Sanderson electronegativity |
| Mor23m | DNMT1 | signal 23/weighted by mass |
| TDB06u | DNMT1 | 3D Topological distance based descriptors - lag 6 unweighted |
| GATS4m | GFER | Geary autocorrelation of lag 4 weighted by mass |
| Mor14m | GFER | signal 14/weighted by mass |
| R5i | GFER | R autocorrelation of lag 5/weighted by ionization potential |
| DISPp | HDAC2 | displacement value/weighted by polarizability |
| IC2 | HDAC2 | Information Content index (neighborhood symmetry of 2-order) |
| P_VSA_MR_5 | HDAC2 | P_VSA-like on Molar Refractivity, bin 5 |
| F04[C–C] | IMPDH2 | Frequency of C - C at topological distance 4 |
| HOMA | IMPDH2 | Harmonic Oscillator Model of Aromaticity index |
| VE1_B(s) | IMPDH2 | coefficient sum of the last eigenvector (absolute values) from Burden matrix weighted by I-State |
| Eig02_AEA(dm) | ITGB1 | eigenvalue n. 2 from augmented edge adjacency mat. weighted by dipole moment |
| SHED_AA | ITGB1 | SHED Acceptor-Acceptor |
| SpMax2_Bh(s) | ITGB1 | largest eigenvalue n. 2 of Burden matrix weighted by I-state |
| F10[C–N] | MARK2 | Frequency of C - N at topological distance 10 |
| nPyrroles | MARK2 | number of Pyrroles |
| SaaNH | MARK2 | Sum of aaNH E-states |
| max_conj_path | MARK3 | maximum number of atoms that can be in conjugation with each other |
| SaaNH | MARK3 | Sum of aaNH E-states |
| VE1_H2 | MARK3 | coefficient sum of the last eigenvector (absolute values) from reciprocal squared distance matrix |
| GATS3s | NSD2 | Geary autocorrelation of lag 3 weighted by I-state |
| HOMA | NSD2 | Harmonic Oscillator Model of Aromaticity index |
| Mor16s | NSD2 | signal 16/weighted by I-state |
| H7m | PABPC1 | H autocorrelation of lag 7/weighted by mass |
| JGI7 | PABPC1 | mean topological charge index of order 7 |
| P_VSA_MR_2 | PABPC1 | P_VSA-like on Molar Refractivity, bin 2 |
| GATS4m | PLAT | Geary autocorrelation of lag 4 weighted by mass |
| Mor04s | PLAT | signal 04/weighted by I-state |
| R6p+ | PLAT | R maximal autocorrelation of lag 6/weighted by polarizability |
| nPyrroles | PRKACA | number of Pyrroles |
| RDF040v | PRKACA | Radial Distribution Function - 040/weighted by van der Waals volume |
| SpMin3_Bh(m) | PRKACA | smallest eigenvalue n. 3 of Burden matrix weighted by mass |
| Eig02_EA(bo) | PSEN2 | eigenvalue n. 2 from edge adjacency mat. weighted by bond order |
| nArX | PSEN2 | number of X on aromatic ring |
| VE1sign_D/Dt | PSEN2 | coefficient sum of the last eigenvector from distance/detour matrix |
| SHED_DL | PTGES2 | SHED Donor-Lipophilic |
| VE2sign_G | PTGES2 | average coefficient of the last eigenvector from geometrical matrix |
| VE3sign_G | PTGES2 | logarithmic coefficient sum of the last eigenvector from geometrical matrix |
| CATS3D_08_AL | RIPK1 | CATS3D Acceptor-Lipophilic BIN 08 (8.000–9.000 Å) |
| MATS5i | RIPK1 | Moran autocorrelation of lag 5 weighted by ionization potential |
| VE3sign_RG | RIPK1 | logarithmic coefficient sum of the last eigenvector from reciprocal squared geometrical matrix |
| BLTA96 | SIGMAR1 | Verhaar Algae base-line toxicity from MLOGP (mmol/l) |
| F10[C–C] | SIGMAR1 | Frequency of C - C at topological distance 10 |
| TPSA(Tot) | SIGMAR1 | topological polar surface area using N,O,S,P polar contributions |
| Eig01_AEA(dm) | TBK1 | eigenvalue n. 1 from augmented edge adjacency mat. weighted by dipole moment |
| HATS4i | TBK1 | leverage-weighted autocorrelation of lag 4/weighted by ionization potential |
| SdssC | TBK1 | Sum of dssC E-states |
| AROM | VCP | aromaticity index |
| E1m | VCP | 1st component accessibility directional WHIM index/weighted by mass |
| MATS5m | VCP | Moran autocorrelation of lag 5 weighted by mass |
| H5s | ACE2 | H autocorrelation of lag 5/weighted by I-state |
| Mor10m | ACE2 | signal 10/weighted by mass |
| Mor17m | ACE2 | signal 17/weighted by mass |
Important chemical features for classification models. Top three physicochemical features for viral targets where the models classified chemicals as active vs inactive relative to broad inhibition or activition rather than a specific assay value (e.g. Ki, IC50, and AC50).
| Feature | Target | Description |
|---|---|---|
| Mor18s | BRD4 | signal 18/weighted by I-state |
| SpMAD_G/D | BRD4 | spectral mean absolute deviation from distance/distance matrix |
| SpMax3_Bh(p) | BRD4 | largest eigenvalue n. 3 of Burden matrix weighted by polarizability |
| P_VSA_LogP_3 | HDAC2 | P_VSA-like on LogP, bin 3 |
| SHED_DA | HDAC2 | SHED Donor-Acceptor |
| SHED_DL | HDAC2 | SHED Donor-Lipophilic |
| G(N..N) | IDE | sum of geometrical distances between N..N |
| SM1_Dz(i) | IDE | spectral moment of order 1 from Barysz matrix weighted by ionization potential |
| Wap | IDE | all-path Wiener index |
| CATS2D_08_DA | TBK1 | CATS2D Donor-Acceptor at lag 08 |
| F08[N–N] | TBK1 | Frequency of N - N at topological distance 8 |
| P_VSA_e_3 | TBK1 | P_VSA-like on Sanderson electronegativity, bin 3 |
| H7m | PRKACA | H autocorrelation of lag 7/weighted by mass |
| H7s | PRKACA | H autocorrelation of lag 7/weighted by I-state |
| RDF060m | PRKACA | Radial Distribution Function - 060/weighted by mass |
| GATS6e | MARK3 | Geary autocorrelation of lag 6 weighted by Sanderson electronegativity |
| GATS6m | MARK3 | Geary autocorrelation of lag 6 weighted by mass |
| Mor02m | MARK3 | signal 02/weighted by mass |
| CATS2D_02_DL | IMPDH2 | CATS2D Donor-Lipophilic at lag 02 |
| CATS3D_07_DL | IMPDH2 | CATS3D Donor-Lipophilic BIN 07 (7.000–8.000 Å) |
| NaasC | IMPDH2 | Number of atoms of type aasC |
| C-039 | ABCC1 | Ar-C(=X)-R |
| VE2sign_Dz(p) | ABCC1 | average coefficient of the last eigenvector from Barysz matrix weighted by polarizability |
| VE3sign_Dz(v) | ABCC1 | logarithmic coefficient sum of the last eigenvector from Barysz matrix weighted by van der Waals volume |
| Mor31s | ABHD12 | signal 31/weighted by I-state |
| RTi+ | ABHD12 | R maximal index/weighted by ionization potential |
| VE3sign_Dz(p) | ABHD12 | logarithmic coefficient sum of the last eigenvector from Barysz matrix weighted by polarizability |
| E2m | BRD2 | 2nd component accessibility directional WHIM index/weighted by mass |
| GATS2m | BRD2 | Geary autocorrelation of lag 2 weighted by mass |
| TDB03i | BRD2 | 3D Topological distance based descriptors - lag 3 weighted by ionization potential |
| MAXDP | COMT | maximal electrotopological positive variation |
| nDB | COMT | number of double bonds |
| P_VSA_MR_2 | COMT | P_VSA-like on Molar Refractivity, bin 2 |
| CATS2D_02_AL | DNMT1 | CATS2D Acceptor-Lipophilic at lag 02 |
| Mor04s | DNMT1 | signal 04/weighted by I-state |
| VE3sign_Dt | DNMT1 | logarithmic coefficient sum of the last eigenvector from detour matrix |
| ChiA_B(i) | EIF4H | average Randic-like index from Burden matrix weighted by ionization potential |
| F05[C–O] | EIF4H | Frequency of C - O at topological distance 5 |
| NaasC | EIF4H | Number of atoms of type aasC |
| CENT | LOX | centralization |
| EE_G | LOX | Estrada-like index (log function) from geometrical matrix |
| VE2_D/Dt | LOX | average coefficient of the last eigenvector (absolute values) from distance/detour matrix |
| Eta_D_beta | MARK2 | eta measure of electronic features |
| Mor29v | MARK2 | signal 29/weighted by van der Waals volume |
| SpPosA_B(i) | MARK2 | normalized spectral positive sum from Burden matrix weighted by ionization potential |
| CATS2D_07_AL | NEK9 | CATS2D Acceptor-Lipophilic at lag 07 |
| CATS2D_08_AL | NEK9 | CATS2D Acceptor-Lipophilic at lag 08 |
| TDB05p | NEK9 | 3D Topological distance based descriptors - lag 5 weighted by polarizability |
| CATS2D_06_DL | NEU1 | CATS2D Donor-Lipophilic at lag 06 |
| TDB04i | NEU1 | 3D Topological distance based descriptors - lag 4 weighted by ionization potential |
| X3A | NEU1 | average connectivity index of order 3 |
| nR06 | RHOA | number of 6-membered rings |
| R8s+ | RHOA | R maximal autocorrelation of lag 8/weighted by I-state |
| SpMin1_Bh(m) | RHOA | smallest eigenvalue n. 1 of Burden matrix weighted by mass |
| CATS3D_08_NL | SIRT5 | CATS3D Negative-Lipophilic BIN 08 (8.000–9.000 Å) |
| O-057 | SIRT5 | phenol, enol, carboxyl OH |
| SpMax2_Bh(s) | SIRT5 | largest eigenvalue n. 2 of Burden matrix weighted by I-state |
| CATS2D_04_AL | TK2 | CATS2D Acceptor-Lipophilic at lag 04 |
| JGI3 | TK2 | mean topological charge index of order 3 |
| MATS1i | TK2 | Moran autocorrelation of lag 1 weighted by ionization potential |
| P_VSA_e_3 | VCP | P_VSA-like on Sanderson electronegativity, bin 3 |
| RDF020p | VCP | Radial Distribution Function - 020/weighted by polarizability |
| SpMaxA_AEA(dm) | VCP | normalized leading eigenvalue from augmented edge adjacency mat. weighted by dipole moment |
Figure 2Models of chemical features accurately predict inhibitors of SARS-CoV-2 targets.a) Pipeline for fitting and validating models that predict IC50, Ki, or AC50 or a classification score, which reflects broad inhibitory activity against the listed viral targets. b) , mean absolute error (MAE) in predicting the log transformed endpoints (IC50, Ki, AC50). classification of chemicals for broad inhibition or activity against targets, validating using the area under the receiver operating characteristic (ROC) curve (AUC). Plots are for 10-fold cross validation, repeated 5 times. The model predictions are from an ensemble of three support vector machines (SVM), trained on different chemical feature sets or in some cases SVM and Random Forest. c) , external test set performance for regression models, where possible. , external test set performance for classification models, where possible. More comprehensive performance data in Supplementary Information 1.
Figure 3Approved drugs with putative activity against SARS-CoV-2 targets. a) The best predicted activity against SARS-CoV-2 targets among databases of approved drugs. Viral targets with few promising candidates are omitted. Comprehensive table in Supplementary Information 2. b) Network showing drugs that are among the top 25 for multiple viral targets (drugs: black nodes; viral targets: red nodes).
Figure 4Predicting activity against SARS-CoV-2 targets among theoretical volatile chemicals. a) , count of chemicals per target after initially filtering based on predicted scores. , chemical counts across all viral targets for the models predicting general inhibitory or activity against () and those for specific inhibitory endpoints () (e.g. IC50). b) Pipeline for further prioritizing chemical sets according to estimated log vapor pressure and low mammalian toxicity (LD50). c) Top ranking predictions of general inhibition or activity against targets () and/or specific inhibitory endpoints () against SARS-CoV-2 targets from the ZINC database, filtered to the highest estimated log vapor pressures.
Figure 5Predicted chemicals rank highly for multiple SARS-CoV-2 targets. a) Network of chemicals predicted to have low toxicity that are ranked highly for >1 viral targets. Chemicals were considered if for multiple viral targets they had >0.75 activity/class scores or predictions of specific assay measures (Ki, IC50, and AC50) < 100 nM.
Figure 6Predictions of SARS-CoV-2 targets among chemicals lacking odorant properties.a) Sample of ZINC chemicals scoring highly for activity against the viral targets (classification or regression models, Score). Comprehensive tables in Supplementary Information 4, detailing the model type and predicted assay endpoint.
KEY RESOURCES TABLE
| Reagent or Resource | Source | Identifier |
|---|---|---|
| Deposited Data | ||
| ZINC 15 | Sterling and Irwin, 2015 | |
| chEMBL 25 | EMBL-EBI, 2011; Mendez et al., 2019 | |
| EPI Suite Data | EPA, 2015 | |
| DrugBank | Wishart et al., 2018 | |
| Therapeutic Targets Database (TTD) | Chen, 2002; Zhu et al., 2009 | |
| FDA: Substance Registration Database (FDA UNII) | FDA, 2020 | |
| Hazardous Substances Data Bank (HSDB) | Fonger et al., 2014 | |
| Viral Targets | Gordon et al. 2020 | |
| Acutoxbase | Kinsner-Ovaskainen et al., 2009 | |
| DSSTox | Richard and Williams, 2002 | |
| Top 50 physicochemical features to predict inhibitory assay activity for each SARS-CoV-2 target | This paper | Supplementary Table 2 |
| Top 50 physicochemical features to predict broadly inhibiting activity for each SARS-CoV-2 target | This paper | Supplementary Table 3 |
| Top predicted drug and FDA registered chemicals. | This paper | Supplementary Information 2 |
| Top predicted chemicals from ZINC, rank ordered by estimated vapor pressure | This paper | Supplementary Information 3 |
| Top predicted chemicals from ZINC, filtered for toxicity | This paper | Supplementary Information 4 |
| Software and Algorithms | ||
| Classification and regression training (caret) | Kuhn, 2008 | |
| Kernlab | Karatzoglou et al., 2004 | |
| Regularized Random Forest (RRF) | Deng and Runger, 2013 | |
| RDKit | Landrum, 2006 | |
| ggplot2 | Wickham, 2016 | |