Literature DB >> 32802980

Predicting novel drugs for SARS-CoV-2 using machine learning from a >10 million chemical space.

Abstract

There is an urgent need for the identification of effective therapeutics for COVID-19 and we have developed a machine learning drug discovery pipeline to identify several drug candidates. First, we collect assay data for 65 target human proteins known to interact with the SARS-CoV-2 proteins, including the ACE2 receptor. Next, we train machine learning models to predict inhibitory activity and use them to screen FDA registered chemicals and approved drugs (~100,000) and ~14 million purchasable chemicals. We filter predictions according to estimated mammalian toxicity and vapor pressure. Prospective volatile candidates are proposed as novel inhaled therapeutics since the nasal cavity and respiratory tracts are early bottlenecks for infection. We also identify candidates that act across multiple targets as promising for future analyses. We anticipate that this theoretical study can accelerate testing of two categories of therapeutics: repurposed drugs suited for short-term approval, and novel efficacious drugs suitable for a long-term follow up.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: ACE2; Chemical informatics; Computer-aided drug design; Covid-19; Drug discovery; Machine learning; Microbiology; SARS-CoV-2; Structure activity relationship; Toxicology; Viral disease; Virology; Viruses

Year: 2020 PMID： 32802980 PMCID： PMC7409807 DOI： 10.1016/j.heliyon.2020.e04639

Source DB: PubMed Journal: Heliyon ISSN： 2405-8440

Introduction

SARS-CoV-2 is a novel coronavirus that is responsible for the COVID-19 disease which is a rapidly evolving global pandemic. Coronaviruses primarily target the upper respiratory tract and the lungs, with varying degrees of severity. Related coronaviruses such as the SARS-CoV emerging in China in 2002 and the MERS-CoV in the Middle East in 2012 result in severe respiratory conditions. The SARS-CoV-2 also produces similarly severe respiratory conditions, albeit at a lower rate but with a higher contagion factor [1]. Alarmingly, infected individuals may be asymptomatic carriers, presumably harboring the viral infection in the upper airway tract, increasing the likelihood of infecting populations that are most susceptible to severe complications [2, 3]. Although the mechanisms underlying SARS-CoV-2 infection are not completely understood, select human proteins are targets for the virus including ACE2 [4]. The SARS-CoV-2 receptor binding domain (RBD) interacts strongly with the human ACE2 receptor and TMPRSS2 to enter a human cell [5]. In addition to ACE2, a recent systems-level analysis of protein-protein interaction with peptides encoded in the SARS-CoV-2 genome identified ~300 additional human proteins, of which, 66 were considered suitable candidates for identification of therapeutics [6]. Gordon et. al. performed an in vitro assay with human cells expressing 26 SARS-CoV-2 proteins, which was followed by an analysis for high-confidence interactions. Of the 100s of reported interactions 66 were prioritized, and the authors subsequently mined and tested FDA approved drugs that were known or suspected to target these human proteins. Most of the human target proteins are overexpressed in the respiratory tract. Of particular note is the entry receptor ACE2 which is expressed at high levels in a few cell types of the nasal epithelium, as well as elsewhere [6, 7]. This could be an unusual opportunity for volatile inhaled therapeutics and prophylactics that will have direct access to the cells that are infected by the virus. The Gordon et al study also identified FDA-approved drugs that have known activity against these human protein targets or are structurally related to chemicals with known activity on the targets. While these drugs have not been comprehensively tested on the virus, another study performed high-throughput testing of ~12,000 FDA-approved or clinical stage drugs on viral replication in cell lines [8]. This study identified at least 6 potential leads that include a kinase inhibitor, a CCR1 inhibitor and 4 cysteine protease inhibitors that are candidates for testing in clinical trials. Since the regulatory process for the approval of new drugs can take several years, the repurposing of FDA approved drugs for COVID-19 offers a potential fast-track to approval. One of the more promising candidates being tested is the antiviral Remdesivir, which has been effective in vitro [9] as well as in non-human primates [10], with human trails currently ongoing. The other drug being tested is the antimalarial, hydroxychloroquine, which showed some promise alongside the antibiotic, azithromycin, in small clinical trials [11, 12]. However, hydroxychloroquine has shown less promise in larger trials for treating COVID-19 [13]. While drug repurposing is expedient, it is possible that drugs designed for other diseases will not be as well suited to respiratory organs, where a large percentage of putative human proteins targeted by the virus are enriched [6], or to the nervous system, implicated by neurological symptoms as well as prior evidence that coronaviruses can cross the blood brain barrier [14, 15]. Drug-development strategies are also often guided by minimizing off-target interactions. Repurposed drugs might have to be used in combination, and the side effects and interactions that this entails are presently not well defined. While there are recent efforts exploring novel, directed therapies from small molecule libraries [16], it is desirable to identify 100–1000s of putative chemicals as the majority may be difficult to synthesize in mass, prove toxic at therapeutic concentrations, or yield inconsistent benefits across patients due to genetic variability. These shortcomings have significantly increased the demand for additional drugs or small molecules that might interfere with viral entry and replication. Additionally, if prophylactics or non-toxic, easy-to-use therapeutics were available even for mild cases that do not require hospitalization and experimental drug treatments, contracting the virus may nevertheless impact long-term health and community transmission [17]. There are subsequently unmet needs in COVID-19 research, including identification of compounds that target the relevant SARS-CoV-2 human proteins from (1) approved drugs, (2) FDA registered chemicals or (3) a large repository of ~14 million purchasable chemicals from the ZINC 15 database [18], which we computed additional properties for such as mammalian toxicity, vapor pressure, and logP. For 65 human protein targets that SARS-CoV-2 interacts with that had publicly available bioassay and chemical data [6], we first generated a database of predictions based on structural similarity to chemicals that interact with the targets and then machine learning models (34). Many chemicals we have identified have little or no known biological activities and are predicted to have low toxicity in addition to a wide range of vapor pressures. These data are a resource to rapidly identify and test novel, safe treatment strategies for COVID-19 and other diseases where the target proteins are relevant.

Results

Identification of important structural features from known inhibitors of human target proteins

In order to test whether there is a structural basis for inhibitors of the target proteins identified previously [5, 6], we used two complementary approaches to evaluate each target's training set of compounds with known activity, compiled from the literature. First, we performed an exhaustive search for maximum common substructures among active chemicals. In some cases, enriched substructures were apparent among known ligands, with slight variation in the substructure based on the sensitivity to the targets, suggesting physicochemical features may be relevant in predicting activity against these targets (Supplementary Table 1). Next, we used a machine learning pipeline for predicting chemicals that interfere with SARS-CoV-2 targets. It involves selection of important physicochemical features for each target, followed by fitting support vector machines (SVM) with these features and then evaluating the predictions using various computational validation methods (Figure 1A). The chemical features that best predicted activity for the different targets included simple 2D information, describing the type and number of bonds, but also more abstract 3D geometries (Tables 1 and 2). Identification of each target-specific feature set provides a foundation to better understand the physicochemical basis of the activity. To that end, Supplementary Tables 2-3 include more comprehensive rank ordered lists of the physicochemical features that optimally predict activity against the targets (details about the feature ranking algorithms in Materials and Methods).

Figure 1

Machine learning pipeline to identify chemicals that interfere with SARS-CoV-2 targets.a) Overview of the pipeline to predict chemicals for 65 SARS-CoV-2 human targets selected from Gordon et al., 2020 and using bioassay data from publicly available databases. b) Graphically depicts the pipeline details. Available bioassay data on the viral targets were mined for information to use in machine learning or structural analysis. This resulted in 24 targets that could be modeled using values for the most abundant inhibitory assay measure (e.g. Ki or IC50) and 21 targets modeled by classifying broad inhibition or actvity against the proteins (34 unique targets in total). The remaining targets with limited data were funneled into a structural similarity analysis, which aids in developing more bioassay data and helps clarify the chemical features contributing to bioactivity. For targets modeled with supervised machine learning, optimal chemical features were identified on subsets of training data. The top features were sampled by support vector machines (SVM). These models were then aggregated. In certain cases, the Random Forest algorithm was inlcuded to improve the fit. External chemicals were used to verify successful predictions. Models trained for the 34 targets predicted large chemical databases including FDA registered chemicals and approved drugs, as well as 10+ million purchasable chemicals from the ZINC database. Top scoring predicted chemicals were subsequently assigned theoretical toxicity, log vapor pressure, and MLOGP, which estimates membrane permeability.

Table 1

Important chemical features for regression models. Top three physicochemical features for the viral targets with known bioassay activities.

Feature	Target	Description
GATS5s	ABCC1	Geary autocorrelation of lag 5 weighted by I-state
RDF055m	ABCC1	Radial Distribution Function - 055/weighted by mass
SpMax_B(s)	ABCC1	leading eigenvalue from Burden matrix weighted by I-State
CATS2D_08_AA	BRD2	CATS2D Acceptor-Acceptor at lag 08
RDF035s	BRD2	Radial Distribution Function - 035/weighted by I-state
SpDiam_X	BRD2	spectral diameter from chi matrix
HATS8p	BRD4	leverage-weighted autocorrelation of lag 8/weighted by polarizability
R5i+	BRD4	R maximal autocorrelation of lag 5/weighted by ionization potential
RDF035m	BRD4	Radial Distribution Function - 035/weighted by mass
Eig02_EA(bo)	CSNK2A2	eigenvalue n. 2 from edge adjacency mat. weighted by bond order
Eig05_EA(bo)	CSNK2A2	eigenvalue n. 5 from edge adjacency mat. weighted by bond order
SpMax2_Bh(m)	CSNK2A2	largest eigenvalue n. 2 of Burden matrix weighted by mass
CATS2D_04_AA	CSNK2B	CATS2D Acceptor-Acceptor at lag 04
SHED_DN	CSNK2B	SHED Donor-Negative
SpMin1_Bh(m)	CSNK2B	smallest eigenvalue n. 1 of Burden matrix weighted by mass
DISPm	DCTPP1	displacement value/weighted by mass
HATS7u	DCTPP1	leverage-weighted autocorrelation of lag 7/unweighted
Mor31s	DCTPP1	signal 31/weighted by I-state
MATS1e	DNMT1	Moran autocorrelation of lag 1 weighted by Sanderson electronegativity
Mor23m	DNMT1	signal 23/weighted by mass
TDB06u	DNMT1	3D Topological distance based descriptors - lag 6 unweighted
GATS4m	GFER	Geary autocorrelation of lag 4 weighted by mass
Mor14m	GFER	signal 14/weighted by mass
R5i	GFER	R autocorrelation of lag 5/weighted by ionization potential
DISPp	HDAC2	displacement value/weighted by polarizability
IC2	HDAC2	Information Content index (neighborhood symmetry of 2-order)
P_VSA_MR_5	HDAC2	P_VSA-like on Molar Refractivity, bin 5
F04[C–C]	IMPDH2	Frequency of C - C at topological distance 4
HOMA	IMPDH2	Harmonic Oscillator Model of Aromaticity index
VE1_B(s)	IMPDH2	coefficient sum of the last eigenvector (absolute values) from Burden matrix weighted by I-State
Eig02_AEA(dm)	ITGB1	eigenvalue n. 2 from augmented edge adjacency mat. weighted by dipole moment
SHED_AA	ITGB1	SHED Acceptor-Acceptor
SpMax2_Bh(s)	ITGB1	largest eigenvalue n. 2 of Burden matrix weighted by I-state
F10[C–N]	MARK2	Frequency of C - N at topological distance 10
nPyrroles	MARK2	number of Pyrroles
SaaNH	MARK2	Sum of aaNH E-states
max_conj_path	MARK3	maximum number of atoms that can be in conjugation with each other
SaaNH	MARK3	Sum of aaNH E-states
VE1_H2	MARK3	coefficient sum of the last eigenvector (absolute values) from reciprocal squared distance matrix
GATS3s	NSD2	Geary autocorrelation of lag 3 weighted by I-state
HOMA	NSD2	Harmonic Oscillator Model of Aromaticity index
Mor16s	NSD2	signal 16/weighted by I-state
H7m	PABPC1	H autocorrelation of lag 7/weighted by mass
JGI7	PABPC1	mean topological charge index of order 7
P_VSA_MR_2	PABPC1	P_VSA-like on Molar Refractivity, bin 2
GATS4m	PLAT	Geary autocorrelation of lag 4 weighted by mass
Mor04s	PLAT	signal 04/weighted by I-state
R6p+	PLAT	R maximal autocorrelation of lag 6/weighted by polarizability
nPyrroles	PRKACA	number of Pyrroles
RDF040v	PRKACA	Radial Distribution Function - 040/weighted by van der Waals volume
SpMin3_Bh(m)	PRKACA	smallest eigenvalue n. 3 of Burden matrix weighted by mass
Eig02_EA(bo)	PSEN2	eigenvalue n. 2 from edge adjacency mat. weighted by bond order
nArX	PSEN2	number of X on aromatic ring
VE1sign_D/Dt	PSEN2	coefficient sum of the last eigenvector from distance/detour matrix
SHED_DL	PTGES2	SHED Donor-Lipophilic
VE2sign_G	PTGES2	average coefficient of the last eigenvector from geometrical matrix
VE3sign_G	PTGES2	logarithmic coefficient sum of the last eigenvector from geometrical matrix
CATS3D_08_AL	RIPK1	CATS3D Acceptor-Lipophilic BIN 08 (8.000–9.000 Å)
MATS5i	RIPK1	Moran autocorrelation of lag 5 weighted by ionization potential
VE3sign_RG	RIPK1	logarithmic coefficient sum of the last eigenvector from reciprocal squared geometrical matrix
BLTA96	SIGMAR1	Verhaar Algae base-line toxicity from MLOGP (mmol/l)
F10[C–C]	SIGMAR1	Frequency of C - C at topological distance 10
TPSA(Tot)	SIGMAR1	topological polar surface area using N,O,S,P polar contributions
Eig01_AEA(dm)	TBK1	eigenvalue n. 1 from augmented edge adjacency mat. weighted by dipole moment
HATS4i	TBK1	leverage-weighted autocorrelation of lag 4/weighted by ionization potential
SdssC	TBK1	Sum of dssC E-states
AROM	VCP	aromaticity index
E1m	VCP	1st component accessibility directional WHIM index/weighted by mass
MATS5m	VCP	Moran autocorrelation of lag 5 weighted by mass
H5s	ACE2	H autocorrelation of lag 5/weighted by I-state
Mor10m	ACE2	signal 10/weighted by mass
Mor17m	ACE2	signal 17/weighted by mass

Table 2

Important chemical features for classification models. Top three physicochemical features for viral targets where the models classified chemicals as active vs inactive relative to broad inhibition or activition rather than a specific assay value (e.g. Ki, IC50, and AC50).

Feature	Target	Description
Mor18s	BRD4	signal 18/weighted by I-state
SpMAD_G/D	BRD4	spectral mean absolute deviation from distance/distance matrix
SpMax3_Bh(p)	BRD4	largest eigenvalue n. 3 of Burden matrix weighted by polarizability
P_VSA_LogP_3	HDAC2	P_VSA-like on LogP, bin 3
SHED_DA	HDAC2	SHED Donor-Acceptor
SHED_DL	HDAC2	SHED Donor-Lipophilic
G(N..N)	IDE	sum of geometrical distances between N..N
SM1_Dz(i)	IDE	spectral moment of order 1 from Barysz matrix weighted by ionization potential
Wap	IDE	all-path Wiener index
CATS2D_08_DA	TBK1	CATS2D Donor-Acceptor at lag 08
F08[N–N]	TBK1	Frequency of N - N at topological distance 8
P_VSA_e_3	TBK1	P_VSA-like on Sanderson electronegativity, bin 3
H7m	PRKACA	H autocorrelation of lag 7/weighted by mass
H7s	PRKACA	H autocorrelation of lag 7/weighted by I-state
RDF060m	PRKACA	Radial Distribution Function - 060/weighted by mass
GATS6e	MARK3	Geary autocorrelation of lag 6 weighted by Sanderson electronegativity
GATS6m	MARK3	Geary autocorrelation of lag 6 weighted by mass
Mor02m	MARK3	signal 02/weighted by mass
CATS2D_02_DL	IMPDH2	CATS2D Donor-Lipophilic at lag 02
CATS3D_07_DL	IMPDH2	CATS3D Donor-Lipophilic BIN 07 (7.000–8.000 Å)
NaasC	IMPDH2	Number of atoms of type aasC
C-039	ABCC1	Ar-C(=X)-R
VE2sign_Dz(p)	ABCC1	average coefficient of the last eigenvector from Barysz matrix weighted by polarizability
VE3sign_Dz(v)	ABCC1	logarithmic coefficient sum of the last eigenvector from Barysz matrix weighted by van der Waals volume
Mor31s	ABHD12	signal 31/weighted by I-state
RTi+	ABHD12	R maximal index/weighted by ionization potential
VE3sign_Dz(p)	ABHD12	logarithmic coefficient sum of the last eigenvector from Barysz matrix weighted by polarizability
E2m	BRD2	2nd component accessibility directional WHIM index/weighted by mass
GATS2m	BRD2	Geary autocorrelation of lag 2 weighted by mass
TDB03i	BRD2	3D Topological distance based descriptors - lag 3 weighted by ionization potential
MAXDP	COMT	maximal electrotopological positive variation
nDB	COMT	number of double bonds
P_VSA_MR_2	COMT	P_VSA-like on Molar Refractivity, bin 2
CATS2D_02_AL	DNMT1	CATS2D Acceptor-Lipophilic at lag 02
Mor04s	DNMT1	signal 04/weighted by I-state
VE3sign_Dt	DNMT1	logarithmic coefficient sum of the last eigenvector from detour matrix
ChiA_B(i)	EIF4H	average Randic-like index from Burden matrix weighted by ionization potential
F05[C–O]	EIF4H	Frequency of C - O at topological distance 5
NaasC	EIF4H	Number of atoms of type aasC
CENT	LOX	centralization
EE_G	LOX	Estrada-like index (log function) from geometrical matrix
VE2_D/Dt	LOX	average coefficient of the last eigenvector (absolute values) from distance/detour matrix
Eta_D_beta	MARK2	eta measure of electronic features
Mor29v	MARK2	signal 29/weighted by van der Waals volume
SpPosA_B(i)	MARK2	normalized spectral positive sum from Burden matrix weighted by ionization potential
CATS2D_07_AL	NEK9	CATS2D Acceptor-Lipophilic at lag 07
CATS2D_08_AL	NEK9	CATS2D Acceptor-Lipophilic at lag 08
TDB05p	NEK9	3D Topological distance based descriptors - lag 5 weighted by polarizability
CATS2D_06_DL	NEU1	CATS2D Donor-Lipophilic at lag 06
TDB04i	NEU1	3D Topological distance based descriptors - lag 4 weighted by ionization potential
X3A	NEU1	average connectivity index of order 3
nR06	RHOA	number of 6-membered rings
R8s+	RHOA	R maximal autocorrelation of lag 8/weighted by I-state
SpMin1_Bh(m)	RHOA	smallest eigenvalue n. 1 of Burden matrix weighted by mass
CATS3D_08_NL	SIRT5	CATS3D Negative-Lipophilic BIN 08 (8.000–9.000 Å)
O-057	SIRT5	phenol, enol, carboxyl OH
SpMax2_Bh(s)	SIRT5	largest eigenvalue n. 2 of Burden matrix weighted by I-state
CATS2D_04_AL	TK2	CATS2D Acceptor-Lipophilic at lag 04
JGI3	TK2	mean topological charge index of order 3
MATS1i	TK2	Moran autocorrelation of lag 1 weighted by ionization potential
P_VSA_e_3	VCP	P_VSA-like on Sanderson electronegativity, bin 3
RDF020p	VCP	Radial Distribution Function - 020/weighted by polarizability
SpMaxA_AEA(dm)	VCP	normalized leading eigenvalue from augmented edge adjacency mat. weighted by dipole moment

Machine learning models can successfully predict activity from chemical structure

We identified 24 targets with training sets large enough to model the log IC50, Ki, or AC50 (Figure 2A). Rigorous computational validation was performed and the results on training (Figure 2B, left) and test data that had been set aside (Figure 2C, left) indicated good overall performance according to the average mean absolute error (MAE) and the correlation between predicted and observed assay measures (MAE = 0.48; R = 0.62). Predictions of log Ki for the viral entry receptor, ACE2, were also accurate (test set R = 0.92; test set mean absolute error (MAE) = 0.53) (Figure 2C, left; Supplementary Information 1).

Figure 2

Models of chemical features accurately predict inhibitors of SARS-CoV-2 targets.a) Pipeline for fitting and validating models that predict IC50, Ki, or AC50 or a classification score, which reflects broad inhibitory activity against the listed viral targets. b) , mean absolute error (MAE) in predicting the log transformed endpoints (IC50, Ki, AC50). classification of chemicals for broad inhibition or activity against targets, validating using the area under the receiver operating characteristic (ROC) curve (AUC). Plots are for 10-fold cross validation, repeated 5 times. The model predictions are from an ensemble of three support vector machines (SVM), trained on different chemical feature sets or in some cases SVM and Random Forest. c) , external test set performance for regression models, where possible. , external test set performance for classification models, where possible. More comprehensive performance data in Supplementary Information 1. For some of the viral targets, we noticed that assay data included additional inhibitory measurements or descriptions of general activity against the targets. Some of the available data such as % inhibition, for instance, are less quantitative. However, to include as much of the available data as possible, we created models to identify physicochemical features that might broadly contribute to inhibition or activity against the targets. We therefore assigned binary, active and inactive, labels to the chemicals, then trained models as outlined before (Figure 2A; Materials and Methods). The models that were developed using this classification approach similarly proved successful, validating over partitions of the training data (avg. AUC = 0.87, avg. Shuffle AUC = 0.50, p < 10−19) (Figure 2B, right), as well as over sets of external test chemicals (avg. AUC = 0.83, avg. Shuffle AUC = 0.51, p < 10−8) (Figure 2C, right) (Supplementary Information 1). Collectively, these results suggested the models provided accurate predictions and could be used to screen approved drug libraries as well as databases of commercially available chemicals for novel therapeutics.

Predicting candidates for repurposing of FDA-approved drugs

Repurposing of existing FDA approved drugs offers a path towards rapid deployment of therapeutics against SARS-CoV-2. Approved drugs may have activity that extend beyond the original target protein. Accordingly, we used the machine learning models to predict activities of ~100,000 FDA registered chemicals (UNII database) [19] as well as the DrugBank [20] and Therapeutic Targets [21, 22] databases, which include information on drug interactions, pathways, and approval status. Interestingly, some of the approved drugs are predicted to have high activity against the SARS-CoV-2 targets (Figure 3A). In order to identify more efficacious candidates, we isolated the drugs scoring in the top 25 for multiple targets and found a few of high priority (Figure 3B). The structural analysis suggested that hits visually display 2D similarity to known active chemicals as well. (Supplementary Information 2).

Figure 3

Approved drugs with putative activity against SARS-CoV-2 targets. a) The best predicted activity against SARS-CoV-2 targets among databases of approved drugs. Viral targets with few promising candidates are omitted. Comprehensive table in Supplementary Information 2. b) Network showing drugs that are among the top 25 for multiple viral targets (drugs: black nodes; viral targets: red nodes).

Predicting volatile drug candidates from a large ~14M chemical space

Given that many of the human target proteins are overexpressed in the respiratory tract, including the entry receptor ACE2 in only a few cells types of the nasal epithelium, the upper airways and lungs [7, 23], we reasoned that volatile chemicals may offer a unique opportunity as inhaled therapeutics that will have direct access to the cells and tissues that are infected by the virus. We used the machine learning models to search a large database of ~14 million commercially available chemicals (ZINC) for volatile candidates. We initially isolated the top 1% of the predicted scoring distribution (Figure 4A, left), which resulted in >1 million chemicals in total (Figure 4A, right). To prioritize the hits for potential human use, we next developed machine learning models to predict volatility (vapor pressure) (Supplementary Figure 1) and mammalian toxicity (LD50) (Supplementary Figure 2). The toxicity and vapor pressure estimates helped identify smaller priority sets (Figure 4B). Although the vapor pressures were not especially high, we rank ordered the top candidates according to the best values (Figure 4C; Supplementary Information 3).

Figure 4

Predicting activity against SARS-CoV-2 targets among theoretical volatile chemicals. a) , count of chemicals per target after initially filtering based on predicted scores. , chemical counts across all viral targets for the models predicting general inhibitory or activity against () and those for specific inhibitory endpoints () (e.g. IC50). b) Pipeline for further prioritizing chemical sets according to estimated log vapor pressure and low mammalian toxicity (LD50). c) Top ranking predictions of general inhibition or activity against targets () and/or specific inhibitory endpoints () against SARS-CoV-2 targets from the ZINC database, filtered to the highest estimated log vapor pressures. Chemicals with suspected odorant properties, however, represent only a fraction of the chemical space, and these chemicals may not have the activity levels suited for COVID-19 cases. Volatile compounds, for instance, may be biased towards structurally simple chemicals that do not resemble drugs. We therefore also focused on additional chemicals with the high predicted activities for their targets and low estimated toxicities regardless of vapor pressure. We identified numerous candidates with potential activity against multiple viral targets (Figure 5A) and many other others with significant activity against a single target (Figure 6A; Supplementary Information 4).

Figure 5

Figure 6

Predictions of SARS-CoV-2 targets among chemicals lacking odorant properties.a) Sample of ZINC chemicals scoring highly for activity against the viral targets (classification or regression models, Score). Comprehensive tables in Supplementary Information 4, detailing the model type and predicted assay endpoint.

Predicted chemicals rank highly for multiple SARS-CoV-2 targets. a) Network of chemicals predicted to have low toxicity that are ranked highly for >1 viral targets. Chemicals were considered if for multiple viral targets they had >0.75 activity/class scores or predictions of specific assay measures (Ki, IC50, and AC50) < 100 nM. Predictions of SARS-CoV-2 targets among chemicals lacking odorant properties.a) Sample of ZINC chemicals scoring highly for activity against the viral targets (classification or regression models, Score). Comprehensive tables in Supplementary Information 4, detailing the model type and predicted assay endpoint.

Discussion

SARS-CoV-2 is a significant world health crisis. The full scope of COVID-19 disease and any long-term health complications following infection remain unclear. Although vaccines are the best long-term solution, treatments will be necessary to mitigate disease severity in the short term. What is concerning is that several repurposed drugs have already been tested in some form of clinical trial, and only one drug Remdesivir has shown a clear benefit in randomized clinical trials. Additionally, there is no guarantee that an effective vaccine can be found for the SARS-CoV-2 virus, and therefore drug candidate pipelines are extremely important to pursue for the long-term research effort against COVID-19. A vaccine against SARS-CoV-2 would likely need to stimulate local immunity, since the infection is limited to mucosal surfaces, and these could be short-lived immunities. We have therefore taken a comprehensive approach to try and provide a pipeline for short and long-term use, and for a potentially local application route via inhalation. Existing FDA approved drugs that target a single protein important for viral replication and host entry are currently the highest priority for repurposing as new COVID-19 drugs. However, we think that there are compelling reasons to create pipelines to explore many putative targets, and chemical spaces that are far larger and more diverse than the known approved drugs. We have therefore screened ~10+ million potentially purchasable compounds from the ZINC database and also predicted toxicity values for the numerous candidates. In addition, we have identified chemicals that are predicted to affect more than one of the host proteins, suggesting these may have more efficacy. One unusual category we have emphasized is volatiles, as these compounds may be biologically sourced, and therefore microbes could be genetically engineered to produce them in mass [24]. This would subsequently reduce the strain on global supply chains for chemicals that are necessary in synthesizing certain pharmaceuticals. These chemicals are also intriguing options for drug cocktails. If present in metabolic pathways, they possibly already interact in vivo. Therefore, short-term therapeutic concentrations may be better tolerated in humans. It is nevertheless important to note that machine learning depends on available data. Because the size and diversity of publicly available bioassay data are limited, caution is required in interpreting the predictions. It is common to find past bioassays focused on similarly shaped chemicals, limiting the scope of the machine learning approach to find new chemistries. Importantly, apart from ACE2, the other human proteins that were identified to interact with SARS-CoV-2 are yet to be tested in vivo for efficacy. And although some of the candidate chemicals we identified may be biologically sourced, the concentrations are not well defined or unknown, nor is there any understanding of a therapeutic concentration in this scenario. These data are presented as a forward-looking resource and a pipeline to evaluate chemical data with additional research. While our motivation was the evolving COVID-19 pandemic, the 65 SARS-CoV-2 targets including ACE2 are relevant to a range of other diseases and conditions. We therefore anticipate that the AI-based predictions of purchasable compounds from 10+ million chemicals will accelerate drug discovery in general and facilitate research on these chemicals in the future for a number of diseases. In general, the use of AI-driven tools could provide additional valuable solutions for tackling Covid-19 [25].

Materials and methods

Data sources for machine learning

ZINC

ZINC is a free database comprised of 230 million chemicals for in silico analyses. It was developed as a resource for non-commercial research. Chemicals predicted here are from a purchasable subset; however, availability is subject to change and pricing may vary widely [18, 26].

Bioassay data

Bioassay data was retrieved from ChEMBL 25 using the associated Python module, which enables access to the API services via Python [27, 28]. The various inhibitory measures/endpoints, wherever possible, are standardized to nM units; the logarithm of the standardized values was used for machine learning. Regression models were fit for a single endpoint. For classification machine learning models, however, ‘active’ class chemicals were defined using the deposited activity comments such as for assays of general activity against proteins, and added active labels for endpoints with values up to 10,000 nM (Ki and IC50) and for the semi-quantitative % inhibition, greater than 10%. The majority class was downsampled during the training and model tuning phases to adjust for possible class imbalances. Because the class labels were assigned using arbitrary cutoffs and the predicted activities for classification models from various assay endpoints are not clearly defined, we also compared each model fit to shuffled labels. Training for the regression and classification approaches was done on 85% of the total data. Notably, in a small number of cases the remaining 15% was insufficient to effectively estimate performance using an external test set. To reduce bias, feature selection (recursive feature elimination (RFE) algorithm) was always run on 85% of the data over 250–300 different partitions (iteratively running the 10-fold cross validation 25–30 times). However, for these cases, the held-out portion (15%) was then incorporated back into the dataset to better estimate performance of the trained model by 10-fold cross-validation (repeated 5 times) and obtain a better fit. We also fit 3 different radial basis function (RBF) support vector machine (SVM) models, wherein the chemical features (predictors) were randomly sampled (50%) from the top 70. This makes the performance estimates more conservative (see Key Resources Table for machine algorithm source files). However, the structural diversity and size of the datasets imply some bias in the performance estimates.

Toxicity data

Training and testing data are curated by various government agencies and provided freely to the general public as databases (see Key Resources Table) [29, 30, 31].

Vapor pressure data

Training and testing data are from EPI Suite [32], which is developed and maintained by the Environmental Protection Agency (EPA) (see Key Resources Table). Methods for fitting these models are as outlined in the Figure 1 pipeline. To compare the vapor pressure model predictions with respect to different machine learning methods as well as EPI suite, data were split into train/test partitions as defined in a previous study [33].

Selecting optimally predictive chemical features

Optimizing chemical structures

Chemical features were computed with ~5300 AlvaDesc descriptors, from the developers of DRAGON software, and 3D coordinates and optimization performed using RDKit in Python [34].

Chemical feature ranking and importance

Cross-validated recursive feature elimination (CV-RFE)

Recursive feature elimination iteratively selects subsets of features to identify optimal sets. The algorithm is a “wrapper” and therefore relies on an additional algorithm to supply predictions and quantify importance. We used two different algorithms, depending on the size and composition of data: (1) Random Forest and (2) Support Vector Machine (SVM). Random forest determines the importance in relation to the % increase in error when permuting a feature or predictor. There is no equivalent method for computing importance with the SVM. Accordingly, the importance is based on fitting a model between the response and each predictor or feature as compared to null. If the response is numeric, importance is derived from the pseudo R2 (non-linear regression). If, however, the response is binary, the AUC is instead computed for each predictor or feature (see Key Resources Table for algorithm source files). Including cross-validation with the recursive feature elimination (RFE) partitions the training data into multiple folds. This step avoids biasing performance estimates but results in lists of top predictors over the cross-validation folds such that importance of a predictor is based on a selection rate.

Selection bias

Selecting features or predictors on the same dataset used for cross validation results in models that have already “seen” possible partitions of the data and therefore performance metrics will be biased. Selection bias [35] was addressed by bootstrapping and cross validation, which ensure some separation between predictor/feature selection and model-fitting/validation. In addition to these methods, we used hidden test sets or more generally performed the feature selection on a portion of the data.

Selecting optimal machine learning algorithms

The support vector machine (SVM) with the radial basis function kernel (RBF) outperformed regularized Random Forest (regRF) or performed comparably. Rather than utilize many different approaches, we aggregated multiple SVM models to improve generalizability. However, in the case of the classification model for EIF4H, we included the regularized random forest algorithm, as the aggregated prediction (SVM and regRF) was clearly optimal on the test data. Algorithm selection and training was done using the classification and regression training package in R [36], caret [37], and the implementation of the Support Vector Machine (SVM) algorithm in Kernlab [38].

Enriched substructures/cores

Enriched cores were analyzed using RDKit through Python [34]. The algorithm performs an exhaustive search for maximum a common substructure among a set of chemicals. In practice, larger sets often yield fewer substantive cores. To remedy this, the algorithm includes a threshold parameter that relaxes the proportion of chemicals containing the core. We used a threshold of 0.55, which ensures that the majority of the chemicals contained the core.

Chemical fingerprinting

Extended Connectivity Fingerprints (ECFP) are a class of cheminformatic algorithms that iteratively combine chemical features that are present within a predefined radius/diameter, representing them by a set of integer values. Typically, the fingerprint is converted into a binary string of fixed length using a hash function. Here, the bit length was set at 1024 and a radius of 2 (diameter = 4 or ECFP4). This structural representation was preferred as it is strongly associated with activity [39]. Accordingly, it is a suitable alternative to identify drug candidates in the absence of machine learning models. We used the ECFP algorithm in RDKit (Morgan or circular fingerprint) [34]. The similarity between the fingerprints of chemicals with known activity against the SARS-CoV-2 targets and prospective chemicals was computed using the Tanimoto index. This index is a similarity coefficient (0–1; 1 = max similarity). It is the overlap of the “on-bits” divided by the sum of the unique “on-bits”. Notably, coefficients of 1 need not imply identical chemicals.where c = overlapping “on-bits”; a = “on bits” in A; b = “on-bits” in B.

Support vector machine (SVM)

Training the support vector machine (SVM) involves identifying a set of parameters that optimize a cost function, where cost 1 and cost 0 correspond to training chemicals labeled as “Active” and “Inactive,” respectively. θT is the scoring function or output of the support vector machine. If the output is ≥0, the prediction is “Active.” The function (ƒ) is a kernel function. The kernel determines the shape of the decision boundary between the active and inactive chemicals from the training set. The radial basis function (RBF) or Gaussian kernel enables the learning of more complex, non-linear boundaries. It is therefore well suited for problems in which the biologically active chemicals cannot be properly classified as a linear function of physicochemical properties. This kernel computes the similarity for each chemical (x) and a set of landmarks (l), where σ2 is a tunable parameter determined by the problem and data. The similarity with respect to these landmarks is used to predict new chemicals (“Active” vs. “Inactive”).

Model performance metrics

The Area under the ROC Curve (AUC) assesses the true positive rate (TPR or sensitivity) as a function of the false positive rate (FPR or 1-specificity) while varying the probability threshold (T) for a label (Active/Inactive). If the computed probability score (x) is greater than the threshold (T), the observation is assigned to the active class. Integrating the curve provides an estimate of classifier performance, with the top left corner giving an AUC of 1.0 denoting maximum sensitivity to detect all targets or actives in the data without any false positives. The theoretical random classifier is reported at AUC = 0.5.where T is a variable threshold and x is a probability score. However, we generated classifiers that are more authentic than theoretical random classification, shuffling the chemical feature values in the models and statistically comparing the mean AUCs across multiple partitions of the data. This controls against optimally tuned algorithms predicting well simply because of specific predictor attributes (e.g. range, mean, median, and variance) or models that are of a specific size (number of predictors) performing well even with shuffled values. Additionally, biological data sets are often small, with stimuli or chemicals that—rather than random selection—reflect research biases, possibly leading to optimistic validation estimates without the proper controls. We used the AUC for evaluating classification models. For the classification-based training, we initially converted the inhibitory data into a binary label (Active/Inactive). For predictions of quantitative bioassay measures (e.g. Ki, IC50, AC50, Log LD50), we computed the mean absolute error (MAE), the correlation coefficient (R) and the squared correlation coefficient (R2). MAE: Mean absolute error is the mean of the absolute difference between predicted and observed (% usage). It therefore assigns equal weight to all prediction errors, whether large or small.where, = predicted and = observedwhere, TP = True Positive and FN = False Negativewhere, TN = True Negative and FP = False Positive. KEY RESOURCES TABLE

Declarations

Author contribution statement

Joel Kowalewski: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper. Anandasankar Ray: Conceived and designed the experiments; Wrote the paper.

Funding statement

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Competing interest statement

J.K. and A.R. are listed as inventors in patents submitted by the University of California Riverside. A.R. is also founder of Sensorygen Inc.

Additional information

No additional information is available for this paper.

KEY RESOURCES TABLE

Reagent or Resource	Source	Identifier
Deposited Data
ZINC 15	Sterling and Irwin, 2015	https://zinc.docking.org/substances/home/
chEMBL 25	EMBL-EBI, 2011; Mendez et al., 2019	https://www.ebi.ac.uk/chembl/
EPI Suite Data	EPA, 2015	http://esc.syrres.com/interkow/EPiSuiteData.htm
DrugBank	Wishart et al., 2018	https://www.drugbank.ca/
Therapeutic Targets Database (TTD)	Chen, 2002; Zhu et al., 2009	http://db.idrblab.net/ttd/
FDA: Substance Registration Database (FDA UNII)	FDA, 2020	https://fdasis.nlm.nih.gov/srs/
Hazardous Substances Data Bank (HSDB)	Fonger et al., 2014	https://www.nlm.nih.gov/databases/download/hsdb.html
Viral Targets	Gordon et al. 2020	https://www.nature.com/articles/s41586-020-2286-9
Acutoxbase	Kinsner-Ovaskainen et al., 2009	https://www.acutetox.eu/
DSSTox	Richard and Williams, 2002	https://www.epa.gov/chemical-research/distributed-structure-searchable-toxicity-dsstox-database
Top 50 physicochemical features to predict inhibitory assay activity for each SARS-CoV-2 target	This paper	Supplementary Table 2
Top 50 physicochemical features to predict broadly inhibiting activity for each SARS-CoV-2 target	This paper	Supplementary Table 3
Top predicted drug and FDA registered chemicals.Structural similarity between drugs and chemicals with bioassay activities for SARS-CoV-2 targets	This paper	Supplementary Information 2
Top predicted chemicals from ZINC, rank ordered by estimated vapor pressure	This paper	Supplementary Information 3
Top predicted chemicals from ZINC, filtered for toxicity	This paper	Supplementary Information 4
Software and Algorithms
Classification and regression training (caret)	Kuhn, 2008	https://github.com/topepo/caret
Kernlab	Karatzoglou et al., 2004	https://github.com/cran/kernlab
Regularized Random Forest (RRF)	Deng and Runger, 2013	https://github.com/softwaredeng/RRF
RDKit	Landrum, 2006Python wrapper	https://github.com/rdkit/rdkit
ggplot2	Wickham, 2016	https://github.com/tidyverse/ggplot2

18 in total

1. TTD: Therapeutic Target Database.

Authors: X Chen; Z L Ji; Y Z Chen
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

2. Update of TTD: Therapeutic Target Database.

Authors: Feng Zhu; BuCong Han; Pankaj Kumar; XiangHui Liu; XiaoHua Ma; Xiaona Wei; Lu Huang; YangFan Guo; LianYi Han; ChanJuan Zheng; YuZong Chen
Journal: Nucleic Acids Res Date: 2009-11-20 Impact factor: 16.971

3. Covid-19: four fifths of cases are asymptomatic, China figures indicate.

Authors: Michael Day
Journal: BMJ Date: 2020-04-02

4. DrugBank 5.0: a major update to the DrugBank database for 2018.

Authors: David S Wishart; Yannick D Feunang; An C Guo; Elvis J Lo; Ana Marcu; Jason R Grant; Tanvir Sajed; Daniel Johnson; Carin Li; Zinat Sayeeda; Nazanin Assempour; Ithayavani Iynkkaran; Yifeng Liu; Adam Maciejewski; Nicola Gale; Alex Wilson; Lucy Chin; Ryan Cummings; Diana Le; Allison Pon; Craig Knox; Michael Wilson
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

5. ChEMBL: towards direct deposition of bioassay data.

Authors: David Mendez; Anna Gaulton; A Patrícia Bento; Jon Chambers; Marleen De Veij; Eloy Félix; María Paula Magariños; Juan F Mosquera; Prudence Mutowo; Michal Nowotka; María Gordillo-Marañón; Fiona Hunter; Laura Junco; Grace Mugumbate; Milagros Rodriguez-Lopez; Francis Atkinson; Nicolas Bosc; Chris J Radoux; Aldo Segura-Cabrera; Anne Hersey; Andrew R Leach
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

6. An orally bioavailable broad-spectrum antiviral inhibits SARS-CoV-2 in human airway epithelial cell cultures and multiple coronaviruses in mice.

Authors: Timothy P Sheahan; Amy C Sims; Shuntai Zhou; Rachel L Graham; Andrea J Pruijssers; Maria L Agostini; Sarah R Leist; Alexandra Schäfer; Kenneth H Dinnon; Laura J Stevens; James D Chappell; Xiaotao Lu; Tia M Hughes; Amelia S George; Collin S Hill; Stephanie A Montgomery; Ariane J Brown; Gregory R Bluemling; Michael G Natchus; Manohar Saindane; Alexander A Kolykhalov; George Painter; Jennifer Harcourt; Azaibi Tamin; Natalie J Thornburg; Ronald Swanstrom; Mark R Denison; Ralph S Baric
Journal: Sci Transl Med Date: 2020-04-06 Impact factor: 17.956

7. Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2.

Authors: Yuanyuan Zhang; Yaning Li; Renhong Yan; Lu Xia; Yingying Guo; Qiang Zhou
Journal: Science Date: 2020-03-04 Impact factor: 47.728

Review 8. The neuroinvasive potential of SARS-CoV2 may play a role in the respiratory failure of COVID-19 patients.

Authors: Yan-Chao Li; Wan-Zhu Bai; Tsutomu Hashikawa
Journal: J Med Virol Date: 2020-03-11 Impact factor: 2.327

9. Remdesivir and chloroquine effectively inhibit the recently emerged novel coronavirus (2019-nCoV) in vitro.

Authors: Manli Wang; Ruiyuan Cao; Leike Zhang; Xinglou Yang; Jia Liu; Mingyue Xu; Zhengli Shi; Zhihong Hu; Wu Zhong; Gengfu Xiao
Journal: Cell Res Date: 2020-02-04 Impact factor: 25.617

10. ZINC: a free tool to discover chemistry for biology.

Authors: John J Irwin; Teague Sterling; Michael M Mysinger; Erin S Bolstad; Ryan G Coleman
Journal: J Chem Inf Model Date: 2012-06-15 Impact factor: 4.956

16 in total

1. Predicting novel drug candidates against Covid-19 using generative deep neural networks.

Authors: Santhosh Amilpur; Raju Bhukya
Journal: J Mol Graph Model Date: 2021-10-13 Impact factor: 2.518

Review 2. Methodology-Centered Review of Molecular Modeling, Simulation, and Prediction of SARS-CoV-2.

Authors: Kaifu Gao; Rui Wang; Jiahui Chen; Limei Cheng; Jaclyn Frishcosy; Yuta Huzumi; Yuchi Qiu; Tom Schluckbier; Xiaoqi Wei; Guo-Wei Wei
Journal: Chem Rev Date: 2022-05-20 Impact factor: 72.087

3. Artificial intelligence against the first wave of COVID-19: evidence from China.

Authors: Ting Wang; Yi Zhang; Chun Liu; Zhongliang Zhou
Journal: BMC Health Serv Res Date: 2022-06-10 Impact factor: 2.908

4. Combining machine learning and quantum mechanics yields more chemically aware molecular descriptors for medicinal chemistry applications.

Authors: Sara Tortorella; Emanuele Carosati; Giulia Sorbi; Giovanni Bocci; Simon Cross; Gabriele Cruciani; Loriano Storchi
Journal: J Comput Chem Date: 2021-08-19 Impact factor: 3.672

5. Machine learning enabled identification of potential SARS-CoV-2 3CLpro inhibitors based on fixed molecular fingerprints and Graph-CNN neural representations.

Authors: Jacek Haneczok; Marcin Delijewski
Journal: J Biomed Inform Date: 2021-05-28 Impact factor: 8.000

10. Knowing and combating the enemy: a brief review on SARS-CoV-2 and computational approaches applied to the discovery of drug candidates.

Authors: Mateus S M Serafim; Jadson C Gertrudes; Débora M A Costa; Patricia R Oliveira; Vinicius G Maltarollo; Kathia M Honorio
Journal: Biosci Rep Date: 2021-03-26 Impact factor: 3.840