Deisy Morselli Gysi1,2, Ítalo Do Valle1, Marinka Zitnik3, Asher Ameli4,5, Xiao Gan1,2, Onur Varol1, Helia Sanchez4, Rebecca Marlene Baron6, Dina Ghiassian4, Joseph Loscalzo7, Albert-László Barabási1,2,8. 1. Network Science Institute and Department of Physics, Northeastern University, Boston, MA 02115, USA. 2. Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA. 3. Department of Biomedical Informatics, Harvard University, Boston, MA 02115, USA. 4. Scipher Medicine, 260 Charles St, Suite 301, Waltham, MA 02453, USA. 5. Department of Physics, Northeastern University, Boston, MA 02115, USA. 6. Division of Pulmonary and Critical Care Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA. 7. Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA 02115, USA. 8. Department of Network and Data Science, Central European University, Budapest 1051, Hungary.
Abstract
The COVID-19 pandemic demands the rapid identification of drug-repurpusing candidates. In the past decade, network medicine had developed a framework consisting of a series of quantitative approaches and predictive tools to study host-pathogen interactions, unveil the molecular mechanisms of the infection, identify comorbidities as well as rapidly detect drug repurpusing candidates. Here, we adapt the network-based toolset to COVID-19, recovering the primary pulmonary manifestations of the virus in the lung as well as observed comorbidities associated with cardiovascular diseases. We predict that the virus can manifest itself in other tissues, such as the reproductive system, and brain regions, moreover we predict neurological comorbidities. We build on these findings to deploy three network-based drug repurposing strategies, relying on network proximity, diffusion, and AI-based metrics, allowing to rank all approved drugs based on their likely efficacy for COVID-19 patients, aggregate all predictions, and, thereby to arrive at 81 promising repurposing candidates. We validate the accuracy of our predictions using drugs currently in clinical trials, and an expression-based validation of selected candidates suggests that these drugs, with known toxicities and side effects, could be moved to clinical trials rapidly.
The COVID-19 pandemic demands the rapid identification of drug-repurpusing candidates. In the past decade, network medicine had developed a framework consisting of a series of quantitative approaches and predictive tools to study host-pathogen interactions, unveil the molecular mechanisms of the infection, identify comorbidities as well as rapidly detect drug repurpusing candidates. Here, we adapt the network-based toolset to COVID-19, recovering the primary pulmonary manifestations of the virus in the lung as well as observed comorbidities associated with cardiovascular diseases. We predict that the virus can manifest itself in other tissues, such as the reproductive system, and brain regions, moreover we predict neurological comorbidities. We build on these findings to deploy three network-based drug repurposing strategies, relying on network proximity, diffusion, and AI-based metrics, allowing to rank all approved drugs based on their likely efficacy for COVID-19patients, aggregate all predictions, and, thereby to arrive at 81 promising repurposing candidates. We validate the accuracy of our predictions using drugs currently in clinical trials, and an expression-based validation of selected candidates suggests that these drugs, with known toxicities and side effects, could be moved to clinical trials rapidly.
The speed and the disruptive nature of the COVID-19 pandemic has taken both public health and biomedical research by surprise, demanding the rapid deployment of new interventions, the development, and testing of an effective cure and vaccine. Given the compressed timescales, the traditional methodologies relying on iterative development, experimental testing, clinical validation, and approval of new compounds are not feasible. A more realistic strategy relies on drug repurposing, requiring us to identify clinically approved drugs, with known toxicities and side effects, that may have a therapeutic effect in COVID-19patients.In the past decade, network medicine has developed and validated a series of computational tools that help us identify drug repurposing opportunities[1-7]. Here we deploy these tools to analyze the molecular perturbations induced by the virus SARS-CoV2, causing a pathophenotype (disease) known as COVID-19 (Coronavirus Disease 2019), and to identify potential drug repurposing candidates. We start by characterizing the COVID-19 disease module (Fig. 1A), representing the network neighborhood of the human interactome perturbed by SARS-CoV2, and its integrity in 56 tissues, to identify the tissues and organs the virus could invade. We then explore multiple network-based strategies to prioritize existing drugs based on their ability to interact with their protein targets and, thereby, perturb the disease module: network proximity-based methods that use a graph theoretic repurposing strategy[2]; diffusion-based methods to capture node similarity[8]; and approaches relying on artificial intelligence network (AI-Net), that embed all available data to detect efficacy[5,6]. These three predictive approaches offer us twelve ranked lists, normally applied independently and validated on different datasets. Here, we combine them using a rank aggregation algorithm[9], allowing to exploit their relative advantages and to obtain a final prioritized ranking of drug repurposing candidates that offers higher accuracy than any of the pipelines alone. After eliminating drugs based on toxicity, delivery, and appropriateness of their use in COVID-19patients, we selected 81 approved drugs as candidates for drug repurposing. Finally, we integrate experimental data from in vitro models to help identify the network-based mechanism of action for selected compounds and offer further validation using existing gene expression data (Fig. 1B)[10,11].
Figure 1:
Network Medicine Approaches to Drug Repurposing.
(A) The physical interactions that we use as input in the network medicine framework: Virus-human protein interaction, capturing the human proteins to which the viral proteins can bind; human protein-protein interactions, defining the human interactome of 18,508 proteins linked by 332,749 pairwise physical interactions; and the drug-human protein interactions, capturing the human protein targets of each drug in DrugBank. (B) A schematic representation of the input data we use for the predictions, the three prediction methods and the resulting pipelines, and the outcomes provided by the analysis.
Results
Mapping SARS-CoV2 Targets to the Human Interactome
SARS-CoV2 infects human cells by hijacking the host’s translation mechanisms to generate 29 viral proteins, which bind to multiple human proteins to initiate the molecular processes required for viral replication and additional host infection[12]. Gordon et al[13] expressed 26 of the 29 SARS-CoV2 proteins and used affinity-purification followed by mass spectrometry to identify 332 human proteins to which the viral proteins bind (Table S1)[13]. We mapped these 332 proteins to the human interactome, consisting of 18,508 proteins and 332,749 pairwise interactions between them (see Methods). Of the 332 viral targets, 239 proteins form a multiply connected subnetwork of viral targets (Fig. 2A), and 93 viral targets do not interact with other targets, but only with other human proteins.
Figure 2:
The COVID-19 Disease Module.
(A) Proteins targeted by SARS-CoV2 are not distributed randomly in the human interactome, but form a large connected component (LCC) consisting of 208 proteins, as well as multiple small subgraphs. We do not show the 93 viral targets that do not interact with other viral targets. Proteins not expressed in the lung are shown in orange, indicating that almost all proteins in SARS-CoV2 LCC are expressed in the lung, explaining the effectiveness of the virus in causing pulmonary infections. (B) The random expectation of the LCC size, indicating that the observed COVID-19 LCC, whose size is indicated by the red arrow, is larger than expected by chance. (C) Similarly, the lung-based LCC is also greater than expected by chance.
We find that 208 viral targets form a large connected component (LCC) (Fig. 2A). To test whether the observed LCC could have emerged by chance, we randomly placed 332 proteins in the interactome while matching the degrees of the original viral targets. The obtained random LCC of size 183.23 ± 14.93 proteins and the comparative Z-Score= 1.65 indicates that the SARS-CoV2 target-proteins aggregate in the same network vicinity[3,14], defining the location of the COVID-19 disease module within the human interactome. Potential drug repurposing candidates must either target proteins within or in the network vicinity of this disease module.
Tissue Specificity
Previous work indicates that the expression of a gene associated with a disease in a particular tissue is insufficient for a disease to be manifest in that tissue, but a statistically significant disease LCC for must be expressed[15]. We, therefore, measured the statistical significance of the COVID-19 LCC in 56 tissues, using data from GTEx[16]. With GTEx median value < 5, only 10,823 (58%) of the 18,406 proteins in the interactome are expressed in lung[15,16], while of the 332 viral targets (Fig. 2C) 214 (64%) are expressed. We find that 182 viral targets form a tissue specific LCC, and given the random expectation of 155.61 ± 14.82 for this LCC, we obtain a Z-Score= 1.78 for the lung, larger than the Z-Score= 1.65 of the LCC in the full-network. Overall, in 30 tissues the LCC exceeds the Z-Score of the full-network, helping us to identify tissues where the virus-induced disease could be manifested (Table 1). The list contains pulmonary and cardiovascular tissues, supporting the clinical observations that COVID-19 manifests itself in the respiratory system[17,18], but infectedpatients often present significant cardiovascular involvement[17,19], and patients with underlying cardiovascular diseases show increased risk of death[20]. Interestingly, Table 1 indicates that the LCC is also expressed in the multiple brain regions, likely explaining the recently reported neurological manifestations[21-23] of the disease. We also observe multiple tissues related to the digestive system (colon, esophagus, pancreas) in this analysis, again consistent with clinical observations. Finally, equally unexpected is the fact that Table 1 indicates expression in multiple reproductive system tissues (vagina, uterus, testis, cervix, ovary), as well as spleen, potentially related to disruptions in the regulation of the immune system[24,25] (Table 1).
Table 1:
Tissues Affected by SARS-CoV2.
The list of 30 tissues whose Z-Scores are higher than the overall Z-Score of the COVID-19 LCC. Tissues in the same or similar systems or organs are shaded by the same color.
Tissue
LCC
Z-Score
Immortalized cell line
171
2.114
Vagina
185
2.062
Brain-Frontal Cortex
162
1.923
Pancreas
133
1.908
Heart-Left Ventricle
129
1.897
Brain-Cortex
161
1.889
Brain-Hippocampus
149
1.884
Colon-Sigmoid
179
1.870
Kidney-Cortex
151
1.848
Fibroblasts
183
1.843
Adrenal Gland
168
1.816
Uterus
184
1.808
Cervix-Endocervix
185
1.801
Bladder
179
1.799
Testis
189
1.794
Lung
182
1.780
Artery
178
1.777
Spleen
173
1.761
Colon
179
1.760
Brain-Hypothalamus
157
1.757
Esophagus-Mucosa
175
1.757
Cervix-Ectocervix
184
1.730
Ovary
182
1.726
Skin
178
1.720
Heart-Atrial Appendage
153
1.716
Prostate
183
1.715
Brain-Spinal cord
169
1.713
Kidney
167
1.704
Brain-Anterior cingulate cortex
152
1.690
All
208
1.658
Predicting Disease Comorbidity
Pre-existing conditions worsen prognosis and recovery of COVID-19patients[26]. Previous work has shown that the disease relevance of the human proteins targeted by the virus can predict the symptoms/signs and diseases caused by a pathogen[14], prompting us to identify diseases whose molecular mechanisms overlap with cellular processes targeted by SARS-CoV2, allowing us to predict potential comorbidity patterns[27-29]. We retrieved 3,173 disease-causing genes for 299 diseases[30], finding that 110 of the 320 proteins targeted by SARS-CoV2 are implicated in disease; however, the overlap between SARS-CoV2 targets and the pool of the disease genes is not statistically significant (Fisher’s exact test; FDR-BH p-value> 0.05). We, therefore, evaluated the network-based overlap between the proteins associated with each of the 299 diseases and the targets of SARS-CoV2, using the S metric[30], where S < 0 signals a network-based overlap between the SARS-CoV2 viral targets v and the gene pool associated with disease b. We find that S > 0 for each disease, indicating that SARS-CoV2 disease module does not directly overlap with any major disease module (Fig. S1 and Table S2). The diseases closest to the COVID-19 proteins (smallest S), include several cardiovascular diseases and cancer, whose comorbidity in COVID-19patients is well documented[19,31,32] (Fig. 3). The same metric predicts comorbidity with neurological diseases, in line with our observation, that the viral targets are expressed in the brain (Table 1).
Figure 3:
Disease Comorbidity.
We measured the network proximity between COVID-19 targets and 299 diseases. The figure represents each disease as a circle whose radius reflects the number of disease genes associated with it[30]. The diseases closest to the center, whose names are marked, are expected to have higher comorbidity with the COVID-19 outcome. The farther is a disease from the center, the more distant are its disease proteins from the COVID-19 viral targets.
In summary, we find that the SARS-CoV2 targets do not overlap with disease genes associated with any major diseases, indicating that a potential COVID-19 treatment can not be derived from the arsenal of therapies approved for specific diseases. These findings argue for a strategy that maps drug targets without regard to their localization within a particular disease module. However, the diseases modules closest to the SARS-CoV2 viral targets are those with noted comorbidity for COVID-19infection, such as pulmonary and cardiovascular diseases, and cancer. We also find multiple network-based evidence linking the virus to the nervous system, a less explored comorbidity, consistent with the observations that many infectedpatients initially lose olfactory function and taste[33], and that 36% of patients with severe infection requiring hospitalization have neurological manifestations[21].
Identifying Drug Repurposing Candidates for COVID-19
Traditional repurposing strategies focus on drugs that target the human proteins to which viral proteins bind[13], or on drugs previously approved for other pathogens. The network medicine approach described here is driven by the recognition that most approved drugs do not target directly disease proteins, but bind to proteins in their network vicinity[34]. Hence our goal is to identify drug candidates that may or may not target the proteins to which the virus binds, but nevertheless have the potential to perturb the network vicinity of the virus disease module. To achieve this end, we utilized several network repurposing strategies: a network proximity strategy, identifying drugs whose targets are in the immediate network vicinity of the viral targets[2]; a diffusion-based strategy[8]; and an AI-Net based strategy that uses machine learning to combine multiple sources of evidence[5,6] (Fig. 1B). We test the predictive power of each method independently using a list of drugs under clinical trial for COVID-19 and combine the evidence provided by each method, arriving at a ranked list of drug repurposing candidates derived from the complete list of drugs in DrugBank (see Methods).
Proximity-based Ranking
Proximity-based methods allow us to measure the distance between two sets of nodes in a network, also determining the statistical significance for the observed proximity. Here we use proximity to explore the distance between the viral protein targets (approximating the COVID-19 disease module), and (i) the targets of approved drugs; and (ii) the differentially expressed genes induced by each drug, arriving at three drug ranking lists.Pipeline P1: For each drug, we measured the network distance to the closest protein targeted by COVID-19, and applied a degree-preserving randomization procedure to assess its statistical significance, expecting a Z-Score< 0 for proximal drugs. For example, chloroquine, a rheumatological and antimalarial drug currently in clinical trial for COVID-19, has Z-Score=−1.82, indicating its proximity to SARS-CoV2 targets. In contrast, etanercept, another anti-inflammatory drug with no supported COVID-19 relevance, has a Z-Score=1.29, indicating that the drug’s protein targets are far from the SARS-CoV2 viral targets (Fig. 4A). We tested the proximity of 6,116 drugs with at least one target in DrugBank, identifying 385 drugs with Z-Scores< −2, and 1,201 drugs with Z-Scores< −1, representing potential repurposing candidates (Fig. 4B).
Figure 4:
Using Proximity to Predict Repurposing Drugs:
(A) The local neighborhood of the human interactome showing the targets of the drug chloroquine and the reference drug, dextrotyroxine, and the proteins closest to them targeted by COVID-19 viral proteins. (B) Distribution of proximity scores for 6,116 drugs, capturing their distance to SARS-CoV2 targets. The six lighter bars indicate the proximity of drugs currently tested in clinical trials for COVID-19.
Pipeline P2: We computed the proximity Z-Score after disregarding for each drug the targets that are enzymes, carriers or transporters. These are proteins targeted by multiple drugs, and are often unrelated to the known pharmacological effects of the profiled drugs. Of the 5,550 drugs obtained after the filtering, the metric identified 165 drugs with Z-Scores< −2 and 541 with Z-Scores< −1. Using this measure, chloroquine and hydroxychloroquine are less proximal to COVID-19 targets, while ribavin, an antiviral drug in clinical trial, gain more proximity (Fig. 4C).Pipeline P3: The effect of a drug is rarely limited directly to the target proteins, but the drug can activate or repress biological cascades and biochemical pathways, that change the expression patterns of multiple proteins in the network neighborhood of the drug’s targets. DrugBank compiles 17,222 differentially expressed genes (DEGs), linked to 793 drugs in multiple cell lines. We measured the proximity between DEGs and COVID-19 targets for 793 drugs, finding 18 drugs with Z-Scores < −2, and 82 drugs with Z-Scores < −1.In summary, each of the pipelines P1-P3 offer a list of drug candidates ranked by the proximity Z-Score of the respective pipeline.
Diffusion-based Methods
Diffusion State Distance (DSD) methods rank drugs based on the network similarity of their targets to COVID-19 protein targets. The similarity of two nodes captures the overlap of two global (network-wide) states following the independent perturbation of the two nodes. We implemented three statistical measures that resulted in five ranking pipelines (see Methods).Pipeline D1: L1 norm (Manhattan distance) calculates similarity through the sum over the absolute value of differences between the elements of the two vectors, providing a symmetric measure whose lower values reflect higher similarity.Pipeline D2 : As the L1 norm may result in loss of information[35], we also implemented the Kullback-Leibler (KL) divergence[36], which calculates the relative entropy of the vector representation of the two nodes, reporting the average asymmetric similarity value over the minimum pairwise similarity values (KL-min), and resulting in values between 0 and 1.Pipeline D3: We deployed the KL divergence measure, discussed above, but reporting the average similarity value over the median pairwise similarity (KL-median).Pipeline D4: We implemented Jensen-Shannon (JS) divergence[37], a modified (symmetrized and smoothed) version of the KL divergence, reporting the average over the minimum value of pairwise similarities (JS-min).Pipeline D5: Similar to D4, but we report the average over the median value of pairwise similarities (JS-median).We used these five metrics to rank 3,225 drugs as potential treatments for COVID-19. Baricitinib, for example, is a rheumatological drug currently in trial for COVID-19 and all diffusion-based pipelines rank it higher than tocilizumab, a drug also indicated for rheumatological and severe inflammatory diseases with no proven COVID-19 relevance.
AI-Net based Strategy
We adopted machine learning tools previously developed for drug repurposing using the protein-protein interaction network as input[38,39], resulting in the AI-Net pipeline that exploits the power of AI in a network context[39] (see Methods).The method learns how to represent (i.e., embed) the multimodal graph into a compact, low-dimensional vector space such that the algebraic operations in the learned embedding space reflect the topology of the input network (Fig. S4A), and specifies a deep transformation function that maps drugs and diseases to points in the learned space, termed ‘drug and disease embeddings’. As diseases are not independent of each other and genes are often shared between distinct diseases, the method embeds diseases associated with similar genes close together in the embedding space. Similarly, the effects of drugs are not limited to proteins to which they directly bind, but effects spread throughout the protein-protein interaction network. To capture these effects, the method embeds closely together drugs whose target proteins have similar local neighborhoods in the underlying protein-protein interaction network.We use the learned embeddings to generate four lists of candidate drugs for COVID-19, each ranked list containing 1,607 treatment recommendations. To obtain the four rankings, we use four distinct decoders, which decode the structure of small network neighborhoods around a drug or a disease node from the learned embeddings.Pipeline A1: We search for drugs that are in the vicinity of the COVID-19 disease module by calculating the cosine distance between COVID-19 and all drugs in the decoded embedding space[40]. The decoding is based on the N = 10 nearest neighboring nodes in the embedding space, with a minimum distance between nodes of D = 0.25.Pipeline A2: To prevent nodes in the decoding embedding space to pack together too closely, we choose D = 0.8 and keep N unchanged, pushing the structures apart into softer more general features, offering a better overarching view of the embedding space at the loss of the more detailed structure.Pipeline A3: Alternatively, to force the decoding to concentrate on the very local structure (to the detriment of the overall goal of the exercise), we choose N = 5 to explore a smaller neighborhood while setting the minimum distance at a midrange point, D = 0.5.Pipeline A4: Instead of focusing on the finer local structure, we specify the decoder such that it preserves the broad structure (N = 10, D = 1), offering a broader view of the embedding space at the loss of detailed structure.By inspecting the 20 highest ranked drug candidates offered by the AI-Net pipeline (Table S4), we observe that several drugs in COVID-19 clinical studies (e.g., chloroquine, ritonavir). Other top-ranked drugs include anti-malarial medications and drugs used to treat autoimmune, pulmonary, and cardiovascular diseases.
Prioritizing Repurposing Candidates
The predictive pipelines discussed above offered altogether twelve rankings, each reflecting a different network-based criterion to estimate a drug’s likelihood to show efficacy in treating COVID-19patients. As they all start from the same list of drugs and drug-targets and operate on the same PPI network, the rankings provided by them are not expected to be fully independent. To quantify the similarity between them we measure the Kendall τ rank correlation of the rankings provided by each pipeline. We find that two of the target proximity-based pipelines, P1 and P2, show high correlation between each other, as do the four AI-Net pipelines (A1-A4), and the five diffusion-based pipelines (D1-D5). Yet, the correlations across the three basic methods are much lower, and P3, relying on gene expression patterns, is also somewhat uncorrelated with other pipelines, indicating that the different methods offer complementary ranking information (Fig. 5A).
Figure 5:
Comparison of the Predictive Pipelines.
(A) Heatmap of the Kendall τ capturing the correlation between the ranking predicted by the 12 drug repurposing pipelines. Methods using different approaches are not correlated, potentially prioritizing different drugs. (B) ROC Curves and AUC for each of the twelve pipelines used for drug repurposing, using as a gold standard the drugs under evaluation in clinical trial for treating COVID-19 (Table S5). (C) The performance of the overall cRank (all), which combines all pipelines into a final ranking list, is higher than the performance of each method individually (cRanks AIs, Ps and Ds).
To evaluate the predictive power of the pipelines, we test their ability to recover the drugs currently in clinical trials as COVID-19 treatment. For this purpose, we obtained a list of 67 drugs currently undergoing clinical trials from ClinicalTrials.gov (Table S5). We use the resulting list and the ranking predicted by each pipeline to compute the ROC (receiver operating characteristics) curves and the AUC (area under the curve) scores for model selection and performance analysis, measuring the quality of separation between positive and negative instances. As Fig. 5B shows, the best individual ROC curves, of 0.86 − 0.87, are obtained by the four AI-Net based methods. Note that the performance of the four AI-Net pipelines is largely indistinguishable, in line with the finding that the ranking lists provided by them are highly correlated (Fig. 5A). The second-best performance, of 0.70, is provided by the proximity method P3. Close behind is P1 with AUC = 0.68, and we find that eliminating some drug targets in P2 decreases the AUC to 0.58. As a group, the diffusion methods offer ROC between 0.55–0.56. Their lower performance is somewhat unexpected, as diffusion-based methods should capture higher order correlations, compared to the proximity methods, thus one would expect a performance between the proximity-based and the AI-Net methods, which successfully integrate high order correlates.Each method extracts its own network-based signal for prioritizing drugs. However, the scores of each method are biased differently, offering different rankings. We used a rank aggregation algorithm[9] to combine the 12 ranking lists, aiming to maximize the number of pairwise agreements between the final ranking and each input ranking. This objective, known as the Kemeny consensus, is NP-hard to compute[41,42]; hence, we used an algorithm to approximate it (see Methods). We first tested whether combining the ranking within each method class could improve the predictive power of the list provided by the individual pipelines (Fig. 5C). The joint performance of the AI-Net group is 0.87, the same as A3. We do observe, however, an improvement for the proximity pipelines in the joint ranking, increasing performance from 0.70 for 0.72. Interestingly, the combined diffusion pipelines have lower performance (0.54) than the best diffusion pipeline of 0.56 observed for D1, D2, and D4. What is particularly encouraging, however, is that when we combine all 12 pipelines, we obtain a ROC of 0.89, the highest of any individual or combination-based pipelines, confirming that the individual pipelines offer complementary information that can be harnessed by the combined ranking. It is this combined list, therefore, that defines our final ranked list of predicted drugs for repurposing.Finally, we manually inspected the joint ranking list, removing drugs with significant toxicities, eliminating those not appropriate, and removing lower-ranked members of the same drug class (with some exceptions). Through this process, we arrived at a list of 86 drugs selected from the top 10% of the total combined rank list, representing our final repurposing candidates for COVID-19 (Table 2). The selection contains drugs that are used for disorders of the respiratory (e.g., theophylline, montelukast) and cardiovascular (e.g., verapamil, atorvastatin) systems; antibiotics used to treat viral (e.g., ribavirin, lopinavir), parasitic (e.g., hydroxychloroquine, ivermectin, praziquantel), bacterial (e.g., rifaximin, sulfanilamide), mycotic (e.g.,fluconazole), and mycobacterial (e.g., isoniazid) infections; and immunomodulating/anti-inflammatory drugs (e.g., interferon-β, auranofin, montelukast, colchicine); anti-proteasomal drugs (e.g., bortezomib, carfilzomib); and a range of other less obvious drugs that warrant exploration (e.g., aminoglutethimide, melatonin, levothyroxine, calcitriol, selegiline, deferoxamine, mitoxantrone, metformin, nintedanib, cinacalcet, and sildenafil, among others (Table 2). Our final list includes 11 previously proposed[13,43] potential drug-repurposing candidates for COVID-19, and 21 drugs that are currently being tested in clinical trials (Table 2).
Table 2:
Drug Repurposing Candidates.
The list of the 81 drugs selected for repurposing. It shows the drugs’ name, the final combined rank of each drug, the number of clinical trials in which the drug is being tested for COVID-19 and references to paper, that already noted their potential COVID-19 relevance.
Validation Case Studies
The drug repurposing list provided in Table 2 ranks drugs based on their network-based relationship to the viral targets. However, for a drug to be effective, it may not be sufficient to be proximal—it also needs to induce the right perturbation in the cell, suppressing, for example, the expression of proteins the virus needs, and activating the expression of proteins essential for the cell function and survival that are suppressed by the virus. In this section we use expression data to understand how the drug affects the activity of proteins within the COVID-19 disease module, offering insights about the mechanism of action of selected drugs.
Connectivity Map
We retrieved gene expression perturbation profiles for 59 of the 81 repurposing candidates from the Connectivity Map (CMap) database[10,11], altogether including 5,291 experimental instances (combination of different drugs, cell lines, doses, and time of treatment). To evaluate the degree to which each of these drugs modulate the activity of COVID-19 targets, we measured the overlap between the perturbed genes and COVID-19 targets. For example, for mitoxantrone, an antineoplastic drug (Table 2), we find that 75 (22%) of the COVID-19 targets have a significant overlap with the 2,440 genes highly perturbed by the drug (3.33μM) in the lung cell line HCC515 (Fisher’s exact test, FDR-BH p-value < 0.05) (Fig. 6A). When evaluated across all experimental instances, we find that for 43 of the 59 drugs, there was a statistically significant overlap of the perturbed genes with the COVID-19 targets (Fig. 6B). For random selections of 59 drugs from the pool of all drugs, only 13 ± 7 drugs on average have statistically significant overlap between perturbed genes and COVID-19 targets (Fig. S5), indicating that the repurposing candidates effectively perturb the network of the COVID-19 disease module. We observed the highest number of perturbed COVID-19 targets for carfilzomib (162, p-value = 0.004, HA1E, 10.0μM), flutamide (162, p-value = 0.003, MCF7, 0.04μM), and bortezomib (162, p-value = 0.02, HA1E, 20.0μM). For cell lines derived from lung tissues (A549 and HCC515), the drugs with the highest overlap with COVID-19 targets are mitoxantrone and ponatinib. These results can help us extract direct experimental evidence that the drug repurposing candidates selected by our methods modulate processes targeted by the virus, and offer mechanistic insights into the biological processes affected by these drugs. For example, we find that mitoxantrone (HUVEC, 10, μM, 24h) perturbs COVID-19 targets related to cell cycle, viral life cycle, protein transport and organelle organization.
Figure 6:
Validation Using Gene Expression Data.
(A) Local region of the interactome showing the COVID-19 targets. The drug mitoxantrone (3.33μM, 24h) perturbs the gene expression of 75 COVID-19 targets (labeled proteins) in the lung cell line HCC515 (green and red colors represent down- and up-regulation, respectively). (B): The comparison of bortezomib treatment (YAPC, 20 μM) and SARS-CoV2 infection perturbation profiles shows a negative correlation (Spearman ρ = −0.58, FDR-BH p-value = 1.67×10−7), indicating that the drug counteracts the effects of the infection for 65 genes (orange dots). The straight line shows a linear fit between the two profiles and the respective confidence interval. Positive values represent upregulated expression and negative values represents down-regulated expression on both axes.
Suppressing COVID-19 Induced Expression
We next asked whether the selected drugs can counteract the gene expression perturbations caused by the virus, i.e., whether they down-regulate genes up-regulated by the virus or vice versa. For this analysis, we begin with the 120 differentially expressed genes (DEGs) in the SARS-CoV2infected of the A549 cell line[44] and compare the list with the drug perturbation profiles. For example, bortezomib treatment of the cell line YAPC (20 μM) counteracts the effects of the SARS-CoV2infection for 65 genes (Fig. 6C), resulting in an inverted expression profile (Spearman correlation ρ = −0.58) (Fig. 6C). We measured the Spearman correlation ρ between the perturbations caused by the drug and perturbations caused by the virus in the A549 cell line, where negative correlation values indicate that the drug could counteract the effects of the infection. We find that 22 of the 59 drugs profiled in the Connectivity Map have negative correlation coefficients (Spearman ρ < 0, FDR-BH p-value < 0.05), indicating that they could beneficially modulate the effects of the virus infection. Again, for random selections of 59 drugs from the pool of all drugs, only 3±2 drugs on average have statistically significant negative correlation coefficients (Fig. S6), supporting, once again, the COVID-19 relevance of the repurposing list. Among the 22 drugs with significant perturbation overlap with both COVID-19 targets and DEGs in SARS-CoV2infection, we find ivermectin and carfilzomib, each in clinical trial for COVID-19 (Table S5). Altogether, these results provide in vitro experimental support for the selected repurposing candidates as possible modulators of the biological processes targeted by the virus. It also indicates how network-based tools can utilize gene expression profiles to explore the potential efficacy of drugs.
Discussion
In this study, we took advantage of recent advances in network medicine to define a list of 81 drug repurposing candidates for the treatment of COVID-19, and, using in vitro data, we show that these drugs do affect biological processes targeted by the virus. The accuracy of our predictions will further improve as the input or validation data improve. For example, we relied on the results of Gordon et al (2020)[13], for the map of interactions between the virus and human proteins. There are, however, additional interactions not detected in the study[13]. For example, the ACE2[45,46] protein has been recently linked to initial viral association on airway epithelial cells, but in the current data set[13] no viral proteins target it.Note that the utilized predictive pipelines select drugs that, by the virtue of the network-based relationship between their targets and the SARS-CoV2 viral targets, are positioned to perturb effectively the COVID-19 disease module. Some of the perturbations may block the virus’ ability to invade the host cells, or limit the molecular level disruption caused by the infection, potentially alleviating the disease symptoms and shortening the timeline of the disease. Others, however, may cause perturbations that aggravate the symptoms and the seriousness of the phenotype. Therefore, in ordinary circumstances, we would need molecular experiments to test the efficacy of these drugs for COVID-19infected cell lines (Table 2). Yet, as many of these drugs have well-known side effects and toxicities, given the imminent need for a cure, it may be possible to move those drugs directly into clinical trials. While we are currently pursuing this possibility, releasing the list could offer opportunities for other groups, with appropriate resources and toolset, to move some of these drugs into screening or directly to rapid clinical trials. We are, of course, cognizant of the remote, yet real, possibility that these approved drugs with known side effects may exert unique toxicities in the setting of this novel infection, an outcome that can only be identified in clinical trial.Our study focused on ranking the existing drugs based on their expected efficacy for COVID-19patients. This does not mean that drugs that did not make our final list could not have efficacy, or that they must be excluded from further consideration. As the input data improves, other, currently highly ranked drugs could move to a lower ranking, developing a case for experimental testing and clinical trial, and vice versa. The proposed methodology is general, allowing us to profile the potential efficacy of any drug or a family of drugs, whether or not they are included in our current reference list.Normally, bioinformatic validation would be followed by experimental screening and potentially clinical validation before publication. We are currently pursuing these avenues, from screening in human cell lines to clinical trials. We feel, however, that given the strength of the bioinformatics validation and the obtained AUC, generating confidence in our methodologies, and the urgency of the COVID-19 crisis, there is an imminent need for disclosure to offer rationale and guidance for upcoming clinical trials.
Methods
Human Interactome, SARS-CoV2 and Drug Targets
The human interactome was assembled from 21 public databases that compile experimentally-derived protein-protein interactions (PPI) data: 1) binary PPIs, derived from high-throughput yeast-two hybrid (Y2H) expereriments (HI-Union[47]), three-dimensional (3D) protein structures (Interactome3D[48], Instruct[49], Insider[50]) or literature curation (PINA[51], MINT[52], LitBM17[47], Interactome3D, Instruct, Insider, BioGrid[53], HINT[54], HIPPIE[55], APID[56], InWeb[57]); 2) PPIs identified by affinity purification followed by mass spectrometry present in BioPlex2[58], QUBIC[59], CoFrac[60], HINT, HIPPIE, APID, LitBM17, InWeb; 3) kinase-substrate interactions from KinomeNetworkX[61] and PhosphoSitePlus[62]; 4) signaling interactions from SignaLink[63] and InnateDB[64]; and 5) regulatory interactions derived by the ENCODE consortium. We used the curated list of PSI-MI IDs provided by Alonso-López et al (2019)[56], for differentiating binary interactions among the several experimental methods present in the literature-curated databases. Specifically for InWeb, interactions with curation scores < 0.175 (75th percentile) were not considered. All proteins were mapped to their corresponding Entrez ID (NCBI) and the proteins that could not be mapped were removed. The final interactome used in our study contains 18,505 proteins and 327,924 interactions between them. We retrieved interactions between 26 SARS-CoV2 proteins and 332 human proteins that were detected by Gordon, et al[13]. and drug-target information from the DrugBank database, containing 26,167 interactions between 7,591 drugs and their 4,187 targets.
Tissue Specificity
We used the GTEx database[16], which contains the median gene expression from RNA-seq for 56 different tissues, assuming that genes with a median count lower than 5 are not expressed in that particular tissue. The LCC was calculated using a degree preserving approach[2], preventing the repeated selection of the same high degree nodes by choosing 100 degree bins in 1,000 simulations.
Network Proximity
Given V, the set of COVID-19 virus targets, the set of drug targets, T, and d(v,t), the shortest path length between nodes v ∈ V and t ∈ T in the network, we define[2]We also determined the expected distances between two randomly selected groups of proteins, matching the size and degrees of the original V and T sets. To avoid repeatedly selecting the same high degree nodes, we use degree-binning[2] (see above). The mean μ and standard deviation σ of the reference distribution allows us to convert the absolute distance d to a relative distance Z, defined as
Diffusion State Distance
The diffusion state distance (DSD)[8] algorithm uses a graph diffusion property to derive a similarity metric for pairs of nodes that takes into account how similarly they impact the rest of the network. We calculate the expected number of times He(A,B) that a random walk starting at node A visits node B, representing each node by the vector[8]
which describes how a perturbation initiated from that node impacts other nodes in the interactome. The similarity between nodes A and B is provided by the L1 norm of their corresponding vector representations,Inspired by the DSD, we developed five new metrics to calculate the impact of drug targets t on the SARS-CoV2 targets v. The first (Pipeline D1) is defined as
where DSD(s, t) represents the diffusion state distance between nodes t and v. Since the L1 norm of two large vectors may result in loss of information[35], we also used the metric (Pipeline D2)
and (Pipeline D3)
where KL is the Kullback-Leibler (KL) divergence between the vector representations of the nodes t and s. Finally, to provide symmetric measures, we tested the measures (Pipeline D4)
and (Pipeline D5)
where JS is the Jensen Shannon (JS) divergence between the vector representations of nodes t and s. All five measures consider t ≠ v.
Graph convolutional networks
We designed a graph neural network for COVID-19 treatment recommendations based on a previously developed graph convolutional architecture[38]. The multimodal graph is a heterogeneous graph with N nodes representing three distinct types of biomedical entities (i.e., drugs, proteins, diseases), and labeled edges representing four semantically distinct types of edges r between the entities (i.e., protein-protein interactions, drug-target associations, disease-protein associations, and drug-disease treatments).
COVID-19 treatment recommendation task
We cast COVID-19 treatment recommendation as a link prediction problem on the multimodal graph. The task is to predict new edges between drug and disease nodes, so that a predicted link between a drug node v and a disease node v should indicate that drug v is a promising treatment for disease v (e.g., COVID-19). Our graph neural network is an end-to-end trainable model for link prediction on the multimodal graph and has two main components: (1) an encoder: a graph convolutional network operating on G and producing embeddings for nodes in G, and (2) a decoder: a model optimizing embeddings such that they are predictive of successful drug treatments.
Overview of graph neural architecture
The neural message passing encoder takes as input a graph G and produces a node d-dimensional embedding for every drug and disease node in the graph. We use the encoder[38] that learns a message passing algorithm[65] and aggregation procedure to compute a function of the entire graph that transforms and propagates information across graph G. The graph convolutional operator takes into account the first-order neighborhood of a node and applies the same transformation across all locations in the graph. Successive application of these operations then effectively convolves information across the K-th order neighborhood (i.e., embedding of a node depends on all the nodes that are at most K steps away), where K is the number of successive operations of convolutional layers in the neural network model. The graph convolutional operator takes the form
where is the hidden state of node v in the k-th layer of the neural network with d( being the dimensionality of this layer’s representation, r is an edge type, matrix is a edge-type specific parameter matrix, ϕ denotes a non-linear element-wise activation function (i.e., a rectified linear unit), and α denote attention coefficients[66]. To arrive at the final embedding of node v, we compute its representation as: . Next, the decoder takes node embeddings and combines them to reconstruct labeled edges in G. In particular, decoder scores a (v, r, v) triplet through a function g whose goal is to assign a score g(v, r, v) representing how likely it is that drugs v will treat disease v (i.e., r denotes a ‘treatment’ relationship).
Training the graph neural network
During model training, we optimize model parameters using the max-margin loss functions to encourage the model to assign higher probabilities to successful drug indications (v, r, v) than to random drug-disease pairs. We take an end-to-end optimization approach, that jointly optimize over all trainable parameters and propagates loss function gradients through both encoder and the decoder. To optimize the model, we train it for a maximum of 100 epochs (training iterations) using the Adam optimizer[67] with a learning rate of 0.001. We initialize weights using the initialization described in[68]. To make the model comparable to other drug repurposing methodologies in this study, we do not integrate additional side information into node feature vectors; instead, we use one-hot indicator vectors[69] as node features. In order for the model to generalize well to unobserved edges, we apply a regular dropout[70] to hidden layer units (Eq. (10)). In practice, we use efficient sparse matrix multiplications, with complexity linear in the number of edges in G, to implement the model. We use a 2-layer neural architecture with d1 = 32, d2 = 32, d = 128 hidden units in input, output, and intermediate layer, respectively, a dropout rate of 0.1, and a max-margin of 0.1. We use mini-batching[71] by sampling triples from the multimodal graph. That is, we process multiple training mini-batches (mini-batches are of size 512), each obtained by sampling only a fixed number of triplets, resulting in dynamic batches that change during training.
Expression perturbation profiles
We retrieved drug perturbation profiles from the Connectivity Map (CMap) database[10,11] using the Python package CMapPy[72]. For each perturbation profile, we calculated the significance of the overlap of perturbed genes (|Z − Score | > 2) and SARS-CoV2 targets derived from Gordon, et. al.,[13] using Fisher’s Exact Test. We also retrieved gene expression data of the cell line A549 after infection with SARS-CoV2[44]. The correlation between the perturbation scores provided in CMap and the gene expression fold change caused by SARS-CoV2infection was evaluated using the Spearman correlation coefficient. In both cases, we applied the Benjamini-Hochberg method for multiple testing correction (FDR < 0.05).
Rank aggregation
We used CRank algorithm[9] to combine rankings returned by different methodologies into a single rank for each drug, which then determined the drug’s repurposing priority. The rank aggregation algorithm starts with ranked lists of drugs, R, each one arising from a different methodology r. Each ranked list is partitioned into equally sized groups, called bags. Each bag i in ranked list R has attached importance weight whose initial values are all equal.CRank uses a two-stage iterative procedure to aggregate the individual rankings by taking into account uncertainty that is present across ranked lists. After initializing the aggregate ranking R as a weighted average of ranked lists Rr, CRank alternates between the following two stages until no changes are observed in the aggregated ranking R. (1) First, it uses the current aggregated ranking R to update the importance weights for each ranked list. For that purpose, the top-ranked drugs in R serve as a temporary gold standard. Given bag i and ranked list R, CRank updates importance weight based on how many drugs from the temporary gold standard appear in bag i using the Bayes factors[73,74]. (2) Second, the ranked lists are re-aggregated based on the importance weights calculated in the previous stage. The updated importance weights are used to revise R in which the new rank R(C) of drug C is expressed as , where indicates the importance weight of bag i(C) of drug C for ranking r, and R(C) is the rank of C according to r. By using an iterative approach, CRank allows for the importance of a ranking not to be predetermined and to vary across drugs.The final output is a global ranked list R of drugs that represents the collective opinion of the different repurposing methodologies. The Python source code implementation of CRank is available at https://github.com/mims-harvard/crank. In all experiments, we set the number of bags to 1,000, the size of the temporary gold standard to 0.5% of the total number of drugs in R, and the maximum number of iterations to 50. In all cases, the algorithm converged, in fewer than 20 iterations.
ROC curves
We employed different methodologies to rank drug candidates. Since we lack ground-truth labels for drugs being effective against the disease, we rely on clinical trials to gather names of drugs currently in trial. We made an assumption that all the drugs tested in clinical trials are relevant and based on prior in vitro or in vivo observations. We used this information and the ranking of each method to compute ROC (Receiver Operating Characteristics) curves and AUC (area under the curve) scores for model selection and performance analysis. AUC score measures the quality of the separation between positive and negative instances. For the ranked list, we applied different thresholds to compute false-positive and true-positive rates to plot ROC. Scores of AUC range between 0 and 1, where 1 corresponds to perfect performance and 0.5 indicates the performance of a random classifier. Some methods fail to provide a ranking for each drug or to provide a fair comparison between methods, we assumed all the missing ranks should be listed at the bottom of the ranking. We use the Python package Scikit-learn[75] for computing AUC scores and plotting ROC curves.For the ground-truth list, we consider the ClinicalTrials.gov website the primary source of ongoing trials of drugs fo COVID-19. We are cognizant of its limitations, primarily being one of time lags between the implementation of a trial and its appearance on the site. We also quantified the performance of models under different constraints: considering only drugs that have at least N trials and considering only the evidence provided up to a certain date (Fig. S7).
Authors: Justin Lamb; Emily D Crawford; David Peck; Joshua W Modell; Irene C Blat; Matthew J Wrobel; Jim Lerner; Jean-Philippe Brunet; Aravind Subramanian; Kenneth N Ross; Michael Reich; Haley Hieronymus; Guo Wei; Scott A Armstrong; Stephen J Haggarty; Paul A Clemons; Ru Wei; Steven A Carr; Eric S Lander; Todd R Golub Journal: Science Date: 2006-09-29 Impact factor: 47.728
Authors: Giacomo Grasselli; Alberto Zangrillo; Alberto Zanella; Massimo Antonelli; Luca Cabrini; Antonio Castelli; Danilo Cereda; Antonio Coluccello; Giuseppe Foti; Roberto Fumagalli; Giorgio Iotti; Nicola Latronico; Luca Lorini; Stefano Merler; Giuseppe Natalini; Alessandra Piatti; Marco Vito Ranieri; Anna Mara Scandroglio; Enrico Storti; Maurizio Cecconi; Antonio Pesenti Journal: JAMA Date: 2020-04-28 Impact factor: 56.272
Authors: Jörg Menche; Amitabh Sharma; Maksim Kitsak; Susan Dina Ghiassian; Marc Vidal; Joseph Loscalzo; Albert-László Barabási Journal: Science Date: 2015-02-20 Impact factor: 47.728
Authors: Edward L Huttlin; Raphael J Bruckner; Joao A Paulo; Joe R Cannon; Lily Ting; Kurt Baltier; Greg Colby; Fana Gebreab; Melanie P Gygi; Hannah Parzen; John Szpyt; Stanley Tam; Gabriela Zarraga; Laura Pontano-Vaites; Sharan Swarup; Anne E White; Devin K Schweppe; Ramin Rad; Brian K Erickson; Robert A Obar; K G Guruharsha; Kejie Li; Spyros Artavanis-Tsakonas; Steven P Gygi; J Wade Harper Journal: Nature Date: 2017-05-17 Impact factor: 49.962