Literature DB >> 34850112

PharmacoDB 2.0: improving scalability and transparency of in vitro pharmacogenomics analysis.

Nikta Feizi¹, Sisira Kadambat Nair¹, Petr Smirnov^1,2, Gangesh Beri¹, Christopher Eeles¹, Parinaz Nasr Esfahani¹, Minoru Nakano¹, Denis Tkachuk¹, Anthony Mammoliti^1,2, Evgeniya Gorobets³, Arvind Singh Mer^1,2, Eva Lin⁴, Yihong Yu⁴, Scott Martin⁴, Marc Hafner⁵, Benjamin Haibe-Kains^1,2,6,7,8.

Abstract

Cancer pharmacogenomics studies provide valuable insights into disease progression and associations between genomic features and drug response. PharmacoDB integrates multiple cancer pharmacogenomics datasets profiling approved and investigational drugs across cell lines from diverse tissue types. The web-application enables users to efficiently navigate across datasets, view and compare drug dose-response data for a specific drug-cell line pair. In the new version of PharmacoDB (version 2.0, https://pharmacodb.ca/), we present (i) new datasets such as NCI-60, the Profiling Relative Inhibition Simultaneously in Mixtures (PRISM) dataset, as well as updated data from the Genomics of Drug Sensitivity in Cancer (GDSC) and the Genentech Cell Line Screening Initiative (gCSI); (ii) implementation of FAIR data pipelines using ORCESTRA and PharmacoDI; (iii) enhancements to drug-response analysis such as tissue distribution of dose-response metrics and biomarker analysis; and (iv) improved connectivity to drug and cell line databases in the community. The web interface has been rewritten using a modern technology stack to ensure scalability and standardization to accommodate growing pharmacogenomics datasets. PharmacoDB 2.0 is a valuable tool for mining pharmacogenomics datasets, comparing and assessing drug-response phenotypes of cancer models.

Entities: Chemical

Mesh：

Year: 2022 PMID： 34850112 PMCID： PMC8728279 DOI： 10.1093/nar/gkab1084

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

With the advent of high-throughput technologies, a vast amount of genomic data is being generated across various disease domains. In oncology, genomic and pharmacological profiling of cancer cell line models has resulted in a better understanding of the relationship between the molecular features of cancers and treatment outcomes. Starting with a disease-oriented screening model in the late 1980s, the US National Cancer Institute’s NCI-60 anticancer drug screen has aided in major discoveries across many fields including anticancer therapy (1). Subsequently, several studies, including the Genomics of Drug Sensitivity in Cancer (GDSC) (2,3), Cancer Therapeutics Response Portal (CTRP) (4) and Cancer Cell Line Encyclopedia (CCLE) (5), have generated pharmacogenomic profiles of much larger panels of cancer cell lines. Collectively, such data can be used for hypothesis testing and in the discovery or repurposing of new anti-cancer therapeutics. PharmacoDB (version 1.0) was released in 2018 (6) as the largest database integrating cancer cell line pharmacogenomic datasets, enabling users to efficiently explore data across the largest published studies. PharmacoDB provided a unified analysis platform by standardizing statistical models of dose–response data and harmonizing annotations of experiments. The website has grown to approximately 500 monthly users and has served as the source of curated and easily accessible data for numerous published studies. This includes researchers using the PharmacoDB database to understand specific compounds or molecular processes and pathways (7–9), as well as for more global analyses of drug sensitivity (10,11). PharmacoDB has also empowered machine learning researchers to develop and publish novel approaches to drug–response prediction (12,13). The next generation of PharmacoDB (version 2.0) includes new datasets such as the US National Cancer Institute’s NCI-60 (14–19), the Broad's Profiling Relative Inhibition Simultaneously in Mixtures (PRISM) dataset (20), GDSC version 2 (GDSC2), and updates to existing datasets such as the Wellcome Trust Sanger Institute’s GDSC version 1 (GDSC1) (21) and the Genentech Cell line Screening Initiative (gCSI) (22) (Table 1). The cell lines investigated in these studies were screened for drug response and where available, profiled at the molecular level with multiple technologies, including RNA sequencing, microarray single-nucleotide and gene expression profiling, and whole-exome or whole-genome sequencing. The processing of the pharmacogenomic data is fully automated and documented to generate FAIR (findability, accessibility, interoperability and reusability) data through the use of ORCESTRA (https://orcestra.ca/) and PharmacoDI ingestion pipelines. PharmacoDB 2.0 also provides new visualization of differential drug dose response across tissues, as well as summaries of gene–drug associations showcasing their strength and reproducibility across studies. New links to drugs and cell lines have been added from Reactome (23), Drug Target Commons (DTC) (24) and Cellosaurus (25) to increase the connection to PharmacoDB from other databases in the community. The chemical identifiers are extended to include ChEMBL (26) IDs (Figure 1A).

Table 1.

Details of new and updated datasets in PharmacoDB 2.0

Dataset	Description	PSet Molecular Data	# Cell lines	# Drugs	# Tissues	Assay	Dose–response source	ORCESTRA
US National Cancer Institute 60 anticancer drug screen (NCI-60)	NCI-60 dataset consists of molecular profiles as well as the dose–response data from screening small molecules including approved and investigational drug compounds	RNA-seq isoforms RNA-seq composite Microarray MicroRNA	162	54774	15	Sulforhodamine B colorimetry	https://wiki.nci.nih.gov/download/attachments/147193864/DOSERESP.zip?version=1&modificationDate=1622830743000&api=v2	https://orcestra.ca/pset/10.5281/zenodo.5570629
The PRISM Repurposing dataset (PRISM)	The PRISM dataset consists of dose–response data from assessing the anti-cancer effects of non-oncology drugs on human cancer cell-lines using the PRISM barcoding method developed by Broad Institute of MIT and Harvard	*RNA-seq Microarray Mutation CNV	**480	1437	22	PRISM (Luminex)	https://ndownloader.figshare.com/files/20237757	https://orcestra.ca/pset/10.5281/zenodo.5570757
Genomics of Drug Sensitivity in Cancer (GDSC1)	Genomics of Drug Sensitivity in Cancer (GDSC) Project is part of a collaboration between Wellcome Trust Sanger Institute and the Massachusetts General Hospital Cancer Center. Both GDSC1 and 2 datasets contains dose response as well as molecular data from screening anti-cancer therapeutics across genetically characterized human cancer cell lines	RNA-seq Microarray Mutation Mutation (Exome) CNV Fusion	1104	303	29	Resazurin or Syto60	ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-8.2/GDSC1_public_raw_data_25Feb20.csv	https://orcestra.ca/pset/10.5281/zenodo.3905485
Genomics of Drug Sensitivity in Cancer (GDSC2)		RNA-seq Microarray Mutation Mutation (Exome) CNV Fusion	1104	190	29	CellTiter Glo	ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-8.2/GDSC2_public_raw_data_25Feb20.csv	https://orcestra.ca/pset/10.5281/zenodo.3905481
The Genentech Cell Line Screening Initiative (gCSI)	The gCSI data were generated and shared by Genentech as part of the Genentech Cell Line Screening Initiative. gCSI dataset includes dose–response data as well as and molecular profiles from screening drugs on independently characterized cell lines	RNA-seq Mutation CNV	788	44	27	CellTiter Glo	http://research-pub.gene.com/gCSI_GRvalues2019/gCSI_GRdata_v1.3.tsv.tar.gz	https://orcestra.ca/pset/10.5281/zenodo.4737437

An overview of the new and updated datasets, types of molecular profiles included in each dataset, number of cell lines, drugs, and tissue types in each dataset, and the assay type used for measuring the dose–response values. The link to both the source of raw dose–response data and the corresponding PSets on ORCESTRA is provided in the table.

Note: The number of drugs is reported based on the PSets' unique drug IDs (Supplementary Data).

*Cell lines are directly obtained from the Cell Broad-Novartis Cancer Cell Line Encyclopedia (CCLE) project. Molecular profiles are accessible from CCLE dataset on ORCESTRA (https://orcestra.ca/pset/10.5281/zenodo.3905462).

**19 out of 499 original cell lines failed the STR fingerprinting comparison tests and were not included in the PSet.

Figure 1.

PharmacoDB 2.0 overview. (A) The new version of PharmacoDB includes updated and new large-scale pharmacogenomic datasets. The web-application contains enriched annotations for drugs and cell lines via connectivity to external databases. PharmacoDB 2.0 includes new analytical methods for tissue-specific and pan-cancer biomarker discovery. The new web-interface ensures scalability and simplifies maintenance. PharmacoDB 2.0 is made fully reproducible through the use of the ORCESTRA platform and automated data ingestion pipelines. (B) Bar plots showing previous (Version 1) and current (Version 2) database statistics.

Details of new and updated datasets in PharmacoDB 2.0 An overview of the new and updated datasets, types of molecular profiles included in each dataset, number of cell lines, drugs, and tissue types in each dataset, and the assay type used for measuring the dose–response values. The link to both the source of raw dose–response data and the corresponding PSets on ORCESTRA is provided in the table. Note: The number of drugs is reported based on the PSets' unique drug IDs (Supplementary Data). *Cell lines are directly obtained from the Cell Broad-Novartis Cancer Cell Line Encyclopedia (CCLE) project. Molecular profiles are accessible from CCLE dataset on ORCESTRA (https://orcestra.ca/pset/10.5281/zenodo.3905462). **19 out of 499 original cell lines failed the STR fingerprinting comparison tests and were not included in the PSet. PharmacoDB 2.0 overview. (A) The new version of PharmacoDB includes updated and new large-scale pharmacogenomic datasets. The web-application contains enriched annotations for drugs and cell lines via connectivity to external databases. PharmacoDB 2.0 includes new analytical methods for tissue-specific and pan-cancer biomarker discovery. The new web-interface ensures scalability and simplifies maintenance. PharmacoDB 2.0 is made fully reproducible through the use of the ORCESTRA platform and automated data ingestion pipelines. (B) Bar plots showing previous (Version 1) and current (Version 2) database statistics.

ADDITION OF NEW DATASETS

The current release of PharmacoDB includes significant updates to the GDSC1000 (renamed GDSC1 following the Wellcome Trust Sanger Institute’s nomenclature) and gCSI datasets (Table 1). In addition, the NCI-60, PRISM and GDSC2 datasets have been included in this new release of PharmacoDB. The numbers of entities such as drugs, cell lines, tissues, gene–drug associations and experiments in PharmacoDB 2.0, and the previous version is shown in (Figure 1B). The newly standardized tissue curation is explained in Supplementary Data. The NCI-60 screen includes over 55 000 small molecules assayed on 60 core (and 102 additional) cell lines representing 15 tumor types. PharmacoDB 2.0 includes percentage of treated cell growth (PTC) values, downloaded from the NCI Developmental Therapeutics Program (https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-60+Growth+Inhibition+Data), for 4 557 787 experiments that included at least four measured doses required for curve fitting (Supplementary Data). The PRISM Repurposing dataset employs a molecular barcoding method to screen drugs against cell lines in pools (20,27). The barcoded cell lines from different lineages are assayed for relative mRNA abundance after treatment with a drug or chemical perturbagen to estimate cell viability. PRISM drug screening involves a 2-stage screening strategy: (i) Screen-1 includes 4518 compounds and 578 cell lines that were assayed in triplicate at a single dose and (ii) Screen-2 includes 1448 drugs that were re-screened against 499 cell lines in triplicate in an 8-point dose response (20). PharmacoDB 2.0 includes the Screen-2 dose–response data with biological replicate-collapsed log-fold change values, downloaded from the Dependency Map Data Portal (https://depmap.org/repurposing/). After processing, these data included 726 814 experiments with at least 4 dose measurements (Supplementary Data). The GDSC2 dataset was generated using a new screening platform from the Wellcome Trust Sanger Institute's Genomics of Drug Sensitivity in Cancer project (28,29). The dataset, previously included as GDSC1000 in PharmacoDB, has been renamed to GDSC1 and updated to include more experiments (323 032 GDSC1 versus 225 480 GDSC1000), reflecting the updated data available from the GDSC project. The GDSC2 dataset represents a different screening approach (using the same cellular viability assay as the Broad and Genentech studies, and an increase in biological replicates) on a similar set of cell lines as GDSC1, with both overlapping and new compounds, and a total of 215 780 experiments which were included in PharmacoDB 2.0. More information on the protocol differences between the GDSC1 and GDSC2 data can be found on the GDSC website at https://www.cancerrxgene.org/. Finally, PharmacoDB 2.0 includes significant updates to Genentech’s gCSI dataset. The number of experiments available has increased from 6455 to 16 688, primarily through the inclusion of additional 35 compounds and extensive biological replicate information. For all new additions and updates to the database, drug annotations were curated by mapping drugs to PubChem (30) identifiers such as compound identifiers (CID), SMILES and InChIKeys using the PUG-REST API (31). Whenever possible, PubChem index names were used as the standard name in PharmacoDB. Most exceptions occurred in the NCI-60 dataset, where a large proportion of PubChem index names followed IUPAC nomenclature and were difficult for humans to read. For these, we followed the process described in Supplementary Figure S1 to decide when to prefer the NCI-60 dataset ID as the standard name for PharmacoDB. Cell lines were annotated using Cellosaurus (25), an online cell line knowledge resource that documents cell lines used in biomedical research. OncoTree (32) was used to re-label the tissues of cancer cell lines profiled into an externally defined ontology.

IMPLEMENTATION OF REPRODUCIBLE PIPELINES

With large amounts of pharmacogenomic data released from multiple studies, the reproducibility of computational pipelines used to process these multimodal data is essential. This includes adhering to the FAIR data principles, along with ensuring that there is a standardized manner in which the pipelines are executed and the data is hosted. To address this important issue, we used ORCESTRA (https://orcestra.ca/), a platform that allows researchers to process biomedical data into unified data objects in a reproducible and transparent manner, where data provenance is tracked (33). At the heart of ORCESTRA is Pachyderm (https://www.pachyderm.com/), an open-source data versioning tool used to execute pipelines processing the molecular and compound screening data, and packaging the datasets into R objects called PharmacoSets (PSets), which are implemented by the PharmacoGx package (34). At the end of each PSet processing pipeline, the created object is automatically deposited on Zenodo and assigned a digital object identifier (DOI). Once the highly curated and standardized PSets are released via ORCESTRA, they need to be preprocessed into tables which match the PharmacoDB Entity Relationship Diagram (ERD) before being loaded into the database. To ensure that the data ingestion standards in PharmacoDB adhere to FAIR data principles, the Pharmaco-Data Ingestion (PharmacoDI) project was initiated to create an Extract Transform Load (ETL) pipeline which adheres to modern data engineering best practices (35,36). The PharmacoDI project consists of three major components. First, the rPharmacoDI (https://github.com/bhklab/rPharmacoDI) R package provides an interface to download and export PSets in a Python compatible format. Second, the PharmacoDI (https://pypi.org/project/PharmacoDI) Python package contains a set of functions for transforming the exported raw PSet data into tables which match the PharmacoDB schema, leveraging the Python Datatable package to allow larger than memory data processing. Finally, the Snakemake workflow management tool is used to integrate ORCESTRA, rPharmacoDI, PharmacoDI and PharmacoDB into a fully modular and scalable ETL pipeline (https://github.com/bhklab/PharmacoDI_snakemake_pipeline) which ensures that all software dependencies, metadata and code are version controlled, transparent and fully reproducible. This pipeline includes a number of additional features which keeps PharmacoDB annotations up to date, such as dynamic queries to ChEMBL (26) and Cellosaurus (25) to fetch respective drug and cell line metadata. Automated quality control checks are implemented throughout the pipeline to ensure data integrity and flag data for manual review if problems are detected. Once quality control passes, the Python SQLAlchemy package is used to connect with our Azure MySQL database where database tables are automatically created and loaded before being deployed for use in the PharmacoDB web-application (Figure 2).

Figure 2.

Computational processing pipeline of raw pharmacogenomic data for ingestion into PharmacoDB. Different panels show the process of ingesting public datasets into PharmacoDB 2.0. The first panel highlights the sources of the newly added datasets, while the subsequent panels highlight the tools and technologies used for Data Processing and Standardization, Data Ingestion and Annotation, and for building the PharmacoDB 2.0 web app itself.

CONNECTIVITY TO EXTERNAL DATABASES

While the original PharmacoDB database included links to external databases for drugs, genes and cell lines, PharmacoDB 2.0 focuses on creating bidirectional links to improve discoverability of the available pharmacogenomic data. Unique stable identifiers were created for both drugs and cell lines in the PharmacoDB database, allowing external databases to link directly to entities in PharmacoDB. A collection of drugs from PharmacoDB 2.0 are bidirectionally linked to Reactome (23), an online database of biological pathways including drug mechanisms of action. Reactome provides detailed insights into drug targets, binding partners and subsequent biological pathways associated with target action. PharmacoDB drugs are also linked to Drug Target Commons (DTC) (24), a community-driven web platform for compound-target bioactivity assay annotation profiles relevant for drug discovery and repurposing. ChEMBL (26) IDs matching our compound identifiers are added in addition to PubChem identifiers. The cell lines from the datasets are bidirectionally linked and annotated to Cellosaurus (25), which is a cell line information resource. Additional cell line metadata such as disease, metastasis site, species and links to external databases such as DepMap are available from Cellosaurus for the linked cell lines. The external links can be found in the individual drug and cell line pages of PharmacoDB as well as in external web-applications. In addition, PharmacoDB APIs are implemented using JavaScript programming language in Express which is a back end web application framework for NodeJS. GraphQL (https://graphql.org/), which is an open-source data query and manipulation language for the APIs is used to structure the APIs, and which also provides a runtime for fulfilling queries with existing data. The newly created APIs provide better performance and are open source to facilitate user operability. The APIs can be accessed without any authentication process or tokens.

ENHANCED DATA ANALYSIS

Tissue-specific analysis

Building on top of the newly standardized tissue ontology, PharmacoDB 2.0 has been updated with a focus on tissue-specific analysis throughout the web application. PharmacoDB 2.0 contains cell response data for a large panel of 1757 cell lines spanning across 30 tissues. Moreover, PharmacoDB includes data for a large portfolio of 589 FDA approved drugs, several investigational drugs, tool or lead compounds, and natural substances, all with varying levels of activity across and within tissue types. Therefore, the visualizations within the web-application have been extended to help identify patterns of sensitivity across and within tissue(s). On each individual drug page, the differential sensitivity to a compound across tissues is displayed as a boxplot, giving a quick overview of the tissues that are sensitive or resistant to a compound. As with all the plots in the web-application, this can be displayed by integrating data across all datasets or filtered by dataset(s). For example, viewing this plot across all datasets for Dabrafenib clearly indicates the well known preferential sensitivity of BRAF mutated skin cancers to BRAF inhibition (37), and its absence in other tissues (even tissues known to harbour mutations in the BRAF pathway, such as bowel) (Figure 3A). Alternatively, within PharmacoDB 2.0 it is possible to investigate the differential sensitivity to a compound within a tissue type by comparing the drug dose–response curves across all cell lines tested in all datasets for a particular tissue. Continuing with the example of Dabrafenib, it is possible to identify the most sensitive and resistant skin cell lines, using either the sensitivity summary metrics provided in PharmacoDB (such as the area above the curve [AAC] or the drug concentration necessary to inhibit 50% of the maximal cell viability, IC50), or visually (Figure 3B). This will be a useful addition for experimentalists interested in identifying the most sensitive and/or resistant models to a particular compound for investigating mechanism, synthetic lethality or possible drug combinations.

Figure 3.

Visualization of tissue-specific drug–response and gene–drug associations. (A) Drug response (AAC) of Dabrafenib across various tissues from all datasets. (B) Differential sensitivity of skin cell lines to Dabrafenib; cell lines and datasets of interest can be highlighted in the plot by checking the boxes. (C) Forest plot of Pearson correlations between Lapatinib response and ERBB2 expression in breast tissue. Data from RNA sequencing is shown here. The significant associations (FDR < 0.05 and pearson correlation coefficient, r > 0.7) is highlighted in bright pink. (D) Manhattan plot showing the association of copy number alterations with Lapatinib response in all datasets and across all tissue types, with ERBB2 highlighted. The genomic coordinates are displayed on the x-axis, and negative logarithm of the association P-value is displayed on the y-axis. The different colors of each block show the extent of each chromosome.

Query biomarkers

PharmacoDB 2.0 improves the users’ ability to explore, evaluate and compare the molecular features most strongly correlating with response to particular compounds across pharmacogenomic studies. The original version of PharmacoDB contained precomputed tables for pan-cancer associations of drug response with gene expression, mutation and copy number variation computed using the PharmacoGx package (34). The data were presented in a plain table format, useful only for ranking associations by significance or effect size. PharmacoDB 2.0 extends this by providing interpretable visualizations to compare and contextualize these associations, extends the analysis to tissue-specific associations and incorporates updates to the statistical methodology to evaluate the strength and significance of these markers. The previous pipeline used to evaluate the associations between all feature types, and drug response was based around a linear modelling approach, and relied on analytical P-values derived from assumptions of normality. However, both the distributions of molecular features and drug–response metrics can deviate significantly from normal, potentially inducing biases in the estimation of these P-values (38). This is exacerbated by our addition of tissue-specific associations to the database, for which sample sizes are much smaller (dozens versus hundreds of cell lines). Therefore, PharmacoDB 2.0 includes both analytic and permutation-estimated P-values for associations, increasing the reliability of statistical significance estimates in the database (details on statistical methodology are available in the Supplementary Data). Tissue-specific gene-compound associations require evaluating the correlation between a feature and drug response within cell lines limited to a single tissue type. Repeating this for each of the 30 tissues in PharmacoDB becomes a significant undertaking, increasing the table associations from approximately 33.5 million pan-cancer associations to over 400 million tissue specific associations (after filtering out associations based on <20 cell lines). While analytical test results are available for all associations, permutation testing of all 400 million associations is a lengthy and ongoing process. Currently, PharmacoDB contains the results of permutation testing associations appearing in at least three datasets. As we compute the results of permutation tests for more compounds and gene features, we will be continuously updating the database. Finally, each gene-compound association (tissue-specific or pan-cancer) is now associated with a biomarker page within the web application, which contains summary information about the gene, the compound, and two types of plots contextualizing the associations available for this gene and compound pair in the database. A forest plot allows users to compare the strength of associations between the compound and gene across datasets. In Figure 3C, we show the association between the validated biomarker of ERBB2 expression (39,40) and Lapatinib response in breast tissue. The forest plot gives a visual summary of the reproducibility of this marker across studies. As a forest plot is displayed for each available molecular feature type, it also facilitates comparison of the drug–response effect between expression, copy number, and mutation alterations of the same gene. The second plot, specifically designed for copy number alteration (CNA) associations, is a Manhattan plot, displaying the strength and significance of each CNA association with this particular compound across the dataset (Figure 3D). This provides important context to the association, allowing a visual identification of associations that are driven by focal copy number changes, versus those that are likely passengers to larger genomic events.

CONCLUSION AND FUTURE DIRECTIONS

Large-scale compound screens across various biological model systems are being carried out at a fast pace, generating valuable pharmacogenomic data for biomarker discovery, a key challenge in precision medicine. PharmacoDB 2.0 is a major update bringing enhancements to the User Interface of the web application, greatly expands the pharmacogenomic data available within the database and implements pipelines following FAIR principles which allow researchers to track the provenance of data included in each release of PharmacoDB going forward. PharmacoDB provides a platform of reference to cancer researchers while designing their experiments. It can be used either for checking if an experiment has been carried out by other research groups or to compare experiment outcomes with public data. Information on the cell line and tissue breakdown of drug studies in datasets, sensitivity across tissues and top gene–drug associations are particularly helpful in this regard. PharmacoDB also helps users check the association of their gene of interest with drug response and further analyze the reliability of a potential biomarker across various studies. Finally, PharmacoDB serves as a tool for drug repurposing by providing drug response on tissue types other than the drug’s approved indication. Since the first publication of PharmacoDB in 2018, several other web applications have arisen which integrate pharmacogenomics studies, including CellMinerCDB (41), and the Dependency Map Portal integrating GDSC, CTRP and PRISM drug response data with RNAi and CRISPR essentiality screening data (3,4,20,27). These three web tools have converged on a core set of functionality, allowing the exploration of dose–response data, comparisons between datasets and exploration of molecular features associated with drug response. However, each tool fills a different niche. DepMap is focused on providing interactive access to the Dependency Map project data, integrating molecular profiling, compound screening and essentiality datasets. As such, researchers who are interested in comparing data modalities generated by this project would benefit from access to all different data modalities through one portal. CellMinerCDB is focused on integrating NCI derived data with the larger public datasets from the Broad, Sanger and MD Anderson institutes and uniquely provides interactive tools for multivariate modelling of compound response from molecular features. PharmacoDB, in contrast to the other websites, has focused on integrating across a wider range of pharmacogenomics studies, and includes major pan-cancer screening initiatives as well as smaller, tissue-specific studies. Researchers interested in a large collection of independent datasets for integrative or meta-analysis would benefit from the larger number of studies covered by PharmacoDB. This can be especially useful for machine learning researchers, where it is important to have independent training and multiple testing datasets to show generalizability. However, PharmacoDB does not at this time integrate any gene essentiality data, and researchers looking for such data will be better served by the other two tools. Future work on PharmacoDB will be focused on building upon its strengths. In the upcoming releases, we will plan to add smaller published studies that lack the visibility of the DepMap and NCI datasets, and are working on a formal statistical meta-analysis to leverage this growing number of datasets for biomarker discovery. PharmacoDB has already proven to be a valuable resource for computational and wet-lab researchers and with continued improvements, we aim to make pharmacogenomics data more discoverable and accessible for the cancer research community.

DATA AVAILABILITY

The Pharmacosets can be downloaded from https://orcestra.ca/, with the link to each dataset being provided in Table 1.

CODE AVAILABILITY

All the code is available on GitHub. PharmacoDB: https://github.com/bhklab/PharmacoDB-JS; rPharmacoDI: https://github.com/bhklab/rPharmacoDI; PharmacoDI: https://github.com/bhklab/PharmacoDI; and the SnakeMake database creation pipelines: https://github.com/bhklab/PharmacoDI_snakemake_pipeline Click here for additional data file.

37 in total

1. mRNA and microRNA expression profiles of the NCI-60 integrated with drug activities.

Authors: Hongfang Liu; Petula D'Andrade; Stephanie Fulmer-Smentek; Philip Lorenzi; Kurt W Kohn; John N Weinstein; Yves Pommier; William C Reinhold
Journal: Mol Cancer Ther Date: 2010-05-04 Impact factor: 6.261

2. The Cellosaurus, a Cell-Line Knowledge Resource.

Authors: Amos Bairoch
Journal: J Biomol Tech Date: 2018-05-10

3. Feasibility of drug screening with panels of human tumor cell lines using a microculture tetrazolium assay.

Authors: M C Alley; D A Scudiero; A Monks; M L Hursey; M J Czerwinski; D L Fine; B J Abbott; J G Mayo; R H Shoemaker; M R Boyd
Journal: Cancer Res Date: 1988-02-01 Impact factor: 12.701

4. PharmacoGx: an R package for analysis of large pharmacogenomic datasets.

Authors: Petr Smirnov; Zhaleh Safikhani; Nehme El-Hachem; Dong Wang; Adrian She; Catharina Olsen; Mark Freeman; Heather Selby; Deena M A Gendoo; Patrick Grossmann; Andrew H Beck; Hugo J W L Aerts; Mathieu Lupien; Anna Goldenberg; Benjamin Haibe-Kains
Journal: Bioinformatics Date: 2015-12-09 Impact factor: 6.937

5. Exon array analyses across the NCI-60 reveal potential regulation of TOP1 by transcription pausing at guanosine quartets in the first intron.

Authors: William C Reinhold; Jean-Louis Mergny; Hongfang Liu; Michael Ryan; Thomas D Pfister; Robert Kinders; Ralph Parchment; James Doroshow; John N Weinstein; Yves Pommier
Journal: Cancer Res Date: 2010-03-09 Impact factor: 12.701

Review 6. The NCI60 human tumour cell line anticancer drug screen.

Authors: Robert H Shoemaker
Journal: Nat Rev Cancer Date: 2006-10 Impact factor: 60.716

7. PharmacoDB: an integrative database for mining in vitro anticancer drug screening studies.

Authors: Petr Smirnov; Victor Kofia; Alexander Maru; Mark Freeman; Chantal Ho; Nehme El-Hachem; George-Alexandru Adam; Wail Ba-Alawi; Zhaleh Safikhani; Benjamin Haibe-Kains
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

8. Drug mechanism-of-action discovery through the integration of pharmacological and CRISPR screens.

Authors: Emanuel Gonçalves; Aldo Segura-Cabrera; Clare Pacini; Gabriele Picco; Fiona M Behan; Patricia Jaaks; Elizabeth A Coker; Donny van der Meer; Andrew Barthorpe; Howard Lightfoot; Tatiana Mironenko; Alexandra Beck; Laura Richardson; Wanjuan Yang; Ermira Lleshi; James Hall; Charlotte Tolley; Caitlin Hall; Iman Mali; Frances Thomas; James Morris; Andrew R Leach; James T Lynch; Ben Sidders; Claire Crafter; Francesco Iorio; Stephen Fawell; Mathew J Garnett
Journal: Mol Syst Biol Date: 2020-07 Impact factor: 11.429

9. The Aurora kinase/β-catenin axis contributes to dexamethasone resistance in leukemia.

Authors: Kinjal Shah; Mehreen Ahmed; Julhash U Kazi
Journal: NPJ Precis Oncol Date: 2021-02-17

10. PubChem in 2021: new data content and improved web interfaces.

Authors: Sunghwan Kim; Jie Chen; Tiejun Cheng; Asta Gindulyte; Jia He; Siqian He; Qingliang Li; Benjamin A Shoemaker; Paul A Thiessen; Bo Yu; Leonid Zaslavsky; Jian Zhang; Evan E Bolton
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971