Thomas Cokelaer1, Elisabeth Chen2, Francesco Iorio3, Michael P Menden4, Howard Lightfoot2, Julio Saez-Rodriguez3,5, Mathew J Garnett2. 1. Institut Pasteur-Bioinformatics and Biostatistics Hub-C3BI, USR 3756 IP CNRS, Paris, France. 2. Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK. 3. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, UK. 4. Oncology Innovative Medicines and Early Development, AstraZeneca, Cambridge, UK. 5. RWTH Aachen University, Joint Research Centre for Computational Biomedicine, Aachen, Germany.
Abstract
Motivation: Large pharmacogenomic screenings integrate heterogeneous cancer genomic datasets as well as anti-cancer drug responses on thousand human cancer cell lines. Mining this data to identify new therapies for cancer sub-populations would benefit from common data structures, modular computational biology tools and user-friendly interfaces. Results: We have developed GDSCTools: a software aimed at the identification of clinically relevant genomic markers of drug response. The Genomics of Drug Sensitivity in Cancer (GDSC) database (www.cancerRxgene.org) integrates heterogeneous cancer genomic datasets as well as anti-cancer drug responses on a thousand cancer cell lines. Including statistical tools (analysis of variance) and predictive methods (Elastic Net), as well as common data structures, GDSCTools allows users to reproduce published results from GDSC and to implement new analytical methods. In addition, non-GDSC data resources can also be analysed since drug responses and genomic features can be encoded as CSV files. Contact: thomas.cokelaer@pasteur.fr or saezrodriguez.rwth-aachen.de or mg12@sanger.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Large pharmacogenomic screenings integrate heterogeneous cancer genomic datasets as well as anti-cancer drug responses on thousand humancancer cell lines. Mining this data to identify new therapies for cancer sub-populations would benefit from common data structures, modular computational biology tools and user-friendly interfaces. Results: We have developed GDSCTools: a software aimed at the identification of clinically relevant genomic markers of drug response. The Genomics of Drug Sensitivity in Cancer (GDSC) database (www.cancerRxgene.org) integrates heterogeneous cancer genomic datasets as well as anti-cancer drug responses on a thousand cancer cell lines. Including statistical tools (analysis of variance) and predictive methods (Elastic Net), as well as common data structures, GDSCTools allows users to reproduce published results from GDSC and to implement new analytical methods. In addition, non-GDSC data resources can also be analysed since drug responses and genomic features can be encoded as CSV files. Contact: thomas.cokelaer@pasteur.fr or saezrodriguez.rwth-aachen.de or mg12@sanger.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.
Cancers occur due to genetic alterations in cells accumulated through the lifespan of an individual. Cancers are genetically heterogeneous and as a consequence patients with similar diagnoses may vary in their response to the same therapy. The path towards precision cancer medicine requires the identification of specific biomarkers, such as genetic alterations, allowing effective patient selection strategies for therapy. Large-scale pharmacological screens such as the Genomics of Drug Sensitivity in Cancer (GDSC) (Garnett ) and Cancer Cell Line Encyclopaedia projects (Barretina ) have been used to identify potential new treatments and to explore biomarkers of drug sensitivity in cancer cells. In particular, the GDSC project releases database resources periodically (www.cancerRxgene.org) (Yang ). A recent installment of this resource (version 17) includes cancer-driven alterations identified in 11 289 tumors from 29 tissues across 1001 molecularly annotated humancancer cell lines, and cell line sensitivity data for 265 anti-cancer compounds. A systematic identification of clinically-relevant markers of drug response uncovered numerous alterations that sensitize to anti-cancer drugs (Iorio ).Here, we present GDSCTools, a Python library that allows users to perform pharmacogenomic analyses as those presented in (Iorio ). Our software complements an existing tool (Smirnov ) by giving access to the full GDSC dataset and providing a powerful platform for statistical analyses and data mining through visualization tools.
2 Data formats and data wrangling tools
The GDSC database provides large-scale genomics and drug sensitivity datasets. The drug sensitivity dataset contains dose-response curves (e.g. cell viability for 5–9 drug concentrations) which can be used to derive drug sensitivity indicators (Garnett ; Vis ), such as the half-maximal inhibitory concentration () or the area under the curve (AUC) (Fig. 1A). In GDSCTools, logged indicators are encoded as a Nc × Nd matrix, where Nc is the number of cell lines labeled with their COSMIC identifier (http://cancer.sanger.ac.uk/cosmic) and Nd is the number of drugs. For a given drug, we denote with Yd the vector of logged s across the Nc cell lines. The genomic feature dataset is also encoded as a Nc × Nf matrix, where Nf is the number of genomic features. In addition to a subset of the data files available in GDSCTools (version 17 only), users can also retrieve additional datasets online (e.g. methylation data, copy number variants, etc.) Database-like queries can be used to extract and use specific features (e.g. only gene amplifications or deletions). These database-like functionalities are part of the OmniBEM builder (Supplementary Material).
Fig. 1
(A) Drug response (cell viability versus drug concentrations) and derived drug response metrics (AUC and s). (B) Distribution of s in response to a given drug across a dichotomy of cell lines induced by the status of a genomic feature. (C) P-values from an ANOVA analysis versus signed effect sizes (all drug-genomic feature interactions). (D) Weight distributions resulting from training a sparse linear regression model of a given drug response using all the genomic features
(A) Drug response (cell viability versus drug concentrations) and derived drug response metrics (AUC and s). (B) Distribution of s in response to a given drug across a dichotomy of cell lines induced by the status of a genomic feature. (C) P-values from an ANOVA analysis versus signed effect sizes (all drug-genomic feature interactions). (D) Weight distributions resulting from training a sparse linear regression model of a given drug response using all the genomic features
3 Data analysis tools
Using GDSCTools, genomic features can be investigated as possible predictors of differential drug sensitivity across screened cell lines. The statistical interaction Yd ∼ X between drug response and genomic features can be tested within a sample population of cell lines from the same cancer type with a t-test. However, to account for possible confounding factors (including the tissue of origin, when performing pan-cancer analyses) a more versatile analysis of variance (ANOVA) is implemented. In this model, the variability observed in Yd is first explained using the tissue covariate, subsequently using additional factors (e.g. microsatellite instability denoted by MSI), and finally by each of the genomic features in X (one model per feature). This can be mathematically expressed as Yd ∼ C(tissue) + C(MSI) + … + feature, where the operator indicates a categorical variable. An ANOVA test is performed for each combination of drug and genomic feature (Fig. 1B). Outcomes of this large number of tests (Nd × Nf) are corrected for multiple hypothesis testing using Bonferroni or Benjamini-Hochberg corrections. To account for P-value inflations due to differences in sample sizes, the effect sizes of the tested statistical interactions (computed with the Cohen and Glass models) are also included (Fig. 1C).Unlike the ANOVA analysis that is performed on a one drug/one feature basis, linear regression models assume that drug response can be expressed as a linear combination of the status of a set of genomic features. GDSCTools includes three linear regression methods: (i) Ridge, based on an L2 penalty term, which limits the size of the coefficient vector; (ii) Lasso, based on an L1 penalty term, which imposes sparsity among the coefficients (i.e. makes the fitted model more interpretable) and (iii) Elastic Net, a compromise between Ridge and Lasso techniques with a mix penalty between L1 and L2 norms (see Supplementary Material for details). These three methods require the optimization of an α parameter (importance of L1 and L2 penalties) and a ρ parameter (mix ratio between L1 and L2 penalties; ElasticNet case only). This is performed via a cross validation to avoid over-fitting. The best model is determined using as objective function the Pearson correlation between predicted and actual drug responses on the training set. The final regressor weights are outputted as shown in Figure 1D. Significance of the final selected models is computed against.
4 Implementation and future directions
GDSCTools is available on http://github.com/CancerRxGene/gdsctools. It is fully documented on http://gdsctools.readthedocs.io. Pre-compiled versions of the library are available on https://bioconda.github.io/. GDSCTools can be used via standalone applications to analyse a user defined set of drugs (and genomic features) and assemble the results in an HTML report. We also provide solutions based on the Snakemake framework (Köster and Rahmann, 2012) to parallelize the analysis on distributed cluster farm architectures such as LSF or SLURM (Supplementary Material). Besides analysis of pharmacogenomic datasets, GDSCTools can provide the framework for discovering new biomarkers through integration/mining of novel and heterogeneous datasets, including pharmacological, RNA interference or increasingly available genetic screens (e.g. CRISPR), alternative drug response metrics (e.g. AUC) or implementing new analytical tools. The augmentation of genomic features with information obtained from online web services (Cokelaer ) like pathway enrichment [e.g. via OmniPath (Turei )] will further extend functionality and usefulness of GDSCTools.Conflict of Interest: none declared.Click here for additional data file.
Authors: Petr Smirnov; Zhaleh Safikhani; Nehme El-Hachem; Dong Wang; Adrian She; Catharina Olsen; Mark Freeman; Heather Selby; Deena M A Gendoo; Patrick Grossmann; Andrew H Beck; Hugo J W L Aerts; Mathieu Lupien; Anna Goldenberg; Benjamin Haibe-Kains Journal: Bioinformatics Date: 2015-12-09 Impact factor: 6.937
Authors: Daniel J Vis; Lorenzo Bombardelli; Howard Lightfoot; Francesco Iorio; Mathew J Garnett; Lodewyk Fa Wessels Journal: Pharmacogenomics Date: 2016-05-16 Impact factor: 2.533
Authors: Jordi Barretina; Giordano Caponigro; Nicolas Stransky; Kavitha Venkatesan; Adam A Margolin; Sungjoon Kim; Christopher J Wilson; Joseph Lehár; Gregory V Kryukov; Dmitriy Sonkin; Anupama Reddy; Manway Liu; Lauren Murray; Michael F Berger; John E Monahan; Paula Morais; Jodi Meltzer; Adam Korejwa; Judit Jané-Valbuena; Felipa A Mapa; Joseph Thibault; Eva Bric-Furlong; Pichai Raman; Aaron Shipway; Ingo H Engels; Jill Cheng; Guoying K Yu; Jianjun Yu; Peter Aspesi; Melanie de Silva; Kalpana Jagtap; Michael D Jones; Li Wang; Charles Hatton; Emanuele Palescandolo; Supriya Gupta; Scott Mahan; Carrie Sougnez; Robert C Onofrio; Ted Liefeld; Laura MacConaill; Wendy Winckler; Michael Reich; Nanxin Li; Jill P Mesirov; Stacey B Gabriel; Gad Getz; Kristin Ardlie; Vivien Chan; Vic E Myer; Barbara L Weber; Jeff Porter; Markus Warmuth; Peter Finan; Jennifer L Harris; Matthew Meyerson; Todd R Golub; Michael P Morrissey; William R Sellers; Robert Schlegel; Levi A Garraway Journal: Nature Date: 2012-03-28 Impact factor: 49.962
Authors: Mathew J Garnett; Elena J Edelman; Sonja J Heidorn; Chris D Greenman; Anahita Dastur; King Wai Lau; Patricia Greninger; I Richard Thompson; Xi Luo; Jorge Soares; Qingsong Liu; Francesco Iorio; Didier Surdez; Li Chen; Randy J Milano; Graham R Bignell; Ah T Tam; Helen Davies; Jesse A Stevenson; Syd Barthorpe; Stephen R Lutz; Fiona Kogera; Karl Lawrence; Anne McLaren-Douglas; Xeni Mitropoulos; Tatiana Mironenko; Helen Thi; Laura Richardson; Wenjun Zhou; Frances Jewitt; Tinghu Zhang; Patrick O'Brien; Jessica L Boisvert; Stacey Price; Wooyoung Hur; Wanjuan Yang; Xianming Deng; Adam Butler; Hwan Geun Choi; Jae Won Chang; Jose Baselga; Ivan Stamenkovic; Jeffrey A Engelman; Sreenath V Sharma; Olivier Delattre; Julio Saez-Rodriguez; Nathanael S Gray; Jeffrey Settleman; P Andrew Futreal; Daniel A Haber; Michael R Stratton; Sridhar Ramaswamy; Ultan McDermott; Cyril H Benes Journal: Nature Date: 2012-03-28 Impact factor: 49.962
Authors: Thomas Cokelaer; Dennis Pultz; Lea M Harder; Jordi Serra-Musach; Julio Saez-Rodriguez Journal: Bioinformatics Date: 2013-09-23 Impact factor: 6.937
Authors: Francesco Iorio; Theo A Knijnenburg; Daniel J Vis; Graham R Bignell; Michael P Menden; Michael Schubert; Nanne Aben; Emanuel Gonçalves; Syd Barthorpe; Howard Lightfoot; Thomas Cokelaer; Patricia Greninger; Ewald van Dyk; Han Chang; Heshani de Silva; Holger Heyn; Xianming Deng; Regina K Egan; Qingsong Liu; Tatiana Mironenko; Xeni Mitropoulos; Laura Richardson; Jinhua Wang; Tinghu Zhang; Sebastian Moran; Sergi Sayols; Maryam Soleimani; David Tamborero; Nuria Lopez-Bigas; Petra Ross-Macdonald; Manel Esteller; Nathanael S Gray; Daniel A Haber; Michael R Stratton; Cyril H Benes; Lodewyk F A Wessels; Julio Saez-Rodriguez; Ultan McDermott; Mathew J Garnett Journal: Cell Date: 2016-07-07 Impact factor: 41.582
Authors: Wanjuan Yang; Jorge Soares; Patricia Greninger; Elena J Edelman; Howard Lightfoot; Simon Forbes; Nidhi Bindal; Dave Beare; James A Smith; I Richard Thompson; Sridhar Ramaswamy; P Andrew Futreal; Daniel A Haber; Michael R Stratton; Cyril Benes; Ultan McDermott; Mathew J Garnett Journal: Nucleic Acids Res Date: 2012-11-23 Impact factor: 16.971
Authors: Augustin Luna; Fathi Elloumi; Sudhir Varma; Yanghsin Wang; Vinodh N Rajapakse; Mirit I Aladjem; Jacques Robert; Chris Sander; Yves Pommier; William C Reinhold Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971
Authors: Fiona M Behan; Francesco Iorio; Gabriele Picco; Kosuke Yusa; Mathew J Garnett; Emanuel Gonçalves; Charlotte M Beaver; Giorgia Migliardi; Rita Santos; Yanhua Rao; Francesco Sassi; Marika Pinnelli; Rizwan Ansari; Sarah Harper; David Adam Jackson; Rebecca McRae; Rachel Pooley; Piers Wilkinson; Dieudonne van der Meer; David Dow; Carolyn Buser-Doepner; Andrea Bertotti; Livio Trusolino; Euan A Stronach; Julio Saez-Rodriguez Journal: Nature Date: 2019-04-10 Impact factor: 49.962
Authors: George Adam; Ladislav Rampášek; Zhaleh Safikhani; Petr Smirnov; Benjamin Haibe-Kains; Anna Goldenberg Journal: NPJ Precis Oncol Date: 2020-06-15
Authors: Ankit Shukla; Thu H M Nguyen; Sarat B Moka; Jonathan J Ellis; John P Grady; Harald Oey; Alexandre S Cristino; Kum Kum Khanna; Dirk P Kroese; Lutz Krause; Eloise Dray; J Lynn Fink; Pascal H G Duijf Journal: Nat Commun Date: 2020-01-23 Impact factor: 14.919