| Literature DB >> 31862961 |
Joshua M Dempster1, Clare Pacini2,3, Sasha Pantel1, Fiona M Behan2,3, Thomas Green1, John Krill-Burger1, Charlotte M Beaver2, Scott T Younger1, Victor Zhivich1, Hanna Najgebauer2,3, Felicity Allen2, Emanuel Gonçalves2, Rebecca Shepherd2, John G Doench1, Kosuke Yusa2,4, Francisca Vazquez1, Leopold Parts2,5, Jesse S Boehm1, Todd R Golub1,6, William C Hahn1,6, David E Root1, Mathew J Garnett2,3, Aviad Tsherniak7, Francesco Iorio8,9,10.
Abstract
Genome-scale CRISPR-Cas9 viability screens performed in cancer cell lines provide a systematic approach to identify cancer dependencies and new therapeutic targets. As multiple large-scale screens become available, a formal assessment of the reproducibility of these experiments becomes necessary. We analyze data from recently published pan-cancer CRISPR-Cas9 screens performed at the Broad and Sanger Institutes. Despite significant differences in experimental protocols and reagents, we find that the screen results are highly concordant across multiple metrics with both common and specific dependencies jointly identified across the two studies. Furthermore, robust biomarkers of gene dependency found in one data set are recovered in the other. Through further analysis and replication experiments at each institute, we show that batch effects are driven principally by two key experimental parameters: the reagent library and the assay length. These results indicate that the Broad and Sanger CRISPR-Cas9 viability screens yield robust and reproducible findings.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31862961 PMCID: PMC6925302 DOI: 10.1038/s41467-019-13805-y
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Comparison of experimental protocols and gene score results.
a Experimental settings and reagents used in the experimental pipelines underlying the two compared data sets. b Densities of individual gene scores in individual cell lines, in the Broad and Sanger data sets, across processing levels. The distributions of gene scores for previously identified essential genes[12] are shown in red. c Examples of the relationship between a gene’s score rank in a cell line and the cell line’s rank for that gene using Broad unprocessed gene scores, with gene ranks in their 90th percentile of least dependent lines highlighted. Cell lines in the 90th percentile of least dependent lines on RPS8 (a common essential gene) still rank this gene among the strongest of their dependencies. d Distribution of gene ranks for the 90th percentile of least dependent cell lines for each gene in both data sets. Black dotted lines indicate natural thresholds at the minimum gene density along each axis. The y-axis is equivalent to the y-axis in (c) at the 90th percentile mark, as indicated by the arrows.
Fig. 2Reproducibility of gene and cell line dependency profiles.
a Examples of gene score pattern comparisons for selected known cancer genes. b Distribution of correlations of scores for individual genes in unprocessed data. c Gene scores for strongly selective dependencies across all cell lines, with the threshold for calling a line dependent set at an FDR of 0.05. d tSNE visualization of cell lines in unprocessed data based on the correlation between cell line profiles of gene scores. Colors represent the cell line while shape denotes the study of origin. e The same as in (d) but for data batch-corrected using ComBat. f Recovery of a cell line’s counterpart in the other data set before (Uncorrected) and after correction (Corrected). Value on the y-axis shows percentages of cell lines whose matching counterpart in the other data set is within its k-nearest cell lines, i.e. the k-neighborhood on the x-axis, based on a Pearson correlation distance metric. nAUC values are shown in brackets. Three different gene sets were considered to calculate the correlation between cell lines. First, using all genes (uncorrected and corrected all), second, using genes that are dependencies for at least one cell line (corrected variable) and third, using strongly selective dependencies (corrected SSD) genes.
Fig. 3Reproducibility of biomarkers.
a Results from a systematic association test between molecular features and differential gene dependencies (of the SSD genes) across the two studies. Each point represents a test for differential dependency on a given gene (on the second line of the point label) based on the status of a molecular feature (on the first line). b Precision/Recall and Recall/Specificity curves obtained when considering as positives controls the top significant molecular-feature/gene-dependency associations found in one of the studies and ranking all the tested molecular-feature/gene-dependency associations based on their p-values in the other study. To define top-significant associations different significance thresholds matching the quantile threshold specified in the legend are considered, where 100% includes all associations with FDR less than 5%. c Examples of significant statistical associations between genomic features and differential gene dependencies across the two studies. The box covers the interquartile range with the median line drawn within it. The whiskers of the boxplot extend to a maximum of 1.5 times the size of the interquartile range. d Comparison of results of a systematic correlation test between gene expression and dependency of SSD genes across the two studies. The gray dashed lines indicate the thresholds of significant correlations at a 5% false discovery rate identified for each study. Labeled points show the gene expression marker on the first line and gene dependency on the second line. Each tested association between gene expression and SSD dependency is represented by a single purple point. Regions with higher density of points are shown in white. e Examples of significant correlations between gene expression and dependencies consistently identified in both studies.
Fig. 4Influence of reagent library on gene score.
a Distributions of sgRNA depletion score correlations for sgRNAs targeting genes with varying NormLRT scores within each data set (left) and between them (right). Each gene is binned according to the mean of its NormLRT score across the two data sets. The x-axis defines the color gradient. The y-axis reports the average of all correlations between pairs of sgRNAs that belong to the same data set and target that gene. Boxes cover the interquartile range with the median indicated by a horizontal line. Whiskers extend up to 1.5 time the interquartile range with outliers shown as fliers. b Relationship between sgRNA correlation within data sets and gene correlation between data sets. The linear trend is shown for SSD genes. c The mean depletion of guides targeting common dependencies across all replicates vs Azimuth estimates of guide efficacy. The x-axis defines the color gradient. d Comparison of Broad and Sanger unprocessed gene scores for genes matching SSD with highest minimum median estimated sgRNA efficacy (MESE) across both libraries (left, TFA2C), common dependency in either data set and greatest difference between KY and Avana MESE (center, EIF3F), and the SSD with worst KY MESE (right, MDM2).
Fig. 5Influence of time point.
a Distribution of early and late common dependency gene scores in the Broad and Sanger data sets averaged across cell lines. Boxes cover the interquartile range with the median indicated by a horizontal line. Whiskers extend up to 1.5 time the interquartile range with outliers shown as fliers. b Distribution of corrected gene scores for asparagine synthetase (ASNS) by media and institute. Blue and orange lines indicate the median of nonessential and essential gene scores, respectively. c GO terms significantly enriched in Broad-exclusive dependencies. For each GO term the bar length indicates the ratio of cell lines showing Broad-exclusive dependencies with a statistically significant enrichment of that GO term.
Fig. 6Results of replication experiments.
a Original and replication screens from each institute plotted by their first two principal components. HT-29 screens are highlighted. Axes are scaled to the variance explained by each component. b Correlations of the changes in gene score caused when changing a single experimental condition. c The difference in unprocessed gene scores between Broad screens of HT-29 and the original Sanger screen (Sanger minus Broad), beginning with the Broad’s original screen and ending with the Broad’s screen using the KY library at the 14-day time point. Each point is a gene. The horizontal axis is the mean difference of the gene’s score between the Sanger and Broad original unprocessed data sets. d A similar plot taking the Broad’s original screen as the fixed reference and varying the Sanger experimental conditions (Broad minus Sanger).