Literature DB >> 26780180

Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9.

John G Doench¹, Nicolo Fusi², Meagan Sullender¹, Mudra Hegde¹, Emma W Vaimberg¹, Jennifer Listgarten², Katherine F Donovan¹, Ian Smith¹, Zuzana Tothova^1,3, Craig Wilen⁴, Robert Orchard⁴, Herbert W Virgin⁴, David E Root¹.

Abstract

CRISPR-Cas9-based genetic screens are a powerful new tool in biology. By simply altering the sequence of the single-guide RNA (sgRNA), one can reprogram Cas9 to target different sites in the genome with relative ease, but the on-target activity and off-target effects of individual sgRNAs can vary widely. Here, we use recently devised sgRNA design rules to create human and mouse genome-wide libraries, perform positive and negative selection screens and observe that the use of these rules produced improved results. Additionally, we profile the off-target activity of thousands of sgRNAs and develop a metric to predict off-target sites. We incorporate these findings from large-scale, empirical data to improve our computational design rules and create optimized sgRNA libraries that maximize on-target activity and minimize off-target effects to enable more effective and efficient genetic screens and genome engineering.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Substances：
RNA, Guide

Year: 2016 PMID： 26780180 PMCID： PMC4744125 DOI： 10.1038/nbt.3437

Source DB: PubMed Journal: Nat Biotechnol ISSN： 1087-0156 Impact factor: 54.908

INTRODUCTION

The Cas9 protein, an RNA-directed DNA endonuclease, is a powerful tool for manipulating the genome[1-4]. The ease of programming Cas9 has enabled CRISPR-based genetic screens[5], identifying well-established genes and providing novel insight into gene function for multiple phenotypes[6-8]. Initial libraries were designed with little knowledge of sgRNA activity rules, a critical design parameter, as interpreting screening data requires consistency among multiple sgRNAs targeting the same gene to distinguish true hits from false positives. Inactive and non-specific sgRNAs reduce the effective gene coverage of the library and the accuracy of the hit list. Many studies indicate that Cas9 off-target activity depends on both sgRNA sequence and experimental conditions[10-14]. These studies have provided qualitative but incomplete understanding of specificity determinants. Finding generalizable patterns is quite challenging, requiring large datasets to adequately sample the vast number of possible imperfect sgRNA:DNA interactions to reveal sequence features for prediction of off-target activity. Here, we present the design and characterization of human and mouse genome-wide sgRNA libraries based on our previously published rules for predicting on-target efficiency[9]. Building on screening data generated with the new libraries and large-scale assessment of off-target activity, we develop improved algorithms for on- and off-target activity prediction, allowing further optimization of our genome wide libraries.

RESULTS

Genetic screens with the Avana and Asiago libraries

Previously, we examined the activity of 1,841 sgRNAs to determine sequence features leading to increased efficacy and developed rules for improved sgRNA design (Rule Set 1)[9]. We implemented these rules in human and mouse genome-wide libraries, named Avana and Asiago, respectively, and tested their performance in phenotypic screens. We selected six sgRNAs per gene according to three criteria: Rule Set 1 score, specificity within protein coding regions, and the target site location within the gene (). The distribution of Rule Set 1 scores for the previously-published GeCKO[6,15] and Koike-Yusa et al.[8] libraries resembles the null distribution, as these were not designed with on-target criteria (). The Wang et al.[7] library incorporated early rules for increasing sgRNA activity and has higher Rule Set 1 scores. The Avana and Asiago libraries are, by design, the most enriched for sgRNAs predicted to be active. We first tested our library using a well-established screening system, resistance to vemurafenib in A375 melanoma cells, which carry the BRAF V600E mutation and are sensitive to MAPK pathway inhibition ()[6,16,17]. We transduced each of the six Avana subpools, each containing 1 sgRNA per gene, in biological replicates at low multiplicity of infection and performed the same screen with the GeCKOv1 and GeCKOv2 libraries; for cells infected with the Avana library, we also applied the MEK-inhibitor selumetinib[18,19], another small molecule previously examined in this resistance model (). We ranked sgRNAs by their log2-fold-change relative to their abundance in the plasmid DNA pool and averaged ranks from the two replicates (). We first used the RIGER algorithm to analyze the GeCKOv1, GeCKOv2, and Avana screens (). The weighted sum option with RIGER, however, incorporates information from only the top two perturbations targeting a gene, discarding additional evidence about specificity of the hit list provided by other sgRNAs. We thus developed STARS, an alternative gene-ranking system to generate false discovery rates (FDR)[20], rewarding genes where a high fraction of sgRNAs score, similar to the MAGeCK algorithm[21]. Using STARS, we observed that six genes previously validated to confer vemurafenib resistance (CUL3, MED12, NF1, NF2, TADA1, TADA2B) all score with FDR < 1% (). We directly compared the performance of the three libraries in the lentiGuide vector. At FDR < 10%, 27 genes scored with GeCKOv1 whereas 60 genes scored with GeCKOv2 (), an increase likely due to library size, as GeCKOv1 averages 3 - 4 sgRNAs per gene while GeCKOv2 contains 6 sgRNAs per gene. 92 genes scored at FDR < 10% with Avana, which also contains 6 sgRNAs per gene. Similarly, 94 genes scored with the Avana library screened in lentiCRISPRv2, and 36 genes scored at FDR < 10% across both vectors. We examined the data for new modulators of vemurafenib resistance. With the Avana library, the tumor suppressors PTEN, TP53, and RB1 scored with FDR < 1%, while in the GeCKOv2 library, these genes score with FDR of 78%, 7%, and 2%, respectively (). We thus annotated the hit lists for PanCancer genes, a highly curated and validated set containing many genes whose loss may restore the MAPK pathway or activate alternative survival pathways ()[22]. We observed that the Avana library identified more PanCancer genes than either version of GeCKO. At FDR < 10%, we identified 4 PanCancer genes with GeCKOv1 (p = 1.1 × 10−5, hypergeometric distribution), 6 genes with GeCKOv2 (p = 2.2 × 10−7), and 10 with Avana (p = 2.9 × 10−11). In the selumetinib screen, Avana identified 20 genes (p = 4.6 × 10−24), including 9 of the 10 genes identified at the same threshold in the comparable vemurafenib dataset (). We used a competition assay to validate a selection of newly-identified hits. Cells carrying a single sgRNA were mixed with EGFP-labeled cells; if that sgRNA provided a selective growth advantage, then EGFP-negative cells would enrich over time, as measured by flow cytometry. We tested this assay with 9 positive control sgRNAs targeting validated hits (NF1, NF2, CUL3); 10 negative control sgRNAs with no known target; and 5 sgRNAs to a new hit, TP53. When grown without vemurafenib, the fraction of sgRNA-containing cells decreased slightly, on average 8.4% across all sgRNAs (). In the presence of vemurafenib, the populations with negative control sgRNAs died and could not be assessed by flow cytometry, whereas positive control and TP53 sgRNAs came to represent an average of 93% and 82% of the population, respectively (). In a similar competition assay for the 6 previously-validated genes and 9 newly-identified genes, the average representation of sgRNA-containing cells after addition of vemurafenib ranged from 80% to 90% and 58% to 84%, respectively, validating the reproducibility of all 9 selected hits identified by the Avana primary screen (). Several of these newly-validated hits represent additional members of complexes previously identified[23,24]. Although screening more sgRNAs per gene increases the power of a library to detect hits, larger libraries require larger-scale screening, reducing the number of conditions that could be screened and barring models with limited cell numbers. We performed subsampling analysis to determine how decreasing the number of sgRNAs affected the hit list for the vemurafenib screen. The Avana library screened in lentiGuide identified 92 hits at FDR < 10%. For combinations of four 1 sgRNA/gene subpools, on average 52% of genes were recovered at the same FDR (). At a relaxed FDR threshold of < 75%, 92% of genes were recovered with 4 sgRNAs; similar rates were seen for the Avana library screened in lentiCRISPRv2 (93%) and for the selumetinib screen with both lentiGuide (93%) and lentiCRISPRv2 (93%) (). This observation suggests a useful screening strategy, especially for arrayed screens or pooled models where scale-up is costly or prohibitive: perform a primary screen with a small number of sgRNAs at genome-scale, use a relatively relaxed cut-off for hit selection, then perform a secondary screen on hundreds of primary screen hits using additional sgRNAs per gene. We next examined library performance for negative selection, depletion of sgRNAs targeting proliferation-essential genes. Unlike positive selection, where small fractions of cells can strongly enrich, dropout screens represent a more challenging screening modality, as large fractions of cells receiving perturbations must show the phenotype in order to score. We performed a viability screen using the Avana, GeCKOv1, and GeCKOv2 libraries in A375 cells () and analyzed sgRNAs targeting a curated set of 291 core essential genes ()[25]. To enable comparison across libraries of different sizes, we performed ROC-AUC analysis of individual sgRNAs. The Avana library performed significantly better than either version of the GeCKO library, producing AUC values of 0.77 and 0.80, versus GeCKO AUC values of 0.67 to 0.70 (). Using STARS, Avana identified 171 of the 291 genes (59%), while GeCKOv2 identified 76 (29%) at FDR < 10% (). Similarly, using Gene Set Enrichment Analysis (GSEA)[26], the Avana library identified essential cellular complexes and processes with greater statistical confidence than GeCKOv2 (). Combining data from the Avana and GeCKO libraries followed by STARS analysis identified 1,545 and 1,952 essential genes in A375 cells at FDR < 10% and < 25%, respectively, a range consistent with a recent report ()[27]. We confirmed the improved negative-selection screen performance of the Avana library in HT29 cells, a colon cancer line (). Here, we screened the first four subpools of the Avana library and observed a significant improvement in depletion of core essential genes, despite employing fewer sgRNAs: Avana with 4 sgRNAs identified 161 genes, whereas GeCKOv2 with 6 sgRNAs identified 92 genes (). To further test the Avana library in different assays, we screened for resistance to the purine analog 6-thioguanine in three different cell lines. In A375 cells, single sgRNAs dominated each of the six subpools, and all six targeted HPRT1, a gene well-established in this assay ()[28]. The HPRT1 sgRNA in each subpool was enriched at least 700-fold more than the second-ranked sgRNA, suggesting that none of the other ~110,000 sgRNAs in this library led to significant off-target activity at the HPRT1 locus. In HT29 and 293T cells, HPRT1 sgRNAs were highly-enriched, but targeting of NUDT5 also produced 6-thioguanine resistance (). We validated these observations by individual infections in these 3 cell types with 2 sgRNAs each against HPRT1 and NUDT5 (). To understand the differential enrichment of NUDT5 sgRNAs among cell lines, we examined genomic DNA by the TIDE technique[29]. Both sgRNAs led to modification of NUDT5 in all three cell lines in the absence of selection (), suggesting that observed phenotypic differences between cell lines were not due to ineffective gene editing. 6-thioguanine is a substrate for HPRT1 in the purine salvage pathway, substituting for hypoxanthine and thereby causing toxicity (). NUDT5 plays an earlier role in purine synthesis, ultimately leading to production of 5-phosphoribosyl diphosphate (PRPP), another substrate for HPRT1-catalyzed production of inosine monophosphate[30]. Depletion of NUDT5 thus may prevent the toxic effects of 6-thioguanine incorporation ()[31]. Differences in phenotypic severity between A375 and HT29 cells might relate to the fact that A375 cells have lost one allele of the NUDT5 gene[32] and may have upregulated alternative means of producing PRPP. Finally, we tested the Asiago mouse library in a model of interferon signaling in which BV2 cells were challenged with interferon gamma and STARS analysis of surviving cells revealed enrichment of well-established mediators of interferon signaling (). Nine of the ten most-enriched sgRNAs targeted Jak1, Stat1, Ifngr1, or Ifngr2, and all 4 sgRNAs for each of these genes scored in the top 25 of the 79,641 sgRNAs screened. Interestingly, query of the Molecular Signatures Database (MSigDB)[33] with human homologs for the genes identified at FDR < 50% identified a significant enrichment for mitochondrial genes (FDR 8.99×10−12, hypergeometric test) not previously implicated in interferon signaling.

Rule Set 2 for sgRNA on-target activity

We originally modeled on-target activity with 1,841 sgRNAs targeting 9 human and mouse genes assayed by flow cytometry (FC dataset)[9]. To improve predictions, we designed a tiling library targeting all possible NGG PAM-containing sites in 15 genes known to confer resistance to vemurafenib, selumetinib, 6-thioguanine, or etoposide. In A375 cells, several genes reported as hits for resistance to etoposide (TOP2A, CDK6)[7] and 6-thioguanine (MLH1, MSH2, MSH6, and PMS2)[8] in other cell types failed to show significant enrichment under the conditions tested here. For small molecules that yielded resistant cells, we examined sgRNA enrichment (). As expected, the activity of individual sgRNAs correlated well between vemurafenib and selumetinib (). These 2,549 sgRNAs, targeting 8 genes, comprise the RES (resistance) dataset. With combined data for over 4,000 sgRNAs targeting 17 genes, we examined the efficacy of each sgRNA versus its position in the protein coding region (). Initial design guidelines recommended targeting close to the N'terminus. Alternatively, Vakoc and colleagues recently reported on improved generation of loss-of-function alleles when targeting specific domains of several well-characterized proteins[34]. For the 17 genes under consideration here, however, we did not observe discrete regions of the protein-coding sequence that were obviously more-productive target sites for gene inactivation (). Overall, we observed that only the C-terminal 10% of the protein-coding region showed a statistically significant reduction in activity (). An expanded target site window allows more flexibility in sgRNA selection, both to optimize for on-target efficacy and to minimize off-target potential. Previously, to generate Rule Set 1, we treated the top 20% of sgRNAs for each gene as highly active and trained on a 20% vs. 80% classification model to identify significant predictive features.[9] We compared the performance of this model, which used support vector machine (SVM) with logistic regression, to other classification-based approaches applied to sgRNA activity predictions, SVM[7,35] and L1 logistic regression[36], evaluating performance with a leave-one-gene-out approach. Assessing the FC, RES, and combined datasets, we observed the best performance from SVM plus logistic regression (). The SVM-alone or L1 logistic regression models augmented with dinucleotide features, previously found to be informative, performed worse than SVM plus logistic regression on these datasets (). Discretizing sgRNA activity into 20% vs. 80% classes, although convenient, likely resulted in considerable information loss, and thus we explored regression models using normalized sgRNA ranks. We initially retained the same evaluation metric used in the derivation of Rule Set 1, AUC, and observed that L1-regularized linear regression (a continuous regression model) outperformed L1-regularized logistic regression (a discrete classification model) when trained on the same FC data with the same features; we observe a similar result with Spearman correlation, a more suitable metric for regression models (). These results suggested further exploration of linear regression models. Another important factor besides the modeling approach for predicting activity is the feature set. Previously, we used single and dinucleotide position-specific nucleotides, and the GC count of the sgRNA[9]. We hypothesized that additional features, such as position-independent nucleotide counts and the location of the sgRNA target site within the gene, could improve predictions. Biochemical and structural studies indicate that the sgRNA participates in step-wise association with DNA, suggesting that localized thermodynamic properties may also be useful[37,38]. These additional features improved the L1 regression model for all datasets (). Microhomology features, suggested to improve sgRNA activity[39], were predictive on their own but did not improve performance when added to our final model. Because regression models and additional features gave improved results, we investigated the performance of five different regression models (see Methods). On the RES data and the combined dataset, the gradient-boosted regression trees model exhibited better Spearman correlation than all others, while for the FC dataset, it performed comparably to L1- and L2-regression models (). We conclude that the gradient-boosted regression trees model was the best among those compared here. In this model, dinucleotide and single nucleotide identities at each position of the sgRNA continued to contribute most to activity predictions, representing a total of 58% of the Gini importance, a measure of the weight each feature contributes to the overall model (). New features contributed substantially: position-independent counts of single and dinucleotides, location of the sgRNA within the protein, and melting temperatures of different regions gave Gini importance values of 16%, 13%, and 11%, respectively. To gauge if the amount of data had saturated the model, we systematically added genes to the training set, assessed performance, and observed a plateau in performance with the combined dataset (). Furthermore, when trained and tested on all pairwise combinations of datasets, the model trained with the combined dataset produced the best results across all test sets (). We refer to the gradient-bossed regression trees model with the augmented feature set trained on the combined dataset as Rule Set 2. For further validation of Rule Set 2, we assessed performance on independent negative selection datasets not used in the construction or evaluation of the final model, from two human cell lines[7] and mouse embryonic stem cells[8], curated by Xu and colleagues[7,8,36]. Rule Set 1 scores clearly distinguished sgRNAs classified as effective vs. ineffective in all three datasets, with p-values of 1.4×10−32, 1.8×10−16, and 1.1×10−11 (two-sample Kolmogorov-Smirnov test, ). With Rule Set 2 we observed even more significant distinctions between effective and ineffective sgRNAs, with respective p-values of 5.9×10−80, 2.1×10−24, and 3.9×10−35 (), illustrating the generalizability of our new modeling approach. We next examined curated datasets from screens that used CRISPRa/i technology[36,40]. In the largest CRISPRi dataset, comprising over 5,000 sgRNAs in a negative selection screen, Rule Set 2 exhibited the greatest difference in the distributions of scores for sgRNAs classified as effective vs. ineffective (p-value = 1.8×10−40, two-sample Kolmogorov-Smirnov test, ). A smaller positive selection screen with 571 sgRNAs gave a p-value of 1.1×10−4. In the smallest dataset, 532 sgRNAs from a positive selection CRISPRa screen, Rule Set 2 again favored effective sgRNAs but with a higher p-value (0.14). Together, these observations suggest that commonalities between CRISPR knockout and CRISPRa/i, such as the interactions between the sgRNA and Cas9, and the sgRNA and DNA, represent meaningful components of the predictive power of Rule Set 2.

CFD score to predict sgRNA off-target effects

The off-target effects of sgRNAs have been extensively investigated across many experimental systems and conclusions regarding the extent of off-target activity have varied widely.[5] We sought to understand off-target effects under typical pooled-screening conditions in mammalian cells using a library targeting the coding sequence of human CD33 with all possible sgRNAs, regardless of PAM. For all sites with the canonical NGG PAM, in addition to the perfect-match sgRNAs, we introduced three types of sgRNA mutations: first, all 1 nucleotide deletions; second, all 1 nucleotide insertions; third, all 1 nucleotide mismatches to the target DNA, generating a library with 27,897 unique sgRNAs. To these we added 10,618 sgRNAs targeting the mouse Thy1 locus to serve as negative controls. We infected MOLM13 cells and performed flow cytometry to isolate CD33-negative cells (). We examined the activity profile of perfect match sgRNAs with an NGG PAM to identify CD33 target sites that led to robust loss-of-function, and subsequent analyses only examined sgRNAs in this region (). Using negative control sgRNAs to define inactivity, we compared the activities of sgRNAs with different PAM sequences. As expected, sgRNAs with an NGG PAM had the highest proportion of active sgRNAs, 91% (). Some alternative PAM sequences led to notable but smaller rates of sgRNA activity: NAG (26%), NCG (11%), and NGA (7%). While alternative PAMs should be avoided to maximize on-target activity, they must be considered as potential off-target sites[14]. To examine sgRNAs with mismatches, deletions, or insertions to their target DNA, we identified 65 perfect-match sgRNAs that produced > 4.8-fold enrichment for CD33 knockout, two standard deviations above the mean of the negative control sgRNAs (), and their 9,914 associated variant sgRNAs. The activity distribution of these variants was bimodal, providing a clear threshold to classify each as active or inactive. We then calculated the percent-activity for sgRNAs sharing the same nucleotide mutation at the same position, that is, the fraction of individual sgRNA:DNA interactions that led to activity, a calculation enabled by sampling from large numbers of sgRNAs (). We also calculated the delta-log-fold-change (dLFC) to quantify the decrease in activity for each variant sgRNA relative to its perfect match sgRNA (). The two metrics for assessment of mismatch variants, percent-activity and dLFC, were highly correlated (R = −0.97, Pearson correlation coefficient, ). We next examined the changes in activity produced by the three different types of variants. Deletions in the RNA, which create a single bulged DNA base, seldom led to activity (). Of the 76 nucleotide and position combinations (we did not assess nucleotide number 1), only 14 of these types of deletions retained high levels of cleavage using the dLFC metric (p-value > 10−4), 13 of which occurred at positions 2, 3, 19, and 20. Notably, at positions 19 and 20 we observed that deletion of cytidine in the RNA to create a bulged guanine in the DNA was poorly tolerated, with only 4% of sgRNAs showing activity, while deletion of other nucleotides led to an average of 44% of active sgRNAs. Likewise, an extra base in the sgRNA sequence rarely preserved activity; only positions 2 and 3 gave rise to high levels of activity at the same p-value threshold (). Asymmetric sgRNA:DNA interactions that lead to bulged RNA or DNA bases have been examined for a small number of examples and our results are qualitatively consistent[41]. Finally, we examined single-base mismatches between the sgRNA and DNA. Overall, both the position and identities of mismatched nucleotides played major roles in determining activity (). For example, rG:dT mismatches were generally tolerated even in the PAM-proximal region: across all positions, 256 of 289 (89%) of these interactions were active. Conversely, only 37% of rC:dC interactions preserved sgRNA activity. Other nucleotide mismatches, however, had quite different effects at different positions. For example, 0 of 46 purine:purine mismatches at position 16 were active, compared to 30 of 31 mismatches with a thymidine in the DNA. By contrast, at position 19, 41% of purine:purine mismatches and 48% of mispaired thymidines led to active sgRNAs. These results suggest that each type of mismatch must be evaluated individually and that metrics to score off-target sites must incorporate both the position and identity of imperfectly matched RNA:DNA interactions. Using these data, we propose the Cutting Frequency Determination (CFD) score to calculate the off-target potential of sgRNA:DNA interactions (see Methods). We tested the ability of the CFD score to predict off-target, imperfect-match activity for an independent dataset of 89 sgRNAs designed to target H2-D, some of which were previously shown to produce effective protein knockout of H2-K, a gene with highly similar sequence[9]. We compared the CFD score to two previously-described off-target scoring metrics, the CCtop[42] and Hsu-Zhang[14] scores, both of which consider the position and number, but not identity, of mismatches between the RNA and DNA. The CFD score correlated best with the measured activity of these 89 off-target sgRNAs (). Dividing the dataset by number of mismatches, the CFD score performed best in each subset; with 2 or more mismatches, both the Hsu-Zhang and CCtop scores showed poor correlation to measured activity (). To further evaluate the CFD score, we examined results obtained with GUIDE-Seq, an experimental approach for detection of sgRNA target sites[43]. In those data, 9 sgRNA sequences produced a total of 402 off-target sites. We compared the number of GUIDE-Seq reads, a measure of sgRNA activity, to the calculated off-target scores and observed the best Pearson correlation with the CFD score (R = 0.40), followed by CCtop (0.31) and Hsu-Zhang (0.26). As GUIDE-Seq is an ostensibly unbiased method to detect off-target sites, it is noteworthy that only 11 (2.5%) of the observed sites received CFD scores less than 0.05 and were among the weakest, averaging 100-fold fewer reads than the perfect match sgRNAs. We next examined the specificity of these off-target scoring metrics. Importantly, we observed that the alignment tool Bowtie2[44], used by the E-CRISP portal[45], does not return all possible mismatched sites, especially as the number of mismatches increases (). Thus, the poor performance of the E-CRISP and Zhang portals in identifying off-target sites documented by Joung and colleagues[43] may be partly a failure to identify all high-homology candidate sites rather than the scoring of the sites once found. We therefore used Cas-OFFinder[46], a comparatively slow but comprehensive tool, to identify all possible sites with 6 or fewer mismatches to the 9 sgRNAs examined by GUIDE-Seq and scored them with the various metrics. Since the number of potential off-target sites increases exponentially with the number of mismatches, we examined each separately (excluding sites with one mismatch, as there were no such sites with zero Guide-Seq reads, precluding a determination of sensitivity). We observed that the CFD score performed best, with AUC values ranging from 0.82 – 0.98, while Hsu-Zhang ranged from 0.61 to 0.87 and CCtop ranged from 0.64 to 0.77 (). We conclude that the CFD score will enable avoidance of the majority of high-frequency off-target effects. To determine the impact of off-target activity on screening results, we used a curated set of 927 non-essential genes[25]. The Avana library contains 4,950 sgRNAs targeting these genes, and we examined their behavior in the viability screen in A375 cells described above. As expected, proportionally far fewer of these sgRNAs showed evidence of strong depletion (). Using the CFD score, there was a statistically-significant increase in the number of off-target sites predicted for the most-lethal of these sgRNAs (). This observation suggests that sgRNAs with more predicted off-target sites are more frequently erroneously deplete in a negative selection screen, and thus avoidance of such promiscuous sgRNAs will lead to improved library performance.

Optimized sgRNA libraries: Brunello and Brie

Finally, we incorporated our improved on-target activity predictions, Rule Set 2, with our off-target avoidance metric, the CFD score, to design fully optimized libraries for the human and mouse genomes, named Brunello and Brie, respectively (). These new libraries are designed to maximize Rule Set 2 scores and minimize off-target sites with high CFD scores (). Brunello and Brie thus likely represent a clear improvement over existing libraries and are available via Addgene. Our web portals incorporate Rule Set 2 and CFD models (http://www.broadinstitute.org/rnai/public/analysis-tools/sgrna-design and http://research.microsoft.com/en-us/projects/azimuth/) and provide code for incorporation of these scores into existing tools (Supplementary Information).

DISCUSSION

Using our previous rules for rational design of sgRNAs, Rule Set 1, we created human and mouse genome-wide libraries and performed genetic screens. In both positive and negative selection screens with the Avana library, expected hits (e.g. PanCancer genes[22], core essentials[25], and previously-validated targets[6]) scored as strong hits with multiple sgRNAs, and the screens produced a greater number of statistically-significant genes compared to existing libraries, allowing the identification and confirmation of new genes in these phenotypic assays. Predictions of sgRNA activity with Rule Set 1 were imperfect, however, leaving opportunity for further improvements in library design. By doubling the size of the sgRNA activity dataset and identifying a more effective modeling approach, we developed Rule Set 2, which shows demonstrably improved performance versus Rule Set 1 across multiple datasets from multiple laboratories. Notably, the ability of Rule Set 2 to predict performance extends to CRISPRa and CRISPRi screens. Thus, sgRNA designs based on Rule Set 2 (such as our Brunello and Brie libraries) should deliver substantial additional improvements in library performance. Another major factor driving library performance is specificity. Joung and colleagues compared their experimentally-detected off-target sites to those found by two common off-target prediction algorithms and noted that “neither program identified the vast majority of off-target sites found by GUIDE-seq.”[43] Whereas Rule Sets 1 and 2 represent data-driven, quantitative models of on-target sgRNA effectiveness, a quantitative analysis of off-target interactions based on similarly large-scale data has been lacking, and the ability to predict off-target effects therefore quite limited. Our analysis of 9,914 sgRNA variants that represent all single-base mismatches and indels for 65 DNA targets revealed that the pattern of activity effects is complex, and would not be fully apparent by examining smaller numbers of sgRNA:DNA interactions. We used these data to derive the CFD score to predict the likelihood of off-target cutting and showed that it outperformed other off-target scoring metrics. Together, these results provided direction for the creation of even more active and specific libraries for the human and mouse genomes. The current work provides a resource for the design of improved sgRNA reagents for large-scale screens and small-scale gene editing experiments. As one of the biggest challenges for screens is identifying faithful models for the biology of interest, confirming the generalizability of a finding may require examination of multiple cellular models. Smaller libraries with a high fraction of active and specific sgRNAs, enabled by better on- and off-target activity predictions, will facilitate cost-effective screening across a range of model systems. Directly comparing the effectiveness of CRISPR knockout libraries to other gene-depletion approaches, such as state-of-the-art RNAi[47] and CRISPRi[40] libraries, will likely reveal different strengths, and thus complete assessment of model systems will benefit from screening with multiple modalities. Additionally, future work will determine if the results obtained using Streptococcus pyogenes Cas9 provide useful lessons regarding the activity of other Cas9 proteins. The experimental and analytical approaches described here illustrate a powerful method to uncover factors contributing to sgRNA activity and specificity and to optimize reagent design for large-scale functional genomics.

Online Methods

Avana and Asiago Libraries

To design these libraries, we targeted protein-coding transcripts annotated by the Consensus Coding Sequence Database (CCDS), totaling 18,675 genes for the human genome and 20,077 genes for the mouse genome. When a gene had more than one CCDS ID, we picked the shortest transcript per gene. We annotated NGG protospacer adjacent motifs (PAMs) on both the plus and minus strands and then selected sgRNAs for inclusion in the library on the basis of three criteria, and divided these criteria into tiers. A most-preferred sgRNA would fulfill the first tier of all three criteria. However, not all sgRNAs can have these properties, and thus to reach a quota of 6 sgRNAs per gene, step-wise relaxation of tiers across criteria was necessary, and the properties of each step-wise round of relaxation are given in . Additionally, we excluded sgRNAs with a BsmBI site in their sequence or with a run of four or more thymidines. We selected up to 6 sgRNAs per gene, and this resulted in a human library (Avana) of 110,257 sgRNAs and a mouse library (Asiago) of 120,453 sgRNAs (). The final distributions of sgRNAs across these criteria within each tier chosen for inclusion in the Avana and Asiago libraries are provided. Criterion A: Location of the target site in the protein coding sequence, with the four tiers divided by quartiles of the target: (i) 0 - 25% of the protein coding region, (ii) 25 – 50%, (iii) 50 – 75%, (iv) 75 – 100%. Criterion B: To mitigate off-target effects, sequence uniqueness of various lengths, counting from the 3’ end of the sgRNA, e.g. of all possible sgRNAs targeting protein coding genes, the PAM-proximal (i) 13 nts are unique, (ii) 17 nts are unique, (iii) 20 nts are unique or (iv) the sgRNA sequence is not unique. Criterion C: Rule Set 1 on-target score to maximize gene knockout efficacy, divided into deciles; e.g. (i) score of 0.9 – 1.0, (ii) score of 0.8 – 0.9, etc.

Library Creation

Oligonucleotides were synthesized at the Broad Technology Laboratory (BTL) on a B3 Synthesizer (CustomArray). To each sgRNA sequence, BsmBI recognition sites were appended along with the appropriate overhang sequences (underlined) for cloning into sgRNA expression plasmids. Additional primer sites were appended to allow differential amplification of subsets from the same synthesis pool. The final oligonucleotide sequence was thus: 5’-[Forward Primer]CGTCTCACACCG[sgRNA, 20 nt]GTTTCGAGACG[Reverse Primer]. Unique primer sets were used to amplify individual subpools using 25 μL 2x NEBnext PCR master mix (New England Biolabs), 2 μL of oligonucleotide pool (~40 ng), 5 μL of primer mix at a final concentration of 0.5 μM, and 18 μL water. PCR cycling conditions: 30 seconds at 98°C, 30 seconds at 53°C, 30 seconds at 72°C, for 24 cycles. The resulting amplicons were PCR-purified (Qiagen), digested with Esp3I (Fisher Scientific) and cloned into either lentiGuide (pXPR_003, Addgene 52963) or lentiCRISPRv2 (pXPR_023, Addgene 52961). The ligation product was isopropanol precipitated and electroporated into Stbl4 electrocompetent cells (Life Technologies) and grown at 30°C for 16 hours on agar with 100 μg/mL carbenicillin. Colonies were scraped and plasmid DNA (pDNA) was prepared (HiSpeed Plasmid Maxi, Qiagen). To confirm library representation and distribution, the pDNA was sequenced by Illumina. After mapping of Illumina reads (see below) we calculated the overall fraction of reads that contained intended sgRNAs, which serves as a surrogate for the quality of the oligonucleotide synthesis. By this cloning scheme, only 21 nts of the synthesized oligonucleotide, the prepended G and the 20 nt variable sequence, become incorporated in the final library, in contrast to ligation-independent cloning schemes (e.g. Gibson) in which both the sgRNA and flanking sequences are derived from synthesis. We deem a library to have passed quality control if > 85% of the sequencing reads map to an intended sgRNA, which corresponds to an oligonucleotide synthesis error rate of 0.75% per base or lower (85% = 211-0.0075). A distribution of sgRNA abundance for the subpools, as well as GeCKOv2 for comparison, is given in .

Cell Culture

Stock cell lines were maintained without added antibiotics and supplemented with 1% penicillin/streptomycin during screens. A375, HT29, and MOLM13 cells were obtained from the Cancer Cell Line Encyclopedia[32]. 293T cells were obtained from ATCC (CRL-3216). BV2 cells are a mouse microglial cell line that has been described[49,50]. All cells tested negative for mycoplasma contamination. Cas9 derivatives were made by introducing the lentivector pLX_311-Cas9, which expresses blasticidin resistance from the SV40 promoter and Cas9 from the EF1a promoter, as described[9].

Virus Production

293T cells were plated at a density of 1.5e6 cells per well (2 mL volume) 24 hours pre-transfection in a 6-well dish. Transfection was performed using TransIT-LT1 Transfection Reagent (Mirus) according to the manufacturer's protocol. Briefly, two solutions were prepared for each well. One solution contained 8.25 μL of LT1 diluted in 66.75 μL of Opti-Mem (Corning) and incubated at room temperature for 5 minutes. The second solution contained 250 ng pCMV-VSVG (Addgene 8454), 1250 ng psPAX2 (Addgene 12260), and 1250 ng transfer vector (e.g. lentiGuide) in a final volume of 75 μL with Opti-MEM. The two solutions were combined and incubated at room temperature for 20 - 30 minutes. During this incubation period, the media on the 293T cells was changed. The transfection mixture was added dropwise to the cells, and then plates were centrifuged at room temperature at 1000 × g for 30 minutes and returned to 37°C. 6 – 8 hours post-centrifugation, transfection media was removed, leaving ~0.5 mL in each well, and replaced with 5.5 mL viral harvest media (DMEM + 10% FBS + 1% BSA). Virus was harvested 24 hours post-transfection, the media was replenished, and a second harvest occurred at 48 hours post-transfection.

Determination of Infection Conditions for Pooled Screens

Optimal infection conditions were determined for each batch of virus prep in each cell line in order to achieve 30 - 50% infection efficiency, corresponding to a multiplicity of infection (MOI) of ~ 0.5 – 1. Infections were performed in 12-well plate format with 3.0e6 cells per well for adherent lines and 2.5e6 cells per well for suspension lines. Optimal conditions were determined by infecting cells with different virus volumes (0, 50, 100, 300, and 500 μL for lentiGuide virus; 0, 100, 200, 400, and 600 μL for lentiCRISPRv2 virus) followed by trypsinization and replating equal numbers of cells per each virus volume into 2 wells of a 6-well plate, each with complete medium, one supplemented with the appropriate concentration of puromycin. Cells were counted 3 - 5 days post selection to determine the infection efficiency, comparing survival with and without puromycin selection. Volumes of virus that yielded ~30 – 50% infection efficiency were used for screening.

Screening

Screening-scale infections were performed with the pre-determined volume of virus in the same 12-well format as the viral titration described above, and pooled 4 - 6 hours post-centrifugation. Based on library size, infections were performed with enough cells to achieve a representation of ~400 cells per sgRNA in the library following puromycin selection, unless otherwise noted. Cells were selected with puromycin for 5-7 days following infection to remove uninfected cells. A flowchart demonstrates the experimental timeline ().

Vemurafenib and selumetinib resistance screens

Cas9-containing cells were infected in two biological replicates with the Avana and GeCKO libraries in lentiGuide; unmodified, parental cells were infected in two biological replicates with the Avana library in lentiCRISPRv2. Small molecules were added to puromycin-selected cells 7 days post-infection. Cells either received a media change or were passaged every two or three days over the course of the screen in complete media supplemented with 1% penicillin/streptomycin. Vemurafenib (PLX-4032, Selleckchem, S1267) was screened at a final concentration of 2 μM. Selumetinib (AZD-6244, Selleckchem, S1008) was screened at a final concentration of 200 nM. Surviving cells were harvested after 14 days of small molecule treatment. For analysis, the log2-fold-change of each sgRNA was determined relative to the starting plasmid DNA (pDNA) pool for each biological replicate, which we have previously validated as representative of an early time point in pooled screens[6].

6-thiogunaine resistance screens, Avana library

Cas9-containing cells were infected in two biological replicates with the Avana library in lentiGuide. 6-thioguanine (2 μg/mL final concentration, Sigma A4660) was added to puromycin-selected cells 7 days post-infection. Cells either received a media change or were passaged every two or three days over the course of the screen in complete media supplemented with 1% penicillin/streptomycin. Surviving cells were harvested after 14 days of small molecule treatment. For analysis, the log2-fold-change of each sgRNA was determined relative to the starting plasmid DNA (pDNA) pool for each biological replicate.

Interferon-gamma resistance screen, Asiago library

BV2-Cas9 cells were infected in a single biological replicate with four different subpools of the Asiago library in lentiGuide. Cells were challenged with interferon gamma (R&D systems, 485-MI-100) at a final concentration of 10 units/mL eight days post-infection. Two days after interferon gamma treatment, surviving cells were harvested. For analysis, the log2-fold-change of each sgRNA was determined relative to mock-challenged cells.

Resistance screens, tiling library for further on-target activity modeling

A375-Cas9 cells were infected in four biological replicates. Small molecules were added to puromycin-selected cells 7 days post-infection. Cells either received a media change or were passaged every two or three days over the course of the screen in complete media supplemented with 1% penicillin/streptomycin. Vemurafenib (PLX-4032, Selleckchem, S1267) was screened at a final concentration of 2 μM. Selumetinib (AZD-6244, Selleckchem, S1008) was screened at a final concentration of 200 nM. 6-thioguanine (Sigma A4660) was screened at a final concentration of 2 μg/mL. Etoposide (Sigma E1383) was screened at a final concentration of 1 μg/mL. Surviving cells were harvested after 14 days of small molecule treatment. For analysis, the log2-fold-change of each sgRNA was determined relative to control cells treated with DMSO.

Off-target library screen

MOLM13-Cas9 cells were infected in one biological replicate at 1000 cells per sgRNA, and were selected with puromycin in complete media supplemented with 1% penicillin/streptomycin for the length of the screen. 21 days post-infection, cells were co-stained with PE conjugated anti-CD33 (Miltenyi Biotec #130-091-732) and DAPI, and then sorted on a BD-FACS Aria II. Gates were set on the BD FACSDiVa software using a CD-33 DAPI positive control and a DAPI-only negative control. The cell population was first gated to exclude debris; two additional gates were applied to select singlets and a fourth to select viable cells via DAPI. Finally, the population was gated in order to sort for PE negative cells (). For analysis, the log2-fold-change of each sgRNA was determined relative to the pDNA.

Dropout screens, Avana and GeCKOv2 libraries

Cas9-containing cells were infected in two or three biological replicates per cell line per library in lentiGuide, and selected with puromycin. Cells were passaged every two or three days for three weeks in complete media supplemented with 1% penicillin/streptomycin. For analysis, the log2-fold-change of each sgRNA was determined relative to the starting plasmid DNA (pDNA) pool for each biological replicate.

Screen analysis

For analysis, the log2-fold-change of each sgRNA was determined relative to the starting plasmid DNA (pDNA) pool for each biological replicate unless otherwise stated. The pDNA has been shown to serve as a good surrogate for an early time point, and is more cost effective when performing many screens[6]. We then determined the percent-rank of each sgRNA: each sgRNA was ranked by log2-fold change, and this number was then divided by the total number of sgRNAs in the pool to determine a percent-rank. These percent-rank values were then averaged across biological replicates. For performing RIGER and STARS analysis, the percent-rank values across subpools were merged. In the small number of cases where an sgRNA sequence was present in more than one subpool, the average of the percent-rank values was used. An exception to the above analysis method is subpool 5 in the selumetinib screen, where one replicate each of lentiGuide and lentiCRISPRv2 was not completed, and thus a single biological replicate was used for analysis. For of the GeCKOv1-lentiCRISPRv1 library, we re-analzed the previously-published data and normalized against the pDNA pool rather than the DMSO treatment, to be consistent with the analysis of the Avana and GeCKOv2 data[6].

Validation of Hits

Validation assays were performed in an arrayed format with infection conditions of 1e6 cells per well of a 24 well-plate. Cells were selected with puromycin for 5-7 days following infection to remove uninfected cells, and were treated according to the methods of their respective resistance screen. Sequences of sgRNAs and primers are included in .

Vemurafenib resistance assay

A375-Cas9 cells were infected with sgRNAs cloned into lentiGuide and selected with puromycin for one week. Cells were mixed ~1:1 () or 1:9 () with A375-Cas9 cells expressing EGFP but with no sgRNA, and vemurafenib was added to the cell mixture. Flow cytometry was performed in 96 well-plate format on a BDAccuri C6 Sampler system to measure the EGFP negative fraction of samples at the day of initial mixing of cells and then again 10 – 12 days after addition of vemurafenib. Gates were set on the cytometer software using EGFP positive and negative controls; the cell populations were first gated by forward and side scatter to exclude debris, and a second gate was applied to measure the EGFP fluorescence.

6-thiogunaine resistance assay

Cas9-containing cells were infected in two biological replicates with lentiGuide constructs, 2 targeting HPRT, 2 targeting NUDT5, and one control sgRNA targeting EGFP. Cells were infected in arrayed format in a 24 well plate with 1e6 cells per well. Cells were divided into small molecule and no-selection conditions, cell counts were taken at each passage to determine population doublings, and pellets were collected at day 0 and day 14 of 6-thioguanine treatment for analysis of indels using the TIDE web portal[29].

Genomic DNA Preparation and Sequencing

Genomic DNA (gDNA) was isolated using Maxi (3e7 - 1e8 cells), Midi (5e6 - 3e7 cells), and Mini (<5e6 cells) kits according to the manufacturer's protocol (Qiagen).

Illumina sequencing

PCR of gDNA and pDNA was performed to attach sequencing adaptors and barcode samples, which were divided into multiple 100 μl reactions (total volume) containing a maximum of 10 μg gDNA. Per 96 well plate, a master mix consisted of 75 μL ExTaq DNA Polymerase (Clontech), 1000 μL of 10x Ex Taq buffer, 800 μL of dNTP provided with the enzyme, 50 μL of P5 stagger primer mix (stock at 100 μM concentration), and 2075 μL water. Each well consisted of 50 μL gDNA plus water, 40 μL PCR master mix, and 10 μL of a uniquely barcoded P7 primer (stock at 5 μM concentration). PCR cycling conditions: an initial 1 minute at 95°C; followed by 30 seconds at 94°C, 30 seconds at 52.5°C, 30 seconds at 72°C, for 28 cycles; and a final 10 minute extension at 72°C. P5/P7 primers were synthesized at Integrated DNA Technologies (IDT). Samples were purified with Agencourt AMPure XP SPRI beads according to manufacturer's instructions (Beckman Coulter, A63880). Samples were sequenced on a HiSeq2000 (Illumina). Reads were counted by first searching for the CACCG sequence in the primary read file that appears in the vector 5’ to all sgRNA inserts. The next 20 nts are the sgRNA insert, which was then mapped to a reference file of all possible sgRNAs present in the library. The read was then assigned to a condition (e.g. a well on the PCR plate) on the basis of the 8nt barcode included in the P7 primer. The resulting matrix of read counts was first normalized to a reads per million within each condition by the following formula: read per sgRNA / total reads per condition × 106. Reads per million was then log2-transformed by first adding one to all values, which is necessary in order to take the log of sgRNAs with zero reads.

STARS

STARS is a gene-ranking algorithm for genetic perturbation screens. This algorithm takes a ranked list of perturbations as input. The algorithm computes a score for genes using the probability mass function of a binomial distribution: where n is the total number of perturbations targeting a gene, k is the within-gene-rank of the perturbation, and p is the ratio of the rank of the kth perturbation over the total number of perturbations in the experiment. This calculation is performed for all sgRNAs that rank above a user-defined threshold, e.g. the top x% of sgRNAs from a ranked list. The value of the least probable perturbation for each gene is then assigned to the gene as the STARS Score. To avoid single sgRNA hits, we require that at least two perturbations rank above the user-defined threshold for a gene to receive a STARS Score. Permutation testing is also performed on the list of perturbations used in the experiment to generate the null distribution, allowing the calculation of p-values, FDR, and q-value (corrected FDR) for hit genes. STARS is written in Python, and is available on our website: http://www.broadinstitute.org/rnai/public/software/index

Development of Rule Set 2

Data processing for on-target prediction

For each sgRNA targeting a given gene, the log2-fold-change (LFC) in abundance, as judged from read-counts was computed. Next, normalized ranks were obtained for sgRNAs within each gene by assigning ranks to each guide, and then re-scaling them to lie between 0 and 1. If more than one cell type was available, these normalized ranks were averaged across cell types. Thus, a final sgRNA score for a given sgRNA and gene was in [0,1], where 1 was indicative of successful knockout. Note that normalized ranks were used instead of raw LFC because the LFC has a different maximum for genes that have more sgRNAs, suggesting that the LFC across different genes would not be comparable.

Predictive Models

We used the following statistical models in our experiments: (1) linear regression, (2) L1-regularized linear regression, (3) L2-regularized linear regression, (4) the hybrid SVM plus logistic regression approach used previously,[9] (5) Random Forest, (6) Gradient-boosted regression tree, (7) L1 logistic regression (a classifier), (8) SVM Classification. Implementations for each of these used scikit-learn package in python. For (2) and (3), we set the regularization parameter range to search over100 points evenly spaced in log space, with a minimum of 10−6 and a maximum of 1.5 × 105. Random Forest used the default setting, as did the Gradient-boosted regression trees (learning rate of 0.1, and 100 base estimators each with a maximum depth of three). For the SVM, we used a linear kernel with default L2 regularization unless otherwise noted.

Featurization

A “one-hot” encoding of the nucleotide sequences refers to taking a single categorical variable and converting it to more variables each of which can take on the value 0 or 1, with at most, one of them being “hot,” or on. For example, position 1 of the 30-mer target site plus context can take on A/C/T/G, and it gets converted to four binary variables, one for each possible nucleotide. These are “order 1” features. For “order 2” features, we looked at all adjacent pairwise nucleotides as features, such as AA/AT/AG/etc. There are 4 × 4 = 16 such pairs, thus a single variable representing one such pair gets one-hot encoded into 16 binary variables. Previously, only “position-specific” nucleotide features were used in this manner, meaning that for each position on the sgRNA, a different one-hot-encoded feature was used. Here, however, taking inspiration from some of the string kernel literature, we augmented these features sets by also including “position-independent” features, where, for order 1 features for example, this would mean a feature for how many A's and how many T's, etc. were in the sgRNA, ignoring their position, and similarly for order 2.[51] Therefore, for a 30mer (20mer sgRNA plus context), we obtained 80 order 1 and 320 order 2 position-specific features, and 4 order 1 and 16 order 2 position-independent features. GC counts features were computed as before, that is, the number of Gs and Cs in the 20mer were counted, yielding one feature, and then another feature with count>10 was also used. The two nucleotides in the N and N positions relative to the PAM “NGGN” were one-hot encoded yielding 16 features, one for each NN possibility (e.g., AT). Thermodynamic features were computed from the melting temperatures of the DNA version of the RNA guide sequence, or portions thereof, using the Biopython Tm_staluc function. [52,53] In addition to using the melting temperature of the entire 30mer target site plus context, we also separated the thermodynamic features into three further features corresponding to the melting temperature of three distinct parts of the sgRNA—in particular, the 5 nucleotides immediately proximal to the PAM, the 8 nucleotides adjacent to that (away from the PAM), and then the 5 nucleotides in turn adjacent to the 8mer (again, away from the PAM). Guide positional features such as amino acid cut position and percent peptide correspond to how far from the start of the protein coding region of the gene the cut site of the sgRNA was positioned. We also trained a version of Rule Set 2 without features that relate to protein information, as not all potential targets are protein-coding. Not all features will appear in , as features that did not vary at all were removed from all computations.

Stratification/CV

Unless otherwise noted, cross-validation was always performed by leaving out all sgRNAs for one gene at a time. For the inner cross-validation in the required nested cross-validation used to set the regularization weights for L1/L2, we also used leave-one-gene-out cross-validation. When examining feature importance, we averaged either the Gini importance (for gradient boosted regression trees), or the weights (for L1-regularized regression) across genes.

Statistical significance

The only test of statistical significance used was to compare the Spearman correlation prediction measure between two approaches when using exactly the same test data and cross-validation. To do so, we used three sets of numbers: (i) the Spearman correlation performance measure for one approach, (ii) the Spearman correlation performance measure for the other approach, and (iii) the Spearman correlation between the predictions from each approach. These values were used in conjunction with a method to test dependent measures of correlations (here they are dependent because each approach uses the same ground truth)54. The precision of the implementation for this method resulted in the occasional p-value of 0.0. In such cases, we instead report p < 10−16, the smallest p-value that we observed the method reporting (strictly speaking this depends on the number of samples, and correlation between predictions, but it seemed a reasonably practical place-holder). Also of note is that this measure of statistical significance does not give equal weight to each gene, rather, each gene is represented in proportion to its number of sgRNAs.

Off-target scoring

The CCtop scoring algorithm was implemented as described[42]. Since the CCtop score is in the opposite direction of the Hsu-Zhang and CFD score, ranks of the CCtop scores were sorted in the opposite direction for AUC calculation, and we report the negative of the calculated correlation values. In other words, for the CFD score, a value of 0 indicate no predicted off-target activity while a value of 1 indicate a perfect match, while a low CCtop score indicates a high likelihood of off-target activity and a high CCtop score predicts no off-target activity. The scoring algorithm that derives from data published by Hsu et al. is described on the MIT CRISPR design server (http://www.crispr.mit.edu/about)[14]. The Cutting Frequency Determination (CFD) score is calculated by using the percent activity values provided in . For example, if the interaction between the sgRNA and DNA has a single rG:dA mismatch in position 6, then that interaction receives a score of 0.67. If there are two or more mismatches, then individual mismatch values are multiplied together. For example, an rG:dA mismatch at position 7 coupled with an rC:dT mismatch at position 10 receives a CFD score of 0.57 × 0.87 = 0.50.

Evaluation of Bowtie2 performance

We initially attempted to use Bowtie2 to find off-target sites. This approach was successful and relatively straightforward for returning all perfect matches as well as those sites with only one mismatch to the query sequence. However, as the Bowtie2 algorithm is optimized for longer query sequences, we were unable to discover a set of options that would reliably find multi-mismatch sites (such as many of those identified by Guide-Seq) for our relatively short sgRNA query sequences in a single search invocation. Moreover, the runtime performance of Bowtie2 degraded significantly as we adjusted its options for greater multi-mismatch recall. One common workaround for tools with poorly performing fuzzy search is to convert the single query sequence into a list of relevant variants, perform a “less fuzzy” search on each variant, and accumulate the results. Employing this technique, we arrived at an “optimized Bowtie2” method. Specifically, we performed a systematic exploration of the Bowtie2 parameter space to find the combination of the amount and type of pre-permutation and parameter settings that yielded the highest average recall while still maintaining search time of approximately 1 second on standard computing hardware. This optimized Bowtie strategy scaled reasonably to whole-genome batches, but still did not identify all sites found by GUIDE-Seq or Cas-OFFinder.

GUIDE-Seq analysis

The sites identified by Guide-Seq were available as a supplementary table in that publication[43]. Cas-OFFinder was used to find all off-target sites with up to 6 mismatches in the human genome for the 10 sgRNAs assayed with Guide-Seq[46]. To make the fairest comparison between off-target scoring algorithms, we only scored off-target sites with NRG PAMs, as required by CCtop and the Hsu et al. algorithms, which meant excluding 38 sites.

Design of Brunello and Brie libraries

To design these libraries, we targeted protein-coding transcripts annotated by the GENCODE database (http://www.gencodegenes.org). GENCODE annotation is created by merging the Havana manual annotation and Ensembl automated gene annotation. We picked a transcript for each gene based on three tags defined by GENCODE: a) APPRIS (http://appris.bioinfo.cnio.es) annotations, which prioritize transcripts based on protein structural information, functionally important residues and evidence from cross-species alignments; b) level of confidence on the transcript, such as whether the transcript was a verified loci, manually annotated loci or automatically annotated loci; and c) transcript level support in GENCODE, which is based on how well mRNA and EST alignments match over the full length of the transcript. We annotated all possible sgRNAs targeting each transcript for its Rule Set 2 score, and used these scores to rank, per transcript, the on-target activity of sgRNAs. In this implementation, we used the version of Rule Set 2 that does not incorporate protein target site information, as that criterion was used later in picking sgRNAs. Likewise, we annotated each sgRNA by the number of potential off-target sites within various tiers of CFD scores: (i) 1.0; (ii) 0.2 – 1.0; (iii) 0.05 – 0.2; (iv) < 0.05. We then ranked, per transcript, each sgRNA by the fewest number of off-target sites in tier (i), using the additional tiers as tiebreakers. To break ties within a tier, we prioritized sgRNAs with the fewest off-target sites in protein-coding regions. The on-target and off-target ranks of each sgRNA were then combined at equal weight to provide a final rank for each sgRNA targeting a particular transcript. To pick sgRNAs for each transcript, we first chose the best-ranked sgRNA that targeted within the 5 – 65% of the protein-coding region of the target gene. We then selected additional sgRNAs per transcript, requiring that the next-picked sgRNA targets a site at least 5% away, from a protein-coding standpoint, from previously-picked sgRNAs. This ensures diversity in target space, especially useful due to the potential for exons that are present in the reference transcript not to be included in any particular cellular model to which the library is applied. We selected up to 4 sgRNAs per gene using this heuristic. In order to meet quota for some genes, we eliminated the 5 – 65% protein-coding region and 5% spacing criteria.

Tier	Description	Avana	Asiago
i	0%-25%	57%	57%
ii	25%-50%	43%	43%
ii	50%-75%	0.06%	0.04%
iv	75%-100%	0.02%	0.01%

Tier	Description	Avana	Asiago
i	Unique 13 nts	84%	83%
ii	Unique 17 nts	13%	13%
ii	Unique 20 nts	0.2%	0.4%
iv	not unique	3%	4%

Tier	Description	Avana	Asiago
i	0.9-1.0	3%	3%
ii	0.8-0.9	14%	15%
iii	0.7-0.8	22%	23%
iv	0.6-0.7	23%	22%
v	0.5-0.6	17%	16%
vi	0.4-0.5	10%	10%
vii	0.3-0.4	6%	6%
viii	0.2-0.3	3%	3%
ix	0-0.2	1%	1%

Primer Set	Forward Primer, 5’ – 3’	Reverse Primer, 5’ – 3’
1	AGGCACTTGCTCGTACGACG	ATGTGGGCCCGGCACCTTAA
2	GTGTAACCCGTAGGGCACCT	GTCGAGAGCAGTCCTTCGAC
3	CAGCGCCAATGGGCTTTCGA	AGCCGCTTAAGAGCCTGTCG
4	CTACAGGTACCGGTCCTGAG	GTACCTAGCGTGACGATCCG
5	CATGTTGCCCTGAGGCACAG	CCGTTAGGTCCCGAAAGGCT
6	GGTCGTCGCATCACAATGCG	TCTCGAGCGCCAATGTGACG

Cell Line	Media	Puromycin	Blasticidin	Polybrene
A375	RPMI + 10% FBS	1 μg/mL	5 μg/mL	1 μg/mL
293T	DMEM + 10% FBS	1 μg/mL	5 μg/mL	1 μg/mL
HT29	DMEM + 10% FBS	1 μg/mL	5 μg/mL	1 μg/mL
MOLM13	RPMI + 10% FBS	2 μg/mL	5 μg/mL	4 μg/mL
BV2	DMEM + 10% FBS + 1% HEPES	2.5 μg/mL	4 μg/mL	---

Table 1

Screening results for mediators of interferon-gamma signaling using the mouse genome-wide Asiago library. Full results are provided in Supplementary Table 15.

Gene Symbol	sgRNA Ranks	STARS Score	p-value	FDR
Jak1	4;5;10;13	15.15	< 3.66×10⁻⁶	< 2.61×10⁻⁴
Ifngr2	7;11;12;15	14.90	< 3.66×10⁻⁶	< 2.61×10⁻⁴
Stat1	1;2;9;16	14.79	< 3.66×10⁻⁶	< 2.61×10⁻⁴
Ifngr1	3;6;14;22	14.23	< 3.66×10⁻⁶	< 2.61×10⁻⁴
Jak2	18;20;55;178	10.60	< 3.66×10⁻⁶	< 2.61×10⁻⁴
Ifnar1	131;145;273;1176	7.32	1.10×10⁻⁵	6.53×10⁻⁴
Irf9	71;74;196;56950	7.23	1.10×10⁻⁵	5.60×10⁻⁴
Ifnar2	37;137;348;8433	6.48	7.68×10⁻⁵	3.43×10⁻³

52 in total

Review 1. Genetic screens and functional genomics using CRISPR/Cas9 technology.

Authors: Ella Hartenian; John G Doench
Journal: FEBS J Date: 2015-03-16 Impact factor: 5.542

2. Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation.

Authors: John G Doench; Ella Hartenian; Daniel B Graham; Zuzana Tothova; Mudra Hegde; Ian Smith; Meagan Sullender; Benjamin L Ebert; Ramnik J Xavier; David E Root
Journal: Nat Biotechnol Date: 2014-09-03 Impact factor: 54.908

3. Low incidence of off-target mutations in individual CRISPR-Cas9 and TALEN targeted human stem cell clones detected by whole-genome sequencing.

Authors: Adrian Veres; Bridget S Gosis; Qiurong Ding; Ryan Collins; Ashok Ragavendran; Harrison Brand; Serkan Erdin; Chad A Cowan; Michael E Talkowski; Kiran Musunuru
Journal: Cell Stem Cell Date: 2014-07-03 Impact factor: 24.633

4. Discovery of cancer drug targets by CRISPR-Cas9 screening of protein domains.

Authors: Junwei Shi; Eric Wang; Joseph P Milazzo; Zihua Wang; Justin B Kinney; Christopher R Vakoc
Journal: Nat Biotechnol Date: 2015-05-11 Impact factor: 54.908

5. CCTop: An Intuitive, Flexible and Reliable CRISPR/Cas9 Target Prediction Tool.

Authors: Manuel Stemmer; Thomas Thumberger; Maria Del Sol Keyer; Joachim Wittbrodt; Juan L Mateo
Journal: PLoS One Date: 2015-04-24 Impact factor: 3.240

6. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens.

Authors: Wei Li; Han Xu; Tengfei Xiao; Le Cong; Michael I Love; Feng Zhang; Rafael A Irizarry; Jun S Liu; Myles Brown; X Shirley Liu
Journal: Genome Biol Date: 2014 Impact factor: 13.583

7. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics.

Authors: Traver Hart; Kevin R Brown; Fabrice Sircoulomb; Robert Rottapel; Jason Moffat
Journal: Mol Syst Biol Date: 2014-07-01 Impact factor: 11.429

8. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases.

Authors: Shengdar Q Tsai; Zongli Zheng; Nhu T Nguyen; Matthew Liebers; Ved V Topkar; Vishal Thapar; Nicolas Wyvekens; Cyd Khayter; A John Iafrate; Long P Le; Martin J Aryee; J Keith Joung
Journal: Nat Biotechnol Date: 2014-12-16 Impact factor: 54.908

9. Genome-Scale CRISPR-Mediated Control of Gene Repression and Activation.

Authors: Luke A Gilbert; Max A Horlbeck; Britt Adamson; Jacqueline E Villalta; Yuwen Chen; Evan H Whitehead; Carla Guimaraes; Barbara Panning; Hidde L Ploegh; Michael C Bassik; Lei S Qi; Martin Kampmann; Jonathan S Weissman
Journal: Cell Date: 2014-10-09 Impact factor: 41.582

10. Easy quantitative assessment of genome editing by sequence trace decomposition.

Authors: Eva K Brinkman; Tao Chen; Mario Amendola; Bas van Steensel
Journal: Nucleic Acids Res Date: 2014-10-09 Impact factor: 16.971

1175 in total

1. Tissue- and development-stage-specific mRNA and heterogeneous CNV signatures of human ribosomal proteins in normal and cancer samples.

Authors: Anshuman Panda; Anupama Yadav; Huwate Yeerna; Amartya Singh; Michael Biehl; Markus Lux; Alexander Schulz; Tyler Klecha; Sebastian Doniach; Hossein Khiabanian; Shridar Ganesan; Pablo Tamayo; Gyan Bhanot
Journal: Nucleic Acids Res Date: 2020-07-27 Impact factor: 16.971

2. Inhibition of Dpp8/9 Activates the Nlrp1b Inflammasome.

Authors: Marian C Okondo; Sahana D Rao; Cornelius Y Taabazuing; Ashley J Chui; Sarah E Poplawski; Darren C Johnson; Daniel A Bachovchin
Journal: Cell Chem Biol Date: 2018-01-27 Impact factor: 8.116

3. A Genome-wide CRISPR Death Screen Identifies Genes Essential for Oxidative Phosphorylation.

Authors: Jason D Arroyo; Alexis A Jourdain; Sarah E Calvo; Carmine A Ballarano; John G Doench; David E Root; Vamsi K Mootha
Journal: Cell Metab Date: 2016-09-22 Impact factor: 27.287

4. Bromodomain Inhibitors Correct Bioenergetic Deficiency Caused by Mitochondrial Disease Complex I Mutations.

Authors: Joeva J Barrow; Eduardo Balsa; Francisco Verdeguer; Clint D J Tavares; Meghan S Soustek; Louis R Hollingsworth; Mark Jedrychowski; Rutger Vogel; Joao A Paulo; Jan Smeitink; Steve P Gygi; John Doench; David E Root; Pere Puigserver
Journal: Mol Cell Date: 2016-09-22 Impact factor: 17.970

5. Genome-Wide CRISPR Screening Identifies the Tumor Suppressor Candidate OVCA2 As a Determinant of Tolerance to Acetaldehyde.

Authors: Amin Sobh; Alex Loguinov; Alessia Stornetta; Silvia Balbo; Abderrahmane Tagmount; Luoping Zhang; Chris D Vulpe
Journal: Toxicol Sci Date: 2019-05-01 Impact factor: 4.849

6. Stem Cell Lineage Infidelity Drives Wound Repair and Cancer.

Authors: Yejing Ge; Nicholas C Gomez; Rene C Adam; Maria Nikolova; Hanseul Yang; Akanksha Verma; Catherine Pei-Ju Lu; Lisa Polak; Shaopeng Yuan; Olivier Elemento; Elaine Fuchs
Journal: Cell Date: 2017-04-20 Impact factor: 41.582

Review 7. CRISPR Knockouts in Ciona Embryos.

Authors: Shashank Gandhi; Florian Razy-Krajka; Lionel Christiaen; Alberto Stolfi
Journal: Adv Exp Med Biol Date: 2018 Impact factor: 2.622

8. Inactivation of Cancer Mutations Utilizing CRISPR/Cas9.

Authors: Christina Gebler; Tim Lohoff; Maciej Paszkowski-Rogacz; Jovan Mircetic; Debojyoti Chakraborty; Aylin Camgoz; Martin V Hamann; Mirko Theis; Christian Thiede; Frank Buchholz
Journal: J Natl Cancer Inst Date: 2016-08-30 Impact factor: 13.506

9. Genetic interaction mapping and exon-resolution functional genomics with a hybrid Cas9-Cas12a platform.

Authors: Thomas Gonatopoulos-Pournatzis; Michael Aregger; Kevin R Brown; Shaghayegh Farhangmehr; Ulrich Braunschweig; Henry N Ward; Kevin C H Ha; Alexander Weiss; Maximilian Billmann; Tanja Durbic; Chad L Myers; Benjamin J Blencowe; Jason Moffat
Journal: Nat Biotechnol Date: 2020-03-16 Impact factor: 54.908

10. The disruption of the MAPKK gene triggering the synthesis of flavonoids in endophytic fungus Phomopsis liquidambaris.

Authors: Qian Yang; Mei Wu; Ya-Li Zhu; Ya-Qiong Yang; Yan-Zhen Mei; Chuan-Chao Dai
Journal: Biotechnol Lett Date: 2020-10-31 Impact factor: 2.461