Literature DB >> 31110232

Model-based understanding of single-cell CRISPR screening.

Bin Duan^1,2, Chi Zhou¹, Chengyu Zhu¹, Yifei Yu¹, Gaoyang Li^3,4, Shihua Zhang⁵, Chao Zhang¹, Xiangyun Ye⁶, Hanhui Ma⁷, Shen Qu¹, Zhiyuan Zhang⁸, Ping Wang^9,10, Shuyang Sun¹¹, Qi Liu^12,13.

Abstract

The recently developed single-cell CRISPR screening techniques, independently termed Perturb-Seq, CRISP-seq, or CROP-seq, combine pooled CRISPR screening with single-cell RNA-seq to investigate functional CRISPR screening in a single-cell granularity. Here, we present MUSIC, an integrated pipeline for model-based understanding of single-cell CRISPR screening data. Comprehensive tests applied to all the publicly available data revealed that MUSIC accurately quantifies and prioritizes the individual gene perturbation effect on cell phenotypes with tolerance for the substantial noise that exists in such data analysis. MUSIC facilitates the single-cell CRISPR screening from three perspectives, i.e., prioritizing the gene perturbation effect as an overall perturbation effect, in a functional topic-specific way, and quantifying the relationships between different perturbations. In summary, MUSIC provides an effective and applicable solution to elucidate perturbation function and biologic circuits by a model-based quantitative analysis of single-cell-based CRISPR screening data.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
RNA, Guide

Year: 2019 PMID： 31110232 PMCID： PMC6527552 DOI： 10.1038/s41467-019-10216-x

Source DB: PubMed Journal: Nat Commun ISSN： 2041-1723 Impact factor: 14.919

Introduction

Pooled CRISPR knockout screening is a powerful technique for evaluating the biologic function of genes. This technique, however, only recognizes genes with very distinct phenotypes, such as those that affect cellular growth substantially or can be detected with antibodies or fluorescent protein reporters directly, which limited its ability to detect other genes with subtle phenotypes[1-3]. Recently described novel methods, i.e., single-cell-based CRISPR knockout or knockdown screening (independently termed Perturb-Seq[4,5], CRISP-seq[6], and CROP-seq[7,8]), combine pooled CRISPR screening with single-cell RNA-seq to investigate functional CRISPR screening in a single-cell level. These screening methods make it possible to implement large-scale gene perturbation study in a more elaborated way. The key technical innovation for single-cell CRISPR screening including Perturb-Seq[4,5], CRISP-seq[6], or CROP-seq[7,8] lies in modifying the lentiviral vector to allow for identification of the sgRNA in a single cell from deep-sequencing of mRNAs (polyadenylated RNA fraction)[3]. By taking advantage the innovation in performing mRNA-seq on individual cells, large-scale cells with distinct perturbations within a heterogeneous cell population can be investigated[3,9]. Several computational challenges exist in the analysis of such single-cell CRISPR screening data: (1) Data sparsity and noise. Single-cell RNA-seq data is sparse[10,11]. In addition, both single-cell RNA-seq data and pooled CRISPR screening data are inherently noisy[12,13], and this is further exacerbated by their combination. Efficient data filtering and normalizing are needed to meet these challenges. (2) The sgRNA perturbation and off-target effect should be carefully investigated when linking such perturbations with the gene expression readout[14,15], particularly for heterogeneous cell-to-cell comparisons. (3) Quantitative and parallel estimating and prioritizing the effect of each perturbation and their relationships on different cells with cellular heterogeneity and technical complexity is required, and (4) Intuitively visualizing the perturbation results at a large-scale heterogeneity cellular level is needed. To this end, we developed MUSIC, which is an integrated tool for model-based understanding of single-cell CRISPR screening. This is an easy-to-use and model-based integrated analytical tool designed specifically for single-cell CRISPR screening data analysis.

Results

General pipeline of MUSIC

MUSIC comprises three steps for single-cell CRISPR screening data analysis (Fig. 1): data preprocessing, model building, and perturbation effect prioritizing.

Fig. 1

General workflow of MUSIC. MUSIC comprises three steps for single-cell CRISPR screening data analysis: data preprocessing, model building, perturbation effect prioritizing. In the 1st step, besides the conventional considering of cell quality, several specific factors existed for single-cell CRISPR screening are also considered. These factors are the ratio of nonzero perturbed expression value in all cells, sgRNA efficiency and the minimal perturbed cell number per perturbation. In the 2nd step, MUSIC applies a topic model-based computational framework to derive the functional topics of each cell (including controls) with specific perturbation (PE, perturbation). In the 3rd step, MUSIC quantitatively estimates and prioritizes the individual gene perturbation effect on cell phenotypes from three different perspectives, i.e., prioritizing the gene perturbation effect as an overall perturbation effect, or in a functional topic-specific way, and quantifying the relationships between different perturbations In the first step (Fig. 1 and see Methods), besides the routine quality control and data normalization processes applied in single-cell RNA-seq analysis, MUSIC also applied a data imputation step (achieved by SAVER[16]) to improve the data quality. In addition, MUSIC addresses two issues that should be taken into account for such a novel data type: (1) Filtering perturbed cells with invalid edits; (2) Filtering perturbations according to a minimal number of cells per perturbation. Second, MUSIC builds a computational framework based on Topic Models to handle single-cell CRISPR screening data (Fig. 1 and see Methods). The concept of topic models was initially presented in the machine-learning community[17] for discovery of hidden semantic structures in a text body and has been successfully applied to gene expression data analysis[18-20]. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently. The topics generated by topic modeling are represented by class of words with similar sematic meanings. A topic model is a probabilistic framework formulated on the investigation of the giving documents and discovering their topic profiles based on such word frequency representations. By analogy to the single-cell CRISPR screening data, a single cell with perturbation can be taken as a document. The gene expression is analogous to the word frequency in the document. A topic here represents a specific biological function associated with a group of highly differential expressed genes. Therefore, a topic model applied here allows us to examine a set of cells with perturbations and discover, based on the gene expression in each, what the perturbation induced biological functions might be. Two key advantages of the topic model applied here are: (1) it allows each perturbed sample to process a proportion of the membership in each functional topic rather than to categorize the sample into a discrete cluster. Such topic profile, which is derived from large-scale cell-to-cell different perturbed samples, making the following ranking of perturbation impact straightforward and quantitative. As can be clearly illustrated in Fig. 2, compared with traditional clustering, which makes a hard assignment of cells into different subclasses, topic modeling just calculates a topic probability profile for each sample rather than assigning it into subclasses. (2) Topic modeling is sensitive to detect subtle phenotype changes based on the change of topic probability profile with and without perturbation, while traditional clustering generally failed to detect such subtle phenotype changes, which widely exist in single-cell CRISPR screening data (Fig. 2).

Fig. 2

Comparisons between traditional clustering based analysis and topic model based analysis for single-cell CRISPR screening data. a Difference between traditional clustering based analysis and topic model-based analysis for single-cell CRISPR screening data when a perturbation has a significant phenotype on the cells. Both analyses can detect such phenotype change (see the cell sample with red dotted line). b Difference between traditional clustering-based analysis and topic model-based analysis for single-cell CRISPR screening data when a perturbation has a subtle phenotype on the cells. Topic modeling calculates a topic probability profile for each sample while traditional clustering just makes a hard assignment of the sample to each cluster. Therefore, in this way, topic-model-based analysis can detect such phenotype change based on the change of topic probability profile with and without perturbation, while traditional clustering based analysis failed to detect such subtle phenotype change (see the cell sample with red dotted line) In addition, MUSIC addresses several specific issues when applying the topic model to this specific data type: (1) The distribution of topics between cases and controls is affected by the ratio of their sample numbers, and such a sample imbalance issue is addressed by the bootstrapping strategy when prioritizing the perturbation effect (see Methods). (2) The optimal topic number is automatically selected by MUSIC in a data-driven manner (see Methods). Finally, with the topic-model-based perturbation analysis, MUSIC can quantitatively estimate and prioritize the individual gene perturbation effect on cell phenotypes from three different perspectives (Fig. 1 and see Methods), i.e., prioritizing the gene perturbation effect as an overall perturbation effect, or in a functional topic-specific way, and quantifying the relationships between different perturbations.

Evaluating the performance of MUSIC

To evaluate the performance of MUSIC, we made the following two aspects of analysis. We started our study by applying MUSIC to all publicly available 14 sets of single-cell CRISPR screening data, including Perturb-Seq[4,5], CRISP-seq[6], and CROP-seq[7,8] to obtain the analysis results (Supplementary Table 1). For illustration purposes, we took the doxorubicin-treated MCF10A cells[8] with 29 tumor suppressors perturbed as an example plot (Fig. 3a, b). Detailed analysis results of all the other datasets can be accessed in the supplementary materials (Supplementary Data 1–14 and Supplementary Fig. 1–14).

Fig. 3

An illustration result of MUSIC for single-cell CRISPR screening data analysis. We take the dataset of MCF10A cells treated with doxorubicin (GSM2911346) by the updated version of CROP-seq[8] as an example, as illustrated in (a, b). The overall perturbation effect ranking lists identified by MUSIC were also compared between cells with different treatment, as illustrated in (c). a The functional annotations of each topic derived from topic modeling for dataset GSM2911346. b The overall perturbation effect ranking list and the topic-specific perturbation effect ranking list for dataset GSM2911346. c The differences of perturbation impact between different experimental conditions are demonstrated respectively for Perturb-Seq[5] and CROP-seq[7,8] data Then, we compared MUSIC with two other mentioned tools MIMOSCA[5] and LRICA[4] (Tables 1 and 2). MIMOSCA is a computational framework to handle multiple input multiple output single-cell data analysis. LRICA is proposed to decipher the driver signal/component of the data by low-rank matrix factorization. Although MIMOSCA and LRICA models were presented in the literatures, they were only developed as the prototypes without executable and user-friendly implementations. In addition, the output of MUSIC is different from these tools and they are not straightforward to be compared. Therefore, we provided the preliminary comparison results in Tables 1–3 for several datasets to indicate the effectiveness of MUSIC.

Table 1

Comparisons of detail analysis results between MUSIC and MIMOSCA

Datasets	Technology	Demonstrated perturbation	Output	MIMOSCA	MUSIC
Mouse BMDC (3 h post-LPS, GSM2396856)	Perturb-Seq[5]	Cebpb	Overall perturbation effect	—	Rank 2nd
			Topic-specific functional perturbation effect	Immune cells activation	• Immune cells activation[21] • Cell migration[22]
			Perturbations relationship	• Cebpb and Nfkb1, Runx1, Irf4, Spi1 have opposing effects.	cor(Cebpb, Nfkb1) = -0.99 cor(Cebpb, Runx1) = −0.99 cor(Cebpb, Irf4) = − 0.99 cor(Cebpb, Spi1) = − 0.96
				• Cebpb and Rela, HIF1a, Stat3, Junb have reinforcing activation.	cor(Cebpb, Rela) = 0.99 cor(Cebpb, HIF1a) = 0.98 cor(Cebpb, Stat3) = 0.99 cor(Cebpb, Junb) = 0.93
Human K562 (7 days post transduction, GSM2396858)	Perturb-Seq[5]	GABPA	Overall perturbation effect	—	Rank 2nd
			Topic-specific functional perturbation effect	Mitochondrial function	• Heme metabolic process • Neutrophil activation[35]
			Perturbation relationship	—	cor(GABPA, ELK1) = 0.89[36]
Human K562 (cell cycle regulators, GSM2396861)	Perturb-Seq[5]	AURKA	Overall perturbation effect	—	Rank 1st
			Topic-specific functional perturbation effect	Proliferation	Proliferation
			Perturbation relationship	AURKA, TOR1AIP1, and RACGAP1 perturbed similar.	cor(AURKA, TOR1AIP1) = 0.70 cor(AURKA, RACGAP1) = 0.85 cor(TOR1AIP1, RACGAP1) = 0.75

cor(a,b) represents the Pearson correlation coefficient of topic distribution profile between perturbation a and perturbation b

Table 2

Comparison of detail analysis results between MUSIC and LRICA

Datasets	Technology	Demonstrated perturbation	Output	LRICA	MUSIC
Human K562 (3 UPR related genes, GSM2406677)	Perturb-seq[4]	ATF6, PERK, IRE1α	Overall perturbation effect	—	The three perturbations’ overall perturbation effect ranks 1st
			Topic-specific functional perturbation effect	UPR	• UPR • Apoptosis[23]
			Perturbation relationship	The perturbation of PERK has a greater impact than those of ATF6 and IRE1α.	TPDS(PERK) = 94.0 TPDS(IRE1α) = 23.2 TPDS(ATF6) = 11.0
Human K562 (83 UPR related genes, GSM2406681)	Perturb-seq[4]	EIF2S1	Overall perturbation effect	—	Rank 1st
			Topic-specific functional perturbation effect	UPR	UPR
			Perturbation relationship	—	cor(EIF2S1, DHDDS) = 0.99

cor(a,b) represents the Pearson correlation coefficient of topic distribution between perturbation a and perturbation b

TPDS(a) represents the impact score to evaluate the overall perturbation effect of perturbation a

Table 3

Other representative analysis results of MUSIC

Datasets	Technology	Demonstrated perturbation	Output	Original study	MUSIC
Mouse myeloid cell (GSE90486)	CRISP-seq[6]	Cebpb	Overall perturbation effect	—	Rank 1st
			Topic-specific functional perturbation effect	Immune cell differentiation	• Immune cell differentiation[24] • Cell migration[22]
			Perturbation relationship	—	cor(Cebpb,Rela) = 0.99[25]
Human MCF10A cell (treated with doxorubicin, GSM2911346)	Updated version of CROP-seq[8]	TP53	Overall perturbation effect	—	Rank 1st
			Topic-specific functional perturbation effect	DNA replication	DNA replication[37]
			Perturbation relationship	—	cor(TP53, MLH1) = 0.99
Human Jurkat cell (stimulated by anti-CD3/CD28, GSM2439086~GSM2439090)	CROP-seq[7]	LCK	Overall perturbation effect	—	Rank 6th
			Topic-specific functional perturbation effect	TCR signature	leukocyte differentiation
			Perturbations Relationship	LCK, ZAP70, LAT have similar effect on TCR activation signature.	cor(LCK, ZAP70) = 0.93 cor(LCK, LAT) = 0.50 cor(ZAP70, LAT) = 0.78

cor(a,b) represents the Pearson correlation coefficient of topic distribution between perturbation a and perturbation b

Comparisons of detail analysis results between MUSIC and MIMOSCA • Immune cells activation[21] • Cell migration[22] cor(Cebpb, Nfkb1) = -0.99 cor(Cebpb, Runx1) = −0.99 cor(Cebpb, Irf4) = − 0.99 cor(Cebpb, Spi1) = − 0.96 • Heme metabolic process • Neutrophil activation[35] cor(AURKA, TOR1AIP1) = 0.70 cor(AURKA, RACGAP1) = 0.85 cor(TOR1AIP1, RACGAP1) = 0.75 cor(a,b) represents the Pearson correlation coefficient of topic distribution profile between perturbation a and perturbation b Comparison of detail analysis results between MUSIC and LRICA • UPR • Apoptosis[23] TPDS(PERK) = 94.0 TPDS(IRE1α) = 23.2 TPDS(ATF6) = 11.0 cor(a,b) represents the Pearson correlation coefficient of topic distribution between perturbation a and perturbation b TPDS(a) represents the impact score to evaluate the overall perturbation effect of perturbation a Other representative analysis results of MUSIC • Immune cell differentiation[24] • Cell migration[22] cor(LCK, ZAP70) = 0.93 cor(LCK, LAT) = 0.50 cor(ZAP70, LAT) = 0.78 cor(a,b) represents the Pearson correlation coefficient of topic distribution between perturbation a and perturbation b First, the comparisons between the analysis results of MUSIC and MIMOSCA were presented in Table 1. MUSIC recapitulated the similar findings as those of MIMOSCA, like the perturbation impact of Cebpb on immune cell activation[21]. A novel knockout effect on cell migration[22] was also identified by MUSIC which are consistent with previous knowledge. MUSIC further identified the gene–gene perturbation relationships, like the recognized associations between Cebpb knockout and other gene perturbations by the quantitative correlation calculations (Table 1). Second, similar comparisons between MUSIC and LRICA were presented in Table 2. Again, MUSIC recapitulated similar findings like LRICA. For example, ATF, PERK, and IRE1α are all important proteins related to unfolded protein response (UPR). Original study has indicated that the perturbation of PERK has a greater impact than those of ATF6 and IRE1α. MUSIC recapitulated this finding in a quantitatively way. In addition, a novel perturbation effect for apoptosis function by knockout the three genes simutaneously[23] was identified, which indicates that in the absence of the three branches of the UPR, K562 cell enhance the positive regulation of apoptosis signal pathway significantly (Supplementary Data 8 and Supplementary Fig. 8). Finally, analysis of remain datasets also recapitulated original findings or identified novel results. Representative analysis results by MUSIC on remain datasets are shown in Table 3. MUSIC recapitulated the similar results as the original findings, such as the perturbations of Cebpb has an important influence on immune cell differentiation[24]. MUSIC further identified several novel findings, such as the high correlation between Cebpb and Rela[25] perturbations (Supplementary Data 10). MUSIC identified the special response of TP53 knockout when cells treated with doxorubicin, which is consistent with previous knowledge[26-28] (Fig. 3c).

Evaluating the impact of the data preprocessing strategies adopted in MUSIC

Due to substantially noise existed in single-cell CRISPR screening data, MUSIC adopted several data preprocessing strategies (see Methods), which can effectively improve its performance. In this part, we further explored their impact on the outputs of MUSIC from the following three aspects. First, we provided an overview information on how many cells are filtered from the datasets in the data preprocessing. A statistic summary of the proportion of filtered cells by quality control is shown in Fig. 4a, indicating that an average of 6% of cells are filtered. A statistic summary of the proportion of filtered cells by filtering low efficiency sgRNA is shown in Fig. 4b (Supplementary Data 15). It can be seen that this step filtered an average of 41% cells and these ratios are different in different datasets and techniques. It should be noted that prior study already indicated the single-cell CRISPR screening technique is very noisy, 20–30% of the cells with a detected sgRNA show a wild-type phenotype[29,30] and these cells should be filtered.

Fig. 4

Evaluating the impact of the data preprocessing strategies adopted in MUSIC. a The proportion of filtered cells by quality control for all datasets. The red dash line represents the mean of the data. b The proportion of filtered cells by filtering low efficiency sgRNA for all datasets. The red dash line represents the mean of the data. c zero_rate plot of all knockouts/knockdowns in all datasets. The red dash line represents the mean value of all the knockouts/knockdowns zero_rates. d Comparisons of overall perturbation effect ranking with or without imputation/filtering for all the available datasets Second, since the single-cell CRISPR screening data are noisy and zero-inflated, we provided a statistic to show how frequently genes have a zero expression value across all cells. And we demonstrated that our filter strategy will not remove lowly expressed while functional genes like transcription factors. To this end, for all 326 knockouts/knockdowns in all 14 datasets, we calculated their proportion of zero expression values in all cells, which is denoted as the zero_rate of these genes (Fig. 4c and Supplementary Data 16). It is found that that our filtering strategy successfully filters CDKN2A in doxorubin-treated and untreated MCF10A cell[8], which is expected since MCF10A breast epithelium cells carry a deletion of the CDKN2A locus. Then only two other genes were filtered. These genes are PTPRD in doxorubin-treated MCF10A cell[8] and IER3IP1 in K562 cell[4], probably due to the noise existed in these datasets. These genes are not transcription factors, and all the functional transcription factors are kept to be unaffected. To further evaluate the impact of this filtering on the results of MUSIC, we also performed a test to check what occurs if MUSIC removed this filtering step. We rerun MUSIC and compared the overall perturbation effect ranking with or without zero expression filter for the corresponding three affected datasets (doxorubin-treated and untreated MCF10A and K563 cell). More specifically, we normalized the overall ranking score (see the section of Obtaining the overall perturbation effect ranking list in Methods) in the obtained ranking list calculated with or without zero expression filter. Then we calculated the Pearson correlation coefficients of the normalized overall ranking score profiles with or without zero expression filter. The Pearson correlation coefficients calculated above were 0.99 for doxorubin-treated MCF10A, 0.93 for untreated MCF10A, 0.98 for K562 cell, respectively. Taking together, these results showed that the filtering of zero expression will not induce substantial changes on the overall rankings, which means that the filtering of the corresponding knockouts generally keeps other knockouts or knockdowns unaffected. Third, we evaluated the impact of imputation and filtering strategies in the data preprocessing step on the final perturbation ranking results. To this end, we took a group of genes tested by Perturb-Seq[5] as a benchmark, which indicated that Cebpb has the strong reinforcing effect on Rela, Hif1a, Stat3 and Junb, while keeps the strong opposing effect on Nfkb1, Runx1, Irf4 and Spi1. The relationships available for these genes are so evident that it is ideal to be taken as a golden standard. As shown in Supplementary Table 2, a comparison with or without imputation/filtering were performed on this dataset. It can be seen clearly that imputation and filtering as a whole can uncover such strong positive and negative correlations correctly and accurately. We further made a global evaluation to access the overall impact of the data preprocessing on all the datasets (Fig. 4d). In this study, the overall impact is calculated as the overall perturbation effect ranking correlation with or without imputation/filtering for all the 14 datasets (Supplementary Data 17). More specifically, we first normalized the overall ranking score (see the section of Obtaining the overall perturbation effect ranking list in Methods) in the obtained ranking list calculated with or without imputation/filtering. Then we calculated the Pearson correlation coefficients of the normalized overall ranking score profiles with or without imputation/filtering. The bar plots of such similarity comparisons are shown in Fig. 4d, indicating that how the imputation_only, the filtering_only or their combinations affect the final perturbation effect ranking as a whole. It can be seen that all the three strategies changed the ranking list with a similarity of ~0.6 on average. Also the combination strategy changed the ranking list mostly, which is expected.

Discussion

In this study, we developed MUSIC, an integrated model-based pipeline designed specifically for single-cell CRISPR screening. MUSIC takes the raw counts data with the corresponding perturbation information as inputs and it can quantitatively estimate and prioritize the perturbation effect for each knockout or knockdown from three different perspectives, i.e., prioritizing the gene perturbation effect as an overall perturbation effect, in a functional topic-specific way, and quantifying the relationships between different perturbations. Extensive tests on MUSIC demonstrated that it is an effective and applicable pipeline for analyzing single-cell CRISPR screening data. Single-cell CRISPR screening is a powerful technique, making it feasible to perform large-scale perturbations in a single-cell granularity. However, it is inherently noisy, presenting to be challenging for such data analysis. Currently version of MUSIC contains a series of carefully designed filtering steps to reduce the data noise, while future improvements are expected to refine and update such filtering steps to make it more effective.

Methods

Cell quality control

MUSIC evaluates cell quality based on three factors[29], i.e., number of genes detected (default 500), number of unique molecular identifiers induced (default 1000), and percentage of mitochondrial genes detected (default 10% among all the detected genes). Only cells with the first two factors above the thresholds and the third factor below the threshold are retained.

Data imputation

Single-cell RNA-seq data is sparse[10,11], only a small fraction of the transcripts presented in each cell are sequenced. To improve the quality of data, MUSIC adopted SAVER[16], a R package for single-cell RNA-seq data imputation which is proven to be necessary for MUSIC to discover the real and correct regulation relationships (Supplementary Table 2). It should be noted that SAVER has been proven to recover the true expression level of each gene in each individual cell, avoid to introduce spurious correlation or false positive gene pairs that have no biological correlations.

Evaluation of sgRNA knockout efficiency

The sgRNA knockout efficiency in CRISPR screening should also be carefully evaluated. The sgRNA will target Cas9 to a specific gene locus, but only 70–80% of them will generate true loss-of-function of the targeted gene[30,31]. This implies that in 20–30% of the cells with a detected sgRNA, the gene can be active or partially active and show a wild-type phenotype (false positive) which will influence the estimation for the impact of perturbation. Thus, a step to filter such cells is needed. Intuitively, the basic idea of our filtering algorithm is based on the assumption that if the differentially expressed gene profile of a perturbed cell is more similar to the control cells than that of other same perturbed cells, this cell will be filtered. Specifically, for each type of perturbation, we performed the following steps: If the corresponding gene expression values of the perturbation are all zero among all the cells, this perturbation will be filtered directly. If not, perform the following steps. Identifying genes that are differentially expressed between control and perturbed cells by the Kolmogorov–Smirnov test at p < 0.05. For each perturbed cell i, the median of cosine similarity of differentially expressed gene profile between i and all the other perturbed cells with the same perturbation is calculated, denoted as M(). For each perturbed cell i, the median of cosine similarity of differentially expressed gene profile between i and all the control cells is calculated, denoted as M(C). For each cell i, if M(C) is bigger than M(), this cell will be filtered. For a specific perturbation, if the influenced cells filtered are amount to a high proportion (default 90%) among all, such perturbation is filtered.

The minimal perturbed cell number per perturbation

Datlinger et al.[7] concluded that at least 30 cells are required to capture each perturbation phenotype. Therefore, the perturbations with perturbed cells lower than 30 (default) are not considered in MUSIC.

Selecting highly dispersion differentially expressed (DDE) genes

MUSIC identified differentially expressed genes in single-cell sequencing data as dispersion differentially expressed (DDE) genes, i.e., genes with a maximum dispersion difference (DD) between the case and control. MUSIC selects DDE genes based on the subsequent statistical test:where DD is the i-th gene’s dispersion difference, and ZDcase(i) and ZDcontrol(i) are the z-scores of the i-th gene’s dispersion in the case and control cells, respectively. Before calculating the z-score, the genes were binned based on their average expression, and the z-score of the dispersion was calculated within their corresponding bins. The z-score of the i-th gene’s dispersion (ZD) is calculated aswhere μ and σ are the mean and variance of the i-th gene expression, respectively, within its corresponding bin and D is the dispersion of the i-th gene expression, which is calculated aswhere σ and μ are the variance and mean, respectively, of the i-th gene expression.

Normalizing and rounding the expression value

The expression level of different genes is normalized and rounded to fit the topic model: We round the final expression value as the ×10 magnification of the original normalized expression values.

Topic models

The topic model was originally presented in the machine-learning and natural language processing community for latent topics discovery in a particular set of documents[17]. This generative hierarchical model assumes that a word in a document is generated through two steps, i.e., a topic in a document is selected with a certain probability, and then a word in the topic is selected with a certain probability. The generative process of topic model is formulated as follows: θ and are, respectively, the distribution over topics of document d and the distribution over words of topic t. Here, α and β are hyper-parameters following Dirichlet distributions. For generating word i in document d, topic Z is first sampled from document’s distribution over topics, and then word W is sampled from the topic’s distribution over words based on the following distributions, In our study, the topic model is utilized to process our single-cell CRISPR screening data. We made a perfect analogy between text mining and perturbation effect evaluation, where documents can be analogized to the cells conducted by single-cell CRISPR screening and the word frequency in a document can be analogized to the expression value of genes for a given cell. We determined the joint probability of gene expression for each cell by integrating parameter θ into ∅ and applied the collapsed Gibbs sampling to assign the gene of each cell to topics. Detailed information can be refereed[17]. In summary, topic modeling was performed on the entire screen dataset to compare the impact of different perturbations under the same background. Topic modeling resulted into two outputs, i.e., (1) the probability distribution of each topic, representing as a topic profile, which is used to characterize each perturbation (include control) and (2) the enriched functional profile of each topic, which is intuitively calculated by the enrichment analysis with top 10% differentially expressed genes in each topic. Then, with such two profiles in hand, we are able to quantitatively calculate the overall perturbation effect ranking, topic-specific perturbation ranking as well as the relationship between perturbations.

Annotating each topic’s function

MUSIC obtains the occurrence probabilities of genes available in each topic. For each topic, MUSIC took full advantage of the power of topic profile modeling to perform a weighted biological function annotation. Intuitively, genes with large occurrence probabilities are more representative of the function and they should be selected to annotate the topic function. Specifically, for each topic, MUSIC performed the following steps: MUSIC first selects the top 10% genes of each topic based on their occurrence probabilities. Genes selected by step 1 are used to perform the functional enrichment annotation with clusterProfiler[32]. In the end, the top-ranked n (default 5) GO terms (rank by q value) are selected to represent the topic functions.

Automatically selecting the optimal topic number

Topic distribution is influenced by the topic number. MUSIC applies an automatic strategy to select the optimal topic number. Intuitively, an optimal topic number should distinguish the cells with different perturbation effects from each other as much as possible. In our study, we defined a matrix G representing the n topics’ occurrence probability in m cells derived from the topic model with a certain topic number n. Then, an optimal topic number should make G match the following two criteria: (I) For each topic, its occurrence probability in different perturbation cells should differ as much as possible. Such a measurement is defined as a specificity score (SS) for all the topics under a certain topic number n, as calculated in Eq. (9). The larger the specificity score, the better the selected topic number. (II) The fewer topic functions dominating each cell, the better. Such a measurement is defined as a purity score (PS) for all the topics under certain topic number n, as calculated in Eq. (10). The larger the score, the better the selected topic number. Finally, MUSIC defined the combination score(CS), which is a weighted average of the specificity score and purity score, as shown in Eq. (11). Again, the larger the score, the better the selected topic number. The specificity score (SS) is calculated aswhere n is the selected topic number, and σ and μ are the variance and mean, respectively, of the j-th column of G. The purity score (PS) is calculated aswhere n is the selected topic number, m is the number of rows in matrix G, and σ is the variance of the i-th row of G. The combination score (CS) is calculated aswhere n is the selected topic number and α (default 0.5) is the weight with value of [0, 1]. Considering the time cost and the biological interpretability of the result, we recommended a reasonable scope (now 4 to 6) of topic model number to be tried, by considering the prior information of biologic functional categories.

Considering off-target effects

A sgRNA off-target effect may exist for these novel types of data due to application of the CRISPR knockout/knockdown screening technique. For CRISPRi technique, MUSIC won’t consider this step, since CRISPRi knockdown is highly specific with minimal off-target effects[33]. In the current version MUSIC only provides the off-target information of the knockout. Basically, MUSIC integrates sgRNA sequence information with its corresponding knockout gene expression to determine whether the sgRNA has induced an off-target effect as following: CRISPRseek[34] is performed to predict possible off-targets based on the sgRNA sequence information. Correlations of the transcriptional expression values between the corresponding knockout gene and the possible off-targets are calculated for the case and control, respectively. If a significant increase in the correlations between the case and control is detected, the possible off-target effect for this knockout is reported in MUSIC.

Obtaining the topic-specific ranking list

To analyze the functions of the perturbations impact, MUSIC prioritizes the perturbation effect in a topic-specific way. For a specific topic, MUSIC prioritizes the perturbation effect by calculating the specific topic probability difference (TPD) between the case and control. Intuitively, the ranking list is obtained by evaluating the perturbation effect on this specific topic, where the perturbation should influence this topic as much as possible while keeping other topics as unaffected as possible. Specifically, MUSIC performed the following steps: MUSIC calculates topic probability difference (TPD) based on Student t-test. In order to meet the conditions of the Student t-test, the topic probability of different cells with different perturbation were normalized to the standard normal distribution. Specifically, for the i-th perturbation on the j-th topic, each topic probability was z-normalized with respect to the mean and standard deviation of the corresponding control population as: We also realized that the number of cells with different edits generally varies greatly, i.e., the sample imbalance issue exists, which can affect the analysis of the perturbation effects greatly. To address this issue, MUSIC first identified the minimal cell number (M) among all perturbations. Then, for each perturbation, MUSIC adopted a bootstrapping strategy to randomly samples M cells to perform the subsequent Student t-test for 1000 times, and the median is obtained. The test statistic of the i-th perturbation on the j-th topic is calculated aswhere is the mean of normalized topic probabilities calculated in Eq. (12) for the i-th perturbation on the j-th topic, is the mean of normalized topic probabilities of control cells for the j-th topic, S is the standard deviation of normalized topic probabilities of cells for the i-th perturbation on the j-th topic, Scontrol, is the standard deviation of normalized topic probabilities of control cells for the j-th topic. In our study, the test statistic TPD will be taken for consideration for the following two reasons: (a) TPD is a valid metric to estimate the difference of mean between two populations. (b) TPD can be positive or negative, thus used to estimate the direction of a perturbation impact. Then, MUSIC prioritizes such a perturbation by considering the effect of the perturbation on this specific topic as well as its influence on other topics. MUSIC applies the ratio of each topic probability difference (TPDR) to evaluate its influence on other topics. The bigger the ratio is, the less the perturbation influence on other topics. The TPDR of the i-th perturbation on the j-th topic is calculated aswhere TPD is calculated in Eq. (13). Finally, MUSIC defines an efficient score to evaluate the effect of the i-th perturbation (CS) on a specific topic considering both TPD and TPDR. The larger the score, the higher the rank. MUSIC also calculated a threshold to determine if a perturbation had an impact on a specific topic with statistically significance. Intuitively, the impact of a perturbation on a functional topic is significant if it is greater than that generated randomly. MUSIC first obtained TPDrandom, which can be calculated in Eq. (16) and performs the same process to obtain the score (CS) between selected ones and all. This process is repeated for 1000 times to obtain the median as the threshold. The impact of the i-th perturbation on a specific topic j is considered significant when CS is bigger than the threshold.where is the mean of normalized topic probabilities calculated in Eq. (12) for the M selected control cells on the j-th topic.

Obtaining the overall perturbation effect ranking list

For the calculation of the overall perturbation effect ranking list, the sum of each topic’s TPD (TPDS) for each perturbation was calculated: It should be noted that in practical the calculation of TPD here is needed to be adjusted by performing the same bootstrapping on control cells. Specifically, the adjust TPD, i.e., TPDA is calculated as

Obtaining the relationships between different perturbations

MUSIC quantifies the relationships between two perturbations by calculating the Pearson correlation coefficient of two perturbations’ TPDA profiles. Furthermore, the perturbation correlation networks can be automatically visualized by MUSIC for each testing dataset, respectively.

Prioritizing perturbation effect difference under different treatment conditions

When cells were treated under different experimental conditions, MUSIC can be applied to prioritize the perturbation effect difference under two different conditions, and identify the perturbation with substantial effect change. Intuitively, by comparing the TPDS of one specific perturbation under two different conditions, MUSIC identified those perturbations whose impact changed significantly under two conditions. Specifically, MUSIC first selected the common perturbations under two conditions, then MUSIC defined the score perturbation impact difference (PID) to quantitatively represent the perturbation impact difference between two different experimental conditions. For a perturbation i, PID is calculated aswhere n is the number of common perturbations under two conditions and TPDS is calculated by Eq. (17).

Comparisons between negative control and blank control

Given that the former steps rely on the comparisons between perturbed and negative control cells, we made a statistical test to compare negative control with blank control to indicate the suitability of applying negative control in the experiments. First, we believe that it should be slightly different to use the negative control (induced with non-targeting gRNAs) and the blank control (none gRNAs induced) in the single-cell CRISPR screening experiments. While in the previous studies[4-8], researchers in this community tend to choose negative control rather than blank control to keep a relative fair comparison scenario, since it is necessary to eliminate the effects of the induction on the cells. Second, the differences between negative control and blank control should be less significantly than that between knockouts/knockdowns and blank control. To prove this point, we made the following test with stimulated Jurkat cell[7] which offered cells without any induction of gRNAs (blank control). The routing imputation and filtering were performed on these cells. Then a bootstrap sampling strategy is applied on the blank control cells to randomly selected 10% among them to compare with negative control and other knockouts cells. Then we calculated the similarity of such comparison for 100 times samplings. The statistical comparison result is shown in Supplementary Fig. 15. It is clearly to see that the negative control cells are significantly similar to blank control (t-test p < 2.2e−16) than any other knockouts.

Robust test

For each datasets, we randomly relabeled 20% control cells as a control test subset to be processed along other knockouts or knockdowns, and calculated the rank of the control test subset in the overall perturbation effect ranking result. We calculated the rate of the knockouts or knockdowns whose rank below the control test subset among the total number of knockouts or knockdowns. The above process was repeated 10 times for each datasets to reduce randomness. The average rate calculated above is about 0.06 among all the available datasets, indicating that the control testing sets in general disturb the final ranking list a little. Besides, for each datasets, the Pearson correlation coefficients were similarly calculated as aforementioned between the overall perturbation effect ranking results obtained from this random test and that from the original studies. The average Pearson correlation coefficient is 0.82, further indicating that the data preprocessing steps in MUSIC is reliable and robust with tolerance to the random noise.

36 in total

Review 1. Regulation of C/EBPβ and resulting functions in cells of the monocytic lineage.

Authors: René Huber; Daniel Pietsch; Thomas Panterodt; Korbinian Brand
Journal: Cell Signal Date: 2012-02-22 Impact factor: 4.315

2. Single-minded CRISPR screening.

Authors: Bryan R Lanning; Christopher R Vakoc
Journal: Nat Biotechnol Date: 2017-04-11 Impact factor: 54.908

3. Dissecting Immune Circuits by Linking CRISPR-Pooled Screens with Single-Cell RNA-Seq.

Authors: Diego Adhemar Jaitin; Assaf Weiner; Ido Yofe; David Lara-Astiaso; Hadas Keren-Shaul; Eyal David; Tomer Meir Salame; Amos Tanay; Alexander van Oudenaarden; Ido Amit
Journal: Cell Date: 2016-12-15 Impact factor: 41.582

4. Genetic screens in human cells using the CRISPR-Cas9 system.

Authors: Tim Wang; Jenny J Wei; David M Sabatini; Eric S Lander
Journal: Science Date: 2013-12-12 Impact factor: 47.728

5. Decreased cytotoxic effects of doxorubicin in a human ovarian cancer-cell line expressing wild-type p53 and WAF1/CIP1 genes.

Authors: F Vikhanskaya; M D'Incalci; M Broggini
Journal: Int J Cancer Date: 1995-05-04 Impact factor: 7.396

6. β-elemene regulates endoplasmic reticulum stress to induce the apoptosis of NSCLC cells through PERK/IRE1α/ATF6 pathway.

Authors: Ying Liu; Zi-Yu Jiang; Yuan-Li Zhou; Hui-Hui Qiu; Gang Wang; Yi Luo; Jing-Bing Liu; Xiong-Wei Liu; Wei-Quan Bu; Jie Song; Li Cui; Xiao-Bin Jia; Liang Feng
Journal: Biomed Pharmacother Date: 2017-06-30 Impact factor: 6.529

7. CRISPR-mediated modular RNA-guided regulation of transcription in eukaryotes.

Authors: Luke A Gilbert; Matthew H Larson; Leonardo Morsut; Zairan Liu; Gloria A Brar; Sandra E Torres; Noam Stern-Ginossar; Onn Brandman; Evan H Whitehead; Jennifer A Doudna; Wendell A Lim; Jonathan S Weissman; Lei S Qi
Journal: Cell Date: 2013-07-11 Impact factor: 41.582

8. CIRCLE-seq: a highly sensitive in vitro screen for genome-wide CRISPR-Cas9 nuclease off-targets.

Authors: Shengdar Q Tsai; Nhu T Nguyen; Jose Malagon-Lopez; Ved V Topkar; Martin J Aryee; J Keith Joung
Journal: Nat Methods Date: 2017-05-01 Impact factor: 28.547

9. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis.

Authors: Emma Pierson; Christopher Yau
Journal: Genome Biol Date: 2015-11-02 Impact factor: 13.583

10. On the design of CRISPR-based single-cell molecular screens.

Authors: Andrew J Hill; José L McFaline-Figueroa; Lea M Starita; Molly J Gasperini; Kenneth A Matreyek; Jonathan Packer; Dana Jackson; Jay Shendure; Cole Trapnell
Journal: Nat Methods Date: 2018-02-19 Impact factor: 28.547

15 in total

Review 1. Design and analysis of CRISPR-Cas experiments.

Authors: Ruth E Hanna; John G Doench
Journal: Nat Biotechnol Date: 2020-04-13 Impact factor: 54.908

2. Recent advances of genome editing and related technologies in China.

Authors: Wen Sun; Haoyi Wang
Journal: Gene Ther Date: 2020-08-03 Impact factor: 5.250

Review 3. Technologies and Computational Analysis Strategies for CRISPR Applications.

Authors: Kendell Clement; Jonathan Y Hsu; Matthew C Canver; J Keith Joung; Luca Pinello
Journal: Mol Cell Date: 2020-07-02 Impact factor: 17.970

4. Single-Cell Analysis of the Transcriptome and Epigenome.

Authors: Krystyna Mazan-Mamczarz; Jisu Ha; Supriyo De; Payel Sen
Journal: Methods Mol Biol Date: 2022

Review 5. In vivo Pooled Screening: A Scalable Tool to Study the Complexity of Aging and Age-Related Disease.

Authors: Martin Borch Jensen; Adam Marblestone
Journal: Front Aging Date: 2021-08-31

6. STRIDE: accurately decomposing and integrating spatial transcriptomics using single-cell RNA sequencing.

Authors: Dongqing Sun; Zhaoyang Liu; Taiwen Li; Qiu Wu; Chenfei Wang
Journal: Nucleic Acids Res Date: 2022-04-22 Impact factor: 19.160

Review 7. Integrative Methods and Practical Challenges for Single-Cell Multi-omics.

Authors: Anjun Ma; Adam McDermaid; Jennifer Xu; Yuzhou Chang; Qin Ma
Journal: Trends Biotechnol Date: 2020-03-26 Impact factor: 19.536

8. In vivo Perturb-Seq reveals neuronal and glial abnormalities associated with autism risk genes.

Authors: Aviv Regev; Feng Zhang; Paola Arlotta; Xin Jin; Sean K Simmons; Amy Guo; Ashwin S Shetty; Michelle Ko; Lan Nguyen; Vahbiz Jokhi; Elise Robinson; Paul Oyler; Nathan Curry; Giulio Deangeli; Simona Lodato; Joshua Z Levin
Journal: Science Date: 2020-11-27 Impact factor: 47.728

9. scMAGeCK links genotypes with multiple phenotypes in single-cell CRISPR screens.

Authors: Lin Yang; Yuqing Zhu; Hua Yu; Xiaolong Cheng; Sitong Chen; Yulan Chu; He Huang; Jin Zhang; Wei Li
Journal: Genome Biol Date: 2020-01-24 Impact factor: 13.583

10. High-throughput single-cell functional elucidation of neurodevelopmental disease-associated genes reveals convergent mechanisms altering neuronal differentiation.

Authors: Matthew A Lalli; Denis Avey; Joseph D Dougherty; Jeffrey Milbrandt; Robi D Mitra
Journal: Genome Res Date: 2020-09-04 Impact factor: 9.043