Literature DB >> 28587596

A systematic evaluation of nucleotide properties for CRISPR sgRNA design.

Pei Fen Kuan1, Scott Powers2, Shuyao He3, Kaiqiao Li3, Xiaoyu Zhao2, Bo Huang4.   

Abstract

BACKGROUND: CRISPR is a versatile gene editing tool which has revolutionized genetic research in the past few years. Optimizing sgRNA design to improve the efficiency of target/DNA cleavage is critical to ensure the success of CRISPR screens.
RESULTS: By borrowing knowledge from oligonucleotide design and nucleosome occupancy models, we systematically evaluated candidate features computed from a number of nucleic acid, thermodynamic and secondary structure models on real CRISPR datasets. Our results showed that taking into account position-dependent dinucleotide features improved the design of effective sgRNAs with area under the receiver operating characteristic curve (AUC) >0.8, and the inclusion of additional features offered marginal improvement (∼2% increase in AUC).
CONCLUSION: Using a machine-learning approach, we proposed an accurate prediction model for sgRNA design efficiency. An R package predictSGRNA implementing the predictive model is available at http://www.ams.sunysb.edu/~pfkuan/softwares.html#predictsgrna .

Entities:  

Keywords:  CRISPR; Machine learning; Predictive modeling; Thermodynamics

Mesh:

Substances:

Year:  2017        PMID: 28587596      PMCID: PMC5461693          DOI: 10.1186/s12859-017-1697-6

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR)/Cas system is a heritable and adaptive prokaryotic immune system that protects cells by destroying foreign genetic elements [1]. Over the past few years, CRISPR has emerged as a powerful gene editing technology [2, 3]. CRISPR consists of a single guide RNA (sgRNA) and an enzyme called Cas9. The sgRNA is composed of a short synthetic RNA (approximately 20 base pairs (bp), known as spacer target) located within a N-bp scaffold. The spacer target is designed to bind to a specific sequence in the genome, whereas the Cas9 protein acts as a biomolecular scissor. This system has proven to be a powerful tool for studying individual gene function and for genome engineering. The design of sgRNA is an important aspect to ensure the success of CRISPR-Cas9 screens. It is desirable to design sgRNA libraries which have maximum on-target and minimum off-target effects. The binding specificity of the sgRNA is determined by the 20 bp spacer target and a protospacer adjacent motif (PAM) sequence (generally NGG or NAG) on the genome. Once the sgRNA binds to the target sequence, the Cas9 nuclease cuts 3-bp upstream of the PAM sequence. Different groups have studied the sequence features of spacer target sites that predict sgRNA on-target efficiency [4-7]. In particular, [5] investigated the position-dependent sequence on sgRNA efficiency and whether these features could reproducibly predict sgRNA efficiency in several publicly available CRISPR datasets. They proposed a predictive model using the position-dependent mono-nucleotide composition across a 40 bp sequence encompassing 5’ flanking, spacer target and 3’ flanking region; and further demonstrated that their model performed better than the model of [4]. On the other hand, [6, 7] proposed a predictive model based on gradient-boosted regression trees using position-dependent and independent sequence properties, location of the sgRNA within the protein and melting temperatures. Aspects of sgRNA design share similarities to oligonucleotide designs used for microarrays. In both cases, optimal oligonucleotide design aims to increase binding sensitivity and specificity while minimizing off target hybridization. A position dependent sequence bias has been observed in the design of oligonucleotides in Affymetrix microarrays [8], whereas in our earlier work [9] we showed that the thermodynamic and secondary features of the oligonucleotides affect the hybridization intensities in Nimblegen arrays. In addition, [6, 7] investigated position dependent and independent features, position of the guide within the genes, interaction with the PAM sequence and melting temperatures, and showed that these features improved the prediction model in CRISPR/Cas9 screens; whereas microhomology features did not improve the prediction. In this paper, we computed a comprehensive list of features of the target sequence from a number of nucleic acid, thermodynamic, and secondary structure models by adopting some ideas of microarray designs. In a similar manner as [6, 7], we systematically characterized the effect of these features on the efficiency of sgRNA design, and seek to understand if the inclusion of these features improves the design of effective sgRNAs in CRISPR/Cas9 knockout screens.

Methods

We used the sets of efficient and inefficient sgRNAs from the CRISPR/Cas9 screens of [10] and [11] compiled by [5]. The first dataset consists of 731 efficient and 438 inefficient sgRNAs targeting ribosomal genes [10], the second dataset consists of 671 efficient and 237 inefficient sgRNAs targeting non-ribosomal genes [10] and the third dataset consists of 830 efficient and 234 inefficient sgRNAs targeting essential genes in mouse embryonic stem cell (mESC) line, JM8 [11]. The procedures for identifying efficient and inefficient sgRNAs were used exactly as described in [5]. Spacer lengths in the reported studies were 20 bp [10] and 19 bp [11]. Using these sets of sgRNAs, we computed primary sequence, thermodynamic, and secondary structures as candidate features. Further details are provided below.

DNA sequence candidate features

Position-dependent nucleotide composition

Similar to [5], we created vectors of position-dependent mono-nucleotide composition (PD Mono) for the 40 bp long sequences comprised of the spacer targets, and 5’ and 3’ flanking regions. In addition, we extracted position-dependent dinucleotide composition (PD Dinuc) for these 40 bp sequences and computed the single and dinucleotide frequencies (Freq) for the spacer target. Since positions 32 and 33 were part of the PAM sequence (GG), they were excluded from the analysis.

Thermodynamics and secondary structure properties of [9] (Thermo)

Motivated by our earlier work which studied the relationship between oligonucleotide properties and hybridization signal intensities in microarray design [9], we computed the thermodynamic properties: melting temperature (T ), GC content, entropy change (ΔS), enthalpy change (ΔH), free energy change (ΔG); and secondary structures: longest polyN, repetitive sequence (repeat), length of a potential stem-loop (LSL) and minimum energy folding (MEF). T was computed according the formula where [Na +] was assumed to be 0.2M [12]. ΔG, ΔH and ΔG were calculated by summing the respective entropy, enthalpy and free energy parameters of each dinucleotide, including the initiation parameters and penalty for self complementary duplexes according to the position-dependent nearest neighbor approach as described [13]. These parameters were provided in Tables 1 and 2 of [13]. MEF was computed using the hybrid-ss-min program in OligoArrayAux package, whereas LSL was computed using the palindrome function in the EMBOSS package. Longest polyN and repeat were calculated as previously described [9]. These properties were computed for the spacer target sequence.

DNA secondary structures based on dinucleotide and tetra nucleotide properties of [14] and [15] (Packer)

Following a previously described approach [16], we computed the minimum, maximum and average values of both the tetranucleotide energy and flexibility scores as described [15]. These scores were given in Tables 3 and 4 of [15]. In addition, we computed the minimum, maximum and average values of the dinucleotide roll, twist, slide and shift scores as described [14]. The dinucleotide values of these properties were given in Tables 1, 2 and 3 of [14]. These scores were representations of the three-dimensional DNA structure and anisotropic flexibility [14]. Similar to above, we computed these properties for the spacer target sequence.

Physiochemical properties of [17] (PhyChem)

We adapted the approach described by [17] which was developed for predicting nucleosome occupancy and computed the 12 physiochemical properties (A-pillicity, base-stacking, B-DNA twist, bendability, DNA bending stiffness, DNA denaturation, duplex disrupt energy, duplex free energy, propeller twist, protein deformation, protein-DNA twist and Z-DNA). For each property, we computed the minimum, maximum and average dinucleotide scores for the spacer target sequence. The dinucleotide values of the 12 physicochemical properties were given in Table 1 of [17].

Pseudo k-tuple nucleotide composition of [18] (PseKNC)

The PseKNC model was also originally developed for predicting nucleosome occupancy by taking into account global sequence-order effects. PseKNC represents the DNA sequence as vectors where ’s are the k-tuple nucleotide frequencies and m is the number of local DNA properties considered, P (r r ) and P (r r ) are the score of the t-th DNA local structural property for dinucleotide r r and r r at position s and s+j, respectively. λ is the order of correlations along the DNA sequence and w is the weight factor. Our candidate k, λ and w took values of k=2,3,…,6, λ=1,2,…,15, and w=0,0.1,0.2,…,1. We considered the following strategy to choose the optimal parameters for the PseKNC model. A three way cross validation was performed on each dataset using elastic net [19]. The parameters corresponding to the PseKNC model with the largest average area under the receiver operating characteristic curve (AUC) were selected for subsequent analysis. Based on this criterion, we set k=2, λ=1 and w=0.5. Similar to [18], we considered m=6 DNA local structural properties which were divided into local translational (rise, slide and shift) and angular (twist, roll and tilt).

Optimal pairwise alignment (Align)

We computed the optimal global pairwise alignment scores between the seed region and scaffold using the Needleman-Wunsch algorithm [20] which served as a measure of the potential of the k PAM-proximal seed region of the spacer target to interact with the scaffold sequence. The seed region was defined as the immediate k nucleotides next to the PAM sequence. We considered k=5,…,L, where L is the length of spacer target.

Results and discussion

For each dataset, we computed a score for every feature as a measure of strength of association with sgRNA efficiency. If the feature was a binary variable, a log odds ratio between efficient and inefficient sgRNAs was computed. If the feature was a continuous variable, two-sample t-statistic was computed. We divided the features into 8 classes (1) position-dependent mono-nucleotide (PD Mono), (2) position-dependent dinucleotide (PD Dinuc), (3) frequencies of mono and dinucleotides (Freq) (4) optimal pairwise alignment between spacer target and scaffold (Align) (5) thermodynamics and secondary structures of [9] (Thermo) (6) secondary structures of [14, 15] (Packer) (7) physiochemical properties (PhyChem) of [17] and (8) pseudo k-tuple nucleotide composition of [18] (PseKNC). We found that most of the features were consistently associated with sgRNA efficiency across datasets (Figs. 1 and 2).
Fig. 1

Pairwise correlation plot for each class of features. Left column is the pairwise correlation plot between ribosomal and non-ribosomal genes from [10]. Middle column is the pairwise correlation plots between ribosomal genes from [10] and mESC essential genes from [11]. Right column is the pairwise correlation plots between non-ribosomal genes from [10] and mESC essential genes from [11]. Each point is a feature

Fig. 2

Pairwise correlation plot for each class of features. Left column is the pairwise correlation plot between ribosomal and non-ribosomal genes from [10]. Middle column is the pairwise correlation plots between ribosomal genes from [10] and mESC essential genes from [11]. Right column is the pairwise correlation plots between non-ribosomal genes from [10] and mESC essential genes from [11]. Each point is a feature

Pairwise correlation plot for each class of features. Left column is the pairwise correlation plot between ribosomal and non-ribosomal genes from [10]. Middle column is the pairwise correlation plots between ribosomal genes from [10] and mESC essential genes from [11]. Right column is the pairwise correlation plots between non-ribosomal genes from [10] and mESC essential genes from [11]. Each point is a feature Pairwise correlation plot for each class of features. Left column is the pairwise correlation plot between ribosomal and non-ribosomal genes from [10]. Middle column is the pairwise correlation plots between ribosomal genes from [10] and mESC essential genes from [11]. Right column is the pairwise correlation plots between non-ribosomal genes from [10] and mESC essential genes from [11]. Each point is a feature

Candidate feature ranking

To rank the contribution of each feature to the efficiency of sgRNA design, we fitted a logistic regression model within each dataset using the binary sgRNA efficiency indicator as the response and the features as predictors. The Bayesian Information Criterion (BIC) for the fitted model was computed. The features were ranked by the BIC scores and the top 10 most important features were shown in Additional file 1: Figure S1. The top ranked feature based on average BIC scores across the three datasets was the 16-th feature from PseKNC model. This feature is a function of TT dinucleotide frequency. In addition, we computed the area under receiver operating characteristic curves (AUCs) for continuous features. The top 10 features ranked by AUC were shown in Fig. 3, in which the 16-th feature from the PseKNC model was also ranked number one. The third measure we considered for feature ranking was the permutation based variable importance score from the random forest prediction algorithm. Random forest [21] is a non-parametric ensemble approach based on a large number of classification trees trained on bootstrap samples. The permutation based variable importance score of a feature is defined as the difference in prediction accuracy before and after permuting this feature, averaging over all trees. We used the unscaled version of variable importance score as recommended by [22, 23] to avoid bias due to number of trees grown. The top 10 features ranked by variable importance are shown in Additional file 1: Figure S2. Based on these results, the frequencies of T and TT had the strongest association with sgRNA efficiency, in which higher frequencies of T and TT were associated with decreased efficiency.
Fig. 3

Top 10 most informative features ranked by AUC by dataset. The last panel is the ranking by average AUC aggregating the three datasets

Top 10 most informative features ranked by AUC by dataset. The last panel is the ranking by average AUC aggregating the three datasets

Predictive modeling

To assess the contribution of the 8 different feature classes in prediction sgRNA efficiency, we formed all possible combinations of feature classes ( combinations). We adapted the strategy in [5] in constructing and evaluating the predictive model for sgRNA efficiency: To evaluate intra-platform consistency within the same class of genes, we performed 3-way cross validation within dataset 1 (sgRNA targeting ribosomal genes) from [10]. We randomly split dataset 1 into 3 parts of equal sample size, trained the model on two parts (training set) and evaluated the performance of the resulting predictive model on the remaining part (test set). This process was repeated 3 times by leaving out a different test set, and results were averaged over 10 iterations of random sampling. To evaluate intra-platform consistency across different classes of genes, the predictive algorithm was trained on dataset 1 (ribosomal genes) and tested on dataset 2 (non-ribosomal genes). To evaluate inter-platform consistency, the predictive algorithm was trained on datasets 1 and 2 (ribosomal + non-ribosomal genes) from [10] and tested on dataset 3 (mESC essential genes) from [11]. The elastic net algorithm [19] was used in constructing the predictive model on the training set based on 10 fold cross-validation. Since the features we considered in this paper were functions of the nucleotide composition, they were correlated and the elastic net algorithm automatically selected non-redundant informative features. The objective function of elastic net consists of a loss function + penalty: where and . We evaluated the performance on the test set in terms of AUC. The optimal cutpoints were determined by maximizing the Youden index(J) =Se+Sp −1, where Sensitivity(Se) and Specificity(Sp). The results were shown in Tables 1, 2 and 3. For each test set, we reported these performance measures for the predictive models constructed using each of the 8 feature classes, as well as the combinations of feature classes with the maximum AUC (Comb Feature). Across all comparisons, integrating multiple feature classes showed improvements in terms of AUC compared to position-dependent mono-nucleotide models (PD Mono) in [5]. Among the 8 individual feature classes, position-dependent dinucleotide models (PD Dinuc) consistently outperformed other feature classes in predicting sgRNA efficiency and were close to results from the combination of feature classes models in all 3 scenarios. A similar pattern was also observed in [6, 7], in which they showed that position dependent dinucleotide features yielded the largest average Gini importance among the set of features considered in their dataset [4, 7].
Table 1

AUC, Youden index (J), Sensitivity (Se) and Specificity (Sp) from the 3-way cross validation within dataset 1 (ribosomal genes)

Feature classAUC J SeSp
PD Mono0.8260.5350.8550.680
PD Dinuc0.8480.5750.7880.787
Freq0.7780.4410.6770.764
Align0.6130.1880.7460.442
Thermo0.5250.0860.8120.273
Packer0.6010.1860.6340.551
PhyChem0.7220.3800.7110.669
PseKNC0.7310.3760.6830.693
Comb Feature0.8670.6180.8260.792

Comb Feature: PD Mono+PD Dinuc+Freq+Thermo+Packer+PhyChem+PseKNC. We reported the average performance from the 3-way cross validation over 10 iterations of random sampling

Table 2

AUC, Youden index (J), Sensitivity (Se) and Specificity (Sp) from intra-platform comparison (training set: ribosomal genes, test set: non-ribosomal genes)

Feature classAUC J SeSp
PD Mono0.7850.4430.7170.726
PD Dinuc0.7920.4780.7650.713
Freq0.7000.3320.7790.553
Align0.5940.1590.8810.278
Thermo0.6160.2220.6390.580
Packer0.6370.2070.4310.776
PhyChem0.6590.2410.6330.608
PseKNC0.6470.2430.6940.549
Comb Feature0.8060.4920.8510.641

Comb Feature: PD Mono+PD Dinuc +Thermo+Packer+PhyChem

Table 3

AUC, Youden index (J), Sensitivity (Se) and Specificity (Sp) from inter-platform comparison (training set: ribosomal and non-ribosomal genes, test set: mESC essential genes)

Feature classAUC J SeSp
PD Mono0.7970.4860.7510.735
PD Dinuc0.8320.5440.7920.752
Freq0.7510.3820.7160.667
Align0.5740.1310.4900.641
Thermo0.6410.2610.8170.444
Packer0.6670.2410.5140.726
PhyChem0.7260.3510.7180.632
PseKNC0.7330.3700.6600.709
Comb Feature0.8480.5660.8430.722
azimuth0.7950.4630.8570.607
sgRNA Scorer0.6690.2880.5480.739
azimuth (retrained)0.8330.5430.7870.756
sgRNA Scorer (retrained)0.8040.4740.7860.688

Comb Feature: PD Mono+PD Dinuc+Freq+Align+Thermo+Packer+PhyChem+

PseKNC. azimuth and sgRNA Scorer were the results based on the softwares by [7] and [27], respectively developed using different training datasets. azimuth (retrained) and sgRNA Scorer (retrained) were the results obtained by refitting the algorithms on the current training set (ribosomal and non-ribosomal genes)

AUC, Youden index (J), Sensitivity (Se) and Specificity (Sp) from the 3-way cross validation within dataset 1 (ribosomal genes) Comb Feature: PD Mono+PD Dinuc+Freq+Thermo+Packer+PhyChem+PseKNC. We reported the average performance from the 3-way cross validation over 10 iterations of random sampling AUC, Youden index (J), Sensitivity (Se) and Specificity (Sp) from intra-platform comparison (training set: ribosomal genes, test set: non-ribosomal genes) Comb Feature: PD Mono+PD Dinuc +Thermo+Packer+PhyChem AUC, Youden index (J), Sensitivity (Se) and Specificity (Sp) from inter-platform comparison (training set: ribosomal and non-ribosomal genes, test set: mESC essential genes) Comb Feature: PD Mono+PD Dinuc+Freq+Align+Thermo+Packer+PhyChem+ PseKNC. azimuth and sgRNA Scorer were the results based on the softwares by [7] and [27], respectively developed using different training datasets. azimuth (retrained) and sgRNA Scorer (retrained) were the results obtained by refitting the algorithms on the current training set (ribosomal and non-ribosomal genes) We also compared the results using the random forest and boosted regression to construct the predictive model. Random forest [21] was implemented in the R package randomForest, whereas the boosted regression based on extensions to AdaBoost [24] and gradient boosted machine [25] was implemented in the R package gbm. The results were shown in Additional file 1: Tables S1, S2 and S3 (randomforest) and Additional file 1: Tables S4, S5 and S6 (gbm). These results were comparable to the results from elastic net. Related work for predicting CRISPR/Cas9 guide efficiency based on nucleotide properties and melting temperatures includes azimuth [4, 6, 7], which constructed a predictive model based on gradient-boosted regression trees as described earlier. This method was recommended by [26] for in-vivo (U6) transcribed guides. In contrast, the sgRNA scorer of [27] was a predictive model based on the support vector machine (SVM) algorithm using position dependent mono-nucleotide on 5’ flanking (5 bp), spacer target and 3’ flanking (NGG + 5 bp) region. We included these two methods for comparison in Table 3 and Fig. 4. In this comparison, each method was trained on different datasets, but the performance was evaluated on the same test dataset generated by an independent research group, i.e., [11] dataset. The statistical significance for pairwise AUC comparisons was based on DeLong’s test [28]. Our proposed predictive algorithm achieved higher AUC compared to both azimuth and sgRNA scorer (p<0.001 in both cases). On the other hand, azimuth had better performance than sgRNA scorer (p<0.001). We have also implemented azimuth (based on continuous outcome gbm model) and sgRNA scorer (based on binary outcome SVM model) using the sequence features identified by [6, 7] and [27], respectively on the same training data (i.e., [10] ribosomal and non-ribosomal genes) (Table 3). As expected, the performance of sgRNA scorer was comparable to the model using position dependent mono-nucleotide (Table 3), whereas the performance of azimuth was comparable to the gbm results in Additional file 1: Table S15. Our proposed predictive algorithm achieved higher AUC compared to the refitted sgRNA scorer (p=0.048) and comparable performance to the refitted azimuth (p>0.1).
Fig. 4

AUC curves for our proposed predictive model using combination features (Comb Feature), azimuth and sgRNA scorer. azimuth and sgRNA Scorer were the results based on the softwares by [7] and [27], respectively developed using different training datasets. azimuth (retrained) and sgRNA Scorer (retrained) were the results obtained by refitting the algorithms on the current training set (ribosomal and non-ribosomal genes)

AUC curves for our proposed predictive model using combination features (Comb Feature), azimuth and sgRNA scorer. azimuth and sgRNA Scorer were the results based on the softwares by [7] and [27], respectively developed using different training datasets. azimuth (retrained) and sgRNA Scorer (retrained) were the results obtained by refitting the algorithms on the current training set (ribosomal and non-ribosomal genes) We also included comparison using a regression model based on (1) the average log2 fold change (12 cell doublings vs initial seeding states) of HL-60 and KBM-7 cell lines for [10] data and (2) the average log2 fold change (mESC vs plasmid control) of replicate 1 and replicate 2 of mouse ESC JM8 cell lines for [11] data. We compared the performance of the sequence properties in prediction in terms of AUC, Pearson correlation coefficient, Spearman rank correlation coefficient and mean squared error on the test data. The results were presented in Additional file 1: Tables S7, S8 and S9. In addition, similar to the binary outcome model as described above; position-dependent dinucleotide models (PD Dinuc) consistently outperformed other feature classes in predicting sgRNA efficiency and were comparable to results from the combination of feature classes models in all 3 scenarios. Fusi et al. [6] and Doench et al. [7] showed that the regression model outperformed classification model using their dataset [4, 7]. However, we observed that the regression model and the classification model yielded comparable performance in both [10] and [11] datasets. The combination feature prediction model from the regression model (Comb Feature) exhibited larger AUC than both azimuth and sgRNA scorer (p<0.001 for all pairwise AUC comparisons using DeLong’s test [28]), but no difference using Spearman rank correlation coefficient for Comb Feature versus azimuth (p=0.88 from Fisher’s Z-transformation test [29, 30]) as shown in Additional file 1: Table S9. The results from random forest and boosted regression were presented in Additional file 1: Tables S10, S11 and S12 (randomforest) and Additional file 1: Tables S13, S14 and S15 (gbm). These results were comparable to the results from elastic net. Following [6, 7], we also included the results from leave-one-gene out prediction framework to obtain a generalization of our prediction model to new genes in Additional file 1 (Section 5 and Tables S19 and S20). The conclusion remained the same, i.e., Comb Feature yielded the largest AUC and PD Dinuc followed closely. Additional results including performance evaluation using 30 bp sequence [6, 7] instead of 40 bp sequence were presented in Additional file 1: Tables S16, S17 and S18. The results indicated that the performance of the prediction models were comparable regardless whether a 40 bp or 30 bp sequence was used. We created an R package predictSGRNA implementing the proposed predictive algorithm based on position-dependent dinucleotide model, available at http://www.ams.sunysb.edu/~pfkuan/softwares.html#predictsgrna.

Conclusions

In this paper, we explored various aspects of nucleotide compositions including position dependent models, secondary structure and thermodynamics to gain better understanding of the nucleotide properties on CRISPR sgRNA design efficiency in a similar way as [6, 7]. Candidate feature ranking in terms of association with sgRNA effiency identified features which characterize the flexibility of the underlying DNA structure. Specifically, we found that the frequency of T and TT dinucleotide exhibited the strongest negative association with sgRNA efficiency. Packer et al. [14] illustrated that TT dinucleotide has the most rigid step and least flexible in terms of the ability to slide and shift, which could explain the decreased efficiency of sgRNA with higher abundance of TT dinucleotides. The results from the different predictive algorithms showed that across datasets, the position dependent mono-nucleotide model [5] achieved good operating characteristics while the prediction algorithm trained on position dependent dinucleotide model offered additional improvement in terms on AUC. The advantage of position dependent dinucleotide model in predicting sgRNA efficiency was also observed in [6, 7]. One factor that may guide improvement of future predictive algorithms is chromatin structure. Chromatin accessibility (packed vs unpacked) has been shown to be the major determinant of genome-wide binding of dCas9-sgRNA in [16]. Examples of epigenetic marks which are implicated in chromatin remodeling and accessibility include DNase I hypersensitive sites, transcription factor binding, DNA methylation and histone modification. Future work will include integrating both the nucleotide composition features and chromatin structures as features in the predictive model to characterize the binding efficiency of sgRNA. In this study, we used datasets of size 3141 and achieved AUC of > 0.8. Prior efforts to improve the efficiency of RNAi design utilized high-throughput functional testing of the efficacy of different RNAi sequences to generate large (2182) [31] and very large datasets (∼250000) [32]. These large datasets in turn were used to develop improved prediction algorithms using machine-learning approaches similar to those used here [33, 34]. It is generally accepted that the first large test set (2182) was very useful for improving RNAi design, there is still uncertainty regarding the utility of examining very large datasets [34]. Part of the unresolved issues are the degree to which different prediction algorithms are dependent upon the vector used for shRNA expression [35] as well as the sequence context in the genome outside of the immediate target [36]. Therefore, as more CRISPR/Cas9 screens datasets are becoming available, we anticipate that the specificity of sgRNA efficacy prediction can be further improved by considering the vector-dependent level of expression of the sgRNA.
  27 in total

1.  Sequence-dependent DNA structure: dinucleotide conformational maps.

Authors:  M J Packer; M P Dauncey; C A Hunter
Journal:  J Mol Biol       Date:  2000-01-07       Impact factor: 5.469

2.  Design of a genome-wide siRNA library using an artificial neural network.

Authors:  Dieter Huesken; Joerg Lange; Craig Mickanin; Jan Weiler; Fred Asselbergs; Justin Warner; Brian Meloon; Sharon Engel; Avi Rosenberg; Dalia Cohen; Mark Labow; Mischa Reinhardt; François Natt; Jonathan Hall
Journal:  Nat Biotechnol       Date:  2005-07-17       Impact factor: 54.908

3.  The effect of regions flanking target site on siRNA potency.

Authors:  Li Liu; Qian-Zhong Li; Hao Lin; Yong-Chun Zuo
Journal:  Genomics       Date:  2013-07-25       Impact factor: 5.736

4.  Genome-wide binding of the CRISPR endonuclease Cas9 in mammalian cells.

Authors:  Xuebing Wu; David A Scott; Andrea J Kriz; Anthony C Chiu; Patrick D Hsu; Daniel B Dadon; Albert W Cheng; Alexandro E Trevino; Silvana Konermann; Sidi Chen; Rudolf Jaenisch; Feng Zhang; Phillip A Sharp
Journal:  Nat Biotechnol       Date:  2014-04-20       Impact factor: 54.908

5.  Genetic screens in human cells using the CRISPR-Cas9 system.

Authors:  Tim Wang; Jenny J Wei; David M Sabatini; Eric S Lander
Journal:  Science       Date:  2013-12-12       Impact factor: 47.728

6.  iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties.

Authors:  Wei Chen; Hao Lin; Peng-Mian Feng; Chen Ding; Yong-Chun Zuo; Kuo-Chen Chou
Journal:  PLoS One       Date:  2012-10-29       Impact factor: 3.240

7.  Gene selection and classification of microarray data using random forest.

Authors:  Ramón Díaz-Uriarte; Sara Alvarez de Andrés
Journal:  BMC Bioinformatics       Date:  2006-01-06       Impact factor: 3.169

8.  An accurate and interpretable model for siRNA efficacy prediction.

Authors:  Jean-Philippe Vert; Nicolas Foveau; Christian Lajaunie; Yves Vandenbrouck
Journal:  BMC Bioinformatics       Date:  2006-11-30       Impact factor: 3.169

9.  Sequence determinants of improved CRISPR sgRNA design.

Authors:  Han Xu; Tengfei Xiao; Chen-Hao Chen; Wei Li; Clifford A Meyer; Qiu Wu; Di Wu; Le Cong; Feng Zhang; Jun S Liu; Myles Brown; X Shirley Liu
Journal:  Genome Res       Date:  2015-06-10       Impact factor: 9.043

10.  Unraveling CRISPR-Cas9 genome engineering parameters via a library-on-library approach.

Authors:  Raj Chari; Prashant Mali; Mark Moosburner; George M Church
Journal:  Nat Methods       Date:  2015-07-13       Impact factor: 28.547

View more
  8 in total

1.  An overview of designing and selection of sgRNAs for precise genome editing by the CRISPR-Cas9 system in plants.

Authors:  Ajay Prakash Uniyal; Komal Mansotra; Sudesh Kumar Yadav; Vinay Kumar
Journal:  3 Biotech       Date:  2019-05-21       Impact factor: 2.406

2.  Sequence-specific prediction of the efficiencies of adenine and cytosine base editors.

Authors:  Myungjae Song; Hui Kwon Kim; Sungtae Lee; Younggwang Kim; Sang-Yeon Seo; Jinman Park; Jae Woo Choi; Hyewon Jang; Jeong Hong Shin; Seonwoo Min; Zhejiu Quan; Ji Hun Kim; Hoon Chul Kang; Sungroh Yoon; Hyongbum Henry Kim
Journal:  Nat Biotechnol       Date:  2020-07-06       Impact factor: 54.908

3.  An overview and metanalysis of machine and deep learning-based CRISPR gRNA design tools.

Authors:  Jun Wang; Xiuqing Zhang; Lixin Cheng; Yonglun Luo
Journal:  RNA Biol       Date:  2019-09-27       Impact factor: 4.652

4.  Optimized libraries for CRISPR-Cas9 genetic screens with multiple modalities.

Authors:  Kendall R Sanson; Ruth E Hanna; Mudra Hegde; Katherine F Donovan; Christine Strand; Meagan E Sullender; Emma W Vaimberg; Amy Goodale; David E Root; Federica Piccioni; John G Doench
Journal:  Nat Commun       Date:  2018-12-21       Impact factor: 14.919

5.  SpCas9 activity prediction by DeepSpCas9, a deep learning-based model with high generalization performance.

Authors:  Hui Kwon Kim; Younggwang Kim; Sungtae Lee; Seonwoo Min; Jung Yoon Bae; Jae Woo Choi; Jinman Park; Dongmin Jung; Sungroh Yoon; Hyongbum Henry Kim
Journal:  Sci Adv       Date:  2019-11-06       Impact factor: 14.136

6.  Deep learning improves the ability of sgRNA off-target propensity prediction.

Authors:  Qiaoyue Liu; Xiang Cheng; Gan Liu; Bohao Li; Xiuqin Liu
Journal:  BMC Bioinformatics       Date:  2020-02-10       Impact factor: 3.169

7.  Evaluation of the effects of sequence length and microsatellite instability on single-guide RNA activity and specificity.

Authors:  Changzhi Zhao; Yunlong Wang; Xiongwei Nie; Xiaosong Han; Hailong Liu; Guanglei Li; Gaojuan Yang; Jinxue Ruan; Yunlong Ma; Xinyun Li; Huijun Cheng; Shuhong Zhao; Yaping Fang; Shengsong Xie
Journal:  Int J Biol Sci       Date:  2019-10-03       Impact factor: 6.580

Review 8.  Computational approaches for effective CRISPR guide RNA design and evaluation.

Authors:  Guanqing Liu; Yong Zhang; Tao Zhang
Journal:  Comput Struct Biotechnol J       Date:  2019-11-29       Impact factor: 7.271

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.