| Literature DB >> 27581337 |
Karambir Kaur1, Amit Kumar Gupta1, Akanksha Rajput1, Manoj Kumar1.
Abstract
Genome editing by sgRNA a component of CRISPR/Cas system emerged as a preferred technology for genome editing in recent years. However, activity and stability of sgRNA in genome targeting is greatly influenced by its sequence features. In this endeavor, a few prediction tools have been developed to design effective sgRNAs but these methods have their own limitations. Therefore, we have developed "ge-CRISPR" using high throughput data for the prediction and analysis of sgRNAs genome editing efficiency. Predictive models were employed using SVM for developing pipeline-1 (classification) and pipeline-2 (regression) using 2090 and 4139 experimentally verified sgRNAs respectively from Homo sapiens, Mus musculus, Danio rerio and Xenopus tropicalis. During 10-fold cross validation we have achieved accuracy and Matthew's correlation coefficient of 87.70% and 0.75 for pipeline-1 on training dataset (T(1840)) while it performed equally well on independent dataset (V(250)). In pipeline-2 we attained Pearson correlation coefficient of 0.68 and 0.69 using best models on training (T(3169)) and independent dataset (V(520)) correspondingly. ge-CRISPR (http://bioinfo.imtech.res.in/manojk/gecrispr/) for a given genomic region will identify potent sgRNAs, their qualitative as well as quantitative efficiencies along with potential off-targets. It will be useful to scientific community engaged in CRISPR research and therapeutics development.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27581337 PMCID: PMC5007494 DOI: 10.1038/srep30870
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Performance of pipeline-1 (geCRISPRc) predictive models on sgRNA sequences (T1840) using Support vector machine during 10-fold cross validation and on Independent datasets (V250).
| S.No | sgRNA features | Vector | T1840 | V250 | ||||
|---|---|---|---|---|---|---|---|---|
| Acc | MCC | AUC | Acc | MCC | AUC | |||
| 1 | Mononucleotide composition | 4 | 65.6 | 0.32 | 0.70 | 69.2 | 0.39 | 0.73 |
| 2 | Dinucleotide composition | 16 | 66.3 | 0.35 | 0.73 | 69.2 | 0.40 | 0.76 |
| 3 | Trinucleotide composition | 64 | 73.26 | 0.47 | 0.81 | 75.2 | 0.51 | 0.82 |
| 4 | Tetranucleotide composition | 256 | 72.45 | 0.45 | 0.80 | 74 | 0.49 | 0.81 |
| 5 | Pentanucleotide composition | 1024 | 65.82 | 0.33 | 0.71 | 68.8 | 0.39 | 0.73 |
| 6 | 1 + 2 | 20 | 79.57 | 0.6 | 0.86 | 85.2 | 0.71 | 0.92 |
| 7 | 1 + 2+3 | 84 | 82.17 | 0.64 | 0.88 | 87.2 | 0.75 | 0.92 |
| 8 | 1 + 2 + 3 + 4 | 340 | 81.85 | 0.64 | 0.88 | 83.6 | 0.69 | 0.92 |
| 9 | 1 + 2 + 3 + 4 + 5 | 1364 | 80.65 | 0.61 | 0.87 | 82.8 | 0.66 | 0.9 |
| 10 | Mononucleotide binary | 80 | 85.54 | 0.71 | 0.92 | 90 | 0.8 | 0.95 |
| 12 | Dinucleotide (2-degree) binary | 288 | 84.57 | 0.70 | 0.90 | 86.4 | 0.73 | 0.93 |
| 13 | Dinucleotide (3-degree) binary | 272 | 83.7 | 0.68 | 0.90 | 86 | 0.72 | 0.93 |
| 14 | 10 + 11 | 384 | 77.23 | 0.58 | 0.78 | 90.4 | 0.81 | 0.95 |
| 15 | 11 + 12 + 13 | 864 | 86.47 | 0.73 | 0.92 | 89.6 | 0.79 | 0.95 |
| 16 | 10 + 11 + 12 + 13 | 944 | 86.96 | 0.74 | 0.93 | 88.8 | 0.78 | 0.95 |
| 17 | Secondary structure | 20 | 53.7 | 0.10 | 0.54 | 58 | 0.16 | 0.59 |
| 18 | Thermodynamic features | 21 | 75.38 | 0.51 | 0.82 | 74.4 | 0.49 | 0.82 |
| 19 | 7 + 16 | 1028 | 86.41 | 0.73 | 0.93 | 91.2 | 0.82 | 0.95 |
| 20 | 7 + 16 + 18 | 1049 | 83.7 | 0.68 | 0.91 | 84.4 | 0.70 | 0.93 |
| 21 | 7 + 16 + 17 | 1048 | 86.25 | 0.73 | 0.93 | 90.4 | 0.81 | 0.95 |
| 22 | 7 + 16 + 17 + 18 | 1069 | 83.48 | 0.67 | 0.91 | 83.6 | 0.67 | 0.92 |
Acc, accuracy; MCC, Matthew’s correlation coefficient; AUC, area under curve.
Figure 1ROC representing Area under the curve between different hybrids features.
In composition profile hybrid of mono-di-trinucleotide have AUC of 0.88 (blue), binary profile of dinucleotide have AUC of 0.92 (red) and hybrid of mono-di-trinucleotide composition and dinucleotide binary display AUC of 0.93 (green).
Performance of pipeline-2 (geCRISPRr) predictive models on Training/testing (T3619) and independent datasets using (V250) Support vector machine during 10-fold cross validation.
| S.No | sgRNA features | Vector | T3619 | V520 |
|---|---|---|---|---|
| PCC | PCC | |||
| 1 | Mononucleotide composition | 4 | 0.32 | 0.31 |
| 2 | Dinucleotide composition | 16 | 0.33 | 0.31 |
| 3 | Trinucleotide composition | 64 | 0.44 | 0.45 |
| 4 | Tetranucleotide composition | 256 | 0.44 | 0.44 |
| 5 | Pentanucleotide composition | 1024 | 0.34 | 0.36 |
| 6 | 1 + 2 | 20 | 0.52 | 0.48 |
| 7 | 1 + 2 + 3 | 84 | 0.58 | 0.54 |
| 8 | 1 + 2 + 3 + 4 | 340 | 0.58 | 0.56 |
| 9 | 1 + 2 + 3 + 4 + 5 | 1364 | 0.57 | 0.54 |
| 10 | Mononucleotide binary | 80 | 0.65 | 0.67 |
| 11 | Dinucleotide (1degree) binary | |||
| 12 | Dinucleotide (2degree) binary | 288 | 0.59 | 0.60 |
| 13 | Dinucleotide (3degree) binary | 272 | 0.60 | 0.61 |
| 14 | 10 + 11 | 384 | 0.68 | 0.69 |
| 15 | 11 + 12 + 13 | 864 | 0.65 | 0.65 |
| 16 | 10 + 11 + 12 + 13 | 944 | 0.65 | 0.65 |
| 17 | Secondary structure | 20 | 0.01 | 0.05 |
| 18 | Thermodynamic features | 21 | 0.44 | 0.50 |
| 19 | 7 + 14 | 468 | 0.66 | 0.63 |
| 20 | 7 + 14 + 17 | 488 | 0.12 | 0.60 |
| 21 | 7 + 14 + 18 | 489 | 0.66 | 0.63 |
| 22 | 7 + 14 + 17 + 18 | 509 | 0.12 | 0.60 |
PCC, Pearson correlation coefficient.
Figure 2Two sample sequence logo depicting preference of A,T,G,C at 20 positions of highly effective and least effective sgRNAs.
Figure 3Workflow of pipeline-2.
(a) sgRNA scanner for extracting sgRNAs in user provided genome/gene (b) output of pilpeline-2 depicting efficiency of each sgRNA (c) sgRNA profile, displaying secondary structure and off-targets associated with individual sgRNA.
Comparison of pipeline-1 (geCRISPRc) with existing sgRNA potency prediction algorithms.
| Algorithm | Dataset Train/Test | Length of sgRNA | Acc | MCC | AUC | References |
|---|---|---|---|---|---|---|
| sgRNA designer | 736 = 368p + 368n | 20 | NA | NA | NA | Doench |
| WU-CRISPR | 736 = 368p + 368n | 20 | NA | NA | 0.91 | Wong |
| Our Study | ||||||
| sgRNAScorer_sp | 279 = 133p + 146n | 23 | 73.2 | NA | NA | Chari |
| Our Study | ||||||
| sgRNAScorer_st | 171 = 82p + 89n | 27 | 81.5 | NA | NA | Chari |
| Our Study | ||||||
ACC; accuracy, MCC; Matthew’s correlation coefficient, AUC; area under curve, NA; Not Available.
Comparison of pipeline-2 (geCRISPRr) with existing sgRNA efficiency prediction algorithms.
| Algorithm | Dataset Train/Test | PCC | Dataset Independent | PCC | Reference |
|---|---|---|---|---|---|
| CRISPRscan | 1020 | 0.45 | NA | 0.58 | Mateos |
| 1020 | 0.43 | NA | NA | Our study | |
PCC; Pearson correlation coefficient, NA; Not available.
Figure 4Diagrammatic representation of ge-CRISPR pipeline development.