| Literature DB >> 32199128 |
Bin Liu1, Zhihua Luo2, Juan He3.
Abstract
As a key technique for the CRISPR-Cas9 system, identification of single-guide RNAs (sgRNAs) on-target activity is critical for both theoretical research (investigation of RNA functions) and real-world applications (genome editing and synthetic biology). Because of its importance, several computational predictors have been proposed to predict sgRNAs on-target activity. All of these methods have clearly contributed to the developments of this very important field. However, they are suffering from certain limitations. We proposed two new methods called "sgRNA-PSM" and "sgRNA-ExPSM" for sgRNAs on-target activity prediction via capturing the long-range sequence information and evolutionary information using a new way to reduce the dimension of the feature vector to avoid the risk of overfitting. Rigorous leave-one-gene-out cross-validation on a benchmark dataset with 11 human genes and 6 mouse genes, as well as an independent dataset, indicated that the two new methods outperformed other competing methods. To make it easier for users to use the proposed sgRNA-PSM predictor, we have established a corresponding web server, which is available at http://bliulab.net/sgRNA-PSM/.Entities:
Keywords: XGBoost; position-specific mismatch; sgRNAs on-target activity
Year: 2020 PMID: 32199128 PMCID: PMC7083770 DOI: 10.1016/j.omtn.2020.01.029
Source DB: PubMed Journal: Mol Ther Nucleic Acids ISSN: 2162-2531 Impact factor: 8.886
Figure 1Graph Showing AUC Scores of the sgRNA-PSM Predictors with Different n Values, where n Denotes the Number of Selected Features
The 10 Most Important Features in the sgRNA-PSM Predictor
| No. | PSM Feature | Sequence Position | F_score |
|---|---|---|---|
| 1 | *G*GG | 23–27 | 185.6 |
| 2 | G*GG* | 24–28 | 185.6 |
| 3 | C*G*G | 24–28 | 136.2 |
| 4 | C**GG | 24–28 | 136.2 |
| 5 | *C*GG | 23–27 | 129.0 |
| 6 | C*GG* | 24–28 | 129.0 |
| 7 | **GGG | 24–28 | 128.0 |
| 8 | *GGG* | 25–29 | 128.0 |
| 9 | GGG** | 26–30 | 128.0 |
| 10 | **TTC | 20–24 | 113.0 |
Parameters were k = 5, m = 2.
The sequence position of mismatches.
Calculated by F regression.
List of AUC Scores Obtained by Various Methods via the Leave-One-Gene-Out Cross-Validation on the Same Benchmark Dataset S (cf. Equation 3)
| Methods | AUC (%) |
|---|---|
| Azimth | 71.9 |
| ge-CRISPR | 71.7 |
| CRISPRpred | 71.6 |
| sgRNA-PSM | 73.8 |
| sgRNA-ExPSM | 74.4 |
AUC means the area under the ROC curve;, the better predictor corresponds to larger AUC values.
Results obtained by in-house implementation from Doench et al.
Results obtained by in-house implementation from Kaur et al.
Results obtained by in-house implementation from Rahman and Rahman.
For the proposed predictor in this article, see Equations 9 and 10 with k = 5, m = 2, = 3, R = 0.1, F = 800.
For the proposed predictor in this article, see Equations 10 and 11 with k = 5, m = 2, = 3, R = 0.1, F = 800.
Figure 2Graph Showing the Predictive Quality of the Aforementioned Predictors via the ROC Curves
The corresponding AUC scores are 0.717, 0.716, 0.719, 0.738, and 0.744 for ge-CRISPR, CRISPRpred, Azimth, sgRNA-PSM, and sgRNA-ExPSM predictors via the leave-one-gene-out cross-validation on the same benchmark dataset S, respectively.
List of the AUC Scores Obtained by Various Methods on the On-Target Dataset Reported in Chuai et al.
| Cell Type | Methods | AUC (%) |
|---|---|---|
| Azimuth | 74.1 | |
| DeepCRISPR (pt+aug CNN) | 87.4 | |
| sgRNA-PSM | 91.7 | |
| Retrained sgRNA-PSM | 74.0 | |
| Azimuth | 67.5 | |
| DeepCRISPR (pt+aug CNN) | 78.2 | |
| sgRNA-PSM | 82.8 | |
| Retrained sgRNA-PSM | 72.1 | |
| Azimuth | 79.2 | |
| DeepCRISPR (pt+aug CNN) | 73.9 | |
| sgRNA-PSM | 77.6 | |
| Retrained sgRNA-PSM | 83.7 |
The cell type of the independent test dataset.
Results reported in Chuai et al.
Results reported in Chuai et al.
The sgRNA-PSM predictor trained with the dataset reported in Chuai et al.; see Equations 9 and 10 with k = 4, m = 2, = 9, R = 0.05, F = 2,300.
The sgRNA-PSM predictor trained with each of the three datasets (hct116, hela, and hl60).
Figure 3Graphic of the Homepage of the Web Server http://bliulab.net/sgRNA-PSM/
Figure 4Schematic Diagram Illustrating How to Generate the PSM Vector for a DNA Sequence
(A) Example of PSM with parameters of k = 2, m = 1. (B) Example of PSM with parameters of k = 3, m = 1.
Comparison between the PS Feature Vector’s dimension (cf. Equation 8) and the PSM Feature Vector’s Dimension (cf. Equation 9)
| Dimension of PS Vector | Dimension of PSM Vector | Ratio γ | ||
|---|---|---|---|---|
| 2 | 464 | 1 | 232 | ∼2 |
| 3 | 1,792 | 1 | 1,344 | ∼1.3 |
| 2 | 336 | ∼5.3 | ||
| 4 | 6,912 | 1 | 6,912 | 1 |
| 2 | 2,592 | ∼2.7 | ||
| 3 | 432 | ∼16 | ||
| 5 | 26,624 | 2 | 16,640 | ∼1.6 |
| 3 | 4,160 | ∼6.4 | ||
| 4 | 520 | ∼51.2 | ||
| 6 | 102,400 | 4 | 6,000 | ∼17.07 |
| 5 | 600 | ∼170.67 | ||
Calculated by Equation 8.
Calculated by Equation 9.
Ratio of the number of column 2 and the number of column 4; it is the same with , where m is given in column 3.