| Literature DB >> 31969902 |
Guishan Zhang1, Zhiming Dai2,3, Xianhua Dai1,4.
Abstract
Accurate prediction of guide RNA (gRNA) on-target efficacy is critical for effective application of CRISPR/Cas9 system. Although some machine learning-based and convolutional neural network (CNN)-based methods have been proposed, prediction accuracy remains to be improved. Here, firstly we improved architectures of current CNNs for predicting gRNA on-target efficacy. Secondly, we proposed a novel hybrid system which combines our improved CNN with support vector regression (SVR). This CNN-SVR system is composed of two major components: a merged CNN as the front-end for extracting gRNA feature and an SVR as the back-end for regression and predicting gRNA cleavage efficiency. We demonstrate that CNN-SVR can effectively exploit features interactions from feed-forward directions to learn deeper features of gRNAs and their corresponding epigenetic features. Experiments on commonly used datasets show that our CNN-SVR system outperforms available state-of-the-art methods in terms of prediction accuracy, generalization, and robustness. Source codes are available at https://github.com/Peppags/CNN-SVR.Entities:
Keywords: CRISPR/Cas9; convolutional neural network; guide RNA; on-target; support vector regression
Year: 2020 PMID: 31969902 PMCID: PMC6960259 DOI: 10.3389/fgene.2019.01303
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1An illustration of procedures for cell line-specific gRNA on-target activity prediction based on CNN-SVR. Here, [f_1,f_2,⋯,f_n ] is the subset of [f_1,f_2,⋯,f_80 ].
Performance comparison between CNN-SVR and CNN models for gRNA activity prediction on four cell-line datasets under 10-time 10-fold cross-validation.
| Model | CNN-SVR | CNN | CNN-SVR | CNN |
|---|---|---|---|---|
| Spearman | AUROC | |||
| HCT116 | 0.661 ± 0.030 | 0.932 ± 0.001 | ||
| HEK293T | 0.725 ± 0.029 | 0.972 ± 0.001 | ||
| HELA | 0.699 ± 0.006 | 0.916 ± 0.001 | ||
| HL60 | 0.576 ± 0.040 | 0.914 ± 0.003 | ||
Performance is shown as mean ± standard deviation. This representation also applies to . The best performance across different folds cross-validation method is highlighted in bold for clarification. These highlights also apply to to and to .
Performance comparison of CNN-SVR and different CNNs combined regression models for gRNA activity prediction on four cell-line datasets under 10-time 10-fold cross-validation.
| Model | HCT116 | HEK293T | HELA | HL60 |
|---|---|---|---|---|
| CNN-SVR | ||||
| CNN-L1 | 0.712± 0.010 | 0.793 ± 0.004 | 0.633± 0.020 | 0.542± 0.033 |
| CNN-L2 | 0.670± 0.025 | 0.731 ± 0.032 | 0.683± 0.009 | 0.517± 0.034 |
| CNN-L1L2 | 0.701± 0.008 | 0.803 ± 0.012 | 0.682± 0.005 | 0.589± 0.018 |
| CNN-SVR | ||||
| CNN-L1 | 0.931 ± 0.001 | 0.982 ± 0.001 | 0.924 ± 0.002 | 0.930 ± 0.003 |
| CNN-L2 | 0.919± 0.002 | 0.975 ± 0.002 | 0.923± 0.002 | 0.895± 0.008 |
| CNN-L2 | 0.918± 0.003 | 0.977 ± 0.001 | 0.915± 0.002 | 0.912± 0.004 |
The tables from top to bottom respectively record the Spearman correlation and AUROC of CNN-SVR and three CNN combined regression methods.
AUROC of different deep learning-based methods by considering gRNA sequence only and incorporating both gRNA sequence and epigenetic features.
| Model | HCT116 | HEK293T | HELA | HL60 | Average |
|---|---|---|---|---|---|
| CNN-SVR | |||||
| DeepCRISPR | 0.887 | 0.474 | 0.788 | 0.584 | 0.683 |
| Seq-deepCpf1 | 0.931 | 0.925 | 0.920 | 0.938 | |
| CNN-SVR | |||||
| DeepCRISPR | 0.919 | 0.506 | 0.820 | 0.643 | 0.722 |
Figure 2Performance comparison of CNN-SVR and other prediction models on various testing cell line data.
Figure 3Performance comparison of CNN-SVR and other prediction models on various testing cell line data with a leave-one-cell-out procedure.
The differences of Spearman correlation and AUROC between independent test and a leave-one-cell-out approach between CNN-SVR and DeepCRISPR.
| Model | HCT116 | HEK293T | HELA | HL60 |
|---|---|---|---|---|
| CNN-SVR | -0.015 | |||
| DeepCRISPR | -0.107 | 0.805 | -0.043 | |
| CNN-SVR | ||||
| DeepCRISPR | -0.045 | 0.455 | -0.038 | 0.096 |
Figure 4Spearman correlation between different deep learning-based models and datasets. Models considering (A) gRNA sequence composition only and (B) both gRNA sequence and epigenetic information.
Figure 5(A) Visualization of the importance of different nucleotides and epigenetic features at different positions for our model trained on the benchmark dataset. The colors represent the contribution of the position-specific nucleotides to determining an efficient gRNA. The x-axis shows the positions of the nucleotide in the sequence. The y-axis lists all possible nucleotides. This representation also applies to. (B) Preference of nucleotide sequences that impact CRISPR/Cas9 gRNAs activity.