| Literature DB >> 35140885 |
Zhong-Rui Zhang1, Zhen-Ran Jiang1.
Abstract
The CRISPR/Cas9 gene-editing system is the third-generation gene-editing technology that has been widely used in biomedical applications. However, off-target effects occurring CRISPR/Cas9 system has been a challenging problem it faces in practical applications. Although many predictive models have been developed to predict off-target activities, current models do not effectively use sequence pair information. There is still room for improved accuracy. This study aims to effectively use sequence pair information to improve the model's performance for predicting off-target activities. We propose a new coding scheme for coding sequence pairs and design a new model called CRISPR-IP for predicting off-target activity. Our coding scheme distinguishes regions with different functions in the sequence pairs through the function channel. Moreover, it distinguishes between bases and base pairs using type channels, effectively representing the sequence pair information. The CRISPR-IP model is based on CNN, BiLSTM, and the attention layer to learn features of sequence pairs. We performed performance verification on two data sets and found that our coding scheme can represent sequence pair information effectively, and the CRISPR-IP model performance is better than others. Data and source codes are available at https://github.com/BioinfoVirgo/CRISPR-IP.Entities:
Keywords: A, Adenine; BiLSTM, Bi-directional Long-Short Term Memory; C, Cytosine; CDF, Cutting frequency determination; CNN, Convolutional Neural Networks; CRISPR-Cas9; CRISPR-IP, CRISPR model based on Identity and Position; CRISPR/Cas9, Clustered Regularly Interspaced Short Palindromic Repeats / CRISPR associated protein 9; DNN, Dense Neural Networks; Deep learning; Encoding scheme; G, Guanine; GRU, Gate Recurrent Unit; LOGOCV, Leave-one-gRNA-out cross-validation; LSTM, Long-Short Term Memory; Off-target prediction; PAM, Protospacer adjacent motif; PR-AUC, Area Under the Precision-Recall Curve; RNN, Recurrent Neural Networks; ROC-AUC, Area Under the Receiver Operating Characteristic Curve; T, Thymine; U, Uracil; gRNA, Guide RNA
Year: 2022 PMID: 35140885 PMCID: PMC8804193 DOI: 10.1016/j.csbj.2022.01.006
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Three cases of off-target types.
The model's association with the network layer.
| Model | Convolution | Recurrent | Attention | Dense |
|---|---|---|---|---|
| CRISPR-Net | Yes | Yes | No | Yes |
| CRISPR-OFFT | Yes | No | Yes | Yes |
| AttnToMismatch_CNN | Yes | No | Yes | Yes |
| CNN_std | Yes | No | No | Yes |
| DeepCRISPR | Yes | No | No | Yes |
Notes: 'Yes' means that the model uses this kind of network layer, and 'NO' means it does not.
Performance for each predictive model on the CIRCLE-seq data set.
| Metric | CRISPR-IP | FNN3 | FNN5 | FNN10 | CNN3 | CNN5 | LSTM | GRU | Encoding |
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.989 | 0.962 | 0.955 | 0.988 | 0.988 | 0.984 | encoding scheme 1 | ||
| Accuracy | 0.981 | 0.976 | encoding scheme 2 | ||||||
| F1 score | 0.375 | 0.004 | 0.005 | encoding scheme 1 | |||||
| F1 score | 0.138 | 0.096 | 0.090 | 0.171 | 0.240 | encoding scheme 2 | |||
| PR-AUC | 0.483 | 0.230 | 0.265 | encoding scheme 1 | |||||
| PR-AUC | 0.103 | 0.065 | 0.069 | 0.242 | 0.226 | encoding scheme 2 | |||
| Precision | 0.666 | 0.057 | 0.240 | 0.396 | 0.401 | encoding scheme 1 | |||
| Precision | 0.240 | 0.146 | 0.190 | encoding scheme 2 | |||||
| ROC-AUC | 0.961 | 0.770 | 0.873 | encoding scheme 1 | |||||
| ROC-AUC | 0.856 | 0.812 | 0.937 | 0.929 | 0.901 | encoding scheme 2 | |||
| Recall | 0.295 | 0.002 | 0.002 | encoding scheme 1 | |||||
| Recall | 0.123 | 0.113 | 0.099 | 0.161 | 0.308 | encoding scheme 2 |
Notes: Better results are indicated in bold. Encoding scheme 1 was proposed by Lin et al., and coding scheme 2 was proposed by us.
Performance for each predictive model on the CIRCLE-seq data set.
| Metric | CRISPR-IP | FNN3 | FNN5 | FNN10 | CNN3 | CNN5 | LSTM | GRU | Encoding |
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.990 | 0.670 | 0.796 | 0.955 | 0.982 | 0.982 | 0.988 | 0.987 | encoding scheme 1 |
| Accuracy | encoding scheme 2 | ||||||||
| F1 score | 0.621 | 0.144 | 0.133 | 0.364 | 0.222 | 0.255 | 0.560 | 0.518 | encoding scheme 1 |
| F1 score | encoding scheme 2 | ||||||||
| PR-AUC | 0.695 | 0.209 | 0.082 | 0.302 | 0.316 | 0.319 | 0.665 | 0.616 | encoding scheme 1 |
| PR-AUC | encoding scheme 2 | ||||||||
| Precision | 0.117 | 0.087 | 0.325 | 0.691 | 0.672 | 0.669 | 0.688 | encoding scheme 1 | |
| Precision | 0.791 | encoding scheme 2 | |||||||
| ROC-AUC | 0.973 | 0.767 | 0.775 | 0.855 | 0.891 | 0.885 | 0.970 | 0.965 | encoding scheme 1 |
| ROC-AUC | encoding scheme 2 | ||||||||
| Recall | 0.526 | 0.194 | 0.211 | 0.569 | 0.504 | encoding scheme 1 | |||
| Recall | 0.473 | 0.583 | 0.555 | encoding scheme 2 |
Notes: Better results are indicated in bold. Encoding scheme 1 was proposed by Lin et al., and coding scheme 2 was proposed by us.
Fig. 2Performance evaluation of CRISPR-IP and CRISPR-Net. The result of CIRCLE-Seq dataset is (a), the result of SITE-Seq dataset is (b).
Fig. 3Evaluation of performance for CRISPR-IP and other models on SITE-seq data set.
Fig. 4Results of ablation experiments. The result on CIRCLE-Seq dataset is (a), the result on SITE-Seq dataset is (b).
Fig. 5Results of models on the Dataset_C and Dataset_NC. The results of the Dataset_NC are (a) and (b), and the results of the Dataset_C are (c) and (d). Model_Name_C are models trained on the Dataset_C, and so are Model_Name_NC.
The results of TopN.
| Metric | Model | 1000 | 2000 | 3000 | 4000 | 5000 | 6000 | 7000 |
|---|---|---|---|---|---|---|---|---|
| NOT | CRISPR_IP_NC | 189 | 369 | 545 | 733 | 885 | 1036 | 1204 |
| NOT | CRISPR_IP_C | 768 | 1329 | 1882 | 2351 | 2759 | 3109 | 3408 |
| NB | CRISPR_IP_NC | 880 | 1708 | 2536 | 3323 | 4118 | 4893 | 5633 |
| NB | CRISPR_IP_C | 40 | 101 | 136 | 206 | 274 | 350 | 441 |
| MPS | CRISPR_IP_NC | 1.000 | 0.999 | 0.998 | 0.995 | 0.990 | 0.983 | 0.974 |
| MPS | CRISPR_IP_C | 0.968 | 0.891 | 0.799 | 0.721 | 0.658 | 0.606 | 0.563 |
Note: NOF: Number of off-target sequence pairs. NB: Number of sequence pairs with bulges. MPS: Mean of the predicted scores.
Fig. 6Results of CRISPR-IP, CRISPR-Net, CRISPR-OFFT and CNN_std on K562 Dataset.
Fig. 7Performance evaluation of no processing and two resampling methods on CRISPR-IP model. The result of CIRCLE-Seq dataset is (a), the result of SITE-Seq dataset is (b).
Fig. 8Representation for the gRNA-DNA pair.
Fig. 9An example of gRNA-DNA pair coding.
Fig. 10Architecture of the CRISPR-IP.
Taxonomy of the models used in coding schemes experiments and their respective architectures.
| Name | Type | Architecture |
|---|---|---|
| DNN3 | DNN | 3 dense layers |
| DNN5 | DNN | 5 dense layers |
| DNN10 | DNN | 10 dense layers |
| CNN2 | CNN | 1 convolutional layer, 1 dense layer |
| CNN3 | CNN | 2 convolutional layer, 1 dense layer |
| LSTM | RNN | 1 LSTM layer, 2 dense layers |
| GRU | RNN | 1 GRU layer, 2 dense layers |