| Literature DB >> 31651233 |
Lei Wang1, Juhua Zhang2,3.
Abstract
BACKGROUND: One of the main challenges for the CRISPR-Cas9 system is selecting optimal single-guide RNAs (sgRNAs). Recently, deep learning has enhanced sgRNA prediction in eukaryotes. However, the prokaryotic chromatin structure is different from eukaryotes, so models trained on eukaryotes may not apply to prokaryotes.Entities:
Keywords: CRISPR-Cas9; Deep learning; On-target activity; Prokaryotes
Mesh:
Substances:
Year: 2019 PMID: 31651233 PMCID: PMC6814057 DOI: 10.1186/s12859-019-3151-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Number of samples and range of on-target activity value in Set 1 and Set 2
| Descriptions | Cas9 | eSpCas9 | Cas9 (△ | |||
|---|---|---|---|---|---|---|
| Set 1 | Set 2 | Set 1 | Set 2 | Set 1 | Set 2 | |
| Size | 44,163 | 40,605 | 45,071 | 41,426 | 48,112 | 43,950 |
| Min | 0.0016 | 0.0016 | 0.0007 | 0.0007 | 0.0080 | 0.0080 |
| Max | 48.3807 | 48.3807 | 45.1725 | 45.1725 | 22.0268 | 22.0268 |
| Mean | 24.6415 | 24.6381 | 16.9593 | 16.9825 | 12.4479 | 12.4518 |
Average Spearman correlation coefficients under 5-fold cross-validation for several network architectures
| Networks | Cas9 | eSpCas9 | Cas9 (△ | |||
|---|---|---|---|---|---|---|
| Set 1 | Set 2 | Set 1 | Set 2 | Set 1 | Set 2 | |
| DeepCRISPR | 0.5149 | 0.5139 | 0.6617 | 0.6631 | 0.3108 | 0.3049 |
| CNN_Lin | 0.5217 | 0.5214 | 0.6665 | 0.6685 | 0.3176 | 0.3144 |
| DeepCas9 | 0.5554 | 0.5517 | 0.6951 | 0.6881 | 0.3400 | 0.3362 |
| CNN_5layers | 0.5817 | 0.5787 | 0.7105 | 0.7063 | 0.3602 | 0.3577 |
Fig. 1Components of constructed convolution layers and overall CNN_5layers construction. a The picture on the top shows details of the constructed convolution layer, which contains a convolution operation, a batch normalization and a leaky rectified linear unit in turn. The convolution kernel size is 3 ×1 and the output channel is 120. The simplified diagram is on the bottom. b The picture shows the overall CNN_5layers schema, including five convolution layers, five maximum pooling layers, and two fully-connected layers. All activation functions in CNN_5layers were LeakyReLU. One dropout layer which drops 30 percent was performed after each pooling operation and the first fully-connected layer
Comparison of Spearman correlation coefficients between eukaryotic sgRNA activity and eukaryotic model predictions
| Independent test datasets | Size | DeepCas9 | TSAM_U6 | CNN_5layers |
|---|---|---|---|---|
| chari2015Train293T | 1234 | — | 0.3812 | 0.3607 |
| doench2014HsA375 | 1276 | 0.3237 | 0.3187 | 0.3369 |
| doench2016 | 2333 | 0.3527 | 0.3439 | 0.3945 |
| hart2016-GbmAvg | 4272 | 0.3795 | 0.4242 | 0.4404 |
| hart2016-Hct1162lib1Avg | 4239 | 0.3679 | 0.4161 | 0.4288 |
| hart2016-Hct1162lib2Avg | 3617 | 0.3196 | 0.3598 | 0.3829 |
| hart2016-HelaLib1Avg | 4256 | 0.3403 | 0.3879 | 0.4033 |
| hart2016-HelaLib2Avg | 3845 | 0.3617 | 0.3942 | 0.4390 |
| hart2016-Rpe1Avg | 4214 | 0.2519 | 0.3094 | 0.3044 |
| wang2015hg19 | 2921 | 0.2030 | 0.1882 | 0.2291 |
| xu2015TrainMEsc | 981 | 0.3668 | 0.4088 | 0.4111 |
The DeepCas9 training set contains all chari2015Train293T samples
The Spearman correlation coefficients between on-target activity and six melting temperatures, four RNA fold scores, and four POSs
| Abbreviations | Features | Cas9 | eSpCas9 | Cas9 (△ |
|---|---|---|---|---|
| t1 | T(1_7) | -0.0302 | -0.0322 | — |
| t2 | T(8_15) | -0.0915 | -0.1304 | -0.0579 |
| t3 | T(16_20) | -0.1424 | -0.0695 | -0.0789 |
| t4 | T(1_20) | -0.1439 | -0.1346 | -0.0762 |
| t5 | T(-5_-1) | 0.0239 | 0.0385 | 0.0332 |
| t6 | T(21_+2) | 0.0098 | 0.0107 | 0.0119 |
| f1 | MFE | 0.0944 | 0.0895 | 0.0601 |
| f2 | FETE | 0.0862 | 0.0832 | 0.0517 |
| f3 | FMSE | -0.0246 | -0.0276 | -0.0323 |
| f4 | ED | 0.0166 | 0.0225 | 0.0286 |
| p1 | Cropit_POS | -0.1083 | -0.0998 | -0.0527 |
| p2 | Cctop_POS | -0.1088 | -0.1003 | -0.0518 |
| p3 | Mit_POS | -0.1130 | -0.1073 | -0.0607 |
| p4 | Cfd_POS | -0.1131 | -0.0985 | -0.0579 |
T(1_7) in Cas9 (△recA) scenario is not statistically significant
Descriptions of several feature combinations
| Combinations | Descriptions |
|---|---|
| t | t1, t2, t3, t4, t5, t6 |
| t_c | t1, t2, t3, t4, t5, t6, c |
| t_p | t1, t2, t3, t4, t5, t6, p1, p2, p3, p4 |
| t_p_c | t1, t2, t3, t4, t5, t6, p1, p2, p3, p4, c |
| t_p_f | t1, t2, t3, t4, t5, t6, p1, p2, p3, p4, f1, f2, f3, f4 |
| t_p_f_c | t1, t2, t3, t4, t5, t6, p1, p2, p3, p4, f1, f2, f3, f4, c |
| t_f | t1, t2, t3, t4, t5, t6, f1, f2, f3, f4 |
| t_f_c | t1, t2, t3, t4, t5, t6, f1, f2, f3, f4, c |
| p | p1, p2, p3, p4 |
| p_c | p1, p2, p3, p4, c |
| p_f | p1, p2, p3, p4, f1, f2, f3, f4 |
| p_f_c | p1, p2, p3, p4, f1, f2, f3, f4, c |
| f | f1, f2, f3, f4 |
| c | CNN_5layers output |
Average performances in training set and test set under 5-fold cross-validation for fifteen feature combinations by Linear Regression
| Combinations | Cas9 | eSpCas9 | Cas9 (△ | |||
|---|---|---|---|---|---|---|
| training set | test set | training set | test set | training set | test set | |
| t | 0.1782 | 0.1777 | 0.1604 | 0.1604 | 0.1070 | 0.1056 |
| p | 0.1217 | 0.1207 | 0.1121 | 0.1121 | 0.0630 | 0.0604 |
| f | 0.0962 | 0.0956 | 0.0947 | 0.0940 | 0.0674 | 0.0667 |
| t_p | 0.1931 | 0.1917 | 0.1772 | 0.1759 | 0.1157 | 0.1130 |
| t_f | 0.1888 | 0.1880 | 0.1740 | 0.1724 | 0.1191 | 0.1174 |
| p_f | 0.1480 | 0.1467 | 0.1408 | 0.1399 | 0.0864 | 0.0837 |
| t_p_f | 0.2026 | 0.2010 | 0.1892 | 0.1875 | 0.1258 | 0.1228 |
| c | 0.6631 | 0.5817 | 0.8060 | 0.7105 | 0.4765 | 0.3602 |
| t_c | 0.6631 | 0.5813 | 0.8064 | 0.7112 | 0.4747 | 0.3574 |
| p_c | 0.6636 | 0.5827 | 0.8068 | 0.7122 | 0.4768 | 0.3601 |
| f_c | 0.6650 | 0.5851 | 0.8077 | 0.7137 | 0.4773 | 0.3639 |
| t_p_c | 0.6637 | 0.5824 | 0.8072 | 0.7125 | 0.4753 | 0.3579 |
| t_f_c | 0.6655 | 0.5848 | 0.8085 | 0.7142 | 0.4753 | 0.3619 |
| p_f_c | 0.6656 | 0.5861 | 0.8085 | 0.7149 | 0.4775 | 0.3640 |
| t_p_f_c | 0.6663 | 0.5860 | 0.8092 | 0.7155 | 0.4758 | 0.3624 |
Fig. 2Base performance at 41 positions. The figure shows base preference of high on-target activity at 41 positions in a prokaryotic Cas9, b prokaryotic eSpCas9, and c eukaryotic scenarios
Fig. 3Importance scores at 41 positions. The figure shows the importance scores at 41 positions in prokaryotic Cas9, prokaryotic eSpCas9, and eukaryotic scenario. The normalized cumulative value of 41 points on each curve is one
Fig. 4Flow chart of training CNN_5layers and further improving predictive performance under 5-fold cross-validation. The datasets were randomly and equally separated into five subgroups, and alternately four subgroups were used as the training set to train the models. The remaining subgroup was used to test the generalization capacity of the trained models. We combined trained CNN_5layers output with extra features to continue training simple linear regression models. Finally, the remaining subgroup was again used to test improved performances