| Literature DB >> 35236300 |
Alexander Krohannon1, Mansi Srivastava1, Simone Rauch2,3, Rajneesh Srivastava1, Bryan C Dickinson2, Sarath Chandra Janga4,5,6.
Abstract
BACKGROUND: Recent discovery of the gene editing system - CRISPR (Clustered Regularly Interspersed Short Palindromic Repeats) associated proteins (Cas), has resulted in its widespread use for improved understanding of a variety of biological systems. Cas13, a lesser studied Cas protein, has been repurposed to allow for efficient and precise editing of RNA molecules. The Cas13 system utilizes base complementarity between a crRNA/sgRNA (crispr RNA or single guide RNA) and a target RNA transcript, to preferentially bind to only the target transcript. Unlike targeting the upstream regulatory regions of protein coding genes on the genome, the transcriptome is significantly more redundant, leading to many transcripts having wide stretches of identical nucleotide sequences. Transcripts also exhibit complex three-dimensional structures and interact with an array of RBPs (RNA Binding Proteins), both of which may impact the effectiveness of transcript depletion of target sequences. However, our understanding of the features and corresponding methods which can predict whether a specific sgRNA will effectively knockdown a transcript is very limited.Entities:
Keywords: CRISPR/Cas13; Functional genomics; Gene editing; Machine learning; Protein expression; mRNA regulation
Mesh:
Substances:
Year: 2022 PMID: 35236300 PMCID: PMC8889671 DOI: 10.1186/s12864-022-08366-2
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Algorithmic Framework for CRISPR Cas13 Guide RNA Prediction. CRISPR Cas13 knockdown experiments, protein occupancy, and transcriptomic alignment data was gathered for consideration and analysis by the model. Feature lists were created through composition analysis and k-mer capture. The significance and contribution of each feature was estimated to create finalized possible lists of features. The final feature list for the model was generated through comparison of 3-fold and 5-fold cross-validation experiments. Model predictions were validated through direct comparison with performed experiments. The model used to predict sgRNA’s spanning all transcripts associated with 5000 genes. The results were collected and analyzed for any potential biological relevance for predictions
Fig. 2K-mer Analysis to study the guide composition. A Bar plot of the population of sgRNAs that contain a specific dinucleotide at position 8. B Box plot of target transcript expression values as a function of the nucleotide at position 8. C Barplot of negative log of univariate linear regression significance p-value for all monomers at all positions across the guide. D Bar plot of the feature contribution score for each feature in the Random Forest gini feature list.
Model Architecture Performance by Feature Set
| Z > 2 | Z > 3 | Gini | Gini DT | ||
|---|---|---|---|---|---|
| 3-Fold | |||||
| Random Forest | 0.717 [55.4] | 0.713 [57.55] | 0.714 [51.1] | 0.716 [54.35] | 0.717 [50.55] |
| KNN | 0.715 [2] | 0.715 [2] | 0.711 [2] | 0.717 [3] | 0.717 [3] |
| SVC [linear] | 0.71 | 0.609 | 0.505 | 0.6 | 0.64 |
| SVC [poly] | 0.698 | 0.607 | 0.517 | 0.599 | 0.616 |
| SVC [sigmoid] | 0.62 | 0.551 | 0.47 | 0.54 | 0.553 |
| SVC [rbf] | 0.642 | 0.521 | 0.487 | 0.567 | 0.581 |
| Decision Tree | 0.715 | 0.715 | 0.711 | 0.715 | 0.714 |
| 5-Fold | |||||
| Random Forest | 0.654 [41.3] | 0.651 [46.75] | 0.641 [45.75] | 0.652 [43.95] | 0.654 [41.3] |
| KNN | 0.558 [3.39] | 0.496 [9.71] | 0.515 [8.44] | 0.535 [3.67] | 0.558 [3.39] |
| SVC [linear] | 0.615 | 0.553 | 0.504 | 0.589 | 0.634 |
| SVC [poly] | 0.593 | 0.535 | 0.501 | 0.562 | 0.574 |
| SVC [sigmoid] | 0.551 | 0.489 | 0.458 | 0.517 | 0.521 |
| SVC [rbf] | 0.563 | 0.504 | 0.469 | 0.537 | 0.541 |
| Decision Tree | 0.634 | 0.636 | 0.634 | 0.64 | 0.637 |
Distribution of model accuracy using a variety of different architectures and different feature lists for both 5-fold and 3-fold cross validation methods. For KNN and Random Forest, average values for parameters with the highest accuracy are recorded in brackets
Fig. 3CASowary Model Performance. ROC curve for CASowary Decision Tree model using Random Forest feature list.
Fig. 4Comparison of CASowary Predictions with CIRTS Results. CIRTS experiments SMARCA4 (add transcript ID) transcript measurements correlated with high efficiency CASowary guide predictions, transcript expression value between 0.25–0 (red) and low efficiency CASowary guide predictions, transcript expression value between 0.75 and 1 (blue)
Fig. 5Comparison of Training Data with Gene Predictions: A Density plot of Efficient (Highly Efficient and Efficient) and Inefficient (Inefficient and Highly Inefficient) guides from the training data. B Density plot of Efficient and Inefficient guides from the 5000 random genes. C Pie chart for the breakdown of guide predictions from the training data. D Pie chart for the breakdown of the guide predictions from the 5000 random genes
Fig. 6Cell Line Specific Predictions: IGV tracks for ENST00000251507.8, a protein coding transcript for RABGAP1L. The top collection of tracks corresponds to protein occupancy (blue), high quality guide locations (transcript expression value between 0 and 0.5) (green), and transcript abundance for HEK293 cell line (gray). The bottom collection of tracks is the same for the HeLa cell line