Guohui Chuai1, Fayu Yang2, Jifang Yan1, Yanan Chen1, Qin Ma3, Chi Zhou1, Chenyu Zhu1, Feng Gu2, Qi Liu1. 1. Department of Central Laboratory, Shanghai Tenth People's Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, China. 2. School of Ophthalmology and Optometry, Eye Hospital, Wenzhou Medical University, State Key Laboratory Cultivation Base and Key Laboratory of Vision Science, Ministry of Health and Zhejiang Provincial Key Laboratory of Ophthalmology and Optometry, Zhejiang, China. 3. Department of Plant Science, BioSNTR, South Dakota State University, Brookings, South Dakota, USA.
To the Editor: CRISPR-based gene editing is widely implemented in various cell types and has great potential for molecular therapy.[1] The CRISPR-Cas9 system creates sequence-specific double-strand DNA breaks that are repaired by a dominant error-prone nonhomologous end-joining (NHEJ) pathway, often resulting in gene inactivation by generating frameshift alleles.[1,2,3,4,5,6,7] CRISPR-based gene knockout (KO) often produces in-frame variants that retain functionality, however, which reduces KO efficiency. Recently, Sangsu Bae et al pioneered studies to use microhomology prediction to improve CRISPR-based KO efficiency in cell lines, by in-silico selection of target sites to reduce in-frame mutations.[2] They presented that the preference of the in-frame mutations at a given target site can be predicted by the microhomology profile, and an alternative NHEJ pathway, i.e., the microhomology-mediated end joining (MMEJ) occurs.[2,3] A score was defined to predict the microhomology-based out-of-frame mutation preferences.[2,4]Their work to achieve CRISPR-based KO by reducing in-frame mutations is creative, since in-frame mutations retain protein functionality therefore reducing KO efficiency. Nevertheless, further works are still needed to systematically investigate the relationship between sequence microhomology and in-frame mutations as well as other factors that may influence the occurrence of in-frame mutations in CRIPSR-based KO, taking advantages of the analysis of the posterior high-throughput next-generation sequencing data in CRIPSR-based KO experiment. To address this issue, a comprehensive analysis of 68-sgRNA Hela cell line deep sequencing data[2] by our pipeline (RISPR KO nalysis based on enomic diting data) (, Supplementary Materials) deeply investigated the relationship between microhomology profile and in-frame mutation occurrence, and presented new clues for the efficient CRISPR-based sgRNA design in terms of reducing in-frame mutations.
Microhomology Profile May not be Considered as a General sgRNA Design Feature
Although several previous studies reported the involvement of MMEJ in the mutations introduced by CRISPR-Cas or TALEN,[2,8] there still exist controversial statistics on the occurrence of MMEJ-mediated indels, which probably lie in the vague and unclear definitions between NHEJ and MMEJ as well as the existed cell type heterogeneity tested. In our study, we strictly followed the review article presented by Mitch McVey et al.,[3] which indicated that MMEJ and NHEJ can be distinguished by the length of microhomologous sequence. Microhomologous sequence between 5–25 bp suggests the triggering of MMEJ, while microhomology whose length is under 5 bp actually trigger NHEJ.[3] Based on this definition, our sequence-level analyzing pipeline shows that only one occurrence of MMEJ (microhomology over 5 bp) was identified among all the single deletion reads (1/134,008), indicating that the MMEJ pathway is rarely used compared to NHEJ-based DNA repair, at least in Hela cell. (Supplementary Table S1, Supplementary Table S3a, ). This is not surprising as MMEJ serves as a complementary pathway only when NHEJ is unavailable.[3] We then tested whether microhomology is a crucial factor for the frameshifting paradigm occurring in the Hela dataset, by performing a contingency table analysis (Supplementary Table S2) to compare the enrichment ratio of in-frame mutations between those occurring with and without microhomology (Supplementary Materials). We found no statistically significant correlation between the frameshifting paradigm and microhomology for all the 68 sgRNAs (,). We further investigated the microhomology profile in mouse mESC[9] and zebrafish cells[10] based on the posterior analysis of the next-generation sequencing data using CAGE. We found that in these two cell types, MMEJ also rarely occurred and the contingency table based statistical analysis indicated that for most sgRNAs, the correlations between the frameshifting paradigm and microhomology are not statistically significant (Supplementary Materials). Besides our study, the recently work by John Doench et al. reported that “Microhomology features, suggested to improve sgRNA activity, were predictive on their own but did not improve performance when added to our final model”. Their study, from the view of feature selection to tune a sgRNA activity prediction model, also indicated that microhomology feature may be redundant for the final prediction performance.[11,12]
Further Experimental Validation Using EGFP Reporter System Detecting no Microhomology-Related NHEJ
In order to testify our hypothesis in other cell type, CRISPR-based gene knockout experiment was performed upon our enhanced green fluorescent protein (EGFP) reporter system as previously described in HEK293 cell.[13] Because CRISPR/Cas9-mediated gene knockout is generally based on functional NHEJs, here, we analysis CRISPR/Cas9-mediated EGFP KO to obtain functional NHEJs (EGFP-negative cells), which is more straightforward for the study of functional microhomology-related NHEJ but not total NHEJ. Specifically, we designed three sgRNAs (Supplementary Table S3b) to specifically target EGFP DNA sequence (Supplementary Materials). Next, EGFP gene was inactivated by transfection of the corresponding CRISPR/Cas9 plasmids. The EGFP-negative cells were obtained by fluorescence-activated cell sorting. The whole coding sequence for GFP was amplified, cloned into the cloning vector and individual clones were sequenced by Sanger sequencing. Lastly, we checked the functional indel pattern of the sequencing results and identified that they exist no microhomology pattern in the related NHEJ-mediated indels (Supplementary Table S3b). It should be noted that the analysis of NHEJ pattern in EGFP-negative cells is focused on functional NHEJ but not the total NHEJ, which indicated that MMEJ is rare in this test.
Frequency of Out-of-Frame Deletions/Indels is not a Proper Indicator for SGRNA Efficiency Estimation
To estimate sgRNA efficiency, Sangsu Bae et al. defined the out-of-frame score, which correlated well with the frequency of out-of-frame indels in their study. Frequency was calculated by the ratio of out-of-frame reads among all the deletions/indels per sgRNA, but we consider it more appropriate to use the ratio of the out-of-frame shifting reads among all sequencing reads per sgRNA, to quantitatively represent the sgRNA efficiency. Notably, occurrence of the indel is the prerequisite for CRISPR-based KO efficiency with respect to the frameshifting paradigm. Careful analysis of the sequencing data indicated that although many sgRNAs have a high frequency of out-of-frame deletions among indels, they actually generate a very low number of indels at first, resulting in low KO efficiency. We calculated the Pearson coefficient of the out-of-frame scores with the frequencies of the out-of-frame shift among all the number of sequence reads for one TALEN and two REGN datasets,[2] and the correlations were significantly lower than those in the previous report ().
A Learning-Based Model to Predict the Out-of-Frame Mutation Occurrence Rate in sgRNA Design
We first collected a comprehensive set of genomic features for sgRNAs and modeled their effects on the frequency of out-of-frame shifting reads among all sequencing reads (defined as the “OTF ratio”, Supplementary Materials). These features were coded in a dummying coding way (Supplementary Table S4) and the genomic feature representation of the 68-sgRNA samples in HeLa cell line is presented in Supplementary Table S5. These features were incorporated into a LASSO model and crucial features were selected (Supplementary Table S6, Supplementary Materials). Our prediction model was fivefold cross-validated on the 68 sgRNAs, achieving a mean correlation of 0.87 (P value < 0.01) in the out-of-frame mutation occurrence rate prediction with the selected determining genomics features (). We then generated a group of epigenetic features (Supplementary Table S8) describing the 68 sgRNAs (Supplementary Table S7) and modeled their prediction ability, although these epigenetic features seemed to have less predictive power, probably due to the lack of samples to be tested. We further tested our epigenetic model on three relatively larger sgRNA efficiency datasets[14] with improved prediction abilities (Supplementary Table S9). Recent work also indicates that both sequence composition and locus accessibility are important in determining sgRNA KO efficiency.[15]Table S1. sgRNA-Indel table of the 68-sgRNA HeLa cell line dataset.
Table S2. A contingency table analysis to investigate the correlation between frameshifting paradigm and microhomology.
Table S3. Microhomology analysis for the 68-sgRNA HeLa cell line dataset and EGFP dataset.
Table S4. Dummy coding scheme for genomics feature.
Table S5. The genomic feature representation of the 68-sgRNA HeLa cell line dataset.
Table S6. The selected genomic factors that may influence the OTF ratio by LASSO model of the 68-sgRNA HeLa cell line dataset.
Table S7. Epigenetic feature representations of 4 datasets.
Table S8. Epigenetic feature description of 3 cell lines for 4 datasets.
Table S9. The prediction performance of three sgRNA efficiency datasets presented by X.
Methods
Materials
Author contributions
G.H.C., F.Y.Y., J.F.Y., Y.N.C., Z.C., C.Y.Z., Q.L., and F.G. performed the whole data analysis and pipeline construction. Q.M. analyzed the prediction model and helped polish the manuscript. F.G. compared NHEJ with MMEJ based on the sequencing data. Q.L. and G.H.C. conceived the study and wrote the manuscript.
Authors: Han Xu; Tengfei Xiao; Chen-Hao Chen; Wei Li; Clifford A Meyer; Qiu Wu; Di Wu; Le Cong; Feng Zhang; Jun S Liu; Myles Brown; X Shirley Liu Journal: Genome Res Date: 2015-06-10 Impact factor: 9.043
Authors: John G Doench; Nicolo Fusi; Meagan Sullender; Mudra Hegde; Emma W Vaimberg; Jennifer Listgarten; Katherine F Donovan; Ian Smith; Zuzana Tothova; Craig Wilen; Robert Orchard; Herbert W Virgin; David E Root Journal: Nat Biotechnol Date: 2016-01-18 Impact factor: 54.908
Authors: Jasper Edgar Neggers; Maarten Jacquemyn; Tim Dierckx; Benjamin Peter Kleinstiver; Hendrik Jan Thibaut; Dirk Daelemans Journal: Mol Ther Date: 2020-09-20 Impact factor: 11.454