| Literature DB >> 35830604 |
Jacqueline Jufen Zhu1,2, Albert Wu Cheng1,2,3,4,5.
Abstract
Zinc finger protein-, transcription activator like effector-, and CRISPR-based methods for genome and epigenome editing and imaging have provided powerful tools to investigate functions of genomes. Targeting sequence design is vital to the success of these experiments. Although existing design software mainly focus on designing target sequence for specific elements, we report here the implementation of Jackie and Albert's Comprehensive K-mer Instances Enumerator (JACKIE), a suite of software for enumerating all single- and multicopy sites in the genome that can be incorporated for genome-scale designs as well as loaded onto genome browsers alongside other tracks for convenient web-based graphic-user-interface-enabled design. We also implement fast algorithms to identify sequence neighborhoods or off-target counts of targeting sequences so that designs with low probability of off-target can be identified among millions of design sequences in reasonable time. We demonstrate the application of JACKIE-designed CRISPR site clusters for genome imaging.Entities:
Mesh:
Year: 2022 PMID: 35830604 PMCID: PMC9527058 DOI: 10.1089/crispr.2022.0042
Source DB: PubMed Journal: CRISPR J ISSN: 2573-1599
FIG. 1.Implementation and evaluation of JACKIE. (a) JACKIE.bin scans the genome for k-mers matching a specified motif (e.g., XXXXXXXXXXXXXXXXXXXXNGG matches 20-mer denoted by 20X followed by NGG PAM required by the CRISPR/SpCas9 system), and encodes each occurrence in a data structure called KeyedPosition, consisting of NucKey (unsigned 64-bit integer, uint64_t), chrID (unsigned 32-bit integer, uint32_t), and pos (signed 32-bit integer, int32_t). NucKey records a binary representation of the k-mer sequence using three bits per nucleotide. chrID records an integer representing a chromosome. A chrID-chromosome mapping file is generated. pos records the location of the binding site, with negative and positive integers representing the minus strand and positive strand, respectively. JACKIE.bin outputs KeyedPosition to files as it scans the genome. To parallelize this step, four processes are started as separate cluster jobs, each focusing on the A, C, T, or G as the first nucleotide of k-mer. To allow for parallelization in subsequent steps, KeyedPosition records are appended to files per 6-mer prefix of the k-mer (<6merPrefix>.bin, e.g., ATGAGC.bin contains all records for all k-mer starting with ATGAGC). (b) JACKIE.sortToBed loads each <6merPrefix>.bin file and performs a sort on the KeyedPosition records on the NucKey variable and then traverses the sorted list of KeyedPosition to output a bed interval file containing the coordinate position of each k-mer site with item named by
FIG. 2.Distributions of one-copy gRNA sites or multicopy gRNA clusters in the human genome. (a) Column plot showing the number of one-copy sites (y-axis) across the human genome in megabase bins (x-axis). Chromosome numbers are indicated on top. (b) Column plot showing the number of one-copy sites with no off-targets up to three mismatches across the human genome in megabase bins. (c) Column plot showing the number of multicopy gRNA site clusters with two or more sites across the human genome in megabase bins. (d) Column plot showing the number of multicopy gRNA site clusters with four or more sites across the human genome in megabase bins. (e) Column plot showing the number of multicopy gRNA site clusters with 12 or more sites across the human genome in megabase bins. (f) Column plot showing the number of multicopy gRNA site clusters with 24 or more sites across the human genome in megabase bins.