| Literature DB >> 27461955 |
Annekatrien Boel1, Woutert Steyaert1, Nina De Rocker1, Björn Menten1, Bert Callewaert1, Anne De Paepe1, Paul Coucke1, Andy Willaert1.
Abstract
Targeted mutagenesis by the CRISPR/Cas9 system is currently revolutionizing genetics. The ease of this technique has enabled genome engineering in-vitro and in a range of model organisms and has pushed experimental dimensions to unprecedented proportions. Due to its tremendous progress in terms of speed, read length, throughput and cost, Next-Generation Sequencing (NGS) has been increasingly used for the analysis of CRISPR/Cas9 genome editing experiments. However, the current tools for genome editing assessment lack flexibility and fall short in the analysis of large amounts of NGS data. Therefore, we designed BATCH-GE, an easy-to-use bioinformatics tool for batch analysis of NGS-generated genome editing data, available from https://github.com/WouterSteyaert/BATCH-GE.git. BATCH-GE detects and reports indel mutations and other precise genome editing events and calculates the corresponding mutagenesis efficiencies for a large number of samples in parallel. Furthermore, this new tool provides flexibility by allowing the user to adapt a number of input variables. The performance of BATCH-GE was evaluated in two genome editing experiments, aiming to generate knock-out and knock-in zebrafish mutants. This tool will not only contribute to the evaluation of CRISPR/Cas9-based experiments, but will be of use in any genome editing experiment and has the ability to analyze data from every organism with a sequenced genome.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27461955 PMCID: PMC4962088 DOI: 10.1038/srep30330
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Implementation of BATCH-GE.
Multiple singleplex PCR products (S1, S2, …, Sn) (upper panel, left) that correspond to different genomic sequences in one specific or in different genomes are pooled in equimolar amounts. Subsequently, the pools are used as DNA input for NGS library preparation using the Nextera XT library preparation kit, which simultaneously fragments and tags input DNA (upper panel, middle). The tagging involves the addition of unique adapter sequences in order to provide sequencing indices on both sides of the amplicons (depicted by yellow, grey, light and dark blue bars). In a final step, all molecules are pooled in a single tube prior to NGS sequencing (upper panel, right). BATCH-GE analyses the data sample-by-sample in an automated batchwise manner. The experimental specifications needed to run BATCH-GE are supplied via two input files (middle panel, E (Experiment.csv) and C (Cutsites.bed) icons). In a first step, raw sequencing data is converted into the SAM file format. Secondly, BATCH-GE screens the reads in the SAM file for their coverage of the region(s) of interest, which are user-defined regions, encompassing the theoretical CRISPR/Cas9 cut site, 3 base pairs upstream of the PAM sequence (middle panel, grey sequence). Thirdly, reads that do not fully cover the region of interest are discarded from the analysis, since they lack information about the presence or absence of indels in this region (middle panel, indicated by a mark/cross). Subsequently, the remaining reads (indicated by a tick) are screened for insertions and deletions initiated within the same user-defined region of interest (middle panel, grey dash-lined box). The detected indel variants, along with information about their position, type, length and their frequency are written to a ‘Variants’ text file. Reads that do not contain any indel, are screened for the presence of intended base pair alterations. Frequencies of partial and full repairs are listed in the ‘RepairReport’ file. Additionally, general indel and repair rates are indicated in the ‘Efficiencies’ file. Lastly, URLs (‘URL’ file) enable read visualization in the freeware UCSC Genome Browser database22.
BATCH-GE input files: the Experiment.csv file.
| FastqDir | Sample Numbers | Genome | CutSite | OutputDir | CutSites File | Repair Sequence |
|---|---|---|---|---|---|---|
| /location of the FastQ files/ | a, b, f, g, h | Genome A | Gene A amplicon 1 | /location of the output files/runX/ | /location of the CutSites file/Cutsites.bed | (N)NNNNNN(N)NNNNNNNNNN(N)NNNNNNNNNNNN(N)NNNN[N] |
| /location of the FastQ files/ | a–z | Genome A | Gene A amplicon 2 | /location of the output files/runY/ | /location of the CutSites file/Cutsites.bed | / |
| /location of the FastQ files/ | a, b, d–g | Genome B | Gene B | /location of the output files/runZ/ | /location of the CutSites file/Cutsites.bed | / |
The experiment file contains all information that is specific for the experiment. The mandatory headers are 1) FastqDir, i.e. the full path of the directory of the FastQ files, 2) SampleNumbers, i.e. identifier(s) for the sample(s) to be analysed in the specified NGS sequencing run. The notation x,y ensures that both samples x and y will be analysed. If numerical identifiers are used the notation x-y means that all samples from x to y will be analysed, 3) Genome, i.e. the build of the reference genome the reads should be mapped to (cf. installation notes), 4) CutSite, i.e. identifier of choice for the particular cut site. This identifier needs to be the same as the identifier mentioned in the BED file (4th column), 5) OutputDir, i.e. the directory where the output of BATCH-GE should be stored and 6) Location CutSites File, i.e. BED file containing the genomic coordinates of all cut sites used in the experiment. An optional header is RepairSequence, in this column the HDR template sequence must be placed. Placing square brackets around certain bases of the repair template indicates that these base pair alterations need to be introduced in the zebrafish genome. Round brackets on the other hand, indicate base pair alterations that do not necessarily need to be introduced in the genome e.g. alterations needed for codon optimization of the template.
BATCH-GE input files: the Cutsites.bed file.
| Chromosome | Chromosomal position start region of interest | Chromosomal position end region of interest | Designation region of interest |
|---|---|---|---|
| chrA | Theoretical cut site −30 | Theoretical cut site +30 | Gene A amplicon 1 |
| chrA | Theoretical cut site −30 | Theoretical cut site +30 | Gene A amplicon 2 |
| chrB | Theoretical cut site −30 | Theoretical cut site +30 | Gene B |
The designation of the user-defined region of interest indicated in the ‘CutSite’ column of the Experiment.csv file, can be specified through the ‘cutsite.bed’ file. Each row represents one region of interest and contains the chromosome, user-defined chromosomal start and end position and the designation of the region of interest (should be identical to the names in the ‘CutSite’ column of the ‘Experiment.csv’ file). No header should be included. In general, in this file, the user can specify the region of interest surrounding the theoretical CRISPR/Cas9 cut site. This is generally a region of 20 (position −10 to +10, relative to the theoretical CRISPR/Cas9 cut site) to 100 (−50 to +50) base pairs.
Figure 2BATCH-GE output files for a specific genome editing experiment targeting the tprkb gene.
(a) The ‘Variants’ text file lists chromosome, chromosomal location of the variant, type of the variant, length, the reference sequence surrounding the indel (10 bp upstream and 10 bp downstream of the indel) with [] marking the inserted sequence or with [deleted base pairs] marking the deleted sequence, and absolute and relative frequency of the variants. (b) In case of HDR analysis, the reads which do not contain any indel, are screened for the presence of the intended base pair alterations. BATCH-GE can distinguish between full and partial repair, in case multiple base pair alterations are intended to be introduced in the region of interest. If partial repair is encountered, the specific sequence of the partial repair is listed. (c) General indel and repair rates are shown in the ‘Efficiencies’ file. (d) URLs are generated (‘URL’ file) which allow visualization of the reads in the freeware UCSC Genome Browser database26. However, if the number of total reads (also the reads that are discarded by the tool) exceeds 1000, visualization via UCSC is no longer possible. As an alternative, raw NGS result files (fastQ) can be uploaded into the Integrative Genomics Viewer (IGV)1920.
Figure 3Indel rates and read number, as a function of the size of the region of interest used in BATCH-GE.
The raw sequencing data derived from CRISPR/Cas9 assays (slc2a10, pls3, tapt1a, myt1la, tprkb) injected with 25 pg sgRNA and 250 pg Cas9 and analysed at 1 dpf were reanalysed while varying the size of the region of interest from 20 to 100 bp. The blue bars represent the number of reads retained by BATCH-GE when screened for coverage of the user-defined region of interest. The red line represents the indel rate as a function of the size of the region of interest.
Comparison between available tools for analysis of NGS-derived genome editing data.
| Visualization in genome browser | CRISPR-GA | BATCH-GE | |
|---|---|---|---|
| Freely available | |||
| Mapping reads against complete genomic sequence of the organism of interest | |||
| Specific analysis genome editing experiment | |||
| Calculation rate mutagenic events | |||
| Graphical interpretation mutagenic events | |||
| Generation list of specific variants | |||
| Distinction between full and partial HDR | |||
| Adjustment input parameters | |||
| Analysis multiple samples in batch |