| Literature DB >> 25161238 |
Omer S Alkhnbashi1, Fabrizio Costa1, Shiraz A Shah1, Roger A Garrett1, Sita J Saunders1, Rolf Backofen2.
Abstract
MOTIVATION: The discovery of CRISPR-Cas systems almost 20 years ago rapidly changed our perception of the bacterial and archaeal immune systems. CRISPR loci consist of several repetitive DNA sequences called repeats, inter-spaced by stretches of variable length sequences called spacers. This CRISPR array is transcribed and processed into multiple mature RNA species (crRNAs). A single crRNA is integrated into an interference complex, together with CRISPR-associated (Cas) proteins, to bind and degrade invading nucleic acids. Although existing bioinformatics tools can recognize CRISPR loci by their characteristic repeat-spacer architecture, they generally output CRISPR arrays of ambiguous orientation and thus do not determine the strand from which crRNAs are processed. Knowledge of the correct orientation is crucial for many tasks, including the classification of CRISPR conservation, the detection of leader regions, the identification of target sites (protospacers) on invading genetic elements and the characterization of protospacer-adjacent motifs.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25161238 PMCID: PMC4147912 DOI: 10.1093/bioinformatics/btu459
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The three major phases of CRISPR-Cas immune systems. First, in the adaptation phase, Cas proteins excise the protospacer sequence from foreign DNA and insert it into the repeat, adjacent to the leader at the CRISPR locus. Second, CRISPR arrays are transcribed and then processed into multiple crRNAs, each carrying a single spacer sequence and part of the adjoining repeat sequence. Third, at the interference phase, the crRNAs are assembled into different classes of protein targeting complexes (Cascades) that anneal to, and cleave, spacer matching sequences on either invading element or their transcripts
Summary of our REPEATS dataset derived from all available CRISPR loci
| Data statistics | Archaea | Bacteria | ||
|---|---|---|---|---|
| Genomes (total) | 309 | 4590 | (4899) | |
| Genomes with CRISPRs (%) | 217 | (70) | 1409 | (30) |
| CRISPRs on forward strand | 516 | 1810 | (2326) | |
| CRISPRs on reverse strand | 530 | 1859 | (2389) | |
| Repeats per array (median) | 2–198 | (20) | 2–1371 | (16) |
| Repeat lengths (median) | 20–44 | (29) | 19–48 | (30) |
| Spacer lengths (median) | 20–54 | (38) | 19–72 | (35) |
Summary of REPEATS dataset: published CRISPR-Cas systems with experimental evidence of the processing mechanism
| Organism | Motif | Cas subtype | Summary |
|---|---|---|---|
| M2 | I-E | Structure predicted, but stable; 8-nt-5′-tag; cleavage by | |
| M2 | I-E | Structured; 8-nt-5′-tag; cleavage by | |
| M3 | I-C | Cleavage by | |
| M4 | I-F | Cleavage by Cas6f (Csy4); 8-nt-5′-tag; crystal structure and mutational analyses of repeat hairpin in | |
| M5 | I-DIII-variant | Cleavage by | |
| M9 | I-C | Cleavage by | |
| M13 | I-B III-B | Cleavage by | |
| M14 | III-variant | Biochemical analysis of | |
| M28 | III-A | Cleavage by | |
| M29 | I-B | Cleavage by |
Note: In particular, these are systems for which (i) the Cas endoribonuclease has been characterized and/or (ii) the repeat structure has been verified. Published results are consistent with the data of Lange .
Fig. 2.Graph encoding the consensus repeat sequence. The consensus nucleotide information is represented as a path graph, and additional information is modelled as a chain of additional vertices. The terminal parts of the repeat are marked with block identifiers
Fig. 3.The NSPDK approach extracts a large number of features taking only specific fragments into account. The procedure is parametrized by the radius R and the distance D. Each vertex is considered in turn as a root. A neighbourhood graph of radius R is extracted around each root. All possible pairs of neighbourhood graphs of the same size R are considered, provided that their respective roots are exactly at distance D. To understand the importance of the sequential order of the attributes consider the left part of the figure: here we depict a feature with radius 1 and distance 0, which will encode three pieces of information: (i) the specific dinucleotide combination, (ii) the block ID and (iii) whether a mutation is likely to occur on the first nucleotide of the dinucleotide. As we increase the maximal distance between the roots in the pair, the encoded information is further specialized. In the middle part of the figure, we show a feature that additionally includes the presence of a mutation at distance 5. When the radius is increased to 2, the specific position within the block is also considered
Fig. 4.AUC ROC performance comparison of the five models that encode increasing amount of information about the CRISPR arrays
Fig. 5.Performance comparison between our method and Biswas method. The test database contains 948 CRISPR repeats
Fig. 6.(A) Given the novel predicted orientation Family 5 with Family 8 and Motif 4 with Motif 6 could be merged. (B) The 33 structural motifs from Lange are clustered (i) with the orientation prediction; (ii) without orientation prediction