Literature DB >> 29790974

CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins.

David Couvin¹, Aude Bernheim^2,3, Claire Toffano-Nioche¹, Marie Touchon^2,3, Juraj Michalik⁴, Bertrand Néron⁵, Eduardo P C Rocha^2,3, Gilles Vergnaud¹, Daniel Gautheret¹, Christine Pourcel¹.

Abstract

CRISPR (clustered regularly interspaced short palindromic repeats) arrays and their associated (Cas) proteins confer bacteria and archaea adaptive immunity against exogenous mobile genetic elements, such as phages or plasmids. CRISPRCasFinder allows the identification of both CRISPR arrays and Cas proteins. The program includes: (i) an improved CRISPR array detection tool facilitating expert validation based on a rating system, (ii) prediction of CRISPR orientation and (iii) a Cas protein detection and typing tool updated to match the latest classification scheme of these systems. CRISPRCasFinder can either be used online or as a standalone tool compatible with Linux operating system. All third-party software packages employed by the program are freely available. CRISPRCasFinder is available at https://crisprcas.i2bc.paris-saclay.fr.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2018 PMID： 29790974 PMCID： PMC6030898 DOI： 10.1093/nar/gky425

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Clustered regularly interspaced short palindromic repeats (CRISPR) and associated proteins (Cas) form the CRISPR-Cas systems. CRISPRs consist of a succession of 24–50 bp long direct repeats or ‘repeats’ separated by similarly sized unique sequences called spacers. They are transcribed from a promoter present in the leader (often a 100–200 bp AT-rich sequence) and therefore CRISPR arrays are functionally oriented (1). A community effort resulted in the classification of CRISPR-Cas systems into two classes, six types and 22 subtypes, according to their Cas proteins (2,3). Since the development of genome editing technologies based on elements of the CRISPR-Cas systems, these genomic entities have attracted a lot of attention. Indeed, components of these biological systems present in about 80% archaea and half of bacteria can be used in multiple applications in genetic engineering (4,5). The rapid rate of evolution of certain CRISPR arrays also allows their effective use in typing bacterial isolates (6). Several programs have been developed to identify CRISPR arrays in genomic sequences, the most frequently cited being CRISPRFinder (7), CRT (8) and PILER-CR (9). Additional programs such as CRISPRDetect (10), CRISPRdigger (11) and CRF (12) are also available. Three programs have been proposed for CRISPR array strand prediction based on the characteristics of the CRISPR repeat and the leader: CRISPRDirection using CRISPRDetect (10,13), CRISPRstrand (14,15) and CRISPRleader (16). Cas proteins and systems can be identified using the program Macromolecular System Finder (MacSyFinder), which has a dedicated module (CasFinder) (17). The program is based on the search of protein similarity using Hidden Markov Models (HMM) and a model of genetic composition and organization of the identified components. HMMCAS is a web tool that can be queried online to identify Cas proteins (18). The search for CRISPRs and Cas in user-submitted data can be done on the web using CRISPRone (19) or locally using the CRISPRdisco pipeline (20). Here, we present CRISPRCasFinder, which is an updated, improved, and integrated version of CRISPRFinder and CasFinder with freely available third-party software dependencies. CRISPRCasFinder now includes a standalone version, and presents enhanced CRISPR detection performance.

RESULTS

Availability and implementation

The CRISPRCasFinder web server is based on independent front and back ends. The front end was implemented as a user-friendly web application using .NET Core (dotnet core) development platform and C# (C sharp) programming language. The Bootstrap framework was used to design the web application. CRISPRCasFinder is also available as a standalone program compatible with Linux (including Windows Subsystem for Linux) and MacOS systems. The program was written in Perl. A workflow is shown on Figure 1 and Supplementary Figure S1, and details on dependencies are provided in Supplementary Material.

Figure 1.

CRISPRCasFinder workflow.

Input

The web server currently accepts (multi-)Fasta DNA sequence files of size up to 50 Mb including up to 100 sequences. The standalone application has no pre-defined input size limit and is only limited by the available computer memory.

Output

The web server produces a summary table with an overview of the results (Figure 2A) and the possibility to visualize each array separately (Figure 2B). CRISPR arrays and Cas protein analyses are returned as .xls, GFF3, JSON, TSV and Fasta formatted files. The standalone program returns the same files as well as optional files (see Supplementary Material for further details).

Figure 2.

Output of CRISPRCasFinder. (A) The summary displays information on CRISPR arrays and cas gene clusters in the order in which they lie along the chromosome. “Direction” is the proposed orientation of the CRISPR cluster according to the CRISPRDirection program. (B) Details on individual CRISPR arrays and cas gene clusters can be viewed.

Improved detection of CRISPR arrays and evidence level rating

To identify CRISPR arrays CRISPRCasFinder uses CRISPRFinder v4.2 which is itself based on Vmatch version 2.3 (21) (http://www.vmatch.de/) to identify the CRISPR repeats. CRISPRFinder v4.2 has evolved from the first version described by Grissa et al. (7) and the differences are listed in Supplementary Material. In order to help the user to discriminate spurious CRISPR-like elements from true CRISPRs, we included a rating system based on several criteria. Short candidate arrays made of one to three spacers often do not correspond to CRISPRs (22) and are therefore given the lowest evidence level (rated 1). Evidence levels 2–4 are attributed based on combined degrees of similarity of repeats and spacers. In the majority of cases, repeats are very well conserved and can be defined as a stretch of sequence with a 100% similarity inside the CRISPR array when excluding the distal truncated/diverged repeat. Arrays showing repeats heterogeneity often correspond to coding sequences with a repeated element and are rarely real CRISPRs (23). In contrast, spacers are not expected to show a significant degree of similarity, except in the case of rare recombination or duplication events (24). Therefore, the degree of similarity between spacers is expected to be very low in bona fide CRISPRs. We thus implemented an algorithm to measure CRISPR repeat conservation based on Shannon’s entropy (Supplementary Material, Table S1, Figures S2–4) and produce an EBcons (entropy-based conservation) index. We empirically determined EBcons thresholds based on the analysis of 128 CRISPR arrays from 128 genomes in CRISPRdb (23) (See Supplementary Material and Supplementary dataset 1). Putative CRISPR arrays with at least four spacers are assigned to levels 2–4 as follows: repeats EBcons < 70 (level 2), repeats EBcons ≥ 70 and spacers overall percentage identity > 8% (level 3); repeats EBcons ≥ 70 and spacers overall percentage identity ≤ 8% (level 4). CRISPR arrays having evidence-levels 3 and 4 may be considered as highly likely candidates, whereas evidence-levels 1 and 2 indicate potentially invalid CRISPR arrays. The ambiguous notion of ‘confirmed’ or ‘hypothetical’ CRISPR array (associated with CRISPRFinder v1.0) is no longer used in CRISPRFinder v4.2. We used a panel of 400 genomic sequences (260 bacteria and 140 archaea) from different species (Supplementary dataset 2) taken in alphabetical order, to evaluate the distribution of CRISPR arrays in the four different evidence-level groups. Out of 3251 arrays, there were respectively 1969, 63, 76 and 1143 arrays with evidence level 1, 2, 3 or 4. The identification of false-positive arrays when they possess less than four spacers is not an easy task and some of the evidence-level 1 arrays may in fact be real CRISPRs (see Supplementary Material for an example). Therefore we give the possibility either to view all the detected CRISPR arrays or to hide those with evidence-level 1. When a short CRISPR array has the same consensus repeat as an evidence-level 4 array, it can be considered as a level 4 CRISPR. Applying this rule would upgrade about 5% of the level 1–3 arrays in the test panel to level 4 (163 arrays out of 2108). This scoring correction will be automatically applied in the future when the CRISPR database is integrated in the system. In addition, evidence-level 1 arrays which are not associated to cas genes will be deleted (42 arrays in 26 genomes out of the 400 test genomes).

Orientation of the CRISPR array

CRISPRFinder provides two indicators of CRISPR arrays orientation. First, orientation was predicted by CRISPRDirection for a curated dataset of consensus CRISPR repeats and this result is shown for CRISPR arrays with a matching repeat. We provide an additional method that does not require the existence of previously oriented homologous systems and is based on the AT% in the 100 bp region flanking the array on both sides. As the 5′ region of an oriented array is often AT-rich (25), the flanking side showing the higher AT% is used as a second indicator of orientation. An option in the standalone program allows users to determine the length of the flanking region to be analyzed. The result of both tests is sometimes different as illustrated on Figure 2 with the genome of Bacillus halodurans, showing that additional developments are still necessary to orientate CRISPR arrays with accuracy. The search for an AT-rich region flanking the CRISPR array has been used by different authors to orientate the CRISPR array (e.g. with CRISPRmap or CRISPRDirection) but it is not relevant for all genomes, particularly those which are globally AT-rich. In addition Alkhnbashi et al. (16) showed that 13% of 980 archaeal CRISPR loci, and 24% of 2852 bacterial loci were leaderless.

Update of the detection and typing of Cas proteins

CRISPRCasFinder identifies and types Cas systems based on known Cas protein sequences to increase sensitivity and specificity. Putative coding sequences (CDSs) are identified using Prodigal v2.6.3 (26) in the input nucleotide sequences. Translated CDSs are analyzed by CasFinder to identify the systems and their components. CasFinder is based on MacSyFinder (17), which provides a flexible framework to identify systems in genomes using a model describing their genetic architecture and the minimally sufficient number of protein components. Putative Cas proteins are searched by sequence similarity using HMM protein profiles. The assignment of a protein to a given system is decided based on its compliance with the content and organization defined in the model. Components can be defined as mandatory, accessory or forbidden. The latter (proteins that cannot belong to a defined subtype) are useful for discriminating between systems' types or sub-types. Cas systems can be identified at three levels of precision (‘General’, ‘Typing’ and ‘SubTyping’) using increasingly specific descriptions of the systems. This allows dealing with the trade-off between the detection of a maximum of Cas clusters by using the more sensitive General model, and a precise identification of each cluster, using the more stringent sub-type model. The General Cas model was designed to detect any cluster of at least three Cas proteins (class 1 systems), or containing at least Cas9, Csn2, Cas12 or Cas13 for class 2 systems. The first version of CasFinder contained a set of models to identify and classify CRISPR-Cas systems in three types and thirteen subtypes (17). The last update (CasFinder version 2.0) fits with the new community-based classification of CRISPR-Cas systems (2,3). The program distinguishes class 1 from class 2 systems, detects three new types and nine new subtypes of CRISPR-Cas systems (a description of the sub-typing models is provided in Supplementary Figure S5). Moreover, as 394 new profiles were published with the new classification, previously existing models and protein profiles were revised to improve speed, limit the redundancy between protein profiles, and enhance subtyping accuracy (Supplementary Figure S6). As a result, the new version contains 43 additional protein profiles, for a total of 120 profiles (Supplementary dataset 3).

Evaluation of CRISPR and Cas detection by CRISPRCasFinder

To test the accuracy of the novel CRISPR detection method we used the test panel of 400 genomic sequences. Predictions were compared to the expert-curated annotations in CRISPRdb (used as a reference-set) and to three CRISPR detection programs PILER-CR, CRT and CRISPRDetect. Only CRISPR arrays having at least four spacers were taken into account in the evaluation. In CRISPRdb, 1263 arrays were displayed for the above mentioned set of 400 genomes after manual curation (Supplementary dataset 4). Precision, recall and F-measure metrics were used to compare the three programs with the reference-set (Supplementary Tables S2 and 3). These metrics showed that the four programs performed similarly, with precision from 0.921 to 0.982, recall from 0.935 to 0.987, and F-measure from 0.932 to 0.9776. Detailed validation procedures are provided in Supplementary Material. Predictions of CRISPRCasFinder are visible in Supplementary dataset 5. A comparison between the detection of CasFinder v.2.0 and the summary presented with the new classification in (2) revealed few differences (Jacquard coefficients between 77 and 96%, Supplementary Figure S7). Overall, CasFinder v2.0 is more conservative, finds fewer systems, because it requires at least three genes in the Class 1 Cas system, whereas the previous study only required two genes. We opted for the conservative approach because, to the best of our knowledge, no study identified a fully functional Class 1 Cas system capable of adaptation and interference with only two genes. Expert users can change the underlying models to identify Cas systems in the standalone version and lower the threshold. However the use of the ‘General’ model already allows the identification of relevant small clusters such as in Melioribacter roseus (NC_018178) which possesses an evidence-level 4 CRISPR with a 46-bp repeat located immediately adjacent to two class 2 cas genes (cas2_TypeI-II-III, cas1_TypeII). We evaluated the performance of CRISPRCasFinder, CRISPRDetect and CRISPRone online using a set of 30 genomes (Supplementary dataset 6) selected because they possess particular sets of CRISPRs observed while curating CRISPRdb (Supplementary Table S5). As compared to CRISPRCasFinder and CRISPRone, CRISPRDetect proposes options to edit the CRISPR array and provides a directional analysis based on seven characteristics of the leader and the CRISPR repeat (10), however it relies on NCBI annotations to identify Cas protein and therefore often fails to produce a result. CRISPRone performs an HMM search to identify Cas proteins but the method is less stringent than CasFinder v2.0, and Cas-like proteins are frequently displayed (see Supplementary Material for selected examples). Online the duration of an analysis was variable with the three programs presumably depending on the capacity and the workload of the associated servers. Single bacterial genomes of 4–6 Mbases (Mb) were analyzed by CRISPRCasFinder in 1–2 min whereas a 50Mo file (the current limit on the web server) of 10 fasta files made by concatenating a 5-Mb genome (containing two Cas and seven CRISPRs loci) ran in 5 min. The same 50Mo file split into 100 fasta files ran at the same speed. Runtimes and memory usage were calculated with the standalone versions of CRISPRCasFinder and CRISPRDetect using four genomes of 0.5, 5, 10 and 53 Mb. The results showed that runtimes were similar between the two programs but CRISPRCasFinder tends to require more memory (Supplementary Table S6). At last, we believe that the output of CRISPRCasFinder in the form of a clear and compact summary is an advantage over the other programs.

Use case studies

The simultaneous search for CRISPR and Cas by CRISPRCasFinder greatly facilitates the evaluation of tentative CRISPR and cas loci and this is further exemplified in several cases. Some genes possess tandem repeats which can be misidentified as CRISPRs. For example, analysis of the Pantoea ananatis LMG20103 genome (NC_013956) reveals the existence of a putative CRISPR array with a 23-bp repeat and 26 spacers (Supplementary Figure S8). The evidence level of this array is 2, with repeats and spacers conservations of 57 and 12%, respectively, and no Cas protein detected in the genome. In fact this sequence is part of an Ice nucleation protein gene. An opposite situation is that of Streptococcus sanguinis SK36 (CP000387) which displays a cluster of Type III-A cas genes intermixed with two evidence-level 2 CRISPR arrays showing highly dissimilar repeat sequences (Supplementary Figure S9). In both cases, CRISPRone (19) and CRISPRCasFinder were in agreement. Another interesting feature of CRISPRCasFinder is the possibility to compare the repeat of short arrays with evidence-level 1 to that of larger arrays present in the same genome, allowing to confirm the small size loci as valid CRISPRs such as in the genome of Methanosarcina thermophila MT-1 (AP017646) (Supplementary Figure S10).

DISCUSSION AND CONCLUSION

The updated CRISPRCasFinder shows enhanced performance and capabilities to identify both CRISPR arrays and Cas proteins, improving the previously existing separate tools CRISPRFinder and CasFinder. In addition CRISPRCasFinder and CRISPRCasViewer (Supplementary Figure S11) are available as standalone programs for users willing to analyze large volumes of sequences (see Supplementary Material for details). We are developing a dedicated tool for the analysis of large metagenomic datasets, allowing a simpler and faster CRISPR array and Cas protein detection. CRISPRCasFinder will continue to evolve, notably by providing a better prediction of CRISPR array orientation using curated data from the currently developed database. Key criteria for array orientation are the presence of a leader/promoter sequence immediately before the first repeat, the existence of a diverged/truncated repeat at the 3′ end, the nature of the repeat sequence and its secondary structure, and the position of the cas genes cluster. The program will also be updated to match novel typing methods for Cas systems if and when sufficient examples become available. At last, using extra information can improve the ability to distinguish small CRISPRs from false positives, including the existence of a similar repeat in a larger CRISPR array, the presence of cas genes, generally situated upstream the CRISPR array (27) or of a characteristic leader sequence. The next version of CRISPRFinder will incorporate these elements for improved CRISPR classification. CRISPRCasFinder will be part of a new integrated CRISPR-Cas analysis system, eventually replacing CRISPRdb and associated tools (CRISPRtionary, CRISPRcompar, MyCRISPRdb), which were not designed as actual web services. Click here for additional data file.

25 in total

1. Identification of genes that are associated with DNA repeats in prokaryotes.

Authors: Ruud Jansen; Jan D A van Embden; Wim Gaastra; Leo M Schouls
Journal: Mol Microbiol Date: 2002-03 Impact factor: 3.501

2. Clustered regularly interspaced short palindromic repeats (CRISPRs) for the genotyping of bacterial pathogens.

Authors: Ibtissem Grissa; Gilles Vergnaud; Christine Pourcel
Journal: Methods Mol Biol Date: 2009

3. HMMCAS: A Web Tool for the Identification and Domain Annotations of CAS Proteins.

Authors: Guoshi Chai; Min Yu; Lixu Jiang; Yaocong Duan; Jian Huang
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2017-02-07 Impact factor: 3.710

4. Prodigal: prokaryotic gene recognition and translation initiation site identification.

Authors: Doug Hyatt; Gwo-Liang Chen; Philip F Locascio; Miriam L Land; Frank W Larimer; Loren J Hauser
Journal: BMC Bioinformatics Date: 2010-03-08 Impact factor: 3.169

5. Discovery and Functional Characterization of Diverse Class 2 CRISPR-Cas Systems.

Authors: Sergey Shmakov; Omar O Abudayyeh; Kira S Makarova; Yuri I Wolf; Jonathan S Gootenberg; Ekaterina Semenova; Leonid Minakhin; Julia Joung; Silvana Konermann; Konstantin Severinov; Feng Zhang; Eugene V Koonin
Journal: Mol Cell Date: 2015-10-22 Impact factor: 17.970

Review 6. An updated evolutionary classification of CRISPR-Cas systems.

Authors: Kira S Makarova; Yuri I Wolf; Omer S Alkhnbashi; Fabrizio Costa; Shiraz A Shah; Sita J Saunders; Rodolphe Barrangou; Stan J J Brouns; Emmanuelle Charpentier; Daniel H Haft; Philippe Horvath; Sylvain Moineau; Francisco J M Mojica; Rebecca M Terns; Michael P Terns; Malcolm F White; Alexander F Yakunin; Roger A Garrett; John van der Oost; Rolf Backofen; Eugene V Koonin
Journal: Nat Rev Microbiol Date: 2015-09-28 Impact factor: 60.633

7. CRISPR recognition tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats.

Authors: Charles Bland; Teresa L Ramsey; Fareedah Sabree; Micheal Lowe; Kyndall Brown; Nikos C Kyrpides; Philip Hugenholtz
Journal: BMC Bioinformatics Date: 2007-06-18 Impact factor: 3.169

8. CRISPRstrand: predicting repeat orientations to determine the crRNA-encoding strand at CRISPR loci.

Authors: Omer S Alkhnbashi; Fabrizio Costa; Shiraz A Shah; Roger A Garrett; Sita J Saunders; Rolf Backofen
Journal: Bioinformatics Date: 2014-09-01 Impact factor: 6.937

9. CRISPRDetect: A flexible algorithm to define CRISPR arrays.

Authors: Ambarish Biswas; Raymond H J Staals; Sergio E Morales; Peter C Fineran; Chris M Brown
Journal: BMC Genomics Date: 2016-05-17 Impact factor: 3.969

Review 10. CRISPR-Cas: biology, mechanisms and relevance.

Authors: Frank Hille; Emmanuelle Charpentier
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2016-11-05 Impact factor: 6.237

272 in total

1. CRISPRcasIdentifier: Machine learning for accurate identification and classification of CRISPR-Cas systems.

Authors: Victor A Padilha; Omer S Alkhnbashi; Shiraz A Shah; André C P L F de Carvalho; Rolf Backofen
Journal: Gigascience Date: 2020-06-01 Impact factor: 6.524

2. Diversity of CRISPR/Cas system in Clostridium perfringens.

Authors: Jinzhao Long; Yake Xu; Liuyang Ou; Haiyan Yang; Yuanlin Xi; Shuaiyin Chen; Guangcai Duan
Journal: Mol Genet Genomics Date: 2019-05-27 Impact factor: 3.291

3. Genome-wide correlation analysis suggests different roles of CRISPR-Cas systems in the acquisition of antibiotic resistance genes in diverse species.

Authors: Saadlee Shehreen; Te-Yuan Chyou; Peter C Fineran; Chris M Brown
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2019-05-13 Impact factor: 6.237

4. Comparative Genomics, from the Annotated Genome to Valuable Biological Information: A Case Study.

Authors: Sabina Zoledowska; Agata Motyka-Pomagruk; Agnieszka Misztak; Ewa Lojkowska
Journal: Methods Mol Biol Date: 2021

5. Diversity of the type I-U CRISPR-Cas system in Bifidobacterium.

Authors: Liuyang Ou; Jinzhao Long; Yanli Teng; Haiyan Yang; Yuanlin Xi; Guangcai Duan; Shuaiyin Chen
Journal: Arch Microbiol Date: 2021-04-09 Impact factor: 2.552

6. Complete genome sequence of Streptomyces sp. HSG2 from rhizosphere soil of mangrove in Qingmei Gang, Sanya.

Authors: Shengxiang Pei; Siwen Niu; Fuquan Xie; Shuang Zhang; Wenjing Wang; Gaiyun Zhang
Journal: Arch Microbiol Date: 2021-05-01 Impact factor: 2.552

7. DNA targeting and interference by a bacterial Argonaute nuclease.

Authors: Anastasiya Oguienko; Daria Esyunina; Denis Yudin; Anton Kuzmenko; Mayya Petrova; Alina Kudinova; Olga Maslova; Maria Ninova; Sergei Ryazansky; David Leach; Alexei A Aravin; Andrey Kulbachinskiy
Journal: Nature Date: 2020-07-30 Impact factor: 49.962

8. Natural Competence and Horizontal Gene Transfer in Campylobacter.

Authors: Julia Carolin Golz; Kerstin Stingl
Journal: Curr Top Microbiol Immunol Date: 2021 Impact factor: 4.291

9. Whole-Genome Sequencing and Comparative Genomics of Three Helicobacter pylori Strains Isolated from the Stomach of a Patient with Adenocarcinoma.

Authors: Montserrat Palau; Núria Piqué; M José Ramírez-Lázaro; Sergio Lario; Xavier Calvet; David Miñana-Galbis
Journal: Pathogens Date: 2021-03-12

10. Using the Endogenous CRISPR-Cas System of Heliobacterium modesticaldum To Delete the Photochemical Reaction Center Core Subunit Gene.

Authors: Patricia L Baker; Gregory S Orf; Kimberly Kevershan; Michael E Pyne; Taner Bicer; Kevin E Redding
Journal: Appl Environ Microbiol Date: 2019-11-14 Impact factor: 4.792