Victor A Padilha1, Omer S Alkhnbashi2, Van Dinh Tran2, Shiraz A Shah3, André C P L F Carvalho1, Rolf Backofen2,4. 1. Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, SP 13566-590, Brazil. 2. Bioinformatics Group, Department of Computer Science, University of Freiburg, 79110 Freiburg, Germany. 3. COPSAC, Copenhagen University Hospitals Herlev and Gentofte, DK-2820 Gentofte, Denmark. 4. Signalling Research Centres BIOSS and CIBSS, University of Freiburg, 79104 Freiburg, Germany.
Abstract
MOTIVATION: CRISPR-Cas are important systems found in most archaeal and many bacterial genomes, providing adaptive immunity against mobile genetic elements in prokaryotes. The CRISPR-Cas systems are encoded by a set of consecutive cas genes, here termed cassette. The identification of cassette boundaries is key for finding cassettes in CRISPR research field. This is often carried out by using Hidden Markov Models and manual annotation. In this article, we propose the first method able to automatically define the cassette boundaries. In addition, we present a Cas-type predictive model used by the method to assign each gene located in the region defined by a cassette's boundaries a Cas label from a set of pre-defined Cas types. Furthermore, the proposed method can detect potentially new cas genes and decompose a cassette into its modules. RESULTS: We evaluate the predictive performance of our proposed method on data collected from the two most recent CRISPR classification studies. In our experiments, we obtain an average similarity of 0.86 between the predicted and expected cassettes. Besides, we achieve F-scores above 0.9 for the classification of cas genes of known types and 0.73 for the unknown ones. Finally, we conduct two additional study cases, where we investigate the occurrence of potentially new cas genes and the occurrence of module exchange between different genomes. AVAILABILITY AND IMPLEMENTATION: https://github.com/BackofenLab/Casboundary. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: CRISPR-Cas are important systems found in most archaeal and many bacterial genomes, providing adaptive immunity against mobile genetic elements in prokaryotes. The CRISPR-Cas systems are encoded by a set of consecutive cas genes, here termed cassette. The identification of cassette boundaries is key for finding cassettes in CRISPR research field. This is often carried out by using Hidden Markov Models and manual annotation. In this article, we propose the first method able to automatically define the cassette boundaries. In addition, we present a Cas-type predictive model used by the method to assign each gene located in the region defined by a cassette's boundaries a Cas label from a set of pre-defined Cas types. Furthermore, the proposed method can detect potentially new cas genes and decompose a cassette into its modules. RESULTS: We evaluate the predictive performance of our proposed method on data collected from the two most recent CRISPR classification studies. In our experiments, we obtain an average similarity of 0.86 between the predicted and expected cassettes. Besides, we achieve F-scores above 0.9 for the classification of cas genes of known types and 0.73 for the unknown ones. Finally, we conduct two additional study cases, where we investigate the occurrence of potentially new cas genes and the occurrence of module exchange between different genomes. AVAILABILITY AND IMPLEMENTATION: https://github.com/BackofenLab/Casboundary. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: David Couvin; Aude Bernheim; Claire Toffano-Nioche; Marie Touchon; Juraj Michalik; Bertrand Néron; Eduardo P C Rocha; Gilles Vergnaud; Daniel Gautheret; Christine Pourcel Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971
Authors: Shiraz A Shah; Omer S Alkhnbashi; Juliane Behler; Wenyuan Han; Qunxin She; Wolfgang R Hess; Roger A Garrett; Rolf Backofen Journal: RNA Biol Date: 2018-06-19 Impact factor: 4.652
Authors: Alex Bateman; Lachlan Coin; Richard Durbin; Robert D Finn; Volker Hollich; Sam Griffiths-Jones; Ajay Khanna; Mhairi Marshall; Simon Moxon; Erik L L Sonnhammer; David J Studholme; Corin Yeats; Sean R Eddy Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971
Authors: Omer S Alkhnbashi; Fabrizio Costa; Shiraz A Shah; Roger A Garrett; Sita J Saunders; Rolf Backofen Journal: Bioinformatics Date: 2014-09-01 Impact factor: 6.937
Authors: Daria Vorontsova; Kirill A Datsenko; Sofia Medvedeva; Joseph Bondy-Denomy; Ekaterina E Savitskaya; Ksenia Pougach; Maria Logacheva; Blake Wiedenheft; Alan R Davidson; Konstantin Severinov; Ekaterina Semenova Journal: Nucleic Acids Res Date: 2015-11-19 Impact factor: 16.971