Literature DB >> 29911924

Comprehensive search for accessory proteins encoded with archaeal and bacterial type III CRISPR-cas gene cassettes reveals 39 new cas gene families.

Shiraz A Shah1,2, Omer S Alkhnbashi3, Juliane Behler4, Wenyuan Han2, Qunxin She2, Wolfgang R Hess4,5, Roger A Garrett2, Rolf Backofen3,6.   

Abstract

A study was undertaken to identify conserved proteins that are encoded adjacent to cas gene cassettes of Type III CRISPR-Cas (Clustered Regularly Interspaced Short Palindromic Repeats - CRISPR associated) interference modules. Type III modules have been shown to target and degrade dsDNA, ssDNA and ssRNA and are frequently intertwined with cofunctional accessory genes, including genes encoding CRISPR-associated Rossman Fold (CARF) domains. Using a comparative genomics approach, and defining a Type III association score accounting for coevolution and specificity of flanking genes, we identified and classified 39 new Type III associated gene families. Most archaeal and bacterial Type III modules were seen to be flanked by several accessory genes, around half of which did not encode CARF domains and remain of unknown function. Northern blotting and interference assays in Synechocystis confirmed that one particular non-CARF accessory protein family was involved in crRNA maturation. Non-CARF accessory genes were generally diverse, encoding nuclease, helicase, protease, ATPase, transporter and transmembrane domains with some encoding no known domains. We infer that additional families of non-CARF accessory proteins remain to be found. The method employed is scalable for potential application to metagenomic data once automated pipelines for annotation of CRISPR-Cas systems have been developed. All accessory genes found in this study are presented online in a readily accessible and searchable format for researchers to audit their model organism of choice: http://accessory.crispr.dk .

Entities:  

Keywords:  CARF; CRISPR; accessory; ancillary; archaea; auxillary; bacteria; csx1; csx3; helicase; nuclease; protease; type III

Mesh:

Substances:

Year:  2018        PMID: 29911924      PMCID: PMC6546367          DOI: 10.1080/15476286.2018.1483685

Source DB:  PubMed          Journal:  RNA Biol        ISSN: 1547-6286            Impact factor:   4.652


Introduction

Type III CRISPR-Cas systems have been classified into four main subtypes A to D of which subtypes III-A and III-B have been studied extensively [1,2]. They often coexist within cells carrying Type I systems and are assumed to complement the latter functionally. However, whereas Type I systems specifically target dsDNA, Type III systems can degrade dsDNA, ssDNA and ssRNA [3-9]. Importantly, an earlier survey of archaeal cas gene cassettes revealed that Type III modules are exceptional in that they frequently carry accessory genes either within or immediately bordering their core cas gene modules [1]. Some of these accessory genes encode putative proteases or nucleases suggesting that they provide additional functions that modulate, complement or extend, Type III interference functions. Some accessory cas genes were initially identified during the first comprehensive CRISPR-Cas subtype classification [10]. Core cas genes were found to be conserved across most CRISPR-Cas subtypes, while a different class of cas genes displayed a more variable distribution and were not consistently associated with specific subtypes. Such genes were designated csx and six gene families, csx1-6, were classified [10]. In the next major CRISPR-Cas systematics update, some accessory cas gene families were merged into the existing, and new, core cas gene families, with only the csx1 and csx3 families remaining separate [11]. This more robust definition of core cas gene families enabled a more rigorous distinction to be made from eventual new accessory genes. Building on this, an archaea-specific study identified almost twenty new cas accessory gene families that were considered distinct from core cas gene families, an inference that was reinforced by their variable distribution amongst CRISPR-Cas subtypes [1]. Most of the accessory genes were found to be associated with Type III systems, and most of the encoded proteins were related to the original Csx1 family proteins [10]. Two additional Type III subtypes, III-C and III-D were also introduced, and they were seen to be associated with many of the same accessory genes as those linked to Type III-A and III-B modules [1]. Later, the Csx1 protein family was formally defined by its ligand binding CARF domain and was annotated in both archaea and bacteria [12]. In addition, a few bacterial accessory protein families lacking CARF domains were found and added to the TIGRFAMs database by an author of the original study [10]. The latest CRISPR-Cas systematics update [2] adopted the new Type III subtypes and integrated many of the newly identified archaeal and bacterial accessory gene families. Moreover, subtype III-D was extended to include disparate bacterial Type III systems that had previously defied classification. As a result, their core genes were relatively poorly characterized and were sometimes annotated as accessory genes. Nevertheless, Makarova et al. (2015)[12] provided detailed annotations of all known archaeal and bacterial CRISPR-Cas systems as Supplementary Material for future reference; the present study is based on those data. Although experimental studies on non-CARF accessory proteins have been limited to Cmr7 from Sulfolobus [13], numerous studies have investigated the role of common CARF family accessory proteins such as Csx1, Csm6 and Csa3. The crystal structure of Csa3 was first resolved in 2011 [14] and it demonstrated that the C-terminal DNA binding HTH domain was likely to be under allosteric control of the N-terminal Rossman fold domain implicated in dinucleotide ligand sensing, of the type that is normally involved in signal recognition. Furthermore, a crystal structure of Cmr2, with its cyclase domain, was also compatible with the occurrence of such a signal [15]. These early insights were later undermined by genetic studies on Type III systems [16,17] that linked the presence of Csx1 and Csm6 to a DNA interference phenotype. This was enigmatic at the time because several crystallographic, biochemical and bioinformatic studies had predicted that the proteins must be primarily RNases [18-20]. A key proposal was subsequently made that Csx3 was a distant CARF family member and that the ligands sensed by CARF proteins in general were cyclic oligonucleotides [21,22]. This hypothesis was confirmed in three recent independent studies [23-25], two of which also showed that the cyclic oligoadenylate (cOA) signaling molecules were synthesized by the polymerase domain of Cas10 upon recognition of its target [15,26]. Apparently any target recognition by cellular Type III interference complexes triggers cOA synthesis which, in turn, ensures the coordinated activation of a potentially complex and layered defence response that is dependent on intracellular CARF proteins. Although CARF proteins constitute the most widespread Type III accessory proteins, accessory proteins lacking CARF domains are much more diverse and almost as widespread. For example, it was shown that archaeal Type III interference gene cassettes are often flanked by diverse accessory genes encoding protein domains associated with nucleases, proteases, helicases, toxins or transcriptional regulators [1,13,16]. As for bacterial genomes, a search has already been undertaken specifically for CARF-encoding genes in both archaea and bacteria, and it yielded a comprehensive catalogue of CARF domain-carrying proteins [12]. Here we identify new families of non-CARF cas accessory genes associated with Type III modules, in addition to those encoding CARF domains, in archaeal and bacterial genomes. Since non-CARF accessory genes do not share a common sequence motif or domain, they are particularly challenging to identify. Our method relies on a guilt-by-association approach, coupled with criteria for coevolution and specificity to exclude false positives. Type I, II, IV, V and VI cas gene cassettes were not examined systematically because the incidence of accessory genes is too low to distinguish signal from noise with our method given the limited number of annotated genomes available, although an independent study has been undertaken for these CRISPR-Cas types [27]. In the present study, by focusing on Type III systems, we are better able to identify reliable candidate accessory genes from a spurious gene background.

Results

From a total of 1263 archaeal and bacterial genomes in which CRISPR-Cas cassettes are fully annotated [2], 381 genomes encoded 512 distinct Type III genetic modules which were surveyed for accessory genes (Table 1). Selecting five genes immediately upstream and downstream from each module yielded a total of 4467 genes. This number was less than the theoretical total of 5120 genes because neighbouring adaptation modules, Type I interference modules and cas6 genes were omitted from the analysis. All the encoded protein sequences were aligned with one another and a custom similarity metric [1] was calculated and used for Markov clustering (MCL) [28]. This yielded 231 gene clusters each with more than three members. Cas association scores were calculated for each gene cluster using information about Type III coevolution, Type III subtype specificity, and host genome self-similarity. Gene clusters with low Cas association scores were considered spurious and disregarded. The Cas association score ranged from 100 to −100 corresponding to a strong through weak Type III association, and a cut off of 24 was set by comparing the results to the manually curated accessory gene families obtained in the previous archaeal genome study [1]. Thereafter, 76 of the 231 Type III-associated gene clusters were considered strongly associated; the remainder were discarded. The 76 gene clusters encompassed a total of 982 potential accessory genes distributed across the 512 Type III modules. This translated, on average, to two accessory genes per Type III module, varying from no additional genes to more than the core Type III genes (Figure 1). Properties of the 76 gene families are summarized (Table 2) and a more elaborate version (Table S1) is available online.
Table 1.

Summary of results from the current study compared to previous studies that identified accessory cas genes, divided between archaea and bacteria. The previous studies were limited to a (a) archaeal genomes [1] and (b) a representative subset of genomes [12]. The present study is more comprehensive.

 present study
Makarova 2014
Vestergaard 2014
 ArBaArBaAr
Number of genomes surveyed2172534172484159
Number of genomes with CRISPR-Cas systems131113150133124
Number of genomes with Type III systems783043411483
Total number of Type III systems11040261251125
Number of genomes with accessory genes672627818076
Total number of putative accessory genes248734190454239
Accessory genes found outside cas cassettes001041520
Accessory genes associated to Type III systems24873461251210
Total number of non-CARF accessory genes found13636900135
Figure 1.

Gene maps of example Type III gene cassettes including accessory genes. Core Type III genes are drawn in red with csm/cmr numbers for Type III-A/B modules, and cas numbers for Type III-D modules. cas6 and the adaptation module genes are orange and blue respectively, also with cas gene numbers. Accessory genes are drawn in purple with the gene cluster number indicated.

Table 2.

List of putative accessory gene clusters conserved near Type III genetic modules which passed the Type III association score cut-off (>24). A few gene families with lower scores are also included because they have been confirmed as accessory proteins in earlier studies. For each putative accessory protein family, the cluster id, the size (i.e. number of members per cluster), and the calculated Type III association score are listed. An example (gene-) locus id is also provided for reference. Names are provided for accessory protein families identified in earlier studies [1,10,12], while those identified in the present study are indicated in bold with C3a numbers (Cas type 3 Associated). 39 of 76 putative accessory protein families are newly identified. Commonly associated Type III subtypes are listed for each putative accessory protein family. Unclassified variant subtypes are indicated by ‘III’. Most accessory gene families can function with different Type III subtypes. A predicted function is given in the last column.

clustersizescoreexamplenamesubtypeannotation
111673.88JTY_2831Csx1/Csm6A,B,DCARF+HEPN
29469.25Selin_1039Csx1A,B,DCARF+HEPN
36962.33FFONT_0074Csm6A,B,DCARF+RelE
56459.36ERE_12150Csx19/24(A,B),Dcore gene (cas11)
66158.51TOPB45_1115Csx1A,B,(C,D)CARF+HEPN
75142.59Tph_c24580WYL/Csx1A,B,DCARF+WYL
94937.91Vdis_1148Csx1(A),B,(C)DCARF+Nuclease
114135.59Thebr_0941Csx15/20A,B,Dpeptidase
173124.7VMUT_1481Cas_RecFA,B,(C,D)SMC ATPase
23204.61CYB_0598Csx1(A),B,Dfound elsewhere
251855.83Dtur_0613Csx3A,B,DCsx3 (CARF)
271750.07Cyan10605_3521Csx18B,(D)adaptation associated
281766.76CYB_0586Csx2117Type III-D associated
291445.91CBO2177CorAB,(C),DCorA-like
331310.47M1425_0883Csa3(A),B,CType I-A associated
351261.07CYB_0599Csx3(A),B,DCsx3-AAA
361150.71Nos7107_4284WYL(A,B),DHTH-WYL
371156.62SacN8_09300Csx26III,A,(B)HNH nuclease
381054.35Mefer_0963C3a38A,Bunknown
42974.4CFT03427_1632Csx19Dcore gene (cas11)
43958.83Ferpe_1557C3a43A,BLon protease
45928.54Marme_0670PD-DExKA,B,Dpossible nuclease
47927.09SacN8_09290proteaseIII,A,B,Dpeptidase
50861.07Swol_2530Csx13(A),B,DTM+CARF
55744.44Nos7107_2836C3a55B,DABC permease
57720.46Calkr_2542HerAA,B,(C)Type III coevolved
58733.01VMUT_1471NurAA,BType III coevolved
59755.84Nos7107_2826C3a59A,B,Dtrans-membrane (TM)
64743.07SacRon12I_09300C3a64III,Aunknown
67640.03Caur_2269C3a67(A),BAAA-Csx3
69629.69B005_5545C3a69Dunknown
76639LS215_0703C3a76III,A,Dunknown
77629.73NE0116Csx16A,B,Da.k.a. cas_VVA1548
80627.48PTH_0706C3a80B,C,DAAA+ ATPase
816−9.48YN1551_2131Csx1(A,B),DType I associated
83688.69VMUT_1493cas_RFasA,(B)cluster 17 associated
84666.57TTX_1229cas_RFasA,(B)cluster 17 associated
87674.65SSO1986Cmr7BCmr7
93530.57Thebr_0950C3a93B,C,Dposs. AAA ATPase
96542.25Mvol_0529Mvol_0529-famBDNA binding C-ter.
104539.74Msed_1167Csx1A,B,DCARF+PIN
107584.37B005_5544Csx1B,DCARF
108578.8Pcal_0278cas_RFasBcluster 17 associated
116457.92Rru_A0181C3a116B,Dcluster 29 associated
121462.04Desac_1715Csx23A,B,Dunknown
123437.09Caur_2303C3a123B,Dunknown
124444.42PCC7418_1341C3a124Bunknown
139428.91Tthe_0931C3a139A,B,DHKD+Snf2
146468.43SSO1421C3a146DDNA binding HTH
152325.85Cylst_6373C3a152CType III-C specific
156328.1Metin_0159C3a156B,Coxidoreductase
159327.06Calkr_2554C3a159A,B,Cmethyl transferase
162340.43SYO3AOP1_0653C3a162III,A,Bunknown
166332.1Hbut_0719C3a166B,Dposs. crRNA proc.
168331.1slr7083C3a168Bunknown
173370Adeg_0988Csx21III,B,Dunknown
174374.3Adeg_0809C3a174B,Cnucleotidyl trans.
178337.97Csac_0071C3a178A,B,Dputative invertase
181325.55Tpen_1359C3a181A,B,CDiadenylate cyclase
186328.77RoseRS_2594C3a186A,C,DC-3ʹ,4ʹ desaturase
187339Dole_0738C3a187A,BNERD+UvrD
189350.73Hoch_5581C3a189B,Dkinase
193343.78Thebr_0949C3a193B,Dposs. alt. Cas3
194326.83Mcup_1132C3a194B,DAA transporter
196346.29TREPR_1099C3a196III,A,Bunknown
198340.67Mrub_1490C3a198BTPR protein
205367.5MLP_11360C3a205Dunknown
206329.39Ndas_2980C3a206DTAP-like protein
211351.67Pars_1112cas_RFasBunknown
212339.67RoseRS_0371C3a212C,DSNc+LTD
214330.16Rcas_4246C3a214B,CGlgB
216325.6SSA_1254C3a216A,Badaptation associated
222344Vpar_1800C3a222A,Dunknown
223343.33YG5714_0635C3a223III,Dunknown
227336.73Sfum_1354PrimPolA,Badaptation polymerase
230328.76TVNIR_1454C3a230BRecX family protein
Summary of results from the current study compared to previous studies that identified accessory cas genes, divided between archaea and bacteria. The previous studies were limited to a (a) archaeal genomes [1] and (b) a representative subset of genomes [12]. The present study is more comprehensive. List of putative accessory gene clusters conserved near Type III genetic modules which passed the Type III association score cut-off (>24). A few gene families with lower scores are also included because they have been confirmed as accessory proteins in earlier studies. For each putative accessory protein family, the cluster id, the size (i.e. number of members per cluster), and the calculated Type III association score are listed. An example (gene-) locus id is also provided for reference. Names are provided for accessory protein families identified in earlier studies [1,10,12], while those identified in the present study are indicated in bold with C3a numbers (Cas type 3 Associated). 39 of 76 putative accessory protein families are newly identified. Commonly associated Type III subtypes are listed for each putative accessory protein family. Unclassified variant subtypes are indicated by ‘III’. Most accessory gene families can function with different Type III subtypes. A predicted function is given in the last column. Gene maps of example Type III gene cassettes including accessory genes. Core Type III genes are drawn in red with csm/cmr numbers for Type III-A/B modules, and cas numbers for Type III-D modules. cas6 and the adaptation module genes are orange and blue respectively, also with cas gene numbers. Accessory genes are drawn in purple with the gene cluster number indicated.

Online availability of results

The complete set of results is provided online in an indexed and readily searchable format for further validation (Table S1, http://accessory.crispr.dk). Multiple sequence alignments are provided for protein sequences encoded by each gene cluster. Gene names, and host genome accession numbers are given for each gene together with the adjacent Type III module subtype (Table S2, http://accessory.crispr.dk/genes.html). Profile-profile matches to the sequence family databases Pfam [29], TIGRFAMs [10] and CDD [11] are also presented for each gene family, along with matches to protein structures in PDB [30]. A multi-FASTA file containing all 982 accessory protein sequences along with the designated cluster ID is also available (http://accessory.crispr.dk/Shah2018.TypeIIIaccessory.faa). Finally, gene maps have been drawn for each of the 982 separate putative accessory genes found, illustrating their genomic context including the neighbouring Type III genetic module and other accessory genes or core cas genes from other systems flanking the same module. Clicking the gene maps links to the gene entry in the RefSeq database, enabling easy retrieval of protein or gene sequences, along with the option of initiating a BLAST or CDD search with an additional click. Code for reproducing the results is available GitHub (https://github.com/BackofenLab/Accessory.Crispr).

Largest gene clusters

Of the five largest gene clusters (1 to 5), the first three had 116, 94 and 69 members, respectively (Table 2). All gave highly significant profile-profile matches to known CARF protein profiles (e.g. TIGR02710 or Csx1 and the 5FSH PDB entry for the Csm6 crystal structure from Thermus thermophilus). This result confirms the earlier observation for archaea that CARF domain-encoding genes comprise the most dominant class of accessory genes. The largest cluster with no CARF match, cluster 4, had a low Cas association score (only 3% Type III specificity) and was considered spurious and not included in Table 2. This particular gene cluster encodes an ancient, ubiquitous, family of ABC-transporters with high sequence conservation that would probably appear enriched regardless of the genomic locus under survey. Gene cluster 5 with 64 members was almost exclusively associated with the less well characterised III-D subtype modules. These genes were always closely linked with their cognate Type III genetic modules and co-transcribed with the core cas genes. The genes encode a relatively small protein (about 160 aa) with no significant matches in sequence databases, a pattern typical for  Cas11 (SS). Additionally, a cas11 gene assignment for these systems was missing from the original 2015 annotation. Therefore, we inferred that the gene, like cluster 42 (Figure 1(e)), encodes the Cas11 analog, of an uncharacterized subfamily of Type III-D modules and that cluster 5 represents a core Type III-D gene family rather than an accessory gene family.

Most strongly associated gene clusters

The five most strongly associated accessory gene clusters according to the calculated Cas association score were clusters 83, 107, 108, 87 and 42 (Table 2). Cluster 83 carries six genes all located adjacent to Type III-A gene cassettes from crenarchaeal orders including the thermoacidophilic Sulfolobales and the thermoneutrophilic Thermoproteales. These genes are invariably located immediately adjacent to cluster 17 genes (Figure 1(b)) encoding a putative ATPase domain protein (330 aa) (Table 2) common to DNA repair proteins of the SMC (Structural Maintenance of Chromosomes) type including rad50 and recF. The protein sequences that make up cluster 83 average 160 aa and yield no good profile matches in public databases. The cluster 83–17 pair of accessory genes is also accompanied by gene clusters 6 and 84 (Figure 1(b)), the former of which encodes a CARF protein of the MJ1666 type [10], while genes of the latter are similar in size to those of cluster 83 and appear to show weak sequence similarity. Cluster 107 has five gene members present in divergent bacterial genera Nocardiopsis and Thermus, members of the Actinobacterial and Thermus-Deinococcus phyla, respectively. Cluster 107 proteins produce significant sequence matches to known CARF domain-containing proteins but are larger (about 700 aa) and long regions in the middle and at the C-terminal end yield no profile-profile matches in databases. Thus, while it is likely that these CARF proteins are also activated by cOA, their effector function remains obscure. In Nocardiopsis the host carries subtype III-D modules (Figure 1(d)) and also harbours cluster 206 genes which encode a protease of the AB hydrolase family. Typically for CARF proteins, cluster 107 is found associated with different subtypes III-B and III-D. Cluster 108 genes are found amongst crenarchaeal thermoneutrophiles, often in multiple divergent copies immediately adjacent to cluster 17 genes, similar to the coexistence of genes of clusters 17, 83 and 84. Cluster 108 proteins (about 180 aa) show no sequence similarity to cluster 83/84 proteins and yield no good matches to protein family databases. However, they contain a transmembrane (TM) signal at the N-terminal end consistent with the protein being membrane bound. Cluster 87 corresponds to cmr7 [13] and is an additional cas gene so far exclusive to Sulfolobus species. Exceptionally Cmr7 has been characterized experimentally and shown to associate tightly with the Cas10/Cmr3 sub-complex within the cognate Type III-B effector complex [13]. Cluster 87 genes are associated exclusively with the Cmr-β subclass of Sulfolobus Type III-B modules. Cmr-β exhibits RNase activity cleaving at U-A dinucleotide pairs [13] whereas most other Type III-B complexes cleave at regular spatial intervals along the target RNA, via their Cmr4 subunit [31-33]. Cluster 42, like cluster 5, contains an uncharacterized Type III-D core gene encoding the Cas11 small subunit (SS) analog of a subclass of Type III systems which, to date, are poorly characterized. The genes are always located within the Type III-D gene cassette and appear co-transcribed with the other Type III-D interference genes (Figure 1(e)). The subclass of Type III-D modules carrying the gene cluster occur in the bacterial species Campylobacter, Helicobacter, Fibrobacter, Tannerella and Saprospira.

Least strongly associated gene clusters

Some gene families were enriched adjacent to Type III genetic modules but were predicted not to be cofunctional. The ten lowest ranking Type III-associated gene families according to the Cas association score are summarized (Table 3). They are dominated by different transposase domains, and include a single ABC transporter family, all of which are ubiquitous in the mobilome and, therefore, less significant. The result reinforces that Type III genetic modules tend to lie in genomic regions with relatively high HGT activity. The result also underlines the importance of employing criteria for coevolution and specificity, in addition to conservation, when searching for cas accessory genes.
Table 3.

The ten lowest ranking genes in terms of significance of association to Type III modules, based on the Type III association score. Even though these gene clusters are often found near genomic Type III systems, they are inferred not to bear any functional link with them, and, therefore, were not considered accessory.

clustersizescoreexample locusdomain matchescomments
2093−88.9CFBS_2969InsAputative transposase
826−72.2UDA_2812rve/Tra5integrase/transposase
1444−66.3Pcal_0278COG5552/DUF2277unknown function
1239−60.4SSO1518Tnp 1/InsEtransposase
548−46.7MAF_28180Int C/XerCtyrosine recombinase
489−42.6YN1551_2375InsG/IS 4transposase
1693−33.3Cagg_3808Ftnnonheme ferritin
851−33.0Smar_0302PotA/MalKABC transporter
1533−32.6Athe_0143AmyAc MTasealpha amylase
1543−30.33Athe_0142YqiItrans-membrane
The ten lowest ranking genes in terms of significance of association to Type III modules, based on the Type III association score. Even though these gene clusters are often found near genomic Type III systems, they are inferred not to bear any functional link with them, and, therefore, were not considered accessory.

Diversity of accessory proteins with no CARF domain

Although CARF proteins are the most common accessory proteins, their diversity is quite limited with the CARF domain linked to different combinations of a few domains including HTH, WYL, HEPN and PIN toxins. In contrast, the non-CARF accessory proteins appear more diverse, exhibiting a wider range of protein domains and their genes span many smaller gene clusters. Nevertheless, some domain classes are more common than others and they are covered below. Nucleases constitute the broadest class of non-CARF accessory proteins, associated with clusters 37, 45, 58, 139, 166, 187 and 212. Clusters 37 and 45 encode unrelated restriction endonuclease domains and cluster 37 is probably a core cas gene family for the Sulfolobus subtype IIIV-1 system, judging from observed gene syntenies. The protein may associate with the Type III effector complex, and facilitate DNA targeting. Cluster 45 genes may encode restriction activity against invader DNA and complement Type III DNA targeting activity. Cluster 166 proteins give good matches to the Nob1 rRNA maturation endonuclease and may be involved in crRNA maturation given that the associated Type III modules are not linked to a cas6 gene. Nucleases encoded by clusters 58, 139 and 187 are all associated with helicase domains, either as a separate domain in the protein or encoded by co-transcribed genes. Type III systems lack the processive invader dsDNA digestion characteristic of Type I systems, via Cas3, and helicase/nuclease combinations such as the above may contribute the same type of functionality to Type III systems. Proteases and peptidases represent an important class of non-CARF accessory proteins encoded by clusters 11, 43, 47 and 206. Clusters 43 and 206 encode relatively long proteins possibly with multiple domains that remain of unknown function. Clusters 11 and 47 encode small single domain proteins (about 100 aa), most of which match aspartic acid peptidases. Some cluster 11 members carry no identifiable domains, reminiscent of core Cas11 proteins and may have co-clustered with the peptidases as a result of sequence similarity cut-offs being set too low. However, genes encoding viable peptidases are often found within operons of the cognate Type III Cas proteins which suggests that they are core effector genes and may be involved in maturation of Cas proteins. ABC and AAA ATPases comprise another well represented group of putative non-CARF accessory proteins with the cluster 17-encoded SMC ATPase being the most prominent (Figure 1(b)). Clusters 80 and 93 also encode ATPase domains. The former produces a large protein matching DUF499 in Pfam that probably contains multiple domains and resembles SMC and RecF ATPases at the C terminal end. Cluster 93 encodes a small protein and often accompanies cluster 80, with their genes sometimes co-transcribed, and they carry an AAA ATPase motif. Cluster 35 encodes a protein family found mainly in cyanobacteria; it has a Csx3 domain in the N-terminal end and an AAA ATPase domain in the C-terminal region. Cluster 67 is similar but encodes the domains organized in the opposite order. Csx3 is predicted to be a distant member of the CARF superfamily [22], and the AAA ATPase is likely to be under allosteric regulation from the CARF domain, with a cOA signal as possible activator. Another class of accessory gene families are linked to Type III adaptation modules. Members are often located adjacent to adaptation genes including cas1 and cas2. Cluster 227 encodes a primase domain and is found with adaptation modules of Type III-A and Type III-B systems. A related gene occurs adjacent to Type III-D-linked adaptation modules (Figure 1(e)). Functions may include DNA repair following spacer acquisition, or a replacement for Cas4, or reverse transcription to generate DNA spacers from RNA invaders. The cyanobacterial Type III-B associated clusters 216 and 27 (Figure 1(a)) encode small proteins and the former matches a part of Cas1 from Type III-A-associated adaptation modules of other bacteria. Since Type III interference is protospacer adjacent motif (PAM)-independent, this Cas1-associated domain may have evolved to bypass the requirement for a spacer acquisition motif (SAM) [34] that occurs in Type I and II systems. A final class of putative accessory gene families includes those encoding proteins with no significant domain matches. Most of them are small (100 to 200 aa) but a few are larger including those of cluster 124 (average 430 aa). Gene families without identifiable encoded domains are sometimes associated with other accessory genes, including the Cas_RecF-associated gene families (clusters 83, 84 and 108) (Figure 1(b)) [1], the CorA-associated cluster 96 (Figure 1(c)) and cluster 38 which is often associated with cluster 43-encoded proteases. Other clusters occur alone (cluster 76) or show no clear association pattern (cluster 69), indicating that they can function independently from other accessory proteins. Next we present two experimental sections to gain further insights into the potential roles and significance of a few accessory proteins. The first constitutes a study of the cyanobacterium Synechocystis sp. PCC 6803 where a series of accessory genes was selectively deleted and the effect on Type III-B interference and crRNA maturation was investigated. In the second, the dependence of RNase activity on cyclic adenylates was examined for accessory proteins of Type III-B systems of Sulfolobus with a view to compare the results with those obtained earlier on bacterial Type III-A systems.

Type III-B interference activity was not directly impaired in candidate accessory gene deletion mutants

The major defense plasmid pSYSA, present in the cyanobacterium Synechocystis sp. PCC 6803, harbours three separate CRISPR-Cas systems – CRISPR1, CRISPR2 and CRISPR3. CRISPR1 and 2 are classified as subtype I-D and III-D CRISPR-Cas systems, respectively [35-37], while CRISPR3 is a subtype III-B variant system (III-Bv) carrying an unusual fusion of Cmr1-Cmr6 and lacks an obvious Cas6 homolog [38],(Figure 2). Recently, it was demonstrated that CRISPR3 is a viable CRISPR-Cas system that functions independently of the non-cognate Cas6 endonucleases associated with CRISPR1 and CRISPR2. In addition, the host-encoded RNase E was shown to perform crRNA processing in Synechocystis sp. PCC 6803 [38].
Figure 2.

Gene map of the Synechocystis sp. PCC 6803 pSYSA Type III-Bv module with flanking genes. Core Type III genes are coloured red and denoted with Cmr numbers. Adaptation module genes are coloured blue and marked with Cas protein numbers. Accessory genes found in this study are coloured purple and indicated with cluster numbers. Genes deleted in mutants subject to the interference assay are marked by a dot below. None of the individual deletions resulted in a marked decrease in interference activity.

Gene map of the Synechocystis sp. PCC 6803 pSYSA Type III-Bv module with flanking genes. Core Type III genes are coloured red and denoted with Cmr numbers. Adaptation module genes are coloured blue and marked with Cas protein numbers. Accessory genes found in this study are coloured purple and indicated with cluster numbers. Genes deleted in mutants subject to the interference assay are marked by a dot below. None of the individual deletions resulted in a marked decrease in interference activity. To test the functionality of the CRISPR3 system in wild type Synechocystis sp. PCC 6803, an interference assay was developed based on two invader plasmids, each containing the reporter gene gentamicin and a fused protospacer sequence in either sense or antisense orientation [38]. In this study, the same assay was used to test the effects of single candidate accessory gene knock-outs on interference activity. For each accessory gene knock-out mutant we observed a significant reduction in the number of transconjugants for both invader plasmids relative to the control (Figure 3(a)), as previously seen for wild type Synechocystis sp. PCC 6803 [38]. Hence, the interference system was fully operational. We conclude that the gene products of the investigated accessory genes are not directly involved in the interference stage as the efficient degradation of invading nucleic acids was not impaired in any tested mutant strain and behaved the same as the wild type strain. However, we found that the abundance of CRISPR3 crRNA transcripts was clearly reduced in the knock-out strain of the accessory gene slr7088 (Figure 3(b)) belonging to cluster 11 and encoding a peptidase within the CRISPR3 Cas cassette [38]. Therefore, the selected accessory genes do not appear to encode any functionality with regard to the interference stage but they might well play a role in other CRISPR-Cas related processes, e.g. affecting CRISPR3 crRNA stability and/or turnover as demonstrated for slr7088.
Figure 3.

Experimental investigation of accessory gene knock-out mutants in Synechocystis sp. PCC 6803. a) Interference activity of subtype III-Bv-associated accessory gene knock-out mutants. Conjugation efficiencies are calculated by the ratio of the plasmid target (pT) to the plasmid non-target (pNT, control). The conjugation efficiency of the control plasmid was set to 1 and the number of colonies for the plasmid targets was normalized to the control plasmid. Data points represent mean values and standard deviations were calculated for three independent biological replicates. The accessory gene slr7080 is included in cluster 35, slr7083 in cluster 168 and slr7088 belongs to cluster 11.‘s’, invader plasmid with protospacer in sense orientation, ‘as’, in antisense orientation, ‘c’, control plasmid without protospacer. b) Northern hybridization using a radioactively labelled transcript probe spanning CRISPR3 spacers 1–4. The knock-out mutant ∆slr7088 shows decreased CRISPR3 crRNA accumulation compared to the wildtype (WT) strain. After normalization against 5S rRNA the WT clones accumulated in average 1.24 times more CRISPR3 crRNA than the slr7088 deletion mutants. A representative of two independent experiments is shown.

Experimental investigation of accessory gene knock-out mutants in Synechocystis sp. PCC 6803. a) Interference activity of subtype III-Bv-associated accessory gene knock-out mutants. Conjugation efficiencies are calculated by the ratio of the plasmid target (pT) to the plasmid non-target (pNT, control). The conjugation efficiency of the control plasmid was set to 1 and the number of colonies for the plasmid targets was normalized to the control plasmid. Data points represent mean values and standard deviations were calculated for three independent biological replicates. The accessory gene slr7080 is included in cluster 35, slr7083 in cluster 168 and slr7088 belongs to cluster 11.‘s’, invader plasmid with protospacer in sense orientation, ‘as’, in antisense orientation, ‘c’, control plasmid without protospacer. b) Northern hybridization using a radioactively labelled transcript probe spanning CRISPR3 spacers 1–4. The knock-out mutant ∆slr7088 shows decreased CRISPR3 crRNA accumulation compared to the wildtype (WT) strain. After normalization against 5S rRNA the WT clones accumulated in average 1.24 times more CRISPR3 crRNA than the slr7088 deletion mutants. A representative of two independent experiments is shown.

Compatibility between SisCmr-α and different Sulfolobus CARF proteins

Most putative accessory gene families, especially those encoding CARF proteins, can associate with different Type III subtypes (Table 2). Moreover, highly similar CARF genes, including sisCsx1 (locus tag: SiRe_0884) are also located adjacent to gene cassettes of different Type III subtypes (Figure 4). This suggests that some degree of compatibility can occur between CARF proteins and different Type III subtypes.
Figure 4.

Three subtrees from a neighbor-joining tree of all CARF proteins found in this study. Gene (locus) ids are shown along with the subtype of the associated Type III system. The branch length corresponding to a 25 % dissimilarity at the amino acid sequence level is indicated with the ruler. Closely similar CARF proteins can associate with, and cofunction with, different subtypes of Type III systems. SisCsx1 is included in the final subtree (in bold).

Three subtrees from a neighbor-joining tree of all CARF proteins found in this study. Gene (locus) ids are shown along with the subtype of the associated Type III system. The branch length corresponding to a 25 % dissimilarity at the amino acid sequence level is indicated with the ruler. Closely similar CARF proteins can associate with, and cofunction with, different subtypes of Type III systems. SisCsx1 is included in the final subtree (in bold). The Type III-B system (Cmr-α) of Sulfolobus islandicus (Sis) synthesizes cyclic tetra-adenylate (c-A4) which binds to the CARF domain of SisCsx1, the cognate accessory RNase of Cmr-α, producing strong RNase activity (WH, QS unpublished). We examined the potential for c-A4 to activate three additional Sulfolobus CARF proteins, namely SiRe_0765 (Csa3), SiRe_0811 (Csm6) and ST0035 (a CARF-PIN toxin) from S. tokodaii. ST0035 was shown, by UV crosslinking, to weakly interact with c-A4 but it did not induce additional RNase activity. SisCsx1 was the only protein that interacted strongly with c-A4, leading to efficient induction of RNA cleavage by activation of the HEPN toxin domain. It was concluded that the other Sulfolobus CARF proteins may be activated by other species of cOAs synthesized by different Type III effector systems within the cells. Although the comparative genomics results suggest that compatibility can occur between CARF proteins and different Type III subtypes, the experiment indicates that this is not a general rule and that other factors may be important for the CARF protein-Type III association to be productive.

Discussion

The computational method developed for identifying putative accessory genes employed similarity cut-offs and criteria for defining clusters based on considerable prior experience with manual annotation of genomic CRISPR-Cas cassettes. The Type III association score, based on criteria for host genome self-similarity, Type III specificity and coevolution, yields good granular control over sensitivity versus specificity. The cut-off employed involves a tradeoff which produced a few false positives and some false negatives. However, we infer that the vast majority of the gene clusters selected do carry some level of Type III accessory functionality (Table 2). The method relies on accurate prior annotation of Type III core genes so that flanking genes can be evaluated; for inaccurately annotated Type III systems the method may identify core cas genes as accessory genes. The fact that the most widespread CARF and non-CARF accessory genes were identified previously (Table 2) does not undermine the significance of our approach. Accessory protein families have been defined and annotated over a decade involving extensive manual comparisons and curation whereas the present method is semi automatic and scalable and can accelerate the discovery of non-CARF accessory proteins, especially when coupled to an automated pipeline for annotating CRISPR-Cas systems in newly sequenced genomes, an approach that is being developed [2]. The present study indicates that the diversity of non-CARF accessory proteins is still under-sampled whereas the major families of CARF accessory proteins have probably been identified [12]. Searching additional diverse genomes, and metagenomes with the method, should help identify new accessory protein families with increased precision. Therefore, we plan to integrate the current method into our earlier methods [1,2,39-41] and generate a fully automated system for annotation and characterisation of CRISPR-Cas systems.

Detection of previously confirmed accessory proteins

37 of the 76 accessory gene families have been noted in previous comparative genome studies (Table 2). The most important group comprise CARF proteins that were characterized at an early stage [10] and were more recently catalogued systematically together with WYL domain proteins [12]. The major families of previously characterized accessory proteins lacking CARF domains include: (a) clusters 5 and 42 (Csx19 and 24) that are annotated as CRISPR-associated in the TIGRFAMs database, and are both likely to encode core Type III Cas11 proteins; (b) clusters 57 and 58 encode the HerA/NurA helicase-nuclease pair and were initially found associated with crenarchaeal thermoneutrophile Type III systems [42] and later found in other crenarchaea and aigarchaea [1]; (c) cluster 29 encodes a CorA-like putative magnesium transporter and was originally found linked to Type III-B systems in Methanococcus and Clostridium [1]; here we show it also occurs adjacent to Type III-D systems and in diverse bacteria; (d) cluster 11 or csx15/20 encodes a Type III-associated peptidase that is erroneously catalogued in CDD as being Type I-U associated [11]; we found here that it is linked to crRNA biogenesis in Synechocystis; (e) cluster 47 encodes a Type III-associated peptidase found together with cluster 37 encoding a nuclease; both are associated with the crenarchaeal IIIV-1 variant subtype [1] and may comprise core cas genes for this  Type III subtype; (f) cluster 17 or Cas_RecF encodes a repair-associated ABC ATPase found adjacent to diverse crenarchaeal Type III genetic modules [1] and it is often accompanied by clusters 83, 84 and 108 which do not yield significant sequence database matches. ATPases, although common and functionally diverse, are well conserved sequence-wise and while a functional link to DNA repair is possible it could be an artefact. However, ATPases could provide the energy required for shifting the chemical equilibrium towards more efficient cleavage of target nucleic acid, given that Type III systems lack the helicases of Type I systems that ensure processive target cleavage. During the review process of this manuscript, another manuscript was released in preprint [27] which also describes the detection of new cas genes. The authors used a similar approach by employing a score, but looked at genes flanking all Cas types in addition to CRISPR arrays. In spite of having used twice the number of input genomes they only found around half the number of accessory proteins found here (560 proteins covering 58 profiles). The lower sensitivity may have resulted from the noise coming from a broader search. Data from both studies should prove valuable for future experimental validation, due to their respective deep vs. wide approaches yielding independent results.

Compatibility between CARF accessory proteins and different type III subtypes

The results (Table 2, Figure 4) suggest that CARF accessory proteins in particular are able to cofunction with various Type III systems regardless of subtype. A mechanistic rationale for this compatibility could be the shared capacity of Type III subtypes to synthesize the cOA signal required to activate CARF proteins upon invader RNA recognition. The CARF domain is likely invariant for Type III modules as long as a cOA messenger can activate it. While this explains the observed subtype compatibility, experimental evidence from Type III-A systems in Streptococcus thermophilus (St), Thermus thermophilus (Tt) and Enterococcus italicus (Ei) [23,24] and a Type III-B system in Sulfolobus islandicus (Sis) (WH, QS unpublished), suggest that the mechanisms are more complex (Table 4). The different Type III-A Csm effector complexes tested produced different cOA profiles; some synthesising mostly cyclic hexa-adenylate (c-A6) while others produced mainly cyclic triadenylate (c-A3). The archaeal Type III-B system synthesized a signal dominated by cyclic tetra-adenylate (c-A4) (WH, QS unpublished). Moreover, the synthesis efficiencies varied by more than an order of magnitude between the different Csm and Cmr effector complexes tested (Table 4) and the E. italicus system had to be genetically modified to produce significant yields. In addition, the CARF RNases tested showed strong preferences for specific cOA species, with some favouring c-A6 and others c-A4 and, moreover, their activation efficiencies varied markedly, with some CARF proteins requiring several orders of magnitude higher cOA concentrations to trigger efficient RNA digestion.
Table 4.

Comparisons of reaction efficiencies of different CARF proteins with respect to RNase activity, and for Type III effector complexes with respect to cOA synthesis. Substrates comprise RNA for CARF proteins and ATP for Type III complexes. Reaction times, substrate and effector concentrations shown are the minimum required for digestion of 50% of the RNA substrate or for converting more than 80% of ATP into cOA. The concentration of cOA required for activation of CARF proteins differs by several orders of magnitude, as does the efficiency with which the Type III complexes synthesize cOA. The c-A6 required to activate StCsm’s cognate Csm6 protein comprises a minor species (only 0.5% total cOA synthesized), with the major species being c-A3. In contrast, almost all cOA synthesized by SisCmr and EiCsm* was of the type required by cognate CARF proteins. Table contents were compiled from published data [23,24]. EiCsm* contained dEiCsm3, a nuclease-dead mutant protein. The SisCsx1 and SisCmr data were produced in the Copenhagen laboratory (WH, QS unpublished).

 SisCsx1TtCsm6StCsm6StCsm6’EiCsm6SisCmrStCsmEiCsm*
classCsx1Csm6Csm6Csm6Csm6III-BIII-AIII-A
active cOA speciesc-A4c-A4c-A6c-A6c-A6c-A4c-A6c-A6
effector conc (nM)100100.110.1510200160
Substrate520 nM10 nM10 nM10 nM40 nM100 µM50 µM500 µM
cOA (nM)205000.555~ 70%0.5 %~ 100 %
reaction time (min)203030304201030
Comparisons of reaction efficiencies of different CARF proteins with respect to RNase activity, and for Type III effector complexes with respect to cOA synthesis. Substrates comprise RNA for CARF proteins and ATP for Type III complexes. Reaction times, substrate and effector concentrations shown are the minimum required for digestion of 50% of the RNA substrate or for converting more than 80% of ATP into cOA. The concentration of cOA required for activation of CARF proteins differs by several orders of magnitude, as does the efficiency with which the Type III complexes synthesize cOA. The c-A6 required to activate StCsm’s cognate Csm6 protein comprises a minor species (only 0.5% total cOA synthesized), with the major species being c-A3. In contrast, almost all cOA synthesized by SisCmr and EiCsm* was of the type required by cognate CARF proteins. Table contents were compiled from published data [23,24]. EiCsm* contained dEiCsm3, a nuclease-dead mutant protein. The SisCsx1 and SisCmr data were produced in the Copenhagen laboratory (WH, QS unpublished). The cOA specificities of CARF proteins and the effector complexes appear to transcend the broad protein family, and effector subtype categories, because they presumably vary over shorter evolutionary distances. For example, the Type III subtypes are not linked to specific cOA species (Table 4) and such mechanistic differences probably occur at a finer subfamily level. CARF accessory proteins and core Type III effectors must be able to cofunction in order for their linked genes to persist. However, the experiments summarized above indicate that variables such as the sensitivity to the signal, the species and its concentration, limit productive CARF-Type III interactions.

Mobilome-associated genes adjacent to type III modules

In addition to accessory gene families, we expected mobilome-associated gene families to be enriched around genomic Type III cas modules. The Type III association score was devised to address this problem, and five out of the ten lowest ranking gene clusters encoded putative transposases (Table 3). Remarkably, no toxin-antitoxin (TA) gene pairs were found despite (a) previous studies showing that they are common in mobilomes, with some bordering CRISPR-Cas modules [43,44] and (b) additional evidence demonstrating coevolution between TA gene pairs and adjacent Type III modules [45]. The TA systems may influence mobility of adjacent CRISPR-Cas modules or they may help induce dormancy or programmed cell death at key stages of the immune response, in order to minimize the spread of invader genetic material in the cellular population [18]. Thus, they could provide a semi-accessory function to Type III immunity while also being able to function independently. This would explain why some specific subfamilies of TA systems are enriched adjacent to CRISPR-Cas gene cassettes while others are not [45] and, as a result, they were not detected by our method. Of the many accessory proteins identified that yield no protein domain matches with databases some of these could be anti-anti-CRISPR proteins. An increasing number of anti-CRISPR proteins have been characterized, encoded by phages [46] and by archaeal viruses [47], and it remains a possibility that some of the CRISPR-Cas systems encode accessory proteins that can neutralize the anti-CRISPR proteins before they inhibit an immune response.

On the role of Type I associated CARF proteins

Type III effectors activate CARF proteins via a cOA messenger molecule signaling the presence of intracellular invader RNA [23-25] and many CARF protein genes were identified earlier adjacent to Type III modules [10,48]. Less clear is why CARF protein genes are common adjacent to Type I systems, especially amongst archaeal hyperthermophiles [1], when Type I core Cas proteins do not apparently possess the GGDD domain responsible for cOA derivatives. Most types of CARF protein genes that are associated with Type III modules, are occasionally found adjacent to Type I cassettes, with csx3 and especially csa3 being the most common. csa3 encodes a transcriptional repressor of Type I-A interference modules with a DNA binding HTH domain coupled to a ligand sensing CARF domain. Its pattern of invariant presence suggests that it is a core gene for that subtype. Type I-A gene cassettes in several crenarchaeal thermoacidophiles, including species of Sulfolobus, Thermosphaera, Thermogladius, Fervidicoccus and Desulfurococcus carry two csa3 genes and recent studies have shown that one encodes a transcriptional repressor of Type I interference [49] while the other activates the adaptation module [50]. The former study detected invader nucleic acid-dependent activation of transcription of the Type I-A interference gene cassette. Moreover, a model was proposed that involved Csa3 binding to the Type I-A effector complex which then bound to the interference module promoter and repressed further transcription. Protospacer-containing DNA then recruited the effector complex and derepressed the Type I-A interference operon. A more likely model, given our current knowledge of Type III systems, involves sensing of invader nucleic acid by a Type III effector complex and, after recognition of invader RNA, the cOA signal produced may derepress the Type I-A interference module by inducing a conformational change in its CARF domain-containing transcriptional repressor. This model would also imply that spacer acquisition is under the positive control of Type III-mediated invader sensing, reminiscent of the primed adaptation seen for other Type I subtypes but involving a different mechanism. Thus, all CARF proteins associated with Type I systems may facilitate cofunctioning with Type III systems, either directly, as for Csa3, or indirectly by selecting for two types of CRISPR-Cas systems. Genomes that harbour Type I systems flanked by CARF genes like csx1 are thus more likely to exhibit and maintain Type III systems, even when located on a distant genomic locus. This model also implies that Type I-A systems can be dependent on Type III systems for activating spacer acquisition and/or effector functions. Such an interdependence could give rise to synergies that would ensure a more robust immune response in a natural environment with extensive viral diversity, especially as occurs amongst archaeal hyperthermophiles.

Type III systems as a general purpose RNA interference platform

Our finding that deletion mutants of Synechocystis sp. PCC 6803 Type III-Bv accessory gene candidates were not impaired in interference activity confirmed that RNA silencing and transcription-dependent DNA silencing are inherent to core Type III functionality and do not rely on accessory gene functions. The additional functions provided by the deleted accessory genes are likely to lie beyond basic interference. Furthermore, it was shown that the gene product of one accessory candidate gene, slr7088, is involved in crRNA stability and/or turnover, further supporting an auxiliary rather than a core role. The extensive diversity of non-CARF accessory gene families located adjacent to Type III interference modules, suggests that they are multifunctional, likely extending functionality beyond immune defence. The basic RNA interference activity of core Type III effectors is non-processive in contrast to Type I interference activity where Cas3 processively digests invader DNA after protospacer recognition [51-53]. Type I systems are streamlined towards efficient invader dsDNA degradation, whereas core Type III systems are more versatile interfering with different types and configurations of nucleic acid. Thus Type III systems may be more useful for invader surveillance than clearance, and the cOA signaling pathway is ideal for directing effector functionality to other cellular systems. The specialisation of Type I systems towards supercoiled dsDNA targets suggests that they could have evolved from a more general purpose Type III-like ancestor, after the addition of a helicase-nuclease pair. This is also consistent with a recent hypothesis for evolution of Type I systems from Type III systems [54]. Adaptive immunity, although important, could well be one specialisation from a pool of functions offered by Type III systems. RNA interference has a variety of potential applications beyond immunity, particularly within information processing, and different accessory proteins may provide the crucial functional links to other cellular systems.

Concluding remarks

The developed computational method successfully identified 76 putative accessory gene families flanking genomic Type III genetic modules of archaea and bacteria, more than half of which had not been identified previously. The results expand in particular the repertoire of diverse non-CARF accessory gene families for which functions are currently unknown. One of the detected accessory protein families was found to be involved crRNA biogenesis efficiency for a Type III-Bv system in Synechocystis. The diversity of the gene families found provides evidence of Type III systems being coupled to numerous functions additional to invader nucleic acid silencing. The study also suggests that more accessory gene families will be detected in the future when automated annotation of an increasing number of genomic CRISPR-Cas cassettes becomes available.

Materials and methods

The comparative genomics method for detecting putative accessory genes employs a ‘guilt-by-association’ approach, where genes flanking known Type III modules are first clustered by sequence similarity. The extent of conservation across a wide range of host genomes is taken as an indication that the flanking gene clusters are functionally linked to their cognate Type III modules. Since genomic CRISPR-Cas cassettes are most often located in genomic regions implicated in regular horizontal gene transfer (HGT), the mobilome, any genes flanking the cassettes will be shuffled even over short evolutionary distances. The method assumes that gene conservation adjoining Type III gene modules over larger evolutionary distances stems from a selective pressure to maintain those genes in close vicinity of the cognate Type III systems, indicative of a likely functional link. The mechanistic rationale underlying this, is that any horizontal transfer event involving a Type III module will be more likely to include any co-functional accessory genes the closer they are located to the core genetic module. This would ensure the transfer of fully functional cassettes into a new host and provide it with a selective advantage, and increase the chances of the cassette replicating in the new host. The method used for identifying putative accessory genes relies on prior accurate annotation of core Type III genes. Therefore, the data set of annotated Type III modules from the latest CRISPR-Cas classification update is used [2]. For less well annotated Type III modules, the method is expected to pick up core genes in addition to any accessory genes. Furthermore, some gene families with no functional link may border Type III genetic modules and be enriched. Such gene families may include transposases, toxin-antitoxin gene pairs and genes within mobile genetic elements, all of which are generally enriched in mobilome regions. To distinguish true accessory genes from such false positive gene clusters, the Type III specificity and degree of Type III coevolution is estimated for each gene cluster. Gene clusters that have a history of accompanying Type III systems are predicted to display a high degree of coevolution and specificity for their cognate Type III module, and the results can thus be used to remove spurious gene clusters from those that are functionally conserved. By setting optimal cutoffs for defining gene clusters, specificities and estimates for coevolution, it is expected that most genes that pass them comprise accessory genes. While false positives may still occur, the chances of them being selected is minimized.

Definition of putative accessory gene clusters

Five genes upstream and downstream of annotated Type III gene modules were selected from the most recently classified archaeal and bacterial CRISPR-Cas systems [2]. They were then pooled and subjected to an all-against-all protein sequence similarity comparison. Alignments with an E-value above 1 were discarded as were alignments shorter than 40% of the length of each of the two proteins in question. A further filtering step that took into account total protein sizes in relation to alignment length and the relative locations of similar sequence regions between the two compared proteins was performed to detect orthologous pairs of proteins (Equation 1). Equation 1: For the two compared proteins i and j, l denotes their respective lengths, b the position of the beginning of the alignment and e the end. The sum of the three addends must not exceed 0.1 for the proteins to be considered orthologs. The size disparity and alignment positioning disparity (first and last addends) are thus penalized against the compensating (transformed) alignment coverage (middle addend). Protein pairs which passed the filter were considered orthologous and forwarded to Markov clustering [28] which generated the clusters of orthologous proteins. A multiple sequence alignment (MSA) was made for each gene cluster using MUSCLE [55] and used for sensitive profile-profile sequence searches [56] against the PFAM, CDD, TIGRFAMs and COG databases [10,11,26,29] as well as the latest version of the PDB70 database that is distributed together with the HHSuite package [30,56]. The database searches were used to make rudimentary predictions of protein function (Table 2 and S1). MSAs were initially also used for a phylogenetic analyses in order to clarify whether the gene clusters had co-evolved with their cognate Type III systems. Aggregating scores from multiple pairwise alignments was introduced later to automate this step in order to estimate Type III co-evolution (Equation 2). Subclusters were defined for each cluster in order to estimate diversity and as a fall-back option when the main clusters were too broad. Hidden Markov models (HMMs) corresponding to each cluster were constructed and used for genome-wide searches to quantify the specificity of each gene cluster with regard to Type III gene module association. Later, pairwise sequence similarity searches were used for this purpose, as the results proved more accurate. A cas association score was then devised to distinguish spuriously associated gene families from those that were strongly associated. The score combined a coevolution and Type III specificity estimate, along with an estimate for the self-similarity of hosts carrying the gene cluster according to Equation 2. Equation 2: Let c be an integer denoting the size of the cluster of interest, i the ith member of the cluster, and o a logical vector of length c with boolean values for whether the top c best genome wide orthologs (according to Equation 1) were already flanking Type III systems. Summing the contents of that vector and dividing it by c, and obtaining the mean of that quantity for all members of the cluster yields the average proportion of cluster orthologs that are Type III adjacent. This quantity (i.e. the first addend) denotes the cluster Type III specificity and also comprises a measure for co-evolution, because only the best matching c orthologs are considered as opposed to all found orthologs that may include recently diverged paralogs. The other addend represents the cluster self similarity (taken as an estimate for host self similarity), which is the alignment score s of each member protein against another member protein j over the alignment score against self, done for all member proteins and averaged. Comparisons of the same protein against itself were omitted from the latter as they would otherwise artificially inflate the score. The two quantities are subtracted and multiplied by 100 to yield the Type III association score. A cutoff of 24 was set by comparison with manually curated examples from previous studies. Thus a gene cluster seen conserved across a limited range of closely related host genomes was downgraded compared to a gene cluster conserved adjacent to type III systems across widely divergent hosts. The cas association score was used to rank all conserved gene families, and only the top-ranking gene families (with a score above 24) were included for further analyses, with lower ranking gene families assumed to be spuriously associated. Gene clusters corresponding to previously confirmed Type III accessory genes [1,12] were included regardless of association score. Graphical representations of the gene neighbourhoods surrounding all surveyed type III modules were then created and significant accessory genes were marked in order to inspect the results visually and individually. In particular, the visualisations were used to gauge operonic context, in order to assess whether the accessory genes appeared to be cotranscribed with core Type III genes or were encoded as separate transcripts.

Synechocystisinterference assay

The self-replicating conjugative vector pVZ322 and the gentamicin resistance cassette were used for the construction of invader plasmids to check interference activities in Synechocystis sp. PCC 6803 as described in [38]. Spacer 2 was selected to derive an appropriate protospacer sequence and the protospacer sequence was fused in frame to the gentamicin resistance (reporter) gene just upstream of its stop codon in both orientations, sense and antisense, respectively. Mean values of conjugation efficiency and standard deviations were calculated by the ratio of the number of conjugants of target plasmids to non-target plasmids in biological triplicates and in parallel for the control plasmid. Interference activity was observed if the conjugation efficiency was <1. Conjugation efficiency of the non-target plasmid was set to 1, which corresponds to no interference activity.

Creation of knock-out mutants and transformation of Synechocystis 6803

Synechocystis sp. PCC 6803 cultures were grown as described [35]. Single gene knock-out mutants of slr7080-slr7084, slr7088 and slr7091 were generated by integrating a kanamycin resistance cassette into the corresponding loci. The up- and downstream flanking regions (1,000 base pairs in size each) were amplified via PCR to ensure site-directed integration via homologous recombination. The resistance cassette was placed in between the upstream and downstream flanking regions in vector pUC19 by Gibson assembly. Transformation of 10 mLSynechocystis sp. PCC 6803 aliquots was performed as described [35]. Transformants were tested for full segregation of the introduced knock-out via PCR screening.

Synechocystis 6803 conjugation

Plasmids were conjugated into Synechocystis sp. PCC 6803 by triparental mating as described in [36] and few variations described in [38]. Briefly, the helper strain E. coli J53/RP4 and the donor strain E. coli DH5α with the plasmid of interest are combined to allow the transfer of the RP4 plasmid into the plasmid of interest bearing cells. After the addition of the recipient strain Synechocystis sp. PCC 6803, the plasmid of interest is transferred from the E. coli donor strain to the cyanobacterial recipient strain by conjugational transfer. 40 µL of cell suspension were plated on BG11 agar plates containing 5 µg/mL gentamicin. Conjugants were counted after further incubation at 30°C for 2 weeks.

RNA analysis and hybridization conditions

RNA extraction from 50 ml Synechocystis sp. PCC 6803 cultures was performed as described [35] with the variations mentioned in Behler et al. 2018 [38]. 10 µg of total RNA per lane were separated on 1.5% denaturing agarose gel (1.5% agarose, 16% formaldehyde, 10% 10 x MOPS-EDTA-sodium acetate (MEN) buffer) and RNA was transferred to Hybond-N+ membranes (Amersham, Germany) with 20 x SSC buffer by capillary blotting overnight. Generation of a CRISPR3 transcript probe and hybridization conditions were performed as described [38].

Interaction between c4A and Sulfolobus CARF proteins

c4A was synthesized by incubation of Cmr-α with ATP and target RNA (WH, QS unpublished). The interaction between c4A and CARF proteins was determined using a UV cross-link assay as described previously [25]. The extent of activation of the Sulfolobus CARF proteins by c4A was determined by measuring RNase activity as described previously [25].
  30 in total

1.  CRISPRcasIdentifier: Machine learning for accurate identification and classification of CRISPR-Cas systems.

Authors:  Victor A Padilha; Omer S Alkhnbashi; Shiraz A Shah; André C P L F de Carvalho; Rolf Backofen
Journal:  Gigascience       Date:  2020-06-01       Impact factor: 6.524

2.  Multiple origins of reverse transcriptases linked to CRISPR-Cas systems.

Authors:  Nicolás Toro; Francisco Martínez-Abarca; Mario Rodríguez Mestre; Alejandro González-Delgado
Journal:  RNA Biol       Date:  2019-07-11       Impact factor: 4.652

3.  CRISPR-Cas: more than ten years and still full of mysteries.

Authors:  Emmanuelle Charpentier; Alexander Elsholz; Anita Marchfelder
Journal:  RNA Biol       Date:  2019-04       Impact factor: 4.652

4.  Cryo-EM structure of Type III-A CRISPR effector complex.

Authors:  Yangao Huo; Tao Li; Nan Wang; Qinghua Dong; Xiangxi Wang; Tao Jiang
Journal:  Cell Res       Date:  2018-11-21       Impact factor: 25.617

5.  Casboundary: automated definition of integral Cas cassettes.

Authors:  Victor A Padilha; Omer S Alkhnbashi; Van Dinh Tran; Shiraz A Shah; André C P L F Carvalho; Rolf Backofen
Journal:  Bioinformatics       Date:  2021-06-16       Impact factor: 6.937

Review 6.  Chemistry of Class 1 CRISPR-Cas effectors: Binding, editing, and regulation.

Authors:  Tina Y Liu; Jennifer A Doudna
Journal:  J Biol Chem       Date:  2020-08-14       Impact factor: 5.157

7.  Csx3 is a cyclic oligonucleotide phosphodiesterase associated with type III CRISPR-Cas that degrades the second messenger cA4.

Authors:  Sharidan Brown; Colin C Gauvin; Alexander A Charbonneau; Nathaniel Burman; C Martin Lawrence
Journal:  J Biol Chem       Date:  2020-08-21       Impact factor: 5.157

8.  Structure and Mechanism of a Cyclic Trinucleotide-Activated Bacterial Endonuclease Mediating Bacteriophage Immunity.

Authors:  Rebecca K Lau; Qiaozhen Ye; Erica A Birkholz; Kyle R Berg; Lucas Patel; Ian T Mathews; Jeramie D Watrous; Kaori Ego; Aaron T Whiteley; Brianna Lowey; John J Mekalanos; Philip J Kranzusch; Mohit Jain; Joe Pogliano; Kevin D Corbett
Journal:  Mol Cell       Date:  2020-01-10       Impact factor: 17.970

Review 9.  Three New Cs for CRISPR: Collateral, Communicate, Cooperate.

Authors:  Andrew Varble; Luciano A Marraffini
Journal:  Trends Genet       Date:  2019-04-27       Impact factor: 11.639

10.  Specificities and functional coordination between the two Cas6 maturation endonucleases in Anabaena sp. PCC 7120 assign orphan CRISPR arrays to three groups.

Authors:  Viktoria Reimann; Marcus Ziemann; Hui Li; Tao Zhu; Juliane Behler; Xuefeng Lu; Wolfgang R Hess
Journal:  RNA Biol       Date:  2020-06-10       Impact factor: 4.652

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.