Literature DB >> 22638584

Sequence, structure and functional diversity of PD-(D/E)XK phosphodiesterase superfamily.

Kamil Steczkiewicz1, Anna Muszewska, Lukasz Knizewski, Leszek Rychlewski, Krzysztof Ginalski.   

Abstract

Proteins belonging to PD-(D/E)XK phosphodiesterases constitute a functionally diverse superfamily with representatives involved in replication, restriction, DNA repair and tRNA-intron splicing. Their malfunction in humans triggers severe diseases, such as Fanconi anemia and Xeroderma pigmentosum. To date there have been several attempts to identify and classify new PD-(D/E)KK phosphodiesterases using remote homology detection methods. Such efforts are complicated, because the superfamily exhibits extreme sequence and structural divergence. Using advanced homology detection methods supported with superfamily-wide domain architecture and horizontal gene transfer analyses, we provide a comprehensive reclassification of proteins containing a PD-(D/E)XK domain. The PD-(D/E)XK phosphodiesterases span over 21,900 proteins, which can be classified into 121 groups of various families. Eleven of them, including DUF4420, DUF3883, DUF4263, COG5482, COG1395, Tsp45I, HaeII, Eco47II, ScaI, HpaII and Replic_Relax, are newly assigned to the PD-(D/E)XK superfamily. Some groups of PD-(D/E)XK proteins are present in all domains of life, whereas others occur within small numbers of organisms. We observed multiple horizontal gene transfers even between human pathogenic bacteria or from Prokaryota to Eukaryota. Uncommon domain arrangements greatly elaborate the PD-(D/E)XK world. These include domain architectures suggesting regulatory roles in Eukaryotes, like stress sensing and cell-cycle regulation. Our results may inspire further experimental studies aimed at identification of exact biological functions, specific substrates and molecular mechanisms of reactions performed by these highly diverse proteins.

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 22638584      PMCID: PMC3424549          DOI: 10.1093/nar/gks382

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The large and extremely diverse superfamily of PD-(D/E)XK phosphodiesterases is a remarkable example of adopting a common structural scaffold to various biological activities. These enzymes encompass mainly nucleases (and their inactive homologs) and fill in a variety of functional niches including DNA restriction (1), tRNA splicing (2), transposon excision (3), DNA recombination (4), Holliday junction (HJC) resolving (5), DNA repair (6), Pol II termination (7), or DNA binding (8). The involvement of PD-(D/E)XK enzymes in housekeeping processes suggests that these proteins may be engaged in the development of genetic diseases. It should be noted that PD-(D/E)XK phosphodiesterases exhibit very little sequence similarity, despite retaining a common core fold and a few residues responsible for the cleavage. The extreme sequence diversity, multiple insertions to a relatively small structural core, circular permutations (9) and migration of active site residues (10) render this superfamily a difficult subject to homology inference and hinders a new family identification with traditional sequence- or even structure-based approaches. In the present study our aim was to identify, classify and expand the existing repertoire of proteins belonging to the PD-(D/E)XK fold, in order to obtain a more complete picture of this superfamily. The common conserved structural core of PD-(D/E)XK phosphodiesterases consists of a central, four-stranded, mixed β-sheet flanked by two α-helices on both sides (with αβββαβ topology), forming a scaffold adopted for the active site formation (11) (Figures 1 and 2). This architecture and topology are classified in SCOP (Structural Classification of Proteins) database (12) as a restriction endonuclease-like fold. The active site is located in a characteristic β-sheet Y-shaped bend (the second and third core β-strands) that exposes the catalytic residues (aspartic acid, glutamic acid and lysine, in a canonical active site) from the relatively conserved PD-(D/E)XK motif. In addition to the aforementioned motif, the conserved acidic residues from the core α-helices (usually glutamic acid from the first α-helix) often contribute to active site formation at least in a subset of families (10,13). Altogether, these residues play various catalytic roles which include coordination of up to three divalent metal ion cofactors, depending on the family. In addition, the residues from the second, positively charged α-helix can also contribute to the active site, although their major role is to facilitate the substrate binding and quaternary structure formation (14). The last, fourth core β-strand tends to be strongly hydrophobic as it is buried deeply within the hydrophobic core of the structure. This α/β/α sandwich fold is capable of accommodating a number of modifications (15) that often blur the image of the canonical structure of these enzymes. For a long time, proteins belonging to the PD-(D/E)XK nuclease-like superfamily had been considered as restriction enzymes, exclusively. However, many later experiments showed their contribution to DNA-branched structures resolving (5), double-strand breaks maintenance (16), or RNA maturation (17). In the following years PD-(D/E)XK phosphodiesterases were extensively studied, reclassified (18) and their realm was consequently enlarged. Currently, there are 60 diverse families grouped into the ‘PD-(D/E)XK nuclease superfamily’ clan in the Pfam 26 database (19). This clan includes restriction enzymes, HJC resolvases, herpes virus exonucleases and various other nucleases from all kingdoms of life, sugar fermentation proteins, and several domains of unknown functions (DUFs). In addition, there are over 100 structures of PD-(D/E)XK nucleases cataloged in SCOP database (12) clustered into four main groups, encompassing restriction endonuclease-like enzymes, tRNA–intron splicing endonucleases, eukaryotic RPB5 N-terminal domain and TBP-interacting protein-like.
Figure 1.

The commonly conserved core of PD-(D/E)XK nuclease fold. Critical active site residues are shown as red sticks and marked in corresponding sequence logo. Sequence logo was derived from multiple sequence alignment for PD-(D/E)XK phosphodiesterase superfamily using WebLogo (20).

Figure 2.

Multiple sequence alignment for the conserved core regions of the PD-(D/E)XK superfamily. Each group of closely related Pfam, COG, KOG families and PDB90 structures (detectable with PSI-BLAST) is represented by available PDB90 sequence or selected representative if the cluster does not contain solved structure. Sequences are labeled according to the group number followed by NCBI gene identification number or PDB code. The first residue numbers are indicated before each sequence, while the numbers of excluded residues are specified in parentheses. Sequence given in italic corresponds to circularly permuted α-helix. Residue conservation is denoted with the following scheme: uncharged, highlighted in yellow; polar, highlighted in grey; active site PD-(D/E)XK signature residues, highlighted in black; other conserved polar/charged residues augmenting the active site, highlighted in red. Locations of secondary structure elements are shown above the corresponding alignment blocks.

The commonly conserved core of PD-(D/E)XK nuclease fold. Critical active site residues are shown as red sticks and marked in corresponding sequence logo. Sequence logo was derived from multiple sequence alignment for PD-(D/E)XK phosphodiesterase superfamily using WebLogo (20). Multiple sequence alignment for the conserved core regions of the PD-(D/E)XK superfamily. Each group of closely related Pfam, COG, KOG families and PDB90 structures (detectable with PSI-BLAST) is represented by available PDB90 sequence or selected representative if the cluster does not contain solved structure. Sequences are labeled according to the group number followed by NCBI gene identification number or PDB code. The first residue numbers are indicated before each sequence, while the numbers of excluded residues are specified in parentheses. Sequence given in italic corresponds to circularly permuted α-helix. Residue conservation is denoted with the following scheme: uncharged, highlighted in yellow; polar, highlighted in grey; active site PD-(D/E)XK signature residues, highlighted in black; other conserved polar/charged residues augmenting the active site, highlighted in red. Locations of secondary structure elements are shown above the corresponding alignment blocks. The PD-(D/E)XK proteins constitute a functionally diverse superfamily that addresses multiple nucleic acid maintenance issues. For instance, PD-(D/E)XK domain occurs in all classes of restriction enzymes, including those of type I, II, III and IV. Type II restriction endonucleases form the most diverged group of PD-(D/E)XK phosphodiesterases. These enzymes, in concert with methyltransferases, set up the restriction–modification systems which protect bacterial and archaeal genomes against foreign genetic material (21). Host DNA is marked through methylation and therefore it is protected from accidental cleavage by a restriction enzyme which recognizes only unmethylated, foreign nucleic acid. Jeltsch and Pingoud proposed an evolutionary dependence between methyltransferases and restriction endonucleases (22). They managed to show that bacterial cells had acquired both a relevant methyltransferase and a restriction enzyme simultaneously in order to provide sufficient protection of host genetic material. Other restriction endonuclease-like fold proteins include mismatch repairing enzymes MutH and Vsr. These enzymes are a part of the machinery that recognizes and removes nucleotides improperly incorporated during recombination. MutH, which is a part of the MutHLS mismatch repair system, is a methylation- and sequence-specific nuclease (6,23). Vsr nuclease is a part of the Very Short Patch Repair system which aids MutHLS deficiency connected with the methylated cytosine spontaneous deamination. The PD-(D/E)XK proteins can also resolve HJC emerging from homologous recombination. HJC fastens together two homologous DNA molecules which, if unresolved, can lead to mutations (24). There are several PD-(D/E)XK protein families conserved through all kingdoms of life that recognize and cut branched DNA structures. These enzymes include RecU (25) and bacteriophage T7 HJC resolvase (endonuclease I) involved in genetic recombination during viral infection (26). XPF, ERCC4, Mus81 and Dna2 are also PD-(D/E)XK nucleases with structure-based specificity for DNA branched structures (27,28). They may cleave HJC or, as proven for Dna2, cut the remaining long flap RNA primers during the Okazaki fragment maturation (29). XPF was identified to process damaged DNA structures in mammalian nucleotide excision repair (NER) (27). Additionally, together with ERCC1, it cleaves DNA duplexes during homologous recombination. Mus81 participates in recombination and cell-cycle regulation (28). PD-(D/E)XK phosphodiesterases also embrace exoribonucleases involved in homologous recombination and various DNA repair pathways, including RecB and its inactive homolog RecC from the RecBCD complex (16). The assortment of functional niches for PD-(D/E)XK proteins also encompasses mobile genetic element transposition, exemplified by TnsA transposase (3). Viral nucleases constitute another PD-(D/E)XK group. The alkaline exonuclease maintains extensively expressed viral DNA and degrades host mRNA molecules (30). Bacteriophage λ-exonuclease facilitates double strand break repair and single strand annealing (31). An eukaryotic Rai1-like (PF08652, KOG1982) plays an important role in pre-rRNA maturation by removing two phosphates from the 5′-termini leaving a 5′-monophosphate (7). The mitochondrial, membrane-bound Pet127 (PF08634) protein is suggested to process the apocytochrome-b precursor during mRNA maturation (32). RPB5, a universal subunit of all three major eukaryotic RNA polymerase complexes, also retains the PD-(D/E)XK fold. RPB5 interacts with several transcription factors, such as TFIIB or HBx, and the TIP120 pre-initiation complex (8). The tRNA splicing endonucleases that constitute a well distinguishable group of archaeal and eukaryotic proteins within the PD-(D/E)XK phosphodiesterase realm are a very interesting example of alternative function gain through acquisition of a novel active site. They are vital for maturation of tRNA molecules by performing intron excision from an anticodon loop (2). Their activity is crucial for tRNA intron identification and removal, allowing ligases and phosphotransferases to complete the tRNA maturation process. In humans, the malfunction of some PD-(D/E)XK phosphodiesterases is linked to severe, inherited diseases involving neurological abnormalities and susceptibility to develop early onset malignancies. Mutations in tRNA splicing endonuclease lead to pontocerebellar hypoplasia (PCH) (33) which is related to mental and motor impairments. Mutations in XPFERCC1, an NER repair pathway structure-dependent endonuclease, are one of the primary causes of xeroderma pigmentosum (XP) (34). XP manifests itself by increased sensitivity to sunlight with the development of carcinomas. Fanconi anemia (FA) is a consequence of mutations in PD-(D/E)XK proteins [e.g. FANCM (35)], participating in DNA repair and involves developmental abnormalities, bone marrow failure, and a predisposition to cancer. Up to date there have been several attempts to identify and classify new PD-(D/E)XK phosphodiesterases, such as YhgA (36), UL24 (37), NERD (38), CoiA (39), RmuC (39) protein families or various restriction enzymes (1). Those studies were mainly based on remote homology detection methods, as the extreme sequence divergence of the PD-(D/E)XK enzymes remains the main obstacle in detection of new superfamily members. This inspired the development of a dedicated SVM (Support Vector Machines) algorithm for the identification of the PD-(D/E)XK active site signature within protein sequences (11). The discussed analyses covered a large part of the PD-(D/E)XK phosphodiesterase world, however each approach individually relied on a limited set of initial sequences and did not provide a widespread view on the PD-(D/E)XK fold. Therefore, in order to confer our work a broader perspective, first we collected the structures and families annotated as restriction endonuclease-like enzymes. This set was used as a starting point for exhaustive, transitive fold recognition searches aiming to obtain the most complete set of PD-(D/E)XK proteins available in current databases. Here we report a comprehensive reclassification of proteins containing a PD-(D/E)XK domain, including their domain architecture, taxonomic distribution and genomic context.

MATERIALS AND METHODS

A brief overview of our methods is presented below with further details given in Supplementary Materials (see ‘Materials and Methods’ section). Detection of PD-(D/E)XK families (Pfam, COG, KOG) and structures (PDB90) was performed with a distant homology detection method, Meta-BASIC (40). Non-trivial assignments were additionally confirmed with a consensus of fold recognition, 3D-Jury (41). Sequences of proteins belonging to the identified families were collected with PSI-BLAST (42) searches against NCBI nr database. Multiple sequence alignments were prepared using PCMA (43). In addition, structure-based alignment was derived from a manually curated superimposition of PD-(D/E)XK structures. The final alignment for PD-(D/E)XK superfamily was assembled from sequence-to-structure mappings using a consensus alignment and 3D assessment approach (44). The collected PD-(D/E)XK fold proteins were clustered into groups of closely related families and structures based on detectable sequence similarity with both PSI-BLAST and RPS-BLAST. Structure similarity based searches were performed with ProSMoS program (45). Domain architecture was analyzed with RPS-BLAST against COG, KOG and Pfam, and with HMMER3 against Pfam. Transmembrane regions were detected with a TMHMM server (46). Cellular localization for prokaryotic sequences was predicted with PSORTb (47) and for eukaryotic with Cello (48), WoLF PSORT (49) and Multiloc (50). Taxonomic assignment was based on NCBI taxonomic identifiers. HGT events were identified using a phylogenetic approach. Phylogenetic trees for each cluster were calculated using PhyML. The genomic context was analyzed with The SEED (51), GeContII (52), MicrobesOnline (53) and NCBI genomic resources. Clustering of all 21 911 sequences was performed with CLANS (54), with high resolution figures drawn with an in-house script based on CLANS scores.

RESULTS

In order to broaden the repertoire of PD-(D/E)XK proteins we performed sensitive distant homology searches using as the initial dataset 44 Pfam 25 families and 60 representative restriction endonuclease-like proteins of known structure cataloged in SCOP database. The exhaustive, transitive fold recognition searches against Pfam, COG, KOG and PDB90 databases resulted in a collection of various PD-(D/E)XK families that altogether span 21 911 sequences from the NCBI nr protein database (a list of all identified proteins is provided as Supplementary Dataset S1). For instance, we found that 99 PDB90 structures, 49 COG, 11 KOG and 118 Pfam families retain the PD-(D/E)XK fold. This is significantly more than the currently reported in Pfam 26 database in PD-(D/E)XK nuclease superfamily clan which defines only 60 families. In addition, we found six PD-(D/E)XK fold families to be classified also in two other Pfam clans: (i) Restriction endonuclease-like (Endonuc-FokI_C, PF09254; MutH, PF02976; RE_AlwI, PF09491) and (ii) tRNA–intron endonuclease catalytic domain-like (Sen15, PF09631; tRNA_iecd, PF12858; tRNA_int_endo, PF01974). All PD-(D/E)XK proteins were identified with a single procedure as described in our previous work (36). This exemplifies a major progress in comparison with previous studies on the diversity of PD-(D/E)XK phosphodiesterase superfamily. All collected families and structures were clustered into 121 groups of closely related proteins. The average sequence similarity between different PD-(D/E)XK groups is very low, which is reflected by low Meta-BASIC scores (Supplementary Table S1) and is below the confident recognition both with standard and even more advanced sequence comparison methods. This high sequence divergence implies the need for complex sequence and structure search strategies. Many of the identified protein groups contain uncharacterized and poorly annotated proteins or functionally studied proteins without structural annotations. Eventually, upon further manual literature inspection, the majority of these families were linked to the PD-(D/E)XK superfamily. However, such an assignment was feasible with a list of proteins in question. The remaining 11 identified groups embrace the newly found PD-(D/E)XK fold families. We detected PD-(D/E)XK sequences in multiple genomes from all forms of life. The versatility of this superfamily convinced us to perform a variety of structure- and sequence-based analyses. We thoroughly examined every family in our dataset in order to determine its characteristic sequence and structure features. Here, we describe in detail the results of sequence and literature searches, domain architecture analysis, structural comparisons and phylogenetic inference, that eventually shed new light on functional diversity of PD-(D/E)XK proteins. Table 1 summarizes the details of all identified PD-(D/E)XK phosphodiesterase groups. Human genes encoding PD-(D/E)XK proteins are shown in Supplementary Table S2. One should note that most of the human PD-(D/E)XK genes are involved in diseases.
Table 1.

One hundred and twenty-one groups of proteins retaining PD-(D/E)XK nuclease fold

No.NameBiological functionTaxonomy
HGTs
Detailed distribution
1NaeIPF09126(58)Type II Restriction Endonuclease (58)+Bacteria (α-proteobacteria, Actinobacteria)Deinococcus maricopensis sequence is found in a clade with Roseobacteriales (α-proteobacteria) & Actinomycetales. The Roseobacteriales clade locates within a Actinomycetales tree.
1ev7
2BglI1dmu(59)Type II Restriction Endonuclease (59)+BacteriaOnly four sequences from distant taxa: Bacillus atrophaeus (Bacilli), Microcoleus (Oscillatoriales), Deinococcus deserti (Deinococci) suggest a HGT.
3HpaIIPF09561NewType II Restriction Endonuclease (60)+Bacteria (Bacillus/Clostridium, Bacteroidetes)Streotibacillus moniliformis (Fusobacteriales) forms a clade with Sulfurimonas denitrificans (Campylobacteriales). Bacillus thuringiensis (Bacillales) groups with Flexibacter tractuosus (Cytophagales). Single sequences of Fusobacteria, ε-proteobacteria, β-proteobacteria and γ-proteobacteria.
4NgoBV, NlaIVPF09564(1)Type II Restriction Endonuclease (61)+Bacteria (mostly Neisseria)Multiple transfers, animal related bacteria. Single representatives of: Spirochaetes, Fusobacteria, Tenericutes, ε-proteobacteria, Clostridia, Bacilli.
5ScaIPF09569NewType II Restriction Endonuclease (62)+BacteriaMultiple transfers. Ecologically and taxonomically unrelated bacteria from Bacilli, Proteobacteria, Cyanobacteria, Bacterioidetes.
6LlaMI, ScrFIPF09562(63)Type II Restriction Endonuclease (63)+Bacteria (Cyanobacteria, Bacillus/Clostridium, γ-proteobacteria)One clade grouping: Lachnospiraceae bacterium (Clostridiales), Lactococcus lactis subsp. cremoris (Lactobacillales), Prochlorococcus marinus (Cyanobacteria), Vibrio parahaemolyticus (γ-proteobacteria).
7PvuIIPF09225(64)Type II Restriction Endonuclease (64)+BacteriaMeiothermus ruber (Thermales), Bacteroides cellulosilyticus (Bacteroidales) and Arthrospira maxima (Burkholderiales) are single representatives of corresponding taxa suggesting a transfer event from Enterobacteriales.
3ksk
8XamIPF09572(11)Type II Restriction Endonuclease (65)+{1}BacteriaPatchy distribution including a Haloarcheon—Halogeometricum borinquense grouping with good support within a bacterial clade.
9XhoIPF04555(1)Type II Restriction Endonuclease (66)+{1}Bacteria (mostly Proteobacteria and Actinobacteria)Leptospirillum rubarum and 3 Actinobacteria within a Proteobacteria clade.
10ApaLIPF09499(67)Type II Restriction Endonuclease (62)+BacteriaMultiple transfers, Helicobacter felis (ε-proteobacteria) with Microscilla marina (Bacterioidetes). Patchy distribution including single sequences from Bacillales, Chloroflexales, Xantomonadales, Fusobacteriales, Beggiatoales, Borrelomycetales, Campylobacteriales.
11BamHIPF02923(68)Type II Restriction Endonuclease (68)+BacteriaMultiple transfers, extremophilic and/or aquatic bacteria.
1bam, 3odh
12BstYI, BglIIPF09195(69)Type II Restriction Endonuclease (69){1}+BacteriaMultiple transfers for example B. subtilis sequence grouped with Cyanobacteria. Ethanoligenens harbinense (Clostridiales) is located in a Proteobacteria clade.
1sdo, 1d2i
13SacIPF09566(1)Type II Restriction Endonuclease (70)+Bacteria (Bacilli)Multiple transfers. Patchy distribution: single sequences Bacteroides, Actinobacteria, γ-proteobacteria, ε-proteobacteria.
14Eco47IIPF09553NewType II Restriction Endonuclease (71){1}+BacteriaHelicobacter pylori sequence groups within a Mycoplasma clade, multiple transfers.
15HaeIIPF09554NewType II Restriction Endonuclease (72)+Bacteria (mostly γ- and β-proteobacteria)Cyanobacteria sequences not grouped. Single sequences from Cyanobacteria, Bacterioidetes.
16HindIIIPF09518(73)Type II Restriction Endonuclease (73)+Bacteria (mostly γ-proteobacteria)Multiple transfers: Citrobacter (γ-proteobacteria) within a Bacilli clade, oral bacterium Streptococcus downei grouped together with Haemophilus influenzae.
3a4k
17FokIPF09254(14)Type II Restriction Endonuclease (14)+Bacteria (Bacillus/Clostridium)Haemophilus influenzae within a Streptococcus sanguinis clade.
2fok
18EcoO109I1wtd(74)Type II Restriction Endonuclease (74)+Bacteria (Escherichia coli)No HGT observed
19EcoRVPF09233(75)Type II Restriction Endonuclease (75)+{2}BacteriaEscherichia coli in a clade with Streptococcus mitis (Lactobacillales), Listeria innocua (Bacillales), Vibrio orientalis (Vibrionales) and Thiomonas (Burkholderiales)a
1eo3
20EcoRIPF02963(76)Type II Restriction Endonuclease (76)+{1}Bacteria (BCF group, Proteobacteria, Bacillus/Clostridium)Methanobrevibacter smithii, Staphylococcus aureus, Fusobacterium ulcerans and Brucella melitensis group together with 5 E. coli Migula 1895 sequences. Multiple transfers
2oxv
21XcyIPF09571(77)Type II Restriction Endonuclease (77)+Bacteria (γ-proteobacteria, Clostridium)Pseudomonas alcaligenes (soil bacterium) in a plant pathogenic Xanthomonas clade, Proteobacteria in a extremophilic Clostridium clade. Multiple transfers
22BsoBIPF09194(78)Type II Restriction Endonuclease (78)+Bacteria (mostly Cyanobacteria)Roseiflexus castenholzii phototrophic bacterium and intestinal Alistipes sp. within a mostly Cyanobacteria clade
1dc1
23HincIIPF09226(79)Type II Restriction Endonuclease (79)+Bacteria (mostly γ-proteobacteria)Oral bacterium Capnocytophaga ochracea within a Haemophilus & Actinobacillus clade. Additionally, Prevotella bivia pathogen, joins this clade
3ebc
24SinI, AvaIIPF09570(1)Type II Restriction Endonuclease (22)+BacteriaPatchy distribution
25NgoPIIPF09521(1)Type II Restriction Endonuclease (80)++ProkaryotaPatchy distribution, possible transfer between Desulfurobacterium thermolithotrophicum (Aquificiae) and Methanothermobacter thermautotrophicus and Candidatus Parvarchaeum acidiphilum (Euryarchaeota)
26Tsp45IPF06300NewType II Restriction Endonuclease (81)+BacteriaPossible transfer between Simonsiella muelleri (β-proteobacteria) and Fusobacterium periodonticum (Fusobacteria). Patchy distribution including: Prevotella, Treponema and Chlorobium
27MspIPF09208(82)Type II Restriction Endonuclease (82)+Bacteria (mostly Bacilli/Clostridia)Two γ-proteobacteria (Idiomarina loihiensis, Moraxella) within a Firmicutes clade. Moraxella opportunistic pathogen groups with Clostridium botulinum. Deep sea I. loihiensis groups with Anoxybacillus flavithermus thermophile. Patchy distribution
1sa3
28MjaIIPF09520(11)Type II Restriction Endonuclease (83)++ProkaryotaPossible transfer between Archaea and Bacteria. Patchy distribution
29MunIPF11407(83)Type II Restriction Endonuclease (83)+{1}BacteriaDesulfurivibrio alkaliphilus and Prevotella copri prossible transfer. Cenarchaeum symbiosum groups together with Tenericutes and Clostridia. Cenarchaeum symbiosum is a partner of a marine sponge (84)
1d02
30CfrBIPF09516(1)Type II Restriction Endonuclease (85)+Bacteria (mostly proteobacteria)Anaerobic ammonium-oxidizing candidatus Kuenenia stuttgartiensis, thermophilic Geobacillus stearothermophilus and Thermodesulfovibrio yellowstonii group within a Proteobacteria tree
31NgoMIVPF09015(85)Type II Restriction Endonuclease (85)+BacteriaBacteroides finegoldii groups within Heliobacterium modesticaldum and Faecalibacterium prausnitzii (Clostridiales) clade. Thermomonospora curvata (Actinomycetaceae), Opitutaceae bacterium TAV2 (Opitutaceae) and Idiomarina baltica (Alteromionadaceae) group together
1fiu
32Cfr10I, Bse634I, SgrAIPF07832(86)Type II Restriction Endonuclease (86)+BacteriaPseudomonas stutzeri (Pseudomonadales), Nodularia spumigena (Nostocales) and Streptomyces griseus (Actinomycetales) sequences group together
1cfr, 1knv
3dpg
33Bpu10IPF09549(87)Type II Restriction Endonuclease (87)+BacteriaMultiple transfer events. One clade encompasses representatives of Cyanobacteria (Cyanothece and Nodularia), Proteobacteria (E. coli, Allochromatium vinosum, Plesiocystis pacifica), Chloroflexi (Chloroflexus aurantiacus) and Actinobacteria (Gardnerella vaginalis)
34BspD6I, AlwI, MlyIPF09491 2ewf, 2p14(88)Type II Restriction Endonuclease+{1}BacteriaMicrococcus lylae (Actinomycetales) and Methanohalobium evestigatum (Euryarchaeota) forming a common clade or Mannheimia haemolytica (γ-proteobacteria) within a Firmicutes clade are examples of possible HGT. M. haemolytica causes intramammary infection in sheep. Micrococcus lylae is a denitrifying soil bacterium whereas M. evestigatum is an extreme halophilic methanogen
Restriction Endonuclease (88)
35LlaJI, McrBCPF09563(89)Type II Restriction Endonuclease (89)++{1}ProkaryotaMobiluncus curtisii subsp. curtisii (Actinomycetales) within a Clostridium clade. Gardnerella vaginalis (Actinomycetales) forms a clade with L. lactis (Lactobacillales) and Anaerostipes caccae (Clostridiales). Batrachochytrium dendrobatidis JAM81 (Chytrydiomycota, Fungi) forms a clade with Desulfotomaculum nigrificans (Clostridiales). Methanobrevibacter ruminantium DSM 1093 (Euryarchaeota) locates in a mostly Firmicutes clade
PF10117
COG4268
36SdaI, BsuBIPF06616(90)Type II Restriction Endonuclease (90)+{1}BacteriaTreponema vincentii (Spirochaetales), B. subtilis and Paenibacillus larvae subsp. larvae (Bacillales) within a Proteobacteria clade. Shewanella sediminis (Enterobacteriales) sequence groups with Clostridium sticklandii (Clostridiales). Methanobrevibacter ruminantium (Euryarchaeota) forms a clade with 2 Prevotella (Bacteroidales) sequences. Methanobrevibacter ruminantium is a rumen bacterium of cattle and Prevotella is involved in periodontal infections
2ixs
37DpnII, MboIPF04556(91)Type II Restriction Endonuclease (91)++ProkaryotaCarboxydothermus hydrogeniformans in a Mycoplasma clade. Extremophilic Dictyoglomus thermophilum (Dictyoglomi) with M. smithii & Methanosphaera stadtmanae (Euryarchaeota)
38Ecl18kI, EcoRII, PspGIPF09019(92)Type II Restriction Endonuclease (92){2}+{1}BacteriaPhotobacterium damselae subsp. piscicida (Vibrionales) sequence locates within an Enterobacteriaceae clade (Klebsiella, Shigella, Escherichia and Yersinia)
2fqz, 1na6
3bm3
39HinP1IPF11463(93)Type II Restriction Endonuclease (93)+Bacteria (Proteobacteria)Leptotrichia goodfellowii (Fusobacteriales) in a Proteobacteria clade. Moraxella catarrhalis (Pseudomonadaceae) in a Haemophilus clade (Pasteruellaceae). Haemophilus somnous is a bovine pathogen, L. goodfellowii is found in dental plaque. Moraxella catarrhalis was recently described as a respiratory pathogen
1ynm
40NotIPF12183(94)Type II Restriction Endonuclease (94)+BacteriaDesulfobacterium sp. (Deltaproteobacteria) and Syntrophomonas wolfei (Clostridiales) in a green sulfur bacteria Chlorobium phaeobacteroides clade
3bvq
41Bsp6IPF09504(95)Type II Restriction Endonuclease (95){1}+BacteriaFusobacterium nucleatum (Fusobacteria) sequence localizes in a Ureaplasma/Mycoplasma (Borrellomycetales) clade
42HindVP, HgiDI, BsaHIPF09519(96)Type II Restriction Endonuclease (96)+BacteriaPatchy taxonomic distributiona
43MjaIPF09568(67)Type II Restriction Endonuclease{1}++ProkaryotaMethanothermobacter thermautotrophicus within a BCF group clade
44TaqIPF09573(97)Type II Restriction Endonuclease (97)+Bacteria (Thermus, Aquficae, Nitrospirae)Thermodesulfovibrio yellowstonii (Nitrospirae) in a Hydrogenivirga sp. (Aquificae) clade
45SfiIPF11487(98)Type II Restriction Endonuclease (98)+BacteriaNo HGT observed, the phylogeny could not be resolved with reliable confidence
2ezv
46MvaI, BcnI2odh, 2oa9(99)Type II Restriction Endonuclease (99)+{2}BacteriaThermoplasma volcanium (Euryarchaeota) within mixed bacterial clades
47ThaI3ndh(100)Type II Restriction Endonuclease (100)+Archaea (Thermoplasmata)No HGT observed
48HSDR_N, HSDR_N_2, EcoR124I

PF04313

PF13588

COG4748

COG2810

COG0610

2w00, 3h1t

(101)Type I Restriction Endonuclease (101); EcoR124I cleaves DNA at a location distant from specific recognition site (102).++{1}ProkaryotaSimonsiella muelleri (β-proteobacteria) in a H. influenzae (γ-proteobacteria) clade. A single sequence from Vibrio splendidus (Vibrionales) locates in an Actinobacillus pleuropneumoniae & Haemophilus parasuis (Pastereullaceae) clade
Type IV Restriction Endonuclease (predicted, found mostly in Archaea)
49HindVIP, EcoPI

COG4889

COG4096

COG3421

COG3587

3s1s

(103)

Type I Restriction Endonuclease

Type II Restriction Endonuclease (104)

Type III Restriction Endonuclease (103)

Broad sequence and function profile due to wide, multidomain definitions of COG entities

+++Prokaryota & phagesLactobacillus helveticus (Lactobacillales) and Pseudomonas stutzeri (Pseudomonadales) form a perfectly supported group. Phylogeny is not well resolveda
50Mrr_cat, DUF2034

PF04471

PF10356

COG4127

COG1715

COG1787

1y88

(105)

Mrr restriction endonuclease (Methylated adenine recognition and restriction) restricts both adenine- and cytosine-methylated DNA (106).

DUF2034 function is unknown.

{2}+++

Eukaryota (without Plantae)

Bacteria

Archaea

Phages

No HGT observed, the phylogeny could not be resolved with reliable confidence
51Archaeal HJC

PF01870

COG1591

1hh1, 1gef

1ob8, 2wcw

2eo0

(24)HJC resolvase (107)+++Prokaryota (mostly Archaea) & Archaeal phagesA handful of unrelated bacteria: Fusobacterium nucleatum subsp. polymorphum, Fusobacterium sp., Hydrogenobaculum sp., Rhizobium leguminosarum bv. viciae, Ralstonia solanacearum, E. coli TA206, Nitratiruptor sp. and Synechococcus sp. form a clade within the Archeal tree
52ERCC4, XPF, Mus81

PF02732

KOG0442

KOG2379

COG1948

1j22, 2bgw

2ziu, 2zix

2ziv

(9)

HJC resolvase (108)

DNA repair, structure specific endonuclease

++Archaea & EukaryotaNo HGT observed
53RecU, HJC Resolvase, Penicillin-binding protein-related factor A

PF03838

COG3331

1zp7, 1y1o

(24)HJC resolvase (109). The genomic context is well conserved and includes a penicilin-binding protein, a methylase and HhH domain containing proteins. Penicillin-binding proteins are involved in cell-wall biosynthesis.+Bacteria (Bacillus/Clostridium)Catonella morbi (Clostridiales) in a Lactobacillales clade. Acholeplasma laidlawii (Tenericutes) in a Bacillus clade
54Bacteriophage T7 endonuclease I, Phage_endo_IPF05367(110)HJC resolvase (110)+++Prokaryota & phagesHalanaerobium hydrogeniformans (Firmicutes) locates with Dehalococcoides sp. and Thermomicrobium roseum (Chloroflexi). Patchy distribution suggesting multiple transfers. Phages group with their hosts
2pfj
55tRNA intron endonuclease

PF01974

KOG4133

KOG4685

COG1676

1a79, 2cv8, 2gjw

2zyz, 2ohe, 3iey

3if0, 3ajv, 3p1y

(17)tRNA intron endonuclease, in the proximity of various tRNA synthases in archaeal genomes.++Archaea & EukaryotaNo HGT observed
56Sen15

PF09631

PF12858

2gw6

(111)A structural subunit of eukaryotic tRNA intron endonuclease (111)+Eukaryota (Ophisthokonta, Amoebozoa)No HGT observed
57MutH

PF02976

COG3066

1azo, 2aoq

(6)Mismatch repairing enzyme (6). MutH cleaves a newly synthesized and unmethylated daughter strand 5′ to the sequence d(GATC) in a hemi-methylated duplex.+Bacteria (γ-proteobacteria)Plautia stali symbiont (unclassified bacterium) in a γ-proteobacteria clade
58VSR, DUF559, DUF2726

PF04480

PF03852

COG3727

COG2852

1cw0, 3hrl, 3r3p

(112)

Very short patch repair (Vsr) endonuclease that specifically removes T/G mismatches in DNA sequences targeted to cytosine methyltransferase (112).

Group I intron homing endonuclease (113)

{1}++ProkaryotaNo HGT observed
59TnsAPF08722(114)Transposase (114)+{1}BacteriaRicinus communis and Vibrio harvei form a clade, might be a long branch attraction phenomenon. Deinococcus proteolyticus in a Proteobacteria clade. Mixed clades containing: Bacilli, Chloroflexi, Cyanobacteria and Proteobacteria
1t0f
60XisHPF08814PfamfdxN element excision controlling factor (115)+Bacteria (mostly Cyanobacteria)Herpetosiphon aurantiacus in a Cyanobacteria clade. Beggiatoa sp. (γ-proteobacteria) in a Cyanobacteria cladea
2inb, 2okf
61DUF83, Cas_Cas4PF01930(5)Cas1 protein (YgbT) has nuclease activity against single-stranded and branched DNAs including HJC, replication forks and 5′-flaps (116).++{1}ProkaryotaNot resolved phylogeny. Aureococcus anophagefferens (Stramenopile, Eukaryota) sequence is localized in a mixed Bacteria clade. Aureococcus anophagefferens causes algal blooms. Planctomycetes are isolated from marine water
COG1468
COG2251
62RecBCD, Exonuclease V

PF04257

COG1330

COG3857

COG1074

1w36

(16)Exonuclease/helicase, a component of the RecBCD complex that handles double-strand breaks (DSB) (16). RecB alone has a weak helicase activity (117) and its nuclease domain generates single-strand regions at the ends of DSBs (5).+{1}Bacteria (Clostridium/Bacillus, Chlorobiales, γ-proteobacteria)Oryza sativa protein groups in an Enterobacteriaceae clade within a Serratia proteins
63DUF2800, PDDEXK_1

PF10926

PF12705

COG2887

(118)RecB-like, probable prophage proteins++Bacteria phagesDehalococcoides ethernogenes (Chloroflexi) sequence resides in a Clostridiales clade
64Viral alkaline exonucleasePF01771(30)Exonuclease processing viral genome during recombination (4). The enzyme displays RNase activity used in mRNA degradation pathways (4).+HerpesviralesNo HGT observed
2w45, 3fhd
65YqaJ, lambda-exonuclease

PF09588

COG5377

1avq, 3k93

3slp

(31)Exonuclease facilitating phage DNA recombination (31). The λ exonuclease is an ATP-independent enzyme that binds to dsDNA ends and processively digests the 5′-ended strand to form 5′-mononucleotides and a long 3′-overhang (119).+++

Bacteria

Eukaryota phages

No HGT observed
66RecE, DUF3799PF12684(120)Exonuclease from RecET recombination system (120)++Bacteria phageNo HGT observed, the phylogeny could not be resolved with reliable confidence
3h4r, 3l0a
67DEM1, EXO5PF09810PfamMitochondrial, single-strand-specific 5′-exonuclease releasing dinucleotides as the main products of catalysis. EXO5 binds to 5′-RNA termini of chimeric DNA–RNA molecules and, after sliding across the RNA substrate, cuts the DNA 2 nt from the RNA–DNA junction (121).{1}+++Archaea (Euryarchaeota)Methanocella paludicola in a Actinobacteria clade. Methanocella paludicola is a methanogen isolated form rice paddy soil. Eubacterium eligens (Clostridiales) in an Ascomycota clade (very long branch)
KOG4760Eukaryota
68ssp6803iPF11645(122)Homing endonuclease with a specificity profile extending over a long (17-bp) target site (122)++ProkaryotaPatchy distribution including 5 Haloarcheales and 2 Ktedonobacter sequences as well as Bacillus forming a sister clade to 5 sequences Cyanobacteria suggest a HGT history
2ost
69Rpb5 N-terminal domainPF03871(8)RNA Polymerase (8). It may hold together the Rpb1-β24/25 and Rpb1-α44/47-fold of RNA polymerase II, or their counterparts in the archaeal, viral and RNA polymerase I and III enzymes (123).+EukaryotaNo HGT observed
KOG3218
1dzf, 3h0g
70Arenavirus RNA polymerase N-terminal domain, virus L-ProteinPF06317(124)RNA Polymerase N-terminal domain that utilizes ‘cap snatching’ mechanism for viral mRNA transcription (125). Similar to groups 73 and 74+ArenavirusNo HGT observed
3jsb
71RecB, DUF91PF01939(126)DNA endonuclease specialized in cleavage at double-stranded DNA (dsDNA)/ssDNA junctions on branched DNA substrates (126)++Prokaryota (Actinobacteria, β-proteobacteria)All 3 sequences from Deinococcus-Thermus are located within the Archaea clade. The Proteobacteria sequences are close to the root, this topology is not well resolved
COG1637
2vld
72ERCC1-XPF, Swi10, Rad10PF03834(127)Nuclease of NER system incising oligonucleotide from damaged DNA strand (128)+EukaryotaNo HGT observed
KOG2841
COG5241
2a1i
73La crosse virus L-protein2xi5(129)Cap-snatching Endonuclease; cleaves short and capped host primers that are subsequently used by viral RNA-dependent RNA polymerase to transcribe viral mRNAs (129)+BunyaniviridaeNo HGT observed
74Viral L-proteinPF00603(130)Cap-snatching Endonuclease, mechanism identical to that described above (131)+Influenza A virusPhylogeny not resolved
3hw3
75D212PF12187(132)Uncharacterized nuclease suggested to take part in DNA replication, repair, or recombination (132)++Archaea (Sulfolobus) archaeal phagesPhages and prophages of Sulfolobus, together form one coherent clade
2w8m
76Archaea bacterial proteins of unknown function, DUF234PF03008(5)DEXX-box ATPase belonging to AAA+ superfamily; DEXX-box ATPases act to transduce the energy of ATP-hydrolysis into a conformational stress required for the remodeling of nucleic acid or protein–nucleic acid structure (133).++ProkaryotaTwo Treponema vincentii (Spirochaetales) sequences are in a Butyrivibrio proteoclasticus/ Ruminococcus bromii/Roseburia inulinivorans rumen bacteria (Clostridiales) clade
COG1672
77RAI1-like, Dom-3zPF08652(7)Exoribonuclease. Has a pyrophosphohydrolase activity towards 5′-triphosphorylated RNA (7).+EukaryotaNo HGT observeda
KOG1982
3fqg, 3fqi
78NARG2PF10505(134)Nuclear protein involved in thickness of the brain’s cortical gray matter regulation (57)+Eukaryota (without Plantae & Chromoalveolata)No HGT observed
79DUF911, Dna2

PF06023

PF08696

KOG1805

COG4343

(39)Dna2 processes common structural intermediates that occur during diverse DNA processing (e.g. lagging strand synthesis and telomere maintenance) (135). Dna2 is a dual polarity exo/endonuclease, and 5′ to 3′ DNA helicase involved in Okazaki Fragment Processing (OFP) (136) and DSB Repair (137). DUF911 function is unknown.+++Prokaryota & EukaryotaVery long branches, dubious positioning of various taxons
80YhgA-likePF04754(36)Putative transposase (138). The genomic context is not conserved even among strains of one species suggesting recent mobility.+Bacteria (γ-proteobacteria)Three Burkholderia rhizoxinica (β-proteobacteria) sequences are present on a Enterobacteriales clade forming a sister clade to a Yersinia clade
COG5464
81CoiA-likePF06054(39)Negative regulator of competence. CoiA is probably involved after DNA uptake, either in DNA processing or recombination (139).+Bacteria (Bacillus, Lactobacillus)No HGT observed
COG4469
82DUF524PF04411(36)Predicted restriction endonuclease (36). Co-occurs with a restriction GTPase or ATPase.++Bacteria & EuryarchaeotaMixed clades like: Geobacter uraniireducens (Deltaproteobacteria) together with Gallionella capsiferriformans (β-proteobacteria) and Chlorobium luteolum (Chlorobia)
COG1700
83Mitochondrial protein Pet127PF08634(134)5′-exonuclease responsible for processing the precursor to the mature form (140) involved in modulation of mtRNAP activity+

Alveolata

Fungi

Myxomycota

Excavata

Distribution limited to different unicellular eukaryote, not enough sequencing data for a HGT hypothesis
84Eukaryotic translation initiation factor 3 subunit 7, eIF-3-zeta, eIF3 p66, moe1PF05091(134)eIF3 p66 is the major RNA-binding subunit of the eIF3 complex; Cdc48, Yin6 and Moe1 act in the same protein complex to concertedly control ERAD and chromosome segregation (141).+EukaryotaNo HGT observed
KOG2479
85Secreted endonuclease distantly related to HJC resolvasePF10107(11)Predicted secreted endonuclease distantly related to archaeal HJC resolvase++{1}ProkaryotaA sequence of a bacteria feeding nematode Caenorhabditis remanei in an Acintobacter clade. Archaea sequences in Bacteria clades
COG4741
86DUF1064PF06356(39)Unknown, In firmicutes co-occurs with: RecT, DnaC, DnaB, SSB what suggest a role in recombination. In Proteobacteria phage proteins are also present.++Bacteria phagesBeggiatoa sp. (γ-proteobacteria) within a Clostridiales clade
87DUF790PF05626(39)Unknown. Co-occurs with ResIII and helicase domains.++ProkaryotaA single sequence of Rubrobacter xylanophilus (Actinobacteria) locates with Cyanobacteria and Deinococci
COG3372
88VRR-NUCPF08774(39)A DNA repair nuclease recruited to DNA damage by monoubiquitinated FANCD2 (142) exhibits endonuclease activity toward 5′ flaps and has 5′ exonuclease activity. In γ-proteobacteria co-occurs with DEAD_2 helicase and bacterial extracellular solute-binding protein family POTD/POTF.+++Bacteria & Eukaryota & phagesNo HGT observed
KOG2143
89RmuCPF02646(39)Molecular function unknown. Involved in DNA recombination (143), neighborhood of metallopeptidases and MFS1 transporters+Bacteria (mostly γ-proteobacteria)Lentisphaera araneosa (Lentisphaere) in a Oceanospirillales (Proteobacterial) clade, forms a clade together with Neptuniibacter caesariensis. Both bacteria were isolated from a surface water sample (144,145)
COG1322
90Uncharacterized conserved proteinCOG5482NewUnknown{2}+{1}Bacteria (mostly α-proteobacteria) & phagesRicinus communis (Plantae) forms a clade with a tumorogenic Agrobacterium radiobacter (Rhizobiales) within a Rhizobiales clade
91Predicted transcriptional regulatorCOG1395NewThe function is unknown but it likely binds nucleic acids. Harbors a HTH motif, co-occurs with a two-domain protein consisting of DUF1743 and tRNA_anti (PF01336) nucleic acid-binding OB-fold domain.+ArchaeaNo HGT observed
92DUF1052PF06319PfamCo-occurs with HisKA and Lactamase_B or YkuD (PF03734) which also gives β-lactam resistance.{1}+Bacteria (mostly α-proteobacteria)An uncultured Acidobacterium within a Rhizobiales clade with Nitrobacter, Bradyrhizobium and Rhodopseudomonas palustris. Acidobacteria, Nitrobacter, Bradyrhizobium are soil related bacteria, but R. palustris is found in sea sediments
COG5321
3dnx
93Sugar fermentation stimulation protein SfsAPF03749(146)Unknown, SfsA protein binds to DNA non-specifically (147). Connected with maltose metabolism (147). In γ-proteobacteria in the proximity of LigT and Pol A or with a C4-type zinc finger and nucleotidyltransferase domain. In Cyanobacteria co-occurs with transport proteins related to virulence. In Archaea with a MSF_1 transporter or Lactamase_B.+Bacteria (mostly Proteobacteria)Plautia sali symbiont (unclassified bacterium) groups with a Pantoea sp. clade (γ-proteobacteria)a
COG1489
94NERDPF08378(38)Unknown, described as nuclease-related (38)+{2}BacteriaPlanctomyces limnophilus (Planctomycetales) groups with Puniceispirillum marinum (α-proteobacteria). Mannheimia succiniciproducens (γ-proteobacteria) locates in a Neisseria (β-proteobacteria) clade. Clades with mixed taxonomic groups
95DUF1626PF07788(36)Unknown++ProkaryotaThermodesulfovibrio yellowstonii (Nitrospirales) within a Cyanobacterial clade mostly C. raciborskii. Cylindrospermopsis raciborskii is bloom-forming and potentially toxic river cyanobacteria. T. yellowstonii was isolated form thermal vent water. Patchy distribution in Bacteria suggesting multiple HGT events
COG5493
96UPF0102, RPA0323PF02021PfamIs often found with a TP_methylase (PF00590) domain. Tetrapyrrole (Corrin/Porphyrin) Methylases use S-AdoMet in the methylation of diverse substrates. The genomic context is well conserved for each bacterial class.++ProkaryotaCryptobacterium curtum (Actinobacteria) in a Clostridium cladea
COG0792
COG4998
3fov
97DUF1887PF09002PfamOccasionally co-occurs with phosphorylase superfamily PNP_UDP_1 (PF01048) (uridine phosphorylase) and zinc/cadmium/mercury/lead-transporting ATPase.++ProkaryotaThree M. smithii (Euryarchaeota) sequences form a clade with 2 sequences from Synechococcus sp. from Yellowstone (Cyanobacteria) and M. ruber (Thermales). Methanobrevibacter smithii is a methanogenic archeon highly resistant to antibiotics
1xmx
98DUF1016PF06250(39)Co-occurs with restriction MTase, ResIII and ResI S domains, and mobile element domains (phage integrase, DDE). Might act as nucleic acid-binding element in restriction enzymes.{1}+{3}{2}BacteriaTrichoplax adhaerens (Plecozoa) groups with a Bacterioidales clade with two additional HGT transfered sequences: Rickettsia felis (α-proteobacteria) and Legionella longbeachae (γ-proteobacteria). Ricinus communis (Plantae) locates with a Burkholderiales clade harboring other unrelated taxa from γ-proteobacteria: Thioalkalivibrio sp., Pseudomonas aeruginosa and Dickeya dadantii
COG4804
99DUF1703PF08011(36)There are 9 DUF1703 proteins in Methanospirillum hungatei DSM 864. Some of them reside in the proximity of multiple PAS fold domains and CheY sensor related genes. In Bacterioidetes the genomic context is not conserved due to a duplication.+{1}Bacteria (mostly Bacterioidetes)Nine sequences from M. hungatei form a sister clade to a Proteobacteria clade. This clade is grouped together with a Treponema clade. The rest of the tree belongs to Bacterioidetes
100DUF4143COG1373, PF13635PfamUnknown++ProkaryotaIlyobacter polytropus (Fusobacteriales) forms a clade with C. sticklandii (Clostridiales). Ilyobacter polytropus was isolated from marine anoxic mud
101DUF511PF04373(11)Unknown+BacteriaUnrelated sequences from Fibrobacterales, Chlorobiales, Clostridiales, Flavobacteriales and Bacteroidales on a Proteobacteria tree
COG2958
102DUF2887PF11103(11)Unknown. Co-occurs with transport related proteins.+Bacteria (Cyanobacteria)Methylococcus capsulatus and Beggiatoa sequences are found within a Cyanobacteria clade
103Restriction endonuclease-like fold superfamily protein3ijmPDBUnknown+Spirosoma linguale (Cytophagales)No HGT observed
104DUF1853PF08907(11)Unknown. The genomic context is conserved within bacterial families.+Bacteria (mostly Proteobacteria)Anacystis nidulans (Cyanobacteria), Planctomycetes and Flavobacteria within a Proteobacteria clade
COG3782
105UL24PF01646(36)The molecular mechanism is unknown however the UL24 protein is able to induce G2 cell-cycle arrest (148), disperse nucleolin (149) and alter the nuclei. The PD-(D/E)XK motif preservation is crucial for these functions (150).++HerpesviralesNo HGT observed
106DUF506PF04720(36)UnknownPlantaeNo HGT observed
Green algae
107TT1808, DUF820, Uma2PF05685(39)Predicted endonuclease. In Cyanobacteria the genomic context is well conserved. In γ-proteobacteria the context is not conserved and involves mobile elements suggesting recent mobility and/or acquisition.+BacteriaProteobacteria sequences within Firmicutes or Cyanobacteria clades. Very long branches. Multiple transfer
COG4636
1wdj, 3ot2
108DUF1780PF08682SCOPUnknown. Well conserved context+Bacteria (Pseudomonadales)No HGT observed
1y0k
109DUF2130PF09903PfamUnknown+{1}BacteriaParascardovia denticolens and Scardovia inopinata (Bifidobacteriales) in a Lactobacillaes clade. One archeon M. paludicola
COG4487
110DUF2726PF10881PfamUnknown. In Fusobacteria DUF2726 proteins are surrounded by mobile elements. This feature is less pronounced in other bacteria.++BacteriaMultiple transfers. Pirellula staleyi (Plantomyces) forms a clade with Anaerolinea thermophila (Chloroflexi)
111RAP domainPF08373PfamUnknown. Initially claimed to bind RNA and abundant in Apicomplexans, present in proteins involved in mitochondrial stress sensing (151) and plant immunity (152).{1}+EukaryotaParachlamydia acanthamoebae is located with a lycophyte, Selaginella moellendorffii, long branches
112YaeQPF07152(153)Located with bleomycin resistance (Glyoxalase) and Aceltyltransf_1 (GNAT). In P. aeruginosa biofilms a YaeQ mutant has decreased expression of genes encoding NADH dehydrogenase activity and cobalamin biosynthetic process and increased expression of secretion and pathogenesis genes (e.g. exoY, pscU and exsC). This mutant has biofilm-exclusive tobramycin fitness advantages. Tobramycin is an aminoglycoside antibiotic. YaeQ compensates (154) or does not (155) the hemolysin transcription elongation protein RfaH function.+Bacteria (Proteobacteria)Nitrospira defluvii on a Proteobacteria tree forms a clade with Leptothrix cholodnii. Ricinus communis (Plantae) groups with Methylotenera mobilis
COG4681
2ot9, 2g3w
3c0u
113PDDEXK_2PF12784PfamPutative transposase+{1}BacteriaPhylogeny not resolved
114PDDEXK_3PF13366PfamUnknown+++Prokaryota & VirusesMultiple transfers, mixed clades for Bacteria and Archaea or different Bacterial divisions
115PDDEXK_4PF14281PfamUnknown++{1}ProkaryotaRicinus communis (Plantae) is present in a Proteobacteria clade. Parabacteroides merdae a human gut bacterium found also in wounds forms a clade with a bacteria from termite hindguts Treponema primitia
116DUF4263PF14082NewUnknown{1}+{2}{1}BacteriaPopulus balsamifera subsp. trichocarpa (Plantae) sequence forms a clade with a non-pathogenic metal resistant bacterium Ralstonia metallidurans
117DUF3883PF13020NewUnknown+++Eukaryota & ProkaryotaPhylogeny not well resolveda
118DUF4420PF14390NewPutative transposase+{2}BacteriaMethanoplanus petrolearius (Euryarchaeota) and an uncultured archaeon locate within a Bacteria (Bacterioidetes/Actinobacteria) clade. Multiple transfers
119Replic_RelaxPF13814NewPlasmid replication (156) and plasmid DNA relaxation (157){1}+Bacteria (Bacillus/Clostridium & Actinobacteria)Streptococcus (Lactobacillales) locates within an Actinobacteria clade. Paenibacillus (Bacillales) sequence is found in an Actinobacteria clade
120Dam-replacing proteinPF06044(158)DNA adenine methyltransferase replacing protein (DRP), a restriction endonuclease (158){2}+{3}BacteriaPatchy distribution possibly due to multiple transfers
121TBP-interacting protein2czr(159)A family of proteins, that interact with TATA-binding protein (TBP) (159).+Archaea (Thermococcales)No HGT observed

Groups include closely related families and structures that share relatively high sequence similarity detectable with PSI-BLAST and RPS-BLAST.

aThe tree was not rooted due to dubious position of the rooting sequence.

The curly brackets in the taxonomy columns indicate the number of sequences if kingdom is represented only by a few sequences.

One hundred and twenty-one groups of proteins retaining PD-(D/E)XK nuclease fold PF04313 PF13588 COG4748 COG2810 COG0610 2w00, 3h1t COG4889 COG4096 COG3421 COG3587 3s1s Type I Restriction Endonuclease Type II Restriction Endonuclease (104) Type III Restriction Endonuclease (103) Broad sequence and function profile due to wide, multidomain definitions of COG entities PF04471 PF10356 COG4127 COG1715 COG1787 1y88 Mrr restriction endonuclease (Methylated adenine recognition and restriction) restricts both adenine- and cytosine-methylated DNA (106). DUF2034 function is unknown. Eukaryota (without Plantae) Bacteria Archaea Phages PF01870 COG1591 1hh1, 1gef 1ob8, 2wcw 2eo0 PF02732 KOG0442 KOG2379 COG1948 1j22, 2bgw 2ziu, 2zix 2ziv HJC resolvase (108) DNA repair, structure specific endonuclease PF03838 COG3331 1zp7, 1y1o PF01974 KOG4133 KOG4685 COG1676 1a79, 2cv8, 2gjw 2zyz, 2ohe, 3iey 3if0, 3ajv, 3p1y PF09631 PF12858 2gw6 PF02976 COG3066 1azo, 2aoq PF04480 PF03852 COG3727 COG2852 1cw0, 3hrl, 3r3p Very short patch repair (Vsr) endonuclease that specifically removes T/G mismatches in DNA sequences targeted to cytosine methyltransferase (112). Group I intron homing endonuclease (113) PF04257 COG1330 COG3857 COG1074 1w36 PF10926 PF12705 COG2887 PF09588 COG5377 1avq, 3k93 3slp Bacteria Eukaryota phages PF06023 PF08696 KOG1805 COG4343 Alveolata Fungi Myxomycota Excavata Groups include closely related families and structures that share relatively high sequence similarity detectable with PSI-BLAST and RPS-BLAST. aThe tree was not rooted due to dubious position of the rooting sequence. The curly brackets in the taxonomy columns indicate the number of sequences if kingdom is represented only by a few sequences.

Newly identified PD-(D/E)XK families

According to extensive database and literature searches 11 groups (3, 5, 14, 15, 26, 90, 91, 116, 117, 118, 119; Table 1) include proteins not annotated previously to PD-(D/E)XK fold superfamily. Five of them embrace completely uncharacterized proteins from DUF4420 (PF14390), DUF3883 (PF13020), DUF4263 (PF14082), COG5482 and COG1395 families. The remaining six newly detected groups cover functionally studied protein families which, however, lacked fold assignment. These include restriction endonucleases Tsp45I (PF06300), HaeII (PF09554), Eco47II (PF09553), ScaI (PF09569) and HpaII (PF09561) and Replic_Relax (PF13814)—a predicted transcriptional regulator. We studied in detail all newly detected families to hint at additional functional information. COG1395, COG5482 and Replic_Relax (PF13814) usually occur in a fusion with HTH DNA-binding domains, which suggests their role in transcription regulation. DUF4263 (PF14082) and DUF3883 (PF13020) are often present in proteins encoding an ATPase domain. Additionally, DUF3883 appears in a variety of domain architectures, including fusions with helicases, TF domains, protein kinases and MTases. Details of identification of new families are summarized in Supplementary Table S3. One should note that only two of them were assigned to the PD-(D/E)XK superfamily with Meta-BASIC scores above confidence threshold of 40.

Structure analysis

A comprehensive analysis of the identified structures allows us to better understand how the PD-(D/E)XK fold adapt to particular functions. The structural analyses are critical to further detection and classification of PD-(D/E)XK proteins and provide a solid background for rational hypotheses about structurally unstudied families. In the next section we describe multiple aspects of structural changes that blur a commonly recognized image of the restriction endonuclease-like proteins.

Core variability

The structural core of PD-(D/E)XK phosphodiesterase fold includes only six major elements: four β-strands and two α-helices (Figure 3A). We believe that this minimalism contributes to structural diversity of the superfamily. The first and the second core β-strands can embrace only a few residues (pdb|1y0k, Figure 3B), hardly forming a well-defined part of the central β-sheet. On the other hand, they can also be very long, forming a hairpin, which barely interacts with the rest of the β-sheet and keeps the remaining region bent away from the core structure (RecBCD nuclease, pdb|1w36 chain C, Figure 3C). Even if all core secondary structures are present, their spatial arrangement may still vary significantly. In a canonical PD-(D/E)XK enzyme α-helices remain in a roughly parallel orientation, whereas in the Pa4535 protein (pdb|1y0k, Figure 3B) they are almost perpendicular. In addition, we also observed circular permutations, e.g. in HJC resolving enzyme (pdb|1j22), where the first core α-helix is formed by the C-terminal sequence region, while N-termini encodes the first core β-strand (Figure 3D). Finally, the repertoire of structural variation within restriction endonuclease-like proteins is additionally enriched by domain swapping. For instance, bacteriophage T7 endonuclease I (pdb|2pfj) exchanges the first core α-helix and the first core β-strand between separate chains, both forming catalytically active, dimerized domains (Figure 3E).
Figure 3.

Examples of structural diversity in the PD-(D/E)XK phosphodiesterase superfamily. (A) typical PD-(D/E)XK enzyme (Holiday junction resolvase, Pyrococcus furiosus, pdb|1gef); (B) highly diverged structure with short first β-strand and perpendicular orientation of core α-helices (Pa4535 protein, P. aeruginosa, pdb|1y0k); (C) structure deterioration and the loss of active site (RecC, E. coli, pdb|1w36C); (D) circular permutation of the first core α-helix (Hef endonuclease, Pyrococcus furiosus, pdb|1j22); (E) domain swapping (endonuclease I, Enterobacteria phage T7, pdb|2pfj). Active site PD-(D/E)XK signature residues are shown as red sticks.

Examples of structural diversity in the PD-(D/E)XK phosphodiesterase superfamily. (A) typical PD-(D/E)XK enzyme (Holiday junction resolvase, Pyrococcus furiosus, pdb|1gef); (B) highly diverged structure with short first β-strand and perpendicular orientation of core α-helices (Pa4535 protein, P. aeruginosa, pdb|1y0k); (C) structure deterioration and the loss of active site (RecC, E. coli, pdb|1w36C); (D) circular permutation of the first core α-helix (Hef endonuclease, Pyrococcus furiosus, pdb|1j22); (E) domain swapping (endonuclease I, Enterobacteria phage T7, pdb|2pfj). Active site PD-(D/E)XK signature residues are shown as red sticks.

Insertions to core

In order to investigate the capabilities of the fold to handle additional structural elements we studied the structures of known PD-(D/E)XK proteins. The PD-(D/E)XK structural core is often decorated with plenty of insertions that tune the substrate-binding capabilities or enable protein-protein interactions (Supplementary Figure S1). The structure of Bacillus subtilis RecU resolvase (pdb|1zp7) is a remarkable example of tweaking canonical restriction endonuclease core for a specific function. It has a characteristic stalk formed by the first and the second β-strands extensions that fits into a four-way junction central region and provides a scaffold for substrate destabilizing interactions. Interestingly, using topology based-searches we identified PD-(D/E)XK core fold in many unrelated structures (Supplementary Figure S2). The so called ‘Russian-doll’ effect is discussed in more detail in Supplementary Materials [PD-(D/E)XK fold in other unrelated structures].

Active site variation

A PD-(D/E)XK active site residues fingerprint varies between the families (Figure 4). For instance, the signature motif proline can be replaced by any residue (mainly hydrophobic). Having a vast collection of PD-(D/E)XK proteins we analyzed possible alterations to the archetypical active site architecture. Such information is fundamental for further effective searches for new, putative PD-(D/E)XK enzymes within uncharacterized protein families. The canonical active site is formed by aspartic acid placed in the N-termini of the second core β-strand and glutamic acid, followed by lysine from the third β-strand, placing the carboxyl and amino groups in a suitable spatial arrangement. Interestingly, the glutamic acid and lysine may be shifted into nearby structural elements, tending however to position their chemical groups towards the active site and preserving its catalytic functionality (10). We observed such migration in several structures: (i) Cfr10I restriction endonuclease (pdb|1cfr), where glutamic acid migrates from the third β-strand to the adjacent, second core α-helix resulting in the PD-XXK-E motif; (ii) EcoO109I restriction enzyme (pdb|1wtd), where glutamic acid E moves from the expected position 124 into position 108 and now precedes aspartic acid from the PD motif (motif EPD-XXK); (iii) Pa4535 structural genomics hypothetical protein (pdb|1y0k), where lysine migrates from the expected position 70 into position 125 in the adjacent second core α-helix (motif PD-EXX-K). Interestingly, tRNA splicing endonucleases acquired a different active site within restriction endonuclease-like fold. These enzymes conserve three catalytic residues: tyrosine, histidine and lysine (Y115, H125, K156 in a Methanococcus jannaschii endonuclease) that form an active site located on the opposite edge of the central β-sheet. Even though tRNA-splicing endonucleases share a common PD-(D/E)XK fold, they eventually recognize a different substrate and possess a distinct catalytic mechanism.
Figure 4.

Active site variations observed in the PD-(D/E)XK phosphodiesterase superfamily structures. Observed variant of ‘PD-(D/E)XK’ signature motif is given below each structure with residue migration denoted in blue. (A) archaeal HJC resolvase (P. furiosus, pdb|1gef); (B) BamHI restriction endonuclease (Oceanobacter kriegii, pdb|3odh); (C) BstYI restriction endonuclease (Geobacillus stearothermophilus, pdb|1sdo); (D) EcoO109I restriction endonuclease (E. coli, pdb|1wtd); (E) Bse634I restriction endonuclease (Geobacillus stearothermophilus, pdb|1knv); (F) tRNA splicing endonuclease (Methanocaldococcus jannaschii, pdb|1a79); (G) Vsr repair endonuclease (E. coli, pdb|1cw0); (H) a putative endonuclease-like protein (Neisseria gonorrhoeae, pdb|3hrl); (I) Pa4535 protein (P. aeruginosa, pdb|1y0k).

Active site variations observed in the PD-(D/E)XK phosphodiesterase superfamily structures. Observed variant of ‘PD-(D/E)XK’ signature motif is given below each structure with residue migration denoted in blue. (A) archaeal HJC resolvase (P. furiosus, pdb|1gef); (B) BamHI restriction endonuclease (Oceanobacter kriegii, pdb|3odh); (C) BstYI restriction endonuclease (Geobacillus stearothermophilus, pdb|1sdo); (D) EcoO109I restriction endonuclease (E. coli, pdb|1wtd); (E) Bse634I restriction endonuclease (Geobacillus stearothermophilus, pdb|1knv); (F) tRNA splicing endonuclease (Methanocaldococcus jannaschii, pdb|1a79); (G) Vsr repair endonuclease (E. coli, pdb|1cw0); (H) a putative endonuclease-like protein (Neisseria gonorrhoeae, pdb|3hrl); (I) Pa4535 protein (P. aeruginosa, pdb|1y0k).

Sequence analyses

Although most of the PD-(D/E)XK proteins have a nuclease activity, they may also perform other diverse functions. Adaptation to a particular functional niche may involve the presence of additional protein domains encoded separately or together with the PD-(D/E)XK domain. Some functions are restricted to a certain taxonomic unit while others are widely distributed across the tree of life. In order to gain a general overview of sequence similarities, all 21 911 protein sequences were clustered with CLANS. The obtained clustering was colored based on both sequence taxonomic distribution and protein function (Figure 5). One should note that restriction endonucleases exhibit high sequence divergence, whereas house-keeping genes form tight clusters. Bacterial sequences are present all over the sequence space in contrast to viral sequences which appear only in a handful of sequence groups. Our analysis of taxonomic distribution, genomic context and domain architecture of PD-(D/E)XK proteins should help understand their biological relevance.
Figure 5.

CLANS clustering of 21 911 sequences belonging to 121 clades of the PD-(D/E)XK superfamily. The image was drawn with an in-house script based on CLANS run files. (A) illustrates the taxonomic distribution of analyzed sequences and (B) summarizes their functional annotation.

CLANS clustering of 21 911 sequences belonging to 121 clades of the PD-(D/E)XK superfamily. The image was drawn with an in-house script based on CLANS run files. (A) illustrates the taxonomic distribution of analyzed sequences and (B) summarizes their functional annotation.

Domain architecture

We extensively studied a domain organization for all collected PD-(D/E)XK proteins that might provide a broader view on the diversity of functional associations in this superfamily and also hint at specific functions for uncharacterized and poorly annotated proteins. In particular, we identified fused protein domains, internal repeat regions, coiled-coils and transmembrane elements. We observed various interesting domain arrangements that adjust the PD-(D/E)XK protein function to a specific role (Supplementary Figure S3), although most of the analyzed proteins harbor a single PD-(D/E)XK domain. Altogether, we identified 535 fused protein domains of distinct functions in 79 PD-(D/E)XK groups (Supplementary Table S4). Some of the most interesting and newly observed domain architectures are described in Supplementary Materials [Domain architecture], whereas a complete list of domain arrangements is included as Supplementary Figure S3.

Taxonomic distribution and horizontal gene transfers

The abundance of possible functions within PD-(D/E)XK phosphodiesterase proteins raises a question of the origin of these enzymes. In order to gain some insight into evolutionary history of these proteins we looked at the taxonomic distribution of the 121 PD-(D/E)XK groups (Table 1 and Supplementary Dataset S2). The housekeeping genes such as: HJC resolvase, RecBCD or tRNA intron endonuclease exhibit a broad taxonomic distribution. On the other hand, restriction endonucleases are usually unevenly distributed among a handful of specific orders of Prokaryota. Some PD-(D/E)XK proteins display a special taxonomic distribution. For example, the occurrence of Sen15 tRNA, a subunit of a splicing endonuclease is limited to Amebozoa and Ophistokonta. Noteworthy, in plants only two pre-tRNA molecules undergo splicing (tRNATyr and tRNAMet) (55) and the observed introns are significantly related in structure. The remaining Eukaryotic lineages could display alternative modes of tRNA intron endonuclease action. NARG2 and Pet127 proteins, also absent in plants, are known to participate in vital processes (32,56), but their molecular function is unknown. Pet127 is a mitochondrial protein involved in mtRNA polymerase regulation and mitochondrial mRNA maturation (32). The absence of Pet127 in plants raises a question of the differences in mtRNA polymerase performance and mtRNA maturation in these organisms. Initial studies on NARG2 claimed it is restricted to higher vertebrates and is involved in development (56). For example, human NARG2 protein is involved in the regulation of brain cortical gray matter thickness (57). Importantly, we found NARG2-like proteins to be also present in Amebozoa and Metazoa. We observed Horizontal Gene Transfers (HGTs) in the majority of the PD-(D/E)XK groups. In the families that span multiple proteins originated from one taxon, together with a protein from evolutionary distant species, the HGT is the most parsimonious hypothesis which explains such uneven distribution. Derived tree topologies are often obscured by long and deep, unresolved branches. The distorted clades occasionally encompass sequences of mixed taxonomic origin and may intriguingly group together Archaea and Bacteria sequences. In 1996 Jeltsch and Pingoud (22) hypothesized that HGT affected the distribution and evolution of type II restriction enzymes. Our results corroborate their hypothesis. Patchy taxonomic distribution of restriction enzymes usually covers many unrelated taxonomic ranges, but is limited to a handful of representatives of each taxon. House-keeping genes such as HJC, Vsr do not transfer laterally. The event possibilities for each of the 121 PD-(D/E)XK clades are summarized in Table 1. In Supplementary Materials (Taxonomic distribution and HGTs), we describe some of the most interesting HGT events, with special attention paid to human pathogenic bacteria, and Prokaryota to Eukaryota transfers. Summarizing, the patchy, narrow, or wide taxonomy distribution along with multiple HGT events greatly contribute to the complexity of the world of PD-(D/E)XK proteins that significantly vary in their structural features and display a wide range of domain architectures.

DISCUSSION

The PD-(D/E)XK proteins play important roles in many vital processes including the nucleic acid maintenance. Probably for this reason they are found in all living organisms. Across the superfamily, these proteins display a broad collection of general scaffold alterations which tweak their basic function to perform more specialized actions. The abundance of functions and distant evolutionary distances between particular PD-(D/E)XK families encouraged us to split the whole set of identified proteins into groups of sequences displaying obvious homology in terms of sequence comparison (Table 1). We expected such grouping to reflect the differences between functions and taxonomic distributions. Indeed, most of the defined groups show very coherent functions. The restriction enzymes and tRNA splicing endonucleases may be one of the most prominent examples here. However, some of the groups are blurred in terms of sequence similarity and cover many, yet connected functions including helicases, repair endonucleases, exonucleases and others. The difficulty of reproducing functional partition in our grouping procedure is 2-fold. The consensus sequence definitions that were used in our search included COG and KOG sequences which tend to cover multiple domains. This might lead to extended sequence alignment and boost of sequence similarity measure between distinct protein families. The other reason for grouping deficiency is the complex biological context of the analyzed proteins, especially that observed for housekeeping enzymes, like structure-specific repair nucleases. The alternative functions may emerge relatively fast, because homologous proteins may easily gain a new activity by fusing or interacting with unconventional protein domains. In our opinion, the precision of the grouping also strongly depends on the protein family concept which remains unclear. PD-(D/E)XK phosphodiesterases exhibit great variability in sequence and structure. There are potentially two major reasons for that. These enzymes are involved in a variety of biological processes which require a very diverse range of substrates to be recognized in both the sequence- and structure-specific manner. High sequence dissimilarity, especially between restriction endonucleases is the result of evolutionary arms race between phages and bacteria (160). A detailed analysis of insertions to the common conserved core observed in the existing structures across multiple PD-(D/E)XK families inspire a reflection that the majority of structural diversities are focused on the substrate-binding side (Supplementary Figure S1). The opposite side to the active site remains relatively unchanged. The PD-(D/E)XK fold can be described as gregarious (161) referring to its presence in several evolutionary unrelated protein structures. N-acetyltransferases, lipases, dehydrogenases containing the PD-(D/E)XK domain as a substructure represent different folds (even fold classes) according to SCOP database. This finding provides novel challenges to protein structure classification that should probably describe structural space for the α/β sandwich architecture as the continuum rather than distinct folds. This also sheds new light on the possible mechanisms of fold change in the evolution of protein structure through the structural drift (162), and may also provide some hints about the evolutionary history of these proteins suggesting that some of them might have evolved from a common ancestor. We observed many rare multiple domain architectures what is a general feature of sequence space (163). We identified PD-(D/E)XK domains that co-occur with the domains acting on nucleic acids, including methylases, helicases, resolvases, RNAse H, excision repair endonucleases, chromatin remodelers, or DNA ligases. These domain architectures follow the main functional niche occupied by nucleases. However, proteins with the PD-(D/E)XK domain can also be involved in protein structure maintenance. An interesting example is provided here by a hypothetical protein from Vitis vinifera discussed above (gi|147821195) which may be involved in both nucleic acid and histone protein structure upkeep, or Rai 1 from Polysphondylium pallidum (gi|281203778) followed by COBRA domain, a BRCA1 related protein that contributes to chromatin remodeling. Also intriguing domain association includes nucleases co-occurring with kinases. This might suggest that such proteins are somehow involved in triggering the response to nucleic acid aberrancies. We observed the PD-(D/E)XK groups limited to one Archaea group (ThaI REase in Thermoplasmata), present in a few unrelated taxa (BgII REase) or conserved and essential in all domains of life (Dem1/EXO5). This phenomenon might be explained by the different roles played by conserved and patchy distributed proteins. The former are rarely transferred and inherited vertically, their mutations are strongly deleterious. In consequence, they appear in broad taxonomic groups in a fixed number of copies per genome and in all representatives of a taxon. The latter offer additional adaptive advantages, useful in a defined ecological niche and are frequently transferred laterally rather than inherited. The reported cases of HGT between human pathogenic bacteria or from bacteria to Eukaryotes additionally exemplify the complex evolution of the PD-(D/E)XK proteins.

CONCLUDING REMARKS

The aim of this project was to identify the most complete set of proteins retaining the PD-(D/E)XK fold. Such a collection is indispensable for a comprehensive view on this fold and enables further insight into detailed biological functions, exact substrates and the molecular mechanisms undergoing in the processes connected with nucleic acid cleavage. The large and extremely diverse PD-(D/E)XK superfamily covers both specialized and multifunctional enzymes, as well as proteins that lost their enzymatic activity and now serve as structural or nucleic acid-binding units. Some of the PD-(D/E)XK fold families are restricted to a single bacterial family while others are present in all living organisms. The PD-(D/E)XK domains may co-occur solely, with one additional protein domain or in elaborated domain contexts. Moreover, some of the PD-(D/E)XK families harbor proteins appearing once per genome and others can display an increased number of copies. In humans the PD-(D/E)XK proteins can be linked to severe neurological diseases and may increase the probability of cancer.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1–4, Supplementary Figures 1–4, Supplementary Materials, Supplementary Datasets 1–2 and Supplementary References [164-199].

FUNDING

EMBO Installation, Foundation for Polish Science (Team, Focus), Ministry of Science and Higher Education [N N301 159435, 0376/IP1/2011/71]; European Regional Development Fund under Innovative Economy Programme [POIG.01.01.02-14-054/09-00]; European Social Fund [UDA-POKL.04.01.01-00-072/09-00] grants. Funding for open access charge: Waived by Oxford University Press. Conflict of interest statement. None declared.
  163 in total

1.  [Testing and isolation of high-purity restriction endonucleases].

Authors:  L I Puchkova; T A Ushakova; V K Mikhaĭlova; G D Serov; G N Krivopalova; V E Repin
Journal:  Prikl Biokhim Mikrobiol       Date:  2002 Jan-Feb

2.  Structure of the motor subunit of type I restriction-modification complex EcoR124I.

Authors:  Mikalai Lapkouski; Santosh Panjikar; Pavel Janscak; Ivana Kuta Smatanova; Jannette Carey; Rüdiger Ettrich; Eva Csefalvay
Journal:  Nat Struct Mol Biol       Date:  2008-12-14       Impact factor: 15.369

3.  Crystal structure of the Mus81-Eme1 complex.

Authors:  Jeong Ho Chang; Jeong Joo Kim; Jung Min Choi; Jung Hoon Lee; Yunje Cho
Journal:  Genes Dev       Date:  2008-04-15       Impact factor: 11.361

4.  Cloning and expression of the ApaLI, NspI, NspHI, SacI, ScaI, and SapI restriction-modification systems in Escherichia coli.

Authors:  S Y Xu; J P Xiao; L Ettwiller; M Holden; J Aliotta; C L Poh; M Dalton; D P Robinson; T R Petronzio; L Moran; M Ganatra; J Ware; B Slatko; J Benner
Journal:  Mol Gen Genet       Date:  1998-11

Review 5.  Dna2 on the road to Okazaki fragment processing and genome stability in eukaryotes.

Authors:  Young-Hoon Kang; Chul-Hwan Lee; Yeon-Soo Seo
Journal:  Crit Rev Biochem Mol Biol       Date:  2010-04       Impact factor: 8.250

6.  Crystal structure and evolution of a transfer RNA splicing enzyme.

Authors:  H Li; C R Trotta; J Abelson
Journal:  Science       Date:  1998-04-10       Impact factor: 47.728

7.  Bunyaviridae RNA polymerases (L-protein) have an N-terminal, influenza-like endonuclease domain, essential for viral cap-dependent transcription.

Authors:  Juan Reguera; Friedemann Weber; Stephen Cusack
Journal:  PLoS Pathog       Date:  2010-09-16       Impact factor: 6.823

8.  NERD: a DNA processing-related domain present in the anthrax virulence plasmid, pXO1.

Authors:  Marcin Grynberg; Adam Godzik
Journal:  Trends Biochem Sci       Date:  2004-03       Impact factor: 13.807

9.  Identification of new homologs of PD-(D/E)XK nucleases by support vector machines trained on data derived from profile-profile alignments.

Authors:  Mindaugas Laganeckas; Mindaugas Margelevicius; Ceslovas Venclovas
Journal:  Nucleic Acids Res       Date:  2010-10-20       Impact factor: 16.971

10.  The Pfam protein families database.

Authors:  Marco Punta; Penny C Coggill; Ruth Y Eberhardt; Jaina Mistry; John Tate; Chris Boursnell; Ningze Pang; Kristoffer Forslund; Goran Ceric; Jody Clements; Andreas Heger; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman; Robert D Finn
Journal:  Nucleic Acids Res       Date:  2011-11-29       Impact factor: 16.971

View more
  71 in total

1.  A Wolbachia nuclease and its binding partner provide a distinct mechanism for cytoplasmic incompatibility.

Authors:  Hongli Chen; Judith A Ronau; John F Beckmann; Mark Hochstrasser
Journal:  Proc Natl Acad Sci U S A       Date:  2019-10-15       Impact factor: 11.205

2.  Structural characterization of MepB from Staphylococcus aureus reveals homology to endonucleases.

Authors:  Sayeh Agah; Sandra Poulos; Christian Banchs; Salem Faham
Journal:  Protein Sci       Date:  2014-03-11       Impact factor: 6.725

Review 3.  Type II restriction endonucleases--a historical perspective and more.

Authors:  Alfred Pingoud; Geoffrey G Wilson; Wolfgang Wende
Journal:  Nucleic Acids Res       Date:  2014-05-30       Impact factor: 16.971

4.  Structure-Based Deep Mining Reveals First-Time Annotations for 46 Percent of the Dark Annotation Space of the 9,671-Member Superproteome of the Nucleocytoplasmic Large DNA Viruses.

Authors:  Yeva Mirzakhanyan; Paul David Gershon
Journal:  J Virol       Date:  2020-11-23       Impact factor: 5.103

5.  Highly regulated, diversifying NTP-dependent biological conflict systems with implications for the emergence of multicellularity.

Authors:  Gurmeet Kaur; A Maxwell Burroughs; Lakshminarayan M Iyer; L Aravind
Journal:  Elife       Date:  2020-02-26       Impact factor: 8.140

6.  A nuclease that mediates cell death induced by DNA damage and poly(ADP-ribose) polymerase-1.

Authors:  Yingfei Wang; Ran An; George K Umanah; Hyejin Park; Kalyani Nambiar; Stephen M Eacker; BongWoo Kim; Lei Bao; Maged M Harraz; Calvin Chang; Rong Chen; Jennifer E Wang; Tae-In Kam; Jun Seop Jeong; Zhi Xie; Stewart Neifert; Jiang Qian; Shaida A Andrabi; Seth Blackshaw; Heng Zhu; Hongjun Song; Guo-Li Ming; Valina L Dawson; Ted M Dawson
Journal:  Science       Date:  2016-10-07       Impact factor: 47.728

7.  Polymorphic Toxins and Their Immunity Proteins: Diversity, Evolution, and Mechanisms of Delivery.

Authors:  Zachary C Ruhe; David A Low; Christopher S Hayes
Journal:  Annu Rev Microbiol       Date:  2020-07-17       Impact factor: 15.500

8.  Combinations of Spok genes create multiple meiotic drivers in Podospora.

Authors:  Aaron A Vogan; S Lorena Ament-Velásquez; Alexandra Granger-Farbos; Jesper Svedberg; Eric Bastiaans; Alfons Jm Debets; Virginie Coustou; Hélène Yvanne; Corinne Clavé; Sven J Saupe; Hanna Johannesson
Journal:  Elife       Date:  2019-07-26       Impact factor: 8.140

9.  Catalytic domain of plasmid pAD1 relaxase TraX defines a group of relaxases related to restriction endonucleases.

Authors:  María Victoria Francia; Don B Clewell; Fernando de la Cruz; Gabriel Moncalián
Journal:  Proc Natl Acad Sci U S A       Date:  2013-07-31       Impact factor: 11.205

10.  Functional Diversity of Cytotoxic tRNase/Immunity Protein Complexes from Burkholderia pseudomallei.

Authors:  Parker M Johnson; Grant C Gucinski; Fernando Garza-Sánchez; Timothy Wong; Li-Wei Hung; Christopher S Hayes; Celia W Goulding
Journal:  J Biol Chem       Date:  2016-07-20       Impact factor: 5.157

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.