Literature DB >> 33109766

Origins and Molecular Evolution of the NusG Paralog RfaH.

Bing Wang^1,2, Vadim M Gumerov^1,3, Ekaterina P Andrianova¹, Igor B Zhulin^4,3, Irina Artsimovitch^4,2.

Abstract

The only universally conserved family of transcription factors comprises housekeeping regulators and their specialized paralogs, represented by well-studied NusG and RfaH. Despite their ubiquity, little information is available on the evolutionary origins, functions, and gene targets of the NusG family members. We built a hidden Markov model profile of RfaH and identified its homologs in sequenced genomes. While NusG is widespread among bacterial phyla and coresides with genes encoding RNA polymerase and ribosome in all except extremely reduced genomes, RfaH is mostly limited to Proteobacteria and lacks common gene neighbors. RfaH activates only a few xenogeneic operons that are otherwise silenced by NusG and Rho. Phylogenetic reconstructions reveal extensive duplications and horizontal transfer of rfaH genes, including those borne by plasmids, and the molecular evolution pathway of RfaH, from "early" exclusion of the Rho terminator and tightened RNA polymerase binding to "late" interactions with the ops DNA element and autoinhibition, which together define the RfaH regulon. Remarkably, NusG is not only ubiquitous in Bacteria but also common in plants, where it likely modulates the transcription of plastid genes.IMPORTANCE In all domains of life, NusG-like proteins make contacts similar to those of RNA polymerase and promote pause-free transcription yet may play different roles, defined by their divergent interactions with nucleic acids and accessory proteins, in the same cell. This duality is illustrated by Escherichia coli NusG and RfaH, which silence and activate xenogenes, respectively. We combined sequence analysis and recent functional and structural insights to envision the evolutionary transformation of NusG, a core regulator that we show is present in all cells using bacterial RNA polymerase, into a virulence factor, RfaH. Our results suggest a stepwise conversion of a NusG duplicate copy into a sequence-specific regulator which excludes NusG from its targets but does not compromise the regulation of housekeeping genes. We find that gene duplication and lateral transfer give rise to a surprising diversity within the only ubiquitous family of transcription factors.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: NusG; RfaH; Spt5; antitermination; transcription

Mesh：

Substances：

Year: 2020 PMID： 33109766 PMCID： PMC7593976 DOI： 10.1128/mBio.02717-20

Source DB: PubMed Journal: mBio Impact factor: 7.867

INTRODUCTION

RNA synthesis by RNA polymerase (RNAP) must be elaborately controlled in response to diverse intracellular and environmental cues. However, cellular RNAPs bind to DNA largely nonspecifically and depend on numerous accessory proteins to determine when and where to start, pause, and stop RNA synthesis. Among hundreds of transcription factor families, only NusG-like regulators are present in all domains of life (1). These proteins have similar structural cores (Fig. 1) consisting of a NusG N-terminal (NGN) domain and a C-terminal domain with a Kyprides-Ouzounis-Woese (KOW) motif (2); eukaryotic Spt5 proteins have several KOW domains and additional regulatory regions (3). Consistently with their common evolutionary origin and function, NGNs of NusG homologs from archaea, bacteria, and eukaryotes bind to the same sites on the elongating RNAP (4–6), composed of the clamp helix (CH) domain in the largest RNAP subunit (β′ in Bacteria) and the gate loop in the second-largest subunit (β in Bacteria). Once bound, NusG proteins (or their NGNs alone) promote processive, pause-free RNA synthesis (7), a function thought to be particularly important for the synthesis of very long RNAs. Recent structural studies revealed a common molecular basis for antipausing activity among all NusG-like proteins (4, 5, 8).

FIG 1

RfaH and NusG interactions with the transcription machinery. Autoinhibited RfaH interacts with the ops DNA hairpin formed on the RNAP surface, transforms into an active NusG-like state, and binds to the β′ clamp helices (CHs); NusG makes similar but weaker contacts with RNAP (see Fig. S1 in the supplemental material). The NusG-KOW domain binds to Rho and promotes termination. Residues that make important functionally validated contacts are shown as sticks. PDB accession numbers are as follows: NusG-Rho binary complex, 6DUQ; autoinhibited RfaH, 5OND; RfaH bound to ops-paused transcription elongation complex, 6C6S. Structural comparison of NusG- and Spt5-NGN. (A) Structure of E. coli NusG-RNAP complex (PDB accession no. 6C6U). The β′ clamp helix (CH)-binding residues are indicated in blue. The position of the β-hairpin loop was determined by superposition over the NusG structure (PDB accession no. 2K06). (B) Structure of Pyrococcus furiosus Spt5-NGN-RpoE′′-RNAP complex (PDB accession no. 3QQC). RPB1 CH-binding residues are shown in blue. RpoE′′-binding residues are in magenta, with the invariant E49 involved in the acid-dipole interaction with Spt4 shown in yellow. (C) Alignment of E. coli NusG-NGN (NCBI accession no. NP_418409.1) and P. furiosus Spt5-NGN (WP_011013134.1). NusG (top) and Spt5 (bottom) logos were generated from sequences collected from GTDB_reps (sequences were reduced at 90% identity by CD-hit (88); a total of 4,220 NusG and 154 Spt5 sequences were included in the final alignment). CH-binding residues are indicated with blue circles; E49 is indicated with a yellow circle. The β hairpin loop and RpoE′′-binding region are shown in green and magenta boxes, respectively. Download FIG S1, PDF file, 0.9 MB. NusG homologs comprise two distinct families, which are correlated with the architecture of their respective target RNAPs (Fig. S1). In bacteria, NusG binds to a “minimal” RNAP typically composed of five subunits and promotes uninterrupted RNA synthesis (9). Although NusG can interact with other proteins as part of specialized antitermination complexes (10), it does not require any accessory factors for binding to RNAP. In contrast, in eukaryotes and archaea, which have more complex 12+ subunit RNAPs, Spt5 has an obligatory partner, a small zinc finger protein, Spt4 (called RpoE in archaea). Spt4 and Spt5 form an extensive interface with several conserved residues (11, 12); among them, a universally conserved Glu residue is essential for Spt4/5 binding, and its replacement of Gln (the corresponding residue in NusG) abolishes their interactions (13). Together, Spt4/5 (DSIF in metazoans) promote transcription elongation similarly to NusG (8, 14). Spt4 was long thought to simply buttress Spt5 stability (11, 14), but recent structural data suggest that it also contributes to maintaining RNAP processivity, for example, during transcription through nucleosomes (15). Spt4 binds to Spt5-NGN opposite the RNAP interaction surface, and several conserved basic residues in Spt4 form a part of the upstream DNA channel (4). In NusG, a positively charged β-hairpin loop is positioned similarly to Spt4 (5, 16) and may interact with the upstream DNA duplex (17); large modulatory domains present in place of the β-hairpin in some NusG proteins may contribute to DNA interactions (2, 13). The presence of the β-hairpin is incompatible with an auxiliary protein binding to NusG in a manner similar to the way it binds to Spt4 (11); accordingly, Spt5 proteins do not have insertions at this position (Fig. S1). Within a given cell, NusG and its paralogs can be viewed as alternative transcription elongation factors which compete for binding to RNAP, similarly to σ initiation factors (7). This analogy is strengthened by the fact that NusG and σ (or Spt5 and TFE in Archaea) share the binding site on RNAP (18, 19). However, in stark contrast to σ factors, which perform the same function at their cognate promoters, NusG-like proteins play surprisingly multifaceted roles, as can be illustrated in Escherichia coli, which encodes two of the best-characterized members of this family: an abundant and essential housekeeping NusG protein and its scarce nonessential specialized paralog RfaH (20). NusG promotes productive RNA synthesis as part of antitermination complexes (10) or by coupling transcription to translation via direct contacts with the ribosome (21, 22). Yet if RNA is useless or potentially harmful, as is the case with many xenogenes, the NusG-KOW domain interacts with the termination factor Rho to induce its early release from RNAP (23); in fact, silencing of xenogenes constitutes an essential function of E. coli NusG (24). RfaH plays an opposite role; it activates expression of xenogenes (7), many of which encode virulence factors, and is required for virulence in enteric pathogens (7). While NusG associates with RNAP transcribing all operons (20), RfaH is recruited to its targets only at operon polarity suppressor (ops) elements in the nontemplate DNA strand in the transcription bubble (20, 25). The ops signal halts RNAP to provide more time for RfaH recruitment and forms a short DNA hairpin that interacts with the RfaH-NGN to induce RfaH transformation from an autoinhibited state to an activated state (26) (Fig. 1). Once bound, RfaH excludes NusG from the transcribing RNAP, thereby insulating it from Rho, and activates translation by recruiting the ribosome (20, 27). Extensive genetic, biochemical, and structural data available for RfaH and NusG provide a detailed molecular context for understanding their effects on gene expression. While both proteins interact with similar regions on RNAP, RfaH binds much more tightly (5), giving RfaH advantage to compete with 100-fold more abundant NusG (28), and only NusG interacts with Rho (23). These proteins make similar contacts with the ribosomal protein S10 (21, 27), but in the case of RfaH, a dramatic metamorphosis (in which the entire RfaH-KOW motif refolds from an α-helical hairpin observed in free, autoinhibited RfaH [29] to a β-barrel) is required to expose the residues that interact with S10 (27). This switch is triggered when RfaH binds to the ops-paused RNAP (30). In contrast, relatively little is known about NusG homologs present in diverse bacteria (31). An emerging view is that specialized NusG paralogs (NusGSPs) function as dedicated antiterminators of long, difficult-to-express gene clusters required for adaptation to diverse environments, including human hosts. Bacterial genes shown to be dependent on NusGSPs for expression encode adhesins, capsular polysaccharides, conjugation machinery, polyketide antibiotics, and toxins (7). While RfaH is recruited to ops sites in the leader regions of several unlinked chromosomal targets (20), some NusGSPs are encoded within the operons that they regulate (32, 33) and their modes of recruitment are unknown. In this work, we set out to reconstruct the origins and evolutionary history of RfaH and its relationship to NusG, expanding previous phylogenetic analysis (31) to incorporate the growing number of sequences in public databases and recent experimental insights into the functions of these proteins. Using sensitive profile searches, including those with a newly constructed profile model for RfaH, we revealed the phyletic distribution of NusG and RfaH across the tree of life. Our results show that ancient and recent gene duplication, horizontal gene transfer, and rapid functional divergence of paralogs underlie the evolution of the NusG family. One of these NusG duplications, which occurred in Proteobacteria, led to the emergence of RfaH. Changes within the key functional regions of NusG paralogs suggest that nascent NusG duplicates have gradually morphed into fully specialized RfaH-like regulators by losing contacts with Rho first and acquiring sequence-specific DNA contacts last. We found that NusG homologs are encoded in most plants and photosynthetic protists and in all except severely reduced bacterial genomes. These results support a notion that NusG modulates transcription in nearly every cell that utilizes RNAP of the bacterial type.

RESULTS AND DISCUSSION

In addition to housekeeping NusG/Spt5 proteins, their specialized paralogs are known in bacteria and eukaryotes (31, 34). These paralogs are assumed to have arisen by gene duplication, followed by adaptation to unique regulatory demands, e.g., upregulation of virulence genes during bacterial pathogenesis, a key function of several NusG paralogs in Gram-negative bacteria. Among many bacterial NusG paralogs (31), only a handful have been characterized, but even cursory analyses revealed a surprising diversity in their primary sequence, function, and even structure. NusG-like proteins modulate gene expression through a network of contacts with RNAP, nucleic acid signals, and ribosome (7). In-depth studies of E. coli NusG and RfaH provided atomic-level details of these interactions and identified dramatic conformational changes that underlie their differential recruitment mechanisms (Fig. 1).

New RfaH model.

NusG homologs are widely distributed across all three domains of life (Fig. 2A), but they are very diverse, likely reflecting adaptation to very different niches. This diversity necessitates the use of robust models to investigate the evolution of the NusG family. We needed a model that can reliably distinguish RfaH proteins from the rest of the NusG family. Pfam (35), the leading protein domain database, does not have a specific RfaH model, and its NusG model (PF02357) cannot distinguish NusG from its paralogs. An RfaH-specific model is available in TIGRfam, but this model (TIGR01955) was constructed using only five sequences and was last modified in 2011. Using Pfam guidelines, we built a new hidden Markov model (HMM) profile for RfaH based on 260 seed sequences (see Methods in Fig. S2 in the supplemental material). The new RfaH model detected 4,173 sequences in the UniProtKB database (36), while the TIGRfam RfaH model detected only 2,955. The new RfaH model has been deposited in the MiST database (37) and will be available in its next release.

FIG 2

The distribution of NusG-like factors. (A) NusG/Spt5 factors were identified using NusG and Spt5-NGN Pfam models, respectively, in Aquerium (93; http://aquerium.zhulinlab.org/). The outer ring shows the number of hits; the darker the color, the more hits it represents. The inner rings represent the major taxonomic ranks and supergroups for eukaryotes (93). E, Eukaryota; A, Archaea; B, Bacteria. Plantae are green. (B) RfaH distribution in bacteria on the phylum level. The genome tree was downloaded from AnnoTree (77; http://annotree.uwaterloo.ca/). Phyla with representatives that contain RfaH (based on hits with our new model) are highlighted in purple. Numbers appended after taxons indicate the number of genome hits divided by the total number of genomes. (C) RfaH distribution in Proteobacteria. The percentages of genome hits were calculated for RfaH-containing families with ≥10 genomes. Families with >50% hits are shown in red, and those with <50% hits are shown in blue. A genome tree of representative Gammaproteobacteria is shown. This and other genome trees are maximum-likelihood trees inferred from the alignment of 120 ubiquitous single-copy proteins (53). Example of reciprocal best BLAST hits. Protein sequences were used as queries in BLAST against each other’s genome. If two sequences find each other as the best-scoring match in each other’s genome, they will be called reciprocal best BLAST hits and connected with a line. The number of mutual connections is indicated inside circles. Black circles, chromosomal RfaH; purple circles, plasmid RfaH. RfaH is presented as a genus name and protein NCBI accession number. A high number of mutual connections was observed for every representative, indicating that these representatives belong to the same orthologous groups. Download FIG S2, PDF file, 0.7 MB.

Distribution of housekeeping NusG.

Although presumed to be ubiquitous, NusG was absent in a few (7 out of 711) representatives of COG0250 (38; https://www.ncbi.nlm.nih.gov/COG/). We extended this analysis to a data set of nearly 20,000 representative bacterial and archaeal genomes from the Genome Taxonomy Database (39), to which we refer here as GTDB_reps (see Materials and Methods). In Archaea, Spt5 is widespread (Fig. 2A) but not ubiquitous: using Spt5-NGN as a model, we identified Spt5 in only 789 out of 847 archaeal genomes. A similar trend was observed in bacteria, where 6% of bacterial GTDB_reps genomes had no identifiable NusG proteins (Data Set S1A and -B). The lack of NusG/Spt5 may be due to (i) incomplete genome assemblies or sequencing errors, (ii) gene loss, or (iii) the low sensitivity of the search model. To evaluate these scenarios, we analyzed NusG homolog distribution in ∼130,000 bacterial genomes from the NCBI nonredundant database. Among them, 1,879 appeared to lack NusG homologs (Data Set S1C), but no clear pattern has emerged. Moreover, approximately the same fraction of genomes lacked SecE, RecA, and essential ribosomal proteins L5, L6, S2, and S7 (Data Set S1C). The absence of essential core genes in a significant fraction of genomes is most likely due to technical issues arising during genome sequencing/assembly and exposes limitations of this broad-stroke approach, necessitating in-depth analysis. By analyzing 13,140 NusGs (Data Set S1A) using TREND (40; http://trend.zhulinlab.org), we found that nusG is invariably present within a highly conserved operon that encodes the protein translocase SecE and 50S ribosomal proteins. Thus, we further investigated secE-nusG?-rplK-rplA genomic loci in 183 genomes that appear to lack NusG but contain SecE and ribosomal protein L1 (rplA), as well as RecA and L5, L6, S2, and S7 (Data Set S1D). Data sets generated in this study. (A) NusG collected by the NusG TIGRfam model in GTDB_reps. (B) Statistics of genome hits of NusG TIGRfam in GTDB_reps. (C) NusG homolog distribution in 129,663 bacterial genomes. In model hits, 0 indicates no hits and 1 indicates hits. The total genome hits of different models are presented. (D) Genomes that lack NusG Pfam hits but have other core selected proteins. The NusG family absence was manually checked for complete genomes (those have a Complete Genome assembly level). (E) NusG distribution in bacterial endosymbionts. The absence of NusA/NusG/RpoZ (ω) is highlighted in yellow; in contrast, some core genes, such as those encoding an essential GroELS chaperone system, are present in all genomes. One representative of each species is shown except for “Candidatus Sulcia muelleri,” whose patterns differ among 39 sequenced representatives. Only a small number of genomes larger than 0.5 Mb are shown. (F) NusG homolog detection in Plantae and Chromista. The presence of a chloroplast transit signal (CTS) was predicted by the ChloroP 1.1 server (http://www.cbs.dtu.dk/services/ChloroP/). Sequences with a CTS are highlighted in green. (G) Comparative genome analysis of Plantae and Chromista NusG homologs. (H) RfaHs collected by the new RfaH model in GTDB_reps. (I) Statistics of genome hits of a new RfaH model in GTDB_reps. (J) Representatives for topology analysis (Fig. 3B to D). (K) RfaH distribution in subspecies of Pseudomonadaceae. (L) Representatives used for building the phylogenetic trees of UpxY, RfaH, and NusG (Fig. S7). (M) Representatives of the molecular evolution study (Fig. 5). (N) The eight RfaH clusters (CLs) generated by Markov clustering. Download Data Set S1, XLSX file, 6.4 MB.

FIG 3

Maximum-likelihood phylogenetic trees. (A) NusG-like proteins are widespread. (B to D) Topology of bacterial trees, with monophyletic groups colored in the genome tree (B). The two clades of Alphaproteobacteria (Alpha) are red and purple; one clade of Zetaproteobacteria (Zeta) is gray. The remaining clades belong to Gammaproteobacteria (Gamma). The branches of NusG (C) and RfaH (D) trees are colored according to the genome tree. Black dots indicate bootstrap values of >50% (A) or >70% (B to D).

FIG 5

Molecular evolution of NusG and RfaH. (A) Spt5 (black), NusG (gray), unknown NusGSP (light pink), and RfaH (hot pink) are marked on the maximum-likelihood phylogenetic tree. Archaeal Spt5 is used as an outgroup. NusGs with the same pattern of functional sites are collapsed. (Top) Selected functional residues in RfaH and NusG are color coded and numbered as in E. coli RfaH/NusG (NCBI accession no. NP_418284.1/NP_418409.1). Lighter colors indicate conservative substitutions. CL1 to -8 denote RfaH clusters. (B) A stepwise conversion of NusG into RfaH.

To ensure genome completeness, we selected only those NusG-less representatives that have a “complete genome” assembly level (12 total). Analysis of the secE-nusG?-rplK-rplA operons identified 1-nt frameshifts in the nusG open reading frames (ORFs) in 11 genomes. Among these, 9 have sequences of the same species in which nusG is intact, whereas two genomes are present in single copies, albeit with sequences of their NusG-encoding close relatives available (Data Set S1D). The nusG gene was deleted from “Candidatus Evansia muelleri,” an endosymbiont with a severely reduced 0.36-Mbp genome. Consistently, six out of seven NusG-less COG0250 representatives have genomes smaller than 0.28 Mbp, whereas the remaining genome is incomplete. These findings suggest that reduced genome endosymbionts may function with reduced transcription machinery. In E. coli, a transcribing five-subunit core RNAP (α2ββ′ω) associates with NusA and NusG across the entire genome (20); both Nus factors are essential in wild-type E. coli. We wondered if NusA and ω, which acts as a chaperone and is not essential (41), could also be absent in endosymbionts. We analyzed complete genomes ranging from 0.11 to 5+ Mbp (Data Set S1E). We found that all genomes smaller than 0.2 Mbp did not encode NusG or NusA, whereas genomes larger than 0.36 Mbp encoded both proteins. In genomes bridging these groups, all possible NusA/NusG distribution patterns were observed, sometimes varying between genomes of the same species. Interestingly, ω is absent from many endosymbionts (Data Set S1E), as well as from some free-living bacteria (COG1758). We conclude that all bacterial genomes with the exception of severely reduced genomes encode NusA and at least one NusG family protein. While this conclusion may appear trivial in the case of the “ubiquitous” regulator, nusG has been shown to be dispensable in some model organisms grown under laboratory conditions, such as Bacillus subtilis (42), and can even be deleted in E. coli lacking toxic prophages (43), albeit at a marked fitness cost. Clearly, bacterial survival and adaptation to complex environmental conditions impose requirements different than those of growth in rich medium at an optimal temperature.

Expansion of NusG taxonomic presence.

Realizing that NusG is not restricted to prokaryotes (Fig. 2A), we investigated its distribution further. Using phylogenetic profiling with the most recent Archaeplastida taxonomy (44), we established that, in addition to Spt5, NusG homologs are encoded in the genomes of all major land plant and algal lineages except for some green algal species (Data Set S1F). In addition to identifying NusG homologs in Archaeplastida, we identified them in the genomes of various phyla of photosynthetic chromists (Fig. 3A and Data Set S1F). All genomes in which we could not identify NusG were of poor quality and only partial. All identified NusG homologs in Plantae and Chromista are encoded in the nuclear genomes, except with the Paulinella genus. We hypothesize that these “bacterial” regulators have been retained to assist RNA synthesis by plastid-encoded RNA polymerase (PEP) of the bacterial type. Several lines of evidence support this hypothesis. First, a NusG homolog of a model organism, Arabidopsis thaliana, annotated as “plastid transcriptionally active 13” protein (pTAC13), has been identified as a component of the active transcriptional machinery in chloroplasts (45). Second, a Rho ortholog has been shown to terminate transcription by Arabidopsis PEP (46). Finally, ChloroP 1.1 (47) predicted the presence of a chloroplast transit signal in several newly identified NusG-like proteins (Data Set S1F). Pervasive plastid transcription has been documented in protists (48, 49). Maximum-likelihood phylogenetic trees. (A) NusG-like proteins are widespread. (B to D) Topology of bacterial trees, with monophyletic groups colored in the genome tree (B). The two clades of Alphaproteobacteria (Alpha) are red and purple; one clade of Zetaproteobacteria (Zeta) is gray. The remaining clades belong to Gammaproteobacteria (Gamma). The branches of NusG (C) and RfaH (D) trees are colored according to the genome tree. Black dots indicate bootstrap values of >50% (A) or >70% (B to D). In rhizarian amoebas of the Paulinella genus, nusG is carried in the remnants of a bacterial genome: a photosynthetic organelle called chromatophore. Paulinella representatives formed an evolutionarily recent symbiotic relationship with a photosynthetic cyanobacterium independently from the primary endosymbiosis that gave rise to plastids in Archaeplastida (50, 51). Our phylogenetic analyses revealed that Paulinella NusG is nested within the bacterial NusG cluster in the branch with Synechococcus (Fig. 3A), which is considered to be the ancestor of chromatophores (52). Phylogenetic analysis showed that eukaryotic NusG sequences from Plantae and Chromista formed clusters separate from bacterial and archaeal NusGs (Fig. 3A). Comparative genome analysis using plant and Chromista NusG proteins did not identify any single bacterial group to which all eukaryotic NusG proteins would be most similar (Data Set S1G). These data strongly suggest the presence of a progenitor NusG-like protein in the last universal common ancestor (LUCA).

RfaH evolution events.

A total of 1,922 RfaH proteins were found in 23 out of 117 phyla of Bacteria (Fig. 2B; Data Set S1H and -I), with ∼95% of RfaHs being found in Proteobacteria. Seventy percent and 18% of rfaH genes are found in Gammaproteobacteria and Alphaproteobacteria, respectively (Fig. 2C; Fig. S3). Further analysis revealed that families with a high percentage of hits for RfaH are clustered around the Enterobacteriaceae (Fig. 2C; Fig. S4). Although in the majority of lineages, the rfaH gene is likely a result of vertical evolution, the presence of rfaH-like genes on plasmids and prophages suggests that some RfaHs were acquired via horizontal gene transfer (HGT). To evaluate this possibility, we compared the topologies of phylogenetic trees (Fig. 3B to D; Data Set S1J). The three classes of Proteobacteria on the NusG tree were well separated, and the clades inside each class showed a topology nearly identical to that of the genome tree built using 120 ubiquitous marker genes for microbial classification, bac120 (53). In contrast, the RfaH tree topology was different from that of the genome tree, suggesting that while the evolution of NusG was vertical, HGT events contributed substantially to the evolution of RfaH. New RfaH model hits at the family level of Bacteria. The maximum-likelihood phylogenetic tree was downloaded from AnnoTree (77; http://annotree.uwaterloo.ca/). The percentage of RfaH hits was calculated for families with ≥10 genomes and are shown as bars on the outer ring. The percentages of RfaH genome hits are high in the Gamma- and Alphaproteobacteria. Download FIG S3, PDF file, 0.4 MB. New RfaH model hits in families of Alphaproteobacteria (A) and Gammaproteobacteria (B). The maximum-likelihood phylogenetic tree was downloaded from AnnoTree (77; http://annotree.uwaterloo.ca/). Untitled, there is currently no corresponding taxonomy. The percentages of genome hits were calculated for RfaH-containing families with ≥10 genomes. Families with >50% hits are in red and those with <50% hits in blue. Download FIG S4, PDF file, 0.5 MB. To study RfaH evolution in more detail, we analyzed RfaH distribution in two well-studied families of Gammaproteobacteria: Enterobacteriaceae and Pseudomonadaceae. Among 486 genomes of Enterobacteriaceae, ∼84% have RfaH. A previously defined representative genome data set of Enterobacteriaceae (54) was used for closer examination of RfaH distribution (Fig. 4). Among these genomes, three contained rfaH genes on plasmids, but the best BLAST hits of these plasmid-borne rfaH genes were to chromosomal genes from different strains, suggesting that RfaH can travel around on plasmids (Fig. 4). The plasmid RfaH formed a separate branch on a phylogenetic tree (Fig. S5). On the other hand, we observed similar topologies of the RfaH proteins and ribosomal trees within Enterobacteriaceae (Fig. 4; Fig. S5). Thus, we conclude that both vertical inheritance and HGT events shape RfaH evolution.

FIG 4

Distribution of RfaH proteins in Enterobacteriaceae. The maximum-likelihood phylogenetic tree was built based on sequences of the 16S rRNA genes. Chromosomal RfaH (pink) and plasmid RfaH (purple) are indicated. Plasmid-borne RfaH genes (purple dots) are connected to their best BLASTP hits among the chromosomal genes. Maximum-likelihood phylogenetic tree of Enterobacteriaceae RfaH. The phylogenetic tree was inferred from RfaH sequences by FastTree (85) with the JTT model. The NCBI accession number of RfaH and organisms’ names are shown. The branch of plasmid RfaH is in purple. The topology of the Enterobacteriaceae RfaH tree is similar to that of the 16S rRNA tree of Enterobacteriaceae (Fig. 4). Download FIG S5, PDF file, 0.3 MB. Unlike with Enterobacteriaceae, in which RfaH thrives, ∼60% of Pseudomonadaceae lack RfaH (Fig. 2C). To reveal the origins of this different distribution, we expanded our analysis to include 617 representatives of Pseudomonadaceae. Most species containing RfaH are found around the root, suggesting that RfaH was present in the common ancestor and was subsequently lost in some lineages (Fig. S6A); observations that strains within the same species occasionally lose rfaH genes suggest that this process is ongoing (Data Set S1K). Conversely, we also observed rfaH duplications on the chromosome, which occurred mainly in three clades (Fig. S6B). The species of these three clades were isolated from very different environments, including sputum of a cystic fibrosis patient, cocoon mucus of an earthworm, hyperthermic compost, permafrost, plant roots, marine sediment, etc. These findings indicate that RfaH is actively evolving in Pseudomonadaceae through gene loss and duplication, perhaps to enable adaptation to unique ecological niches. Distribution of RfaH in Pseudomonadaceae. (A) The midpoint-rooted maximum-likelihood phylogenetic tree was built based on 120 ubiquitous marker genes for microbial classification (53). The presence of RfaH is highlighted in purple and the ones with RfaH duplication in cyan. (B) Three clades with extensive gene duplications are shown. P., Pseudomonas. The number of rfaH genes is indicated. Download FIG S6, PDF file, 0.3 MB. Phylogenetic comparison of NusG, RfaH, and UpxY. The numbers above colored arrows indicate the branch length sum of the longest tree path of NusG homologs. Gene duplications on the same genome are indicated with same-colored circles. Download FIG S7, PDF file, 0.8 MB. While RfaH is ubiquitous in Proteobacteria, we identified only one genome that encodes RfaH among 1,908 available genomes of Bacteroidota (Bacteroidetes) (Fig. 2B; Data Set S1H and I). Instead, divergent NusGSP is present in approximately half of Bacteroidota. In Bacteroides fragilis NCTC 9343, eight UpxY proteins are encoded within different capsular polysaccharide operons (32). Each UpxY protein activates the expression of its resident operon, while the product of an adjacent upxZ gene interferes with the expression of heterologous upx operons. However, two uncharacterized UpxYs in the NCTC 9343 genome are not accompanied by UpxZ (Data Set S1L) and may perhaps act similarly to RfaH. Both the upxY and rfaH genes are present in bacteria isolated from different niches, including marine and terrestrial environments and animal hosts (Data Set S1L), and may be under pressure to rapidly adapt to changing environments. Phylogenetic comparison of NusG, RfaH, and UpxY reveals that, as judged by the average branch length, UpxY and RfaH evolve faster than NusG (Fig. S7), and both genes show extensive duplication. Thus, we conclude that NusG paralogs rapidly evolve by gene duplication and subfunctionalization.

Steps in the molecular evolution of RfaH.

In E. coli, NusG and RfaH bind to the same site on RNAP yet have opposite effects on gene expression. NusG is abundant, essential, and acts genome-wide to aid Rho silencing of xenogenes, whereas RfaH inhibits Rho in just a few horizontally acquired operons that are dispensable for survival but necessary for virulence. Transformation of a NusG duplicate into a fully specialized RfaH protein requires several key events: (i) loss of binding to Rho, which is an essential function of NusG (43); (ii) an increased affinity for RNAP (5), which enables RfaH to compete with 100-fold more abundant NusG (28); and (iii) target-specific recruitment, which limits RfaH action to a subset of operons, thereby preventing dysregulation of NusG-controlled genes (20). Recent structural and functional analyses of E. coli NusG and RfaH identified individual residues responsible for their differences, allowing us to investigate the molecular evolution of this family (Fig. 5; Data Set S1M). Molecular evolution of NusG and RfaH. (A) Spt5 (black), NusG (gray), unknown NusGSP (light pink), and RfaH (hot pink) are marked on the maximum-likelihood phylogenetic tree. Archaeal Spt5 is used as an outgroup. NusGs with the same pattern of functional sites are collapsed. (Top) Selected functional residues in RfaH and NusG are color coded and numbered as in E. coli RfaH/NusG (NCBI accession no. NP_418284.1/NP_418409.1). Lighter colors indicate conservative substitutions. CL1 to -8 denote RfaH clusters. (B) A stepwise conversion of NusG into RfaH. Our analysis allowed for the identification of a group of uncharacterized proteins homologous to RfaH. Phylogenetic reconstruction using Spt5 as an outgroup showed that this group of proteins and RfaH sequences are in two separate branches and that they both have NusG from Desulfurobacterium sp. strain TC5-1 as their common ancestor (Fig. 5A). Desulfurobacterium sp. TC5-1 belongs to Aquificae, which are thought to be among the most deeply diverging bacterial lineages, along with Thermotogae and Thermodesulfobacteria (55). We previously proposed that the NusG paralog first lost its ability to bind Rho (Fig. 5B), most likely by altering the Rho contact residues in the NusG-KOW motif (20). Our current data support this scenario. We recently found that a conserved 5-residue loop of NusG, including residues I164, F165, and G166, makes key contacts with Rho (23); furthermore, this loop enables RfaH binding to Rho upon replacement of a loop in RfaH, which contains residues L145-I146-N147 at the corresponding positions (23). Our analysis reveals that the Rho-binding residues were lost by RfaH early on (Fig. 5A), which might be expected given that the opposite effects on Rho termination underlie cellular functions of NusG and RfaH. Next, we envisioned that increased hydrophobicity of the NGN led to a protein with a high affinity for RNAP, which was able to compete with NusG. The RNAP β′ CH domain interacts with a hydrophobic patch on the NGNs of NusG and RfaH (5). RfaH NGN is more hydrophobic, and RfaH outcompetes NusG in vitro and in vivo (5, 20), even though NusG outnumbers RfaH 100:1 (28). RfaH residue F56 is required for binding to RNAP, and its replacement of Leu, the corresponding residue in NusG, confers binding defects (56). F56 is present in RfaH, unknown proteins, and NusG of Desulfurobacterium sp. TC5-1 (Fig. 5A), suggesting that stable interactions with RNAP are important for keeping RfaH in the game of evolution by preventing its displacement by a more abundant NusG. In contrast, F81 in RfaH or the corresponding G95 in NusG makes contact with RNAP in both proteins and is not highly conserved. Finally, NusGSP had to become soluble and to evolve a sequence-specific recruitment mechanism to control several targets in trans. In autoinhibited RfaH, the KOW domain, which is folded as an α-helical hairpin, unlike KOW domains of all other NusGs, shields a hydrophobic surface on the NGN that serves as an RNAP-binding site (29). An opposite side of the NGN contains a patch of residues that recognize the ops DNA (Fig. 1), which folds into a small hairpin on the RNAP surface (26). In addition to making direct contacts with the NGN, ops halts RNAP to facilitate RfaH recruitment (26); ops-like sequences induce pausing of phylogenetically diverse RNAPs (57). Nearly all ops bases are required for RfaH function, and several RfaH residues directly contact the ops DNA hairpin (5, 26). We reason that such a complex mechanism must have evolved incrementally, perhaps with NusGSP initially binding to a paused RNAP and then learning to recognize DNA. Mapping of the RfaH DNA-binding determinants on the phylogenetic tree (Fig. 5A) is consistent with a sequential acquisition of residues that bind DNA: K10 (F in NusG) acquisition preceded the emergence of RfaH, whereas R73 arose later. We believe that autoinhibition controls RfaH recruitment indirectly, by making RfaH binding to RNAP dependent on the presence of the ops signal. RfaH residues E48, I93, and F130 are required for autoinhibition; their replacement allows sequence-independent, NusG-like recruitment of RfaH (27, 58). RfaH contacts with the ops-paused complex relieve autoinhibition, exposing the RNAP-binding site on the NGN (30). The acquisition of residues that mediate interdomain interactions coincide with that of the DNA-binding residues (Fig. 5A), consistent with autoinhibition and ops contacts acting in concert. In summary, our analysis supports a sequential transformation of NusG into RfaH in which the exclusion of Rho binding and increased binding to RNAP precede sequence-specific recruitment to the elongation complex (Fig. 5B).

RfaH targets and gene neighbors.

While E. coli RfaH is monocistronic and acts in trans, other NusGSP proteins, such as Myxococcus xanthus TaA (33) and UpxY (32), are encoded within their target operons. We wondered whether RfaH-like proteins, which display significant variations in their functional regions (Fig. 5A), could fall into different groups, perhaps associated with particular regulatory contexts. Markov clustering of all RfaH sequences identified in this study revealed eight distinct clusters, CL1 to CL8 (Fig. 6A; Fig. S8; Data Set S1N). Using TREND (40), we found that, unlike with the invariant gene neighborhood of nusG (see above), the gene neighbors of rfaH were highly diverse; they encoded polysaccharide biosynthesis enzymes, nucleoid-associated protein H-NS, toxin-antitoxin systems, secondary metabolites, Tat protein secretion system, etc.

FIG 6

RfaH clusters, genomic contexts, and targets. (A) The eight clusters. Footnote a, RfaHs found in GTDB_reps were clustered into eight clusters (Data Set S1H and N). The number of total sequences of different clusters are presented. Footnote b, a subset of different CLs containing NCBI reference sequences only. The number of sequences is shown. (B) Heatmap showing distribution of COG functional categories (represented by A to W) of RfaH neighbor genes; there are five genes on each side. The number of genes in every COG category was normalized by the number of RfaH reference sequences. (C) Operons activated by enterobacterial RfaHs and other NusGSP proteins; positions of ops sites (green) and NusGSP genes (orange) are shown. COG categories can be accessed at https://www.ncbi.nlm.nih.gov/COG/. Sequence logos of eight RfaH clusters (CL). The logos were created based on multiple-sequence alignments of RfaH sequences from each CL using WebLogo (G. E. Crooks, G. Hon, J. M. Chandonia, and S. E. Brenner, Genome Res 14:1188–1190, 2004; https://weblogo.berkeley.edu/). The logos were trimmed to begin with the conserved residue tryptophan in all CLs. Functional sites discussed in Fig. 5A are marked with colored squares. Green, ops binding sites; orange, sites responsible for autoinhibition; blue, RNAP β′ CH-binding sites; purple, Rho-binding sites. Download FIG S8, PDF file, 0.7 MB. To assess whether each cluster could be associated with a subset of genes, we assigned their gene neighbors to cluster of orthologous group (COG) categories (Fig. 6B) (38). Similarly to E. coli RfaH, which is included in CL1, RfaHs of CL1 were not strongly associated with a particular COG category, although H (coenzyme metabolism) and U (secretion) genes were frequent. These diffuse-pattern proteins act in trans on distant targets. In contrast, genes involved in cell envelope biogenesis (M), which are known targets of NusGSP regulators, were overrepresented among neighbors of CL2 to CL8; glycosyltransferases, nucleoside-diphosphate-sugar epimerases, and exopolysaccharide biosynthesis functions were most common (Fig. 6B; Fig. S9A). Notable differences exist among these clusters (Fig. 6B; Fig. S9A). CL1 is frequently adjacent to Sec-independent protein secretion pathway functions (U). CL4 is associated with a helix-turn-helix (HTH) transcriptional regulator (K). CL6 neighbors encode undecaprenyl pyrophosphate synthase, involved in terpenoid biosynthesis (I), and nucleoid-associated protein H-NS (R), whereas CL7 comprises a group of diverse RfaHs from Shewanella that are encoded within putative exopolysaccharide operons (Fig. S9B), an arrangement resembling B. fragilis operons controlled by diverse UpxY proteins (32). Many CL7 genes are adjacent to signal transduction (CheY) and envelope biogenesis (ABC transporter) genes, but their relative orientations differ among CL7 members. Enrichment of neighboring genes of RfaH clusters (CL). (A) A total of 1,054 COGs were assigned to neighbor genes of rfaH (five genes on both sides of rfaH genes). Every circle represents a COG. The percentage of COG was calculated as (raw count of COG)/(total neighbor genes of one CL). Then, COGs were assigned with a unique integer in the range of 1 to 1,054 and the same COG in different CLs will be assigned with the same integer. These integers were used to build the x axis. The identities of highly abundant COGs (38) are indicated. (B) Example of the conversed CL7 rfaH location. From left to right, white arrows indicate chemotaxis signaling genes, light brown arrows indicate a relationship to type III secretion systems, black arrows indicate a transport system, brown arrows indicate a major facilitator superfamily (MFS) transporter, hot pink arrows indicate rfaH, green bars indicate the ops site, and gray arrows indicate an exopolysaccharide operon. Download FIG S9, PDF file, 0.3 MB. In addition to activating several chromosomal targets, RfaH activates an F plasmid tra operon, which encodes a type IV secretion system (Fig. 6C) and is required for conjugation (59). Other plasmids encode resident NusGSPs in their tra operons. As we await experimental assessment of their functions, this genetic syntax suggests that plasmid NusGSP acts as an antiterminator of tra operons, which are among the longest bacterial operons and are thus expected to be prone to premature termination. Carrying a resident antiterminator confers a significant advantage to plasmids that, unlike F, are transferred between different species. Conjugative plasmids are major contributors toward the clinical dissemination of antibiotic resistance, and some of these plasmids encode NusGSPs (60, 61). RfaH and other NusGSPs are required for the expression of very diverse macromolecules, including adhesins, antibiotics, capsular polysaccharides, toxins, etc. The most obvious common feature of NusGSP targets is their length (Fig. 6C). A shared ability of all NusG-like proteins to make RNA synthesis more efficient suggests a mechanism in which NusGSP-bound RNAP ignores intragenic termination signals; consistently, NusGSP is annotated as an antiterminator. However, while RfaH increases gene expression hundreds of folds, its antitermination activity makes only a minor contribution to its effects in vivo (62). Instead, RfaH excludes NusG from RNAP and promotes ribosome recruitment, thereby inhibiting premature RNA release by Rho (27). Furthermore, by coupling RNAP to the ribosome (27), RfaH may enable the complete synthesis of long polypeptides, such as a giant 5,559-amino-acid-long nonfimbrial adhesin encoded by Salmonella pathogenicity island IV (63) (Fig. 6C). Similarly, LoaP-like regulators (31) may promote translation of 4,200- and 5,200-amino-acid-long polyketide synthases in the Bacillus amyloliquefaciens dfn operon. The marked diversity of their gene neighborhoods supports a view that RfaH-like regulators act on any operon, once recruited; indeed, E. coli and Klebsiella pneumoniae RfaH activate expression of the Photorhabdus luminescens lux operon, as long as the ops element is present in the leader region (64). However, in this work, we show that different types of RfaH-like proteins are associated with different classes of neighbors (Fig. 6B), a correlation that may reflect their evolutionary history or distinct mechanisms of recruitment. E. coli RfaH is the only representative for which a detailed mode of recruitment is known, and future studies are required to address this question.

Concluding remarks.

The only ubiquitous family of transcription factors comprises two very different classes of regulators. One class includes essential general elongation factors that coevolved with RNAP since the LUCA (1). These NusG-like core regulators are recruited to RNAP once it escapes from a promoter, replacing transcription initiation factors that bind to the same site (18, 19), and remain associated with RNAP transcribing all genes (20, 65). Here, we show that the bacterial NusG protein is present in genomes of all cells that utilize bacterial RNAPs, except a few endosymbionts and some algae. What makes NusG indispensable? Although their sequences have diverged considerably, bacterial, archaeal, and eukaryal factors make remarkably similar interactions with RNAP that are thought to increase the enzyme’s processivity, acting akin to replicative clamps (66); the NGNs are necessary and sufficient for RNAP modifications (14, 29, 67). This antitermination function of NusG, reflected in genome annotations, has long been thought to be its signature activity. However, NusG alone has only modest effects on RNA synthesis (9). Instead, antitermination is achieved through the assembly of large nucleoprotein complexes, e.g., on bacteriophage λ RNA, in which the NusG-KOW domain makes contact with diverse protein partners (10). In fact, it is through alternative contacts with Rho (23) or ribosome (21) that the NusG-KOW domain determines the fate of the nascent RNA. Multiple Spt5 KOW domains play analogous functions in eukaryotes, coupling RNA synthesis to splicing, polyadenylation, and other cotranscriptional processes (3). Transcription of chloroplast genomes by PEP depends on its binding to several accessory proteins (68), including NusG (45). We speculate that the NusG-KOW domain acts as a hub for PEP complex assembly. Despite its ubiquity, NusG is a dissociable factor rather than an RNAP subunit, a property exploited by the second class of NusG proteins exemplified by RfaH. These regulators outcompete NusG for binding to RNAP and exert much stronger antitermination effects (5) but must be selectively recruited to only a few targets to avoid misregulation of housekeeping genes (20). In the case of RfaH, targeted recruitment is achieved through a complex DNA-dependent mechanism (26). Here, we show that RfaH-like proteins are rapidly evolving through a combination of HGT and vertical inheritance. We identified eight distinct groups of RfaH that we propose control different sets of genes, sometimes coevolving with their targets. While the RfaH-NGN mediates recruitment to RNAP and DNA, we hypothesize that the RfaH-KOW domain plays key regulatory roles. The KOW domain controls RfaH recruitment indirectly, through autoinhibition (58), is thought to load the ribosome onto mRNA lacking ribosome-binding sites (27), and may interact with some membrane components during secretion of proteins whose expression it activates (69). While RfaH is not strictly essential for growth in the lab, it is critical for expression of the cell wall, capsules, adhesins, siderophores, and conjugative pili, whereas other NusGSPs are essential for the synthesis of capsules and antibiotics (7), molecules that determine bacterial success in natural environments. Eukaryotes also encode multiple copies of Spt5 (Fig. 2A), and specialized paralogs have been implicated in the regulation of RNA silencing and meiosis (34, 70). Thus, all life depends on the NusG-like regulators to balance the expression of housekeeping genes with niche-specific demands. The mechanisms by which this balance is maintained remain to be elucidated.

MATERIALS AND METHODS

Taxonomy information used in this study was derived from the Genome Taxonomy Database (GTDB; https://gtdb.ecogenomic.org/) (39). Archaeplastida, Chromista, and Plantae are artificial groups (71–73) and used solely for brevity in this paper.

Construction of a new RfaH model.

RfaH (NCBI accession no. NP_418284.1) from Escherichia coli strain K-12 substrain MG1655 was used as a query in BLAST searches against genomes of selected representatives to find potential RfaH homologs. One species from each family of Proteobacteria was selected as a representative. All potential RfaH sequences were verified using a reciprocal best BLAST hit approach (74) (see Fig. S2 in the supplemental material for an example). The final set of 103 RfaH sequences was used to construct an initial multiple-sequence alignment (MSA). Based on the MSA, an initial HMM profile was generated and used to query the UniProt Reference Proteomes database (v. 2019-09). The hits were filtered based on known conserved positions in RfaH and structural information to collect an extended set of RfaH protein sequences. The redundancy of the set was reduced to the 80% identity level by CD-HIT, and a new MSA was generated based on the reduced sequence set. This set was used to generate a final HMM profile. The final profile was used to query the UniProt reference proteome database and to set the trusted and noise cutoffs of the profile.

Database of species representatives (GTDB_reps).

The list of species representatives of bacteria and archaea (release 89.0) was downloaded from the GTDB (39). The genome files (file type: protein FASTA) were retrieved from NCBI using Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez). A total of 18,436 bacterial genome files and 847 archaeal genome files were downloaded and used as a database of species representatives in this study, which was named GTDB_reps.

Distribution of NusG and RfaH.

NusG TIGRfam and the newly built RfaH HHM were used to search against GTDB_reps by HMMER (75). Taxonomy assignment of the collected protein sequences was done using a custom python script. The percentage of genome hits was calculated using a custom python script. The results were visualized on phylogenetic trees by FigTree (76). The maximum-likelihood genome trees were downloaded from AnnoTree (77; http://annotree.uwaterloo.ca/).

Identification of NusG in Eukaryota.

We used the NusG protein sequence (NCBI accession no. WP_012415655.1) from Elusimicrobium minutum to search eukaryotic protein databases. We used BLASTP and PSI-BLAST against the nonredundant database at the NCBI and a BLASTP search against the oneKP database (78), with default parameters (May 2020). Domain identification was carried out using the TREND (40) and HHpred (79) servers. Multiple-sequence alignments were constructed using the L-INS-I algorithm of MAFFT (80) and edited in Jalview (81). A maximum-likelihood phylogenetic tree was constructed using the MEGA X package (82) and edited in the Interactive Tree of Life (iTOL) v4 tool (83). To study the topology of NusG and RfaH phylogenetic trees, representatives were selected from GTDB_reps (Data Set S1J). One representative genome containing both NusG and RfaH was selected from each family. A total of 82 family representatives of Proteobacteria were selected. A maximum-likelihood bacterial genome tree of family representatives was inferred from a concatenated alignment of 120 ubiquitous single-copy proteins, also known as the bac120 data set (53) using RAxML (84). Maximum-likelihood phylogenetic trees of NusG and RfaH were constructed using FastTree (85) and RAxML (84). The trees constructed by the two methods showed similar topologies. To show examples of evolution events, two families, Enterobacteriaceae and Pseudomonadaceae, were investigated. The maximum-likelihood phylogenetic tree of 16S rRNA sequences of Enterobacteriaceae was from a previous study (54), whereas a maximum-likelihood genome tree of Pseudomonadaceae was inferred from the bac120 data set. The presence of RfaH was determined using the new RfaH model. The maximum-likelihood RfaH tree of Enterobacteriaceae was inferred using FastTree (85).

Phylogenetic tree for molecular evolution study.

To study the molecular evolution of RfaH, a data set was compiled with three parts (Data Set S1M). The first part was representative genomes containing both RfaH and NusG. To select these representatives, a maximum-likelihood phylogenetic tree was inferred from 1,922 RfaH sequences (Data Set S1H) by FastTree (85). Then representatives were selected from this phylogenetic tree according to tree depth. The second part was representative genomes containing proteins which have bit scores between trusted and noise cutoffs of the new RfaH model (referred to as unknown NusGSPs). The third part was representative archaeal genomes containing Spt5, which served as an outgroup. The structural alignment was performed with MAFFT-DASH (86). The maximum-likelihood phylogenetic tree was inferred using FastTree with the JTT model (85) and RAxML with the LG4X model (84). The two programs produced nearly identical phylogenetic trees.

Clustering of RfaH protein sequences.

RfaH protein sequences collected running the new RfaH HMM profile against GTDB_reps were clustered in a stepwise fashion: Step 1 reduced the redundancy of the sequences at a 95% identity level, giving a final set of 1,481 sequences. In step 2, reciprocal BLASTP all-vs-all was run using the final set. With the result, an undirected graph was built. The following cutoffs were used to construct the graph edges: an E value less than or equal to 5e–30 and a coverage of ≥80%. The edge weights were initialized using an average of two E values of each reciprocal BLASTP. Using this graph, Markov clustering was performed. An inflation value of 5 was used, as it gave the most efficient clustering. The majority of the sequences ended up in eight coherent clusters.

Neighbor genes of RfaH.

Gene neighborhoods of 1,122 reference rfaH genes (Fig. 6A) were determined using TREND (40); each neighbor gene was assigned to clusters of orthologous groups (COGs) (38, 87). The distribution of COGs in the eight RfaH clusters were presented by Heatmap using the R package (http://www.R-project.org/).

UpxY search.

BLASTP with the E value threshold of <10−10 was used to query GTDB_reps with eight UpxY protein sequences from B. fragilis NCTC 9343 (32). Representatives were selected to build a maximum-likelihood phylogenetic tree with RfaH and NusG (Data Set S1L). The structural alignment computed by MAFFT-DASH (86) was used to build the phylogenetic tree. The phylogenetic tree was inferred using FastTree with the JTT model (85).

NusG family detection.

An entire list of GTDB genome identifiers (release 89.0) was downloaded. Based on the list, 129,663 genomes were fetched from the NCBI and compiled into a complete database. The database was searched using profile HMMs of eight ubiquitous vertically inherited proteins: NusG, SecE, RecA, L1, L5, L6, S2, and S7.

Software.

We used the following software: AnnoTree v1.2.0 (77), CD-HIT v4.7 (88), FastTree v2.1.10 (85), FigTree v1.4.4 (76), HMMER Web server v2.40.0 (36), HMMER package v3.3 (75), Jalview v2.11.0 (81), MAFFT v7.450 (89), NCBI BLAST 2.9.0+ (90), Python 3.8.2 (91), RAxML v8.2.12 (84), and R 3.6.2 (92). Python codes used in this study are available upon request.

Models.

Models were the new RfaH HMM (this study; to be included in the MiST database [37]), RfaH TIGRfam (TIGR01955), NusG Pfam (PF02357), NusG TIGRfam (TIGR00922), Spt5-NGN Pfam (PF03439), NusA_N Pfam (PF08529), SecE Pfam (PF00584), RecA Pfam (PF00154), L1 Pfam (PF00687), L5 Pfam (PF00281), L6 Pfam (PF00347), S2 Pfam (PF00318), and S7 Pfam (PF00177).

88 in total

1. The new higher level classification of eukaryotes with emphasis on the taxonomy of protists.

Authors: Sina M Adl; Alastair G B Simpson; Mark A Farmer; Robert A Andersen; O Roger Anderson; John R Barta; Samuel S Bowser; Guy Brugerolle; Robert A Fensome; Suzanne Fredericq; Timothy Y James; Sergei Karpov; Paul Kugrens; John Krug; Christopher E Lane; Louise A Lewis; Jean Lodge; Denis H Lynn; David G Mann; Richard M McCourt; Leonel Mendoza; Ojvind Moestrup; Sharon E Mozley-Standridge; Thomas A Nerad; Carol A Shearer; Alexey V Smirnov; Frederick W Spiegel; Max F J R Taylor
Journal: J Eukaryot Microbiol Date: 2005 Sep-Oct Impact factor: 3.346

2. In silico discovery of small molecules that inhibit RfaH recruitment to RNA polymerase.

Authors: Dmitri Svetlov; Da Shi; Joy Twentyman; Yuri Nedialkov; David A Rosen; Ruben Abagyan; Irina Artsimovitch
Journal: Mol Microbiol Date: 2018-10-02 Impact factor: 3.501

3. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life.

Authors: Donovan H Parks; Christian Rinke; Maria Chuvochina; Pierre-Alain Chaumeil; Ben J Woodcroft; Paul N Evans; Philip Hugenholtz; Gene W Tyson
Journal: Nat Microbiol Date: 2017-09-11 Impact factor: 17.745

4. MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms.

Authors: Sudhir Kumar; Glen Stecher; Michael Li; Christina Knyaz; Koichiro Tamura
Journal: Mol Biol Evol Date: 2018-06-01 Impact factor: 16.240

5. Trans locus inhibitors limit concomitant polysaccharide synthesis in the human gut symbiont Bacteroides fragilis.

Authors: Maria Chatzidaki-Livanis; Katja G Weinacht; Laurie E Comstock
Journal: Proc Natl Acad Sci U S A Date: 2010-06-14 Impact factor: 11.205

6. sfrA and sfrB products of Escherichia coli K-12 are transcriptional control factors.

Authors: L Beutin; P A Manning; M Achtman; N Willetts
Journal: J Bacteriol Date: 1981-02 Impact factor: 3.490

7. Mechanism for the Regulated Control of Bacterial Transcription Termination by a Universal Adaptor Protein.

Authors: Michael R Lawson; Wen Ma; Michael J Bellecourt; Irina Artsimovitch; Andreas Martin; Robert Landick; Klaus Schulten; James M Berger
Journal: Mol Cell Date: 2018-08-16 Impact factor: 17.970

8. MAFFT-DASH: integrated protein sequence and structural alignment.

Authors: John Rozewicki; Songling Li; Karlou Mar Amada; Daron M Standley; Kazutaka Katoh
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

9. The quantitative and condition-dependent Escherichia coli proteome.

Authors: Alexander Schmidt; Karl Kochanowski; Silke Vedelaar; Erik Ahrné; Benjamin Volkmer; Luciano Callipo; Kèvin Knoops; Manuel Bauer; Ruedi Aebersold; Matthias Heinemann
Journal: Nat Biotechnol Date: 2015-12-07 Impact factor: 54.908

10. The Pfam protein families database in 2019.

Authors: Sara El-Gebali; Jaina Mistry; Alex Bateman; Sean R Eddy; Aurélien Luciani; Simon C Potter; Matloob Qureshi; Lorna J Richardson; Gustavo A Salazar; Alfredo Smart; Erik L L Sonnhammer; Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto; Robert D Finn
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

5 in total

1. AtNusG, a chloroplast nucleoid protein of bacterial origin linking chloroplast transcriptional and translational machineries, is required for proper chloroplast gene expression in Arabidopsis thaliana.

Authors: Hai-Bo Xiong; Hui-Min Pan; Qiao-Ying Long; Zi-Yuan Wang; Wan-Tong Qu; Tong Mei; Nan Zhang; Xiao-Feng Xu; Zhong-Nan Yang; Qing-Bo Yu
Journal: Nucleic Acids Res Date: 2022-06-23 Impact factor: 19.160