Hfq and other Sm proteins are central in RNA metabolism, forming an evolutionarily conserved family that plays key roles in RNA processing in organisms ranging from archaea to bacteria to human. Sm-based cellular pathways vary in scope from eukaryotic mRNA splicing to bacterial quorum sensing, with at least one step in each of these pathways being mediated by an RNA-associated molecular assembly built upon Sm proteins. Though the first structures of Sm assemblies were from archaeal systems, the functions of Sm-like archaeal proteins (SmAPs) remain murky. Our ignorance about SmAP biology, particularly vis-à-vis the eukaryotic and bacterial Sm homologs, can be partly reduced by leveraging the homology between these lineages to make phylogenetic inferences about Sm functions in archaea. Nevertheless, whether SmAPs are more eukaryotic (RNP scaffold) or bacterial (RNA chaperone) in character remains unclear. Thus, the archaeal domain of life is a missing link, and an opportunity, in Sm-based RNA biology.
Hfq and other Sm proteins are central in RNA metabolism, forming an evolutionarily conserved family that plays key roles in RNA processing in organisms ranging from archaea to bacteria to human. Sm-based cellular pathways vary in scope from eukaryotic mRNA splicing to bacterial quorum sensing, with at least one step in each of these pathways being mediated by an RNA-associated molecular assembly built upon Sm proteins. Though the first structures of Sm assemblies were from archaeal systems, the functions of Sm-like archaeal proteins (SmAPs) remain murky. Our ignorance about SmAP biology, particularly vis-à-vis the eukaryotic and bacterial Sm homologs, can be partly reduced by leveraging the homology between these lineages to make phylogenetic inferences about Sm functions in archaea. Nevertheless, whether SmAPs are more eukaryotic (RNP scaffold) or bacterial (RNA chaperone) in character remains unclear. Thus, the archaeal domain of life is a missing link, and an opportunity, in Sm-based RNA biology.
Introduction: The Sm Family, its Biology and an Archaeal Lineage
A history of the Sm/Lsm-SmAP-Hfq family
HumanSm proteins were discovered over 30 y ago as a group of small antigens involved in the autoimmune disease systemic lupus erythematosus., The ≈80-residue proteins were identified in association with ribonucleoprotein (RNP) complexes from eukaryotic cellular extracts. Other early work uncovered vital roles for Sm proteins in forming the cores of the uracil-rich small nuclear RNPs (U snRNPs) that further assemble into spliceosomes and excise introns in eukaryotic pre-mRNAs (reviewed in ref. 5). Over the ensuing decades, great strides in elucidating the physiological and biochemical properties of Sm proteins, as well as the three-dimensional (3D) structures and assembly behavior of these RNA-associated proteins, led to our current view that eukaryotic Sm proteins function as molecular scaffolds for RNP assembly. As depicted in Figure 1A, eukaryotic Sm assemblies act in a vast array of RNA-related pathways; for recent reviews of this work, see for instance references 7–10. Paralleling this work on the canonical eukaryotic Sm proteins of the spliceosomal snRNPs, early biochemical and bioinformatic analyses,, along with biophysical and crystallographic studies,- expanded our view of the phylogenetic distribution of the Sm family to include an Sm-like (Lsm) subfamily, and revealed Sm proteins in the archaeal domain of life. (Sm systems resembling those of eukaryotes were not necessarily expected in the archaea, given the lack of introns in their protein-coding genes and their presumably more primitive RNA-processing machineries.) Finally, in a third line of seemingly unrelated discoveries—in bacteria, dating to the late 1960s—an Escherichia coli “host factor I” (HF-I) protein was found to be necessary for replication of the bacteriophage Qβ. Biochemical characterization of this host factor for bacteriophage Qβ replication, dubbed “Hfq,” revealed that the protein (1) forms thermostable hexamers, (2) occurs at high intracellular concentrations, and (3) preferentially binds A/U-rich single-stranded RNA (ssRNA) via multiple sites on the protein.- This Hfq was also capable of interacting with DNA,,, such as in the E. coli nucleoid.
Figure 1. A bottom-up approach to Sm function in RNA metabolism. Placing the Sm protein family in a biochemical context underscores its central role in myriad RNA processing pathways in the eukaryotic (A) and bacterial (B) domains of life, highlighting the gaps in our knowledge for the archaea. The diagram indicates how RNA processing events (top layer) hierarchically build upon Sm proteins (bottom layer). One of the most extensively characterized Sm-based pathways is the excision of introns from pre-mRNA, which can be dissected (A) as intron splicing←spliceosome←U1, U2, U4/U6 and U5 snRNPs←Sm core of snRNPs. While this eukaryotic example demonstrates a functional niche of Sm proteins as scaffolds, Hfq acts instead as a chaperone (B), mediating interactions between regulatory ncRNAs (red) and their targets (blue). This schematic is not comprehensive (for clarity, not all known connections are shown) and new examples of Sm function are being discovered continuously, particularly in the bacterial context of Hfq; the pace of discoveries of new Sm functions will likely increase as new interactions and functional linkages are uncovered by genome- and proteome-wide studies.
Figure 1. A bottom-up approach to Sm function in RNA metabolism. Placing the Sm protein family in a biochemical context underscores its central role in myriad RNA processing pathways in the eukaryotic (A) and bacterial (B) domains of life, highlighting the gaps in our knowledge for the archaea. The diagram indicates how RNA processing events (top layer) hierarchically build upon Sm proteins (bottom layer). One of the most extensively characterized Sm-based pathways is the excision of introns from pre-mRNA, which can be dissected (A) as intron splicing←spliceosome←U1, U2, U4/U6 and U5 snRNPs←Sm core of snRNPs. While this eukaryotic example demonstrates a functional niche of Sm proteins as scaffolds, Hfq acts instead as a chaperone (B), mediating interactions between regulatory ncRNAs (red) and their targets (blue). This schematic is not comprehensive (for clarity, not all known connections are shown) and new examples of Sm function are being discovered continuously, particularly in the bacterial context of Hfq; the pace of discoveries of new Sm functions will likely increase as new interactions and functional linkages are uncovered by genome- and proteome-wide studies.These three lines of Sm research—eukarya, archaea and bacteria—were unified by the realization ca. 2002 that Hfq is the bacterial branch of the Sm family. The Hfq↔Sm homology was first suggested by weak sequence similarities between the N-terminal regions of the ≈80–120 residue Hfq and Sm proteins, was further corroborated by phylogenetic, biophysical, fold recognition and homology modeling studies of E. coliHfq,- and was firmly established by the first crystal structure of Hfq, which revealed a hexamer composed of Hfq subunits that adopt the Sm fold. A surge of biochemical, biophysical and genetic/RNomic studies of Hfq over the past decade has revealed much about the roles of this Sm protein in bacterial RNA metabolism, as well as structure/function relationships in the Hfq branch of the Sm family. Whereas eukaryotic Sm proteins serve more “passive” functions as structural scaffolds, Hfq acts as an RNA chaperone,- mediating antisense interactions between small regulatory, noncoding RNAs (ncRNA) and their targets (Fig. 1B) and directly influencing the structures of some RNAs. Relatively recent reviews are available on Hfq-based RNA biology from microbiological and structural perspectives,- including the other reviews in this Special Focus issue.-The in vivo functions of archaeal Sm proteins remain unknown, in contrast to the eukaryotic and bacterial homologs, and despite the fact that the first atomic-resolution structures of intact Sm rings were from archaeal systems. SmAP function can be approached by using the homology between SmAP↔Hfq and SmAP↔Sm/Lsm subfamilies to make phylogenetic inferences about likely Sm functions in the archaea. Thus, the remainder of this introductory section summarizes eukaryotic (Sm/Lsm) and bacterial (Hfq) biology (Fig. 1), as well as the evidence for authentic Sm proteins in the archaea. The next section reviews SmAP 3D structures at the levels of monomers and oligomers, and as regards modularity of the Sm fold; Sm sequence/structure relationships are also described, and some nomenclature issues are raised from a bioinformatic perspective. In all of this, a major question is whether the cellular roles of SmAPs are more eukaryotic (RNP scaffold) or bacterial (RNA chaperone). Thus, the final third of this review examines the possible biochemical roles of SmAPs, starting with what is already known about (potentially Sm-linked) RNA processing in the archaea; this final section also considers the genomic context of Sm genes and offers an exploratory discussion of what may be expected for archaeal Sm function. As suggested by the absence of an archaeal panel in Figure 1, the main motivation for this review is that SmAPs represent a significant opportunity in Sm-based RNA biology.
A synopsis of eukaryotic and bacterial Sm biology
Eukaryotic Sm proteins serve as RNP scaffold.
A modular approach to eukaryotic Sm-based RNA biology is shown in Figure 1A. Various forms of RNA processing occupy the top level of this hierarchy, including rRNA processing by small nucleolar RNPs (snoRNPs), RNase P-based splicing and maturation of tRNA, processing of the 3′ ends of histone mRNA by U7 snRNP,, mRNA decapping and decay and chromosome end maintenance by telomerase. Each of these pathways employ Sm or Lsm proteins. Indeed, a central theme of Figure 1 is that a great diversity of RNA processing events (on a cellular scale) can be traced back to the Sm proteins (on a molecular scale). Because Sm proteins were first identified in connection with RNA splicing, the most thorough biochemical and structural picture available for the molecular basis of Sm function concerns their roles in snRNP-mediated intron excision; snRNPs also provide a useful starting point in considering the potential cellular niches of Sm protein in the archaea.To simplify our understanding of the architectural role of Sm proteins, each U snRNP can be viewed as an RNP composed of two parts: the respective U snRNAs (U1, U2, etc.) and up to dozens of proteins. The U snRNPs are dissected in the two bottom layers of Figure 1A. The protein components fall into two classes: snRNP-specific proteins, such as U2A′ and U2B″ of the U2 snRNP, and core proteins that are common to each snRNP., The snRNP-specific proteins mediate specific RNA∙∙∙RNA, protein∙∙∙RNA, and protein∙∙∙protein interactions and function in ways unique to each snRNP (e.g., DEAD-/DxxH-box helicases). In contrast, the molecular functions of the shared core snRNP proteins—the Sm/Lsm proteins—are presumably more generic.The scaffolding functionality of eukaryotic Sm proteins is exemplified by their roles in snRNP biogenesis. Sm proteins nucleate the early stages of snRNP assembly by binding single-stranded regions of snRNA. The consensus Sm-binding site is a short uracil-rich sequence PuAU≈4–6GPu (Pu = purine) flanked by RNA stem-loops. However, the Lsm ring binds at the single-stranded 3′ end of U6 snRNA, thus demonstrating the variation that is possible in local RNA 2° structures for different Sm- or Lsm-binding sites. Consistent with a shared ancestry, both eukaryotic and bacterial Sm proteins appear to bind U-rich RNAs, such as the snRNA Sm motif, in the central pore toward the same face of the ring (corresponding to the proximal face of Hfq). Also of interest as regards to SmAP function and oligomeric plasticity (detailed below), eukaryotic Sm proteins form stable sub-complexes, such as Sm D1•D2 and F•E•G heteromers, that can then associate into a pentameric “subcore;” notably, these assembly intermediates are of functional relevance. The Sm-templated assembly of snRNPs is guided by interactions between specific Sm proteins and the survival of motor neurons (SMN) protein complex. These and other biochemical features of Sm function have been reviewed in great detail, and an atomic-resolution picture has begun to emerge via recent structural work.
Structural enlightenment.
Decades of genetic, biochemical and electron microscopic (EM) studies have now culminated in three lines of structural work that substantially advance our understanding of Sm function., First, recent structures of the U1, and U4 snRNPs expose Sm rings in their final assembly state, bound to snRNAs and snRNP-specific proteins., These new structures establish that an snRNA threads through the eukaryotic Sm pore, unlike what is thought to be the case in RNA1•Hfq•RNA2 ternary complexes (wherein RNAs bind to distinct regions of an Hfq ring). These structures also show the Sm surface to be a versatile platform for protein∙∙∙protein and protein∙∙∙RNA interactions. In a second line of work, the structure of a late intermediate in the snRNP assembly pathway, an Sm D1•D2∙F•E•G pentamer bound to part of the Gemin/SMN complex, elucidates the mechanistic basis for SMN-chaperoned snRNA∙∙∙Sm associations. Rather than thread through a preformed ring, snRNA is sequentially bound by Sm subunits via discrete, metastable intermediates. Finally, a third line of work offers structures of early intermediates in snRNP assembly and unveils a fascinating case of molecular mimicry: A β-sheet-rich assembly chaperone (pIC1n) that resembles the overall shape of roughly two Sm subunits “wedges” into a crescent-shaped Sm pentamer and stabilizes the partially assembled Sm ring. Beyond these purely structural/scaffolding roles, Sm proteins also serve as regulatory points in RNA pathways via post-translational modifications. Sm RNP biogenesis can be modulated by dimethylation of arginines in the C-term RG dipeptides of Sm and Lsm proteins (the C-term RG in Fig. 2A is not one of these tandem RG methylation substrates); these and other cellular roles of Sm methylation have been reviewed.- While it remains to be seen if archaeal Sm complexes match the functional intricacy found in these recent structural studies of eukaryotic Sm RNPs, the elaborate web of Sm-mediated RNA∙∙∙RNA, RNA∙∙∙protein and protein∙∙∙protein interactions underscores the roles of Sm proteins in RNP biogenesis and function. Because Sm/Lsm homologs occur in the archaea and likely existed in the eukaryotic ancestor,- we do not dismiss the possibility that SmAPs serve similar scaffolding functions in as yet undiscovered archaeal RNPs.
Figure 2. SmAP monomers: Sequence profiles and a 3D structure of versatile functionality. A probabilistic model of sequence variation across the Sm family is shown (A) as a profile hidden Markov model (pHMM). This visual display of pHMMs using logos is roughly analogous to the more familiar sequence logos used in representing multiple sequence alignments. In this pHMM, the vertical axis corresponds to the information content, measured in bits relative to the profile’s background distribution; positions that contain more information correspond to higher stacks, and amino acid letter heights within a stack are scaled by that residue’s relative contribution to the position. The horizontal axis can be considered as the position “s” along the Sm sequence profile. (Technically it is the sequential chain of HMM states, with the hitting probability of visiting state “s” along the HMM chain colored dark gray and the contribution of a (match or insertion) state “s” to the overall Markov chain shown as the sum of the widths of light- + dark-gray regions.) The ≈70-residue Sm core is split across two rows for clarity, and the chief SSEs are depicted near the top of each row. Note that the loop L4 variation is captured by this pHMM, as are other important sites in Sm sequences. The SSEs of the Sm fold arrange into a five-stranded antiparallel β-sheet that interacts in an antiparallel configuration with strands of flanking subunits in an oligomer (B). The highly bent β-sheet of the Sm fold is shown as a Cα trace in (C). Loops L2, L3 and L5 lie toward the lumen of the ring (L3 is nearest the proximal face); loop L4 lies toward the distal face. Important residues from the profile HMM are marked in the 3D structure (C): the β2 strand Gly is shown as a green sphere and the loop L3 Asp is shown in ball-and-stick (lower-left). The backbone traces in (C) of two Sm homologs of < 35% sequence similarity—E. coli Hfq (orange) and a putative cyanophage Sm (blue)—illustrate the persistence of the Sm fold at low sequence similarity. Representing even greater sequence divergence, the Sm fold of S. aureus Hfq (PDB 1KQ1; blue) and an Sm-like fold in the membrane channel MscS (PDB 2VV5; red) are shown superimposed in (D).
Figure 2. SmAP monomers: Sequence profiles and a 3D structure of versatile functionality. A probabilistic model of sequence variation across the Sm family is shown (A) as a profile hidden Markov model (pHMM). This visual display of pHMMs using logos is roughly analogous to the more familiar sequence logos used in representing multiple sequence alignments. In this pHMM, the vertical axis corresponds to the information content, measured in bits relative to the profile’s background distribution; positions that contain more information correspond to higher stacks, and amino acid letter heights within a stack are scaled by that residue’s relative contribution to the position. The horizontal axis can be considered as the position “s” along the Sm sequence profile. (Technically it is the sequential chain of HMM states, with the hitting probability of visiting state “s” along the HMM chain colored dark gray and the contribution of a (match or insertion) state “s” to the overall Markov chain shown as the sum of the widths of light- + dark-gray regions.) The ≈70-residue Sm core is split across two rows for clarity, and the chief SSEs are depicted near the top of each row. Note that the loop L4 variation is captured by this pHMM, as are other important sites in Sm sequences. The SSEs of the Sm fold arrange into a five-stranded antiparallel β-sheet that interacts in an antiparallel configuration with strands of flanking subunits in an oligomer (B). The highly bent β-sheet of the Sm fold is shown as a Cα trace in (C). Loops L2, L3 and L5 lie toward the lumen of the ring (L3 is nearest the proximal face); loop L4 lies toward the distal face. Important residues from the profile HMM are marked in the 3D structure (C): the β2 strand Gly is shown as a green sphere and the loop L3 Asp is shown in ball-and-stick (lower-left). The backbone traces in (C) of two Sm homologs of < 35% sequence similarity—E. coliHfq (orange) and a putative cyanophage Sm (blue)—illustrate the persistence of the Sm fold at low sequence similarity. Representing even greater sequence divergence, the Sm fold of S. aureusHfq (PDB 1KQ1; blue) and an Sm-like fold in the membrane channel MscS (PDB 2VV5; red) are shown superimposed in (D).
Bacterial Sm proteins act as RNA chaperones.
Hfq functions as an RNA chaperone—viz., a single-stranded nucleic acid-binding protein with flexible sequence recognition capacity, such that it can facilitate base-pairing interactions between diverse ncRNAs (regulatory sRNAs) and protein-coding mRNA targets. These antisense sRNA∙∙∙RNA interactions, shown schematically in Figure 1B, often exhibit only partial base-pairing complementarity. By binding the two RNAs independently, Hfq increases their local effective concentration, thereby enhancing their binding affinity. Structural and mechanistic aspects of the “cycling” of RNA on the surface of the Hfq ring are reviewed by Sauer and by Wagner in this issue. Hfq-mediated RNA∙∙∙RNA interactions typically have repressive physiological effects, downregulating either mRNA stability or translational activity. However, recent studies indicate that Hfq can also guide RNA∙∙∙RNA interactions that exert positive regulatory effects. Hfq has also been shown to modulate mRNA stability by promoting polyadenylation, which is often perceived as a eukaryotic-specific function but that also occurs in bacteria and may be intricately linked to Hfq function. A rapidly growing body of work has established pleiotropic roles for Hfq in physiological processes ranging from oxidative stress response and metal homeostasis to regulation of pathogenicity.,,,- The discovery that Hfq mediates a fundamental regulatory step in quorum sensing further expands the scope of Sm function to include microbial cell∙∙∙cell communication networks and intercellular signaling, which enables the emergence of population-wide behaviors.Compared with the substantial progress on eukaryotic and bacterial Sm proteins, little is known about Sm-related RNA biology in the archaea. There are more questions than answers and, therefore, portions of this review should be taken as more speculative and interrogative rather than conclusive.
Sm-like archaeal proteins: Suggested by sequence, confirmed by structure
Sm sequences, often described as Sm1 and Sm2 signature motifs joined by a variable linker (Fig. 2A and B and nomenclature note below), are conserved in many species across the tree of life. Stimulated by the flood of sequences at the dawn of the genomic era, early database searches, revealed that Sm proteins are not exclusive to metazoans or other higher eukaryotes with elaborate mRNA splicing; indeed, several Sm homologs have been found in eukaryotes as divergent from humans as yeast and trypanosomes, and Sm proteins likely existed in the ancestor of eukaryotes. Sm homologs also have been found in several archaeal species.,, The discovery of SmAPs was not entirely expected as Sm proteins were thought to act in snRNP biogenesis and splicing, not general purpose RNA processing; an archaeal RNP complex homologous to the sophisticated eukaryotic splicing apparatus was (and remains) unknown. The discovery of SmAPs raises several implications and questions about the role of these proteins in archaeal RNA metabolism. In short, what are the archaea doing with Sm proteins?The finding that Hfq is the bacterial Sm completes our modern understanding, showing that Sm proteins occur in each domain of life and making their existence in the archaea less startling. Also fascinating from a phylogenetic perspective are the emerging links between host Sm proteins and exogenously encoded (e.g., viral) RNAs: Herpesvirus saimiri produces viral RNA transcripts that recruit host Sm proteins, and the yeastBrome mosaic virus encodes two distinct RNA elements that directly interact with host Lsm1-7 rings in a manner resembling that of Hfq-RNA interactions. Somewhat similarly, a novel pentameric Sm-like protein of putative cyanophage origin was recently uncovered in an ocean metagenomics sampling expedition. These results suggest that the phylogenetic diversity of Sm proteins is far broader than previously thought, including virtually every known form of life, and also expand the realm of possible Sm functions well beyond splicing and other familiar forms of RNA processing.The first putative SmAPs were detected by sequence analysis. Since then, the existence of a distinct, albeit phylogenetically disperse, Sm family has been substantiated via biophysical, biochemical, ultrastructural and crystallographic studies of SmAP orthologs- and paralogs.- Such work has also uncovered several sets of Lsm paralogs in organisms already known to have homologs of the canonical Sm proteins. Biochemical and structural studies of Sm homologs verify their similarity to canonical Sm proteins., All known Sm structures, from eukaryotes, bacteria and archaea, are markedly similar to one another in terms of coordinate root-mean-squared deviation (RMSD) of monomer backbones. As shown in Figure 2C, the Sm fold is conserved even for highly divergent pairs lying near the “twilight zone” of similarity scores for the alignment of two random sequences. (A basic premise of structural bioinformatics is that function arises from structure; function is the level at which evolutionary pressure applies and, therefore, biomolecular structure persists more strongly over deep evolutionary timescales than does sequence conservation.) Thus, the existence of Sm homologs in most species across all three domains of life implies an ancient evolutionary origin for the Sm family, predating the archaeal/eukaryotic divergence.
SmAP Structure: Monomers, Assemblies, Modularity
For a protein of only ≈70 residues, Sm monomers are exceptionally multifunctional (Fig. 2): one part of the Sm fold mediates interatomic contacts between subunits in a ring (β4∙∙∙β5+1 interface), while other portions of the fold help create not one but possibly three distinct RNA-recognition regions, including (1) a U-rich ssRNA-binding region near the often cationic pore of Sm/Lsm, SmAP and Hfq rings, (2) an A-rich binding surface defined by the L4 face of Hfq and (3) a newly recognized RNA-contacting region around the lateral periphery of Hfq rings. Other structural landmarks also have variation as the theme: extensive variation in loop L4 length, variation in the termini (some Sm domains are fused to other domains at the N- and/or C-termini) and variable oligomeric states. Much of our knowledge of Sm structure and assembly originated from studies of SmAPs. This section reviews the 3D structures of Sm homologs. The assembly behavior of SmAPs and related homologs is also examined, as is the possibility that the main functional/evolutionary niche of the Sm domain is a generic structural module for protein∙∙∙protein and protein∙∙∙RNA interactions, akin to the activity of Hfq as a generic facilitator of RNA∙∙∙RNA interactions.
SmAPs and Sm monomers
The first Sm structures were of the humanSm D1•D2 and D3•B heterodimers. Soon, thereafter, the crystal structures of three SmAPs were reported concurrently: a Methanobacterium thermautotrophicum (Mth) SmAP,
Pyrobaculum aerophilum (Pae) SmAP1 and an Archaeoglobus fulgidus (Afu) SmAP, providing the first atomic-resolution glimpse of Sm monomers in an intact ring. All three of these SmAP orthologs assemble as homoheptamers comprised of subunits that adopt the same Sm structure found in the human D1•D2 and D3•B dimers. In the subsequent decade, dozens of Sm crystal structures have been determined for orthologs and paralogs from eukaryotic, bacterial and archaeal lineages [see refs. 10, 36, 38 and 39 and Sauer (this issue) for reviews]. All structural studies, including by solution state NMR spectroscopy of Sm and Sm-like domains, show that Sm monomers adopt a unique fold: a strongly bent, five-stranded antiparallel β-sheet often capped by an N-terminal α-helix. This N-terminal helix has been used as a structural marker to define the proximal face of Hfq rings (e.g., the Hfq hexamer in Fig. 3A), but the helix is an inessential feature of the Sm fold and is likely absent from many Sm sequences. Also, at least one Sm structure (the pentamer in Fig. 3A) features no N-term helix but rather a C-term helix that occurs on the distal face; thus, the presence of a particular helix (or any SSE beyond the Sm core sheet) is of limited utility as a landmark for distinguishing the faces of Hfq and other Sm rings.
Figure 3. Oligomeric plasticity of SmAPs and other Sm assemblies. Despite substantial similarity at the level of monomers, SmAPs and their Sm homologs exhibit profound variability at the levels of single-ring (A), multi-ring (B, left), and higher-order (B, right) assemblies. Each subunit in these ribbon cartoons is colored individually, the n-fold rotational symmetry axis is indicated, and each ring is viewed onto the L4 (distal) face; the N′- and C′-termini of one subunit are indicated for n = 5 and 6 but are not marked for each subunit so as to minimize clutter. A speculative model for the potential roles of multi-ring and higher-order assemblies is shown in (B).
Figure 3. Oligomeric plasticity of SmAPs and other Sm assemblies. Despite substantial similarity at the level of monomers, SmAPs and their Sm homologs exhibit profound variability at the levels of single-ring (A), multi-ring (B, left), and higher-order (B, right) assemblies. Each subunit in these ribbon cartoons is colored individually, the n-fold rotational symmetry axis is indicated, and each ring is viewed onto the L4 (distal) face; the N′- and C′-termini of one subunit are indicated for n = 5 and 6 but are not marked for each subunit so as to minimize clutter. A speculative model for the potential roles of multi-ring and higher-order assemblies is shown in (B).As gauged by sequence analysis, the Sm core is ≈60–70 residues in length (Fig. 2A). The Sm β-sheet is highly curved, and the degree of curvature can be approximated as the distance between the two termini of a segment of β-strand in a given conformation (the chord length, lc) vs. the corresponding distance for that segment in a fully extended conformation (the arc length, la); the lc/la ratio, which is unity for a straight line, can be taken as a crude estimate of curvature. For example, the distance between the Pae SmAP1 β2-strand termini ( and ) is 24 Å, vs. a value of 40 Å for this pair of residues in an unbent, fully extended conformation. Such curvature is a hallmark of the trough-shaped Sm fold, making Sm proteins nearly elliptical or U-shaped in cross-section (see the perspective in Fig. 2D). The polypeptide backbone can adopt this bent conformation because of specific glycine residues that serve as pivot points, particularly in strand β2 (Fig. 2A and the green sphere in Fig. 2C) but also in strands β3, β4 and the loops. The phylogenetically conserved glycines are among the most characteristic features of the Sm sequence family, in the information theoretic “profile” sense shown in Figure 2A; less strictly conserved glycines also serve structural roles, as can be found in SmAP-specific multiple sequence alignments., In addition to the 3D conformational pliability that enables the β-sheet to bend upon itself, a hallmark of the Sm fold is its resilience to sequence variation, such that two randomly selected Sm structures typically feature backbone (Cα) RMSDs of only ≈1–2 Å (Fig. 2C and D).Variation in loop L4 is another characteristic feature of Sm monomer structure. This loop links strands β3 and β4 (Fig. 2A and B) and varies more than other Sm loops in length and amino acid sequence—from just a few residues in bacterial homologs (Hfq) to potentially dozens of residues in eukaryotic homologs (e.g., human SmB). Within Sm rings, the geometric orientation of individual subunits positions L4 “outward,” making these loops the most prominent structural feature on the L4(/distal) face of the rings. This is an important factor in considering structure/function relationships because the L4 face of Hfq is the primary region of interaction with A-rich RNAs;, amino acid variation in L4 modulates the electrostatic potential—and, therefore, the RNA-binding properties—across that face of the Sm ring (see e.g., ref. 88 for a discussion of this effect). The L4 loop can also lead one astray in purely sequence-based bioinformatics: Multiple sequence alignments of SmAPs exceeding ≈100 residues, such as Pae SmAP3, erroneously assign the “extra” (non-Sm) residues to two regions—some residues were flagged as L4 loop insertions while the remainder were predicted to form a C-terminal extension. However, the Pae SmAP3 crystal structure (see below) shows that the extra ≈60 residues actually comprise a unique, autonomous C-terminal domain.
Sm sequence/structure relationships and bioinformatic nomenclature
Sm subunits have been described as consisting of “Sm1+Sm2” motifs, a view of Sm structure that dates to early sequence analyses. A probabilistic model of sequence variation across the entire Sm family is shown in Figure 2A as a Pfam-generated profile hidden Markov model (pHMM). Profile HMMs can effectively capture such features of sequence variation as amino acid insertions, thus making them a potentially effective approach for quantitatively modeling Sm loop variation. The profile HMM shown in Figure 2A captures known features of Sm sequence/structure relationships. For instance, a particular site in the Sm sequence profile can be seen to encode more information than most other sites (site 23), and a Gly dominates the uneven distribution of letters at this site. Overlaying the structural elements of the Sm fold (Fig. 2A) shows that this site corresponds to the strictly conserved Gly near the middle of the highly bent strand β2 (Fig. 2C). Also, the pHMM recapitulates the variability known to occur between strands β3 and β4—i.e., the variable length loop L4 (Fig. 2A). However, to our knowledge there is no evidence for distinct Sm1 and Sm2 motifs, in a structural or evolutionary sense (for instance, the Sm2 motif would resemble a partially opened β-hairpin); thus, we avoid this terminology. We also make this as a practical point to avoid confusion, as paralogous SmAP genes have been occasionally referred to as Sm1 and Sm2 (e.g., Afu,,
Solfolobus solfataricus,
Pyrococcus abyssi). Other issues of terminology also arise.
Nomenclature issues, from a structural bioinformatics perspective.
Considered as a complete set of all homologs, the Sm family exhibits immense complexity—in terms of cellular pathways and functional roles (splicing, telomere maintenance, quorum sensing, etc.); in terms of sequence motifs and other sequence-level properties (e.g., domain fusions); in terms of oligomerization (homomeric and heteromeric assemblies, multiplicity of oligomeric states); in terms of structural and physicochemical properties (e.g., multiple RNA-binding regions of Sm rings) and so on. Thus, it may be unsurprising that some ambiguities may have arisen in the Sm literature with respect to nomenclature.For clarity, the following terminological conventions are used in this review. (1) In terms of protein classification, Sm proteins comprise a superfamily;, nonetheless, in this review we refer to the Sm family for simplicity. (2) In terms of sequence and function, Sm proteins go by many names: archaeal homologs have been termed SmAPs, the bacterial branch of this family is known for historical reasons as Hfq,, and eukaryotic homologs are referred to as Sm (the archetypal Sm core of spliceosomal snRNPs). In addition, the term Lsm (Like-Sm) was introduced early on to refer to eukaryotic Sm-like proteins, such as the paralogous Lsm1-7 (cytosolic, mRNA decay) and Lsm2-8 rings (nuclear, pre-mRNA maturation). Though generally used in the context of eukaryotes, “Lsm” also has been used to label non-eukaryotic homologs, such as those of archaeal origin., Here, we attempt to use the labels Sm, Hfq, etc. only as precisely as is justified by our current knowledge and intended meaning. For example, an occurrence of Sm, rather than SmAP, means a statement applies to all members of the Sm family (to our knowledge), whereas usage of Hfq would indicate that we intend the statement to be limited in scope to the bacterial lineages of Sm. (3) For reasons described above, we avoid describing Sm proteins as consisting of “Sm1+Sm2” motifs. (4) We adopt the labeling of 2° structural elements (SSEs) shown in Figure 2A; note that many structurally and biochemically important regions (e.g., RNA-contacting amino acids) lie near loops L2, L3 and L4. (5) The terms proximal and distal are often used to refer to RNA-contacting surfaces of some Sm rings, such as in Hfq•RNA co-crystal structures. For reasons elaborated below, we instead refer to these surfaces as the L4 (distal) and L3 (proximal) faces.
The Sm domain as a module: Lessons from Pae SmAP3 and MscS
The post-genomic era affords new insights about Sm protein structure, archaeal and otherwise. With myriad open reading frames (ORFs) and bona fide proteins now known, and with increasingly sensitive bioinformatic methods, the Sm fold can be detected as a structural module in many multi-domain proteins., Notably, modularity of Sm domains is consistent with a scaffolding role for some eukaryotic (and perhaps archaeal?) Sm homologs. These Sm-containing (Sm-like?) proteins feature a wide range of pairwise similarities to one another, even below the level of significant sequence homology. Based on the properties of most known Sm proteins, Sm-containing ORFs may be expected to assemble into homo- or hetero-mers. However, Sm-containing proteins may also act as monomers, as seen with some eukaryotic Sm orthologs that exhibit highly divergent structures and functions (e.g., the enhancer of RNA decapping protein EDC3 features an N-term Sm module that does not oligomerize in solution). A recent discovery is remarkable because it links Hfq, Sm modularity and SmAPs: In sequencing studies aimed at examining plasmid-encoded mobile genetic elements in the Thermococcus lineage of archaea, Krupovic et al. discovered “Hfq-like” genes in four distinct archaeal plasmids. In three of these plasmids the putative archaeal Hfq is fused to an N-terminal C2H2-type zinc-finger domain, suggesting a potential role in DNA binding.Also striking, Sm-containing “homologs” can be found in pathways entirely unrelated to RNA or DNA metabolism. For example, an Sm domain was unexpectedly found in the crystal structure of a voltage-gated mechanosensitive channel of small conductance, MscS, and can be seen in other structures too (e.g., a biotin ligase; Mura, unpublished data). Least-squares structural superimposition of a SmAP and the MscS domain demonstrates their 3D similarity (Fig. 2D). Analogous to SmAPs, the MscS membrane protein also forms homoheptamers; however, that superficial resemblance seems to be the only shared feature between these otherwise unrelated, non-homologous proteins (the Sm domain in MscS does not mediate subunit∙∙∙subunit contacts in the heptamer). This degree of structural conservation, yet functional divergence, challenges our grasp of Sm structure/function relationships, and may imply a heretical view: That Sm proteins do not, in fact, comprise a homologous superfamily, but rather the Sm fold arose in multiple independent instances over the course of protein structural evolution.An “augmented” Sm protein can be defined as one that consists of an Sm module and at least one additional structural domain. All three possibilities—N-term Sm, C-term Sm and middle-Sm—have been found. MscS is an example of a middle-Sm domain, and the aforementioned thermococcal plasmid ORFs illustrate C-term Sm(/Hfq) domains. A Pae SmAP3 paralog provides the only known structure of an N-term Sm module fused to another domain (Fig. 4). SmAPs with similarly augmented C-term domains (CTD) can be detected by sequence analysis, particularly for SmAP3s in the Sulfolobus genus of the crenarchaea. The novelty of the mixed α/β fold of the Pae SmAP3 CTD limited what could be inferred about its function via comparative sequence or structural analysis, though weak structural similarity was found with a CTD of yeast TATA-box binding protein. In addition to providing a structure of an Sm protein fused to a new fold, Pae SmAP3 illuminated (1) the assembly of stable 14-mers both in crystals and in solution, (2) a peculiar form of differential divalent cation-binding by Sm proteins, in a manner coupled to its self-assembly and (3) the large-scale conformational heterogeneity that can occur as a possible feature of augmented Sm proteins. Involvement of the SmAP3 CTD both in metal-binding and in shaping the SmAP3 heptamer interface suggests that the main purpose of this auxiliary domain could be either biochemical or structural (the CTD adds over 15,000 Å2 of solvent-inaccessible surface area to the ≈4,300 Å2 heptamer∙∙∙heptamer interface formed by the Sm domains alone).
Figure 4. New structural insights from a SmAP3 paralog. A schematic tree of life (A) shows the approximate phylogenetic location of P. aerophilum SmAP3, which supplies the only known structure of an extended Sm protein. The structure of this paralog (B) reveals a core Sm domain (dark hues) decorated with a C-terminal domain (CTD; light hues) that adopts a novel fold; for clarity, a single chain of the tetradecamer is demarcated with a broken line and colored red (Sm domain) and yellow (CTD). This augmented SmAP forms 14-mers and higher-order assemblies both in solution and in crystals, and exhibits intriguing conformational heterogeneity: The CTDs of subunits in the apical ring (orange hues) are hinged ‘down’ (below the plane of the Sm ring) whereas the CTDs of the equatorial ring (blue hues) splay-out laterally, nearer the plane of the Sm ring. Assembly of the 14-mer is modulated by differential divalent cation-binding in the apical and equatorial subunits (Cd2+ ions are shown as green spheres).
Figure 4. New structural insights from a SmAP3 paralog. A schematic tree of life (A) shows the approximate phylogenetic location of P. aerophilum SmAP3, which supplies the only known structure of an extended Sm protein. The structure of this paralog (B) reveals a core Sm domain (dark hues) decorated with a C-terminal domain (CTD; light hues) that adopts a novel fold; for clarity, a single chain of the tetradecamer is demarcated with a broken line and colored red (Sm domain) and yellow (CTD). This augmented SmAP forms 14-mers and higher-order assemblies both in solution and in crystals, and exhibits intriguing conformational heterogeneity: The CTDs of subunits in the apical ring (orange hues) are hinged ‘down’ (below the plane of the Sm ring) whereas the CTDs of the equatorial ring (blue hues) splay-out laterally, nearer the plane of the Sm ring. Assembly of the 14-mer is modulated by differential divalent cation-binding in the apical and equatorial subunits (Cd2+ ions are shown as green spheres).
Cyclic oligomers and higher-order assemblies
Sm proteins tend to assemble into cyclic oligomers (Figs. 3, 4). Single- and double-ring assemblies occur, as do higher-order polymers. The single-ring oligomers are generally considered to be the biologically functional units. Early EM studies of eukaryotic snRNP particles suggested that the Sm and Lsm cores assemble as “doughnut-shaped heteromers.” The gradual realization that Sm/Lsm genes occur in groups of at least seven subtypes within eukaryotic genomes supported an oligomeric structural model; notably, a differential tagging/pull-down experiment established the stoichiometry of the yeastSm heptamer in vivo and confirmed the sequential order of subunits in the eukaryotic ring. The homo-heptameric nature of an A. fulgidusSmAP bound to oligo(U) RNA was established by multivariate statistical analysis of electron micrographs and, concurrently, the first Sm ring structures were reported from a crenarchaeote (Pae) and two euryarchaeotes (Afu,
Mth). Each of these SmAPs is homoheptameric. The hetero-heptameric nature of eukaryotic Sm cores was established in a relatively native environment (intact U1 snRNPs) as part of a single-particle cryo-EM reconstruction. Shortly thereafter the first non-heptameric Sm structures were discovered: a second AfuSmAP paralog was found to form hexamers (SmAP2), and EM and crystallography revealed hexamers of Hfq. Many lines of genetic, biochemical, biophysical, ultrastructural, NMR and crystallographic data now provide a complex picture of homomeric and heteromeric Sm assemblies. In many cases, an interesting pattern has emerged wherein modern/high-resolution studies are presaged by earlier/lower-resolution results. For instance, an Sm(F∙E∙G)2 hexamer was detected in pioneering transmission EM studies of Sm assembly intermediates, and recent crystallographic and NMR studies of the paralogous Lsm triplet revealed an Lsm(6∙5∙7)2 hexamer at atomic resolution. The gallery of Sm oligomers in Figure 3A includes a trimer (N-terminal fragment of a Schizosaccharomyces pombe Lsm4), pentamer (an Lsm of putative cyanophage origin), hexamer (E. coliHfq), heptamer (Pae SmAP1) and octamer (S. cerevisiaeLsm3). The Pae SmAP3 tetradecamer is an example of a well-defined higher-order Sm assembly: this double-ring SmAP features an intricate, > 20,000 Å2
heptamer-heptamer interface (Fig. 4B).Despite the severe variation in Sm oligomers, the structural basis of subunit interactions in an Sm ring is fairly clear. In virtually every known structure (canonical Sm, Lsm, SmAP, Hfq), the Sm∙∙∙Sm interface forms via hydrogen bonds, van der Waals interactions and other interatomic contacts between strands β4 and β5+1 of subunits i and i+1. This interface is marked for n = 6 in Figure 3A. The antiparallel association of neighboring β-strands extends the sheet of the central subunit (Fig. 2B) across the entire Sm ring. Consistent with this model of Sm interactions, any Sm dimer (a homodimer excised from a homomeric ring, a heterodimer from a heteromeric ring) can be structurally superimposed on any other dimer with reasonably low RMSD values, demonstrating the structural conservation of the Sm•Sm interface. The greater RMSDs for alignment of dimers vs. monomers (and heptamers vs. dimers) implies that much of the structural variation in an Sm ring is a result of rigid-body displacements of subunits. The only exception that we are aware of to the general β4∙∙∙β5+1 assembly model for bona fide Sm proteins is the recent structure of a truncated construct of S. pombe Lsm4 (Fig. 3A, n = 3); though the atypical β•β interface in this trimer could be an artifact of truncation or crystallization, a similar β4∙∙∙β4+1 interface also occurs between Sm-like domains in a biotin ligase (Mura, unpublished results). In typical Sm, Lsm, SmAP and Hfq rings, the head-tail assembly of subunits that propagates the β-sheet across the Sm ring is enabled by the unique geometric orientation of Sm subunits: the U-shaped Sm monomers are oriented like the blades in a turbine, resulting in the β4 and β5+1 edge strands being optimally positioned for interaction. The edge strands often contain apolar amino acids that can engage in energetically favorable packing interactions; the standard hydrogen bonding pattern between the β-strand backbones from adjacent subunits can be supplemented by other contacts that further sculpt the β•β interface (e.g., sulfur∙∙∙π aromatic interactions in Pae SmAP1).In terms of RNA binding, the most salient features of an Sm ring are the topography and physicochemical properties of its surface (binding grooves, electrostatic potential, etc.). The RNA-binding properties of SmAPs have not been thoroughly characterized, though U-rich ssRNAs are known to bind to the face of the ring that corresponds to Hfq’s proximal surface., Consistent with its RNA chaperone activity, Hfq features a more complex RNA-binding profile: U-rich RNAs primarily contact the proximal side of the ring, A-rich RNAs [e.g., poly(A) tails] bind across the distal surface and a third RNA interaction site was recently identified by Sauer et al. along the lateral rim of the disc-shaped hexamer. The N-terminal α-helices found in many Sm structures lie on the L3(/proximal) face, opposite the L4(/distal) face. However, the sole helix in a pentameric Sm is C-terminal and, thus, is not structurally analogous to the N-term helix (Fig. 3A). We raise these points because the proximal/distal labels, which were defined relative to the N-term helix face (proximal to the helix), can be structurally ambiguous: proximal and distal are relative geometric terms that require an external reference frame (an arbitrary point is proximal to some fixed reference point). The terms L4 face and L3 face, vs. distal and proximal (respectively) avoid this difficulty, as they are referred to fixed structural features of the Sm fold/ring. The L4/L3 labeling scheme also draws attention to the most prominent structural features on the respective face of the Sm ring: the L4 loops appear as turret-like projections, particularly in Sm homologs with longer L4 loops, such as humanSmD2 and B (Fig. 1 and Fig. S7 in ref. 56) and the yeastLsm3 octamer. With respect to the orientation of an Sm monomer subunit, the proximal face is toward loop L3 and the distal face is toward loop L4 in Figure 2C.The spontaneous assembly of Sm monomers into functional rings in the presence or absence of RNA is another key, yet enigmatic feature of Sm oligomerization. As a case in point, consider the Afu SmAP2 paralog. Crystallographic and in vitro biophysical characterization of this SmAP show that it can adopt both hexameric and heptameric states, in a manner coupled to both solution pH and RNA-binding.
Afu SmAP2 hexamers occur at acidic pHs and in the absence of RNA, whereas the addition of U-rich RNA induces the formation of heptamers. Perhaps the coupling between snRNA-binding and SMN-mediated assembly of the canonical eukaryotic SmsnRNP ring, via discrete oligomeric intermediates (discussed above), is an evolutionary echo of the remarkable oligomeric plasticity exemplified by Afu SmAP2? Whereas snRNPSm core assembly is chaperoned by SMN and occurs on an RNA site, eukaryotic Lsm complexes autonomously self-assemble into stable rings that then associate with RNA; examples include the nuclear Lsm2-8 complex that binds the 3′ terminus of U6 snRNA and the cytosolic Lsm1-7, which associates with P-bodies and is involved in mRNA degradation. Similarly to the eukaryotic Lsm rings, Hfq likely exists in the bacterial cell primarily as pre-formed rings; this is especially likely given Hfq’s high intracellular concentration. SmAPs that have been characterized thus far seem more Lsm- and Hfq-like, insofar as they spontaneously self-assemble into rings in solution and in the absence of RNA binding. This distinction between RNA-templated assembly of Sm rings, vs. Sm rings that are stable in the absence of RNA, is related to Scofield and Lynch’s functional classification of Sm rings as either fixed (specific function, such as in the snRNPSm core) or flexible (generic/multi-functional, such as Hfq or Lsm).Beyond their self-assembly into cyclic oligomers at the single- and double-ring levels, Sm homologs can also polymerize into fibrillar ultrastructures. Well-defined, finite Sm double-rings, such as Hfq 12-mers and SmAP 14-mers, are often found as head-head (L3-L3) associations of rings in crystal lattices. Though higher-order SmAP complexes (double-ring and beyond) can be detected by in vitro biophysical characterization (e.g., ref. 82), the existence and potential significance of Hfq dodecamers in solution has not been easily resolved; as discussed in reference 110 the detection of Hfq6 and (Hfq6)2 species, and the apparent Hfq:RNA stoichiometry, are influenced by the mode of analysis (gel shifts, analytical ultracentrifugation, etc.). In addition to the single- and double-ring oligomers, SmAPs from at least two archaeal lineages (Pae, Mth SmAP1) undergo head-tail polymerization into well-ordered fibrils. In an intriguing parallel to SmAPs, E. coliHfq also polymerizes into well-ordered fibers with morphologies resembling those of SmAPs, albeit with a different assembly architecture.
The oligomerization/RNA-binding question.
The potential biological significance of SmAP and Hfq assemblies remains unclear at the double-ring level and in terms of the various fibrillar polymers. A speculative model for the potential roles of higher-order SmAP assemblies is shown in Figure 3B. Here, (SmAP) single-rings are indicated as being functional with respect to RNA chaperoning activity (middle panel), while putative (SmAP)2 double-rings (left panel) would exhibit only a subset of interactions (e.g., putative binding of A-rich RNAs to the L4 face, such as occurs with Hfq, is denoted by “?” marks); multi-ring polymers would be effectively RNA-silent (right panel). As suggested in this simple model, the oligomerization and RNA-binding properties of SmAPs are likely to be intricately coupled. In the model shown in Figure 3B, particular oligomeric states of the Sm ring can be viewed as an RNA-coupled molecular switch or as an “RNA-o-stat” (functionally analogous to a thermostat or rheostat, facilitating the cellular pool of RNA∙∙∙RNA interactions).
The oligomeric plasticity challenge.
Viewed across the entire family, Sm complexes exhibit a degree of oligomeric plasticity that outstrips many protein families, despite conservation at the levels of amino acid sequence and 3D fold. Unlike the availability of a geometric theory accounting for the 7-fold symmetry of β-propellers, there has not emerged any general principle relating the order of an Sm oligomeric state (n = 3, 5, …) to whether it is homo- or heteromeric, whether the Sm serves a generic or specific functional niche, and so on. Sm subunits assemble into homo-heptamers (often archaeal), hetero-heptamers (generally eukaryotic) and homo-hexamers (often bacterial, Hfq), though all four possible combinations of ring types {[homomeric, heteromeric} × {hexamers, heptamers}—have been found. Beyond the common heptamer and hexamer states, trimers, pentamers and octamers also exist (Fig. 3A). The closest analog to such large-scale quaternary structural variability may be the quasi-equivalent n = 5/6 states adopted by coat proteins in icosahedral virus capsids. What is the physicochemical and stereochemical basis of such immense plasticity? Are Sm ring assembly/disassembly and RNA binding coupled to Sm protein dynamics and allostery? (If so, how?) Pursuit of these and related questions would advance our understanding of the molecular basis of Sm structure and function.
Sm Functional Roles in the Archaea: Scaffolds or Chaperones (or Both, or Neither)?
Despite the availability of data on SmAP 3D structures, oligomerization, ligand-binding and other biophysical and biochemical properties, little is known about the physiological functions of Sm proteins in the archaea (Fig. 5). This dearth of knowledge stands in stark contrast to the well-characterized eukaryotic Sm proteins and the recently amassed knowledge of bacterial Hfq function (reviewed in refs. 36, 38 and 113). SmAP function remains opaque, both in terms of broad functional niches/cellular contexts (splicing, telomere maintenance, etc.) as well as specific biochemical properties and detailed molecular interactions in vivo. Do SmAPs act as sRNA chaperones, like Hfq, or do they function primarily as scaffolds for the assembly of complex RNPs, akin to the molecular activities of the eukaryotic Sm proteins? One plausible scenario is that the single Sm ortholog present in essentially all archaeal species serves as a single-stranded nucleic acid-binding chaperone (like Hfq), while the paralogous SmAPs found in some (though not all) archaea serve other, more specific, functional roles. One cannot exclude a fourth possibility: that SmAPs act via altogether different sets of mechanisms, which resemble neither Hfq nor eukaryotic Sm proteins.
Figure 5. Functional repertoire of the Sm fold, from a phylogenomic perspective. This phylogenetic tree shows Sm protein functional roles mapped onto the three domains of life (boxes). The typical number of Sm paralogs/species is indicated for each domain: one Sm per bacterial genome (i.e., Hfq), many Sm per eukaryotic genome, and an intermediate number (1→3) per archaeal genome. Sm oligomerization properties are also indicated. Note that the eukaryotic ring schematics are drawn in correct rotational “register” – i.e., SmF↔Lsm6, SmE↔Lsm5, etc. are the most closely matching pairs of sequences, and are presumably paralogous.
Figure 5. Functional repertoire of the Sm fold, from a phylogenomic perspective. This phylogenetic tree shows Sm protein functional roles mapped onto the three domains of life (boxes). The typical number of Sm paralogs/species is indicated for each domain: one Sm per bacterial genome (i.e., Hfq), many Sm per eukaryotic genome, and an intermediate number (1→3) per archaeal genome. Sm oligomerization properties are also indicated. Note that the eukaryotic ring schematics are drawn in correct rotational “register” – i.e., SmF↔Lsm6, SmE↔Lsm5, etc. are the most closely matching pairs of sequences, and are presumably paralogous.
What is (definitively) known about RNA processing in the archaea?
Like bacteria, and unlike eukarya, archaea generally lack introns in protein coding genes. However, many introns do occur in archaeal tRNA and rRNA genes. Archaeal tRNA introns are typically in the anticodon loop, while rRNA introns occur at diverse locations. Whereas bacterial introns are usually self-splicing (e.g., group I introns), several forms of archaeal intron removal resemble their eukaryotic counterparts in terms of a protein requirement, e.g., endonuclease-mediated splicing of archaeal tRNA introns or rRNA processing. The occurrence of archaeal homologs of U3 snoRNP proteins suggests that snoRNP-based rRNA processing may be a shared feature between archaea and eukaryotes. Archaeal RNA processing other than intron removal is also beginning to be characterized, e.g., tRNA 5′- and 3′-end processing. Another RNA processing pathway that appears to be conserved between archaea and eukaryotes is the exosome, a large complex of RNA exonucleases, RNA-binding proteins and RNA helicases that mediates the 3′→5′ degradation of mRNA and other RNAs. Intriguingly, exosome evolution mirrors that of Sm assemblies insofar as eukaryotic exosomes feature greater compositional complexity (greater number of heteromer subunits) than their archaeal counterparts (Fig. 6).
Figure 6. An evolutionary parallel in another RNA-associated system. The archaeal exosome core ring has a relatively simple subunit composition, consisting of a 3 × 2 arrangement of Rpr41 (blue) and Rpr42 (orange) homologs. The eukaryotic ring is a more elaborate hetero-hexamer of Rpr41, Rpr46 and Mtr3 subunits (all three of which are Rpr41 homologs) along with Rpr42, Rpr43 and Rpr45 (all of which are Rpr42 homologs). The transition from a more primitive (archaeal) to a more sophisticated (eukaryotic) architecture presumably occurred via gene duplication, neutral drift and subsequent subfunctionalization among the paralogs that comprise this RNA-processing machine. This trend is mirrored in the evolution of Sm-based systems from homomeric rings with relatively generic functions (single-stranded nucleic acid-binding) to more sophisticated/specialized heteromeric assemblies.
Figure 6. An evolutionary parallel in another RNA-associated system. The archaeal exosome core ring has a relatively simple subunit composition, consisting of a 3 × 2 arrangement of Rpr41 (blue) and Rpr42 (orange) homologs. The eukaryotic ring is a more elaborate hetero-hexamer of Rpr41, Rpr46 and Mtr3 subunits (all three of which are Rpr41 homologs) along with Rpr42, Rpr43 and Rpr45 (all of which are Rpr42 homologs). The transition from a more primitive (archaeal) to a more sophisticated (eukaryotic) architecture presumably occurred via gene duplication, neutral drift and subsequent subfunctionalization among the paralogs that comprise this RNA-processing machine. This trend is mirrored in the evolution of Sm-based systems from homomeric rings with relatively generic functions (single-stranded nucleic acid-binding) to more sophisticated/specialized heteromeric assemblies.It is generally assumed that archaea do not have spliceosomal U snRNP-like particles, as their pre-mRNAs are not generally viewed as containing introns. However, there is some precedent for archaeal mRNA introns: the gene for a tRNA- and rRNA-modifying pseudouridine synthase (an archaeal homolog of eukaryotic centromere-binding factor 5, Cbf5p) was found to contain an intron that is spliced in vivo. The intron/exon boundaries in this gene are predicted to adopt bulge-helix-bulge (BHB) motifs, which are the motifs recognized by the splicing endonucleases involved in processing of archaeal pre-tRNAs and rRNAs. It is striking that the intron-containing protein targets tRNAs and rRNAs as substrates, suggesting the potential for co-regulation via modulation of the BHB splicing and ligation apparatus. Although the regulation and diversity of RNA metabolism in archaea may not be as sophisticated as in eukaryotes, these examples suggest that many intricate features of archaeal RNA processing remain to be discovered. The central role of the highly conserved Sm proteins in eukaryotic mRNA processing suggests that archaeal RNA processing may utilize SmAPs in similar RNP assemblies (snRNP-like or otherwise).Driven by the need for more specific and concrete experimental data about SmAP function in vivo, the past 3 y have seen progress on the cellular functions of these proteins, chiefly via proteomic and RNomic detection of interactions between SmAPs and other proteins or RNAs. The basic strategy has been a “guilt by association” approach (e.g., the CLIP-Seq method), wherein the relevant cellular pathways for a protein or RNA of unknown function are inferred based on the co-precipitated binding partners, a subset of which are presumed to have been functionally characterized. Much of this work has been pioneered by Marchfelder and coworkers in the euryarchaeote Haloferax volcanii.,,
A bacteria-like (Hfq-like) function in the archaea?
Most, if not all, archaeal genomes encode one SmAP, many encode two and some species (primarily among the crenarchaea) encode three Sm paralogs (Fig. 5). In terms of sequence similarity, assembly mode (homomeric, heteromeric) and oligomeric states (hexamer, heptamers, etc.), the SmAPs more closely resemble their eukaryotic Lsm counterparts than the bacterial (Hfq) branch of the family., Thus, the discovery of an “Hfq-like” protein in Methanococcus jannaschii (Mja), a euryarchaeal methanogen, came as a surprise. Sequence comparisons of this MjaHfq with homologs from E. coli (Eco) and Staphylococcus aureus (Sau) suggested conservation of the Sm core of these proteins; the C-terminal tail of Mja is quite abbreviated relative to EcoHfq and somewhat shorter than the Sau ortholog. Crystallographic work revealed that differences between MjaHfq and the bacterial Hfqs localize near the N-terminal α-helix, the loop L4 variable region and the C-termini. These differences include a shorter N-terminal α-helix in MjaHfq, which correlates with a smaller diameter of the hexamer ring in Mja (~54 Å) vs. EcoHfq (~62 Å). The charge distribution on the L4 (distal) face also differs between Mja and EcoHfq, which are predominately negative and positive, respectively. Though some bacterial Hfqs also feature an acidic L4 face, the predominately negative charge on the L4 face of MjaHfq suggests that this archaeal Hfq may deviate from the poly(A)-binding site that is characteristic of many bacterial Hfq rings.Despite these structural and biophysical differences, in vivo studies show that MjaHfq can partially complement the pleiotropic phenotypes of Hfq-knockout mutants in both E. coli and Salmonella enterica. Specifically, MjaHfq was shown to interact with and stabilize sRNAs, and participate in sRNA-mediated mRNA turnover. Furthermore, MjaHfq can form a ternary complex with an mRNA (sucC) and the sRNA (Spot42) in vitro; interestingly, gel-shift assays suggest that sucC may compete with Spot42 at the Hfq-binding site. Competitive binding is not typically observed between sRNAs and their mRNA targets, suggesting that the detailed molecular interactions that underlie the formation of ternary RNA1-Hfq-RNA2 complexes may fundamentally differ between this MjaHfq and more extensively studied bacterial homologs such as EcoHfq. Regardless, the MjaHfq work suggests some degree of functional interchangeability between archaeal and bacterial Hfq orthologs.In addition to the genomically encoded MjaHfq, archaeal “Hfq-like” proteins were recently discovered in four Thermococcus plasmids and three unrelated Methanococcal plasmids. As described above, these presumptive Hfq homologs contain an N-term C2H2-type zinc finger domain fused to a C-term Hfq domain. These novel homologs represent an exciting new group of augmented Sm proteins that may be directed specifically to DNA; intriguingly, both Hfq and SmAPs interact fairly non-specifically with DNA. Functional and structural studies of these new zinc finger-Hfq fusion proteins could greatly illuminate our understanding of both Hfq function and the Hfq/SmAP relationship.
A eukaryote-like (Sm- or Lsm-like) function in the archaea?
Although archaea, like bacteria, are unicellular organisms that lack nuclei and other well-defined organelles, many key features of RNA-based cellular metabolism in archaea are more similar to those of eukarya than bacteria. Homologies between rRNAs helped establish that the archaea and eukarya have a shared ancestor that diverged from early bacterial lineages. Other important similarities include archaeal and eukaryal RNA polymerases, and the usage of a specific class of ncRNAs (small nucleolar RNAs, snoRNAs) in both archaea and eukarya to direct modifications to other RNA molecules., However, archaea lack many of the sophisticated RNA-processing pathways in which eukaryotic Sm proteins play central and essential roles, including the major and minor spliceosomes and telomere maintenance. Deciphering SmAP function may shed light on the evolution of these key RNA processing features in eukaryotes.One plausible role for archaeal Sm proteins is in the general biogenesis of abundant and often essential ncRNAs, including tRNAs, rRNAs and snoRNAs; these are all pathways in which eukaryotic Sm proteins are known to play key roles.,,, However, this functional theme of “ncRNA biogenesis” (where “ncRNA” is a placeholder for t/r/sno/etc-RNAs) shows no clearly continuous line extending to the bacteria. In bacteria, the best established Sm function is as a general purpose chaperone for antisense-mediated hybridization of regulatory ncRNAs and their targets, with the Hfq-mediated ncRNA∙∙∙RNA interaction typically encoded in trans. Hfq also has high affinity for tRNAs, suggesting a direct, but as yet unresolved, role in bacterial tRNA processing or maturation. Intriguingly, Sm∙∙∙tRNA interactions can also occur in eukaryotes: In studies of SMN-mediated snRNP assembly, Pellizzoni et al. noted an association between the canonical eukaryotic Sm proteins and tRNA, suggesting that Sm∙∙∙tRNA interactions may not be limited to Hfq. The pleiotropic effects of Hfq inactivation in many bacteria also suggest a potential role in biogenesis of housekeeping RNAs, perhaps independent of its role as a chaperone for regulatory RNAs. However, the fact that Hfq is not strictly required for growth in many species is inconsistent with a vital function in the biogenesis of essential ncRNAs. As described below, an H. volcaniiSmAP deletion strain was found to be viable, and exhibited a similarly permissive/pleiotropic phenotype as for Hfq-knockout strains in some bacteria.A potential twist on SmAP function is provided by the eukaryotic “Tudor” domain, which is a five-stranded antiparallel β-sheet that bears a striking resemblance to the Sm fold. Tudor domains occur in many proteins involved in RNA metabolism,, including the SMN complex that chaperones the assembly of Sm proteins onto snRNA. Tudor domains bind methylated residues on substrate proteins, such as the dimethylated arginines of eukaryotic Sm proteins. The functional linkage and physical interactions between Tudor domains and Sm heteromers occurs in the early stages of snRNP biogenesis. Intriguingly, the Tudor domain is not found in archaeal sequences in the standard protein family databases (Pfam, Superfamily, InterPro, etc.; Mura, unpublished). Thus, the likely absence of a Tudor/SMN system in the archaea implies that SmAPs differ from eukaryotic Sm proteins in not being methylated (archaeal methyltransferase homologs can be detected by sequence analysis); or, if SmAPs are methylated, then such modifications may occur via alternative (non-Tudor) pathways.Recent investigations using high-throughput sequencing methods have yielded new knowledge about the diversity and abundance of archaeal sRNAs. Many of these sRNAs represent promising partners for functional associations with SmAPs. These include cis- and trans-encoded antisense RNAs that may modulate post-transcriptional processing of target mRNAs,- as well as tRNA-derived fragments that may modulate translational efficiency in response to stress (the latter resembles the functional role of tRNA-derived fragments in eukaryotes). Among the first trans-acting regulatory RNAs discovered in archaea, there appeared to be a particularly promising candidate for interactions with SmAPs in the euryarchaeal methanogen Methanosarcina mazei, but the sRNA showed no particular affinity for either M. mazeiSmAP paralog in vitro. Though the myriad roles for Hfq and Sm/Lsm proteins suggest highly general functions as RNA chaperones and RNP scaffolds, respectively, eukaryotic Sm proteins have not yet been found to play a role in RNA interference. Similarly, neither Hfq nor SmAPs have been implicated in the processing or targeting of CRISPR-derived RNAs, which function analogously in antisense-mediated defense against phage and other infectious genetic elements and account for a substantial fraction of archaeal sRNAs discovered via high-throughput sequencing studies. The C/D box and H/ACA snoRNAs are frequently among the most abundant sRNAs in archaeal and eukaryotic cells, but SmAPs have not been linked to snoRNA-guided modification of target RNAs.The functions of SmAPs are unlikely to emerge from obscurity without studies specifically directed at experimental discovery of new interactions in vivo. To date, few such studies have been reported. In a key study of SmAP structure and function, co-immunoprecipitation (co-IP) experiments found both SmAP paralogs in the euryarchaeote Afu associated with RNase P RNA (which trims the 5′ ends of pre-tRNAs) and a longer precursor, suggesting a role in the maturation of this ubiquitous and essential ribozyme. That work also found that antibodies specific for one AfuSmAP could co-precipitate the other paralog; a similar result was found in preliminary co-IP experiments with Pae SmAP paralogs (Mura, unpublished data), suggesting the potential interaction of SmAP paralogs in vivo. With respect to Figure 5, such an association would represent a small step away from the homomeric complexes of SmAPs and Hfqs, toward the heteromeric Sm complexes of eukaryotes. A recent co-IP study with the SmAP in the euryarchaeal halophile H. volcanii recovered a diverse array of RNA and protein-binding partners, but no particularly clear functional themes emerged from the population of sRNAs. Intriguingly, this work found the single H. volcaniiSmAP ortholog to be inessential for growth; similarly to the bacterial Hfq, genetic inactivation of the SmAP yielded pleiotropic phenotypes and growth defects that were more pronounced under some growth conditions than others. Whether paralogous Sm genes are similarly dispensable in species encoding more than one SmAP remains to be determined. As suggested by Figure 5, it is possible that SmAP paralogs became more ingrained in essential cellular pathways as they increased in copy number, and biochemical diversification, along individual lineages of euryarchaea, crenarchaea and thaumarchaea.
What can be inferred from genomic context?
Patterns of conservation among gene neighbors provide a way to infer phylogenetic and functional relationships among SmAPs, and between SmAPs and other gene families. Such an approach is potentially useful because, despite their conserved β-barrel 3D structures, the short length and great sequence variation across most Sm proteins limits the utility of sequence-based analysis as a means of function inference. Nearly all sequenced archaeal genomes contain at least one Sm homolog, situated directly adjacent to a gene for ribosomal protein L37e (rpl37); this association was first documented when only a few complete archaeal genomes were available. L37e is a zinc finger motif protein. In the euryarchaeote Haloarcula marismortui, L37e contacts conserved A-rich patches in 50S rRNA via long N- and C-term extensions. A SmAP gene is virtually always located immediately upstream of L37e and transcribed in the same direction, suggesting co-transcription as part of a conserved operon (and possibly association of the encoded proteins following translation?). In the euryarchaeal halophile H. volcanii, L37e was shown to be co-transcribed with the upstream SmAP gene, but was not found to be associated among the proteins co-immunoprecipitated using anti-SmAP antibodies. Nevertheless, the near universality of the genomic association between SmAP and L37e genes in all major archaeal clades suggests a conserved role in processing or stabilization of rRNA; such a function would make SmAPs most homologous to the eukaryotic Lsm proteins, some of which are known to be involved in pre-rRNA maturation. It is also possible that SmAPs and L37e associate in evolutionarily conserved processing of other well-structured ncRNAs, such as tRNAs (tRNA genes often occur adjacent to the Sm-L37e pair) or the RNA component of the tRNA-processing RNase P complex, which was shown to associate with both SmAP paralogs in the euryarchaeon Afu.,
Nanoarchaeum equitans is among the few exceptions to this Sm-L37e genomic association; instead, this archaeon’s SmAP gene is adjacent to (and convergently transcribed, relative to) the gene for an alternative ribosomal zinc-finger motif protein known as “L37ae.” N. equitans is an obligate endosymbiont with a highly reduced genome that is notable for the absence of a detectable RNase P RNA gene; a corresponding biochemical activity has not been found in N. equitans, which may support a role for the Sm-L37e gene tandem in maturation of the RNA component of RNase P.Whereas most euryarchaea have one or two Sm genes, other archaeal phyla typically encode at least two, and often three, SmAP paralogs. We are unaware of species with four, five or six Sm genes. This pattern in the paralog count—both within archaeal clades and between the archaea and eukarya (Fig. 5)—implies that Sm proteins evolved via gene duplication and neutral drift, subject to the geometric constraint that the paralogs assemble into functional homo/heteromeric rings. A gene duplication model, along with gene dosage effects, accounts for the pattern of Sm diversification/subfunctionalization across the tree of life (Fig. 5); an analogous evolutionary path is thought to have led to the modern exosome (Fig. 6). Eukaryotic Sm/Lsm genes likely underwent two waves of duplication, although lateral gene transfer, which pervades the microbial world,, has not been excluded as a possible source of multiple Sm genes/species. The conserved genomic context of the second Sm paralog in the euryarchaeal Archaeoglobaceae, and a number of methanogens, suggests co-transcription with a homolog of the RNA polymerase III subunit RPC34; in eukaryotes this zinc-finger protein is involved in transcription of ncRNAs, including tRNAs and 5S rRNA. We also note that a SmAP2 paralog in most crenarchaea and thaumarchaea is directly upstream and transcribed in the same direction as a methionine adenosyl transferase (MAT), which is potentially involved in methylation of DNA or RNA.Other gene context relationships also exist but with some variation among the archaeal clades. Irrespective of this variation, the genomic neighborhood of each SmAP typically includes multiple genes predicted to operate in specific RNA processing pathways. For example, crenarchaeal species in the family Thermoproteaceae (which includes P. aerophilum) are notable for an abundance and diversity of tRNA introns; in these species, we find that the Sm-L37e gene tandem is often adjacent to a divergently transcribed tRNA splicing endonuclease, again suggesting a role in tRNA splicing and maturation. In contrast, the Sm-MAT gene pair in the Thermoproteaceae clade is downstream of a large, well-conserved cluster of genes that includes RNA polymerase subunits and ribosomal proteins—a general contextual feature for at least one SmAP gene in archaeal species with more than one Sm homolog (AE Cozen, unpublished). Other genes that co-occur in the same regions as many SmAP genes include (1) cdc6-type genes possibly involved in cell cycle regulation (in Sulfolobaceae), (2) type II/IV secretion genes that may be linked to conjugation (in Thermoproteaceae), (3) RecA/RadA homologs potentially involved in DNA recombination (in thaumarchaea) and (4) β-lactamase-type nucleases potentially involved in 3′ polyadenlyation of mRNAs (these can be found in most crenarchaea). Again, involvement in RNA-related pathways is a recurring theme from these genomic inferences of SmAP functional roles.
Conclusion, Outlook
Sm proteins exhibit a phenomenal range of RNA-related functionality, from Hfq’s activity as an RNA chaperone to the scaffolding roles of eukaryotic Sm proteins. In contrast, the functions of SmAPs remain unknown. Further motivation for studying archaeal Sm systems is at least 2-fold: (1) practically, Sm RNPs from thermophilic archaea may prove to be more amenable to structural analysis, such as was the case for the ribosome and (2) conceptually, SmAP-based systems may offer a window into the evolution of modern RNP assemblies (e.g., snRNPs), as well as the origins of Hfq-mediated riboregulation.Hfq and other Sm proteins seem to achieve their great functional breadth by virtue of their ability to interact with myriad proteins and nucleic acids—either alone, in complex with other Sm proteins, or as a structural domain in a larger polypeptide. This versatility can be attributed to at least four factors: (1) Though small, the Sm fold is a flexible platform (evolutionarily, physiologically) for displaying amino acid side-chains that can interact with proteins (e.g., the two Sm neighbors in a ring, other proteins) as well as nucleic acids (e.g., the multiple RNA-binding sites of Hfq). (2) In higher-order complexes built upon Sm rings, a “complex” function (such as splicing) can be regulated by a “simpler” upstream function (such as the assembly state of the ring); that is, Sm protein activity can be toggled by modulating the assembly state (Fig. 3B). Finally, a toroidal architecture enables two types of flexibility: (3) Biochemical modularity, wherein exchange of a single subunit within a heteromeric ring can alter the cellular role (e.g., Lsm1-7 and Lsm2-8, which differ by a single subunit). (4) Oligomeric plasticity at the level of single rings (e.g., a SmAP that forms homohexamers at low pHs without RNA, but heptamers when bound to U-rich RNA) and higher-order ring assemblies. As the missing link between all that we know about bacterial Hfq function and eukaryotic Sm function (Fig. 5), Sm-like archaeal proteins occupy a unique and promising evolutionary niche in RNA biology.1Many ncRNAs are referred to as small RNAs (sRNAs), which operationally can be considered to be ssRNA species of less than ≈70-80 nucleotides; however, these are not strictly synonymous—for instance, “long ncRNAs” (lncRNAs) are being discovered at an increasing pace and recognized as an important class of regulatory RNAs.
Authors: Marco Punta; Penny C Coggill; Ruth Y Eberhardt; Jaina Mistry; John Tate; Chris Boursnell; Ningze Pang; Kristoffer Forslund; Goran Ceric; Jody Clements; Andreas Heger; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman; Robert D Finn Journal: Nucleic Acids Res Date: 2011-11-29 Impact factor: 16.971
Authors: Kai Jiang; Ce Zhang; Durgarao Guttula; Fan Liu; Jeroen A van Kan; Christophe Lavelle; Krzysztof Kubiak; Antoine Malabirade; Alain Lapp; Véronique Arluison; Johan R C van der Maarel Journal: Nucleic Acids Res Date: 2015-03-30 Impact factor: 16.971
Authors: Emilio Gutierrez-Beltran; Tatiana V Denisenko; Boris Zhivotovsky; Peter V Bozhkov Journal: Cell Death Differ Date: 2016-09-09 Impact factor: 15.828