Genetic and biochemical analyses of RNA interference (RNAi) and microRNA (miRNA) pathways have revealed proteins such as Argonaute and Dicer as essential cofactors that process and present small RNAs to their targets. Well-validated small RNA pathway cofactors such as these show distinctive patterns of conservation or divergence in particular animal, plant, fungal and protist species. We compared 86 divergent eukaryotic genome sequences to discern sets of proteins that show similar phylogenetic profiles with known small RNA cofactors. A large set of additional candidate small RNA cofactors have emerged from functional genomic screens for defects in miRNA- or short interfering RNA (siRNA)-mediated repression in Caenorhabditis elegans and Drosophila melanogaster, and from proteomic analyses of proteins co-purifying with validated small RNA pathway proteins. The phylogenetic profiles of many of these candidate small RNA pathway proteins are similar to those of known small RNA cofactor proteins. We used a Bayesian approach to integrate the phylogenetic profile analysis with predictions from diverse transcriptional coregulation and proteome interaction data sets to assign a probability for each protein for a role in a small RNA pathway. Testing high-confidence candidates from this analysis for defects in RNAi silencing, we found that about one-half of the predicted small RNA cofactors are required for RNAi silencing. Many of the newly identified small RNA pathway proteins are orthologues of proteins implicated in RNA splicing. In support of a deep connection between the mechanism of RNA splicing and small-RNA-mediated gene silencing, the presence of the Argonaute proteins and other small RNA components in the many species analysed strongly correlates with the number of introns in those species.
Genetic and biochemical analyses of RNA interference (RNAi) and microRNA (miRNA) pathways have revealed proteins such as Argonaute and Dicer as essential cofactors that process and present small RNAs to their targets. Well-validated small RNA pathway cofactors such as these show distinctive patterns of conservation or divergence in particular animal, plant, fungal and protist species. We compared 86 divergent eukaryotic genome sequences to discern sets of proteins that show similar phylogenetic profiles with known small RNA cofactors. A large set of additional candidate small RNA cofactors have emerged from functional genomic screens for defects in miRNA- or short interfering RNA (siRNA)-mediated repression in Caenorhabditis elegans and Drosophila melanogaster, and from proteomic analyses of proteins co-purifying with validated small RNA pathway proteins. The phylogenetic profiles of many of these candidate small RNA pathway proteins are similar to those of known small RNA cofactor proteins. We used a Bayesian approach to integrate the phylogenetic profile analysis with predictions from diverse transcriptional coregulation and proteome interaction data sets to assign a probability for each protein for a role in a small RNA pathway. Testing high-confidence candidates from this analysis for defects in RNAi silencing, we found that about one-half of the predicted small RNA cofactors are required for RNAi silencing. Many of the newly identified small RNA pathway proteins are orthologues of proteins implicated in RNA splicing. In support of a deep connection between the mechanism of RNA splicing and small-RNA-mediated gene silencing, the presence of the Argonaute proteins and other small RNA components in the many species analysed strongly correlates with the number of introns in those species.
Proteins with similar patterns of conservation or divergence across phylogeny are more likely to act in the same pathways[5]. To identify proteins that share an evolutionary history with validated small RNA pathway proteins, we determined the phylogenetic profiles of approximately 20,000 proteins encoded by C. elegans genes in 85 genomes, representing diverse taxa of the eukaryotic tree of life: 33 animals, 6 land plants, 1 alga, 31 Ascomycota fungi, 3 Basidiomycota fungi, and 12 protists. Of the ∼20,000 C. elegans proteins, 10,054 show homologues in non-nematode eukaryotic genomes (Supplementary Table 1). Following correlation and clustering, this analysis sorts genes into clades of conservation and relative divergence or, loss in the various organisms as suites of genes are maintained from common ancestors or diverge in select lineages[6]. Protein divergence or loss in particular taxonomic clades is not random; entire suites of proteins can diverge or be lost as particular taxa specialize to no longer require ancestral functions. The correlated loss of proteins has been used to assign roles for nuclear-encoded mitochondrial proteins[7] and eukaryotic cilia-associated proteins[8]. We developed a non-binary method of phylogenetic profiling to cluster all protein sequences encoded by C. elegans genes. Blast scores were normalized to the length of the query sequence and for relative phylogenetic distance between C. elegans and the queried organism[9].The matrix of 864,644 conservation scores for the 10,054 C. elegans proteins in the 86 genomes was queried either with a single protein to generate a ranking of other C. elegans proteins with the most similar pattern of conservation values or using a more global hierarchical clustering method (Figure 1A). Members of the same protein families exhibit similar patterns of phylogenetic conservation and therefore tend to group together in the hierarchical clustering. However, many phylogenetic clusters include members with no sequence similarity; only their conservation or divergence in genomes is correlated. The ability of this non-binary method of phylogenetic profiling to cluster proteins based on function is exemplified by the clustering of proteins known to act as members of complexes: for example the protein components of the cilated sensory ending in organisms with or without cilia clusters these components whereas the extraordinary high and universal conservation of ribosomal and translation factor proteins clusters many of these translation components (Supplementary Figure 1A,B).
A. Phylogenetic profiles of 10,054 conserved C. elegans proteins across 85 other eukaryotic genomes. For each C. elegans query protein, the normalized ratio of the blastp score for the top scoring protein sequence similarity is listed in the column corresponding to each genome. Values range from 0 (white, no similarity) to 1 (blue, 100% similarity). B. Phylogenetic profiles of validated RNAi factor RDE-1 and the 49 most correlated proteins in rank order.
With a simple query of one of the central proteins in RNAi, the RDE-1Argonaute protein, we generated a rank-ordered list of proteins with phylogenetic profiles most similar to that of RDE-1 (Figure 1B). The 26 other C. elegans Argonautes represented the top correlated proteins, a trivial consequence of protein sequence similarity within the Argonaute family. The signature phylogenetic profile of the Argonaute proteins is that they are absent in 9 of 31 Ascomycota species, 1 of 3 Basidiomycota species, and 6 of 14 protist species, but have not been lost in any of the 33 animal or 6 land plant species compared. The retention of Argonaute proteins correlates with the ability to inactivate genes by RNA interference[10] and the loss of RNAi in about half of the sequenced Ascomycota fungi is correlated with the killer RNA virus [11]. Additional proteins that cluster with the Argonautes but show no sequence similarity include an asparaginase/threonine aspartase/taspase encoded by K01G5.9, the CAND-1 elongation factor, and another elongation factor, the THO complex protein THOC-1. THO complex members have emerged from genetic screens for defective transgene and RNAi silencing in Arabidopsis thaliana[12].Another validated RNAi protein is MUT-2, a polyA polymerase implicated in a step downstream of the production of primary siRNAs by Dicer[13]. Of the 50 proteins with phylogenetic profiles most closely correlated with MUT-2 (Supplementary Figure 1C), 10 are Argonautes, which bear no sequence similarity to MUT-2, demonstrating the efficacy of this approach to detect validated small RNA pathway proteins. Also scoring with a similar phylogenetic profile are the splicing components MAG-1, RSP-8, RNP-4, RSP-5, and DDB-1 and translation factors EIF-3.D and EIF-3.E, many of which score in the validation tests below. Similarly, of the proteins most correlated with the C. elegans orthologue of Dicer (DCR-1), a cofactor for processing of siRNAs and miRNAs, 3 Argonaute proteins emerge among the top 50 positions (Supplementary Figure 1D, Supplementary Table 2).The RNA-dependent RNA polymerases (RdRPs)[14], siRNA-amplifying cofactors, are present in in only 5 of 27 animals, all nematodes and, surprisingly, the tick, all of the land plants, not in green algea, 2 of 4 Basidiomycota fungi, 18 of 27 Ascomycota fungi, and 4 of 14 protists. A query of the RdRP RRF-3 (Supplementary Figure 1E) revealed the cofactor-independent phosphoglycerate mutase, F57B10.3, as a dramatically correlated non-homologous protein (R = 0.93). Inactivation of this phosphoglycerate mutase gene caused defects in the endogenous siRNA response as well as transgene silencing, validating its role in RNA silencing (Supplementary Table 2). It is possible that either the biochemical substrate or product of this glycolysis pathway protein or the actual enzymatic activity as a phosphatase couples it to small RNA pathways.To identify candidate small RNA pathway proteins more comprehensively, we globally ranked proteins based on phylogenetic profile correlation with multiple validated siRNA and miRNA cofactors. After assigning all conserved C. elegans proteins into hierarchical clusters, we defined for each protein a score reflecting its phylogenetic clustering with the validated set of small RNA proteins (Supplementary Figure 2). The phylogenetic profiles of 101 proteins cluster most closely with validated siRNA and miRNA pathway proteins (Figure 2), 61 of which have not yet been implicated in small RNA pathways.
Figure 2
Phylogenetic clusters of candidate small RNA pathway proteins
Validated miRNA and siRNA pathway proteins map non-randomly on the phylogenetic profile; proteins that map to the same clusters are likely to function in small RNA pathways. Left panel: Clusters enriched for validated miRNA and siRNA pathway proteins, black boxes. Darker blue represents higher protein sequence similarity. Right panel: Pairwise local protein sequence alignment of all pairs of proteins in the cluster; black represents significant similarity and white no similarity.
The validated siRNA and miRNA protein cofactors identified to date likely constitute a small fraction of the total number of proteins that mediate small RNA function. Full genome RNAi screens for defects in siRNA or miRNA pathway function have identified hundreds of additional candidate small RNA pathway proteins. We integrated ten genome-scale studies into the phylogenetic cluster analysis: five C. elegans gene inactivation screens for defects in RNAi or miRNA function[1,15,16], C. elegans orthologues of Drosophila genes identified in two full-genome RNAi screens for impaired siRNA or miRNA response[2], and three proteomic studies of complexes containing the known RNAi proteins DCR-1[4], ERI-1[17], and AIN-2[18]. Candidate genes identified in these studies show little overlap (Supplementary Table 3; Supplementary Figure 3A,B). However, the candidates from the different studies have similar phylogenetic profiles to each other and to validated small RNA cofactors (Figure 3, Supplementary Figure 3C,D; Supplementary Table 4).
Figure 3
Select phylogenetic clusters enriched with hits from proteomic and functional genomic small RNA screens
A. The phylogenetic profile matrix was clustered and a Max Ratio score (MRS) was calculated for every protein in each screen; 117 proteins scored significantly in miRNA (56 genes) or siRNA (75 genes) functional genomic screens, or both (14 genes). Middle panel, black tick, hit in screens; gray tick, significant MRS. B. Blue boxes, the 23 known small RNA pathway genes identified. C. From the 117 genes predicted by the phylogenetic profile, 28 genes (blue bars) show defects in siRNA silencing (p-value < 3x10-15).
We used a Naïve Bayesian Classifier to assign predictive values to six genome-scale studies of RNAi cofactors and five of miRNA cofactors (see Supplementary Methods)[19,20]. To the phylogenetic profiles, we added a score for each C. elegans gene that is co-expressed on microarrays[21] or whose encoded gene product interacts with validated small RNA pathway proteins[22]. The top 105 genes identified by this analysis were enriched with 41 well-validated siRNA pathway genes (Supplementary Figure 7, Supplementary Table 2). The other genes on this list are excellent candidates to mediate siRNA or related small RNA functions. More than 20 of these genes encode RNA recognition motifs including RNP (p-value 2x10-06) and helicase (p-value 1.4x10-05), a ∼20-fold enrichment relative to the entire dataset. Nine proteins from this list constitute components of the spliceosome (Supplementary Figure 3).We tested a set of the top predictions from phylogenetic profiling (Figures 1-3) and Bayesian analysis using two different tests for defects in RNAi. Transgene silencing in the somatic cells of the enhanced RNAi mutant eri-1(mg366) is mediated by an RNAi mechanism[1]. We tested a set of 87 predicted small RNA pathway genes in this strain, and 43 scored as significantly RNAi defective (Supplementary Table 2, Figure 4A). We also tested candidate genes using a GFP-based sensor for the abundant C. elegans endogenous siRNA 22G siR-1[23] to monitor whether any of the gene inactivations affect the production or response to this endogenous siRNA. Thirty-three out of 87 genes tested scored in this assay (Supplementary Table 2, Figure 4B). Eight of the nine predicted splicing components scored strongly in these validation screens.
Figure 4
Inactivation of genes implicated in RNAi pathways reanimates transgenes that are silenced by RNAi
A. Expression of scm::gfp in the seam cells of an eri-1(mg366) mutant, where it is normally silenced by RNAi. Animals shown were treated with control, dcr-1, arp-6, or B0336.3 RNAi. B. GFP expression from the ubl-1::gfp-siR-1 sensor transgene, which is normally silenced by the siR-1 endogenous siRNA. Animals shown were treated with control, dcr-1, arp-6, or mes-4 RNAi.
The enrichment for RNA splicing components (Supplementary Figure 4) points to a close mechanistic connection between splicing and small RNA regulation. Among the Ascomycota and protist species that have lost the Argonaute proteins, most exhibit an extreme loss of introns, from 10[4]-10[5] introns in species with Argonautes to 10[2] introns or less introns in most species without Argonautes (Supplementary Figure 5). We screened for defects in RNAi a cherry-picked gene inactivation sublibrary of C. elegans orthologues of known splicing factors that have emerged from biochemical and genetic screens for splicing components from other systems. From a set of 46 C. elegans genes annotated in KEGG to encode the orthologues of known splicing proteins that could be tested for roles in RNAi in our assays, 16 and 22 of these splicing factor genes scored strongly in the eri-1 transgene desilencing assay and the endogenous 22G siR-1 sensor assay. Many of the splicing components that scored strongly in these screens show a phylogenetic profile similar to the Argonaute proteins (Supplementary Figure 6, Supplementary Table 6). However, a subset of splicing factors that are well conserved across phylogeny also scored strongly in these assays.We used the eri-1 transgene desilencing system to conduct a full genome screen for gene inactivations that disable transgene silencing and identified 855 genes required for transgene silencing, with more than 200 scoring above 3 on a scale of 0 to 4 for desilencing (Supplementary Table 7). Among gene inactivations that caused the greatest desilencing, 11% correspond to the highest ranked predictions from the siRNA Naïve Bayesian analysis, a 30-fold enrichment (p-value = 4.7x10-13 using hypergeometric test) for positives. Of the 84 splicing factors that have been assigned to specific splicing steps, 49 scored in the full genome screen as required for transgene silencing, and 32 showed phylogenetic profiles clustering with known small RNA factors. The splicing factors that couple to small RNA pathways were not isolated to any particular step of RNA splicing. Splicing factor mutations in S. pombe disrupt the RNAi based centromeric silencing[24]. Both splicing proteins and siRNA/miRNA pathway proteins co-localize to cytoplasmic processing bodies (P-bodies) and nuclear Cajal bodies[25], further supporting the possibility of functional crosstalk between splicing and RNAi.Early genome sequence comparisons of S. pombe, S. cerevisiae, and a small set of eukaryotes suggested that loss of introns and splicing components is highly correlated with loss of Argonaute proteins[26]. One interpretation was that the loss of RNAi in S. cerevisiae allowed viral invasion and a subsequent loss of introns via reverse transcription of genes by the invading viral replication enzymes. However, such a scenario would not predict that inactivation of splicing components in a species bearing the RNAi apparatus would cause an RNAi deficient phenotype. One model is that splicing could regulate RNAi indirectly by modulating spliced isoforms of key RNAi factors. However, the observations that only a subset of splicing cofactors are required for RNAi and the co-immunoprecipitation of splicing factors and DCR-1, ERI-1 and AIN-2 disfavors this indirect model. Rather, a mechanistic coupling between RNAi and RNA splicing explains these new data better. RNAi factors also affect splicing: Dicer is required for efficient spliceosomal RNA maturation in C. albicans[27]. If RNAi engages introns intimately by, for example, engaging nascent transcripts through the Argonaute NRDE-3 before splicing[28], then the selective advantage of introns may fade once the RNAi pathway is lost.Our data suggest that a large subset of the proteins that mediate steps in the maturation of mRNAs bearing introns are also required for RNAi, and those genomes that have lost most of their introns no longer require the RNAi pathway. Superimposed on the mRNA splicing pathway is an RNA surveillance system that eliminates aberrantly processed or mutant pre-mRNAs and mRNAs. It is possible that RNAi constitutes another level of mRNA surveillance that acts in parallel to and using many of the same components as the splicing quality control surveillance pathways.
Methods summary
Informatics
The Normalized Phylogenetic Profile data matrix (NPP) was clustered via MATLAB statistical toolbox using the average linkage method and Pearson correlation coefficient as a similarity measure. Clustering was performed on the rows of the matrix. To identify C. elegans proteins with phylogenetic profiles similar to published small RNA co-factors (Supplementary Table 9), the fraction of the validated genes in each phylogenetic cluster was calculated and optimized to define a Max Ratio Score (MRS), (Supplementary Figure 2).
Authors: Tomer Avidor-Reiss; Andreia M Maer; Edmund Koundakjian; Andrey Polyanovsky; Thomas Keil; Shankar Subramaniam; Charles S Zuker Journal: Cell Date: 2004-05-14 Impact factor: 41.582
Authors: Sarah Calvo; Mohit Jain; Xiaohui Xie; Sunil A Sheth; Betty Chang; Olga A Goldberger; Antonella Spinazzola; Massimo Zeviani; Steven A Carr; Vamsi K Mootha Journal: Nat Genet Date: 2006-04-02 Impact factor: 38.330
Authors: Douglas A Bernstein; Valmik K Vyas; David E Weinberg; Ines A Drinnenberg; David P Bartel; Gerald R Fink Journal: Proc Natl Acad Sci U S A Date: 2011-12-15 Impact factor: 11.205
Authors: David J Pagliarini; Sarah E Calvo; Betty Chang; Sunil A Sheth; Scott B Vafai; Shao-En Ong; Geoffrey A Walford; Canny Sugiana; Avihu Boneh; William K Chen; David E Hill; Marc Vidal; James G Evans; David R Thorburn; Steven A Carr; Vamsi K Mootha Journal: Cell Date: 2008-07-11 Impact factor: 41.582
Authors: Nicolas Simonis; Jean-François Rual; Anne-Ruxandra Carvunis; Murat Tasan; Irma Lemmens; Tomoko Hirozane-Kishikawa; Tong Hao; Julie M Sahalie; Kavitha Venkatesan; Fana Gebreab; Sebiha Cevik; Niels Klitgord; Changyu Fan; Pascal Braun; Ning Li; Nono Ayivi-Guedehoussou; Elizabeth Dann; Nicolas Bertin; David Szeto; Amélie Dricot; Muhammed A Yildirim; Chenwei Lin; Anne-Sophie de Smet; Huey-Ling Kao; Christophe Simon; Alex Smolyar; Jin Sook Ahn; Muneesh Tewari; Mike Boxem; Stuart Milstein; Haiyuan Yu; Matija Dreze; Jean Vandenhaute; Kristin C Gunsalus; Michael E Cusick; David E Hill; Jan Tavernier; Frederick P Roth; Marc Vidal Journal: Nat Methods Date: 2009-01 Impact factor: 28.547
Authors: Shouhong Guang; Aaron F Bochner; Derek M Pavelec; Kirk B Burkhart; Sandra Harding; Jennifer Lachowiec; Scott Kennedy Journal: Science Date: 2008-07-25 Impact factor: 47.728
Authors: Amelia F Alessi; Vishal Khivansara; Ting Han; Mallory A Freeberg; James J Moresco; Patricia G Tu; Eric Montoye; John R Yates; Xantha Karp; John K Kim Journal: Proc Natl Acad Sci U S A Date: 2015-12-15 Impact factor: 11.205
Authors: Tianxiong Yu; Birgit S Koppetsch; Sara Pagliarani; Stephen Johnston; Noah J Silverstein; Jeremy Luban; Keith Chappell; Zhiping Weng; William E Theurkauf Journal: Cell Date: 2019-10-10 Impact factor: 41.582