Simona Panni1, Ananth Prakash2, Alex Bateman2, Sandra Orchard2. 1. Università della Calabria, Dipartimento di Biologia, Ecologia e Scienze della Terra, Rende 87036, Italy. 2. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.
Abstract
This article describes the creation of the first expert manually curated noncoding RNA interaction networks for S. cerevisiae The RNA-RNA and RNA-protein interaction networks have been carefully extracted from the experimental literature and made available through the IntAct database (www.ebi.ac.uk/intact). We provide an initial network analysis and compare their properties to the much larger protein-protein interaction network. We find that the proteins that bind to ncRNAs in the network contain only a small proportion of classical RNA binding domains. We also see an enrichment of WD40 domains suggesting their direct involvement in ncRNA interactions. We discuss the challenges in collecting noncoding RNA interaction data and the opportunities for worldwide collaboration to fill the unmet need for this data.
This article describes the creation of the first expert manually curated noncoding RNA interaction networks for S. cerevisiae The RNA-RNA and RNA-protein interaction networks have been carefully extracted from the experimental literature and made available through the IntAct database (www.ebi.ac.uk/intact). We provide an initial network analysis and compare their properties to the much larger protein-protein interaction network. We find that the proteins that bind to ncRNAs in the network contain only a small proportion of classical RNA binding domains. We also see an enrichment of WD40 domains suggesting their direct involvement in ncRNA interactions. We discuss the challenges in collecting noncoding RNA interaction data and the opportunities for worldwide collaboration to fill the unmet need for this data.
Macromolecules within an organism perform many different roles, achieved in a large part through their associations with multiple interacting partners. Many differing types of molecular interaction networks have been studied, including protein–protein interaction (PPI) networks, transcriptional regulatory networks (DNA–protein), signal transduction networks, and metabolic networks. These networks are an essential step in developing a systems-level understanding of an organism. Comparisons of different biological networks can give interesting insights into the nature of cellular evolution, function, and organization. Large-scale RNA sequencing and transcriptomic analyses have revealed that, in contrast with previous hypotheses, most of the genome is transcribed into coding or noncoding RNAs (ncRNAs) (Storz 2002; Clark et al. 2011). Functional studies have shown that ncRNAs are involved in complex regulatory networks, which they modulate by interacting with proteins, nucleic acids, and small molecules.Any RNA molecule that functions without being translated into a protein is, by definition, a ncRNA, and they have been organized into families with markedly different structural or functional roles (Nawrocki et al. 2015). In addition to the well-characterized ribosomal RNAs and transfer RNAs, several other broad classes have been defined, including microRNAs (miRNA), long noncoding RNAs (lncRNA), and small nuclear and nucleolar RNAs (snRNA and snoRNA) (Amaral et al. 2011; Yoshihama et al. 2013; Kozomara and Griffiths-Jones 2014). These classes are still being continuously added to and enlarged by newly discovered molecules. The whole spectrum of ncRNA function is far from being elucidated. It is likely that they participate in signaling networks, modulating the main cellular processes and regulating gene expression, and also play additional roles via specific RNA–DNA, RNA–RNA, or RNA–protein recognition events (Woolford and Baserga 2013; Yamashita et al. 2016). The sheer quantity of reported data leaves no doubt as to the relevance of ncRNAs in a broad spectrum of signaling pathways, and on the multitude of interactions in which they are involved.Several high-throughput techniques based on UV cross-linking such as CLASH, CRAC, iCLIP, and PAR-CLIP, have been developed for the in vivo identification of RNA–protein as well as RNA–RNA interactions (Granneman et al. 2009; Hafner et al. 2010; Kudla et al. 2011; Modic et al. 2013). With the help of quantitative mass spectrometry, hundreds of new RNA binding proteins have been recently discovered, gaining novel insights into the nature of RNA–protein recognition specificity and the complexity of the RNA–protein networks (Castello et al. 2012; Mitchell et al. 2013; Beckmann et al. 2015).The curation of these interactions into the public domain, making them available for query or for network analyses, is therefore of critical importance. However, unlike protein–protein interaction data, no systematic manual curation of the published literature in this area has yet been undertaken. Recently, a few resources have gathered RNA interactions and made them available on the web, such as Clipdb (Yang et al. 2015), NPinter (Hao et al. 2016), RAIN (Junge et al. 2017), and RAID (Yi et al. 2017). An important issue that previously made it difficult for interaction databases to capture and organize ncRNA interactions was the lack of a single central resource providing unambiguous identifiers for each RNA molecule, and linking that identifier to the sequence of that RNA. Over the years, particular ncRNA categories have been collected and classified in specialized databases (miRbase, gtRNAdb, tRNAdb, lncRNAdb, etc.), while other RNAs were still only identified by the gene entry name (Chan and Lowe 2009; Jühling et al. 2009; Amaral et al. 2011; Kozomara and Griffiths-Jones 2014; Li et al. 2014). To consistently identify all ncRNAs, the RNAcentral platform was developed (http://rnacentral.org/), with the aim of incorporating and combining the numerous ncRNA databases to create a unified reference resource that allows the search or comparison of any ncRNA of interest (The RNAcentral Consortium 2015, 2017). RNAcentral is now available and acts as an external reference resource for all ncRNA molecules, as UniProtKB (The UniProt Consortium 2015) does for proteins and ChEBI (Hastings et al. 2016) for small molecules. The linking of identifier to sequence enables the precise mapping of point mutation data, which researchers generate to experimentally demonstrate the effect of these mutants on the interactions of ncRNA with other molecules.Now that an external reference resource exists for ncRNA molecules, the capture of functional annotations for those molecules has become feasible, and it is possible that information held in disparate databases can be combined. Protein–protein interaction data has been made publicly available in a consistent data format thanks to the combined work of multiple interaction databases and the molecular interaction workgroup of the Human Proteome Organisation Proteomics Standards Initiative (HUPO PSI-MI) (Hermjakob et al. 2004). The community standard data formats enable users to download interaction from multiple resources and readily combine these into a single data set. Members of the IMEx Consortium (Orchard et al. 2012) have gone further than this, managing curation across multiple separately funded resources to give the user a single, consistently annotated data set and make optimum use of public funds by avoiding repetitive capture of the same papers and reducing parallel database and tool development. Members of the IMEx Consortium have already extended their activities into the capture of protein–small molecule data, and the IntAct database also collates transcription factor/transcribed gene interactions.We present here a study focused on the curation of ncRNA interactions in the IntAct molecular interaction database, and on the analysis of the network subsequently produced. We have selected Saccharomyces cerevisiae—a well-studied model organism—for this work, in order to demonstrate that we can identify those processes and functions we would expect to see in such a network. We have curated over 120 articles and built yeast RNA–protein and RNA–RNA networks. Details concerning the experimental methodology, the interacting molecules’ detection methods, the features of the protein and RNA molecules (tags, point mutations, deletion mutations, etc.), and the organism used to produce or express the molecules, can all be captured by the IntAct data model, and the curation rules and controlled vocabulary (CV) terms have been expanded to encompass this new data type. The use of standard formats and CVs has allowed us to use the existing MIscore algorithm to calculate a confidence score for each interaction, and also has enabled a more detailed characterization of the resulting interaction network (Orchard et al. 2014).The results of this pilot study demonstrate the value of the expert curation of ncRNA networks, allowing the subsequent analysis both of overall network properties and the detailed interactions between individual molecules. As stated above, Saccharomyces cerevisiae was chosen as an appropriate model organism for this initial work because it contains a limited number of genes that encode ncRNAs. While we are not claiming this network is complete, we have achieved good coverage of the existing literature and keeping the network up to date with new findings will be a relatively easy task. The real value of this work is to highlight the importance of starting to tackle the collection of such data in more complex organisms, such as human. The required data resources and data exchange formats exist; the analysis tools originally developed for PPIs have been shown to be applicable to other molecule types, but funding and expert input from the ncRNA community will be required to tackle this important data collection effort.
RESULTS
General network properties
Protein–protein interaction (PPI) networks have long been studied in detail in various organisms, either as complete proteomes or as specific modules (proteins that belong to a certain pathway or complex). RNA–protein interaction (RPI) networks have been described in detail for very few organisms, with the exception of human and yeast (Nishtala et al. 2016), and little information is available regarding network analysis of RNA–RNA interactions. One of the reasons noncoding RNA–RNA interaction (RRI) networks are less well studied could be the limited availability of experimentally validated large-scale RNA–RNA interaction data.Here we have taken a comparative approach to studying various network properties of PPI, RPI, and RRI networks to identify any similarities or differences between them.The yeast PPI network modeled in this study comprises 77,620 unique interactions among 6091 proteins. The RPI network consists of 596 interactions between 105 ncRNAs and 153 proteins, and the RRI network comprises 195 interactions among 102 ncRNAs.We computed the degree distribution to measure the probability of nodes to interact with k, other nodes. As seen in Figure 1, the degree distribution of the PPI network follows a power law P(k) ∼ k−γ, wherein a large number of nodes interact with a few partners, and a small number of nodes, called hubs, interact with a large number of partners. We observe a similar trend in RPI and RRI networks. Hub proteins are critical for the functioning of a network and their removal can result in failure of the system (Jeong et al. 2001). On average there are 12.74 interactions per node in the PPI network and 651 hubs identified that interact with more than 50 proteins; the major hubs include heat shock proteins Ssb1 (UniProt: P11484), Ssa1 (UniProt: P10591), and Ssa2 (UniProt: P10592) with 3493, 2751, and 2444 interactors, respectively. In comparison, the RPI and RRI networks are sparsely connected with an average 2.13 and 1.90 interactions per node, respectively. The Gar1 protein, a subunit of the H/ACA ribonucleoprotein complex (UniProt: P28007), and the small nucleolar RNA U3a (IntAct: EBI-10821792, RNAcentral: URS0000444F9B) form the major protein and RNA hubs in the RPI network with 32 and 51 edges, respectively. On the other hand, the 18S and 25S ribosomal RNAs (rRNAs) dominate the RRI network (Fig. 2). Comparison of yeast networks with human networks show similar trends of degree distribution (Supplemental Fig. S1). The nodes in the three networks are closely linked with a mean shortest path length of 2.58, 3.27, and 2.35 between any two nodes in PPI, RPI, and RRI networks, respectively.
FIGURE 1.
Degree distribution of a random interaction network (nodes: 6091, edges 77620), protein–protein interaction network (PPI), RNA–protein interaction network (RPI), and RNA–RNA interaction network (RRI). Degree distribution of all three biological networks exhibits a power law. The degree exponent γ is shown in red text.
FIGURE 2.
The yeast noncoding RNA–RNA interaction network (RRI) shows ribosomal RNA subunits as the major hubs interacting with tRNAs and snoRNAs. The spliceosomal module comprising spliceosomal RNAs is circled in red. The ribosomal RNAs are highlighted in pink and tRNAs are highlighted in yellow. The snoRNA nodes SNR8 and SNR81, with high betweenness but low centrality, are shown in orange.
Degree distribution of a random interaction network (nodes: 6091, edges 77620), protein–protein interaction network (PPI), RNA–protein interaction network (RPI), and RNA–RNA interaction network (RRI). Degree distribution of all three biological networks exhibits a power law. The degree exponent γ is shown in red text.The yeast noncoding RNA–RNA interaction network (RRI) shows ribosomal RNA subunits as the major hubs interacting with tRNAs and snoRNAs. The spliceosomal module comprising spliceosomal RNAs is circled in red. The ribosomal RNAs are highlighted in pink and tRNAs are highlighted in yellow. The snoRNA nodes SNR8 and SNR81, with high betweenness but low centrality, are shown in orange.Other network properties, such as clustering coefficient and neighborhood connectivity, show similar trends between the PPI, RPI, and RRI networks (Table 1). The betweenness and closeness centrality of PPI, RPI, and RRI networks were also found to show similar distributions (not shown).
TABLE 1.
Summary of network properties
Summary of network propertiesNode centrality measures such as betweenness and closeness centrality evaluate the crucial role of a node as a mediator of interactions within the network. A node is central to the network if the shortest paths of a large number of nodes pass through it. Hubs cover a large number of paths that connect nodes within a network and therefore tend to show high betweenness and closeness centralities. Interestingly, both the PPI and RRI networks contain nodes that have low connectivity (degree) but relatively higher betweenness centralities. In the RRI network these nodes are represented by snoRNAs SNR8 (RNAcentral: URS00001F7017) and SNR81 (RNAcentral: URS00003377F1). SnoRNAs guide chemical modifications of ribosomal, transfer, and other small nuclear RNAs. In the RRI network (Fig. 2), snoRNAs SNR8 and SNR81 interact with nodes from both the ribosomal and spliceosomal RNA subnetworks. The H/ACA box SNR81 guides pseudouridylation of the 25S rRNA subunit and U2 (LSR1) snRNA (Schattner et al. 2004; Wu et al. 2011). Two distinct hairpin sites of SNR81 modify 25S rRNA and U2; the 3′ pseudouridylation pocket guides the pseudouridylation of 25S rRNA at position 1051 (Schattner et al. 2004), whereas the 5′ pseudouridylation pocket guides pseudouridylation of U2 snRNA at positions 42 (Ma et al. 2005) and 93 (Wu et al. 2011). SnoRNA SNR8 modifies the large ribosomal subunit and the actin pre-mRNA. The two snoRNAs form a structural link between the ribosomal subnetwork and the spliceosome module of the RRI network. The five snRNAs U1, U2, U4, U5, and U6 comprise the spliceosome module.“High betweenness but low connectivity” nodes were identified in a previous study of the yeast proteome (Joy et al. 2005). It was shown that nodes in the PPI network with “high betweenness but low connectivity” tend to be important connectors that link various modules (or clusters) within the network and are essential proteins of recent evolutionary origin (Joy et al. 2005). It has been proposed that “high betweenness but low connectivity” nodes in PPI networks evolve by the addition of nodes with edges and random rewiring of these edges, as a result of gene duplication and point mutations (Joy et al. 2005). Although node addition (gene duplications) and random rewiring (mutations) may explain the presence of “high betweenness low connectivity” nodes in PPI networks, the same evolutionary model cannot be extended to RPI and RRI networks. Protein-coding genes and noncoding genes evolve by different mechanisms and under different evolutionary constraints; while duplication and divergence are suggested as the major mechanism for expanding the protein-coding gene repertoire (Guan et al. 2007; Dujon 2010), mechanisms such as retroposition (Weber 2006; Schmitz et al. 2008), intragenic duplications (Shao et al. 2009), and de novo emergence (Meunier et al. 2013) are suggested to drive the expansion of short ncRNA genes. Moreover, mutation rates of protein-coding genes and noncoding RNAs, which potentially can rewire interactions in the network, are different, thereby affecting edge dynamics of the networks.
Network features
The IMEx curation process captures the full contextual details of experiments generating molecular interactions data, and this resource has previously been successfully used for computational analyses of PPI interactomes (Porras et al. 2015). Most of the controlled vocabulary (CV) terms that describe PPIs were also applicable to RPI and RRI interactions; however, new CV terms were introduced, when required, to enable the curation of specific interaction detection methods or participant detection methods. As shown in Figure 3B, more than 50% of the curated RNA–protein interactions were detected by affinity chromatography, anti-tag or anti-bait coimmunoprecipitation, and tandem affinity purification; however, new approaches developed to specifically capture protein-binding RNAs, such as CRAC and PAR-CLIP (Granneman et al. 2009; Hafner et al. 2010), were included to describe 19% and 1% of the RPI interactions, respectively. The yeast RNA–RNA interactions were demonstrated by cosedimentation (19%), EMSA (5%), pull down (5%), and cross-linking (5%), but the majority (55%) of these interactions were shown with RNA–RNA specific techniques, such as CLASH (Kudla et al. 2011). RNA–RNA interactions are often derived from sequence complementarity that can be predicted with bioinformatics analyses and confirmed by genetic interference. This approach was used to demonstrate most of the yeast snoRNA–rRNA interactions, using methodology that leads to a chemical modification of the ribosomal RNA. To formalize the representation of this approach, the term “chemical RNA modification plus base pair prediction” (Fig. 3A) was introduced into the CV. Information concerning the techniques used to identify the molecules participating in each interaction are annotated into the PSI-MI data exchange format as “participant identification methods.” Figure 3C shows that most of the RNAs were identified by sequence analysis (17%), Northern blot (29%), or primer extension assay (42%).
FIGURE 3.
Distribution of experimental methods used to demonstrate the yeast RRI and RPI interactions (A,B) or the identity of the RNA molecules (C). In D the percentage of curated interactions for each MIscore interval is reported for RPI (orange bar) and RRI (blue bar).
Distribution of experimental methods used to demonstrate the yeast RRI and RPI interactions (A,B) or the identity of the RNA molecules (C). In D the percentage of curated interactions for each MIscore interval is reported for RPI (orange bar) and RRI (blue bar).Moreover, to assess the validity of each interaction pair, a confidence score is calculated in IntAct (MIscore) that is reliant on the annotation of experimental, predicted, or inferred data, from which each interacting binary pair is supported (Villaveces et al. 2015). MIscore, as implemented in IntAct, provides a normalized score between 0 and 1 calculated from the weighted sum of the three different subscores: the number of publications, the experimental detection methods, and the interaction types found for the interaction. The higher the score, the more reliable the interaction is seen as being. According to this scoring system, more than 20% of the RNA–protein interactions have a score greater than 0.5, while most of the RNA–RNA interactions scores are in the range of 0.2–0.5, and only 8% of the interactions have a score higher than 0.5 (Fig. 3D). These low values are presumably due to the limited characterization of the RNA interactomes.Using the Cytoscape software (Shannon et al. 2003), interactions can be filtered according to confidence values to limit the visualization only to the most reliable interactions or to those validated with a specific method. Figure 4 shows a subgraph of the RPI network, including only high-confidence interactions with a score higher than 0.5.
FIGURE 4.
RPI subgraph showing interactions with a MIscore value equal or superior to 0.5. Proteins are colored in yellow and ncRNAs in blue. Edge line width is proportional to the score value.
RPI subgraph showing interactions with a MIscore value equal or superior to 0.5. Proteins are colored in yellow and ncRNAs in blue. Edge line width is proportional to the score value.
Gene Ontology analysis of ncRNA interactome
One important use of interaction data is to enable producers of large-scale Omics data to perform network analysis, looking for proteins that cluster in the interactome of a particular organism, or a given cell type, and that are all up- or down-regulated. Clusters of interacting proteins are often involved in the same biological process or are present in the same macromolecular complex. Gene Ontology is commonly used to identify commonalities of function and process. This relies on the proteins in the network having first been annotated to the appropriate GO terms, an ongoing process that has been the work of many curation teams for almost 20 years (The Gene Ontology Consortium 2015). The GO was designed for the annotation of gene products, and the inclusion of RNA within the scope of the GO project was always intended and predates our current understanding of the size and complexity of such an undertaking. The annotation of ncRNAs into interaction networks, and the ability to include their functional annotations into analyses will improve the analytical power of our current algorithms (Huntley et al. 2016); and the existence of RNAcentral to provide reference molecular identifiers for GO term curation of ncRNAs now makes this possible. We have evaluated the occurrence of GO terms of both proteins and ncRNAs of the yeast RPI interactome. Unfortunately, most analytical tools do not perform yeast RNA GO terms enrichment analyses as yet, so for ncRNAs we have used GO SLIM terms associated with the molecules based on curation undertaken by SGD (Cherry et al. 2012), and have analyzed the frequencies of the most represented terms. Protein GO terms enrichment analysis was carried out using BiNGO (Maere et al. 2005). As expected, nucleic acid/RNA binding molecular function was largely overrepresented in the RPI in respect to the yeast proteome, as well as enzymatic activities involved in RNA metabolism (see Fig. 5), while RNA processing, ribosome biogenesis, and maturation of RNAs are the most common biological processes terms.
FIGURE 5.
Gene Ontology enrichment analysis of yeast RPI proteins. Statistically significant overrepresented Gene Ontology terms in the yeast ncRNA–protein interactome were calculated with the BinGO tool.
Gene Ontology enrichment analysis of yeast RPI proteins. Statistically significant overrepresented Gene Ontology terms in the yeastncRNA–protein interactome were calculated with the BinGO tool.Notably, a significant enrichment of metabolic enzymes was observed in yeast mRNA binding proteins (Matia-Gonzalez et al. 2015), including glycolytic enzymes not directly involved with RNA metabolism, which may affect RNA stability. Most of the ncRNAs in RPI function as guides for methylation and pseudouridylation of rRNA during its maturation process, or have triplet codon adaptor activity, so that rRNA methylation and pseudouridylation are among the most represented biological processes terms (see Fig. 6). These terms are also the most represented in the RRI interactome (not shown), probably due to the lack in Saccharomyces of a relevant set of regulatory ncRNAs.
FIGURE 6.
Yeast RPI ncRNAs Gene Ontology terms. Slim ncRNAs Gene Ontologies were downloaded from SGD (Cherry et al. 2012) and frequencies were calculated as the occurrence of each term in RPI with respect to the number of considered ncRNAs.
Yeast RPI ncRNAs Gene Ontology terms. Slim ncRNAs Gene Ontologies were downloaded from SGD (Cherry et al. 2012) and frequencies were calculated as the occurrence of each term in RPI with respect to the number of considered ncRNAs.
RNA binding proteins analysis
RNA–protein interactions are often mediated by a small number of protein domains (RBD) that specifically recognize RNA structures and/or sequences (Lunde et al. 2007; Cook et al. 2015). We analyzed the proteins associated with ncRNAs in the yeast RPI to retrieve the occurrence of RBDs. We included in the list both “direct interactions” and “association” edges, which means that the proteins have been found in complexes with ncRNAs, but for some of them direct binding to ncRNA has not been demonstrated. We queried both Pfam and SMART databases with the list (Letunic et al. 2015; Finn et al. 2016). As shown in Figure 7, some canonical RBDs, such as RRM, DEXDC, and S1, were enriched in comparison with the yeast proteome; however, most of the proteins do not contain any of them, but contain specific regions selectively found in single proteins (such as Nop10 domain in Nop10p, UniProtKB: Q6Q547; or Utp8 domain in Utp8p, UniProtKB: P53276; or Mpp10 domain in Mpp10p, UniProtKB: P47083; and so forth), or are composed of disordered regions only (such as SF3B1, UniProtKB: P49955 or UTP9, UniProtKB: P38882, or Nsa1, UniProtKB: P53136), suggesting the evolution of multiple RNA binding strategies. It has been recently reported that yeast mRNA binding proteins often lack typical RBDs (Beckmann et al. 2015; Matia-Gonzalez et al. 2015; Brannan et al. 2016; Castello et al. 2016a). We compared the occurrence of the most common RNA binding domains in yeastncRNA binding proteins (BP) versus a list of yeast mRNA BP (Mitchell et al. 2013), and we observed that ncRNAs are recognized by unconventional modules similarly to or even more often than mRNAs (Fig. 7C).
FIGURE 7.
Putative RNA binding domains in yeast ncRNA binding proteins (BP). Frequencies were calculated as the percentage of proteins that contain a certain domain in the yeast proteome/6000 (occurrence in the yeast proteome A,D), proteins that contain a domain in the ncRNA interactome/173 protein nodes (occurrence in yeast RPI B,D), or proteins that contain a domain/120 mRNA BP from Mitchell et al. (2013) (C). (D) Frequencies of the most represented protein domains in the yeast ncRNA binding proteins. (Note that in A, KH, DSRM, PUMILIO, C3H1, and S1 are barely visible, due to their frequencies being 0.1% or less, while in B, C3H1 and Pumilio are absent, and in C, C2H2 is absent.)
Putative RNA binding domains in yeastncRNA binding proteins (BP). Frequencies were calculated as the percentage of proteins that contain a certain domain in the yeast proteome/6000 (occurrence in the yeast proteome A,D), proteins that contain a domain in the ncRNA interactome/173 protein nodes (occurrence in yeast RPI B,D), or proteins that contain a domain/120 mRNA BP from Mitchell et al. (2013) (C). (D) Frequencies of the most represented protein domains in the yeastncRNA binding proteins. (Note that in A, KH, DSRM, PUMILIO, C3H1, and S1 are barely visible, due to their frequencies being 0.1% or less, while in B, C3H1 and Pumilio are absent, and in C, C2H2 is absent.)Interestingly, the two most enriched domains in RPI are SM and WD40. SM domains (http://www.ebi.ac.uk/interpro/entry/IPR010920) are known to be associated with RNA binding proteins; however, WD40 repeat domains (http://www.ebi.ac.uk/interpro/entry/IPR017986) are mainly described as protein–protein interaction modules. This domain has also been reported to be enriched in mRNA binding proteins (Castello et al. 2012), which, together with our observation and with the characteristics of the domain, suggest that it may also in some cases mediate ncRNA–protein interactions.
DISCUSSION
Expert manual curation of the literature and high-quality annotation of described interactions has become a challenge for interaction databases, made more difficult by the increase of experimental results generated by high-throughput techniques. In the last decade, standardized formats and quality controls have been developed to help data sharing in proteomics and protein networks, in order to allow the user to score the reliability of each interaction (Orchard and Hermjakob 2015). In contrast, no specific standards have yet been developed for the capture of RNA interactions. This has meant that databases have collected RNA interactions using subjective rules for reliability and scores that do not allow a direct comparison of the data, while sometimes the source of the experiment and the conditions it was carried out with are not made clear to the reader. As recalled in the Introduction, the main issue previously preventing the biocuration community from collecting ncRNA data was the problems in assigning an unambiguous identifier to the molecules, recently resolved by the creation of RNAcentral. As a result, however, RNA molecules have been extensively annotated in only a few model organisms, and in some cases it can be hard to associate the transcript referred to in the article with a specific identifier, with the result that, for example, different mRNAs transcribed from the same gene cannot always be distinguished. This is also true for some precursors of ncRNAs. Another issue has been the classification of different experimental approaches as to their suitability, or otherwise, for building physical (as opposed to predicted or genetic) networks. Members of the PSI-MI community have defined a controlled vocabulary capable of describing experimental approaches, transcript metadata, point mutations, and expression levels that are extensively used to produce PPI networks and are crucial to the assembly of consistent and reproducible models. This framework is now being extended to encompass RNA–RNA and RNA–protein interactions, allowing the curation of the yeast ncRNAs interactome according to the curation principles developed by the IMEx Consortium. The IntAct data model has proven able to hold these additional data via a web-based editing tool to which one of the authors (S.P.) was given an account and training. IntAct curators additionally provided quality control checks on the resulting annotations. The interaction data collected in the present study are publicly available and represent a bona fide gold standard that can be utilized in prediction analyses.We have chosen Saccharomyces cerevisiae as a model system, which does not express many ncRNAs with transcription regulation functions compared to higher eukaryotes, so the ncRNA network we built consists mainly of interactions involved in the RNA maturation process or in the assembly of functional complexes such as the ribosome, spliceosome, and telomerase, etc. The RRI interactome in yeast is relatively small and most of the ncRNA interactions concern the ribosomal RNA processing. In comparison, mammalian cells express a very large number of ncRNAs that interact with both mRNAs and ncRNAs. While PPI networks have been extensively studied in the last decade, little is known about RPI and RRI network characteristics, and this pilot study is one of the first to allow for a direct comparison of the three networks. Interestingly, they show similar parameters, with a few hubs dominating the networks (Table 1). We also analyzed ontology frequencies for ncRNA involved in RRI and RPI (see Fig. 6). The ontological representation of networks often facilitates functional analyses; however, comprehensive annotation of Gene Ontology terms for ncRNAs is still largely incomplete, and common tools to determine statistically overrepresented GO categories have not yet been implemented for all ncRNAs (Huntley et al. 2016).The analysis of ncRNA–protein interactions in yeast reveals that the limited number of canonical RNA binding domains classified in 2007 (Lunde et al. 2007) mediates only a small fraction of them, while in some processes the protein–ncRNA-sequence recognition is helped by the binding of a second RNA carrying a complementary region. For example, some snoRNAs help Cbf5 in recognizing specific sites of the ribosomal RNA and modifying them (Reichow et al. 2007). Notably, the occurrence of the WD40 binding domain in yeastncRNA binding proteins is clearly overrepresented in respect to the yeast proteome (see Fig. 7), while no other protein-binding domain is. The WD40 repeats usually act as a scaffold for assemblies of complexes, which can explain their presence in some of the associations. However, the strong enrichment of WD40 domains suggests that they may also have a more direct role in ncRNA binding, which is also in agreement with its versatility in binding very different classes of ligands. Recently, the crystal structure of gemin5 WD40 domain binding an RNA oligomer was reported (Xu et al. 2016).Macromolecular structures provide a growing source of ncRNA interaction data. To date, the PDB contains 1808 entries that represent complexes between an RNA and a protein. Of those, 137 have S. cerevisiae listed as one of its source organisms. Most of these structures are large macromolecular complexes: More than 40 structures are variants of the 80S and 60S ribosome, 10 structures concern the spliceosome complex, five the preinitiation 40S and 48S complexes, and a few others the exosome complex, while only a small number involve tRNA–protein interactions. With the advances in cryoEM techniques, we hope that an increasing number of RNA–protein interactions will be defined at atomic resolution.This study presents the first attempt at creating a noncoding RNA network for yeast in IntAct. Other resources have started to collect RNA interactions (Li et al. 2014; Hao et al. 2016; Junge et al. 2017; Yi et al. 2017). However, new interactions are continuously published for yeast and mammalian mRNAs (Baltz et al. 2012; Baejen et al. 2014; Farazi et al. 2014; Gerstberger et al. 2014; Castello et al. 2016b) and for less characterized classes of ncRNAs such as SUT, CUT, and XUT (Tuck and Tollervey 2013). We believe that it is extremely timely to begin to collate RNA interaction data from the literature and high-throughput experimental screens. This will be greatly aided through collaboration between the world's noncoding RNA interaction databases and the adoption of standards by them. The widespread availability of protein–protein interaction data is critical for scientific research and the same must happen for RNA interaction data to fill the gap in scientific data provision.
MATERIALS AND METHODS
Data collection and curation: yeast noncoding RNAs
The yeast Saccharomyces cerevisiae cell represents an excellent model system to perform analyses on ncRNA interaction networks, since it contains a limited number of ncRNAs, most of which are well-characterized in literature and annotated in databases. As reported in Table 2, the majority of the yeast ncRNAs belong to the small nucleolar/nuclear classes of RNAs, which regulate ribosomal and messenger RNA metabolism. Other well-known classes include the ribosomal and transfer RNAs, the telomerase RNA component, the SRP complex RNA 7S, and the catalytic introns of groups 1 and 2. No microRNAs have been reported in S. cerevisiae to date, and few lncRNA have been described (Martens et al. 2004; Bumgarner et al. 2009; van Werven et al. 2012). Numerous cryptic unstable transcripts (CUT), Xrn1-sensitive unstable transcripts (XUT), and other noncoding transcripts are synthesized in yeast and immediately targeted for degradation (Neil et al. 2009; van Dijk et al. 2011). They are still poorly characterized and we did not consider them in our study (Tisseur et al. 2011). To select the RNAcentral identifier for each ncRNA, the database was searched with the SGD identifier or, for tRNAs, with the GTRNAdb identifier. For a few tRNA precursors, the RNA sequence search was used.
TABLE 2.
Summary of yeast noncoding RNAs and their interactions
Summary of yeast noncoding RNAs and their interactionsBoth transfer and ribosomal RNAs are transcribed as precursor molecules that undergo a series of post-transcriptional events required for the biogenesis of the functional molecules, and these precursors are sometimes not annotated as independent entries in RNA databases. During the curation process, we considered precursor and mature RNAs as different interactors whenever it was clearly specified which molecular form was involved in an interaction (Table 2).Noncoding RNA interactions were manually curated from the literature, following the curation standards established by the IMEx Consortium. First we consulted the Rfam, SGD, the LncRNAdb, and the yeast snoRNA databases (Piekna-Przybylska et al. 2007; Cherry et al. 2012; Nawrocki et al. 2015; Quek et al. 2015) in addition to the published literature, to draw up a complete list of S. cerevisiae ncRNAs. We then retrieved relevant articles by searching PubMed for abstracts containing at least one ncRNA name and “yeast” or “S. cerevisiae” terms. We also searched PubMed with other keywords, such as “CLIP,” “CLIP-seq,” “CLASH,” “rna rna interaction” “rna protein interaction,” etc. We collected several hundreds of papers that were manually filtered to a final total of approximately 120 actually containing interactions that were manually curated. We annotated interactions for each type of yeast ncRNAs, although there were a few exceptions (such as regulatory RNAs Pwr1, Icr1, and Irt 1) for which we have not found any physical interactions to date. RNA–RNA and RNA–protein interactions were manually curated using the IntAct editor, according to the IMEx standard. All data are available for download from the IntAct home page in both PSI-MI XML and MITAB tabular format, including all relevant information about the constructs used, the detection methods, and the host organism.To download the RNA interactomes, the IntAct database can be queried with (ptypeA:RNA AND ptypeB:RNA) or [(ptypeA:RNA AND ptypeB:protein) or (ptypeA:protein AND ptypeB:RNA)] (to limit the search to yeast, add Saccharomyces cerevisiae in the advanced search), or it can be queried with the list of ncRNAs (Supplemental Table S1), using EBI identifiers or URS identifiers followed by _559292.
Network analysis
Protein–protein, ncRNA–protein, and noncoding RNA–RNA interaction data for Saccharomyces cerevisiae (taxonomy ID: 559292) were downloaded from IntAct database (November 2016). Only a single edge was kept for examples of duplicate edges and self-interaction (loops) edges were removed before computing network properties. The complete yeast proteome PPI network was inferred from 77,620 unique interactions among 6091 proteins, the RPI network was inferred from 596 interactions between 105 ncRNAs and 153 proteins, and the RRI network from 195 interactions among 102 ncRNAs (tRNA–rRNA interactions were added to the RNA–RNA interactome). To compare network properties between the RRI network and the PPI network, the PPI network was downsampled such that the downsampled network comprised the same number of edges as the RRI network, and mean values from 100 downsampled networks were considered. To compare network properties of biological networks with a random network, a random network was generated using the Erdös–Renyi model with the same number of nodes and edges as in the PPI network (nodes: 6091, edges: 77620). Network analyses were carried out using the igraph package (Csardi and Nepusz 2006) in R.
Degree distribution
The degree of a node n is the number of edges linked to it. Number of nodes ordered by their increasing degree gives the degree distribution of a network.
Clustering coefficient
The clustering coefficient of a node is n defined as C(n) = 2e/[k(k−1)], where e is the number of edges between neighbors of node n, and k is the number of neighbors of n. The value of the clustering coefficient lies between values 0 and 1 and is highest if all the neighbors of the node directly interact with each other (i.e., edges form triangles) and 0 when none of the neighbors are connected with each other.
Betweenness centrality
Betweenness centrality of a node n is defined as B(n) = Σa≠ [σab(n)/σab], where σab is the shortest number of paths between nodes a, b; and σab(n) is the shortest number of paths between nodes a, b through node n. Betweenness centrality is divided by normalizing factor (N−1)(N−2)/2, where N is the total number of nodes in the connected network.
Closeness centrality
Closeness centrality of a node n is defined as the reciprocal of the average shortest path length, K(n) = 1/average [L(n,a)], where L(n,a) is the shortest path between two nodes n and a. The closeness centrality measure was computed on subgraphs with the highest number of interconnected nodes.
Neighborhood connectivity
Neighborhood connectivity of a node n is defined as the average connectivity (degree) of all its neighbor nodes.
Gene ontology enrichment analysis
The RPI network was downloaded from IntAct database (November 2016) and the BiNGO tool was used to search ontology files. For proteins, the hypergeometric test with 0.01 significance level was used to evaluate the statistical significance of enrichment of each term in comparison to the yeast proteome. No data are available for the whole-yeast transcriptome to date, so that for ncRNAs we calculated the occurrence of GO terms in the RPI network with respect to the set of yeast ncRNAs.
SUPPLEMENTAL MATERIAL
Supplemental material is available for this article.
Authors: E L van Dijk; C L Chen; Y d'Aubenton-Carafa; S Gourvennec; M Kwapisz; V Roche; C Bertrand; M Silvain; P Legoix-Né; S Loeillet; A Nicolas; C Thermes; A Morillon Journal: Nature Date: 2011-06-22 Impact factor: 49.962
Authors: Michael B Clark; Paulo P Amaral; Felix J Schlesinger; Marcel E Dinger; Ryan J Taft; John L Rinn; Chris P Ponting; Peter F Stadler; Kevin V Morris; Antonin Morillon; Joel S Rozowsky; Mark B Gerstein; Claes Wahlestedt; Yoshihide Hayashizaki; Piero Carninci; Thomas R Gingeras; John S Mattick Journal: PLoS Biol Date: 2011-07-12 Impact factor: 8.029
Authors: Eric P Nawrocki; Sarah W Burge; Alex Bateman; Jennifer Daub; Ruth Y Eberhardt; Sean R Eddy; Evan W Floden; Paul P Gardner; Thomas A Jones; John Tate; Robert D Finn Journal: Nucleic Acids Res Date: 2014-11-11 Impact factor: 19.160