Vladimir Reinharz1,2, Tsvi Tlusty1,3. 1. Center for Soft and Living Matter, Institute for Basic Science, Ulsan 44919, Republic of Korea. 2. Department of Computer Science, Université du Québec à Montréal, Montréal, H2X 3Y7, Canada. 3. Department of Physics, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea.
Abstract
Chaperone proteins-the most disordered among all protein groups-help RNAs fold into their functional structure by destabilizing misfolded configurations or stabilizing the functional ones. But disentangling the mechanism underlying RNA chaperoning is challenging, mostly because of inherent disorder of the chaperones and the transient nature of their interactions with RNA. In particular, it is unclear how specific the interactions are and what role is played by amino acid charge and polarity patterns. Here, we address these questions in the RNA chaperone StpA. We adapted direct coupling analysis (DCA) into the αβDCA method that can treat in tandem sequences written in two alphabets, nucleotides and amino acids. With αβDCA, we could analyze StpA-RNA interactions and show consistency with a previously proposed two-pronged mechanism: StpA disrupts specific positions in the group I intron while globally and loosely binding to the entire structure. Moreover, the interactions are strongly associated with the charge pattern: Negatively charged regions in the destabilizing StpA amino-terminal affect a few specific positions in the RNA, located in stems and in the pseudoknot. In contrast, positive regions in the carboxy-terminal contain strongly coupled amino acids that promote nonspecific or weakly specific binding to the RNA. The present study opens new avenues to examine the functions of disordered proteins and to design disruptive proteins based on their charge patterns.
Chaperone proteins-the most disordered among all protein groups-help RNAs fold into their functional structure by destabilizing misfolded configurations or stabilizing the functional ones. But disentangling the mechanism underlying RNA chaperoning is challenging, mostly because of inherent disorder of the chaperones and the transient nature of their interactions with RNA. In particular, it is unclear how specific the interactions are and what role is played by amino acid charge and polarity patterns. Here, we address these questions in the RNA chaperone StpA. We adapted direct coupling analysis (DCA) into the αβDCA method that can treat in tandem sequences written in two alphabets, nucleotides and amino acids. With αβDCA, we could analyze StpA-RNA interactions and show consistency with a previously proposed two-pronged mechanism: StpA disrupts specific positions in the group I intron while globally and loosely binding to the entire structure. Moreover, the interactions are strongly associated with the charge pattern: Negatively charged regions in the destabilizing StpA amino-terminal affect a few specific positions in the RNA, located in stems and in the pseudoknot. In contrast, positive regions in the carboxy-terminal contain strongly coupled amino acids that promote nonspecific or weakly specific binding to the RNA. The present study opens new avenues to examine the functions of disordered proteins and to design disruptive proteins based on their charge patterns.
There is mounting evidence for the existence of intrinsically disordered proteins (IDPs) that lack specific structures (Babu et al. 2012). These proteins do not fold into a well-defined conformation (Wright and Dyson 2015), although some may acquire a specific structure given the right context. IDPs are at the core of key biological assemblies and processes, such as membrane-less organelles (Nott et al. 2015), cell signaling (Wright and Dyson 2015), and cell division (Buske and Levin 2013). Disordered regions may exert entropic forces on the proteins they bind and thereby shift the ensemble of protein structures toward one with higher binding affinity (Keul et al. 2018). Although our repertoire of IDPs is steadily growing (Varadi and Tompa 2015; Piovesan et al. 2017; Schad et al. 2017), the function of most is yet to be discovered (Van Der Lee et al. 2014; Papaleo et al. 2016). Nevertheless, analysis suggests that a crucial determinant of the global shape and function of IDPs is their charge pattern (Das et al. 2015).A prominent class of IDPs is that of chaperones whose fraction of disordered residues, 54% on average, is the highest among all functional classes of proteins (Tompa and Csermely 2004). A particularly important subclass is those that chaperone RNA folding: To perform their functions, noncoding RNAs rely on well-conserved structures, which have been used for sequence alignment and putative RNA prediction (Nawrocki and Eddy 2013). Although some noncoding RNAs are able to attain those structures by themselves, chaperone proteins are essential in stabilizing correct conformations or in destabilizing, and thus rescuing, misfolded RNAs (Bhaskaran and Russell 2007; Woodson 2010; Papasaikas and Valcárcel 2016).A prime example of chaperone-dependent RNA is the group I intron (GII), which has an elaborate functional structure (Michel and Westhof 1990). Two chaperones take part in the folding of this RNA. One is the Cyt-18 protein that stabilizes the active structure (Guo and Lambowitz 1992; Mohr et al. 1992). The second chaperone is the StpA protein, which is known to destabilize misfolded GII structure (Waldsich et al. 2002; Mayer et al. 2007). The structures of Cyt-18 and its complex with the GII are well-determined (Paukstelis et al. 2008). In contrast, most of the StpA protein, 73% of the residues, is known to be disordered. StpA consists of two domains, the amino-terminal and carboxy-terminal. Excising the carboxy-terminal from the sequence increases the efficacy of the chaperone, whereas mutations in the carboxy-terminal hinder its binding capacity (Mayer et al. 2007). An entropy transfer model has been proposed, in which rapid and transient binding disturbs the structure, thus allowing it to refold (Tompa and Csermely 2004). But many questions regarding the specifics of the destabilization function remain open. An inherent obstacle in understanding the mechanisms of disordered proteins, such as StpA, is the lack of functional structure. The StpA–GII problem is even more challenging because the other partner in the interaction, the GII RNA, is misfolded and therefore lacks a specific structure as well.To overcome the lack of structures, one may leverage the accelerated growth in the number of known sequences and use them for multiple sequence alignments (MSAs). As of August 2018, GenBank (Sayers et al. 2019) had sequences totaling more than 3.7 × 1012 nucleotides from 420,000 species, an increase of 40% from the previous year. Techniques such as direct coupling analysis (DCA) extract from the MSAs amino acid contacts and 3D structures (Burger and Van Nimwegen 2010; Marks et al. 2011; Ovchinnikov et al. 2015), protein–protein interaction sites (Morcos et al. 2011; Ovchinnikov et al. 2014), RNA ligand binding pockets (Reinharz et al. 2016), RNA tertiary contacts (De Leonardis et al. 2015), and RNA–protein interaction sites (Weinreb et al. 2016). These studies have also demonstrated that many IDPs have strong correlations, hinting at context-dependent structures (Toth-Petroczy et al. 2016), although in the last study StpA did not exhibit any particular structure. So far, however, IDP–RNA interactions—which are essential in many molecular systems, in particular chaperones—have not been examined, perhaps because of the difficulty of analyzing the interaction of two objects that lack defined structures and whose sequences are written in different alphabets.All this motivates the present study in which we adapt the DCA method to concurrently process proteins and RNAs, which not only differ in the size of their alphabets but, on top of that, have high variability in sequence conservation. The adaptive method, termed αβDCA, produces the first analysis of the interaction of a disordered protein, StpA, with a noncoding RNA, the group I intron. Our method identifies 90 strongly coupled pairs between StpA and GII. The inferred locations of those pairs are consistent with the results of Mayer et al. (2007).We find that the charge pattern is strongly associated with the type of interactions: The amino terminal of StpA, which is known to destabilize the RNA, exhibits a few specific interactions among negatively charged regions of the protein and regions of the GII, which are critically misfolded in the structure's ensemble or impede functional loops from forming. In the carboxyl terminal, strongly coupled amino acids are mostly in positively charged regions, and their interaction of these amino acids with the RNA is weakly specific and almost uniformly distributed over the entire GII sequence. Moreover, although both terminals are of roughly the same length, only 21% of the top DCA scores are in the amino terminal. These findings propose a charge-dependent two-pronged mechanism of unspecific binding but specific disruption by chaperone IDPs.
RESULTS
We extended the classic mean-field approximation DCA (mfDCA) method for treating paired sequences that are written in different alphabets and have different levels of sequence conservation (for details see Materials and Methods). First, we tested this simple DCA variant—which we call αβDCA (for treating varying alphabets)—against two other DCA implementations: Gremlin, an implementation of Markov random-field DCA (Ovchinnikov et al. 2014), and EVcouplings (Hopf et al. 2018), an implementation of pseudolikelihood DCA (plmDCA). For the benchmark of the 5S–RL18 ribosomal complex, the adaptive αβDCA method predicts more contacts in its top scores (see section “5S RNA–RL18 protein interactions” in the Materials and Methods). Additionally, we observe that the mfDCA method outperforms Gremlin in the GII alignment, most probably owing to the correct pseudocount for a five-letter alphabet, rather than that of the 21-letter alphabet of proteins used in Gremlin. In the following, we apply the αβDCA method to analyze the StpA–GII alignment (code and alignment are available at https://gitlab.info.uqam.ca/cbe/abDCA).
αβDCA exhibits significant scores for strongly coupled StpA–RNA contacts
The DCA method identifies strong couplings, indicating significant physical interactions. These significant scores emerge as outliers departing from the bulk distribution of the DCA scores. Sequence conservation is a critical factor, as too high a conservation level prohibits coevolution analysis. Figure 1A shows the secondary structure of the GII RNA together with its long-range interactions and sequence conservation values (the overall maximal conservation is shown in Fig. 2). To test whether the alignment contains more information than an ensemble of random sequences, we compare the distribution of αβDCA scores from the StpA–GII alignment with those obtained from the same alignment but with randomly shuffled sequences. The scores of the original alignment spread over a much wider range then that of the shuffled alignment, thus confirming that the DCA analysis extracts information from the alignment (Fig. 3).
FIGURE 1.
(A) GII secondary structure and its sequence conservation. The last 25 positions have no conservation levels because they are excluded from the alignment (see section “Group I intron” in Materials and Methods). (B) Positions of significant DCA scores (≥4σ above average) between the StpA protein (y-axis) and the GII RNA (x-axis). The RNA axis is labeled with the secondary structure in parentheses notation. The blue region is the amino terminal of StpA and the orange region its carboxyl terminal. The pale gray regions are the pseudoknot (PK) of GII. The last 25 positions of GII are omitted (see section “Group I intron” in Materials and Methods).
FIGURE 2.
Global conservation. The most conserved nucleotide for each position, with its percentage of conservation. Each nucleotide shown is the most frequent one. If only A or G are present in that position, an R is shown for purine. If only C or U are present in that position, a Y is shown for pyrimidine.
FIGURE 3.
The distribution of APC values obtained by our method on the StpA–GII alignment compared to the same values after shuffling the sequences. There are 100 blue and 100 orange bins. The orange bins are therefore narrower.
(A) GII secondary structure and its sequence conservation. The last 25 positions have no conservation levels because they are excluded from the alignment (see section “Group I intron” in Materials and Methods). (B) Positions of significant DCA scores (≥4σ above average) between the StpA protein (y-axis) and the GII RNA (x-axis). The RNA axis is labeled with the secondary structure in parentheses notation. The blue region is the amino terminal of StpA and the orange region its carboxyl terminal. The pale gray regions are the pseudoknot (PK) of GII. The last 25 positions of GII are omitted (see section “Group I intron” in Materials and Methods).Global conservation. The most conserved nucleotide for each position, with its percentage of conservation. Each nucleotide shown is the most frequent one. If only A or G are present in that position, an R is shown for purine. If only C or U are present in that position, a Y is shown for pyrimidine.The distribution of APC values obtained by our method on the StpA–GII alignment compared to the same values after shuffling the sequences. There are 100 blue and 100 orange bins. The orange bins are therefore narrower.The StpA–GII amino acid–nucleotide pairs with the strongest DCA couplings are shown in Figure 1B. The distribution of scores is assumed to be normal, with its average and standard deviation computed from the empirical data. Scores that are four standard deviations (4σ) above the average are deemed significant. The αβDCA identifies 90 significant pairs, 15% less than those extracted by the standard DCA, which disregards the difference in alphabet and sequence similarity between the RNAs and the proteins. As shown below, in agreement with previous studies (Ovchinnikov et al. 2015; Toth-Petroczy et al. 2016), the number of false positives increases with the number of selected pairs. One can therefore expect that the analysis that shows fewer significant scores will yield fewer errors.
Inferred protein–RNA interactions are selective in the amino terminal and global in the carboxyl terminal of StpA
The amino and carboxyl terminals of StpA are known to interact differently with the RNA (Waldsich et al. 2002). This motivates us to characterize the number and distribution of high αβDCA scores, which indicate strong physical couplings, in each of these two regions. Because RNA structures fluctuate within a dynamic ensemble (McCaskill 1990), we examine the interactions in light of the two main structure ensembles and the functional structure, and in particular link the distribution of strong couplings along the RNA.To this end, from the RNA sequence, RNAstructure (Reuter and Mathews 2010) computes, in the McCaskill thermodynamic framework (McCaskill 1990), the probability of each possible base pair. Those pairing probabilities can be divided into two main structural ensembles to ease the visualization (Aalberts and Jannen 2013). We plot in Figure 4 the net charge distribution along the StpA protein (averaged over a window of five amino acids), above the two main clusters of the GII RNA structure ensemble as predicted by RNAstructure (Reuter and Mathews 2010). The arcs in the upper part depict bonds in the main cluster, whose probability is 68.2%, and the arcs in the lower part show bonds in the second main cluster, of probability 31.8% (Aalberts and Jannen 2013). The red discs represent stems in the functional structure that are absent from both ensembles, in particular the pseudoknot, as annotated by Waldsich et al. (2002). Note that although pseudoknots cannot be predicted with RNAstructure, they could not be inferred even with RNAPKplex, which was designed for this purpose (Lorenz et al. 2011).
FIGURE 4.
Significant scores between the protein and the RNA. (Top) The protein charge distribution. (Bottom) The RNA sequence with the two main structure clusters. Arcs represent base pairs: yellow only in the main, most probable cluster, blue only in the secondary, least probable cluster, and black in the functional structure. Red discs are base pairs in the functional structure absent from both clusters. Brown arcs are the P3 stem. Positions highlighted in green in the protein and RNA had > 50% of gaps in the alignment and were therefore omitted from the analysis. Significant DCA scores are denoted by lines: dark between the amino-terminal and the RNA, light gray lines between the carboxy-terminal and the RNA.
Significant scores between the protein and the RNA. (Top) The protein charge distribution. (Bottom) The RNA sequence with the two main structure clusters. Arcs represent base pairs: yellow only in the main, most probable cluster, blue only in the secondary, least probable cluster, and black in the functional structure. Red discs are base pairs in the functional structure absent from both clusters. Brown arcs are the P3 stem. Positions highlighted in green in the protein and RNA had > 50% of gaps in the alignment and were therefore omitted from the analysis. Significant DCA scores are denoted by lines: dark between the amino-terminal and the RNA, light gray lines between the carboxy-terminal and the RNA.The significant scores between the RNA and the protein are denoted by lines. There are 90 significant scores (≥4σ) between the protein and the RNA: 19 in negative regions of the amino-terminal (dark lines), 69 in mostly positive regions of the carboxyl terminal (light gray lines), and two in the linker between the amino and carboxyl terminals (dashed lines).More than 61% of carboxy-terminal significant amino acids have many globally distributed partners, on average 4.3 nt. In contrast, the N-amino acids show a more selective evolutionary signature with 67% of them exhibiting significant covariation with only 1 nt.
High scores correspond to close nucleotides in the 3D structure of GII
To check whether the RNA alignment is informative by itself, we examine the DCA scores among all pairs of RNA positions. To validate the quality of the RNA alignment, we compared the physical contacts predicted by DCA to the 3D structure of the td GII RNA (available at http://www-ibmc.u-strasbg.fr/spip-arn/spip.php?rubrique136). We computed DCA scores using two methods, the mean-field approximation (mfDCA) and Gremlin (Ovchinnikov et al. 2014). We note that αβDCA is identical to mfDCA when treating a single alphabet. We consider as a good prediction a pair of nucleotides closer than 8 Å in the 3D structure. Figure 5 shows the number of these true positives (distance < 8 Å) for the hundred top scores. Although the first 40 top scores are well predicted by both methods, the Gremlin method is outperformed by mfDCA in the next 60 scores.
FIGURE 5.
Evaluating the predictive power of the group I intron RNA sequence alignment. Fraction of nucleotide pairs closer than 8 Å for the top DCA values using mfDCA and Gremlin.
Evaluating the predictive power of the group I intron RNA sequence alignment. Fraction of nucleotide pairs closer than 8 Å for the top DCA values using mfDCA and Gremlin.
DISCUSSION
The StpA protein destabilizes the misfolded GII RNA, allowing it to achieve its functional structure. Experiments have shown that the binding is transient and weak, with little specificity (Waldsich et al. 2002; Doetsch et al. 2011). Mutation studies provide evidence for GII–StpA interactions: Mutations in the StpA carboxyl terminal reduce the binding affinity between StpA and the group I intron, whereas complete deletion of the carboxyl terminal increases the efficiency of StpA as a chaperone (Waldsich et al. 2002). A carboxy-terminal mutation, glycine 126 changed to valine, weakens the binding and increases the efficiency of StpA (Mayer et al. 2007). In the following, we further expand the understanding of the GII–StpA mechanism, based on the αβDCA results. We show that the αβDCA results are consistent with previous experimental studies. Moreover, they put forward a detailed picture of coupled amino acids and nucleotides responsible for both binding and destabilizing interactions.
Binding is mediated by positively charged regions of StpA
Binding of StpA to GII is driven by electrostatic forces mediated by positively charged amino acids (Mayer et al. 2007). This is confirmed by the αβDCA showing that the vast majority of high scores in the carboxyl terminal are in positively charged regions (Fig. 4). Out of the 69 pairs, 44 (64%) are in positively charged regions, 18 (26%) in neutral regions, and seven (10%) in negatively charged regions. This also implies that most of the binding energy comes from amino acids in the carboxyl terminal. It was conjectured that binding is only weakly specific and prefers unstructured RNAs (Mayer et al. 2007). Our analysis is consistent with this conjecture, showing a spread of top αβDCA scores all over the RNA. Figure 6 shows the cumulative number of top scores of nucleotides with the carboxy-terminal along the RNA, demonstrating the roughly uniform spread (with gaps excluded), with notable enrichment before position 200.
FIGURE 6.
Cumulative distribution of top DCA scores of amino acids in the carboxyl terminal of StpA coupled with nucleotides along the RNA. Positions with >50% of gaps omitted from the analysis are in gray. The black curve is the cumulative uniform distribution with the same gaps.
Cumulative distribution of top DCA scores of amino acids in the carboxyl terminal of StpA coupled with nucleotides along the RNA. Positions with >50% of gaps omitted from the analysis are in gray. The black curve is the cumulative uniform distribution with the same gaps.In a fine-grained examination, one notices several interactions of special interest. The glycine at position 126 of the protein, which is known to strongly reduce binding affinity when mutated, takes part in three different pairs. Position 113 of the protein—which participates in 14 different pairs, more than any other amino acid—is strongly coupled to positions 125 and 162 in the RNA, which themselves are also coupled with glycine 126. Position 125 of the RNA resides in the 5′-end of the pseudoknot and position 162 in the 3′-end of the P3 stem. The two RNA regions with the strongest coupling to the carboxyl terminal are both ends of the pseudoknot, which are involved in erroneous base pairs in the two dominant structures. This may explain why the isolated carboxyl terminal is a much inferior chaperone than the whole protein. Our analysis is also consistent with the theory that although important misfolded regions are disrupted, strong electrostatic binding slows the release of StpA, thereby impeding the correct folding of the RNA.
Destabilization is mediated by negatively charged regions targeting specific RNA positions
Removing the linker and carboxyl terminal increases by 50% the efficiency of StpA, implying that the amino terminal drives the destabilization (Mayer et al. 2007). Although the amino-terminal composes 48.5% of StpA, it contains only 21% of the strong couplings, 19 out of 90. The black lines in Figure 4 show the coupled pairs between StpA and GII. Out of 19 significant pairs, one is in a positively charged region, five (26%) are in a neutral region, and the remaining 13 (68%) are in negatively charged regions.Of those 19 pairs, seven are coupled with regions that determine the functional RNA conformation in both structure's ensembles, in particular the one position paired with the positively charged amino acid at position 39 in the amino terminal. The other 12 pairs correlate with four different regions that are expected to be destabilized as the probable structure conflicts with the functional one.In both ensembles, the functional short stems at the beginning and end of the GII sequence are blocked by a stem linking those two parts together. We find three interactions that target this region: First, the 3′-end of the pseudoknot, in both ensembles, is blocked by misfolded stems that are strongly correlated with a position of the amino terminal. Second, following the 5′-end of the P3 stem, the region between positions 60 and 85 of the RNA has the right conformation in the less probable ensemble and is targeted by three couples. Finally, the 3′-end of the P3 stem, present only in the least probable ensemble, is involved in one coupling. The last three coupled positions are in a hairpin stem preceding the P3 3′-end. This stem is missing functional base pairs in both ensembles, two of the coupled pairs are in positions lacking a base pair, the third in the unpaired region of the hairpin.Without StpA, ∼55% of the RNA is able to fold into its functional self-splicing form, and this folding fraction rises to ∼80% in the presence of the chaperone (Mayer et al. 2007). The strong correlations we observe manifest an interplay between the two main structure ensembles of the RNA, with the less probable one presenting most of the correct base pairs. Regions that contain functional stems in the least probable ensemble are all targeted by couplings with the destabilizing amino terminal. In both ensembles, the functional but energetically unfavorable pseudoknot has stems in its 3′-end impeding its formation. Our analysis proposes that the stems are also destabilized by the amino-terminal.
Conclusion
DCA methods have been applied to infer protein structure and protein–protein or protein–RNA interactions (Weinreb et al. 2016). DCA demonstrated high correlations among amino acids in IDPs, suggesting that many IDPs do exhibit structure in a particular context (Toth-Petroczy et al. 2016). In the present study, we expanded DCA to account for the different alphabets and different levels of sequence diversity in the concatenated sequences of protein and RNA used for the alignment. We used this adapted αβDCA method to infer the strong couplings between a noncoding RNA, GII, and its disordered protein chaperone, StpA. Understanding the StpA–GII is particularly challenging, because on top of the inherent disorder of the protein, the misfolded RNA also lacks a well-defined structure.The present αβDCA method produces 15% less significant contacts than the traditional mfDCA. In cases in which the structure is unknown, a rather arbitrary significance threshold must be chosen. Having fewer scores departing from the distribution indicates better discrimination of important coevolving pairs. Our findings are consistent with experiments and a proposed mechanism in which the binding, mediated by electrostatic forces of positively charged amino acids, is nonspecific or only weakly specific. These strong couplings, observed in the positively charged regions in the carboxyl terminal of StpA, are paired with evenly distributed nucleotides along the RNA sequence. In contrast, the αβDCA suggests that the structural disruption driven by the amino terminal is mediated by negatively charged amino acids that target specific regions of the RNA sequence. In particular, regions in the two main structure ensembles of the RNA impeding the formation of the first and last stem, as the pseudoknot, are strongly coupled with the amino terminal. Stems in the more probable structure ensemble—which are conflicting with the functional stems present in the lower probability ensemble—are also targeted.The present study is the first direct coupling analysis of the coupling between a disordered chaperone and its RNA target. Charge patterns have been known to be crucial for the global structure of disordered proteins, and here we shed some light on how they can affect destabilization mechanisms involved in RNA chaperoning. The analysis suggests several concrete experimental tests—for example, mutations at positions 99 and 113 in the carboxy-terminal are expected to significantly decrease binding affinity. The αβDCA variant used in the study is simple and general enough to be easily applied for investigating other IDP–RNA mechanisms. An interesting application of the present analysis is the identification of chaperone IDPs from their charge pattern. Those patterns could also be used to design novel destabilizing proteins.
MATERIALS AND METHODS
We first present a modified DCA algorithm, termed αβDCA, adapted for treating paired sequences that are written in different alphabets and have different sequence conservation levels. The different nature of the paired sequences influences the normalization factors that are crucial to predict the disentangled covariations. To illustrate the method, we show how the data for the StpA protein and the group I intron RNA were gathered, and how the alignment was built. The code is freely available at: https://gitlab.info.uqam.ca/cbe/abDCA.
αβDCA: direct coupling analysis for varying alphabets and sequence conservation
DCA has proved extremely useful for disentangling covariations between noninteracting residues in MSA (Weigt et al. 2009; Morcos et al. 2011). It aims to find the Potts model that maximizes the entropy in order to infer the most likely probability having the given dinucleotide marginals without any additional constraints (Weigt et al. 2009). The original method was constructed to treat alignments of sequences written in the same alphabet—namely, the protein amino acids written in the language of the genetic code. We modify this method to treat in tandem two alphabets, of sizes r and s. Given a sequence of n characters, we assume that the first ζ elements are from the alphabet of size r, and the last n − ζ from the alphabet of size s. In this study, the first alphabet is of the protein amino acids and a gap, hence r = 21, and the second is of the RNA nucleotides and a gap (i.e., s = 5).The MSA of M sequences of length n is recorded as its sequence of columns , where p ∈ [1, …, M] are the M sequences and 1, …, n are the columns. Because the proteins and RNAs have different sequence similarity and alphabets, we define two values for calibrating the pseudocount:
where similarity is true if sequences C and C are identical in >80% of the positions between a and b. We note that the values of and are at least 1 because each sequence is identical to itself. We additionally defineThe parameter λ is a pseudocount set to the appropriate value of or , as in previous studies.The frequencies of each letter in each column, and of each pair of letters for each pair of positions, need to be reweighted as following. We define the frequency count of a letter at column i, given the indicator function 1, as
Similarly, the frequency count of a pair of letters (, ) at positions (i, j) is defined asThe rest of the equations follow closely the formulation in Morcos et al. (2011). The coupling value e(, ), between two letters (, ) at positions (i, j), is calculated through the set of n(n − 1)/2 matrices ∂, the connected correlation matrix. For each pair of positions i, j, one defines a matrix ∂ij, whose dimension is (r − 1)2 if i < j ≤ ζ, (r − 1)(s − 1) if i ≤ ζ < j, and (s − 1)2 if ζ < i < j. For all i ∈ [1, …, n], j ∈ [1, …, n] the entries of ∂ij are
where and take all possible r − 1 or s − 1 values, depending on the index i and j. Finally, the coupling between positions i, j is obtained by inverting ∂:
where that block matrix is extended with 0s so that the dimension of e is r2 if i < j ≤ ζ, rs if i ≤ ζ < j, and s2 if ζ < i < j. The inverse of the connected correlation matrix returns the negative coupling term; we correct it by taking minus its value (Morcos et al. 2011).We can now define a pseudoprobability, P(, ), of observing (, ) at positions (i, j), given auxiliary residue fields for each position:
where is the normalization factor. The values of the fields are determined by the observed single residue count and must satisfy the system of equations:Note that we must assume that if (respectively, if ).At this point, we can compute the directed information between two positions, D, as
Finally, the distortion of the scores due to the undersampling effect is corrected using an average product correction (APC) method (Dunn et al. 2007).
StpA homologs
The StpA protein from the Escherichia coli (strain K12) sequence isMSVMLQSLNNIRTLRAMAREFSIDVLEEMLEKFRVVTKERREEEEQQQRELAERQEKISTWLELMKADGINPEELLGNSSAAAPRAGKKRQPRPAKYKFTDVNGETKTWTGQGRTPKPIAQALAEGKSLDDFLI.The distribution of charges along the sequence is a known indicator of the global conformation of disordered proteins (Holehouse et al. 2017). The Das–Pappu phase diagram shows that the StpA protein belongs to the ensemble of “Janus sequences.” Those are collapsed or expanded depending on context, and most functional disordered proteins belong to that group. This region of Janus sequences contains 40% of known disordered proteins (Das et al. 2015), whereas another 25% reside in the strong polyampholyte region, and 30% are classified as weak polyampholyte.The jackhmmer method (Potter et al. 2018) was run iteratively 13 times, until the number of sequences added to the matches was <1% of the already identified ones. We identified 21,593 matches, 5749 of them unique. jackhmmer provides a sequence alignment of all the hits, which belong to 7539 different taxa. Every sequence in GenBank (Sayers et al. 2019) associated with those taxa was downloaded, a total of 633 GB of data.
Group I intron
The td group I intron (GII) sequence from phage T4 thymidylate-synthase isgguUAAUUGAGGCCUGAGUAUAAGGUGACUUAUACUUGUAAUCUAUCUAAACGGGGAACCUCUCUAGUAGACAAUCCCGUGCUAAAUUGUAGGACUGCCCGGGUUCUACAUAAAUGCCUAACGACUAUCCCUUUGGGGAGUAGGGUCAAGUGACUCGAAACGAUAGACAACUUGCUUUAACAAGUUGGAGAUAUAGUCUGCUCUGCAUGGUGACAUGCAGCUGGAUAUAAUUCCGGGGUAAGAUUAACGACCUUAUCUGAACAUAAUGcuacand its functional secondary structure, from Waldsich et al. (2002), is((((……))))((((((((((….))))))))))…((((((…((((((….(((…..)))…))))))((…..(((((((((…..)))))))))…..))…[.[[[[[.(((….)))(.(((((((….))))))))..))))))(((((((……)))))))……]]]]]]…(((((((….))))))).((((……))))(((((((………)))))))………….. where the pseudoknot is indicated with square brackets, “[‘ and ’].”The GII has 14 different subgroups, which have been cataloged in the GISSD database (Zhou et al. 2008). Identification and alignment of GII sequences are highly dependent on the subgroup they belong to (Nawrocki et al. 2018). Therefore, for each subgroup, we generated a covariance model using Infernal (Nawrocki and Eddy 2013). The IA2 subgroup is the most compatible with GII. With GII, Infernal reports an e-value of 1.7 × 10−36, and 63% of the base pairs are well predicted. In particular, the complete P3 stem (brown in Fig. 4) is perfectly aligned with the consensus structure. We note that although the sequence has 273 nt, only the first 248 were matched. The rest of the analysis is performed on those 248 nt.A search of matches to the IA2 subgroup was then computed with the cmsearch routine of Infernal, on all sequences from the 7359 taxa gathered previously. A total of 7542 sequences were identified as significant—e-value <0.01—with default parameters, 471 of them unique. The cmsearch tool returns an alignment of those sequences.
Protein–RNA alignment
Duplicate proteins and RNAs were removed from each taxon. Every possible protein–RNA pair inside a taxon was concatenated together. This yielded a total of 13,230 couples, of which 10,013 were unique.Only columns in which StpA and the GII have <50% of gaps were kept. In total, 39 positions of the proteins were removed, the amino terminal's first 30 positions, six in the carboxyl terminal and two in the linker. In the RNA, 64 positions were removed. The resulting protein alignment is composed of 95 columns and the RNA alignment of 184.
5S RNA–RL18 protein interactions
We compare four DCA methods for the benchmark of inferring the interactions between the 5S RNA and the RL18 protein. The four methods are (i) standard mfDCA, in which the pseudocount is kept at 21 for every position in our alignment, (ii) our αβDCA implementation of mfDCA with adaptive pseudocount, (iii) the implementation EVcouplings (Hopf et al. 2018) of pseudolikelihood DCA (plmDCA), and (vi) the Markov-random field DCA as implemented in Gremlin (Ovchinnikov et al. 2014).We used the protein alignment of RL18 provided in Weinreb et al. (2016). The RNA sequences were recovered from the Rfam family RF00001 (Kalvari et al. 2017). We followed the protocol of the previous section. Because of the large amount of sequences, we selected randomly one pair of protein–RNA per taxonomic family, as in Weinreb et al. (2016). The alignment before removing columns with >50% of gaps is available at https://gitlab.info.uqam.ca/cbe/abDCA. We computed amino acid–nucleotide distances in the 4V4Q protein structure (Schuwirth et al. 2005). Pairs with distance shorter than 10 Å are considered to be in contact.We show in Figure 7 the results of the first top 100 scores for each method. Only mfDCA and αβDCA (mfDCA adaptive) have their highest scores correctly predicting a contact. Although mfDCA's fourth hit is correct but not the one in the αβDCA method, the opposite occurs at their sixth top score. Both methods outperform Gremlin and EVcouplings on the top 20 scores. Although the true positives of mfDCA and αβDCA steadily decline as more top scores are taken into account, Gremlin sees an increase to up to 50% at its 30th score. All methods then converge to ∼22% true positive when the first 100 scores are taken into account.
FIGURE 7.
Comparing four DCA methods for the benchmark of inferring the 5S–RL18 complex from PDB 4V4Q. The graphs show fraction of pairs with a distance below 10 Å for the top 100 DCA values for each method. The circles indicate the last score over 4σ from the bulk distribution.
Comparing four DCA methods for the benchmark of inferring the 5S–RL18 complex from PDB 4V4Q. The graphs show fraction of pairs with a distance below 10 Å for the top 100 DCA values for each method. The circles indicate the last score over 4σ from the bulk distribution.The overlap of scores over 4σ from each bulk distribution is shown in Figure 8 (Heberle et al. 2015). Although 95% of those overlap between mfDCA and αβDCA, they are almost completely exclusive from Gremlin and EVcouplings top results. None of the top pairs is identified by all of the four methods and only two are shared by mfDCA, αβDCA, and Gremlin. This is the only overlap between any three methods.
FIGURE 8.
Intersection of the scores over 4σ for each of the four DCA methods on the 5s–RL18 complex.
Intersection of the scores over 4σ for each of the four DCA methods on the 5s–RL18 complex.
Authors: Faruck Morcos; Andrea Pagnani; Bryan Lunt; Arianna Bertolino; Debora S Marks; Chris Sander; Riccardo Zecchina; José N Onuchic; Terence Hwa; Martin Weigt Journal: Proc Natl Acad Sci U S A Date: 2011-11-21 Impact factor: 11.205
Authors: Debora S Marks; Lucy J Colwell; Robert Sheridan; Thomas A Hopf; Andrea Pagnani; Riccardo Zecchina; Chris Sander Journal: PLoS One Date: 2011-12-07 Impact factor: 3.240
Authors: Oliver Mayer; Lukas Rajkowitsch; Christina Lorenz; Robert Konrat; Renée Schroeder Journal: Nucleic Acids Res Date: 2007-01-31 Impact factor: 16.971
Authors: Eric W Sayers; Mark Cavanaugh; Karen Clark; James Ostell; Kim D Pruitt; Ilene Karsch-Mizrachi Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971