| Literature DB >> 18688270 |
Jérôme Waldispühl1, Srinivas Devadas, Bonnie Berger, Peter Clote.
Abstract
The diversity and importance of the role played by RNAs in the regulation and development of the cell are now well-known and well-documented. This broad range of functions is achieved through specific structures that have been (presumably) optimized through evolution. State-of-the-art methods, such as McCaskill's algorithm, use a statistical mechanics framework based on the computation of the partition function over the canonical ensemble of all possible secondary structures on a given sequence. Although secondary structure predictions from thermodynamics-based algorithms are not as accurate as methods employing comparative genomics, the former methods are the only available tools to investigate novel RNAs, such as the many RNAs of unknown function recently reported by the ENCODE consortium. In this paper, we generalize the McCaskill partition function algorithm to sum over the grand canonical ensemble of all secondary structures of all mutants of the given sequence. Specifically, our new program, RNAmutants, simultaneously computes for each integer k the minimum free energy structure MFE(k) and the partition function Z(k) over all secondary structures of all k-point mutants, even allowing the user to specify certain positions required not to mutate and certain positions required to base-pair or remain unpaired. This technically important extension allows us to study the resilience of an RNA molecule to pointwise mutations. By computing the mutation profile of a sequence, a novel graphical representation of the mutational tendency of nucleotide positions, we analyze the deleterious nature of mutating specific nucleotide positions or groups of positions. We have successfully applied RNAmutants to investigate deleterious mutations (mutations that radically modify the secondary structure) in the Hepatitis C virus cis-acting replication element and to evaluate the evolutionary pressure applied on different regions of the HIV trans-activation response element. In particular, we show qualitative agreement between published Hepatitis C and HIV experimental mutagenesis studies and our analysis of deleterious mutations using RNAmutants. Our work also predicts other deleterious mutations, which could be verified experimentally. Finally, we provide evidence that the 3' UTR of the GB RNA virus C has been optimized to preserve evolutionarily conserved stem regions from a deleterious effect of pointwise mutations. We hope that there will be long-term potential applications of RNAmutants in de novo RNA design and drug design against RNA viruses. This work also suggests potential applications for large-scale exploration of the RNA sequence-structure network. Binary distributions are available at http://RNAmutants.csail.mit.edu/.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18688270 PMCID: PMC2475669 DOI: 10.1371/journal.pcbi.1000124
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Schematic representation of the k-mutant Boltzmann ensemble sampled by RNAmutants.
The input RNA sequence is represented at the center while the k-neighbourhoods (Here k = 1, 2) are represented by concentric rings. Each individual RNA sequence is associated with a set of secondary structures that can be mapped onto it (the boxed structures). These comprise the set of structure that have to be enumerated to compute the Boltzmann partition function).
Figure 2Feynman diagram of original recursions from McCaskill's algorithm [20] to compute the partition function and extension (in red) to RNAmutants recursions.
Sequence index are given below the diagram. Shaded half-disks represent secondary structures with at least one base pair and correspond to recursive calls of the partition function computations. The labels give the type of the recursion. The dashed arc lines represent base pairs. The extensions brought by RNAmutants to the McCaskill recursions are highlighted in red and address the labeling of the mutant sequence. The distribution of the mutations is determined using the recursive equations described in the section Partition Function for Mutant RNA in Methods. Wavy lines represent ensembles of sequences with a fixed number of mutations and an empty secondary structure. While dashed lines are mutant sequences to be recursively determined.
Figure 3Time complexity measured for all Hepatitis C virus (HCV) stem-loop IV (SLIV) sequences from the Rfam seed alignment.
(A) The x-axis represents the maximum number k of mutations, while the y-axis represents the time (in seconds) required by RNAmutants to compute the partition function Z for each 0≤i≤k, and to sample 10 sequences and structures from the corresponding Boltzmann ensemble. Input length of HCV SLIV sequences is 37 nt. The average time over all 110 seed sequences of HCV SLIV is indicated by tick marks, while error bars represent ±1 standard deviation. (B) The x-axis represents the length of the input sequence, while the y-axis represents the time (in seconds) required by RNAmutants to compute the complete partition function Z for all mutants (i.e., all possible sequence of a given length). A logarithmic scale is used for both axis. For each length, the average time over five random sequences is indicated by tick marks, while error bars represent ±1 standard deviation. For comparison, a curve y = K·x 5 representing the theoretical bound of the time complexity is also plotted.
Figure 4Overview of the sampling procedure.
Dashed lines represent the regions which must be recursively sampled. The recursive calls are indicated by an arrow, and labeled when multiple recursive calls are performed. Wavy lines show the base pairs created duringthe execution of the algorithm. Dots indicate nucleotides sampled in the function and are never involved in a recursive call. The number of mutations is determined using the recursive equations of the section Partition Function for Mutant RNA in Methods.
Figure 5Complete mutation landscape of Hepatitis C virus stem-loop IV (HCV SLIV).
(A) Mutation profile of HCV SLIV, averaged over all 110 seed sequences from Rfam, which depicts the probability of mutation of a residue at a level k (i.e. among all k-point mutants). This profile corresponds to a 37×37 matrix M = (m), where x denotes the position within the input HCV SLIV sequence (x-axis) and y denotes the mutation level or number of mutations (y-axis). Mutation frequency computed from sampled structures is represented as a gray level: probability of 1 is depicted as black while probability of 0 is depicted as white, and values of y increase from bottom to top. Sequence logo and the consensus secondary structure from the Rfam seed alignment appear below the mutation profile. (B) Superposition of k-superoptimal free energy and k-mutant ensemble free energy, as computed by RNAmutants; the x-axis represents the number of mutations and the y-axis represents free energy in kcal/mol. Note that the k-mutant ensemble free energy −RI ln Z is lower than the k-superoptimal free energy, a situation analogous to the fact that the ensemble free energy −RI⋅ln Z is lower than the minimum free energy in the output of RNAfold. This may seem paradoxical, unless one realizes that ensemble free energy is not the same as the mean free energy μ = Σ(S)⋅exp(−E(S)/RI)/Z, the latter which can be computed by the method of [53] or by the classical statistical mechanics formula [33].
Base pair distance between the sampled and native structures for cis-regulatory elements from Hepatitis C virus and HIV
| RNA | #seq | length | #bp | 0 | 1 | 2 | 3 | 4 | 5 |
| HCV CRE | 52 | 51.0 | 14 | 1.8/0 | 2.7/0 | 6.1/2 | 8.6/1 | 10.8/9 | 12.4/10 |
| HCV SLIV | 110 | 35.0 | 15 | 0.3/2 | 0.3/1 | 0.3/1 | 0.3/1 | 0.3/1 | 0.3/1 |
| HIV PBS | 388 | 94.8 | 17 | 12.3/4 | 14.9/5 | 16.7/1 | 17.6/1 | 18.3/1 | - |
| HIV FE | 853 | 51.9 | 10 | 7.6/1 | 7.7/2 | 7.7/2 | 7.6/2 | 7.4/2 | 7.2/2 |
| HIV GSL3 | 1403 | 81.1 | 8 | 9.3/0 | 9.1/0 | 8.9/0 | 9.2/0 | 9.6/0 | - |
Native structure is here taken as the Rfam consensus structure from the seed alignments of these elements of HCV and HIV. Two measures are given. The average distance represents the average base pair distance between sampled structures and the native secondary structure 0. The centroid represents the average base pair distance between sampled structures and the sample centroid c, where the latter is defined to consist of those base pairs occurring in strictly more than half the sampled structures. The number of sequences in the Rfam seed alignment, the average length and the number of basepairs in the native structure are given before the average and centroid distance values.
Figure 6Rfam [9] consensus secondary structure of Hepatitis C cis-acting replication element (HCV CRE) and the trans-activation response hairpin of the human immunodeficiency virus (HIV1 TAR).
Figure 7Deleterious mutations identified in the ensemble sampled by RNAmutants on the input of 47 nt Hepatitis C virus cis-acting replication element (HCV CRE), known to be essential for viral replication.
Pointwise mutants are listed by decreasing order of break number (a measure of structural distortion, defined as the number of native base pairs that must be removed for given structure to be compatible with the wild type structure). For each secondary structure listed, we display the base pair associated with the mutation, the mutation type (index and nucleotide substitution), the index and type of the nucleotide that can be associated with the concerned base pair, the frequency of this mutation and the break number.
Mutants with mutations A11G, C35U, and C36U in the full alignment of the 47 nt Hepatitis C virus cis-acting replication (HCV CRE) element
| A11G | |
| AF054264.1/326-376 | (A1G),(A11G) |
| C35U | |
| D14853.1/9264-9314 | (A15U),(C25G),(G32A),(C35U),(A39G) |
| D16190.1/986-1036 | (A15U),(C25G),(G32A),(C35U),(A39G),(C45U) |
| D16192.1/986-1036 | (G2A),(A15U),(C25G),(C35U),(A39G),(C45U) |
| C36U | |
| D87352.1/983-1033 | (A1G),(A9G),(A13G),(C25G),(U26C),(U30C),(U33G),(C36U) |
| D37862.1/983-1033 | (A9G),(A13G),(A15U),(C25G),(U26C),(U30C),(U33G),(C36U),(U46A) |
| D49769.1/983-1033 | (A1G),(A9C),(C25G),(U30G),(U33A),(C36U) |
| D37859.1/983-1033 | (A9G),(A17U),(C25G),(U30C),(G32A),(U33G),(C36U),(A39G),(U46A) |
| D31973.1/986-1036 | (A1G),(A9C),(C25G),(U30G),(U33A),(C36U),(U46C) |
| D87358.1/983-1033 | (A9G),(A15U),(C25G),(U30C),(U33G),(C36U),(U46A) |
| D87356.1/983-1033 | (A1G),(G2A),(A9G),(A15U),(C25G),(U30C),(U33G),(C36U),(C45U),(U46C) |
| D84263.2/9267-9317 | (A9G),(A17U),(C25G),(U30C),(U33G),(C36U),(A39G),(U46A) |
| AY973865.1/1663-1713 | (A1G),(G2A),(A9G),(A15U),(C25G),(U30C),(G32A),(U33G),(C36U),(C45U),(U46C) |
| D86543.1/983-1033 | (A1G),(A9G),(A15U),(A20G),(C25A),(U30C),(U33C),(C36U) |
| D87360.1/983-1033 | (A1G),(G2A),(A9G),(A15U),(C25G),(U30C),(G32A),(U33G),(C36U),(C45U),(U46C) |
| AY878650.1/9259-9309 | (A9G),(A15U),(C25G),(U30C),(U33G),(C36U),(U46A) |
| D87354.1/983-1033 | (A9G),(A17U),(C25G),(U30C),(U33G),(C36U),(A39G),(U46A) |
| D49777.1/983-1033 | (A1G),(A9C),(C25G),(U30G),(U33A),(C36U) |
| D84264.2/9276-9326 | (A9G),(A15U),(C25G),(U30C),(U33G),(C36U),(U46A) |
| D87357.1/983-1033 | (A9G),(A15U),(C25G),(U30C),(U33G),(C36U),(U46A) |
| D38079.1/983-1033 | (A9G),(A17U),(C25G),(U30C),(G32A),(U33G),(C36U),(A39G),(U46A) |
| D84398.1/983-1033 | (A1G),(A9C),(C25G),(U30G),(U33A),(C36U),(U46C) |
| AY859526.1/9242-9292 | (A1G),(G2A),(A9G),(A15U),(C25G),(U30C),(G32A),(U33G),(C36U),(C45U),(U46C) |
| D87355.1/983-1033 | (A9G),(A15U),(C25G),(U30C),(U33G),(C36U),(U46A) |
| AY973866.1/1663-1713 | (A1G),(G2A),(A9G),(A15U),(C25G),(U30C),(G32A),(U33G),(C36U),(C45U),(U46C) |
| D37855.1/983-1033 | (A1G),(G2A),(A9G),(A15U),(C25G),(U30C),(U33G),(C36U),(C45U),(U46C) |
| D84262.2/9289-9339 | (A1G),(G2A),(A9G),(A15U),(C25G),(U30C),(U33G),(C36U),(C45U),(U46C) |
| D84265.2/9273-9323 | (A1G),(A9G),(A13G),(C25G),(U26C),(U30C),(U33G),(C36U) |
| D50409.1/9341-9391 | (A1G),(A9C),(C25G),(U30G),(U33A),(C36U),(A39G),(U46C) |
| D87353.1/983-1033 | (A1G),(A9G),(A13G),(C25G),(U26C),(U30C),(U33G),(C36U) |
| D87359.1/983-1033 | (A9G),(A15U),(C25G),(U30C),(U33G),(C36U),(U46A) |
| D87363.1/983-1033 | (A1G),(G2A),(A9G),(A15U),(C25G),(U30C),(G32A),(U33G),(C36U),(C45U),(U46C) |
| D37860.1/983-1033 | (A9G),(A17U),(C25G),(U30C),(G32A),(U33G),(C36U),(A39G),(U46A) |
| D87362.1/983-1033 | (A1G),(G2A),(A9G),(A15U),(C25G),(U30C),(G32A),(U33G),(C36U),(C45U),(U46C) |
See text for a comparison of this table produced by RNAmutants with the experimental mutagenesis study of You et al. [39].
Figure 8Scan of 57 nt human immunodeficiency virus trans-activation response elements (HIV-1 TAR) from the HIV-1 genome.
By sliding a window forward, for each 3 nt window in the Rfam seed alignment of HIV-1 TAR elements, we allow mutations only within this window, and subsequently compute the centroid and average distances. The starting position of the 3 nt window is given on the x-axis and the centroid (resp. average) distance is given on the y-axis. (See Results/Discussion for the definition of centroid and average distance.) Each curve shows the results computed with a fixed number of mutations in the frame: 1 mutation (A), 2 mutations (B), and 3 mutations (C).
Figure 9Base pair density in k-mutants (0≤k≤8).
The x-axis represents the number of mutations, while the y-axis represents the (normalized) frequency of base pairs (i, j) (i
Figure 10Relative propensity of mutations occurring inside (A) and outside (B) of stem regions to base pair inside or outside the same region.
The statistics have been computed using a scanning window of size 50 with up to 8 mutations. When a mutation occurs in a stem region 10(A), we distinguish three cases: when the base pair is created inside the same stem region, when the base pair links another stem region and when the mutation base pairs outside any stem regions. In the case of a mutation happening outside the stem regions 10(B), we only need to distinguish whether the base pair links a stem region or not.
Figure 11Differential probability of mutation associated with a base pair increasing mutation (A) or a base pair decreasing mutation (B).
The x-axis represents the number k of mutations, while the y-axis represents the differential probability p + (i, j)−p (i, j).
Figure 12Average mutation rates computed from a scan of the 3′ UTR of GB virus C (GBV-C) with frames of 50 (A and B), 100 (C and D), and 150 nucleotides (E and F).
Evolutionarily conserved stem loops identified in [48] are indicated with shaded regions. Profiles with no restriction on the length j−i of the base pair (i, j) associated with the mutations are given in the left column, while those for medium and long range base pairing (length ≥25 nt) are shown in the right column.
Distribution of the mutations inside versus outside the evolutionarily conserved RNA stem loops SLI to SLVII corresponding to the profiles of Figure 12
| Frame size | 50 nt | 100 nt | 150 nt | |||
| Location w.r.t. RNA regions | In | Out | In | Out | In | Out |
| All mutations | 48% | 52% | 39% | 61% | 38% | 62% |
| In a base pair of size ≥25 nt | 41% | 59% | 27% | 73% | 24% | 76% |
The first row presents statistics computed for all mutations, while the second row presents statistics for mutations involved in a base pair (i, j) of length |j−i|≥25.
Figure 13Probability of mutations occurring in a base pair (i, j), whose length j−i exceeds a certain threshold.
The x-axis represents the threshold value for base pair length. Results are reported for frame sizes of 50 (A), 100 (B), and 150 (C). The fractions of mutations satisfying the criteria in the sample set are given using the dashed line.