| Literature DB >> 31665393 |
Michael Golden1,2, Benjamin Murrell3, Darren Martin4, Oliver G Pybus2, Jotun Hein1.
Abstract
Pairs of nucleotides within functional nucleic acid secondary structures often display evidence of coevolution that is consistent with the maintenance of base-pairing. Here, we introduce a sequence evolution model, MESSI (Modeling the Evolution of Secondary Structure Interactions), that infers coevolution associated with base-paired sites in DNA or RNA sequence alignments. MESSI can estimate coevolution while accounting for an unknown secondary structure. MESSI can also use graphics processing unit parallelism to increase computational speed. We used MESSI to infer coevolution associated with GC, AU (AT in DNA), GU (GT in DNA) pairs in noncoding RNA alignments, and in single-stranded RNA and DNA virus alignments. Estimates of GU pair coevolution were found to be higher at base-paired sites in single-stranded RNA viruses and noncoding RNAs than estimates of GT pair coevolution in single-stranded DNA viruses. A potential biophysical explanation is that GT pairs do not stabilize DNA secondary structures to the same extent that GU pairs do in RNA. Additionally, MESSI estimates the degrees of coevolution at individual base-paired sites in an alignment. These estimates were computed for a SHAPE-MaP-determined HIV-1 NL4-3 RNA secondary structure. We found that estimates of coevolution were more strongly correlated with experimentally determined SHAPE-MaP pairing scores than three nonevolutionary measures of base-pairing covariation. To assist researchers in prioritizing substructures with potential functionality, MESSI automatically ranks substructures by degrees of coevolution at base-paired sites within them. Such a ranking was created for an HIV-1 subtype B alignment, revealing an excess of top-ranking substructures that have been previously identified as having structure-related functional importance, among several uncharacterized top-ranking substructures.Entities:
Keywords: coevolution; evolution; nucleic acid structure; probabilistic model
Mesh:
Substances:
Year: 2020 PMID: 31665393 PMCID: PMC6993869 DOI: 10.1093/molbev/msz243
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1.Summary of secondary structure prediction benchmarks. Structure predictions were performed on 99 RFAM data sets using three different comparative structure prediction methods (MESSI, RNAalifold, and PPFold).
. 2.Paired site likelihoods calculation timings in seconds (log10 axis) as a function of the number unique paired partial site patterns (log10 axis). Numbers above the GPU timings indicate the fold speed-up over the CPU version.
. 3.Inside algorithm timings in seconds (log10 axis) as a function of the number of alignment sites. Numbers above the GPU timings indicate the fold speed-up over the CPU version.
Tests of the GU/GT Neutral Hypothesis across 15 Data Sets: Five Noncoding RNA Alignments from the RFAM Database (denoted by the prefix “RF”), Five ssRNA Virus Alignments (foot-and-mouth disease, human poliovirus 1, tobamovirus, rhinovirus A, and hepatitis A virus), and Five ssDNA Virus Alignments (maise streak virus, tomato yellow leaf curl virus, beet curly top virus, and wheat dwarf virus).
| Data Set | Type | Number of Sites | Potentially Recombinant | Recombinant Regions Separated | |||||
|---|---|---|---|---|---|---|---|---|---|
| LRT (M1-M0) |
| LRT (M1-M0) | Bootstrap |
| |||||
|
|
| ( |
|
|
| ( | |||
| RF00001 | ncRNA | 230 | 226.80 |
| 2.15 | 211.41 |
|
| 2.17 |
| RF00003 | ncRNA | 203 | 46.57 |
| 2.57 | 43.51 |
|
| 2.56 |
| RF00010 | ncRNA | 996 | 2,964.36 |
| 2.35 | 797.29 |
|
| 2.32 |
| RF00379 | ncRNA | 335 | 38.78 |
| 1.97 | 35.31 |
|
| 1.96 |
| RF01846 | ncRNA | 624 | 101.72 |
| 2.18 | 71.74 |
|
| 2.10 |
| FMDV | ssRNA | 8,349 | 336.64 |
| 2.75 | 211.53 |
| n.c. | 2.50 |
| Hepatitis A | ssRNA | 7,572 | 2.33 |
| 1.28 | 3.27 |
| n.c. | 1.30 |
| H. poliovirus 1 | ssRNA | 7,668 | 132.18 |
| 3.52 | 140.32 |
| n.c. | 3.38 |
| Rhinovirus A | ssRNA | 7,308 | 2,255.91 |
| 10.63 | 2,188.94 |
| n.c. | 8.70 |
| Tobamovirus | ssRNA | 6,849 | 90.81 |
| 2.17 | 86.96 |
| n.c. | 2.23 |
| BCTV | ssDNA | 3,215 | 0.18 | n.s. | 1.08 | 0.10 | n.s. | n.s. | 1.06 |
| Bocavirus | ssDNA | 5,577 | 0.00 | n.s. | 1.00 | 0.00 | n.s. | n.c. | 1.00 |
| MSV | ssDNA | 2,755 | 3.77 |
| 1.35 | 0.00 | n.s. | n.s. | 1.00 |
| TYLCV | ssDNA | 2,925 | 4.12 |
| 1.50 | 0.00 | n.s. | n.s. | 1.00 |
| WDV | ssDNA | 2,755 | 0.04 | n.s. | 1.04 | 0.00 | n.s. | n.s. | 1.04 |
note.—n.s., not significant; n.c., not computable.
P < 0.05;
P < 0.005;
P < 0.0005.
. 4.Estimated posterior probabilities for all six orderings of the three base coevolution rates across 15 data sets.
Spearman’s Correlations (ρ) and 95% Confidence Intervals (ρ 95% CI) between Five Different Measures of Covariation/Coevolution and Base-Pair Averaged SHAPE-MaP Reactivities and the Same Five Measures and Base-Pair Averaged SHAPE-MaP Pairing Probabilities.
| Data Set | Measure | SHAPE-MaP Reactivities | ρ 95% CI |
| SHAPE-MaP Pairing Probabilities | ρ 95% CI |
|
|---|---|---|---|---|---|---|---|
| ρ | ρ | ||||||
| A. Mutual information (MI) | −0.01 | [−0.051, 0.035] | n.s. |
| [0.069, 0.154] |
| |
| B. RNAAlifold MI | 0.01 | [−0.033, 0.053] | n.s. | 0.01 | [−0.030, 0.056] | n.s. | |
| HIV-1 | C. MI with stacking | −0.02 | [−0.060, 0.026] | n.s. |
| [0.059, 0.144] |
|
| Subtype B | D. |
| [−0.202, −0.118] |
|
| [0.147, 0.230] |
|
| E. Posterior mean |
| [−0.180, −0.095] |
|
| [0.162, 0.244] |
| |
| A. Mutual information (MI) | 0.03 | [−0.016, 0.070] | n.s. |
| [0.047, 0.132] |
| |
| B. RNAAlifold MI | 0.03 | [−0.013, 0.073] | n.s. |
| [0.033, 0.119] |
| |
| HIV-1 | C. MI with stacking | 0.00 | [−0.043, 0.043] | n.s. |
| [0.081, 0.166] |
|
| Group 1M | D. |
| [−0.225, −0.142] |
|
| [0.227, 0.307] |
|
| E. Posterior mean |
| [−0.197, −0.113] |
|
| [0.251, 0.330] |
| |
| A. Mutual information (MI) | 0.09 | [0.053, 0.132] |
| −0.04 | [−0.084, −0.005] |
| |
| B. RNAAlifold MI | 0.12 | [0.086, 0.164] |
| −0.07 | [−0.114, −0.035] |
| |
| SIVmac239 | C. MI with stacking | 0.10 | [0.057, 0.135] |
| −0.01 | [−0.046, 0.033] | n.s. |
| D. |
| [−0.160, −0.082] |
|
| [0.153, 0.229] |
| |
| E. Posterior mean |
| [−0.137, −0.058] |
|
| [0.164, 0.240] |
|
Note.—Underlined values indicate correlations that are statistically significant and in the expected direction. n.s., not significant.
P < 0.05;
P < 0.005;
P < 0.0005
SHAPE Structure Ranking.
| Rank | Alignment Position | NL4-3 Position | Length | Name | Median Degree of Coevolution |
|
|---|---|---|---|---|---|---|
| 1 | 8233–8582 | 7249–7595 | 350 | Rev response element (RRE) | 5.38 | 5.02 |
| 2 | 2608–2943 | 1991–2326 | 336 | Longest continuous helix | 5.17 | 2.92 |
| 3 | 10155–10383 | 8982–9170 | 229 | 3′-Untranslated region (3′-UTR) | 5.27 | 2.69 |
| 4 | 588–838 | 105–344 | 251 | 5′-Untranslated region (5′-UTR) | 5.65 | 2.61 |
| 5 | 9570–9584 | 8440–8454 | 15 | 5.91 | 2.29 | |
| 6 | 860–979 | 366–485 | 120 | 5′-Untranslated region (5′-UTR) | 5.54 | 2.28 |
| 7 | 1710–1845 | 1177–1312 | 136 | 5.17 | 2.28 | |
| 8 | 2115–2301 | 1561–1711 | 187 | Gag-pol frameshift | 5.31 | 2.21 |
| 9 | 1479–1490 | 946–957 | 12 | 5.85 | 2.04 | |
| 10 | 3886–3907 | 3269–3290 | 22 | 5.80 | 2.01 |
note.—The top 10 of 86 nonoverlapping HIV NL4-3 substructures ranked from highest to lowest z-score based on the estimated degrees of coevolution within an alignment of HIV-1 subtype B sequences. Where the HIV NL4-3 SHAPE-MaP secondary structure was used as the canonical structure.
Consensus Structure Ranking.
| Rank | Alignment Position | NL4-3 Position | Length | Name | Median Degree of Coevolution |
|
|---|---|---|---|---|---|---|
| 1 | 8240–8577 | 7256–7590 | 338 | Rev response element (RRE) | 5.64 | 6.53 |
| 2 | 2202–2229 | 1645–1672 | 28 | Gag-pol frameshift | 8.17 | 4.56 |
| 3 | 1710–1845 | 1177–1312 | 136 | 6.44 | 4.50 | |
| 4 | 4751–4833 | 4134–4216 | 83 | 6.47 | 3.97 | |
| 5 | 4505–4709 | 3888–4092 | 205 | 5.22 | 3.21 | |
| 6 | 591–939 | 108–445 | 349 | 5′-Untranslated region (5′-UTR) | 5.38 | 3.16 |
| 7 | 133–151 | NA | 19 | 6.85 | 2.94 | |
| 8 | 2564–2890 | 1947–2273 | 327 | Longest continuous helix | 4.44 | 2.62 |
| 9 | 9782–9800 | 8645–8663 | 19 | 6.92 | 2.55 | |
| 10 | 3612–3623 | 2995–3006 | 12 | 6.74 | 2.50 |
note.—The top 10 of 118 nonoverlapping HIV consensus substructures ranked from highest to lowest z-score based on their degrees of coevolution within an alignment of HIV-1 subtype B sequences. Where the canonical structure was treated as unknown and a consensus structure predicted by MESSI.
. 5.Visualization of several top ranking substructures in the SHAPE-MaP structure and consensus structure rankings. NL4-3 SHAPE-MaP experimental reactivities are mapped and visually overlaid using the same color scheme as in Watts et al. (2009). Depicted within each nucleotide is a sequence logo summarizing the nucleotide composition at the corresponding alignment position. Mean degrees of coevolution inferred using MESSI are depicted for each base-pair using colored links (blue–green–yellow gradient).
Parameters of the Unconstrained Model and Their Distributions.
| Parameter and Distribution | Marginalized or Estimated | Description |
|---|---|---|
|
| Estimated | Probability of neutral coevolution. |
|
| Marginalized | Indicates neutral coevolution at position |
|
| Estimated | Shape and rate parameter of prior over coevolution rates |
|
| Marginalized | The rate of coevolution at each paired position |
|
| ||
|
| Estimated | Shape and rate parameter of prior over substitutions rates |
|
| Marginalized | Substitution rate at each unpaired position |
|
| Estimated | GTR equilibrium frequencies of the four nucleotides. |
|
| Estimated | GTR rate matrix entry AC. |
|
| Estimated | GTR rate matrix entry AG. |
|
| Estimated | GTR rate matrix entry AT. |
|
| Estimated | GTR rate matrix entry CG. |
|
| Estimated | GTR rate matrix entry CT. |
|
| Estimated | GTR rate matrix entry GT. |
|
| Estimated | GC coevolution rate. |
|
| Estimated | AT coevolution rate. |
|
| Estimated | GT coevolution rate. |
|
| Marginalized | The secondary structure is drawn from the KH99 SCFG prior. |
. 6.Examples of secondary structure representations. Above (A) is a dot bracket representation of a secondary structure, and the corresponding VARNA and circular visualizations (B and C, respectively) produced by VARNA Darty et al. (2009). Below (D) is an extended dot bracket notation format with an additional bracket type, <>, that allows a pseudoknotted structure to be represented unambiguously. (E) and (F) are the corresponding VARNA visualizations for (D). Note how the overlapping bonds in the circular visualization (F) demonstrate that the secondary structure is pseudoknotted.
. 7.Illustrations of the inside algorithm showing CPU and GPU parallelism schemes. The light to dark blue gradient starting at the central diagonal and finishing in the top right-hand corner indicates the order in which each diagonal is computed. The light red elements indicate the data dependencies required to compute the single bright red entry of the inside matrix. The lower half of each matrix with each cell crossed out is not computed and can be ignored. Note that the top-right element corresponds to the structure-integrated likelihood term and is therefore always the last element to be calculated, as it depends on all other elements having been computed first.