| Literature DB >> 30007348 |
Joshua Thody1, Leighton Folkes2, Zahara Medina-Calzada2, Ping Xu2, Tamas Dalmay2, Vincent Moulton1.
Abstract
Small RNAs (sRNAs) are short, non-coding RNAs that play critical roles in many important biological pathways. They suppress the translation of messenger RNAs (mRNAs) by directing the RNA-induced silencing complex to their sequence-specific mRNA target(s). In plants, this typically results in mRNA cleavage and subsequent degradation of the mRNA. The resulting mRNA fragments, or degradome, provide evidence for these interactions, and thus degradome analysis has become an important tool for sRNA target prediction. Even so, with the continuing advances in sequencing technologies, not only are larger and more complex genomes being sequenced, but also degradome and associated datasets are growing both in number and read count. As a result, existing degradome analysis tools are unable to process the volume of data being produced without imposing huge resource and time requirements. Moreover, these tools use stringent, non-configurable targeting rules, which reduces their flexibility. Here, we present a new and user configurable software tool for degradome analysis, which employs a novel search algorithm and sequence encoding technique to reduce the search space during analysis. The tool significantly reduces the time and resources required to perform degradome analysis, in some cases providing more than two orders of magnitude speed-up over current methods.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30007348 PMCID: PMC6158750 DOI: 10.1093/nar/gky609
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.An overview of different stages of the PAREsnip2 algorithm. (A) Shows the inputs and processing steps performed to predict sRNA targets evidenced through degradome sequencing. (B) Shows the process of encoding sequence data into a number system. (C) Visual representation of the three-stage candidate filtering process. Regions are labelled R and target regions are labelled TR.
The 2-bit binary encoding of nucleotides within sequence data
| Nucleotide base | sRNA Encoding | mRNA Encoding |
|---|---|---|
| A | 0 0 | 1 1 |
| C | 0 1 | 1 0 |
| T/U | 1 1 | 0 0 |
| G | 1 0 | 0 1 |
Features within a sRNA–mRNA alignment which are used during the duplex alignment process and can be configured by the user
| Configurable Search Parameters | |
|---|---|
| Maximum score | Maximum adjacent mismatches |
| Maximum G/U Wobble Pairs | Maximum Mismatches |
| Mismatch Score | G/U Wobble Score |
| Gap Score | Permissible Mismatch Positions |
| Core Region Start Position | Core Region End Position |
| Maximum Mismatches Core Region | Maximum Adjacent Mismatches Core Region |
| Allow Mismatch Position 10 | Position 10 Mismatch Score |
| Allow Mismatch Position 11 | Position 11 Mismatch Score |
| Core Region Multiplier | Non-permissible Mismatch Positions |
| Max Gaps Allowed | G/U Wobble Counts as Mismatch |
Summary statistics from the sequencing of three Arabidopsis thaliana degradome replicates (NR = non-redundant)
| Replicate | Untrimmed Reads | Untrimmed Reads (NR) | Trimmed Reads (NR) | Invalid Sequences Filtered (NR) | Genome Matched Reads | Genome Matched Reads (NR) |
|---|---|---|---|---|---|---|
| D2A | 45 581 525 | 15 267 190 | 11 114 679 | 21 004 | 41 144 941 | 9 009 977 |
| D2B | 34 915 085 | 13 385 729 | 10 103 828 | 17 049 | 31 426 832 | 8 316 470 |
| D2C | 26 067 832 | 10 199 905 | 7 715 372 | 12 140 | 23 303 530 | 6 337 667 |
Benchmarking results for both time and memory usage in Gigabytes (GB) from running each tool using the generated small RNA datasets. If the entry is DNF it means that the tool did not complete the analysis within the 10 day cut-off. A ‘-’ means that we did not attempt to run the tool
| # Seqs | CleaveLand4 | GB | PAREsnip | GB | sPARTA | GB | PAREsnip2 | GB |
|---|---|---|---|---|---|---|---|---|
| 1 | 19m 23s | 1 | 9m 30s | 58 | 12m 48s | 25 | 5m 38s | 5 |
| 10 | 27m 32s | 1 | 9m 50s | 58 | 12m 53s | 25 | 5m 36s | 5 |
| 100 | 1h 52m | 1 | 12m 35s | 58 | 13m 55s | 25 | 5m 44s | 5 |
| 1,000 | 15h 8m | 1 | 44m 51s | 58 | 1h 11m | 26 | 6m 15s | 6 |
| 10,000 | 6d 6h 48m | 8 | 6h 25m | 64 | 4d 6h 59m | 37 | 6m 32s | 6 |
| 100,000 | DNF | - | 2d 15h 16m | 66 | DNF | - | 15m 1s | 6 |
| 250,000 | - | - | 6d 10h 49m | 68 | - | - | 29m 6s | 7 |
| 500,000 | - | - | DNF | - | - | - | 53m 11s | 8 |
| 1,000,000 | - | - | - | - | - | - | 1h 44m | 8 |
The results from the accuracy performance benchmarking of each tool over the three biological replicates. V = validated targets, NV = non-validated and %PV = percentage of possible validated targets that could be found
| Replicate D2A | Replicate D2B | Replicate D2C | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Tool Name | V | NV | %PV | V | NV | %PV | V | NV | %PV |
| sPARTA | 171 | 120 | 70% | 169 | 121 | 70% | 162 | 127 | 72% |
| PAREsnip | 177 | 48 | 73% | 179 | 50 | 75% | 167 | 57 | 75% |
| CleaveLand4 | 88 | 20 | 36% | 95 | 26 | 40% | 87 | 25 | 39% |
| PAREsnip2 Allen | 193 | 41 | 79% | 191 | 39 | 80% | 181 | 33 | 80% |
| PAREsnip2 Fahlgren & Carrington | 219 | 48 | 90% | 219 | 43 | 91% | 205 | 37 | 91% |
Figure 2.The number of interactions reported when using MFE as a filter. As the MFE filter ratio increases, there is a reduction in the number of captured sRNA–mRNA interactions. A cut-off score of 0.70 captures 98% of the possible validated interactions.
Figure 3.The number of interactions reported when using P-value as a filter. As the cut-off decreases, there is a reduction in the number of captured sRNA–mRNA interactions. The default cut-off score of 0.05 captures 85.6% of the possible validated interactions.