| Literature DB >> 31596843 |
Jimmy Bell1, Maureen Larson2, Michelle Kutzler3, Massimo Bionaz3, Christiane V Löhr2, David Hendrix1,4.
Abstract
MicroRNAs are conserved, endogenous small RNAs with critical post-transcriptional regulatory functions throughout eukaryota, including prominent roles in development and disease. Despite much effort, microRNA annotations still contain errors and are incomplete due especially to challenges related to identifying valid miRs that have small numbers of reads, to properly locating hairpin precursors and to balancing precision and recall. Here, we present miRWoods, which solves these challenges using a duplex-focused precursor detection method and stacked random forests with specialized layers to detect mature and precursor microRNAs, and has been tuned to optimize the harmonic mean of precision and recall. We trained and tuned our discovery pipeline on data sets from the well-annotated human genome, and evaluated its performance on data from mouse. Compared to existing approaches, miRWoods better identifies precursor spans, and can balance sensitivity and specificity for an overall greater prediction accuracy, recalling an average of 10% more annotated microRNAs, and correctly predicts substantially more microRNAs with only one read. We apply this method to the under-annotated genomes of Felis catus (domestic cat) and Bos taurus (cow). We identified hundreds of novel microRNAs in small RNA sequencing data sets from muscle and skin from cat, from 10 tissues from cow and also from human and mouse cells. Our novel predictions include a microRNA in an intron of tyrosine kinase 2 (TYK2) that is present in both cat and cow, as well as a family of mirtrons with two instances in the human genome. Our predictions support a more expanded miR-2284 family in the bovine genome, a larger mir-548 family in the human genome, and a larger let-7 family in the feline genome.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31596843 PMCID: PMC6785219 DOI: 10.1371/journal.pcbi.1007309
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Outline of miRWoods Pipeline.
After aligning to the genome, overlapping reads are grouped together to form read stacks. Read stacks are scored used Mature Product Random Forest (MPRF), to predict a set of putative mature microRNAs. Products which meet the minimum threshold score for the MPRF are combined with the surrounding region to form hairpins and each hairpin is folded. Hairpins are scored using the Hairpin Random Forest (HPRF) and a set of final predictions are generated which meet the minimum threshold for the HPRF score.
Features used in the mature products random forest (MRPF).
| fivePrimeHet | 5′-heterogeneity of product reads |
| medianLength | Median length of product reads |
| gcContent | GC content of product sequence |
| aa,ac,ag,at,ca,cc,cg,ct,ga,gc,gg,gt,ta,tc,tg,tt (16 features) | product dinucleotide frequencies |
| r7,r6,r5,r4,r3,r2,r1,s0,f1,f2,f3,f4,f5,f6,and f7 (15 features) | read abundance 7 nt downstrem to 7 nt upstream product start position |
| WFC | Wooton-Federhen Complexity of product sequence |
| Duplex Energy | Minimum free energy of product duplex with surrounding genomic region. |
Features used in the hairpin products random forest (MRPF).
| Name | Description | Reference |
|---|---|---|
| mfe | minimum free energy of hairpin fold | 14, 11 |
| pbp | frequency of paired bases of miR | 11, 4 |
| urf | fraction of unique reads to total adjusted reads for locus | 34, 8 |
| gcContent | GC content of locus sequence | 11, 10 |
| totalSenseRPM | Adjusted reads per million (ARPM) in the sense strand | 12 |
| loopSize | length of the loop in nucleotides. | 10, 12 |
| maxBulge | longest bulge appearing in the region of the hairpin spanning the miR and miR | 10 |
| tapd | total displacement of sense to anti-sense products | 8 |
| aapd | average displacement of sense to anti-sense products | 8 |
| ahc | average number of hits to the genome for the major product | 8 |
| afh | average 5’-heterogeneity of major product reads | 8 |
| sameShift | Amount of offset between products on the same arm | 8 |
| bothShift | maximum amount two products are offset on opposite arms | 8 |
| Dinucleotide frequencies (16 features) | precursor dinucleotide frequencies | 12 |
| maxInteriorLoop | Length of largest interior loop spanning the miR and miR | 12 |
| intLoopSideDiff | Difference in length of of interior loop branches in miR/miR | 12 |
| OPA | Frequency of the most abundant overlapping product | |
| Duplex Energy | Duplex energy of major product and surrounding region. | |
| foldDupCmp | Similarity between dotbracket sequences from RNAduplex and RNAfold | |
| dupPBP | base pairing density of region duplexing the major product | |
| dupLoopLength | Length of biggest bulge or interior loop in region duplexing the major product | |
| APV | The average variance of read counts for distinct reads for all products | |
| wAPV | The average variance of read counts for distinct reads weighted across products | |
| ARV | The average variance of start positions for reads on each product | |
| wARV | The average variance of start positions for reads weighted by product size | |
| mpLoopDistance | distance of the miR from the loop | |
| dupLoopDistance | distance of the miR | |
| totalOverlap | The sum of the amounts of overlap between each pair of overlapping reads. | |
| totalRelativeOverlapAmount | sum of each overlap multiplied by the abundance ratio of the smaller to larger product | |
| averageOverlapAmount | sum of each overlapping product multiplied by the frequency of reads of the smaller product within the hairpin | |
| innerLoopGapCount | number of times 3 or more unbound nucleotides appears in the loop region | |
| totalAntisenseRPM | Adjusted reads per million (ARPM) in the anti-sense strand | |
| maxUnboundOverhang | The largest length of unpaired nucleotides on either side of the miR | |
| numOffshoots | number of additional hairpins formed on or across from the miR or miR | |
| dupSize | The size of the region duplexed by the miR product | |
| neighborCount | The number of regions of contiguous read loci within 1000 nucleotides of the precursor | |
| RFProductAvg | Decision value returned by the random forest in the product phase | |
| Product counts (8 features) | The fraction of the product relative to the total for the hairpin | |
| Product overlaps (11 features) | Overlaping lengths for individual products within the locus (e.g. “miRmoR5pOverlap” the overlap between miR and moR on 5′ arm). |
*References with an asterisk use a variant of the described feature.
Fig 2Improved hairpin precursor Span identification.
a miRWoods generates several potential hairpin precursor spans from each product that passes through the MPRF. Duplex-focused spans take the region between the product and the optimal duplex and product-focused spans take the region between the product and other products greater than 4 nt away. Hairpins are selected based on HPRF score. b The miRBase annotation for hsa-mir-4721 crosses over an intron boundary. miRWoods corrects the annotation by recognizing a second read stack and produce precursor span that perfectly matches an intron, suggesting mir-4721 is a mirtron. c The miRWoods prediction for mmu-let-7c-2 in mouse is consistent with the miRBase annotation, while the best miRDeep2 prediction, albeit below the default signal-to-noise threshold, only partially overlap with the miRBase annotation.
Percentage of predicted hairpin spans matching miRBase annotation.
The method with the highest percent for a particular sample are presented in bold.
| miRWoods | miRDeep2 | miReap | ||||
|---|---|---|---|---|---|---|
| library | total | percent (%) | total | percent (%) | total | percent (%) |
| human MCF-7 | 450 | 318 | 98.452 | 430 | 98.398 | |
| humam MCF-7 | 452 | 98.69 | 314 | 428 | 97.717 | |
| human liver | 385 | 318 | 99.375 | 413 | 98.804 | |
| human cell lines | 736 | 532 | 98.519 | 228 | 97.854 | |
| mouse brain | 405 | 330 | 98.214 | 370 | 98.143 | |
| mouse embryo | 486 | 412 | 98.329 | 398 | 97.073 | |
| mouse newborn | 419 | 335 | 97.384 | 179 | 97.283 | |
| mouse ovary | 282 | 243 | 97.2 | 237 | 98.75 | |
| mouse testes | 293 | 269 | 97.464 | 260 | 97.744 | |
Comparison of performance of miRWoods compared to miRDeep2 and miReap.
The method associated with the highest F-score for a particular sample are presented in bold.
| miRWoods | miRDeep2 | miReap | |||||||
|---|---|---|---|---|---|---|---|---|---|
| library | precision | recall | F-score | precision | recall | F-score | precision | recall | F-score |
| human MCF-7 (total cell) | 0.727 | 0.501 | 0.839 | 0.354 | 0.249 | 0.42 | 0.478 | 0.223 | |
| human MCF-7 (cytoplasm) | 0.7 | 0.511 | 0.86 | 0.355 | 0.251 | 0.476 | 0.484 | 0.24 | |
| human liver | 0.871 | 0.447 | 0.898 | 0.369 | 0.262 | 0.446 | 0.48 | 0.231 | |
| human cell lines | 0.627 | 0.586 | 0.834 | 0.424 | 0.281 | 0.264 | 0.182 | 0.108 | |
| mouse brain | 0.849 | 0.569 | 0.951 | 0.463 | 0.312 | 0.397 | 0.52 | 0.225 | |
| mouse embryo | 0.694 | 0.567 | 0.312 | 0.898 | 0.481 | 0.205 | 0.464 | 0.142 | |
| mouse newborn | 0.836 | 0.559 | 0.931 | 0.447 | 0.302 | 0.312 | 0.239 | 0.135 | |
| mouse ovary | 0.953 | 0.603 | 0.96 | 0.519 | 0.337 | 0.798 | 0.506 | 0.31 | |
| mouse testes | 0.91 | 0.579 | 0.944 | 0.532 | 0.34 | 0.324 | 0.514 | 0.199 | |
Fig 3Evaluation of miRWoods performance.
a Euler diagrams comparing predictions from miRWoods and miRDeep with annotations from miRBase for human MCF-7 cytoplasmic extract b A scatterplot comparing the miRWoods decision value to the log fold change in Dicer knockdown cells compared to wild-type cells. c Scatter-boxplot comparing the log fold change for Dicer knockout to wild type for unprocessed read regions, miRBase annotations, and predictions from miRWoods, miRDeep, and miReap for MCF-7 (cytoplasmic fraction). Black dots indicate predictions that are unique to this method. d Precision-recall (PR) Curve and AUPRC of miRWoods predictions for human including MCF-7 (total cell content), MCF-7 (cytoplasmic fraction), cell lines, and liver. e Euler Diagrams comparing predictions from miRWoods and miRDeep with annotations from miRBase for human liver. f Precision Recall Curve and AUPRC of miRWoods predictions for mouse tissues including brain, embryo, newborn, testes, and ovaries sets. g Euler Diagrams comparing predictions from miRWoods and miRDeep2 with annotations from miRBase for mouse ovary.
Fig 4miRWoods predictions in the feline genome.
a Euler diagram of the predictions from miRWoods with predictions from Sun et al. (2014) and Lagana et al. (2017). b The expression in skin and muscle for miR-133-Novel-3p c Hairpin for mir-133-Novel precursor. d The expression in skin and muscle for Novel110-3p. e Hairpin for Novel110 precursor. f Scatterplot of average muscle expression vs average skin expression for each mature microRNA.
Fig 5Novel let-7 microRNAs in the feline genome.
a RNA-seq of cluster containing fca-let7-Novel2, fca-let7f, and fca-let7-Novel3 for each skin and muscle sample from Felis catus. b Hairpin structures for fca-let7-Novel2, c fca-let7-Novel3, and d fca-let7f. e Phylogenetic tree of let-7 miRs including those previously found by Lagana et al. (2017).
Fig 6Novel microRNA predictions in the bovine genome.
a Euler diagram comparing miRWoods predictions in the cow genome with miRBase annotations. b Scatterplot and best fit line comparing the normalized RT-qPCR expression and RNA-seq for the control miR bta-miR-7. c Scatterplot and best fit line comparing the normalized RT-qPCR expression and RNA-seq for a novel predicted miR with enriched expression in brain stem. d Scatterplot and best fit line comparing the normalized RT-qPCR expression and RNA-seq for a novel predicted miR with enriched expression in corium feet. e Heat map of RT-qPCR expression expression values over tissues examined.
Fig 7mir-2284/mir-2285 family miRs in Bos taurus.
a A heat map for the expression of annotated and novel mir-2284/mir-2285 family miRs. b A phylogenetic tree for the bta-2284/bta-2285 family. Variants of bta-mir-2284 appear in red and variants of bta-mir-2285 appear in blue. Colors for novel predictions appear lighter than those for annotated predictions. c Abundance of miRs for the 5′ and 3′ sides of the mir-2284/mir-2285 family. The 5′ product tends to show greater expression in the mir-2284 family while the 3′ product shows greater expression in the mir-2285 family.
Fig 8Novel microRNA families identified by miRWoods.
a hsa-novel-8 is a mirtron predicted for both MCF-7 sets where expression was decreased in the Dicer knockdown sets. b hsa-Novel-185 is a mirtron predicted within the human cell lines set and the MCF-7 (cytoplasmic fraction) set. It also shows reduced expression in the Dicer knockdown version of the MCF-7 set. c The structure of hsa-novel-8. d The structure of hsa-Novel-185. e Phylogeny comparing the LAMA5 intron and CHD3 intron for several mammals. f Novel miR predicted in bovine genome in an intron of TYK2. g novel predicted miR in the feline genome in the same intron of TYK2 h structure of novel feline miR. i structure of novel bovine miR. Eight nucleoties were removed from the 5' end, and two were added to the 3' end to match the feline hairpin precursor boundaries. j A phylogeny comparing the TYK2 intron in several mammals.