| Literature DB >> 25888698 |
Shirley Tam, Ming-Sound Tsao, John D McPherson.
Abstract
The past two decades of microRNA (miRNA) research has solidified the role of these small non-coding RNAs as key regulators of many biological processes and promising biomarkers for disease. The concurrent development in high-throughput profiling technology has further advanced our understanding of the impact of their dysregulation on a global scale. Currently, next-generation sequencing is the platform of choice for the discovery and quantification of miRNAs. Despite this, there is no clear consensus on how the data should be preprocessed before conducting downstream analyses. Often overlooked, data preprocessing is an essential step in data analysis: the presence of unreliable features and noise can affect the conclusions drawn from downstream analyses. Using a spike-in dilution study, we evaluated the effects of several general-purpose aligners (BWA, Bowtie, Bowtie 2 and Novoalign), and normalization methods (counts-per-million, total count scaling, upper quartile scaling, Trimmed Mean of M, DESeq, linear regression, cyclic loess and quantile) with respect to the final miRNA count data distribution, variance, bias and accuracy of differential expression analysis. We make practical recommendations on the optimal preprocessing methods for the extraction and interpretation of miRNA count data from small RNA-sequencing experiments.Entities:
Keywords: data preprocessing; miRNA sequencing; miRNA-seq normalization; small RNA sequence alignment
Mesh:
Substances:
Year: 2015 PMID: 25888698 PMCID: PMC4652620 DOI: 10.1093/bib/bbv019
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Software packages and methods for miRNA-sequencing data preprocessing and alignment
| Preprocessing | Annotation | ||||
|---|---|---|---|---|---|
| Adapters | Filtering | Output | Algorithm | Reference | |
| Morin et al., (2008) |
Trim all reads to 30 nt |
Count file | MegaBlast |
Genome Annotate genomic positions to miRNA, tRNA, rRNA, scaRNA… | |
| SeqBuster (2009) |
• Modification of Needleman-Wunsch -3′ adapter removal | MegaBlast |
User-supplied databases | ||
| miRExpress (2009) |
• 3′ adapter trimming |
Reads with 3′ adapters in the middle or beginning |
Count file | Smith-Waterman |
miRBase |
| E-miR (2010) |
• Regular expression matching - 3′ adapter removal |
Reads <15 nt Trim reads >32 nt |
Count file | Eland |
Genome Annotate genomic positions to non-coding RNA transcripts |
| mirTools (2010) |
• Custom Script - 5′/3′ trimming |
PolyA Quality |
Collapsed fasta | SOAP |
Genome miRBase, repeats, coding genes… |
| DSAP (2010) |
• Supermatcher - 5′/3′ clipping |
Homopolymers Reads <16 nt Read with no 3′ adapter |
Collapsed tags | BLAST |
ncRNAs: rRNA, tRNAs, snRNAs, snoRNAs… miRNA |
| miRNAkey (2010) |
• 3′ adapter clipping |
Read length post-clipping | BWA |
miRBase | |
| Flicker (2011) |
• 3′ adapter trimming |
Fasta | Eland |
Contaminants (rRNA, primers…) Genome miRNA miRNA precursors | |
| miRanalyzer (2009, 2011) |
• Optional adapter removal |
‘N’ containing reads Reads <17 nt Trim reads >26 nt |
Count file Multi-fasta | Bowtie |
miRBase: mature, precursor RefSeq, Rfam Genome |
| miRDeep2 (2011) |
• Custom script - Clip 3′ adapter |
Reads <18 nt |
Collapsed reads | Bowtie |
Genome Map miRNA and reads to precursors; find intersection |
| shortran (2012) |
• FASTX trimming |
Abundance Inter-library variation Sequence (Bowtie) |
Multi-fasta | Bowtie |
Genome User-supplied sequence files |
| Farazi |
• Custom script -3′ adapter trimming |
Reads <16 nt and >25 nt Mono-, di-, tri-nucleotide repeats Reaction by-products (Needleman–Wunsch algorithm) |
Collapsed fasta | BWA |
Genome Map unique sequences to rRNA, tRNA, sn/snoRNA, repeats, miRNA… |
Figure 1.(A) Outline of the experimental design and workflow for evaluating different combinations of alignment tools and normalization procedures. (B) Comparison of alignment tools. miRNA counts recovered from the different aligners were averaged across the samples and plotted for each pairwise comparison. Only miRNAs called present by all the aligners were considered. The aligners being compared are shown on the diagonal, along with the distinct number of miRNAs identified after filtering for species with low counts. Spearman’s correlation coefficients are indicated in the upper left corner for each comparison.
Figure 2.Comparison of data distribution. (A) Density plots of log count distribution. Distributions of the samples before and after normalization are shown in different panels for each normalization method. Comparing between samples, none appears to have abnormal distributions. (B) Boxplots of RLE. The RLE distributions of comparable samples should be centered at zero and similar to each other. Boxplots of the raw data clearly indicate the need for normalization. Values beyond the 1.5 interquartile range (IQR) are not shown. The samples are grouped according to the normalization method, with the order of the samples consistent across each group. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.
Figure 3.Variance comparison. (A) Boxplots of the variance distribution. The variance of the log2 counts of all non-spike-in miRNAs was computed across the samples, and visualized using boxplots. The data are grouped according to the normalization method. Values beyond the 1.5 interquartile range (IQR) are not shown. A clear increase in variance is observed in the data normalized by cpm or total count scaling, while a decrease is seen in UQ or TMM normalized data. (B) Mean-variance dependency. The relationship between variance and abundance level is visualized by plotting the ratio of the variance between the normalized and unnormalized data versus the average counts. Data points below the y = 0 line have decreased variance compared with the raw data. The Lowess smoother line reveals the presence of any trends in the data. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.
Change in variance and assessment of bias post-normalization
| % with lower variance | Median slope (β1) | Median | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Novoalign | BWA seed | BWA | Bowtie seed | Bowtie | Bowtie2 | novoalign | BWA seed | BWA | Bowtie seed | Bowtie | Bowtie2 | novoalign | BWA seed | BWA | Bowtie seed | Bowtie | Bowtie2 | |
| Raw | – | – | – | – | – | 0.96 | 0.96 | 0.98 | 0.96 | 0.97 | 0.96 | 0.96 | 0.96 | 0.98 | 0.96 | 0.97 | 0.96 | |
| Cpm | 15.4 | 15.0 | 10.5 | 15.5 | 10.4 | 24.8 | 0.87 | 0.87 | 0.88 | 0.87 | 0.87 | 0.86 | 0.87 | 0.87 | 0.88 | 0.87 | 0.87 | 0.86 |
| total count scaling | 1.1 | 1.0 | 0.6 | 1.0 | 0.6 | 1.2 | 0.92 | 0.92 | 0.91 | 0.92 | 0.91 | 0.92 | 0.92 | 0.92 | 0.91 | 0.92 | 0.91 | 0.92 |
| upper quartile | 96.4 | 96.5 | 96.6 | 96.5 | 96.7 | 97.2 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 |
| TMM | 95.4 | 95.4 | 95.6 | 95.4 | 95.7 | 97.2 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 |
| DESeq | 86.3 | 85.4 | 88.9 | 85.0 | 88.6 | 88.7 | 0.98 | 0.98 | 0.99 | 0.98 | 0.99 | 0.98 | 0.98 | 0.98 | 0.99 | 0.98 | 0.99 | 0.98 |
| linear regression | 95.1 | 95.6 | 95.8 | 95.4 | 95.9 | 94.9 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.98 | 0.97 | 0.97 | 0.97 | 0.97 | 0.97 | 0.98 |
| cyclic loess | 88.0 | 87.7 | 89.3 | 87.9 | 89.2 | 88.7 | 0.98 | 0.98 | 1.00 | 0.98 | 0.99 | 0.98 | 0.98 | 0.98 | 1.00 | 0.98 | 0.99 | 0.98 |
| Quantile | 88.8 | 88.1 | 89.5 | 87.5 | 89.2 | 88.6 | 0.95 | 0.96 | 0.96 | 0.96 | 0.96 | 0.97 | 0.95 | 0.96 | 0.96 | 0.96 | 0.96 | 0.97 |
*The ath-miR405a dilution series was excluded from the analysis of bias.
Figure 4.Bias assessment. (A) Assessment of bias. Linear regression was performed on the spike-in dilution series, and the resulting slopes (β1) were visualized using boxplots, with the asterisk symbol (*) representing the R2 values. The data are grouped according to the normalization method. (B) Comparison of miRNA-seq fold-changes and nominal fold-changes. The difference between experimentally determined fold-changes and nominal fold-changes was determined for all miRNAs. Boxplot show the distribution of the deviation from the expected fold-changes. The data are grouped according to the normalization method. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.
Figure 5Accuracy in predicting differential expression. (A) PR curves. The effect of normalization on the accuracy of differential expression analysis is visualized using PR curves. miRNAs with a fold-change ≥2 were considered to be differentially expressed. The spike-in sequences and the background reference were used as the true-positive results and true-negative results, respectively. Colors represent different normalization methods, while lines represent aligners. Please see online version for coloured image. (B) Comparison of fold-change estimates. Spearman’s correlation coefficients between fold-change estimates obtained from the differently processed data were calculated and subjected to unsupervised hierarchical clustering using Euclidean distance as the distance metric and complete linkage. (C) M versus A plots. MA plots were generated by comparing all samples to one chosen reference sample. The black points represent log2 fold-changes from the background, while the red points represent the non-spike-in miRNAs that have a fold-change ≥2. The colored numbers represent the spike-in sequences, with the precise number representing the expected fold-change. A colour version of this figure is available at BIB online: http://bib.oxfordjournals.org.