Maria D Giraldez1, Ryan M Spengler1, Alton Etheridge2, Paula M Godoy3, Andrea J Barczak3, Srimeenakshi Srinivasan4, Peter L De Hoff4, Kahraman Tanriverdi5, Amanda Courtright6, Shulin Lu7, Joseph Khoory7, Renee Rubio8, David Baxter9, Tom A P Driedonks10, Henk P J Buermans11, Esther N M Nolte-'t Hoen10, Hui Jiang12,13, Kai Wang9, Ionita Ghiran7, Yaoyu E Wang8, Kendall Van Keuren-Jensen6, Jane E Freedman5, Prescott G Woodruff14, Louise C Laurent4, David J Erle3, David J Galas2, Muneesh Tewari1,12,15,16. 1. Department of Internal Medicine, Hematology/Oncology Division, University of Michigan, Ann Arbor, Michigan, USA. 2. Pacific Northwest Research Institute, Seattle, Washington, USA. 3. Lung Biology Center, Department of Medicine, University of California San Francisco, San Francisco, California, USA. 4. Department of Obstetrics, Gynecology, and Reproductive Sciences and Sanford Consortium for Regenerative Medicine, University of California, San Diego, La Jolla, California, USA. 5. Department of Medicine, Division of Cardiovascular Medicine, University of Massachusetts Medical School, Worcester, Massachusetts, USA. 6. Neurogenomics, The Translational Genomics Research Institute (TGen), Phoenix, Arizona, USA. 7. Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA. 8. Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA. 9. Institute for Systems Biology, Seattle, Washington, USA. 10. Department of Biochemistry & Cell Biology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, the Netherlands. 11. Leiden Genome Technology Center, Department of Human Genetics, Leiden University Medical Center, Leiden, the Netherlands. 12. Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA. 13. Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA. 14. Cardiovascular Research Institute and the Department of Medicine, Division of Pulmonary, Critical Care, Sleep, and Allergy, University of California San Francisco, San Francisco, California, USA. 15. Department of Biomedical Engineering, University of Michigan, Ann Arbor, Michigan, USA. 16. Biointerfaces Institute, University of Michigan, Ann Arbor, Michigan, USA.
Abstract
RNA-seq is increasingly used for quantitative profiling of small RNAs (for example, microRNAs, piRNAs and snoRNAs) in diverse sample types, including isolated cells, tissues and cell-free biofluids. The accuracy and reproducibility of the currently used small RNA-seq library preparation methods have not been systematically tested. Here we report results obtained by a consortium of nine labs that independently sequenced reference, 'ground truth' samples of synthetic small RNAs and human plasma-derived RNA. We assessed three commercially available library preparation methods that use adapters of defined sequence and six methods using adapters with degenerate bases. Both protocol- and sequence-specific biases were identified, including biases that reduced the ability of small RNA-seq to accurately measure adenosine-to-inosine editing in microRNAs. We found that these biases were mitigated by library preparation methods that incorporate adapters with degenerate bases. MicroRNA relative quantification between samples using small RNA-seq was accurate and reproducible across laboratories and methods.
RNA-seq is increasingly used for quantitative profiling of small RNAs (for example, microRNAs, piRNAs and snoRNAs) in diverse sample types, including isolated cells, tissues and cell-free biofluids. The accuracy and reproducibility of the currently used small RNA-seq library preparation methods have not been systematically tested. Here we report results obtained by a consortium of nine labs that independently sequenced reference, 'ground truth' samples of synthetic small RNAs and human plasma-derived RNA. We assessed three commercially available library preparation methods that use adapters of defined sequence and six methods using adapters with degenerate bases. Both protocol- and sequence-specific biases were identified, including biases that reduced the ability of small RNA-seq to accurately measure adenosine-to-inosine editing in microRNAs. We found that these biases were mitigated by library preparation methods that incorporate adapters with degenerate bases. MicroRNA relative quantification between samples using small RNA-seq was accurate and reproducible across laboratories and methods.
RNA-seq has transformed transcriptome characterization in a wide range of biological contexts [1,2]. RNA-seq can be used to sequence long reads (long RNA-seq; e.g., messenger RNAs and long non-coding RNAs) and short RNAs (small RNA-seq; e.g. small non-coding RNAs such as microRNAs). These applications differ in terms of the size of the targeted RNAs, but also by the technical methods used and the resulting biases in the quantitative data produced[3]. For example, preparation of libraries for long RNA-seq, by virtue of having sufficiently long target RNA lengths, commonly utilizes primers for direct generation of cDNA from RNA. In contrast, small RNA-seq library preparation methods typically require RNA ligation or poly-A tailing steps to overcome the challenge of performing reverse transcription and subsequent PCR amplification from extremely short (e.g., 16-30 nt) target RNA sequences.Multiple approaches have been developed to overcome the challenge of uniformly and robustly generating cDNA from small RNAs for the purpose of small RNA-seq[4-9]. Protocols in use for small RNA-seq therefore vary more widely than those used for long RNA-seq, creating greater potential for variation from different library preparation protocols and different labs. In addition, small RNA-seq is increasingly used to study samples with very low RNA concentration, such as biofluids containing exosomes, other extracellular vesicles (EV)[10-16] and RNA-protein complexes[17-21]. Normalization methods[22-24] developed to correct for variation in long RNA-seq data are typically not well suited for small RNA-seq data. Whereas performance characteristics such as reproducibility and quantitative accuracy have been well studied for long RNA-seq[25,26], only the reproducibility of a single library preparation protocol has been evaluated[25].Furthermore, the performance of different small RNA-seq methods for quantifying single nucleotide changes in RNA sequence, such as those seen with microRNA (miRNA) editing, for example, has not been systematically examined. Yet, with the rapid accumulation of small RNA-seq data (e.g., NIH short-reads archive[27,28], EV-associated small RNA sequencing databases[29-31], TCGA[32], the exRNA Atlas,[33] etc.), meaningful, quantitative interpretation of results, especially across studies, would benefit from a systematic examination of technical bias, its effects on accuracy and of the reproducibility of small RNA-seq.Here, we report a study led by investigators from the NIH-funded Extracellular RNA Communication Consortium[34] involving nine laboratories, which performed a systematic multi-protocol, multi-institution assessment of the accuracy, reproducibility and technical bias of small RNA-seq using standardized, synthetic reference reagents. We evaluated the performance of different protocols with respect to characterizing miRNA editing and identified a library preparation approach that reduces technical bias, improving the accuracy and comparability of small RNA-seq results.
Results
Study design and standard reference materials for miRNA quantification
In order to evaluate the performance of small RNA-seq library preparation protocols across multiple laboratories, we developed standard reference samples as well as a standardized study design (shown in Figure 1). We distributed detailed instructions for library preparation and sequencing to each lab, along with four reference RNA samples (Figure 1 and Supplementary Tables 1 and 2): i) an equimolar pool comprising 1,152 synthetic RNAoligonucleotides, corresponding predominantly to human miRNA sequences, as well as a small set of non-miRNA oligonucleotides of varied sequence and length (15-90 nt); ii) two synthetic small RNA pools, called ratiometric pools SynthA and SynthB, each containing the same 334 synthetic RNAs, but in which subsets of RNAs vary in relative amount between pools A and B by 15 different ratios, ranging from 10:1 to 1:10; and iii) RNA isolated from human blood plasma pooled from 11 individuals.
Figure 1
Overview of study design
(Top) The four primary RNA pools used as common reference samples in the study are shown. Equimolar and ratiometric pools were prepared using chemically-synthesized RNA oligonucleotides to establish “ground-truth” knowledge of absolute and relative abundances, respectively. The equimolar pool consisted of 1,152 synthetic RNAs (15-90 nt) mixed at equal concentration. Downstream analyses focused on the subset of 977 RNAs 16-25 nt in length and with 5’-phosphate modifications. The two ratiometric pools, A and B, consisted of 334 synthetic RNAs, in which subsets of RNAs were varied in relative abundance between the two pools. The RNAs were divided into 15 ratiometric subgroups. The subset of 290 RNAs, 16-25 nt in length, was used for downstream analyses, and the number of RNAs in each ratiometric subgroup is indicated. The plasma RNA pool comprised RNA from 11 healthy males that was centrally isolated and distributed to the participating labs. (Middle) Nine different library preparation protocols were tested. Three commercially available kits with “invariant” adapters and six 4N Random-End protocols were tested. (Bottom) Common reference RNA pools were distributed to 9 participating labs for sequencing in quadruplicate, using a standardized common protocol (TruSeq) and at least one additional method of their choice. The breakdown of the resulting libraries is depicted in the colored grid, with the Lab IDs indicated by columns, and the replicates and pools shown in rows. Solid grey blocks indicate libraries that were not attempted. Grey blocks with a diagonal red line indicate samples where library preparation and sequencing was attempted but was unsuccessful. Sequencing was performed by each lab independently.
The common materials were distributed to nine participating research groups (Laurent lab, UCSD; Erle lab, UCSF, Ghiran lab, BIDMC/DFCI; Nolte-’t Hoen lab, UUTR; Freedman lab, UMass; Wang lab, ISB; Galas lab, PNRI; Van Keuren-Jensen lab, TGen and Tewari lab, U. of Michigan). Nine library preparation protocols were evaluated (, wherein at least one group prepared and sequenced quadruplicate libraries from each of the reference samples. Three of the protocols–TruSeq (Illumina), NEBNext (New England Biolabs) and CleanTag (Trilink Biotech)–are commercial kits that employ adapters with invariant sequences. The remaining protocols make use of adapters with 4 degenerate nucleotides at the ligation end as a strategy to reduce the bias, and we collectively call these “4N” protocols in this study. These six 4N protocols included: (i) a commercial kit, NEXTflex (Bioo Scientific), (ii) a recently published protocol[35], and (iii) four variants of a protocol developed by members of the consortium (protocols 4N_A, B, C and D) that we collective refer to as “in-house” 4N methods. The TruSeq kit served as the common reference kit for this study, and was evaluated by all the groups using Illumina sequencing platforms (8 out of 9 groups). In addition, multiple labs generated libraries using the NEBNext kit (6 labs) and the in-house protocol 4N_B (4 labs), thereby allowing for standardized cross-lab comparisons for these two protocols in addition to the Illumina TruSeq protocol.In all, the nine participating groups prepared 384 libraries for miRNA quantification analysis, of which 377 (98%) were successfully sequenced and submitted for central analysis. The 7 libraries that were not successfully prepared and sequenced included four plasma pool libraries (Lab8 4N_NEXTflex), two equimolar pool libraries (Lab7 NEXTflex) and one SynthB library (Lab8 TruSeq). Together, the nine participating groups collectively contributed 5.45 billion small RNA-seq reads to the analysis (Figure 1). These sequencing data were centrally analyzed using the Genboree Workbench and its implementation of the Extracellular RNA Communication Consortium’s ExceRpt Small RNA-seq pipeline, which is specifically designed for the analysis of small RNA-seq data and uses its own alignment and quantification engine to map and quantify a range of RNAs represented in small RNA-seq data (see Supplementary Table 3 for pipeline QC metrics). Of the 377 samples analyzed, 364 (>96%) satisfied minimum quality criteria (Online Methods) and were included in the analyses.
Characterization of sequence-specific bias of small RNA-seq protocols
Out of the 1,152 synthetic RNAs, we focused on 977 5’-phosphorylated RNAs 16-25 nt in length, which can be captured with standard small RNA-seq protocols. The efficiency of recovery of RNA sequences varied by multiple orders of magnitude depending on the protocol, confirming that small RNA-seq protocols are associated with prominent sequence-dependent bias [4,25,36-38] (Figure 2a, 2b) and that the bias is greater as compared to that in long RNA-seq[26]. This was highly reproducible within a given protocol, both across technical replicates and laboratories using the same protocol (Figure 2a). Libraries prepared by different labs clustered first into two groups, corresponding to methods with invariant (TruSeq, NEBNext and Cleantag) or degenerate (4N) adapters. Within each of these two larger groups, the libraries then formed distinct clusters corresponding to the different protocols included in the study, indicating that the impact of the protocol bias is potentially greater than that of lab-to-lab variation. The ten most overrepresented and underrepresented sequences varied widely between protocols (Supplementary Figure 1).
Figure 2
Equimolar pool sequencing results across multiple labs and protocols
(a) The heatmap shows expression levels for each synthetic RNA sequence (rows) across all replicate equimolar pool libraries (columns). Expression levels represent log2-scaled counts-per-million (CPM) calculated for 977 equimolar pool sequences which are 16-25 nt in length and 5’-phosphorylated. Hierarchical clustering for rows and columns represents complete linkage clustering on Euclidean distances (the default setting for the R package, pheatmap, used for plotting). Columns are labeled at the bottom to identify replicate samples. Library Size indicates the sequencing depth for each library (log2-scaled). (b) Violin plots summarize the mean CPM observed for each of the 16-25 nt, 5’-phosphorylated equimolar pool sequences (y-axis; log10-scaled), as measured from equimolar pool libraries prepared by different institutions and using different library preparation protocols (x-axis). The width of the violins is proportional to the density of data points at each position. The horizontal lines within each violin represent the 25, 50 and 75th percentiles. The dashed horizontal line shows the expected CPM for each sequence in the equimolar pool (106/977 miRNAs = 1023.5 CPM). Each violin plot and corresponding quantile lines summarize mean CPM values for n=977 distinct equimolar pool sequences. The mean CPM values were calculated from n=4 technical replicate libraries for each lab/library preparation method shown, with the exception of 4N_NEXTflex. Lab7 (n=2). (c) The percentage of equimolar pool sequences sequenced at levels 10x higher (>10,235 CPM) or 10x lower (<102.35 CPM) than expected (y-axis) are plotted for each lab. The dots and whiskers indicate the median and range of values, respectively, measured across the technical replicates for each lab. N=4 technical replicate libraries per lab/library preparation method, with the exception of 4N_NEXTflex (n=2).
Although all protocols exhibited some bias, it was reduced in those using degenerate adapters (Figure 2b and Supplementary Figure 2). As one measure of this, we calculated the median percentage of sequences with a number of reads (i.e., counts per million) more than ten times above or below the expected value, for each protocol. This ranged from 41.6% to 61.5% for protocols using adapters with defined sequences (TruSeq: 41.6%; CleanTag: 53.9% and NEBNext: 61.5%), and from 2.8% to 22.4% for protocols using adapters with degenerate nucleotides (4N_A: 8.9%; 4N_B: 2.8%; 4N_C: 12.1%; 4N_D: 22.4%, 4N_Xu: 7.1% and 4N_NEXTflex: 17%) (Figure 2c). The 4N in-house protocols showed fewer missing sequences from the equimolar pool (Supplementary Table 4) and when downsampling to compare the same number of total mapped reads across protocols, at varying sequencing depths (Supplementary Figure 3). We found that with the in-house 4N_B protocol, even when downsampling to 10,000 total mapped reads, >90% of the miRNAs had a high probability of detection (median: 92%; range: 78-95%). In contrast, even with the best-performing invariant adapter protocol, TruSeq, <50% of miRNAs had a high probability of detection (median: 46%; range: 40-55%) at the same depth, indicating that the 4N_B protocol may require lower read depth to yield similar coverage as other library protocols.We also assessed the reproducibility of small RNA cloning biases across labs by examining the rank-order of RNA sequence abundance. To do so, we calculated Spearman rank correlations for the equimolar synthetic pool counts between labs and protocols. As expected, the strongest correlations were found between technical replicates from the same lab and method (Supplementary Figure 4). Correlations were also strong between samples generated by different labs using the same protocol (Supplementary Table 5). The somewhat lower correlation value observed for 4N_B can be attributed to the overall lower variation in read counts across miRNAs due to less cloning bias with this protocol. The reduced spread in the data limits the maximum absolute correlation coefficient values that can be obtained. This limitation notwithstanding, comparison across labs using different protocols showed much weaker correlations (Supplementary Table 5)To dissect the source of observed bias, we evaluated the effect of several variables (5’ or 3’ terminal bases, %GC of the four 5’ or 3’end bases, overall %GC, dG [free energy], dH [enthalpy], dS [entropy], and Tm) on the number of obtained reads with different library preparation protocols (Supplementary Figures 5-13). However, none of these variables substantially explained the observed bias.
Accuracy and cross-protocol concordance for relative quantification
To investigate the accuracy of relative quantification of the same small RNAs between different samples, we designed two ratiometric pools, SynthA and SynthB, each containing the same 334 synthetic RNA sequences, but varying the relative abundance of sequences between the two pools for fifteen expression ratios (Figure 1; Supplementary Table 2). All of the protocols tested showed close concordance between observed and expected ratios (Figure 3a). We also analyzed the data using standard differential expression workflows from three commonly used R packages (EdgeR[39,40], DESeq2[41] and limma/voom[42]) to determine the smallest difference in abundance that could be distinguished using small RNA-seq methods. We observed that for most protocols and for the majority of miRNAs, a difference in levels as little as 1.5-fold between the two samples could be detected (Supplementary Figure 14). As shown in Supplementary Table 6, all the evaluated protocols performed relatively well in detecting miRNAs abundances.
Figure 3
Small RNA-seq accuracy and cross-protocol concordance in measuring relative expression levels between samples
(a) Boxplots show the observed ratio (y-axis; log2-scale) vs. expected ratio (x-axis) for miRNAs present in each of the SynthA and SynthB synthetic RNA subpools. Observed ratios for each miRNA were calculated as mean CPM of SynthA/mean CPM of SynthB across technical replicates for each lab + library prep method. Boxes show the median + IQR; upper/lower whiskers indicate the smallest/largest observation less than or equal to 1st/3rd quartile -/+ 1.5 * IQR; outliers are calculated as being < 1st quartile – 1.5 * IQR or > 3rd quartile + 1.5 * IQR. Mean CPM ratios were calculated from n=4 SynthA and n=4 SynthB technical replicate libraries for each lab and library preparation method shown, with the exception of TruSeq Lab8 SynthB (n=3). Those miRNAs with a mean CPM of 0 in SynthA or SynthB are not plotted. The numbers of miRNA not plotted are as follows: Truseq Labs 1,2,3,4,5,6,8: 1; Lab9: 0; CleanTag Lab5: 0; NEBNext Labs 1,3,5,9: 0; Lab4: 1; Lab2:3; 4N_NEXTflex Lab7: 1; other 4N: 0. The number of sequences represented in each boxplot is provided in Supplementary Table 10
(b) Heatmaps show the pairwise, squared Spearman rank correlation coefficients from sequencing the SynthA and SynthB pools. Pairwise correlation coefficients were calculated based on the mean CPM across technical replicates for SynthA samples (left), SynthB samples (middle) and the ratio of SynthA : SynthB (right). The mean CPM value for each ratiometric pool sequence was calculated from n=4 technical replicate libraries per lab, library preparation method and pool. Mean CPM values for n=290 ratiometric pool RNAs were used for calculating each pairwise correlation coefficient. Hierarchical clustering for rows and columns is the same for all heatmaps, and is based on the average pairwise Euclidean distances calculated from the SynthA CPM and SynthB CPM correlation matrices. Column labels indicate the lab ID and library prep method; row labels indicate only lab ID, but are presented in the same order (top to bottom) as columns (left to right).
We examining the rank-order of RNA sequence abundance and found that in general, the Spearman rank correlations results obtained for the SynthA and SynthB samples were similar to those obtained for the equimolar pool: the correlation was strong when using the same protocol, but weaker across different protocols (see left and middle heatmaps in Figure 3b). In contrast, when we analyzed the concordance of the obtained SynthA/SynthB ratios (Figure 3b), we found a very strong correlation between labs not only when using the same protocol, but also across different protocols, confirming that relative quantification is resilient to variation in protocol used (Supplementary Table 5)
Reproducibility of small RNA-seq protocols
In order to quantify intra-lab variation for each sequence, we used two metrics, (i) coefficient of variation -CV- (standard deviation/mean) and, (ii) quartile coefficient of dispersion –QCD- (interquartile range/average of the first and third quartile). The median CV for the equimolar pool libraries ranged from 6.18% (TruSeq) to 23.92% (CleanTag) for the different library preparation methods (Figure 4a and Supplementary Table 5). In addition, the median QCD was <0.1 for all the protocols/labs (Figure 4a and Supplementary Table 5). We also evaluated the intra-lab variation from technical replicates of sequencing the SynthA and SynthB libraries. The calculated CV and QCD values were similar to those observed for the equimolar libraries (Supplementary Figure 15).
Figure 4
Reproducibility of small RNA-seq within- and between-labs
(a) Violin plots summarize the technical reproducibility of quantification for all equimolar pool sequences, as calculated from each lab and library preparation method. Reproducibility measurements, percent coefficient of variation (%CV; top) and quartile coefficient of dispersion (QCD; bottom), were calculated from CPM values. Horizontal lines within each violin indicate the 25, 50 and 75th percentiles, calculated from the mean CPM values of n=977 equimolar pool RNA sequences. Mean CPM values were calculated from n=4 technical replicate libraries for each of the lab/library preparation methods shown, with the exception of TruSeq Lab1 (n=3). (b) Boxplots summarize the sequence-specific reproducibility of quantification measured in equimolar pool libraries generated by different labs using the same protocol. Percent CV (top) and QCD (bottom) values were calculated for each equimolar pool sequence across TruSeq (n=8 labs), NEBNext (n=4 labs) and 4N_B (n=4 labs) library preparation protocols. The mean CPM for each sequence across technical replicates (n=4 technical replicates per lab/library preparation method) was used to calculate the between-lab %CV and QCD plotted here. Boxplot statistics and outliers were calculated from %CV or QCD values for n=977 equimolar pool sequences. The overlaid boxes indicate the median and IQR. Whiskers represent the 1st/3rd quartile +/- 1.5 * IQR. Outliers are < 1st quartile – 1.5 * IQR or > 3rd quartile + 1.5 * IQR.
To characterize the reproducibility of small RNA-seq libraries across laboratories, we focused on the three protocols (TruSeq, NEBNext and 4N_B) for which libraries were generated by at least three groups. In addition, of the six labs that generated libraries using the NEBNext protocol, two of the labs used somewhat modified conditions based on options provided by the manufacturer and were excluded from the analysis ().Using the results for the equimolar pool and treating each laboratory’s results as one trial of the experiment, we calculated the CV and QCD for the mean CPM values for each RNA sequence across laboratories. The median CV across labs ranged from 30.42% (4N_B) to 35.28% (NEBNext) and the median QCD from 0.13 (4N_B) to 0.18 (Truseq and NEBNext) (Figure 4b and Supplementary Table 5). We confirmed that the choice of pseudo-counts for calculating CPM does not appreciably alter the %CV and QCD distribution (Supplementary Figure 16). In addition, repeating the inter-lab CV and QCD calculations using all combinations of n=3 labs from the TruSeq, NEBNext and 4N_B equimolar pool libraries showed that results from analysis of subsets of the data were comparable to those from analysis of the full datasets (Supplementary Figure 17a-b). We also calculated across lab variation for the SynthA and SynthB pools individually and obtained median CV and QCD values that were comparable to those described for the equimolar libraries (Supplementary Table 5):
Performance of small RNA-seq protocols using biological samples
We also sought to characterize the performance of small RNA-seq protocols across labs using standard reference RNA derived from biological material to assess the reproducibility and the diversity of miRNA sequences recovered.In order to perform this analysis, aliquots of RNA extracted from a pool of human blood plasma from 11 donors were shipped to the participating labs for sequencing in quadruplicate (Figure 1). We focused our analysis on miRNAs because they are well characterized and have been extensively studied in human plasma[43]. Hierarchical clustering generally mirrored that from the synthetic pools, with technical replicates of the same protocol clustering most-closely together (Figure 5a) and with samples also broadly clustering according to library preparation protocol.
Figure 5
Small RNA-seq of reference plasma RNA by multiple laboratories using multiple library preparation protocols
(a) The heatmap shows CPM (log2 scale) for each sequence (rows) across plasma pool libraries (columns). Only mature miRNAs with a high confidence of detection are shown, requiring a minimum of 100 CPM in 90 percent of samples from at least one protocol (TruSeq, CleanTag, NEBNext or 4N). Hierarchical clustering for rows and columns represents complete linkage clustering on Euclidean distances. “Library Size” indicates the sum of the mature miRNA-mapped read counts prior to filtering for the individual libraries (log2-scaled). (b) Violin plots summarize the technical reproducibility of quantification for miRNAs expressed in plasma pool libraries, as calculated from each lab and library preparation method. Reproducibility measurements, percent coefficient of variation (%CV; top) and quartile coefficient of dispersion (QCD; bottom), were calculated from CPM values. Horizontal lines within each violin indicate the25, 50 and 75th percentiles, calculated from the mean CPM values of n=977 equimolar pool RNA sequences. . For TruSeq Lab1 n=3 technical replicates; for all other lab/protocols n=4. (c) Boxplots summarize the between-lab reproducibility for miRNAs expressed in the plasma pool libraries using TruSeq (n=6 labs), NEBNext (n=4 labs) and 4N_B (n=4 labs) library preparation protocols. Each dot represents %CV or QCD calculated across labs for a single miRNA. The between-lab %CV and QCD were calculated using the mean CPMs for each sequence across technical replicates for each lab. The underlying boxes show the median and IQR. Whiskers represent the 1st/3rd quartile +/- 1.5 * IQR. Outliers are < 1st quartile – 1.5 * IQR or > 3rd quartile + 1.5 * IQR. (d) Boxplots show the number of mature miRNAs detected by each protocol based on downsampling of datasets to the indicated sequencing depths (see Methods for more details). Each box summarizes number of miRNAs detected by each lab for the indicated protocol. The probability of each miRNA being detected was estimated for every sample randomly downsampled to 104, 104.5, 105, or 105.5 total read counts. A miRNA was only counted as detected if the probability of detection was at least 90%. Libraries with total counts less than the indicated sample size were excluded. Boxplots for 4N includes only in-house 4N protocols (4N_A, B, C and D). The number of libraries summarized by each boxplot is as follows: 10: TruSeq=19; CleanTag=4; NEBNext=12; 4N=28; 10 : TruSeq=23; CleanTag=4; NEBNext=16; 4N=28. The underlying boxes show the median and IQR. Whiskers represent the 1st/3rd quartile +/- 1.5 * IQR. Outliers are < 1st quartile – 1.5 * IQR or > 3rd quartile + 1.5 * IQR.
To evaluate the intra-lab reproducibility of plasma small RNA-seq, we calculated the CV and QCD for individual miRNA sequences across technical replicates in each lab (Figure 5b). After applying the same minimum CPM filtering criteria as before, in order to focus on reliably detected miRNAs, we found that the median CV across the miRNAs analyzed ranged from 7.7% (TruSeq) to 24.9% (CleanTag) for different protocols. Although this degree of reproducibility seems comparable to that observed with the synthetic reference pool RNA (Figure 4 and Supplementary Table 5), it is important to note that the filtering criteria used for plasma sequencing data are different (and generally more stringent) than for the synthetic RNA sequencing data. In addition, the median QCD was ≤0.1 for all the protocols (Supplementary Table 5)Unsupervised clustering of the plasma miRNA expression data revealed clear groups separating by preparation protocol (TruSeq, NEBNext and 4N_B), with results obtained from different labs using the same protocol clustering together (Figure 5a). The median variability across labs measured using CV ranged from 25.7% (4N_B) to 32.9% (TruSeq) and using QCD was < 0.3 for all protocols. (Figure 5c and Supplementary Table 5). The overall reproducibility of small RNA-seq using RNA isolated from biological samples was therefore comparable to that observed using the synthetic reference RNA samples.In order to assess differences between protocols in the diversity of miRNA sequences recovered from the standard reference plasma RNA, we performed an analysis of the number of miRNAs detected by each protocol, in which we plotted data from in-house 4N protocols as one group for the sake of comparison. This was done using downsampled datasets so the same total number of mature miRNA-mapping reads could be compared across protocols, at varying sequencing depths. The in-house 4N protocols recovered a larger number of miRNAs than those using defined adapter sequences (Figure 5d). In addition, an indirect assessment of miRNA diversity (i.e., percent of total reads accounted for by the 10 most abundant miRNAs) was consistent with the conclusion that 4N protocols generate a more diverse profile of miRNAs (Supplementary Figure 18).
Evaluation of small RNA-seq in miRNA A-to-I editing
We extended our study to evaluate performance of different protocols for quantifying sequences exhibiting adenosine to inosine (A-to-I) miRNA editing. This naturally occurring RNA modification can alter both miRNA biogenesis and regulatory functions[44,45]. We designed six synthetic RNA pools, each of which contained ten miRNAs that have previously been reported to undergo A-to-I editing[46,47]. Each pool combined the unedited (A) and edited (I) miRNA variants in different ratios (i.e. 0%, 0.1%, 0.5%, 5%, 50% and 100% edited). Each of these mixtures was then combined with a background of 277 different, unedited human miRNAs in order to increase complexity in the pools (Figure 6a; Supplementary Table 7). The six pools were sequenced by three different labs, each in triplicate, using TruSeq, NEBNext and in-house 4N_B protocols (Figure 6a). The resulting 162 libraries yielded 1.42×109 reads aligned to editing pool sequences in total, with a median library size of 8.22×106 (range: 1.74×106 − 29.01×106). All 162 libraries satisfied minimum quality criteria (Online Methods and Supplementary Table 3).
Figure 6
Library protocol performance in measuring miRNA A-to-I editing events
(a) A schematic depicting the experimental design for the miRNA A-to-I editing experiments. (left) 10 miRNAs were synthesized with either an Adenosine or Inosine at a single position previously shown to be edited in human cells. The position, relative to the 5’ end of the mature miRNA, is indicated to the right of the respective miRNA IDs, along with the identity of the nucleotide. 277 other unedited human miRNAs were added at a fixed concentration to increase the background complexity of the pools. (middle) Six different editing subpools were generated, using a constant amount of the background (Bk) pool and varying percentages of unedited (Un; adenosine) and edited (Ed; inosine) oligos in each pool. (right) The color-coded grid depicts the library design used in the A-to-I editing experiment. Specifically, the six editing pools were sequenced by three participating labs, using three different library preparation protocols, with each lab generating libraries in triplicate. (b) The observed percent editing (y-axis) is shown for each miRNA in the six A-to-I editing pools, as measured by each of the three labs, using TruSeq, NEBNext and 4N_B protocols. The expected editing percent in each pool is both indicated to the right of each plot group and indicated by the horizontal dotted line within each plot. The library prep method is indicated to the right of each plot. The dots and whiskers represent the median and range of percent editing for each miRNA (x-axis) as measured by the three labs. Individual miRNA % editing is shown for n=3 technical replicate libraries for each lab and library preparation method.
To determine accuracy of quantifying miRNA editing in our six synthetic pools, we compared the number of reads observed for the A and I variant oligos in each library to the expected abundance based on the known composition of the pools. Inaccurate and widely varying estimates of editing levels were apparent for many miRNAs using the NEBNext and TruSeq protocols, especially for the 1%, 5% and 50% editing pools (Figure 6b; Supplementary Table 8). In contrast, the in-house protocol, 4N_B, proved more accurate for detecting editing levels ≥ 1%. For example, in the 50% editing pool where the edited and unedited forms of each miRNA are present at equivalent levels, the mean percent editing observed ranged from 19% - 98% and 5% - 95% for TruSeq and NEBNext libraries, respectively, whereas for the 4N_B protocol, estimates were all within 10% of the expected value (43% - 53%).Aside from accuracy, we calculated across-lab reproducibility (i.e., precision) of the measured edited fraction in each pool for each of the evaluated protocols, using CV and QCD, which are most meaningful where there are reads in both edited and unedited categories (Supplementary Table 8). We found that precision varied as a function of known percent editing, with greater precision observed in the 5% and 50% edited pools compared to the 0.1% and 1% pools, as expected from the higher number of edited read counts in the former pools. Across all protocols tested, for the majority of miRNAs, the precision of percent editing measurements was (i) CV: <5% in the 50% edited pool, <20% in the 5% edited pool and <25% in the 1% edited pool; and (ii) QCD: <0.3 in the 50% edited pool, <0.4 in the 5% edited pool and <0.6 in the 1% edited pool.We evaluated the specificity and limit of detection for identifying miRNA editing by downsampling each library to 106 reads to allow standardized comparisons across libraries. To calculate specificity, we first estimated the false positive frequency for each protocol by evaluating: (i) the average percent edited reads observed in the 0% edited pool (i.e., false positive edited read frequency), and; (ii) the average percent unedited reads observed in the 100% edited pool (i.e., false positive unedited read frequency). The overall median false positive rate was 0.10% across all protocols, all miRNAs and both edited and unedited false positive calls (median false positive frequencies for individual protocols: TruSeq 0.05% (edited) and 0.06% (unedited); NEBNext 0.30% (edited) and 0.14% (unedited); 4N_B 0.10% (edited) and 0.08% (unedited) (Supplementary Table 8). This corresponds to an overall specificity across all three protocols of 99.88% for calling unedited sequences and 99.91% for calling edited sequences (Supplementary Table 8). To calculate the limit of detection (LOD), we defined detection of editing as an observed edited count that is more than three standard deviations above the observed edited count in the 0% edited synthetic pool. For all three protocols, the majority of miRNAs had a limit of detection at or below the 1% edited fraction, with a few miRNAs detectable in the 0.1% edited pool (Supplementary Table 8). It is worth noting, however, that the LOD is expected to vary based on sequencing depth, sample complexity, relative abundance of the miRNA being studied, and the pipeline used for analysis.
Discussion
Our results have quantitatively confirmed that small RNA-seq is highly affected by sequence-related bias[4,36-38,48], which is largely protocol-dependent. The observed biases are as large as 104-fold with some commonly used commercial library preparation protocols. This sequence-dependent bias is more severe than that previously reported for long RNA-seq [26], highlighting differences between the technologies and unique challenges involved in small RNA sequencing. Additionally, this bias can be particularly vexing when working with low RNA input samples such as biofluids, preventing the reliable detection of some low-abundance small RNAs. The in-house 4N protocols evaluated here, which employ adapters containing degenerate bases in the ligating ends, reduced the bias on the order of 100-fold and achieved better coverage at a lower sequencing depth than the widely used commercial library preparation kits with invariant adapter sequences. The magnitude of the bias observed for some sequences when using fixed adapter protocols was so high that it is likely impractical to overcome simply with increased sequencing depth. There were, however, differences in the results of different 4N methods, suggesting that not only the use of adapters with degenerate bases but also other factors in the protocols, such as the concentration of polyethylene glycol in ligation reactions, the time and temperature of ligations, etc., may also affect the bias. Our computational analyses of a range of sequence-related variables (e.g., 5’ or 3’ terminal nucleotides, %GC of the four 5’ or 3’end nucleotides, overall %GC, dG [free energy], dH [enthalpy], dS [entropy] and Tm) did not reveal strong associations, suggesting that the mechanistic basis of the bias may be complex.Even using the best-performing 4N protocols, there is still considerable sequence-related bias, which precludes the use of read counts alone for accurate quantification of different small RNAs within a given sample. However, despite the observed biases, we found that small RNA-seq is consistently accurate for relative quantification of a given miRNA between samples, as long as the same library preparation protocol is used for the two samples being compared, which is in agreement with previous observations for mRNA sequencing[26]. In this sense, all of the evaluated protocols were able to distinguish samples with as little as 1.5-fold difference in relative abundance of most sequences examined, although the design of our ratiometric pools is such that differences smaller than 1.5-fold could not be assessed.Reproducibility across laboratories is a crucial requirement for any experimental method used for research or clinical applications[49,50]. We found that for common commercial protocols as well as for our in house 4N protocol, results are reproducible between labs with a CV ≤ 20% for most sequences. Moreover, when comparing relative quantification measurements obtained by small RNA-seq across labs, the results were highly concordant even when the centers were using different protocols.We believe that the datasets generated in this study can also serve as a valuable resource for benchmarking computational tools designed to facilitate and improve upon RNA-seq analysis. This is important both for developing new software, and also for evaluating the suitability of using existing mRNA-seq algorithms for the analysis of small RNA-seq datasets. This could be particularly important for benchmarking software developed to account for various technical biases found in mRNA-seq data[51-54], as our data suggest that such biases may be different in small RNA-seq data.We also hope that our data may facilitate development of computational approaches for normalization of datasets generated using different library preparation protocols. While normalization algorithms are generally not intended to account for cross-platform variation, our preliminary analysis suggests that small RNA-seq protocol-specific biases largely correlate across samples. This suggests that one may be able to account for the protocol-specific differences in sequencing bias individually for each sequence, raising the possibility of cross-protocol data normalization. We performed an initial exploration of this concept using a simple approach for calculating correction factors (Supplementary Results, Supplementary Table 9 and Supplementary Figures 19 and 20). Although this approach was able to make overall profiles from different protocols appear more similar to each other, its performance is not sufficient to be practically relevant yet. We propose that synthetic RNA reference data such as that generated here can provide a foundation for future development of more advanced computational approaches to enable accurate cross-protocol comparisons.We also assessed the ability of library protocols to measure miRNA A-to-I editing. Our results demonstrate that low bias protocols (i.e. in house 4N_B) quantify editing more accurately than protocols using defined adapter sequences (i.e. TruSeq and NEBNext). It worth noting than accuracy of editing estimates can also be affected by low sequencing coverage. Indeed, some miRNAs had very low coverage by at least one of the protocols, which contributed to the inaccuracy and variation in editing estimates. However, this lack of coverage is a consequence of technical biases in small RNA-seq, since 4N_B libraries all had sufficient coverage of each sequence and because, at a minimum, all libraries had depth enough for ~6000x coverage of each sequence in the pool. Thus, protocols with a higher degree of sequencing bias also have a greater potential for inaccurate estimates of editing levels, because of lower read coverage for some miRNAs and/or differential preferences based on a single base difference. This is relevant to miRNA editing estimates reported in the literature, given that prior studies have commonly used the more biased protocols with fixed sequence adapters.[55]
Online Methods. Experimental methods
Note: a Life Sciences Reporting Summary has been submitted with this paper.
Reference samples
A synthetic equimolar pool containing 1,152 synthetic RNA oligos was prepared in an RNase-free environment and working on ice to minimize degradation. The pool was prepared by combining (i) the miRXplore Universal Reference from Miltenyi Biotec Inc (Auburn, CA), which comprises 962 RNA oligonucleotides with sequences matching human and other miRNAs, and (ii) a set of 190 additional, custom-synthesized RNA oligonucleotides, to generate the pool in which each of the 1,152 RNA oligonucleotides is present at equimolar concentration. This latter set comprises miRNAs and non-miRNA sequences of varied length from 15 to 90 nt, which were synthesized, HPLC-purified and quantified spectrophotometrically by IDT, Inc. (Coralville, IA). This latter set of RNA oligonucleotides is available to qualified investigators seeking to reproduce the synthetic equimolar for non-commercial purposes, by request of the corresponding authors (as long as supplies last). The resulting equimolar pool was aliquoted in prelabeled DNA-, DNase-, RNase-, and pyrogen-free screw cap tubes with low adhesion surface and stored immediately at −80C. Aliquots were distributed to the participant laboratories in overnight shipments with an abundant supply of dry ice. The complete list of RNA sequences comprising the equimolar pool is provided in Supplementary Table 1.Two ratiometric pools, SynthA and SynthB, containing 334 synthetic RNAoligonucleotides were designed in the coordinating lab (see computational methods) and synthesized by IDT. Subsets of these oligos were present in 15 different ratios between the two mixtures. These pools were also prepared, aliquoted and distributed to the participant centers following the same previously mentioned precautions to avoid RNA degradation. The complete list of sequences in the SynthA and SynthB pools, as well as their ratios, are provided in Supplementary Table 2.Plasma samples from eleven healthy male donors with age ranging from 21-45 years were collected and pooled in one of the participating labs (Supplementary protocol 1). The Beth Israel Deaconess Medical Center approved the study protocol to consent participants and collect samples. Informed consent was obtained from all subjects, and the samples were subsequently anonymized before distributing to participating research groups. RNA was isolated from this plasma pool (Supplementary protocol 2) in a single lab and aliquots of the purified RNA were mixed and distributed to the rest of the participant centers.
Library preparation and small RNA-seq of reference samples
A written guideline for library preparation and sequencing was distributed to all the participant centers. The input for library preparation was 10 femtomoles of RNA for synthetic pools and 2.1 ul of RNA for the plasma pool. Each group prepared four replicate libraries from each sample using the following small RNA library preparation protocols: Lab1 (TruSeq, NEBNext and in-house 4N_D), Lab2 (TruSeq, NEBNext and in-house 4N_B), Lab3 (TruSeq and NEBNext), Lab4 (TruSeq, NEBNext and in-house 4N_B), Lab5 (TruSeq, CleanTag, NEBNext, in-house 4N_A, in-house 4N_B and 4N_Xu), Lab6 (TruSeq, in-house 4N_B and in-house 4N_C), Lab 7* (NEXTflex), Lab 8* (TruSeq and NEXTflex) and Lab 9* (TruSeq and NEBNext). The labs marked with an asterisk did not contribute plasma libraries.The protocols for TruSeq, CleanTag, NEBNext and NEXTflex for Illumina were performed according to the manufacturer’s instructions in all labs except for NEBNext in Lab9 that performed 3’ overnight ligation. Note that some manufacturers recommended dilution of the adapters when working with low input RNA (for NEBNext, adapters were diluted, 1:2 in Lab3 and Lab9 and 1:6 in Lab1, Lab2, Lab4 and Lab5; for CleanTag 1:20 dilution of the adapters was performed). NEXTflex for Ion Torrent sequencing was performed as described in Supplementary protocol 3. In-house 4N protocols A, B, C and D were performed as described in Supplementary protocol 4-7. 4N_Xu protocol was performed as previously described[4835]. Size selection was performed using Pippin Prep instruments (in Lab1, Lab2, Lab4 and Lab6 for all protocols and Lab5 for in-house 4N_B only), 6% acrylamide gels (in Lab3, and Lab9 for all protocols, Lab 8 for TruSeq and Lab5 for TruSeq, NEBNext, CleanTag, 4N_A and 4N_Xu) or Ampure XP beads (in Lab7 and Lab8 for NEXTflex).Single-end libraries were sequenced using the Illumina HiSeq 2500 (Lab8 and Lab9 for all the protocols and Lab5 for TruSeq, CleanTag, 4N-Xu and 4N_A), Illumina HiSeq 4000 (Lab4 for all the protocols and Lab1 for TruSeq, 4N_D and equimolar NEBNext), Illumina NextSeq 500 (Lab2, Lab3 and Lab6 for all the protocols, Lab5 for NEBNext and in-house 4N_B and Lab1 for ratiometric and plasma NEBNext) or Ion Torrent (Lab7) platforms (see Supplementary Table 3 which also includes information on miRNA editing libraries). All labs using Illumina sequencing performed runs specifying ≥ 50 bp single-end reads. Details on read lengths for each library are included in Supplementary Table 3. Each laboratory was free to choose the number of samples to pool per lane, with a target of at least 8 million reads per library. FASTQ files were uploaded to the Genboree Workbench for central data analysis.
Evaluation of miRNA editing
Ten human miRNAs previously shown in the literature to undergo adenosine-to-inosine (A-to-I) RNA editing were selected to evaluate the performance of small RNA-seq in the detecting miRNA editing. To this end, we designed six pools containing different ratios of the selected synthetic edited miRNAs and their unedited counterparts (i.e. 0%, 0.1%, 0.5%, 5%, 50% and 100% edited) plus 277 unrelated human miRNAs. All RNA oligonucleotides were synthesized by IDT (the complete list of sequences included in these pools is provided in Supplementary Table 7). The pools were prepared and aliquoted in the coordinating center and distributed to two additional labs following the same previously mentioned precautions to avoid RNA degradation. Each lab prepared three replicate libraries from 10 femtomoles of each pool using the three different small RNA library preparation protocols: TruSeq, NEBNext and in-house 4N_B. The protocols for TruSeq and NEBNext were performed according to the manufacturer’s instructions (note that for NEBNext, adapters were diluted 1:2). In-house 4N_B was performed as described in Supplementary protocol 5. Size selection was performed using the Pippin Prep. ≥ 50 bp single-end libraries were sequenced using the Illumina NextSeq 500.
Online methods. Computational methods
Designing ratiometric pools
290 artificial sequences were assigned at random to 8 ratiometric groups (1, 1.5, 2, 3, 4, 5, 8 and 10x) and to either ratiometric pool SynthA or SynthB. The ratio indicates the concentration in the assigned pool relative to the other pool. For example, a sequence in the 10x pool assigned to SynthA would be present at the base concentration in SynthB and at 10x the base concentration in SynthA. To make groups of approximately equal size, assigned 8 sequences to the 8 ratiometric groups randomly, without replacement. To ensure the total amount of oligonucleotide was approximately equal in SynthA and SynthB, an even number of sequences was assigned to each ratiometric group and were distributed equally between pools, using a similar method of equally distributed random assignment. The random assignment was performed in Excel and the complete pool composition and ratios are shown in Supplementary Table 2.
Barcode splitting, FASTQ generation and data coordination
High-throughput sequencing, demultiplexing and FASTQ file generation was performed by each participating group independently. FASTQ files were uploaded to the Genboree Workbench for centralized analysis using the ExceRpt small RNA analysis pipeline. (http://genboree.org/java-bin/workbench.jsp).
Preprocessing, mapping and read counting
FASTQ files for the equimolar, ratiometric and plasma pools were initially processed through the exceRpt small RNA-seq Pipeline (Version 4.6.2), using the batch submission tool. For details on the exceRpt pipeline and the associated processing steps, see the Genboree Workbench documentation (http://genboree.org/theCommons/projects/exrna-tools-may2014/wiki/Small%20RNA-seq%20Pipeline). A brief description of parameters changed from the default settings or that differed between libraries is included below.The exceRpt pipeline was used at the default settings whenever possible. The default for adapter trimming is “auto-detect” which identifies and trims the adapter sequence for multiple library types, and all samples were initially submitted using this functionality. For 4N libraries (A, B, C, D, Xu and NEXTflex), an additional parameter was selected to indicate the degenerate sequence at the end of each adapter. The default random barcode settings were used, indicating that random 4nt sequences are present immediately 5’ and 3’ of the insert sequence. The sequence and identity of the adapter identified by the exceRpt pipeline was confirmed in the output files. Any library with a missing or incorrect adapter identified was re-submitted to the pipeline with the adapter sequence chosen manually, and a note was added to Supplementary Table 3.For plasma pool libraries, sequences shorter than 18 nt after adapter trimming were removed and not used for downstream analysis. For synthetic pools, the minimum length was changed to 15 nt, which corresponds to the length of the shortest sequences in the equimolar and ratiometric pools.To quantify alignments to the full set of synthetic pool sequences, equimolar, ratiometric SynthA and ratiometric SynthB libraries were mapped to a “Spike-In” sequence library uploaded to the Genboree Workbench. This spike-in library FASTA file contains a non-redundant set of sequences from the ratiometric and equimolar pools (Supplementary Table 7). Adapter-trimmed and filtered reads were mapped to the spike-in index with bowtie2 using the default Genboree Workbench alignment parameters, except that the minimum read length was reduced to 15. The number of reads aligning to each sequence was obtained from the “calibratormapped” output files. At the time of writing, reads mapped to the spike-in sequences are removed prior to genomic alignment, so any endogenous alignment information from these samples was ignored. To quantify alignments to endogenous miRNAs, equimolar and plasma pool libraries were also run through the exceRpt pipeline without mapping to spike-in sequences. The default minimum read length of 18 nt was used, along with all default alignment parameters. Reads were mapped to hg19 using the STAR alignment algorithm. Multi-mapping-adjusted read counts corresponding to mature miRNAs were used for all plasma pool analyses and for the equimolar pool correction factor analyses. For all other analyses with the equimolar and ratiometric pools, the spike-in read counts from the “calibratormapped” files were used.
Sample filtering
Unless specifically noted in the text, only libraries meeting minimum read count requirements were considered for analysis. For the synthetic pools, an average of one million reads mapping to the “spike-in” sequences (the unique set of sequences present in the equimolar and ratiometric pools), were required across all replicate libraries. The average was taken after filtering, such that the totals were based only on 5’-phosphorylated sequences 16-25 nt in length. For the plasma pool samples, replicate libraries with fewer than 100,000 miRNA-mapping reads were removed. The entire sample was removed if more than one of the replicate libraries failed to pass the minimum count threshold.
Equimolar pool analysis
Read counts for the equimolar (and likewise for the ratiometric pools) were obtained from “calibratormapped.counts” files included in the exceRpt pipeline output for each sample file. Sample-specific information, including the contributing lab, library preparation method and replicate number were associated with the corresponding calibrator count file, and were loaded into R for analysis. A full list of equimolar and ratiometric sequences with additional sequence information was used as a reference to merge all input files and add zero counts, where needed. Unless specifically mentioned in the text, analysis of ratiometric and equimolar libraries was limited to sequences with a 5’-phosphate modification, 16-25nt in length. Read counts were scaled to counts-per-million (CPM) using the total counts from the filtered sequences.For plots and calculations using log-transformed values, an arbitrarily small count was added to avoid taking the log of zero. To confirm that the extent of sequencing bias and reproducibility we observed was not influenced by the choice of pseudo-count for calculating CPM, we repeated our calculations in different ways with pseudo-counts ranging from 1 to 0.0001 (see Supplementary Figure 16). The adjusted CPM values were calculated using the method employed by the R package, EdgeR[39,40]. This scales the user-supplied prior count (0.25; the default setting) to be proportional to the library size. The scaled prior count is calculated by multiplying the raw prior count (0.25) by the sample library size divided by the mean library size across all equimolar samples and then adding this value to the raw counts for each miRNA. Library sizes are adjusted by adding 2 x the scaled prior count value. Adjusted CPM values are finally calculated as (raw.count + adjusted.prior) * 106/adjusted.library.size.
Determining overrepresented and under-represented sequences
Sequences in each equimolar pool replicate library were ranked by abundance, assigning the minimum rank value in case of ties. The 10 top and bottom-ranked sequences were determined by arranging counts in descending and ascending order, respectively. TruSeq, NEBNext, CleanTag and 4N libraries were each queried for sequences consistently found in the top or bottom 10, as defined by at least 75% agreement among the libraries of at least one method.
Dissecting the source of bias
The CPM obtained for each sequence of the equimolar pool was calculated using pseudo-counts, as in the equimolar pool analysis described above, except that the library sizes were calculated from all equimolar pool sequences prior to filtering for length and end modifications. Sequence length, 5’and 3’ terminal bases, %GC of the four 5’or 3’end base, overall %GC, dG [free energy], dH [enthalpy], dS [entropy] and Tm [melting temperature] were calculated from the annotated sequence. UNAFold [56] (http://unafold.rna.albany.edu/) was used to obtain the dG, dH, dS and Tm values of each of the sequences comprising the equimolar pool.
Ratiometric pools analysis
Ratiometric pool counts were initially processed as described above for equimolar pools, considering only counts for 16-25 nt sequences. The ratio of SynthA:SynthB was calculated as the ratio of the mean CPM across technical replicates in SynthA/SynthB.
Ratiometric Pools: Differential Expression
Independent differential expression workflows were run for each lab and library prep method, following a standard two-group comparison between “A” and “B” ratiometric pools. Normalization, dispersion estimation and differential expression testing was performed using three different R packages: EdgeR[39,40], DESeq2[41] and limma/voom[42]. For EdgeR, normalization factors were calculated using the Relative Log Expression (RLE) method, and significance was calculated (after calculating common, trended and tagwise dispersion estimates) using the default settings, based on a likelihood ratio test on the null hypothesis that ratiometric sample B - A = 0. Default settings were used for DESeq2, and significance was calculated based on a Wald Test. Significance for the limma/voom workflow was based on an empirical Bayes moderated t-test.
Plasma pool analysis
Comparison of plasma pool libraries was limited to mature miRNAs. Read counts for mature miRNAs were taken from “readCounts_miRNAmature_sense.txt” files provided in the exceRpt pipeline output. The read count files and associated metadata for all samples were loaded and merged in R for further analysis. Multi-mapping-adjusted read counts are calculated as part of the exceRpt pipeline and were used for all comparisons. The total number of unique reads mapping to miRNAs was taken from “.stats” files provided in the exceRpt pipeline output.
Downsampling read counts
The R package Vegan, was used to simulate random downsampling of equimolar and plasma pool count matrices. The Vegan function, drarefy, was used to estimate the probability of detection for each sequence based on random simulations of downsampling to specified levels. For the plasma pools, downsampling was performed to four different levels (104, 104.5, 105 and 105.5). Equimolar pools were downsampled to six different levels (106.5 down to 104 at half-log intervals). Libraries with read counts below the specified threshold were removed. A minimum probability of 0.9 was used as the threshold for detection.
Equimolar pool samples were processed through the exceRpt pipeline using the same input parameters as the plasma pool libraries, in order to obtain multi-mapping, scaled read counts for mature miRNAs that were directly comparable to the plasma pool counts. Differential expression analyses were performed using the mature miRNA read counts for the equimolar pool samples, and scaling factors were calculated for each miRNA, and were taken from the resulting log2 fold-change estimates. Key assumptions used in these calculations were: that the median mapped read level calculated for a given protocol should match the median for the 4N results given the same RNA input; that the comparisons of the results from a given protocol and the 4N protocol are performed on data processed in the same way that biological samples are processed (i.e., using the exceRpt pipeline and its mapped read outputs). For details on limma and voom functionality and the parameters used, see the documentation for the limma package.To summarize: correction factors were calculated for each pair of library preparation methods using the following workflow:Filter out miRNAs with 0 counts in any library: For the subset of samples being tested, scaling factors are only calculated for miRNAs having at least one count in every sample of the two methods being tested.Prepare miRNA count matrices for linear modeling: Use the R package, voom, to calculate precision weight estimates and normalize data to allow count data to be analyzed appropriately using the limma package. Normalization is also performed between samples such that the median miRNA expression value is the same in all samples.Fit miRNA-wise linear models to account for batch (lab) effect: The lmFit function from the limma package is used to fit linear models for each miRNA, estimating coefficients for each lab+library prep method. The coefficients represent the differences in expression for each miRNA between each lab+library prep method.Estimate the log fold change between the two methods for each miRNA: Use the fitted model to calculate for each miRNA the average expression estimated using the method A coefficients - the average expression estimate for the method B coefficients.The R packages limma/voom were used for read count normalization and differential expression estimates, using standard workflows suggested for RNA-seq data to account for batch (lab) effects, and then testing for the main effect of the library prep methods. For each pairwise comparison of library preparation methods, equimolar pool counts matrices were extracted and only miRNAs with ≥ 1 read count in all samples of both methods were kept for analysis. After filtering, voom was used to normalize the count data and calculate precision weight estimates that allow count data to be appropriately tested with the linear modeling schema used in the limma package. Voom was run with the default parameters, except that read counts were additionally normalized between arrays using the “scale” method, which adjusts read counts such that the median miRNA expression value is the same in all labs. The voom-transformed data was supplied to the limma lmFit function, along with a design matrix indicating the coefficients to be estimated. Initially, coefficients were estimated for each lab+library prep method in order to model batch/lab-specific effects. The main effect of the library prep method was then calculated as the average effect of method 1 - method 2. A contrasts matrix was generated and supplied, with the fitted model, to the contrasts.fit function, followed by an empirical Bayes function to estimate the resulting statistics for each miRNA. Log2 fold-change estimates, along with 95% CI were obtained from these estimates.
The equimolar pool-derived, inter-protocol bias correction factors were applied to the corresponding plasma pool samples for testing. To apply the correction factors, count matrices for the subset of plasma pool libraries being compared were selected and were then pre-filtered and normalized in the same way as the equimolar pools in generating the correction factors, described above. MiRNAs were filtered to include only those with a) a correction factor estimated from the equimolar pool and b) at least 5 counts in every library in the subset of methods being compared. Correction factors were applied to the appropriate samples. For example, if correction factors were calculated as the log2 fold-change between TruSeq and 4N samples (TruSeq - 4N), then the correction factors would be applied to the log2-transformed TruSeq plasma pool samples by subtracting the correction factor. For the heatmaps and density plots, corrected values were added to the original count matrix of untransformed values, and unless specifically noted in the text, normalized using quantile normalization.
miRNA editing analysis
Editing libraries were trimmed of 5’ and 3’ adapters using cutadapt (version 1.9.1). Trimmed reads ≥ 16 nt were aligned to editing pool sequences using bowtie2 (version 2.3.2) in local alignment mode. The first (5’) 4 nt were removed from 4N library reads during the alignment stage by adding the optional parameter “–trim5p 4”. Read counts were calculated from alignments filtered to have a minimum MAPQ of 20 and 0 mismatches to the reference sequence within the locally-aligned region. The sum totals of the filtered read counts for each library were used to calculate read Counts Per Million (CPM). Down-sampling was performed using the R package Vegan.
Data availability and accession code availability statements
Sequencing data for all experiments can be obtained from the GEO Superseries, GSE94586. Accession numbers for the four subseries are: GSE94584 (Equimolar), GSE94585 (Ratiometric A/B), GSE94582 (Human Plasma Pool) and GSE108138 (A-to-I Editing). GEO records include raw FASTQ files and processed counts from the ExceRpt pipeline.All code, metadata and processed data files required for reproducing the figures, tables and in-text statistical summaries are freely available on GitHub (https://github.com/rspengle/CrossU01_exRNA_Manuscript2017). The repository also includes a Packrat library with a snapshot of R package versions used.
Authors: Peter A C 't Hoen; Marc R Friedländer; Jonas Almlöf; Michael Sammeth; Irina Pulyakhina; Seyed Yahya Anvar; Jeroen F J Laros; Henk P J Buermans; Olof Karlberg; Mathias Brännvall; Johan T den Dunnen; Gert-Jan B van Ommen; Ivo G Gut; Roderic Guigó; Xavier Estivill; Ann-Christine Syvänen; Emmanouil T Dermitzakis; Tuuli Lappalainen Journal: Nat Biotechnol Date: 2013-09-15 Impact factor: 54.908
Authors: Yanzhu Lin; Kseniya Golovnina; Zhen-Xia Chen; Hang Noh Lee; Yazmin L Serrano Negron; Hina Sultana; Brian Oliver; Susan T Harbison Journal: BMC Genomics Date: 2016-01-05 Impact factor: 3.969
Authors: Sheng Li; Paweł P Łabaj; Paul Zumbo; Peter Sykacek; Wei Shi; Leming Shi; John Phan; Po-Yen Wu; May Wang; Charles Wang; Danielle Thierry-Mieg; Jean Thierry-Mieg; David P Kreil; Christopher E Mason Journal: Nat Biotechnol Date: 2014-08-24 Impact factor: 54.908
Authors: Alexandra M Ainsztein; Philip J Brooks; Vivien G Dugan; Aniruddha Ganguly; Max Guo; T Kevin Howcroft; Christine A Kelley; Lillian S Kuo; Patricia A Labosky; Rebecca Lenzi; George A McKie; Suresh Mohla; Dena Procaccini; Matthew Reilly; John S Satterlee; Pothur R Srinivas; Elizabeth Stansell Church; Margaret Sutherland; Danilo A Tagle; Jessica M Tucker; Sundar Venkatachalam Journal: J Extracell Vesicles Date: 2015-08-28
Authors: Jeanette Baran-Gale; C Lisa Kurtz; Michael R Erdos; Christina Sison; Alice Young; Emily E Fannin; Peter S Chines; Praveen Sethupathy Journal: Front Genet Date: 2015-12-22 Impact factor: 4.599
Authors: Kemal M Akat; Youngmin A Lee; Arlene Hurley; Pavel Morozov; Klaas Ea Max; Miguel Brown; Kimberly Bogardus; Anuoluwapo Sopeyin; Kai Hildner; Thomas G Diacovo; Markus F Neurath; Martin Borggrefe; Thomas Tuschl Journal: JCI Insight Date: 2019-04-11
Authors: Fabiana Gámbaro; Marco Li Calzi; Pablo Fagúndez; Bruno Costa; Gonzalo Greif; Emily Mallick; Shawn Lyons; Pavel Ivanov; Kenneth Witwer; Alfonso Cayota; Juan Pablo Tosar Journal: RNA Biol Date: 2019-12-29 Impact factor: 4.652
Authors: José Roberto Bermúdez-Barrientos; Obed Ramírez-Sánchez; Franklin Wang-Ngai Chow; Amy H Buck; Cei Abreu-Goodger Journal: Nucleic Acids Res Date: 2020-02-28 Impact factor: 16.971
Authors: Zachary T Herbert; Jyothi Thimmapuram; Shaojun Xie; Jamie P Kershner; Fred W Kolling; Carol S Ringelberg; Ashley LeClerc; Yuriy O Alekseyev; Jun Fan; Jessica W Podnar; Holly S Stevenson; Gary Sommerville; Shipra Gupta; Maura Berkeley; Julie Koeman; Anoja Perera; Allison R Scott; Jennifer K Grenier; Jeffrey Malik; John M Ashton; Kara L Pivarski; Xinkun Wang; Gina Kuffel; Tania E Mesa; Andrew T Smith; Jianjun Shen; Yoko Takata; Thomas L Volkert; Jennifer A Love; Yanping Zhang; Jun Wang; Xiaoling Xuei; Marie Adams; Stuart S Levine Journal: J Biomol Tech Date: 2020-07
Authors: Thomas Desvignes; Phillipe Loher; Karen Eilbeck; Jeffery Ma; Gianvito Urgese; Bastian Fromm; Jason Sydes; Ernesto Aparicio-Puerta; Victor Barrera; Roderic Espín; Florian Thibord; Xavier Bofill-De Ros; Eric Londin; Aristeidis G Telonis; Elisa Ficarra; Marc R Friedländer; John H Postlethwait; Isidore Rigoutsos; Michael Hackenberg; Ioannis S Vlachos; Marc K Halushka; Lorena Pantano Journal: Bioinformatics Date: 2020-02-01 Impact factor: 6.937
Authors: Oscar D Murillo; William Thistlethwaite; Joel Rozowsky; Sai Lakshmi Subramanian; Rocco Lucero; Neethu Shah; Andrew R Jackson; Srimeenakshi Srinivasan; Allen Chung; Clara D Laurent; Robert R Kitchen; Timur Galeev; Jonathan Warrell; James A Diao; Joshua A Welsh; Kristina Hanspers; Anders Riutta; Sebastian Burgstaller-Muehlbacher; Ravi V Shah; Ashish Yeri; Lisa M Jenkins; Mehmet E Ahsen; Carlos Cordon-Cardo; Navneet Dogra; Stacey M Gifford; Joshua T Smith; Gustavo Stolovitzky; Ashutosh K Tewari; Benjamin H Wunsch; Kamlesh K Yadav; Kirsty M Danielson; Justyna Filant; Courtney Moeller; Parham Nejad; Anu Paul; Bridget Simonson; David K Wong; Xuan Zhang; Leonora Balaj; Roopali Gandhi; Anil K Sood; Roger P Alexander; Liang Wang; Chunlei Wu; David T W Wong; David J Galas; Kendall Van Keuren-Jensen; Tushar Patel; Jennifer C Jones; Saumya Das; Kei-Hoi Cheung; Alexander R Pico; Andrew I Su; Robert L Raffai; Louise C Laurent; Matthew E Roth; Mark B Gerstein; Aleksandar Milosavljevic Journal: Cell Date: 2019-04-04 Impact factor: 41.582
Authors: Srimeenakshi Srinivasan; Ashish Yeri; Pike See Cheah; Allen Chung; Kirsty Danielson; Peter De Hoff; Justyna Filant; Clara D Laurent; Lucie D Laurent; Rogan Magee; Courtney Moeller; Venkatesh L Murthy; Parham Nejad; Anu Paul; Isidore Rigoutsos; Rodosthenis Rodosthenous; Ravi V Shah; Bridget Simonson; Cuong To; David Wong; Irene K Yan; Xuan Zhang; Leonora Balaj; Xandra O Breakefield; George Daaboul; Roopali Gandhi; Jodi Lapidus; Eric Londin; Tushar Patel; Robert L Raffai; Anil K Sood; Roger P Alexander; Saumya Das; Louise C Laurent Journal: Cell Date: 2019-04-04 Impact factor: 41.582