| Literature DB >> 30914698 |
Dilip A Durai1,2, Marcel H Schulz3,4,5.
Abstract
Specialized de novo assemblers for diverse datatypes have been developed and are in widespread use for the analyses of single-cell genomics, metagenomics and RNA-seq data. However, assembly of large sequencing datasets produced by modern technologies is challenging and computationally intensive. In-silico read normalization has been suggested as a computational strategy to reduce redundancy in read datasets, which leads to significant speedups and memory savings of assembly pipelines. Previously, we presented a set multi-cover optimization based approach, ORNA, where reads are reduced without losing important k-mer connectivity information, as used in assembly graphs. Here we propose extensions to ORNA, named ORNA-Q and ORNA-K, which consider a weighted set multi-cover optimization formulation for the in-silico read normalization problem. These novel formulations make use of the base quality scores obtained from sequencers (ORNA-Q) or k-mer abundances of reads (ORNA-K) to improve normalization further. We devise efficient heuristic algorithms for solving both formulations. In applications to human RNA-seq data, ORNA-Q and ORNA-K are shown to assemble more or equally many full length transcripts compared to other normalization methods at similar or higher read reduction values. The algorithm is implemented under the latest version of ORNA (v2.0, https://github.com/SchulzLab/ORNA ).Entities:
Year: 2019 PMID: 30914698 PMCID: PMC6435659 DOI: 10.1038/s41598-019-41502-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Average read quality score (a) and average read abundance score (b) distribution in Brain RNA-seq data. The position-wise distribution of average read quality score (c) and average read abundance score (d) in the brain dataset. Reads in both datasets were divided into bins of 1 million (x-axis). These bins were then considered as partial datasets and the scores ( and ) was calculated for each bin (y-axis).
Figure 2Comparison of ORNA-Q (a) and ORNA-K (b) against ORNA applied on different read orderings (x-axis) for the brain dataset. Order 1 denotes the original dataset ordering. Order 2–4 was obtained by random reshuffling of the reads. The average scores of the reads from the reduced dataset is shown on the y-axis. All the above orders results in similar amount of reduction.
Figure 3Effect of varying the log base parameter b (x-axis) on the average read weight (y-axis), (a) and (b) of the normalized brain datasets. The black and grey bars represent normalization using ORNA-Q/-K and ORNA, respectively.
Comparison of mean F1 scores, nucleotide precision, and nucleotide recall.
| Dataset | measure | Unreduced | ORNA-K | ORNA-Q | ORNA | Diginorm | Bignorm |
|---|---|---|---|---|---|---|---|
| Brain | 0.442 | 0.441 |
| 0.438 | 0.441 | 0.426 | |
| HeLa | 0.280 |
| 0.279 | 0.278 | 0.273 | 0.272 | |
| Brain | Recall | 0.347 | 0.347 |
| 0.345 | 0.349 | 0.331 |
| HeLa | Recall | 0.354 | 0.355 | 0.360 | 0.359 |
| 0.369 |
| Brain | Precision | 0.610 | 0.608 | 0.603 | 0.603 | 0.598 |
|
| HeLa | Precision | 0.232 |
| 0.227 | 0.227 | 0.214 | 0.219 |
Brain and HeLa datasets normalized by the five algorithms (ORNA, ORNA-Q/-K, Diginorm, and Bignorm) were assembled using TransABySS. Several normalized datasets were obtained by varying parameters for each algorithm. Each of these datasets was assembled separately. All the assemblies were then evaluated using REF-EVAL. Averages were taken over results obtained from different assemblies. The mean F1, precision and recall scores obtained for the original (unreduced) dataset is shown in the first column. The highest mean obtained by any normalization algorithm is shown in bold.
Figure 4Comparison of assemblies generated from normalized datasets. The % of reads reduced (x-axis) by a normalization algorithm is compared against % of complete (y-axis: an assembly performance measure). Each point on a line corresponds to a different parametrization of the algorithms. (a,b) Represent TransABySS assemblies (k = 21) applied to normalized brain and HeLa data, respectively.
Runtime (in minutes) and memory (in GB) required by ORNA-Q/-K, ORNA, Diginorm and Bignorm for normalizing Brain (147 M) and HeLa dataset (216 M).
| Method | Brain (147 M–35.1 GB) | HeLa (216 M–60.7 GB) | ||||
|---|---|---|---|---|---|---|
| % reduced | time [min] | mem [GB] | % reduced | time [min] | mem [GB] | |
| ORNA | 69.8 | 112 (42) | 6.38 (6.31) | 75.31 | 219 (64) | 9.81 (9.85) |
| ORNA-Q | 70.65 | 116 (50) | 7.10 (7.13) | 72.72 | 223 (70) | 10.01 (10.02) |
| ORNA-K | 70.3 | 130 (52) | 6.41 (6.5) | 73.86 | 279 (75) | 9.98 (10.01) |
| Diginorm | 72.03 | 135 | 12.5 | 72.91 | 198 | 12.51 |
| Diginorm | 70.51 | 112 | 6.26 | 75.10 | 155 | 9.76 |
| *Bignorm | 69.38 | (47) | 41.94 | 71.28 | (58) | 41.94 |
| Bignorm | 69.39 | (41) | 5.23 | 71.26 | (55) | 5.24 |
Notes: The memory required to store the complete dataset in the main memory is indicated in brackets next to the name of the dataset. The column % reduced states the percent of reads reduced by each method. Time and memory as obtained by running the algorithm with 10 threads (if possible) are shown in brackets. *Bignorm always runs with 4 cores and fixed memory settings.