| Literature DB >> 26379285 |
Simon H Tausch1, Bernhard Y Renard2, Andreas Nitsche1, Piotr Wojciech Dabrowski3.
Abstract
BACKGROUND: The assembly of viral or endosymbiont genomes from Next Generation Sequencing (NGS) data is often hampered by the predominant abundance of reads originating from the host organism. These reads increase the memory and CPU time usage of the assembler and can lead to misassemblies.Entities:
Mesh:
Year: 2015 PMID: 26379285 PMCID: PMC4574938 DOI: 10.1371/journal.pone.0137896
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Graphical representation of RAMBO-K’s workflow.
Reads are simulated from the reference genomes and used to train a foreground and background Markov chain. The simulated sequences and a subset of the real reads are assigned based on these matrices and a preview of the results is presented to the user. If this preview proves satisfactory, the same parameters are used to assign all reads.
Fig 2Example of the graphical output of RAMBO-K for a dataset containing human and orthopoxvirus sequences.
The score distribution of both simulated and real reads is displayed for two different k-mer lengths (left: 4, right: 10), allowing the user to choose the best k-mer length and cutoff. In this case, a cutoff around -100 at a k-mer length of 10 would allow a clean separation of foreground and background reads, as visualized by the clearly separated peaks. The estimated abundance of foreground and background reads in the dataset is displayed in the figure title.
Benchmark results.
| Cowpox (1.3 M reads, SRS957177) | Bat adenovirus (33 K reads, SRX856705) | Wolbachia (12 M reads, SRR1508956) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Time[s] | SEN | FPR | F-Score | Time[s] | SEN | FPR | F-Score | Time[s] | SEN | FPR | F-Score | |
| RAMBO-K |
| 0.87 | 3.00E-04 |
|
| 0.79 | 0.05 |
|
| 1 | 4.00E-05 |
|
| Kraken | 157 | 0.83 | 2.00E-05 | 0.9 | 4.4 | 1 | 0.42 | 0.8 | 7004 | 0 | 0 | N/A |
| AbundanceBin | 20938 | 0 | 0 | N/A | 73 | 0.99 | 0.88 | 0.65 | 1.10E+06 | 0.5 | 0.48 | 0.07 |
| PhymmBL | 82556 | 0.68 | 1.00E-04 | 0.8 | 1.00E+05 | 0 | 0 | N/A | 1.7E+07 | 0.5 | 2E-03 | 0.64 |
| Bowtie2+ | 146 | 0.85 |
| 0.92 | 5.1 | 0.11 |
| 0.2 | 419 | 0.99 |
| 0.99 |
| Bowtie2- | 550 |
| 0.76 | 0.03 | 93 |
| 0,91 | 0.65 | 1274 |
| 0.97 | 0.07 |
The best value for each dataset is in bold. While Bowtie2+ (keeping reads mapping to the foreground reference) generally gives the lowest false-positive rate (FPR) and Bowtie2- (discarding reads mapping to the background reference) the highest sensitivity (SEN), RAMBO-K shows the best balance, providing high SEN and low FPR (F-Score) with the consistently lowest run-time. RAMBO-K outperforms other methods by the largest margin when the nearest known reference has a low identity to the sequenced genome, as in the Bat adenovirus dataset.
a: The values for PhymmBL on the Wolbachia dataset were extrapolated based on the analysis of a subset of 5% of the reads.