| Literature DB >> 27161244 |
Marc Sturm1, Christopher Schroeder2, Peter Bauer2.
Abstract
BACKGROUND: Trimming of adapter sequences from short read data is a common preprocessing step during NGS data analysis. When performing paired-end sequencing, the overlap between forward and reverse read can be used to identify excess adapter sequences. This is exploited by several previously published adapter trimming tools. However, our evaluation on amplicon-based data shows that most of the current tools are not able to remove all adapter sequences and that adapter contamination may even lead to spurious variant calls.Entities:
Mesh:
Year: 2016 PMID: 27161244 PMCID: PMC4862148 DOI: 10.1186/s12859-016-1069-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Read layout with adapter contamination (a) and insert match algorithm examples with different offsets (b). Inserts are colored grey, adapter remains are colored black. Reverse reads are displayed with reverse-complementary sequence to facilitate visual comparison of sequences
Adapter trimming benchmark results on real data (raw reads)
| Adapters left | Bases left | |
|---|---|---|
| no trimming | 414254 | 168509790 |
| SeqPurge 0.1-270 | 0 | 142570393 |
| AdapterRemoval 1.5.4 |
| 142864048 |
| Flexbar 2.5 | 7 | 142323263 |
| PEAT 1.2.2 |
|
|
| SeqPrep 1.2 | 0 |
|
| Skewer 0.1.123 |
| 142664746 |
| Trimmomatic 0.32 | 5 | 142717579 |
The number of adapter 20-mers and the number of bases left in the raw read data after adapter trimming. The most notable entries are highlighted (bold)
Adapter trimming benchmark results on real data (mapping)
| Reads paired | Bases overtrimmed | Bases undertrimmed | |
|---|---|---|---|
| no trimming |
| 0 | 21918793 |
| SeqPurge 0.1-270 | 1021315 | 1736 | 33650 |
| AdapterRemoval 1.5.4 | 1021290 | 25 |
|
| Flexbar 2.5 | 1021224 |
| 62901 |
| PEAT 1.2.2 |
|
|
|
| SeqPrep 1.2 |
| 53 | 34949 |
| Skewer 0.1.123 | 1021323 | 238 |
|
| Trimmomatic 0.32 | 1021316 | 1580 |
|
Benchmark results after mapping: properly-paired reads, erroneously trimmed insert bases, untrimmed adapter bases. The most notable entries are highlighted (bold)
Adapter trimming benchmark results on real data (variant calling)
| Variants | Uncalled TPs | Called FPs | |
|---|---|---|---|
| no trimming |
|
|
|
| SeqPurge 0.1-270 | 155 | 0 | 0 |
| AdapterRemoval 1.5.4 | 155 | 0 | 0 |
| Flexbar 2.5 |
| 0 |
|
| PEAT 1.2.2 | 155 | 0 | 0 |
| SeqPrep 1.2 | 155 | 0 | 0 |
| Skewer 0.1.123 | 155 | 0 | 0 |
| Trimmomatic 0.32 |
| 0 |
|
Benchmark results after variant calling: overall variant count, number of true-positive variants that were not called, number of false-positive variants that were called. The most notable entries are highlighted (bold)
Resources benchmark results on real data
| Trimming time | Mapping time | Variant calling time | Memory usage | |
|---|---|---|---|---|
| no trimming | n/a |
| 69 | n/a |
| SeqPurge 0.1-270 | 38 | 182 | 65 | 28.8 |
| AdapterRemoval 1.5.4 |
| 198 | 66 | 12.2 |
| Flexbar 2.5 |
| 204 | 65 | 13.3 |
| PEAT 1.2.2 | 43 | 241 | 73 |
|
| SeqPrep 1.2 |
| 180 | 65 | 6.2 |
| Skewer 0.1.123 | 24 | 200 | 66 | 5.7 |
| Trimmomatic 0.32 | 58 | 177 | 66 |
|
Benchmark results of single-thread processing times (in seconds) and peak memory usage (in MB). The most notable entries are highlighted (bold)
Trimming benchmark results on simulated data without errors
| Time [s] | Bases overtrimmed | Bases undertrimmed | |
|---|---|---|---|
| SeqPurge 0.1-270 | 224 | 14488 | 0 |
| AdapterRemoval 1.5.4 |
| 434 | 0 |
| Flexbar 2.5 |
|
|
|
| PEAT 1.2.2 | 342 |
|
|
| SeqPrep 1.2 |
| 1848 | 0 |
| Skewer 0.1.123 | 185 | 16 | 0 |
| Trimmomatic 0.32 | 348 | 0 |
|
Benchmark results on simulated data (5 million read pairs of 100 bp) without sequencing errors. The most notable entries are highlighted (bold)
Undertrimmed base counts on simulated data
| Error rate | SeqPurge | AdapterRemoval | SeqPrep | Skewer |
|---|---|---|---|---|
| 0.00 % | 0 | 0 | 0 | 0 |
| 0.50 % | 0 | 0 |
|
|
| 1.00 % | 0 | 212 |
|
|
| 2.00 % | 122 | 48190 |
|
|
| 4.00 % | 4312 |
|
|
|
Undertrimmed base counts on simulated data (5 million read pairs of 100 bp) with different rates of sequencing errors. The most notable entries are highlighted (bold)