| Literature DB >> 22748135 |
Abstract
BACKGROUND: With the advent of next-generation sequencing there is an increased demand for tools to pre-process and handle the vast amounts of data generated. One recurring problem is adapter contamination in the reads, i.e. the partial or complete sequencing of adapter sequences. These adapter sequences have to be removed as they can hinder correct mapping of the reads and influence SNP calling and other downstream analyses.Entities:
Mesh:
Year: 2012 PMID: 22748135 PMCID: PMC3532080 DOI: 10.1186/1756-0500-5-337
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Comparison of various tools for trimming adapters
| Method | 5’ | 3’ | SE | PE | Merge | Multi | Ns | Q | Barcode | Refs. |
| AdapterRemoval | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | No | [ |
| Btrim | Yes | Yes | Yes | Yes | No | No | No | Yes | Yes | [ |
| CANGS1,2 | No | Yes | Yes | No | No | Yes | (Yes) | (Yes) | Yes | [ |
| Cutadapt | Yes | Yes | Yes | No | No | Yes | No | Yes | No | [ |
| EA-Tools | No | Yes | Yes | Yes | No | No | Yes | Yes | Yes | [ |
| FAR3 | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | [ |
| FASTX1 | No | Yes | Yes | No | No | No | (Yes) | No | Yes | [ |
| Scythe | No | Yes | Yes | No | No | No | No | No | No | [ |
| SeqPrep | No | Yes | No | Yes | Yes | No | No | No | No | [ |
| SeqTrim | No | Yes | Yes | No | No | Yes | Yes | Yes | No | [ |
| TagCleaner | Yes | Yes | Yes | No | No | Yes | No | No | No | [ |
| TagDust4 | (Yes) | Yes | Yes | (Yes) | No | Yes | No | No | No | [ |
| Trim Galore!3 | No | Yes | Yes | (Yes) | No | No | No | Yes | No | [ |
| trimLRPatterns | Yes | Yes | Yes | No | No | No | No | No | No | [ |
| Trimmomatic | No | Yes | Yes | Yes | No | Yes | No | Yes | No | [ |
The table summarizes the features of different programs that are aimed at removing adapter sequences from next-generation sequencing reads. For each method, the table shows if it is able to i) identify adapters in the 5’ end of reads, ii) identify adapters in the 3’ end of reads, iii) process single-end reads, iv) process paired-end reads, v) collapse overlapping pairs, vi) search for multiple different adapters, vii) trim subsequences of Ns, viii) trim low-quality nucleotides, ix) sort multiplexed reads based on barcodes. The last feature is not as such related to adapter trimming but since it occurs often in these programs it is included. The table also lists references for each program. Notes: 1) If chosen, reads with one or more Ns are discarded (i.e. Ns are not trimmed). 2) If chosen, low-quality reads are discarded (i.e. low-quality nucleotides are not trimmed). 3) Discards remaining read if the other read in a pair is removed due to trimming. 4) The aim of this program is slightly different as it compares the reads to a library of sequences and checks for significant overlap.
Figure 1Illustration of different constructs and the reads produced. Single-end data on top, paired-end below. Inserts are denoted I, single-end reads R and paired-end reads R1 and R2. Read length denoted L, insert length denoted L. A) L ≥ L: No adapter contamination. B) L < L: adapter contamination occurs in 3’ end. C) L ≥ 2· L: No adapter contamination and no overlap between reads. D) L < L < 2 · L: No adapter contamination but the two reads overlap. E) L < L: adapter contamination in 3’ ends of both reads, overlap between 5’ ends of reads. This information can be used to perform the pairwise alignment needed (after reverse complementing mate 2 from the pair) to locate adapter contamination and/or overlap between reads
Figure 2The need for shifting the alignment due to missing nucleotides. If the read is missing a few nucleotides in the 5’ end, the proper alignment will not be recoverable if the procedure stops at the first position. As shown in 1), this leads to multiple mismatches and possibly missed adapter contamination. If the alignment is shifted by S nucleotides as shown in 2), the correct alignment can be found. The dynamic programming matrix in 3) shows which entries in the matrix leads to the two solutions shown here. The light grey part is the upper half of the matrix that is calculated by default; the two dark grey entries illustrate the two alignments shown in 1) and 2)
Performance of AdapterRemoval and Trimmomatic on simulated test set
| | ||||||||
| Run time (s) | 139.42 | 36.07 | 39.9 | 20.1 | ||||
| Memory (kB) | 7,664 | 6,512 | 387,488 | 611,536 | ||||
| Trimmed, no adapter | 95 | (0.02%) | 93,909 | (15.00%) | 0 | (0.00%) | 0 | (0.00%) |
| Trimmed exact | 346,806 | (92.74%) | 327,925 | (87.69%) | 120284 | (32.16%) | 120,284 | (32.16%) |
| Trimmed more | 0 | (0.00%) | 298 | (0.08%) | 0 | (0.00%) | 0 | (0.00%) |
| Trimmed less | 0 | (0.00%) | 1,038 | (0.28%) | 0 | (0.00%) | 0 | (0.00%) |
| Missed adapter | 27,157 | (7.26%) | 44,702 | (11.95%) | 253679 | (67.84%) | 253,679 | (67.84%) |
| PPV | 1.00 | 0.78 | 1.00 | 1.00 | ||||
| SEN | 0.93 | 0.88 | 0.32 | 0.32 | ||||
| SPEC | 1.00 | 0.85 | 1.00 | 1.00 | ||||
| MCC | 0.94 | 0.71 | 0.48 | 0.48 | ||||
Statistics for AdapterRemoval and Trimmomatic tested on both single-end and paired-end data. The test set contains 1,000,000 read pairs of which 373,963 contain adapter fragments and the remaining 626,037 do not. The paired-end test was performed on the 1,000,000 pairs, and the single-end test was performed on the first read from each pair. The reported numbers are: Run time (seconds); max memory usage (kilobytes); number of reads containing no adapter that were trimmed; number of reads with adapter where just the adapter was removed; number of reads with adapter where more than the adapter was trimmed; number of trimmed reads with adapter where the full adapter was not removed; number of reads with adapter where no trimming was performed; positive predictive value; sensitivity; specificity; Matthew’s correlation coefficient. For “Trimmed, no adapter”, the percentage of the 626,037 reads with no adapter that were trimmed is shown. For the following four rows, the percentages are of the 373,963 read pairs with adapter, and these numbers add up to 373,963 and 100%, respectively.