| Literature DB >> 31077131 |
Xiaolong Zhang1, Yanyan Shao1, Jichao Tian1, Yuwei Liao1, Peiying Li1, Yu Zhang1, Jun Chen2, Zhiguang Li3,4.
Abstract
BACKGROUND: With the widespread use of multiple amplicon-sequencing (MAS) in genetic variation detection, an efficient tool is required to remove primer sequences from short reads to ensure the reliability of downstream analysis. Although some tools are currently available, their efficiency and accuracy require improvement in trimming large scale of primers in high throughput target genome sequencing. This issue is becoming more urgent considering the potential clinical implementation of MAS for processing patient samples. We here developed pTrimmer that could handle thousands of primers simultaneously with greatly improved accuracy and performance. RESULT: pTrimmer combines the two algorithms of k-mers and Needleman-Wunsch algorithm, which ensures its accuracy even with the presence of sequencing errors. pTrimmer has an improvement of 28.59% sensitivity and 11.87% accuracy compared to the similar tools. The simulation showed pTrimmer has an ultra-high sensitivity rate of 99.96% and accuracy of 97.38% compared to cutPrimers (70.85% sensitivity rate and 58.73% accuracy). And the performance of pTrimmer is notably higher. It is about 370 times faster than cutPrimers and even 17,000 times faster than cutadapt per threads. Trimming 2158 pairs of primers from 11 million reads (Illumina PE 150 bp) takes only 37 s and no more than 100 MB of memory consumption.Entities:
Keywords: Multiplex amplicon sequencing; Primer trimming; Target sequencing
Mesh:
Year: 2019 PMID: 31077131 PMCID: PMC6511130 DOI: 10.1186/s12859-019-2854-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Workflow of pTrimmer. The flowchart shows the details of the program. The program takes a primer sequence file and the two FASTQ files (paired-end sequencing) or one (single-end sequencing) as input. The primer sequence file contains three necessary fields for the forward primer, reverse primer and insert length of amplicon. pTrimmer will first extract all possible 1-base shift kmer sequence of both forward primer and reverse primer to build the ‘hash table’. Kmer sequence will also be extracted from FASTQ read with 1-bp shift each time to search for best candidate primer from ‘hash table’. The 12-bp buffer shift length enable the program has 12 times to find the best primer sequence with ‘k-mers model’. If fails, the program will automatically switch to ‘dynamic model’. The ‘dynamic model’ could extract all possible 1-bp shift kmer sequence and get the best candidate primer that has most kmer hit. Then the program will perform dynamic programing between candidate primer sequence and given read
Fig. 2Two conditions of sequencing read. a The ‘read-through condition’ has a forward primer (5′ end) and a reverse complementary primer (3′ end) at the right and left ends of both read1 and read2. This situation is pretty popular in liquid biopsy studies, where the amplicon length is around 140 bp due to the confine of cfDNA fragment length while the read length could be 150 bp. b The ‘normal condition’ only has forward primer at the 5′ end of reads, both read1 and read2. This occurs when the amplicon length is larger than read length
Benchmarks of four programs with three multiplex amplicon sequencing datasets. ‘Trimmed Reads’ indicates the reads that had any number of bases removed by the program. ‘Precisely Trimmed Reads’ indicates the reads whose primer sequences were precisely trimmed. K: the number of nucleotides for “k-mer” algorithm. Err: the number or percentage of allowed mismatches for primer searching. See Additional file 1 for the detailed parameters to run each of these programs
| Sample (No. of reads) | Alientrimmer | Cutadapt | cutPrimers | ||
|---|---|---|---|---|---|
| Parameter | 1 thread, K = 9, err = 3 | 8 threads, err = 10% | 8 threads, err = 3 | 2 threads, k = 9, err = 3 | |
| CfDNA1 (11,498,560) | Time of | 1137 | 108,918 | 2224 | 35 |
| CfDNA2 (16,694,386) | Running (s) | 2511 | 170,920 | 3646 | 47 |
| CfDNA3 (36,119,764) | 6099 | 349,789 | 7772 | 62 | |
| CfDNA1 (11,498,560) | No. of | 2,228,698 (19.28) | 11,392,006 (99.07) | 8,733,142 (75.95) | 10,973,548 (95.43) |
| CfDNA2 (16,694,386) | Trimmed Reads (%) | 3,193,584 (19.13) | 16,633,850 (99.64) | 12,481,890 (74.77) | 16,023,158 (95.98) |
| CfDNA3 (36,119,764) | 5,680,392 (15.73) | 35,921,534 (99.45) | 24,460,516 (67.72) | 32,314,848 (89.47) | |
| CfDNA1 (11,498,560) | No. of | 81,143 (0.71) | 389,526 (3.39) | 7,752,109 (67.42) | 10,908,534 (94.87) |
| CfDNA2 (16,694,386) | Precisely Trimmed Reads (%) | 134,647 (0.81) | 516,377 (3.09) | 11,130,147 (66.67) | 5,905,037 (95.27) |
| CfDNA3 (36,119,764) | 236,252 (0.65) | 1,003,057 (2.78) | 21,581,726 (59.75) | 12,024,840 (88.66) |
Fig. 3Time consumption of Alientrimmer, cutPrimers pTrimmer and BAMClipper with the increasing depth from 1000× to 10,000×. The results of Cutadapt were excluded due to its huge time consumption. And the results of BAMClipper is the running time of processing the bam files. For better displaying, we run Alientrimmer, BAMClipper and pTrimmer with one thread and run cutPrimers with 8 threads
Effects of mismatches on the performance of Alientrimmer, Cutadapt, cutPrimers and pTrimmer. with a simulation datasets (100×). Sensitivity (True Positive Rate, TPR) represent the proportion of reads whose primers were removed, in part or in full, by the programs in the reads that have primers introduced during simulation. Specificity (True Negative Rate, TNR) represent the proportion of reads for whom no primers were identified by the program in the reads that have no primers introduced during simulation. Accuracy (ACC) represent the proportion of reads whose primer sequences were precisely trimmed. Testing was performed with a simulation dataset of 100× depth with 1 ~ 5 mismatches allowed
| Mismatch | Alientrimmer | Cutadapt | cutPrimers | pTrimer |
|---|---|---|---|---|
| TPR (%)/TNR (%)/ACC (%) | TPR (%)/TNR (%)/ACC (%) | TPR (%)/TNR (%)/ACC (%) | TPR (%)/TNR (%)/ACC (%) | |
| 1 | 100.00/0.00/2.48 | 100.00/98.42/0.00 | 56.56/100.00/56.56 | 99.22/100.00/98.87 |
| 2 | 100.00/0.00/0.91 | 100.00/98.42/0.00 | 64.08/100.00/57.77 | 99.96/100.00/99.38 |
| 3 | 100.00/0.00/0.40 | 100.00/45.79/0.00 | 67.44/100.00/58.32 | 99.94/100.00/99.13 |
| 4 | 100.00/0.00/0.00 | 100.00/44.74/0.00 | 70.85/100.00/58.73 | 99.67/100.00/98.60 |
| 5 | 100.00/0.00/0.00 | 100.00/23.16/0.00 | 70.77/100.00/57.46 | 98.88/100.00/97.63 |