| Literature DB >> 23469199 |
Paul Igor Costea1, Joakim Lundeberg, Pelin Akan.
Abstract
Multiplexing is of vital importance for utilizing the full potential of next generation sequencing technologies. We here report TagGD (DNA-based Tag Generator and Demultiplexor), a fully-customisable, fast and accurate software package that can generate thousands of barcodes satisfying user-defined constraints and can guarantee full demultiplexing accuracy. The barcodes are designed to minimise their interference with the experiment. Insertion, deletion and substitution events are considered when designing and demultiplexing barcodes. 20,000 barcodes of length 18 were designed in 5 minutes and 2 million barcoded Illumina HiSeq-like reads generated with an error rate of 2% were demultiplexed with full accuracy in 5 minutes. We believe that our software meets a central demand in the current high-throughput biology and can be utilised in any field with ample sample abundance. The software is available on GitHub (https://github.com/pelinakan/UBD.git).Entities:
Mesh:
Year: 2013 PMID: 23469199 PMCID: PMC3587622 DOI: 10.1371/journal.pone.0057521
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The generation of barcodes along with all the filtering steps are performed concurrently with the execution of the main application.
Each barcode that passes filtering is added to a pool and generation of more barcodes continues. This pool is in time drained by the main application where the insertion into the solution set takes place.
Running times for generating different numbers of barcodes of length 18 with different edit distances and padding.
| Padding | Edit Distance | Number of Barcodes | Generation Time |
| 0 | 3 | 5,000 | 18 seconds |
| 0 | 4 | 19 seconds | |
| 0 | 3 | 20,000 | 4 minutes |
| 0 | 4 | 5 minutes | |
| 0 | 3 | 100,000 | 1.5 hours |
| 0 | 4 | 7 hours | |
| 1 | 3 | 1,000 | 1.4 seconds |
| 1 | 4 | 2 seconds | |
| 1 | 3 | 5,000 | 16 seconds |
| 1 | 4 | 46 seconds | |
| 1 | 3 | 10,000 | 1 minute |
| 1 | 4 | 14 minutes |
Benchmarking was performed on an 8 core 24 Gb machine.
Figure 2Let barcode 1 and 2 are the barcodes within the designed unique barcode set and another barcode containing errors introduced during the experiment, and the edges of the triangles represent the Levenshtein edit distance between them.
A) Barcode 1 can be converted to Barcode 3 with three operations. However two errors introduced either in barcode 1 or 2 can result in a new sequence, which requires same number of operations to transform to either barcode 1 or 2. Therefore, it cannot be classified. B) Barcode 1 is incorrectly synthesised or sequenced in such a way that it now has a smaller edit distance to barcode 2, which leads to its misclassification.
20,000 unique barcodes of length 18 were generated with no padding option.
| k-mer Length | Percentage of reads that are wrongly classified | Reads that cannot be uniquely classified | Running time (seconds) |
| 9 | 0.27% (537) | 1.7% (3423) | 1.5 |
| 8 | 0.14% (276) | 0.38% (760) | 2.1 |
| 7 | 0.11% (223) | 0.13% (267) | 5.9 |
| 6 | 0% (0) | 0% (0) | 27.8 |
Then 200,000 barcoded reads of length 18 are generated and errors are introduced with 2% probability, only substitutions errors are allowed. With k-mer length of 6, all reads can be demultiplexed correctly in less than half a minute. Benchmarking is performed in an 8-core machine with 24 Gb memory.
10,000 unique barcodes of length 18 were generated with padding option set to 1.
| Indexes are mapped back using positional inaccuracy set to 1 | Indexes are mapped back with no positional inaccuracy | |||||
| k-mer Length | % of reads that are wrongly classified | % of reads that cannot be uniquely classified | Running time (seconds) | % of reads that are wrongly classified | % of reads that cannot be uniquely classified | Running time (seconds) |
| 9 | 0.26% (525) | 1.64% (3280) | 2 | 0.13% (256) | 1.84% (3684) | 1.8 |
| 8 | 0.13% (266) | 0.33% (659) | 2.7 | 0.07% (134) | 0.49% (981) | 1.6 |
| 7 | 0.05% (105) | 0.05% (96) | 6.9 | 0.06% (123) | 0.20% (400) | 3.7 |
| 6 | 0% (0) | 0% (0) | 27.3 | 0% (7) | 0% (8) | 15.5 |
Then 200,000 barcoded reads of length 18 are generated and errors are introduced with 2% probability, substitutions, insertions and deletions are allowed. With k-mer length of 6, all reads can be demultiplexed correctly in less than half a minute, if the user uses positional inaccuracy set to 1 (equal to the padding option used in designing the barcodes). Benchmarking is performed in an 8-core machine with 24 Gb memory.