| Literature DB >> 26492348 |
Elizabeth Tapia1, Flavio Spetale1, Flavia Krsticevic2, Laura Angelone1, Pilar Bulacio1.
Abstract
For many parallel applications of Next-Generation Sequencing (NGS) technologies short barcodes able to accurately multiplex a large number of samples are demanded. To address these competitive requirements, the use of error-correcting codes is advised. Current barcoding systems are mostly built from short random error-correcting codes, a feature that strongly limits their multiplexing accuracy and experimental scalability. To overcome these problems on sequencing systems impaired by mismatch errors, the alternative use of binary BCH and pseudo-quaternary Hamming codes has been proposed. However, these codes either fail to provide a fine-scale with regard to size of barcodes (BCH) or have intrinsic poor error correcting abilities (Hamming). Here, the design of barcodes from shortened binary BCH codes and quaternary Low Density Parity Check (LDPC) codes is introduced. Simulation results show that although accurate barcoding systems of high multiplexing capacity can be obtained with any of these codes, using quaternary LDPC codes may be particularly advantageous due to the lower rates of read losses and undetected sample misidentification errors. Even at mismatch error rates of 10(-2) per base, 24-nt LDPC barcodes can be used to multiplex roughly 2000 samples with a sample misidentification error rate in the order of 10(-9) at the expense of a rate of read losses just in the order of 10(-6).Entities:
Mesh:
Year: 2015 PMID: 26492348 PMCID: PMC4619643 DOI: 10.1371/journal.pone.0140459
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The performance of BCH barcodes.
|
| |||||||
|---|---|---|---|---|---|---|---|
| N | M | B | (n, k, t, s) |
|
|
|
|
| 21 | 86 | 0.153 | (63, 30, 6, 21) | 6.94 10−6 | 6.99 10−6 | 1.00 10−8 | 1.02 10−8 |
| 22 | 384 | 0.195 | (63, 30, 6, 19) | 8.33 10−6 | 8.34 10−6 | 1.00 10−8 | 1.02 10−8 |
| 24 | 73 | 0.128 | (63, 24, 7, 15) | 1.84 10−6 | 1.85 10−6 | 0 | 2.00 10−9 |
| 25 | 295 | 0.165 | (63, 24, 7, 13) | 2.68 10−6 | 2.69 10−6 | 0 | 2.00 10−9 |
BCH barcodes of size N ≤ 25 constrained to accomplish M ≥ 24 and p ≤ 10−8 over a QSC model where mismatch errors occur with probability p = 10−2. M, B, p and p are respectively the empirical estimates of the multiplexing capacity, the barcoding rate, the probability of barcodes identification error and the probability of undetected multiplexing errors; and are the upper error bars of the two latter ones. Underlying codes are binary BCH codes of size n shortened to n − s able to carry k − s informative bits and to correct at least t binary errors.
Fig 1The empirical probability p of decoding error accomplished by BCH barcodes of size N, multiplexing capacity M and barcoding rate B.
Sequencing errors follow a QSC model with probability p . Binary BCH codes of size n shortened with parameter s able induce 2 candidate barcode sequences and to correct at least t binary errors at a coding rate r are used.
The performance of LDPC barcodes.
|
| |||||||
|---|---|---|---|---|---|---|---|
| N | M | B | (n, k) |
|
|
|
|
| 19 | 65 | 0.158 | (19, 4) | 5.43 10−6 | 5.44 10−6 | 0 | 2.00 10−9 |
| 21 | 210 | 0.183 | (21, 5) | 5.70 10−7 | 5.72 10−7 | 0 | 2.00 10−9 |
| 23 | 648 | 0.203 | (23, 6) | 5.10 10−7 | 5.11 10−7 | 0 | 2.00 10−9 |
| 24 | 1911 | 0.227 | (24, 7) | 1.66 10−6 | 1.67 10−6 | 0 | 2.00 10−9 |
| 23 | 56 | 0.126 | (23, 4) | 9.10 10−8 | 9.11 10−8 | 0 | 2.00 10−9 |
| 25 | 118 | 0.137 | (25, 5) | 1.10 10−7 | 1.11 10−7 | 0 | 2.00 10−9 |
LDPC barcodes of size N ≤ 25 constrained to accomplish M ≥ 24 and p ≤ 10−8 over a QSC model where mismatch errors occur with probability p = 10−2. M, B, p and p are respectively the empirical estimates of the multiplexing capacity, the barcoding rate, the probability of barcodes identification error and the probability of undetected multiplexing errors; and are the upper error bars of the two latter ones. Underlying codes are quaternary LDPC codes of size n able to carry k informative quads.
Fig 2The empirical probability p of decoding error accomplished by LDPC barcodes of size N, multiplexing capacity M and barcoding rate B.
Sequencing errors follow a QSC model with probability p . Quaternary LDPC codes of size n = N able induce 4k candidate barcode sequences at a coding rate r are used.
The performance of BY barcodes.
|
| |||||||
|---|---|---|---|---|---|---|---|
| N | M | B | (n, k, t) |
|
|
|
|
| 7 | 117 | 0.491 | (7, 4, 1) | 2.20 10−3 | 2.21 10−3 | 2.12 10−4 | 2.13 10−4 |
| 8 | 111 | 0.424 | (8, 4, 1) | 2.87 10−3 | 2.88 10−3 | 2.37 10−6 | 2.38 10−6 |
| 15 | 2880 | 0.383 | (15, 11, 1) | 9.94 10−3 | 9.95 10−3 | 0 | 2.00 10−9 |
BY barcodes of size N over a QSC model where mismatch errors occur with probability p . M, B, p and p are respectively the empirical estimates of the multiplexing capacity, the barcoding rate, the probability of barcodes identification error and the probability of undetected multiplexing errors; and are the upper error bars of the two latter ones. Underlying codes are quaternary extensions of binary Hamming codes of size n able to carry k informative bits and to correct at least t binary errors.
The performance of Random barcodes.
|
| |||||||
|---|---|---|---|---|---|---|---|
| N |
| M | B |
|
|
|
|
| 8 | 5 | 24 | 0.286 | 2.75 10−2 | 2.76 10−2 | 0 | 2.00 10−9 |
| 8 | 3 | 531 | 0.565 | 7.72 10−2 | 7.73 10−2 | 5.80 10−7 | 5.81 10−7 |
| 9 | 7 | 6 | 0.143 | 1.94 10−3 | 1.95 10−3 | 0 | 2.00 10−9 |
| 9 | 5 | 62 | 0.330 | 3.11 10−2 | 3.12 10−2 | 0 | 2.00 10−9 |
| 9 | 3 | 1936 | 0.606 | 8.64 10−2 | 8.65 10−2 | 8.80 10−7 | 8.82 10−7 |
| 10 | 7 | 13 | 0.185 | 2.42 10−3 | 2.43 10−3 | 0 | 2.00 10−9 |
| 10 | 5 | 164 | 0.367 | 3.47 10−2 | 3.48 10−2 | 1.01 10−8 | 1.02 10−8 |
| 10 | 3 | 7198 | 0.640 | 9.56 10−2 | 9.57 10−2 | 1.13 10−6 | 1.14 10−6 |
Random barcodes of size N with minimum edit distance [56] equal to their minimum Hamming distance d over a QSC model where mismatch errors occur with probability p . M, B, p and p are respectively the empirical estimates of the multiplexing capacity, the barcoding rate, the probability of barcodes identification error and the probability of undetected multiplexing errors; and are the upper error bars of the two latter ones.
Fig 3The factor graph of an LDPC barcoding system built from a 4-ary LDPC code of size n = 12 able to carry k = 3 informative quads and thus, to induce 64 candidate barcode sequences.
Each codeword symbol c , i = 1, …, 12, is constrained by exactly j = 3 parity subcodes. The LDPC code is built from m = 9 parity subcodes, e.g., c 1 + c 8 + c 11 + 2 c 12 = 0 holds. A QSC generates mismatch sequencing errors with probabilities p = p and thus, corrupted barcode bases r are observed after sequencing, i = 1, …, 12. At p = 0.01 this system can multiplex up to M = 15 samples with p = 10−4 and p ≈ 0.