| Literature DB >> 22615825 |
Abstract
The diversity and scope of multiplex parallel sequencing applications is steadily increasing. Critically, multiplex parallel sequencing applications methods rely on the use of barcoded primers for sample identification, and the quality of the barcodes directly impacts the quality of the resulting sequence data. Inspection of the recent publications reveals a surprisingly variable quality of the barcodes employed. Some barcodes are made in a semi empirical fashion, without quantitative consideration of error correction or minimal distance properties. After systematic comparison of published barcode sets, including commercially distributed barcoded primers from Illumina and Epicentre, methods for improved, Hamming code-based sequences are suggested and illustrated. Hamming barcodes can be employed for DNA tag designs in many different ways while preserving minimal distance and error-correcting properties. In addition, Hamming barcodes remain flexible with regard to essential biological parameters such as sequence redundancy and GC content. Wider adoption of improved Hamming barcodes is encouraged in multiplex parallel sequencing applications.Entities:
Mesh:
Year: 2012 PMID: 22615825 PMCID: PMC3355179 DOI: 10.1371/journal.pone.0036852
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Linear conversion of the Hamming(16,11) code into DNA sequence.
| Decimalcounter | Binary data counter | Hamming code | Linear translation into DNA sequence |
| 0 | 00000000000 | 0000000000000000 | AAAAAAAA (00,00,00,00,00,00,00,00,) |
| 1 | 00000000001 | 1101000100000011 | TCACAAAT (11,01,00,01,00,00,00,11,) |
| 2 | 00000000010 | 0101000100000100 | CCACAACA (01,01,00,01,00,00,01,00,) |
| 3 | 00000000011 | 1000000000000111 | GAAAAACT (10,00,00,00,00,00,01,11,) |
| 4 | 00000000100 | 1001000100001000 | GCACAAGA (10,01,00,01,00,00,10,00,) |
| 5 | 00000000101 | 0100000000001011 | CAAAAAGT (01,00,00,00,00,00,10,11,) |
| 6 | 00000000110 | 1100000000001100 | TAAAAATA (11,00,00,00,00,00,11,00,) |
| 7 | 00000000111 | 0001000100001111 | ACACAATT (00,01,00,01,00,00,11,11,) |
| 8 | 00000001000 | 0001000100010001 | ACACACAC (00,01,00,01,00,01,00,01,) |
| 9 | 00000001001 | 1100000000010010 | TAAAACAG (11,00,00,00,00,01,00,10,) |
| 10 | 00000001010 | 0100000000010101 | CAAAACCC (01,00,00,00,00,01,01,01,) |
| 11 | 00000001011 | 1001000100010110 | GCACACCG (10,01,00,01,00,01,01,10,) |
| 12 | 00000001100 | 1000000000011001 | GAAAACGC (10,00,00,00,00,01,10,01,) |
| 13 | 00000001101 | 0101000100011010 | CCACACGG (01,01,00,01,00,01,10,10,) |
| 14 | 00000001110 | 1101000100011101 | TCACACTC (11,01,00,01,00,01,11,01,) |
| 15 | 00000001111 | 0000000000011110 | AAAAACTG (00,00,00,00,00,01,11,10,) |
| 16 | 00000010000 | 1100000100100000 | TAACAGAA (11,00,00,01,00,10,00,00,) |
Note: Data bits are intercepted by parity bits at every 2n position. Full list of codes are provided in a supplementary File S1.
Hamming(7,4) codes read with overlap or randomizer.
| Decimal counter | Binary data counter | Hamming (7,4) | Conversion | Randomizer | Conversion with randomizer | Hamming (8,4) | Converted |
|
|
| 0000000 | AAAAAA | 0101010 | ACACACA | 00000000 | AAAAAAA |
|
|
| 1101001 | TGCGAC | 0111100 | GTCTCAG | 11010010 | TGCGACG |
|
|
| 0101010 | CGCGCG | 0000000 | AGAGAGA | 01010101 | CGCGCGC |
|
|
| 1000011 | GAAACT | 0010110 | GACACTG | 10000111 | GAAACTT |
|
|
| 1001100 | GACTGA | 0011001 | GACTGAC | 10011001 | GACTGAC |
|
|
| 0100101 | CGACGC | 0001111 | AGACTCT | 01001011 | CGACGCT |
|
|
| 1100110 | TGACTG | 0110011 | GTCAGTC | 11001100 | TGACTGA |
|
|
| 0001111 | AACTTT | 0100101 | ACAGTGT | 00011110 | AACTTTG |
|
|
| 1110000 | TTGAAA | 0100101 | GTGACAC | 11100001 | TTGAAAC |
|
|
| 0011001 | ACTGAC | 0110011 | ACTGACT | 00110011 | ACTGACT |
|
|
| 1011010 | GCTGCG | 0001111 | GAGTCTC | 10110100 | GCTGCGA |
|
|
| 0110011 | CTGACT | 0011001 | AGTCAGT | 01100110 | CTGACTG |
|
|
| 0111100 | CTTTGA | 0010110 | AGTGTCA | 01111000 | CTTTGAA |
|
|
| 1010101 | GCGCGC | 0000000 | GAGAGAG | 10101010 | GCGCGCG |
|
|
| 0010110 | ACGCTG | 0111100 | ACTCTGA | 00101101 | ACGCTGC |
|
|
| 1111111 | TTTTTT | 0101010 | GTGTGTG | 11111111 | TTTTTTT |
Each base is read from 2 bit code word, overlap: read 2 consecutive bits from Hamming code with 1 bit step. Randomizer: read 1 bit from Hamming code and 1 bit from randomizer. Both versions have D = 3. Hamming(8,4) shows D = 4.
Note: randomizer flips bit value in case if 2 consecutive bits in Hamming code are identical.
Figure 1A concept of Hamming error correction in quaternary format.
A 7-base sequence is indexed by position and value of each base is provided. With those values checksums are calculated and possible error is detected (in the given example “T” is an error). Max(Ch = 2 gives the type of the error, sequence Ch3,Ch2,Ch1 = 202 is transformed to binary 101 (with the rule: if Ch then Ch), which is equal to decimal 5. This defines position of the error. Since the value at erroneous position is 3 (for C = ”T” S = 3), the correct value should be 3−2 = 1. For S = 1, C = ”C”. Thus, the barcode should be corrected at the position 5, the correct base is “C”. Note when calculating correct base: if S<0 then use “the wheel rule” (−3 is 1, −1 is 3, −2 is 2), which can be often (not always!) replaced by mod 4 operation. In short: S = (erroneous base value - error type) mod 4.
Examples of quaternary Hamming encoded barcode sequences.
| Decimal counter | Quaternary data counter | Hamming(6,2) | Conversion | Hamming (6,3) | Conversion |
| 0 | 000 | 000000 | AAAAAA | 000000 | AAAAAA |
| 1 | 001 | 300311 | TAATCC | 030301 | ATATAC |
| 2 | 002 | 200222 | GAAGGG | 020202 | AGAGAG |
| 3 | 003 | 100133 | CAACTT | 010103 | ACACAT |
| 4 | 010 | 331001 | TTCAAC | 300310 | TAATCA |
| 5 | 011 | 231312 | GTCTCG | 330211 | TTAGCC |
| 6 | 012 | 131223 | CTCGGT | 320112 | TGACCG |
| 7 | 013 | 031130 | ATCCTA | 310013 | TCAACT |
| 8 | 020 | 222002 | GGGAAG | 200220 | GAAGGA |
| 9 | 021 | 122313 | CGGTCT | 230121 | GTACGC |
| 10 | 022 | 022220 | AGGGGA | 220022 | GGAAGG |
| 11 | 023 | 322131 | TGGCTC | 210323 | GCATGT |
| 12 | 030 | 113003 | CCTAAT | 100130 | CAACTA |
| 13 | 031 | 013310 | ACTTCA | 130031 | CTAATC |
| 14 | 032 | 313221 | TCTGGC | 120332 | CGATTG |
| 15 | 033 | 213132 | GCTCTG | 110233 | CCAGTT |
Note: Hamming(6,3) code is incomplete, full set can be found in the supplementary File S2.
Comparison of commercially available and quaternary Hamming based barcodes.
| Barcode set name | Set size | Dmin | GC frequencies | Sequence redundancy | |||||||||||||||
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |||
| 6-mers | |||||||||||||||||||
| Epicentre set | 12 | 3–4 | 0 | 0 | 3 | 8 | 1 | 0 | 0 | 5 | 7 | 0 | 0 | 0 | 0 | ||||
| H4(6,2) | 16 | 4 | 1 | 0 | 6 | 0 | 9 | 0 | 0 | 2 | 10 | 2 | 1 | 0 | 1 | ||||
| H4(6,2) filtered | 12 | 4 | 0 | 0 | 6 | 0 | 6 | 0 | 0 | 2 | 10 | 0 | 0 | 0 | 0 | ||||
| H4(6,2) filtered GT transposed | 12 | 4 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 2 | 10 | 0 | 0 | 0 | 0 | ||||
| H4(6,2) filtered AC transposed | 12 | 4 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 2 | 10 | 0 | 0 | 0 | 0 | ||||
| Illumina set | 48 | 2–3 | 1 | 1 | 13 | 20 | 8 | 5 | 0 | 17 | 26 | 3 | 2 | 0 | 0 | ||||
| Craig et al., 2008 | 48 | 2 | 4 | 0 | 20 | 16 | 0 | 8 | 0 | 0 | 32 | 13 | 0 | 3 | 0 | ||||
| H4(6,3) | 64 | 3 | 1 | 3 | 18 | 26 | 9 | 3 | 4 | 21 | 31 | 8 | 3 | 0 | 1 | ||||
| H4(6,3) filtered | 57 | 3 | 0 | 3 | 17 | 26 | 8 | 3 | 0 | 21 | 29 | 7 | 0 | 0 | 0 | ||||
| 7-mers | |||||||||||||||||||
| H(7,4) | 256 | 3 | 2 | 14 | 42 | 70 | 70 | 42 | 14 | 2 | 76 | 60 | 68 | 32 | 12 | 4 | 4 | ||
| H(7,4) AC transposed | 256 | 3 | 16 | 0 | 0 | 112 | 112 | 0 | 0 | 16 | 76 | 60 | 68 | 32 | 12 | 4 | 4 | ||
| H(7,4) AC transposed filtered | 224 | 3 | 0 | 0 | 0 | 112 | 112 | 0 | 0 | 0 | 72 | 56 | 64 | 24 | 8 | 0 | 0 | ||
| 8-mers | |||||||||||||||||||
| H(8,4) | 256 | 4 | 2 | 0 | 56 | 0 | 140 | 0 | 56 | 0 | 2 | 76 | 32 | 88 | 20 | 28 | 8 | 0 | 4 |
| H(8,4) AG transposed | 256 | 4 | 16 | 0 | 0 | 0 | 224 | 0 | 0 | 0 | 16 | 76 | 32 | 88 | 20 | 28 | 8 | 0 | 4 |
| H(8,4) AG transposed filtered | 224 | 4 | 0 | 0 | 0 | 0 | 224 | 0 | 0 | 0 | 0 | 72 | 32 | 80 | 16 | 24 | 0 | 0 | 0 |
| H(16,11) | 2048 | 4 | 31 | 0 | 383 | 0 | 1216 | 0 | 384 | 0 | 32 | 325 | 1166 | 447 | 73 | 30 | 1 | 1 | 3 |
| Hamady et al., 2007 | 1544 | 2 | 0 | 0 | 0 | 0 | 1544 | 0 | 0 | 0 | 0 | 536 | 1008 | 0 | 0 | 0 | 0 | 0 | 0 |
| Erlich et al., 2009 | 385 | 2 | 0 | 0 | 0 | 0 | 385 | 0 | 0 | 0 | 0 | 122 | 263 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: H stands for binary Hamming codes, H4 stands for quaternary Hamming codes.