| Literature DB >> 22900027 |
Brant C Faircloth1, Travis C Glenn.
Abstract
Ligating adapters with unique synthetic oligonucleotide sequences (sequence tags) onto individual DNA samples before massively parallel sequencing is a popular and efficient way to obtain sequence data from many individual samples. Tag sequences should be numerous and sufficiently different to ensure sequencing, replication, and oligonucleotide synthesis errors do not cause tags to be unrecoverable or confused. However, many design approaches only protect against substitution errors during sequencing and extant tag sets contain too few tag sequences. We developed an open-source software package to validate sequence tags for conformance to two distance metrics and design sequence tags robust to indel and substitution errors. We use this software package to evaluate several commercial and non-commercial sequence tag sets, design several large sets (max(count) = 7,198) of edit metric sequence tags having different lengths and degrees of error correction, and integrate a subset of these edit metric tags to polymerase chain reaction (PCR) primers and sequencing adapters. We validate a subset of these edit metric tagged PCR primers and sequencing adapters by sequencing on several platforms and subsequent comparison to commercially available alternatives. We find that several commonly used sets of sequence tags or design methodologies used to produce sequence tags do not meet the minimum expectations of their underlying distance metric, and we find that PCR primers and sequencing adapters incorporating edit metric sequence tags designed by our software package perform as well as their commercial counterparts. We suggest that researchers evaluate sequence tags prior to use or evaluate tags that they have been using. The sequence tag sets we design improve on extant sets because they are large, valid across the set, and robust to the suite of substitution, insertion, and deletion errors affecting massively parallel sequencing workflows on all currently used platforms.Entities:
Mesh:
Year: 2012 PMID: 22900027 PMCID: PMC3416851 DOI: 10.1371/journal.pone.0042543
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Insertion and deletion errors violate the codeword scheme and reduce the utility of Hamming-based tags.
Panel (A) shows two sequence tags that are different from one another by seven substitutions (Hamming distance = 7) – a distance more than sufficient to differentiate tags in the presence of substitution errors. However, these same two tags have an edit distance of two (B) – meaning that a total of two insertions, substitutions, or deletions can turn Tag 1 into Tag 2 and confuse samples. Although it seems improbable that two indels or substitutions would occur in a sequence tag, consider the third case (C) in which a single deletion event at the 5′ end of a sequence tag adjoining DNA template beginning with 5′ guanine confuses Tag 1 with Tag 2. Edit metric sequence tags of distance three or greater would mitigate this mistake.
Figure 2Using Hamming codes to design binary encoded sequence tags when synthesis, replication, or sequencing errors mutate the nucleotide sequence reduces the number of single-base errors that are correctable during downstream demultiplexing.
Here, we show two sequence tags (Tag 1 and Tag 2) and both their nucleotide and binary encodings. Tag 1 and Tag 2 have a Hamming distance of four between their binary representations and a Hamming distance of two between their nucleotide representations. Error 1 is correctable to Tag 2, because a single nucleotide substitution (in purple) results in a single, binary difference (11 versus 01) between Error 1 and Tag 2, and single binary errors are correctable when tags are at least three binary differences from each other. Error 2 and Error 3 tags also exhibit a single nucleotide substitution (in purple) but two binary differences from Tag 1 and two binary differences from Tag 2. Because there is more than a single binary difference, we cannot determine whether the source tag was originally Tag 1 or Tag 2, we cannot correct the error, and we must discard the read. More generally, because of the binary encoding and the Hamming distance between tags (Hamming distance four between binary representations, Hamming distance two between nucleotide representations), we can correct single binary errors seen in the substitutions around the perimeter of inset (B), but we cannot correct double binary errors across the diagonals of inset (B). Because these single nucleotide, double binary substitutions (i.e., across the diagonals) comprise two of six potential substitution mutations, we cannot correct 33% (2/6) of single nucleotide substitution errors.
Commercial and non-commercial sequence tag sets and the conformance of each to the stated or assumed distance metric (edit or Hamming).
| Class | Set Name | Length (nt) | Ntags | Design Algorithm | Minimum Distance | Pair Violations | Tags≥Dexpected | Comments | |
| exp | obs | ||||||||
| Contain Violations | Illumina TruSeq sRNA | 6 | 48 | Hamming | 3 | 2 | 2 | 47 | Some tags violate expected Hamming distance |
| Hamady et al. 2007 | 8 | 1544 | Hamming | 3 | 4/2 | - | 1544 | Only corrects 66% of errors | |
| Meyer et al. 2010 | 6 | 75 | Edit | 3 | 2 | 40 | 49 | Some tags violate expected edit distance | |
| Meyer et al. 2010 | 8 | 711 | Edit | 3 | 2 | 551 | 429 | Some tags violate expected edit distance | |
| Adey et al. 2010 | 9 | 96 | Edit | 4 | 2 | 58 | 64 | Some tags violate expected edit distance | |
| Correct Hamming distance | Illumina TruSeq RNA and DNA | 6 | 27 | Hamming | 3 | 3 | - | 27 | |
| Meyer et al. 2008 | 7 | 52 | Hamming | 3 | 3 | - | 52 | ||
| Meyer et al. 2008 | 8 | 130 | Hamming | 3 | 3 | - | 130 | ||
| Correct edit distance | Qiu et al. 2003 | 6 | 21 | Edit | 3 | 3 | - | 21 | |
| Frank 2009 | 6 | 81 | Other | 2 | 2 | - | 81 | Design algorithm similar to edit distance 2 | |
| Illumina Nextera DNA | 8 | 8/12 | Edit | 3 | 3 | - | 8 or 12 | ||
| Frank 2009 | 8 | 760 | Other | 2 | 2 | - | 760 | Design algorithm similar to edit distance 2 | |
| Roche 454 MID Extended | 10 | 151 | Edit | 4 | 4 | - | 151 | ||
| Roche 454 RL-MID Extended | 10 | 132 | Edit | 4 | 4 | - | 132 | ||
| Designed for this publication | EDDITTAG | 6 | 61 | Edit | 3 | 3 | - | 61 | |
| EDDITTAG | 7 | 211 | Edit | 3 | 3 | - | 211 | ||
| EDDITTAG | 8 | 531 | Edit | 3 | 3 | - | 531 | ||
| EDDITTAG | 9 | 1,936 | Edit | 3 | 3 | - | 1,936 | ||
| EDDITTAG | 10 | 7,198 | Edit | 3 | 3 | - | 7,198 | ||
Hamady et al. [31] tags are from the nmeth.1184-S1.pdf supplementary file.
Hamady et al. [31] tags are Hamming distance 4 from one another in binary encoding but Hamming distance 2 from one another in nucleotide encoding.
We generated Meyer et al.
[29] tags using: ‘python create_index_sequences.py -l
Adey et al. [41] tags are from the gb-2010-11-12-r119-s3.pdf supplementary file.
Meyer et al. [3] tags are from the nprot.2007.520-S1.doc supplementary file.
We generated Frank [42] tags using: ‘barcrawl -l
This is similar to an expected edit distance of two.
Illumina Nextera tags are incorporated to either end of the template strand in combinatorial fashion to identify up to 96 samples.
Counts of four to 10 nucleotide, ≥3 edit distance sequence tags sets designed using EDITTAG.
| Code Sizes | Edit Distance | |||||||
| 3 | 4 | 5 | 6 | 7 | 8 | 9 | ||
|
|
| 7 | - | - | - | - | - | - |
|
| 25 | 7 | - | - | - | - | - | |
|
| 61 | 15 | 5 | - | - | - | - | |
|
| 211 | 41 | 11 | 4 | - | - | - | |
|
| 531 | 103 | 24 | 8 | 3 | - | - | |
|
| 1936 | 301 | 62 | 18 | 6 | 3 | - | |
|
| 7198 | 971 | 164 | 40 | 14 | 5 | 3 | |
We did not include, in any set, sequence tags having >2 homopolymers, GC content outside the range 40%
Figure 3Pairwise edit distance between 25 tags of five nucleotides in length and edit distance three designed using EDITTAG.
Figure 4Number of HiSeq reads returned for libraries prepared using Illumina TruSeq adapters versus libraries prepared using adapters integrating edit metric sequence tags designed using EDITTAG.