| Literature DB >> 19874596 |
Abstract
BACKGROUND: Advances in automated DNA sequencing technology have greatly increased the scale of genomic and metagenomic studies. An increasingly popular means of increasing project throughput is by multiplexing samples during the sequencing phase. This can be achieved by covalently linking short, unique "barcode" DNA segments to genomic DNA samples, for instance through incorporation of barcode sequences in PCR primers. Although several strategies have been described to insure that barcode sequences are unique and robust to sequencing errors, these have not been integrated into the overall primer design process, thus potentially introducing bias into PCR amplification and/or sequencing steps.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19874596 PMCID: PMC2777893 DOI: 10.1186/1471-2105-10-362
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Barcoded primer structures. Design of barcoded oligonucleotides. "Barcode" indicates the location of sequences unique to each PCR primer. "Template-specificity" indicates sequences required for PCR amplification. In this example, template-specificity regions are broadly specific for bacterial large-subunit rRNA genes (A: LSU559R, B: LSU130F) and are joined to the rest of the primer through a two-nucleotide, randomized linker. Sequences labeled "Sequencing" refer to the primer A and B sequences required by the GS-FLX instrument. Although this example is based on specifications required for operation of the 454 LifeSciences Inc. GS-FLX system, the use of barcrawl can be abstracted to other platforms capable of sequencing PCR amplicons. The "sequencing" segments may not be required by other platforms and are not required for barcrawl to function.
Summary of barcrawl command-line options.
| -l <int> | = 8 | 1 | 20 | Set length of barcodes |
| -m <int> | >= 3 | 1 | barcode length | Minimum base substitutions between barcode pairs |
| -p <int> | >= 3 | 2 | N | Exclude homopolymers of specified length or greater |
| -a <int> | >= 5 | 2 | N | Exclude hairpins of specified length or greater |
| -b <int> | >= 5 | 2 | N | Exclude heteroduplexes of specified length or greater |
| -f5 <string> | - | - | - | Specify 5' addition to barcode sequence |
| -f3 <string> | - | - | - | Specify 3' addition to barcode sequence |
| -j5 | off | - | - | Exclude barcodes with 5' base same as 3' end of upstream primer |
| -j3 | off | - | - | Exclude barcodes with 3' base same as 5' end of downstream primer |
| -r <string> | - | - | - | Specify reverse primer sequence |
| -g <int> | > 70 | 50 | 100 | Exclude barcodes with % GC content greater than value |
| -c <int> | < 30 | 0 | 50 | Exclude barcodes with % GC content less than value |
| -d | on | - | - | Exclude barcodes that are converted to other barcodes by deletion |
| -w | off | - | - | Order output by number of 454 GS-FLX nucleotide flows |
| -o <string> | out.txt | - | - | Specify output file |
| -v | off | - | - | Set verbose output to terminal |
| -h | off | - | - | Display help |
1Minimum value for numerical options.
2Maximum value for numerical options. N = any non-negative integer.
Figure 2. Rectangles specify input and output. Diamonds designate filters applied to data. Barcodes (barcrawl) or sequences (bartab) that do not follow the outlined rules are discarded. Arrows diverging from the central flow chart represent analytic steps that the user can choose to skip.
Summary of bartab command-line options.
| -in <string> | - | - | - | Fasta sequence file to process |
| -qin <string> | <fasta_file_name.qual> | - | - | Quality scores associated with sequences |
| -map <string> | <fasta_file_name.bar> | - | - | Tab delimited file listing barcodes and associated metadata |
| -out <string> | <fasta_file_name> | - | - | Base name for output files |
| -for <string> | - | - | - | Forward primer sequence |
| -rev <string> | - | - | - | Reverse primer sequence |
| -rnm <string> | off | - | - | Toggle on renaming of sequences based on column of barcode file named by specified string |
| -spl <string> | off | - | - | Toggle on splitting sequences into individual files based on column of barcode file named by specified string |
| -rep | off | - | - | Toggle on dereplication of output sequence file(s) |
| -st <int> | 1 | 1 | N | Position of barcode in sequence |
| -qu <int> | 20 | 0 | N | Minimum acceptable quality score, averaged over window |
| -win <int> | 5 | 1 | N | Window for calculation of mean quality score |
| -min <int> | 200 | 1 | N | Minimum acceptable sequence length |
| -amb <int> | 0 | 0 | N | Maximum acceptable number of ambiguous bases |
| -xbar | - | - | - | Toggle off removal of barcodes from sequences |
| -v | off | - | - | Verbose output to stdout |
| -dry | off | - | - | Dry run - report sequence statistics then quit |
| -h | off | - | - | Display help |
1Minimum value for numerical options. N = any non-negative integer.
2Maximum value for numerical options.
Figure 3Addition of Metadata through Barcode File. A. Format of barcode file. Column headings act as keys for values in column cells. Strings specifying keys and values must consist of ASCII characters without whitespace. Values in a row are separated by tabs. B. Output of sequences annotated through metadata associated with barcodes. C. Output following sequence renaming and dereplication on the basis of metadata (in this case, by sample name). The "xsrep_count" metadata indicates the number of identical sequences recovered from a sample.
Barcode sequence space(s) defined by program parameters.
| 1 | 4 | 3 | 0.3 | 0.7 | 3 | 5 | 5 | 1 | 256 | 7 | 0.0 (0.0) | 0.0 (0.0) |
| 2 | 5 | 3 | 0.3 | 0.7 | 3 | 5 | 5 | 1 | 1024 | 26 | 0.0 (0.0) | 0.0 (0.0) |
| 3 | 6 | 3 | 0.3 | 0.7 | 3 | 5 | 5 | 1 | 4096 | 82 | 0.4 (0.5) | 0.2 (0.4) |
| 4 | 7 | 3 | 0.3 | 0.7 | 3 | 5 | 5 | 1 | 16384 | 218 | 3.0 (0.0) | 1.2 (0.4) |
| 5 | 8 | 3 | 0.3 | 0.7 | 3 | 5 | 5 | 1 | 65536 | 760 | 54 (0.0) | 24 (0.4) |
| 7 | 9 | 3 | 0.3 | 0.7 | 3 | 5 | 5 | 0 | 262144 | 3113 | 12 (0.0) | 6.0 (0.0) |
| 8 | 9 | 4 | 0.3 | 0.7 | 3 | 5 | 5 | 1 | 262144 | 593 | 159 (0.4) | 68 (0.8) |
| 9 | 9 | 3 | 0.3 | 0.7 | 4 | 5 | 5 | 1 | 262144 | 3294 | 1536 (0.8) | 655 (5.8) |
| 10 | 9 | 3 | 0.3 | 0.7 | 3 | 4 | 4 | 1 | 262144 | 2774 | 979 (0.7) | 416 (1.2) |
| 11 | 9 | 3 | 0.3 | 0.7 | 3 | 6 | 6 | 1 | 262144 | 2774 | 977 (1.1) | 419 (9.3) |
| 12 | 9 | 3 | 0.4 | 0.6 | 3 | 5 | 5 | 1 | 262144 | 2155 | 493 (0.5) | 211 (0.8) |
| 13 | 9 | 3 | 0.3 | 0.7 | 2 | 5 | 5 | 1 | 262144 | 779 | 42 (0.4) | 18 (0.4) |
| 14 | 9 | 3 | 0.2 | 0.8 | 3 | 5 | 5 | 1 | 262144 | 2914 | 1153 (0.8) | 491 (4.9) |
| 15 | 9 | 3 | 0.1 | 0.9 | 3 | 5 | 5 | 1 | 262144 | 2910 | 1177 (0.0) | 505 (5.5) |
| 16 | 10 | 3 | 0.3 | 0.7 | 3 | 5 | 5 | 1 | 1.10E+06 | 9375 | 16507 (1857) | 6650 (23) |
1 Elapsed time of execution: Mean (St. Dev) seconds.
2 Executed on a 2 GHz Intel Core Duo MacBook Pro. 1 GB 667 MHz DDR2 SDRAM. Mac OSX version 10.5.4.
3 Executed on a workstation with 2 × 3 GHz Dual-Core Intel Xeon 5160 processors. 12 GB 800 MHz DDR2 FB-DIMM. Fedora Core 5 operating system.
4 Default parameters
Figure 4Cumulative distribution of barcodes as function of pyrosequencing nucleotide flows. Barcrawl analyses were performed for range of barcode lengths. The cumulative sums of selected barcodes are plotted vs. pyrosequencing nucleotide flows. The inset box shows the region of this plot bounded by flows 10-13 and illustrates that about flow #11, the sequence space of 10-mers > 9-mers > 8-mers.
Figure 5Functional assessment of barcoded primers by quantitative PCR and pyrosequencing analysis. A. Distribution of Ct scores for 96 barcoded LSU559 primers tested under identical reaction conditions. Bars represent relative frequency of occurrence of Ct scores. B. Q-Q plot of sample Ct scores versus those drawn from a theoretical normal distribution. Line denotes expected values for data obtained from a normal distribution. C. Scatterplot of Ct scores for each barcoded primer assayed in two separate PCR experiments and using different primer aliquots. The Pearson correlation coefficient of the two data sets is 0.48. B. Distribution of trimmed pyrosequencing reads for each of 47 barcoded LSU559 primers tested under indentical reaction conditions. Bars represent relative frequency of occurrence of sequence lengths (nts). E. Q-Q plot of pyrosequencing read lengths versus those drawn from a theoretical normal distribution. Line denotes expected values for data obtained from a normal distribution. F. Scatterplot of pyrosequencing read lengths vs. Ct scores for each barcoded primer assayed by pyrosequencing. The Pearson correlation coefficient of the two data sets is -0.01.
Performance measures for bartab:
| 400,000 | - | - | 411 (13) | 65 (0.82) | 145 (8.6) | 40 (0.0) |
| 400,000 | + | - | 993 (193) | 71 (2.07) | 178 (3.1) | 41 (0.0) |
| 400,000 | - | + | 434 (5.5) | 76 (0.75) | 151 (0.8) | 47 (1.2) |
| 400,000 | + | + | 927 (65) | 78 (0.75) | 197 (9.1) | 48 (0.6) |
| 800,000 | - | - | 900 (4.8) | 133 (0.98) | 289 (14) | 81 (0.0) |
| 800,000 | + | - | 1909 (225) | 142 (3.9) | 462 (9.0) | 89 (6.8) |
| 800,000 | - | + | 917 (29) | 148 (2.3) | 299 (1.4) | 97 (2.6) |
| 800,000 | + | + | 2023 (109) | 154 (1.6) | 465 (0.5) | 106 (10) |
| 1,600,000 | - | - | 1923 (69) | 265 (2.2) | 573 (11) | 164 (2.0) |
| 1,600,000 | + | - | 3764 (74) | 285 (4.2) | 1475 (278) | 255 (64) |
| 1,600,000 | - | + | 1909 (88) | 304 (1.7) | 623 (15) | 197 (7.6) |
| 1,600,000 | + | + | 4938 (968) | 313 (4.3) | 1482 (129) | 281 (62) |
| 3,200,000 | - | - | n.d. | 549 (8.4) | 1161 (35) | 415 (34) |
| 3,200,000 | + | - | n.d. | 587 (13) | 3729 (469) | 566 (156) |
| 3,200,000 | - | + | n.d. | 622 (3.6) | 1266 (21) | 457 (29) |
| 3,200,000 | + | + | n.d. | 813 (5.8) | 3494 (231) | 691 (232) |
1 Elapsed time of execution: Mean (St. Dev) seconds.
2 Executed on a 2 GHz Intel Core Duo MacBook Pro. 1 GB 667 MHz DDR2 SDRAM. Mac OSX version 10.5.6.
3 Executed on a 2.6 GHz AMD Phenom 9950 Agena Quad-Core Processor. 4 GB 800 MHz DDR2 FB-DIMM Ubuntu 8.1 operating system.
4 Executed on a Macintosh workstation with 2 × 3 GHz Quad-Core Intel Xeon processors. 8 GB 800 MHz DDR2 FB-DIMM. Mac OS X Server version 10.5.4
5 Executed on a workstation with 2 × 3 GHz Dual-Core Intel Xeon 5160 processors. 12 GB 800 MHz DDR2 FB-DIMM. Fedora Core 5 operating system.