| Literature DB >> 29976145 |
Panu Somervuo1,2, Patrik Koskinen3, Peng Mei4, Liisa Holm3, Petri Auvinen4, Lars Paulin4.
Abstract
BACKGROUND: Current high-throughput sequencing platforms provide capacity to sequence multiple samples in parallel. Different samples are labeled by attaching a short sample specific nucleotide sequence, barcode, to each DNA molecule prior pooling them into a mix containing a number of libraries to be sequenced simultaneously. After sequencing, the samples are binned by identifying the barcode sequence within each sequence read. In order to tolerate sequencing errors, barcodes should be sufficiently apart from each other in sequence space. An additional constraint due to both nucleotide usage and basecalling accuracy is that the proportion of different nucleotides should be in balance in each barcode position. The number of samples to be mixed in each sequencing run may vary and this introduces a problem how to select the best subset of available barcodes at sequencing core facility for each sequencing run. There are plenty of tools available for de novo barcode design, but they are not suitable for subset selection.Entities:
Keywords: Barcode; DNA; Integer programming; Multiplexing; Optimization; Sequencing
Mesh:
Year: 2018 PMID: 29976145 PMCID: PMC6034344 DOI: 10.1186/s12859-018-2262-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Applications of BARCOSEL. BARCOSEL can be used in three different modes depicted in panels a) selecting an optimal barcode set from candidates, b) checking user-defined set of barcodes, and c) augmenting an existing set of barcodes
Fig. 2Web interface to BARCOSEL. User needs to give only two inputs: (1) Candidate barcodes in FASTA format and (2) the number of barcodes to be selected. Sequences can imported either by copy-pasting them in a text box or uploading a FASTA file. After user has pressed submit button, the web page is returned containing an optimal set of barcodes and an image showing nucleotide balance in each barcode position. In case no solution can be found, user gets a report. Optional input parameters become available when pressing Advanced button. These include (3) an initial set of barcodes if BARCOSEL is used to augment an existing set, (4) distance type (Hamming or Levenshtein) and minimum barcode distance required (default is 3 tolerating one sequencing error), and (5) parameters related to lpsolve: maximum computation time (default is 10 seconds), branch-and-bound search depth (0 means no restrictions), and basis crash parameter related to initialization. There is no need to change lpsolve related parameters unless no acceptable solution is found with default values
Examples of optimal 8bp barcode sets with 8,12,16, and 24 barcodes
| Set A (8) | Set B (12) | Set C (16) | Set D (24) |
|---|---|---|---|
| AACACATC | AGTTGCTG | ACACAGGC | ACACAGGC |
| AGAGTGCG | ATAGAGTC | ACTGTTAG | AGCCTACT |
| CCGTATAT | ATCATTGC | ATAGAGTC | AGTTCCGC |
| CTTGGTTG | CAGTTCCA | ATTAGCTG | ATACGGAT |
| GAGATAAC | CATGGAAT | CAGTTCCA | ATGACGAA |
| GTACAGGA | CGCAAGCT | CCGTATAT | ATGGTCTC |
| TCTCGCCT | GAGATAAC | CGCAAGCT | CAGTTCCA |
| TGCTCCGA | GCAGATAA | CTGGCACA | CATGTTGA |
| GGCTCTTG | GACACTAA | CCTGAACC | |
| TCGCCAGA | GGAGTAGA | CGAACTTC | |
| TCTCGCCT | GGCTCTTG | CTCCGGTT | |
| TTACCGGG | GGGAGATC | CTGAATCA | |
| TACTGCAG | GAAAGAAG | ||
| TATCCAGT | GAGATTGT | ||
| TCTCGCCT | GCATCACG | ||
| TTACTGGC | GCCGAATG | ||
| GCTGAAGA | |||
| GGCTCTTG | |||
| TACTGCAG | |||
| TATCTGTG | |||
| TCTCGCCT | |||
| TGAGAGAT | |||
| TGCTCCGA | |||
| TTGAGTAC |
Each barcode is at least three mismatches apart from each other (using Hamming distance) within the set allowing one nucleotide error in sequencing to be corrected. Proportions of all four nucleotides A,C,G,T are in balance in each barcode position
Fig. 3Nucleotide balances of optimal sets with varying number of barcodes. Horizontal axis is the barcode position, total indicates the nucleotide balance over the entire barcode length. In Illumina sequencing, nucleotide groups A/C and G/T should be in balance for optimal detection in each position. Total balance is important for equal consumption of nucleotides during sequencing. a 10 barcodes. b 11 barcodes. c 12 barcodes. d 75 barcodes