| Literature DB >> 30654736 |
Abstract
BACKGROUND: Single-cell sequencing experiments use short DNA barcode 'tags' to identify reads that originate from the same cell. In order to recover single-cell information from such experiments, reads must be grouped based on their barcode tag, a crucial processing step that precedes other computations. However, this step can be difficult due to high rates of mismatch and deletion errors that can afflict barcodes.Entities:
Keywords: Barcode identification; Barcodes; Circularization; K-mer counting; Single-cell; de Bruijn graph
Mesh:
Substances:
Year: 2019 PMID: 30654736 PMCID: PMC6337828 DOI: 10.1186/s12859-019-2612-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A strategy to use k-mer counting to identify sequence barcodes. a Circularizing barcodes ensures robustness against single mismatches. An example sequence ‘BARCODE’ contains an error (highlighted in red). When the barcode sequence is short relative to k, all k-mers from this sequence will contain the mutated base. Circularizing the sequence (bottom) ensures that there will be some error-free k-mers from a sequence independent of the position of the error. b An example circular k-mer graph containing one barcode. Error-containing reads were simulated from a ground-truth barcode. Reads were circularized and k-mers were counted. The resultant k-mer graph is plotted here. Nodes in this graph are represented as gray dots, and edges as blue lines. Edges weights are represented by shading (dark = high edge weight). Despite a fairly high rate of error (Poisson 3 errors per 12 nucleotide barcode), the true barcode path is visually discernable with a modest number of reads. c An example circular k-mer graph containing three barcodes. Same as above
Run time for downsampled Macosko et al., datasets
| Number of reads in dataset | Number of cells detected | Time |
|---|---|---|
| 1,000,000 | 562 | 6 m 39 s |
| 10,000,000 | 575 | 51 m 08 s |
| 100,000,000 | 574 | 360 m 31 s |
Fig. 2Identifying barcodes and splitting reads from Macosko et al., species mixing experiment. a Circular paths were identified in the circular barcode k-mer graph from a published Drop-seq dataset. The distribution of circular path weights versus path rank clearly shows an inflection point. Paths with weight higher than this inflection point are deemed to be true barcodes. b This inflection point can be identified as a local maximum in the first derivative of the path-weight distribution. A Savitskiy Golay filter facilitates in this identification by smoothing the data. c Reads were grouped into cells by assigning them to to thresholded paths based on k-mer compatibility alone. This assignment results in a flat distribution in the number of pseudoalignments per cell. d Reads that were split based on barcode k-mer compatibility alone also segregate by their number of pseudoalignments to different transcriptiomes. This indicates that assigning reads based on k-mer compatibility produces distinct and biologically relevant groupings