| Literature DB >> 35708611 |
Nik Tavakolian1, João Guilherme Frazão2, Devin Bendixsen2, Rike Stelkens2, Chun-Biu Li1.
Abstract
MOTIVATION: DNA barcodes are short, random nucleotide sequences introduced into cell populations to track the relative counts of hundreds of thousands of individual lineages over time. Lineage tracking is widely applied, e.g. to understand evolutionary dynamics in microbial populations and the progression of breast cancer in humans. Barcode sequences are unknown upon insertion and must be identified using next-generation sequencing technology, which is error prone. In this study, we frame the barcode error correction task as a clustering problem with the aim to identify true barcode sequences from noisy sequencing data. We present Shepherd, a novel clustering method that is based on an indexing system of barcode sequences using k-mers, and a Bayesian statistical test incorporating a substitution error rate to distinguish true from error sequences.Entities:
Year: 2022 PMID: 35708611 PMCID: PMC9344852 DOI: 10.1093/bioinformatics/btac395
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.(a) Illustration of the pigeonhole principle for l = 8 and k = 2 (i.e. ). Sequences 2 and 3 are within Hamming distance 2 of the true barcode. It follows from the pigeonhole principle that these error sequences share two or more 2-mers with the true barcode. Since Sequence 4 has Hamming distance 3 to the true barcode the pigeonhole principle only guarantees that it shares one 2-mer with it. Nevertheless, Sequence 4 still shares two 2-mers with the true barcode since two of its errors appear in the same 2-mer. (b) A given sequence (orange square) surrounded by its neighbors (dots) in sequence space. The orange dots are the k-mer neighbors of the given sequence, i.e. all sequences that share at least -mers with it. The blue dots are sequences not included in the k-mer neighborhood. The dashed circle is the ϵ-neighborhood of the sequence and the solid circle is the boundary for the k-mer neighbors, i.e. no k-mer neighbor appears outside the solid circle. Note that l is a multiple of k in this case and that all ϵ-neighbors of the sequence are also k-mer neighbors, this is guaranteed by the pigeonhole principle. (c) Illustration of how a pair of 2-mers are converted into a combination ID. First a pair of 2-mers is selected. Each 2-mer has a location in the sequence specified by the orange numbers. The 2-mer pair is then converted to an ID by assigning each of its nucleotides to a number specified by the conversion table on the right. (d) The k-mer Index for the set of sequences from the panel a, including only the 2-mer pairs shared by at least two sequences in the dataset. The blue numbers correspond to the sequence numbers specified in the panel a. Furthermore, the k-mer Index only includes the combination IDs with the corresponding sets and the 2-mer pairs (leftmost column) are only included here for illustrative purposes. (e) A schematic showing the process of finding the k-mer neighborhood of Sequence 1 from the panel a using the k-mer Index from the panel d. First all combination IDs of Sequence 1 are found and the corresponding sets are obtained from the k-mer Index. The set union of the sets yields the set of all sequences that share at least one combination ID with Sequence 1. By excluding Sequence 1 from this set we obtain its k-mer neighborhood (A color version of this figure appears in the online version of this article.)
Summary of synthetic datasets
| Dataset | A | B | C |
|---|---|---|---|
| Unique sequence count |
|
|
|
| True barcode count |
|
|
|
| Error rate | 0.33% | 0.66% | 2% |
Note: All datasets have barcodes with a total barcode length of 26, with 20 random nucleotides and 6 constant nucleotides.
The false positive count (FPC) and false negative count (FNC) for each method on each synthetic dataset
| Dataset | A | B | C | |||
|---|---|---|---|---|---|---|
| Measure | FPC | FNC | FPC | FNC | FPC | FNC |
| Shepherd | 45 | 66 | 482 | 82 | 461 | 50 |
| Bartender | 1979 | 47 | 14 100 | 62 | 59 554 | 6 |
| Starcode | 7045 | 91 | 26 289 | 99 | 78 956 | 4 |
Fig. 2.(a) The number of clusters with low read counts (<6) for each method compared to the ground truth on Dataset B. (b) Estimated barcode counts compared to the true counts for each method on Dataset B. Only true barcodes identified by all three methods are displayed. True barcodes for which all three methods estimated the same count are excluded to emphasize differences in the estimated counts. (c) The mean absolute difference between the true barcode counts and the estimated counts of Shepherd and Bartender at each time point. For each time point, only true barcodes identified by both methods are included in the comparison
Fig. 3.(a) Distribution of the effective cluster radius r for each method for all clusters containing at least two sequences. There are and such clusters for Shepherd and Bartender, respectively. (b) A 2D histogram of the cluster read counts estimated by Shepherd and Bartender including the barcodes identified by both methods. The colorbar indicates the number of barcodes in each bin