| Literature DB >> 30523036 |
David C Danko1,2, Dmitry Meleshko1,2, Daniela Bezdan2, Christopher Mason2,3, Iman Hajirasouliha2,4.
Abstract
Emerging Linked-Read technologies (aka read cloud or barcoded short-reads) have revived interest in short-read technology as a viable approach to understand large-scale structures in genomes and metagenomes. Linked-Read technologies, such as the 10x Chromium system, use a microfluidic system and a specialized set of 3' barcodes (aka UIDs) to tag short DNA reads sourced from the same long fragment of DNA; subsequently, the tagged reads are sequenced on standard short-read platforms. This approach results in interesting compromises. Each long fragment of DNA is only sparsely covered by reads, no information about the ordering of reads from the same fragment is preserved, and 3' barcodes match reads from roughly 2-20 long fragments of DNA. However, compared to long-read technologies, the cost per base to sequence is far lower, far less input DNA is required, and the per base error rate is that of Illumina short-reads. In this paper, we formally describe a particular algorithmic issue common to Linked-Read technology: the deconvolution of reads with a single 3' barcode into clusters that represent single long fragments of DNA. We introduce Minerva, a graph-based algorithm that approximately solves the barcode deconvolution problem for metagenomic data (where reference genomes may be incomplete or unavailable). Additionally, we develop two demonstrations where the deconvolution of barcoded reads improves downstream results, improving the specificity of taxonomic assignments and of k-mer-based clustering. To the best of our knowledge, we are the first to address the problem of barcode deconvolution in metagenomics.Entities:
Mesh:
Year: 2018 PMID: 30523036 PMCID: PMC6314158 DOI: 10.1101/gr.235499.118
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Taxa detail: Relative abundance is based on read counts and is not adjusted for genome size
Data set properties
Runtime performance
Figure 1.Clockwise from top, left: (1) Purity in Data set 1 for enhanced and 3′ barcodes; (2) Shannon index in Data set 1 for enhanced and 3′ barcodes; (3) Shannon index in Data set 2 for enhanced and 3′ barcodes; (4) purity in Data set 2 for enhanced and 3′ barcodes.
Figure 2.Abundance of different chromosomes across clusters as assigned by Latent Dirichlet allocation (LDA). Enhanced read clouds dramatically improve LDA's ability to distinguish structure in Data set 1. This figure uses the same deconvolution as Figure 1.
Taxonomic promotion
Figure 3.Processing steps for a single read cloud. From top: (1) Fragments are sequenced and tagged with 3′ barcodes. (2) Reads in a given read cloud are mapped to reads in other read clouds using minimizing k-mers. (3) A bipartite graph between reads and other read clouds is constructed. (4) A graph between reads that map to the same read clouds is constructed. (5) Reads are clustered into groups.
Figure 4.Top: hamming distance between windows that share minimizing k-mers, using various parameters. Bottom: number of representative minimizing k-mers per read.