| Literature DB >> 25407910 |
Jamison M McCorrison1, Pratap Venepally2, Indresh Singh3, Derrick E Fouts4, Roger S Lasken5, Barbara A Methé6,7.
Abstract
BACKGROUND: Deep shotgun sequencing on next generation sequencing (NGS) platforms has contributed significant amounts of data to enrich our understanding of genomes, transcriptomes, amplified single-cell genomes, and metagenomes. However, deep coverage variations in short-read data sets and high sequencing error rates of modern sequencers present new computational challenges in data interpretation, including mapping and de novo assembly. New lab techniques such as multiple displacement amplification (MDA) of single cells and sequence independent single primer amplification (SISPA) allow for sequencing of organisms that cannot be cultured, but generate highly variable coverage due to amplification biases.Entities:
Mesh:
Year: 2014 PMID: 25407910 PMCID: PMC4245761 DOI: 10.1186/s12859-014-0357-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Schematic illustration of sequence selection by Neatfreq pipeline. A). Blue blocks represent fragment-only reads. The left side of the figure shows 2 retention bins (20% and 85% retention) created by the ratio between the RMKF of the read and the cutoff input by the user. The random bin selection method extracts a random subset of reads from each bin up to the count denoted by its retention level and the number of reads available. B). The targeted bin selection method, run as fragment-only sequences, is illustrated on a 20% retention bin (left block). Within each retention bin, reads were clustered by the cd-hit-est alignment [22] program based on similarity and sorted by uniqueness, or the population of the sub-bin cluster (middle block). Illustrated here, reads from each intra-bin sub-cluster were selected randomly from within each cluster approached within a bin. C). Green blocks represent forward mates and red blocks reverse mates, with dark green brackets indicating 2-sided mate relationships. When applying the targeted bin selection method to the same bin containing mate pairs (paired-ends), analysis is identical to that described for fragments except that the retrieval of 2-sided mates across all bins, and their sub-bin clusters, is prioritized. Note that highly unique clusters containing only fragments are still given priority in selection.
Figure 2Comparison of random and targeted NeatFreq selection methods on sequence coverage. A). Elevated kmer counts within repetitive regions cause over-reduction using an RMKF cutoff. The genomic regions labeled with stars indicate regions identified as repeats by RepeatFinder [30]. Reads from repetitive regions are placed in low selectivity bins due to the high frequency of similar mers within the data set. Therefore, over-reduction occurs at multiples directly related to the count of repetitive regions. B). This histogram shows the retrieval of sequences at different RMKF cutoff levels when using each of the bin selection methods. Aligned sequence coverage distribution is shown for the first 40,000 bp of the S. aureus genome using query sequences selected by random (top) and targeted (bottom) methods. The targeted method is more effective at recruiting low coverage regions resulting from single cell amplification bias in variable coverage region, including 0-fold regions. The X-axis shows genomic coordinate from the reference used for mapping the extracted reads and the Y-axis shows the level of coverage at each genomic position. C). The histogram gives zoomed view of the low coverage area highlighted by an arrow in Figure 2A (region 278 kbp – 292 kbp). Alignment histograms show that the targeted algorithm, in contrast to the random selection, retains the low coverage areas in the variable dataset, resulting in an increased sequencing span. D). Coverage histogram of reads aligned to the largest H1N1 Influenza genomic reference segment (log scale). Random selection from the entire dataset (without retention bins) was performed to a count of reads equal to that used by targeted selection at RMKF cutoff =40. This random selection from all reads is the most subject to input coverage variability and fails to reduce deep spikes to generate coverage levels compatible with OLC assembly.
Figure 3Comparison of MDA assemblies to the expected (ideal) genomic span. Newbler assembly with all reads (ALL) fails on all samples during consensus resolution (not shown). SPAdes is able to complete an assembly using all reads with few missing reference bases in the assembly but the output also generates excessive redundant and spurious contigs. Pre-processing reduces sequences allowing the completion of OLC (Newbler) assembly and yields genomic spans by both Newbler and SPAdes which are most similar to the ideal (reference) assembly. The Newbler assembly from coverage-reduced data is also closer to the reference span and verifies (consistent with) the assembly produced by the SPAdes following preprocessing alone.