| Literature DB >> 27797770 |
Daniel Mapleson1, Gonzalo Garcia Accinelli1, George Kettleborough1, Jonathan Wright1, Bernardo J Clavijo1.
Abstract
Motivation: De novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data benefits from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilized by assemblers, provides useful insights that can inform the assembly process and result in better assemblies.Entities:
Mesh:
Year: 2017 PMID: 27797770 PMCID: PMC5408915 DOI: 10.1093/bioinformatics/btw663
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(a) and (b), generated using KAT comp, show read k-mer frequency versus assembly copy number stacked histograms for two different assemblies of a heterozygous Fraxinus excelsior genome http://ftp-oadb.tsl.ac.uk/fraxinus_excelsior. Read content in black is absent from the assembly, red occurs once, purple twice, etc. Both k-mer spectra show an error distribution under 25×, heterozygous content around 50× and homozygous content around 100×. (a) contains most (but not all) the heterozygous content, and introduces more duplications on homozygous content. (b) is more collapsed, including mostly a single copy of the homozygous content and less of the heterozygous content. (c) and (d), generated using KAT sect, show kmer coverage across example assembled loci. The assembly k-mer coverage (black line) of assembly (a) in plot (c) shows that the assembly has two copies of this locus, whereas the read k-mer coverage (red line) implies there should be only a single copy. This incorrect duplication has been corrected in assembly (b) with the read and assembly k-mer coverage agreeing in plot (d). The increased read and assembly k-mer coverage at positions 100 and 400 indicates small regions of repetitive sequence in the genome. The halved read k-mer coverage after position 400 indicates a heterozygous locus, which likely caused the duplication of this locus in the assembly (a). See Supplementary Section 5 for a more extensive analysis of all sequences from this loci and their impact on (a) and (b)