Literature DB >> 27797770

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies.

Daniel Mapleson1, Gonzalo Garcia Accinelli1, George Kettleborough1, Jonathan Wright1, Bernardo J Clavijo1.   

Abstract

Motivation: De novo assembly of whole genome shotgun (WGS) next-generation sequencing (NGS) data benefits from high-quality input with high coverage. However, in practice, determining the quality and quantity of useful reads quickly and in a reference-free manner is not trivial. Gaining a better understanding of the WGS data, and how that data is utilized by assemblers, provides useful insights that can inform the assembly process and result in better assemblies.
Results: We present the K-mer Analysis Toolkit (KAT): a multi-purpose software toolkit for reference-free quality control (QC) of WGS reads and de novo genome assemblies, primarily via their k-mer frequencies and GC composition. KAT enables users to assess levels of errors, bias and contamination at various stages of the assembly process. In this paper we highlight KAT's ability to provide valuable insights into assembly composition and quality of genome assemblies through pairwise comparison of k-mers present in both input reads and the assemblies. Availability and Implementation: KAT is available under the GPLv3 license at: https://github.com/TGAC/KAT . Contact: bernardo.clavijo@earlham.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2017        PMID: 27797770      PMCID: PMC5408915          DOI: 10.1093/bioinformatics/btw663

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Rapid analysis of high-throughput whole genome shotgun (WGS) datasets is challenging due to their large size (Metzker, 2010), with genome size and complexity creating additional challenges (Schatz ). Reference-free approaches for analyzing WGS data typically involve examining base calling quality, read length, GC content (Yang ) and exploring k-mer (words of size k) spectra (Chor ; Lo and Chain, 2014). A frequently used reference-free quality control tool is FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). K-mer spectra reveal information not only about the data quality (level of errors, sequencing biases, completeness of sequencing coverage and potential contamination) but also of genomic complexity (size, karyotype, levels of heterozygosity and repeat content; Simpson, 2014). Additional information can be extracted through pairwise comparisons of WGS datasets (Anvar ), which can identify problematic samples by highlighting differences between spectra. KAT, the K-mer Analysis Tookit, is a suite of tools for rapidly counting, comparing and analysing spectra for k-mers of arbitrary length directly from sequence data (see Supplementary section 2 for a discussion on choice of k and Supplementary section 3 for a comparison of k-mer tools).

2 The K-mer analysis toolkit

KAT is a C ++11 application containing multiple tools, each of which exploits multi-core machines via multi-threading where possible. Core functionality is contained in a library designed to promote rapid development of new tools. Runtime and memory requirements depend on input data size, error and bias levels, and properties of the biological sample but as a rule of thumb, machines capable of de novo assembly of a dataset will be sufficient to run KAT on the dataset (see Supplementary section 4 for details). K-mer counting in KAT is performed by an integrated and modified version of Jellyfish2 (Marçais and Kingsford, 2011), which supports large k values and is among the fastest k-mer counters available (Zhang ).

2.1 Assembly validation by comparison of read spectrum and assembly copy number

The KAT comp tool generates a matrix, with a sequence set’s k-mer frequency on one axis, and another set's frequency on the other, with cells holding distinct k-mers counts at the given frequencies. When comparing reads against an assembly, KAT highlights properties of assembly composition and quality. If represented in a stacked histogram, read k-mer spectrum is split by copy number in the assembly (see Supplementary section 5 for a primer on how to interpret KAT’s stacked histograms). In addition, KAT provides the sect tool necessary to study specific assembled sequences and track the k-mer coverage across both the read and the assembly spectra. This can help identify assembly artefacts such as collapsing or expanding events, or detect repeat regions. Figure 1 shows plots relating to two Fraxinus excelsior assemblies created from the same dataset using the comp and sect tools. The plots highlight different strategies taken by the assembler, in (a) and (c) we see some homozygous content being duplicated, and in (b) and (d) some heterozygous content eliminated.
Fig. 1.

(a) and (b), generated using KAT comp, show read k-mer frequency versus assembly copy number stacked histograms for two different assemblies of a heterozygous Fraxinus excelsior genome http://ftp-oadb.tsl.ac.uk/fraxinus_excelsior. Read content in black is absent from the assembly, red occurs once, purple twice, etc. Both k-mer spectra show an error distribution under 25×, heterozygous content around 50× and homozygous content around 100×. (a) contains most (but not all) the heterozygous content, and introduces more duplications on homozygous content. (b) is more collapsed, including mostly a single copy of the homozygous content and less of the heterozygous content. (c) and (d), generated using KAT sect, show kmer coverage across example assembled loci. The assembly k-mer coverage (black line) of assembly (a) in plot (c) shows that the assembly has two copies of this locus, whereas the read k-mer coverage (red line) implies there should be only a single copy. This incorrect duplication has been corrected in assembly (b) with the read and assembly k-mer coverage agreeing in plot (d). The increased read and assembly k-mer coverage at positions 100 and 400 indicates small regions of repetitive sequence in the genome. The halved read k-mer coverage after position 400 indicates a heterozygous locus, which likely caused the duplication of this locus in the assembly (a). See Supplementary Section 5 for a more extensive analysis of all sequences from this loci and their impact on (a) and (b)

(a) and (b), generated using KAT comp, show read k-mer frequency versus assembly copy number stacked histograms for two different assemblies of a heterozygous Fraxinus excelsior genome http://ftp-oadb.tsl.ac.uk/fraxinus_excelsior. Read content in black is absent from the assembly, red occurs once, purple twice, etc. Both k-mer spectra show an error distribution under 25×, heterozygous content around 50× and homozygous content around 100×. (a) contains most (but not all) the heterozygous content, and introduces more duplications on homozygous content. (b) is more collapsed, including mostly a single copy of the homozygous content and less of the heterozygous content. (c) and (d), generated using KAT sect, show kmer coverage across example assembled loci. The assembly k-mer coverage (black line) of assembly (a) in plot (c) shows that the assembly has two copies of this locus, whereas the read k-mer coverage (red line) implies there should be only a single copy. This incorrect duplication has been corrected in assembly (b) with the read and assembly k-mer coverage agreeing in plot (d). The increased read and assembly k-mer coverage at positions 100 and 400 indicates small regions of repetitive sequence in the genome. The halved read k-mer coverage after position 400 indicates a heterozygous locus, which likely caused the duplication of this locus in the assembly (a). See Supplementary Section 5 for a more extensive analysis of all sequences from this loci and their impact on (a) and (b)

2.2 Other KAT tools

KAT also includes the hist tool for computing spectrum from a single sequence set and the gcp tool to analyse gc content against k-mer frequency. The filter tool can be used to isolate sequences from a set according to their k-mer coverage or gc content from a given spectrum (see Supplementary section 1 for details on all the tools). These tools can be used for various tasks including contaminant detection and extraction both in raw reads and assemblies, analysis of the GC bias and consistency between paired end reads and other types of libraries.

3 Summary

KAT is a user-friendly, scalable toolkit for rapidly counting, comparing and analyzing k-mers from various data sources. The tools in KAT assist the user with a wide range of tasks including error profiling, assessing sequencing bias and identifying contaminants and de novo genome assembly QC and validation. Click here for additional data file.
  9 in total

1.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Authors:  Guillaume Marçais; Carl Kingsford
Journal:  Bioinformatics       Date:  2011-01-07       Impact factor: 6.937

Review 2.  Sequencing technologies - the next generation.

Authors:  Michael L Metzker
Journal:  Nat Rev Genet       Date:  2009-12-08       Impact factor: 53.242

Review 3.  Current challenges in de novo plant genome sequencing and assembly.

Authors:  Michael C Schatz; Jan Witkowski; W Richard McCombie
Journal:  Genome Biol       Date:  2012       Impact factor: 13.583

4.  HTQC: a fast quality control toolkit for Illumina sequencing data.

Authors:  Xi Yang; Di Liu; Fei Liu; Jun Wu; Jing Zou; Xue Xiao; Fangqing Zhao; Baoli Zhu
Journal:  BMC Bioinformatics       Date:  2013-01-31       Impact factor: 3.169

5.  Exploring genome characteristics and sequence quality without a reference.

Authors:  Jared T Simpson
Journal:  Bioinformatics       Date:  2014-01-17       Impact factor: 6.937

6.  These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.

Authors:  Qingpeng Zhang; Jason Pell; Rosangela Canino-Koning; Adina Chuang Howe; C Titus Brown
Journal:  PLoS One       Date:  2014-07-25       Impact factor: 3.240

7.  Rapid evaluation and quality control of next generation sequencing data with FaQCs.

Authors:  Chien-Chi Lo; Patrick S G Chain
Journal:  BMC Bioinformatics       Date:  2014-11-19       Impact factor: 3.169

8.  Determining the quality and complexity of next-generation sequencing data without a reference genome.

Authors:  Seyed Yahya Anvar; Lusine Khachatryan; Martijn Vermaat; Michiel van Galen; Irina Pulyakhina; Yavuz Ariyurek; Ken Kraaijeveld; Johan T den Dunnen; Peter de Knijff; Peter A C 't Hoen; Jeroen F J Laros
Journal:  Genome Biol       Date:  2014       Impact factor: 13.583

9.  Genomic DNA k-mer spectra: models and modalities.

Authors:  Benny Chor; David Horn; Nick Goldman; Yaron Levy; Tim Massingham
Journal:  Genome Biol       Date:  2009-10-08       Impact factor: 13.583

  9 in total
  118 in total

1.  Genome-wide analyses reveal drivers of penguin diversification.

Authors:  Juliana A Vianna; Flávia A N Fernandes; María José Frugone; Henrique V Figueiró; Luis R Pertierra; Daly Noll; Ke Bi; Cynthia Y Wang-Claypool; Andrew Lowther; Patricia Parker; Celine Le Bohec; Francesco Bonadonna; Barbara Wienecke; Pierre Pistorius; Antje Steinfurth; Christopher P Burridge; Gisele P M Dantas; Elie Poulin; W Brian Simison; Jim Henderson; Eduardo Eizirik; Mariana F Nery; Rauri C K Bowie
Journal:  Proc Natl Acad Sci U S A       Date:  2020-08-17       Impact factor: 11.205

2.  Whole genome sequencing and identification of host-interactive genes in the rice endophytic Leifsonia sp. ku-ls.

Authors:  Latha Battu; Kandasamy Ulaganathan
Journal:  Funct Integr Genomics       Date:  2019-09-03       Impact factor: 3.410

3.  Genome mining of Burkholderia ambifaria strain T16, a rhizobacterium able to produce antimicrobial compounds and degrade the mycotoxin fusaric acid.

Authors:  Florencia Alvarez; Ester Simonetti; Walter O Draghi; Matías Vinacour; Miranda C Palumbo; Dario Fernández Do Porto; Marcela S Montecchia; Irma N Roberts; Jimena A Ruiz
Journal:  World J Microbiol Biotechnol       Date:  2022-05-17       Impact factor: 3.312

4.  Whole-genome, transcriptome, and methylome analyses provide insights into the evolution of platycoside biosynthesis in Platycodon grandiflorus, a medicinal plant.

Authors:  Jungeun Kim; Sang-Ho Kang; Sin-Gi Park; Tae-Jin Yang; Yi Lee; Ok Tae Kim; Oksung Chung; Jungho Lee; Jae-Pil Choi; Soo-Jin Kwon; Keunpyo Lee; Byoung-Ohg Ahn; Dong Jin Lee; Seung-Il Yoo; In-Gang Shin; Yurry Um; Dae Young Lee; Geum-Soog Kim; Chang Pyo Hong; Jong Bhak; Chang-Kug Kim
Journal:  Hortic Res       Date:  2020-07-01       Impact factor: 6.793

5.  Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies.

Authors:  Arang Rhie; Brian P Walenz; Sergey Koren; Adam M Phillippy
Journal:  Genome Biol       Date:  2020-09-14       Impact factor: 13.583

6.  Genomic architecture and introgression shape a butterfly radiation.

Authors:  Nathaniel B Edelman; Paul B Frandsen; Michael Miyagi; Bernardo Clavijo; John Davey; Rebecca B Dikow; Gonzalo García-Accinelli; Steven M Van Belleghem; Nick Patterson; Daniel E Neafsey; Richard Challis; Sujai Kumar; Gilson R P Moreira; Camilo Salazar; Mathieu Chouteau; Brian A Counterman; Riccardo Papa; Mark Blaxter; Robert D Reed; Kanchon K Dasmahapatra; Marcus Kronforst; Mathieu Joron; Chris D Jiggins; W Owen McMillan; Federica Di Palma; Andrew J Blumberg; John Wakeley; David Jaffe; James Mallet
Journal:  Science       Date:  2019-11-01       Impact factor: 47.728

7.  Three-Finger Toxin Diversification in the Venoms of Cat-Eye Snakes (Colubridae: Boiga).

Authors:  Daniel Dashevsky; Jordan Debono; Darin Rokyta; Amanda Nouwens; Peter Josh; Bryan G Fry
Journal:  J Mol Evol       Date:  2018-09-12       Impact factor: 2.395

8.  Canfam_GSD: De novo chromosome-length genome assembly of the German Shepherd Dog (Canis lupus familiaris) using a combination of long reads, optical mapping, and Hi-C.

Authors:  Matt A Field; Benjamin D Rosen; Olga Dudchenko; Eva K F Chan; Andre E Minoche; Richard J Edwards; Kirston Barton; Ruth J Lyons; Daniel Enosi Tuipulotu; Vanessa M Hayes; Arina D Omer; Zane Colaric; Jens Keilwagen; Ksenia Skvortsova; Ozren Bogdanovic; Martin A Smith; Erez Lieberman Aiden; Timothy P L Smith; Robert A Zammit; J William O Ballard
Journal:  Gigascience       Date:  2020-04-01       Impact factor: 6.524

9.  Sequencing smart: De novo sequencing and assembly approaches for a non-model mammal.

Authors:  Graham J Etherington; Darren Heavens; David Baker; Ashleigh Lister; Rose McNelly; Gonzalo Garcia; Bernardo Clavijo; Iain Macaulay; Wilfried Haerty; Federica Di Palma
Journal:  Gigascience       Date:  2020-05-01       Impact factor: 6.524

10.  Chromosome-level reference genome of the soursop (Annona muricata): A new resource for Magnoliid research and tropical pomology.

Authors:  Joeri S Strijk; Damien D Hinsinger; Mareike M Roeder; Lars W Chatrou; Thomas L P Couvreur; Roy H J Erkens; Hervé Sauquet; Michael D Pirie; Daniel C Thomas; Kunfang Cao
Journal:  Mol Ecol Resour       Date:  2021-03-10       Impact factor: 7.090

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.