Literature DB >> 35199151

BamToCov, an efficient toolkit for sequence coverage calculations.

Giovanni Birolo1, Andrea Telatin2.   

Abstract

MOTIVATION: Many genomics applications require the computation of nucleotide coverage of a reference genome or the ability to determine how many reads map to a reference region.
RESULTS: BamToCov is a toolkit for rapid and flexible coverage computation that relies on the most memory efficient algorithm and is designed for integration in pipelines, given its ability to read alignment files from streams. The tools in the suite can process sorted BAM or CRAM files, allowing the user to extract coverage information via different filtering approaches and to save the output in different formats (BED, Wig or counts).The BamToCov algorithm can also handle strand-specific and/or physical coverage analyses. AVAILABILITY: This program, accessory utilities, and their documentation are freely available at https://github.com/telatin/BamToCov. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Year:  2022        PMID: 35199151      PMCID: PMC9048650          DOI: 10.1093/bioinformatics/btac125

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.931


1 Introduction

Sequencing coverage calculations have been done since the dawn of genomics (Lander and Waterman, 1988), commonly in relation to a priori theoretical calculations aimed at understanding the amount of effort required to produce sufficient DNA reads with capillary sequencers. With the advent of massively parallel sequencing (also referred to as ‘next generation sequencing’) those a priori calculations began to be matched by a posteriori calculations made by mapping the DNA reads against a reference sequence (either a pre-existing reference, or the de novo assembly of the sequencing output itself). In this context some bases sequenced would not be accounted for (e.g. adaptor sequences, contaminants or unmappable reads). When using paired libraries, it is also possible to evaluate the physical coverage, i.e. the number of times a base is spanned by a read pair. There are already several tools to extract coverage information from alignment files (in BAM format): Samtools (Li ), Bedtools (Quinlan, 2014), Sambamba (Tarasov ) and the newer and more feature-rich Mosdepth (Pedersen and Quinlan, 2018b) and MegaDepth (Wilks ). A common limitation of the existing tools is the inability to calculate physical coverage which is important when determining the integrity of assemblies using mate-pairs libraries. Also, it is not possible to separate the coverage per strand; if a position is covered only by forward reads, or only by reverse reads, it is probably due to misalignment. To address these limitations, we developed Covtobed (Birolo and Telatin, 2020), a simple yet efficient C++ program which, inspired by the UNIX philosophy of computer programming, focused on a single task supporting input and output streams. Here we introduce the BamToCov suite of programs, implemented in the Nim language. BamToCov performs coverage calculations using an optimized implementation of the algorithm of Covtobed with new features to support interval targets, new output formats, coverage statistics and multiple BAM files, while retaining the ability to read input streams, thereby achieving an overall performance improvement (i.e., a smaller memory footprint and an increase in speed of up to 3×).

2 Materials and methods

BamToCov reimplements the coverage calculation algorithm of Covtobed, a C++ program, but with optimizations and using hts-lib (Bonfield ) for BAM parsing, via the hts-nim wrapper (Pedersen and Quinlan, 2018a) instead of libbamtools, hence natively supporting CRAM files. The program and its companion utilities are written in Nim and tested using three compiler versions (1.2, 1.4 and 1.6). Unlike other programs that allocate an integer array with the length of the reference sequence, BamToCov uses a streaming approach that takes full advantage of sorted input alignments. Furthermore, its memory usage depends only on the maximum coverage and not on the reference size. The basic premise is that coverage is computed starting from zero at the leftmost base in each contig and updated on-the-fly while reading alignments and moving toward the right. Coverage is incremented at the start of each alignment and decremented at the end, keeping the ending positions in a priority queue. The suite of tools is automatically tested, and available via the BioConda project (The Bioconda Team ). The scripts used to benchmark the execution times and the peak memory usage are available in the software repository.

3 Results

BamToCov, which can be used as a drop-in replacement for Covtobed, is faster and more memory efficient than Covtobed, while implementing a wide range of new features. BamToCov is designed to support input streams and produce physical coverage and per-strand coverage calculations. BamToCov also enables the user to supply a set of intervals of interest (target) and use them both to restrict the output coverage to those regions and generate a table of statistics per interval. BamToCov supports targets in BED, GFF3 and GTF formats. Supplementary utilities provide support for computing read counts (rather than nucleotide coverage) and statistics over the whole chromosomes. An overview of the features is reported in Supplementary Note S1.

3.1 BamToCov performance

To evaluate the performance of BamToCov we adopted four test datasets: the genome sequencing of a yeast using short reads (SR) and long reads (LR), and two human datasets, a human exome, and a small targeted gene panel (see Supplementary Note S2 for a complete description and their availability). BamToCov is reasonably fast, especially for datasets with few reads or long contigs (e.g. targeted gene panels); it is up to 2× faster than its predecessor (Covtobed), and very promising when evaluating the coverage of long reads (see Table 1 and Supplementary Note S3).
Table 1.

Execution times and peak memory usage

ProgramFungus, SR
Fungus, LR
Exome
Gene Panel
Mem Time Mem Time Mem Time Mem Time
BamToCov2.79.774.385.4524.0920.35
Covtobed4.016.455.0162.7633.8140.53
Mosdepth13.96.5119.196.419833.02642526.01
MegaDepth11.65.9711.6107.999510.149809.24

Note: Peak memory usage (in megabytes) and average execution time out of 10 runs per program (seconds). Mosdepth was executed in fast mode.

Execution times and peak memory usage Note: Peak memory usage (in megabytes) and average execution time out of 10 runs per program (seconds). Mosdepth was executed in fast mode. BamToCov is the most memory efficient program for coverage calculations, with an improvement up to 50% when compared with the C++ implementation (Covtobed). Other programs relying on chromosome-sized vectors require up to 400× more memory to analyze a typical human exome, while BamToCov memory usage is not affected by the genome size (see Table 1 and Supplementary Note S4).

4 Conclusion

BamToCov is a program and suite of utilities engineered to simplify their application in bioinformatics pipelines requiring coverage calculations. It is designed to allow for flexible prototyping of bespoke pipelines, where the support for input and output streams and the low memory footprint can be valuable. For example, TraDIS-Xpress experiments (Turner ) rely on detecting uncovered regions across a large set of samples, and benefits from the availability of stranded reports. In terms of performance, BamToCov proves to be a suitable alternative for gene panels and long reads datasets. The peculiar algorithm adopted is the most memory efficient by far, and the new implementation in Nim yields further performance benefits both in terms of execution times and memory footprint.

Funding

This work was supported by the Biotechnology and BiologicalSciences Research Counci (BBSRC; BB/R012490/1 and BBS/E/F/000PR10353, BB/R506552/1), the Medical Research Council (MRC; MR/T030062/1) and the European Union’s Horizon 2020 (GA101017598). Conflict of Interest: none declared. Click here for additional data file.
  10 in total

1.  Sambamba: fast processing of NGS alignment formats.

Authors:  Artem Tarasov; Albert J Vilella; Edwin Cuppen; Isaac J Nijman; Pjotr Prins
Journal:  Bioinformatics       Date:  2015-02-19       Impact factor: 6.937

2.  Bioconda: sustainable and comprehensive software distribution for the life sciences.

Authors:  Björn Grüning; Ryan Dale; Andreas Sjödin; Brad A Chapman; Jillian Rowe; Christopher H Tomkins-Tinch; Renan Valieris; Johannes Köster
Journal:  Nat Methods       Date:  2018-07       Impact factor: 28.547

3.  Genomic mapping by fingerprinting random clones: a mathematical analysis.

Authors:  E S Lander; M S Waterman
Journal:  Genomics       Date:  1988-04       Impact factor: 5.736

4.  hts-nim: scripting high-performance genomic analyses.

Authors:  Brent S Pedersen; Aaron R Quinlan
Journal:  Bioinformatics       Date:  2018-10-01       Impact factor: 6.937

5.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

6.  BEDTools: The Swiss-Army Tool for Genome Feature Analysis.

Authors:  Aaron R Quinlan
Journal:  Curr Protoc Bioinformatics       Date:  2014-09-08

7.  Mosdepth: quick coverage calculation for genomes and exomes.

Authors:  Brent S Pedersen; Aaron R Quinlan
Journal:  Bioinformatics       Date:  2018-03-01       Impact factor: 6.937

8.  HTSlib: C library for reading/writing high-throughput sequencing data.

Authors:  James K Bonfield; John Marshall; Petr Danecek; Heng Li; Valeriu Ohan; Andrew Whitwham; Thomas Keane; Robert M Davies
Journal:  Gigascience       Date:  2021-02-16       Impact factor: 6.524

9.  Megadepth: efficient coverage quantification for BigWigs and BAMs.

Authors:  Christopher Wilks; Omar Ahmed; Daniel N Baker; David Zhang; Leonardo Collado-Torres; Ben Langmead
Journal:  Bioinformatics       Date:  2021-03-08       Impact factor: 6.937

10.  A genome-wide analysis of Escherichia coli responses to fosfomycin using TraDIS-Xpress reveals novel roles for phosphonate degradation and phosphate transport systems.

Authors:  A Keith Turner; Muhammad Yasir; Sarah Bastkowski; Andrea Telatin; Andrew J Page; Ian G Charles; Mark A Webber
Journal:  J Antimicrob Chemother       Date:  2020-11-01       Impact factor: 5.790

  10 in total
  1 in total

1.  Enhanced Apiaceous Potyvirus Phylogeny, Novel Viruses, and New Country and Host Records from Sequencing Apiaceae Samples.

Authors:  Adrian Fox; Adrian J Gibbs; Aimee R Fowkes; Hollie Pufal; Sam McGreig; Roger A C Jones; Neil Boonham; Ian P Adams
Journal:  Plants (Basel)       Date:  2022-07-27
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.