Literature DB >> 29096012

Mosdepth: quick coverage calculation for genomes and exomes.

Brent S Pedersen^1,2,3, Aaron R Quinlan^1,2,3.

Abstract

Summary: Mosdepth is a new command-line tool for rapidly calculating genome-wide sequencing coverage. It measures depth from BAM or CRAM files at either each nucleotide position in a genome or for sets of genomic regions. Genomic regions may be specified as either a BED file to evaluate coverage across capture regions, or as a fixed-size window as required for copy-number calling. Mosdepth uses a simple algorithm that is computationally efficient and enables it to quickly produce coverage summaries. We demonstrate that mosdepth is faster than existing tools and provides flexibility in the types of coverage profiles produced. Availability and implementation: mosdepth is available from https://github.com/brentp/mosdepth under the MIT license. Contact: bpederse@gmail.com. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29096012 PMCID： PMC6030888 DOI： 10.1093/bioinformatics/btx699

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Measuring the depth of sequencing coverage is critical for genomic analyses such as calling copy-number variation (CNV), e.g. by cn.mops (Klambauer ), quality control (Pedersen ), and determining which genomic regions have too low, or too high (Li, 2014) coverage for reliable variant calling. Given the scope of applications for coverage profiles, there are several existing tools that calculate genome-wide coverage. Samtools depth (Li ) outputs per-base coverage; BEDTools genomecov (Quinlan and Hall, 2010; Quinlan, 2014) can output per-region or per-base coverage; Sambamba (Tarasov ) also provides per-base and per-window depth calculations. The need for efficient coverage calculation increases with the number and depth of whole genome sequences, and existing methods require roughly an hour or more of computation for a typical human genome with 30× coverage. Here, we introduce mosdepth and show that it is faster than existing methods and has additional utility.

2 Materials and methods

Mosdepth uses HTSLib (http://www.htslib.org/) via the nim programming language (https://nim-lang.org); it expects the input BAM or CRAM file to be sorted by position. In contrast to samtools, which uses a ‘pileup’ engine that tracks each nucleotide in every read, mosdepth only tracks chunks of read alignments. Only the start and end position of each chunk of an alignment (each alignment may have multiple chunks if it is split by a deletion or other event) are tracked in an array (of 32 bit integers) whose size is the length of the chromosome. For each chunk of an alignment to the reference genome, mosdepth increments the start and decrements the end for the the value at the index in the array corresponding to that chromosomal position (Fig. 1). It avoids double-counting coverage when the ends of a paired-end sequencing fragment have overlapping alignments (Fig. 1, black alignment). Once the coverage array has tracked all alignment starts and ends in a BAM or CRAM file, the depth at a particular position is calculated as the cumulative sum of all array positions preceding it (a similar algorithm is used in BEDTools which track starts and ends separately).

Fig. 1

Mosdepth coverage calculation algorithm. An array the size of the current chromosome is allocated. As each alignment is read from a position-sorted BAM or CRAM file, the value at each start is incremented and the value at each stop is decremented. As illustrated by the alignment with a deletion (D) CIGAR operation, each alignment may have multiple starts and ends. If the leftmost read (the one seen first) of a paired-end alignment has an end that overlaps the position of its mate (which is given as a field in the BAM record) then it is stored in a hash-table until its mate is seen. At that time, the overlap between the mates is calculated, the regions of overlap are decremented and the item is removed from the hash. This prevents double counting coverage from two ends of the same paired-end DNA fragment (black alignment, ‘*’ operation means no coverage increment or decrement is made). Once all reads for a chromosome are consumed, the per-base coverage is simply the cumulative sum of the preceding positions The coverage along a chromosome is calculated in place by replacing the composite start and end counts with the cumulative sum up to each element in the array. Once complete, the coverage of a region is simply the mean of the elements in the array spanning from start to end. This makes it possible to calculate coverage extremely quickly, even for millions of small regions. This setup is also amenable to rapid calculation of a genome’s coverage distribution: that is, the number of bases covered by a given number of reads across the genome or in the given regions. The distribution calculation requires an extra iteration through the array that counts the occurrence of each coverage value. The mosdepth method does require more memory–for the 249 megabase chromosome 1 in the human genome, it will require about 1GB of memory, however, that number is not dependent on the depth of coverage or number of alignments. Despite its flexibility, mosdepth is easy to use and understand (see Supplementary Material for example uses).

3 Results

We compared the time and memory requirements of mosdepth (v0.1.6) to samtools (v1.5), BEDTools (v2.26.0) and sambamba (v0.6.6) on a BAM with about 30× coverage from the Simons Genome Diversity Panel (Mallick ) (Supplementary Material). With a single CPU, mosdepth is faster than existing tools, and can be even faster with multiple decompression threads (Table 1). Results for CRAM and for other options such as window-based depth calculations are shown in Supplementary Table S1. At four threads, there is no additional benefit to adding more decompression threads as shown in Supplementary Figure S1.

Table 1.

Comparison of depth tools for time and memory use on a 30× BAM

Tool	Threads	Relative time	Time (hh:mm:ss)	Memory (MiB)
Mosdepth	1	1	25:23	1196
Mosdepth	3	0.57	14:27	1196
Samtools	1	1.98	50:12	27
Sambamba	1	5.71	2:24:53	166
BEDtools	1	5.31	2:14:44	1908

Note: Mosdepth and BEDTools use much more memory, but mosdepth is nearly twice as fast as the next fastest tool, samtools. The threads column reflects the number of threads for BAM/CRAM decompression.

Comparison of depth tools for time and memory use on a 30× BAM Note: Mosdepth and BEDTools use much more memory, but mosdepth is nearly twice as fast as the next fastest tool, samtools. The threads column reflects the number of threads for BAM/CRAM decompression. To evaluate consistency between the tools, we compared the output to samtools depth. Mosdepth cannot include or exclude individual bases because of low base-quality (BQ) as can samtools depth. In contrast, samtools depth cannot avoid double-counting overlapping regions unless the BQ cutoff is set to a value > 0. Therefore, we compared mosdepth without mate overlap correction to samtools depth with a BQ cutoff of 0 for chromosome 22 of the dataset used for Table 1. With this comparison set up to evaluate differences, we found no discrepancies in reported depth among the tools for the entire chromosome.

4 Discussion

Mosdepth is a quick, convenient tool for genome-wide depth calculation. The optional coverage distribution is useful for quality control and the depth output is applicable without further processing as input to many CNV detection tools. While the method it employs requires greater memory use, it makes the implementation simple and fast, enables a straightforward coverage distribution calculation, and expedites the depth calculations for even millions of regions. Mosdepth is useful for exome, whole-genome, and targeted sequencing projects. It is available from source-code, as a binary, and from bioconda (https://bioconda.github.io/).

Funding

This research was supported by awards to A.R.Q. from the US National Human Genome Research Institute (NIH R01HG006693 and NIH R01HG009141), the US National Institute of General Medical Sciences (NIH R01GM124355) and the US National Cancer Institute (NIH U24CA209999). Conflict of Interest: none declared. Click here for additional data file.

8 in total

1. Sambamba: fast processing of NGS alignment formats.

Authors: Artem Tarasov; Albert J Vilella; Edwin Cuppen; Isaac J Nijman; Pjotr Prins
Journal: Bioinformatics Date: 2015-02-19 Impact factor: 6.937

Review 2. Toward better understanding of artifacts in variant calling from high-coverage samples.

Authors: Heng Li
Journal: Bioinformatics Date: 2014-06-27 Impact factor: 6.937

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

4. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

5. BEDTools: The Swiss-Army Tool for Genome Feature Analysis.

Authors: Aaron R Quinlan
Journal: Curr Protoc Bioinformatics Date: 2014-09-08

6. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate.

Authors: Günter Klambauer; Karin Schwarzbauer; Andreas Mayr; Djork-Arné Clevert; Andreas Mitterecker; Ulrich Bodenhofer; Sepp Hochreiter
Journal: Nucleic Acids Res Date: 2012-02-01 Impact factor: 16.971

7. Indexcov: fast coverage quality control for whole-genome sequencing.

Authors: Brent S Pedersen; Ryan L Collins; Michael E Talkowski; Aaron R Quinlan
Journal: Gigascience Date: 2017-11-01 Impact factor: 6.524

8. The Simons Genome Diversity Project: 300 genomes from 142 diverse populations.

Authors: Swapan Mallick; Heng Li; Mark Lipson; Iain Mathieson; Melissa Gymrek; Fernando Racimo; Mengyao Zhao; Niru Chennagiri; Susanne Nordenfelt; Arti Tandon; Pontus Skoglund; Iosif Lazaridis; Sriram Sankararaman; Qiaomei Fu; Nadin Rohland; Gabriel Renaud; Yaniv Erlich; Thomas Willems; Carla Gallo; Jeffrey P Spence; Yun S Song; Giovanni Poletti; Francois Balloux; George van Driem; Peter de Knijff; Irene Gallego Romero; Aashish R Jha; Doron M Behar; Claudio M Bravi; Cristian Capelli; Tor Hervig; Andres Moreno-Estrada; Olga L Posukh; Elena Balanovska; Oleg Balanovsky; Sena Karachanak-Yankova; Hovhannes Sahakyan; Draga Toncheva; Levon Yepiskoposyan; Chris Tyler-Smith; Yali Xue; M Syafiq Abdullah; Andres Ruiz-Linares; Cynthia M Beall; Anna Di Rienzo; Choongwon Jeong; Elena B Starikovskaya; Ene Metspalu; Jüri Parik; Richard Villems; Brenna M Henn; Ugur Hodoglugil; Robert Mahley; Antti Sajantila; George Stamatoyannopoulos; Joseph T S Wee; Rita Khusainova; Elza Khusnutdinova; Sergey Litvinov; George Ayodo; David Comas; Michael F Hammer; Toomas Kivisild; William Klitz; Cheryl A Winkler; Damian Labuda; Michael Bamshad; Lynn B Jorde; Sarah A Tishkoff; W Scott Watkins; Mait Metspalu; Stanislav Dryomov; Rem Sukernik; Lalji Singh; Kumarasamy Thangaraj; Svante Pääbo; Janet Kelso; Nick Patterson; David Reich
Journal: Nature Date: 2016-09-21 Impact factor: 49.962

8 in total

164 in total

1. Sex-dependent dominance maintains migration supergene in rainbow trout.

Authors: Devon E Pearse; Nicola J Barson; Torfinn Nome; Guangtu Gao; Matthew A Campbell; Alicia Abadía-Cardoso; Eric C Anderson; David E Rundio; Thomas H Williams; Kerry A Naish; Thomas Moen; Sixin Liu; Matthew Kent; Michel Moser; David R Minkley; Eric B Rondeau; Marine S O Brieuc; Simen Rød Sandve; Michael R Miller; Lucydalila Cedillo; Kobi Baruch; Alvaro G Hernandez; Gil Ben-Zvi; Doron Shem-Tov; Omer Barad; Kirill Kuzishchin; John Carlos Garza; Steven T Lindley; Ben F Koop; Gary H Thorgaard; Yniv Palti; Sigbjørn Lien
Journal: Nat Ecol Evol Date: 2019-11-25 Impact factor: 15.460

2. hts-nim: scripting high-performance genomic analyses.

Authors: Brent S Pedersen; Aaron R Quinlan
Journal: Bioinformatics Date: 2018-10-01 Impact factor: 6.937

3. Microbiome composition shapes rapid genomic adaptation of Drosophila melanogaster.

Authors: Seth M Rudman; Sharon Greenblum; Rachel C Hughes; Subhash Rajpurohit; Ozan Kiratli; Dallin B Lowder; Skyler G Lemmon; Dmitri A Petrov; John M Chaston; Paul Schmidt
Journal: Proc Natl Acad Sci U S A Date: 2019-09-16 Impact factor: 11.205

4. Loss of anthocyanidin synthase gene is associated with white flowers of Salvia miltiorrhiza Bge. f. alba, a natural variant of S. miltiorrhiza.

Authors: Caicai Lin; Piyi Xing; Hua Jin; Changhao Zhou; Xingfeng Li; Zhenqiao Song
Journal: Planta Date: 2022-06-21 Impact factor: 4.116

5. SeQuiLa-cov: A fast and scalable library for depth of coverage calculations.

Authors: Marek Wiewiórka; Agnieszka Szmurło; Wiktor Kuśmirek; Tomasz Gambin
Journal: Gigascience Date: 2019-08-01 Impact factor: 6.524

6. Extreme copy number variation at a tRNA ligase gene affecting phenology and fitness in yellow monkeyflowers.

Authors: Thomas C Nelson; Patrick J Monnahan; Mariah K McIntosh; Kayli Anderson; Evan MacArthur-Waltz; Findley R Finseth; John K Kelly; Lila Fishman
Journal: Mol Ecol Date: 2018-12-10 Impact factor: 6.185

7. Abundant expression of maternal siRNAs is a conserved feature of seed development.

Authors: Jeffrey W Grover; Diane Burgess; Timmy Kendall; Abdul Baten; Suresh Pokhrel; Graham J King; Blake C Meyers; Michael Freeling; Rebecca A Mosher
Journal: Proc Natl Acad Sci U S A Date: 2020-06-15 Impact factor: 11.205

8. Whole-genome sequencing association analysis of quantitative red blood cell phenotypes: The NHLBI TOPMed program.

Authors: Yao Hu; Adrienne M Stilp; Caitlin P McHugh; Shuquan Rao; Deepti Jain; Xiuwen Zheng; John Lane; Sébastian Méric de Bellefon; Laura M Raffield; Ming-Huei Chen; Lisa R Yanek; Marsha Wheeler; Yao Yao; Chunyan Ren; Jai Broome; Jee-Young Moon; Paul S de Vries; Brian D Hobbs; Quan Sun; Praveen Surendran; Jennifer A Brody; Thomas W Blackwell; Hélène Choquet; Kathleen Ryan; Ravindranath Duggirala; Nancy Heard-Costa; Zhe Wang; Nathalie Chami; Michael H Preuss; Nancy Min; Lynette Ekunwe; Leslie A Lange; Mary Cushman; Nauder Faraday; Joanne E Curran; Laura Almasy; Kousik Kundu; Albert V Smith; Stacey Gabriel; Jerome I Rotter; Myriam Fornage; Donald M Lloyd-Jones; Ramachandran S Vasan; Nicholas L Smith; Kari E North; Eric Boerwinkle; Lewis C Becker; Joshua P Lewis; Goncalo R Abecasis; Lifang Hou; Jeffrey R O'Connell; Alanna C Morrison; Terri H Beaty; Robert Kaplan; Adolfo Correa; John Blangero; Eric Jorgenson; Bruce M Psaty; Charles Kooperberg; Russell T Walton; Benjamin P Kleinstiver; Hua Tang; Ruth J F Loos; Nicole Soranzo; Adam S Butterworth; Debbie Nickerson; Stephen S Rich; Braxton D Mitchell; Andrew D Johnson; Paul L Auer; Yun Li; Rasika A Mathias; Guillaume Lettre; Nathan Pankratz; Cathy C Laurie; Cecelia A Laurie; Daniel E Bauer; Matthew P Conomos; Alexander P Reiner
Journal: Am J Hum Genet Date: 2021-04-21 Impact factor: 11.025

9. TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets.

Authors: Ales Varabyou; Geo Pertea; Christopher Pockrandt; Mihaela Pertea
Journal: Bioinformatics Date: 2021-05-08 Impact factor: 6.937

10. Infertility due to defective sperm flagella caused by an intronic deletion in DNAH17 that perturbs splicing.

Authors: Adéla Nosková; Maya Hiltpold; Fredi Janett; Thomas Echtermann; Zih-Hua Fang; Xaver Sidler; Christin Selige; Andreas Hofer; Stefan Neuenschwander; Hubert Pausch
Journal: Genetics Date: 2021-02-09 Impact factor: 4.562