Literature DB >> 34341766

Bam-readcount -- rapid generation of basepair-resolution sequence metrics.

Ajay Khanna, David E Larson, Sridhar Nonavinkere Srivatsan, Matthew Mosior, Travis E Abbott, Susanna Kiwala, Timothy J Ley, Eric J Duncavage, Matthew J Walter, Jason R Walker, Obi L Griffith, Malachi Griffith, Christopher A Miller.

Abstract

Bam-readcount is a utility for generating low-level information about sequencing data at specific nucleotide positions. Originally designed to help filter genomic mutation calls, the metrics it outputs are useful as input for variant detection tools and for resolving ambiguity between variant callers . In addition, it has found broad applicability in diverse fields including tumor evolution, single-cell genomics, climate change ecology, and tracking community spread of SARS-CoV-2. Here we report on the release of version 1.0 of this tool, which adds CRAM support, among other improvements. It is released under a permissive MIT license and available at https://github.com/genome/bam-readcount.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 34341766 PMCID： PMC8328062

Source DB: PubMed Journal: ArXiv ISSN： 2331-8422

Introduction

Though many tools exist that can call simple genotypes from sequence data, there is frequently a need for rapid and comprehensive reporting of sequencing metrics at specific genomic locations. The bam-readcount tool reports 15 metrics chosen specifically because they are known to be associated with the quality of sequence reads and individual base calls. These include summarized mapping and base qualities, strandedness information, mismatch counts, and position within the reads. This information can be useful in a large number of contexts, with one frequent application being variant filtering and ensemble variant calling situations where consistent, tool-agnostic metrics are useful[2,7,8].

Implementation and results

The ongoing adoption of compressed data formats has necessitated additions to the code, and the version 1.0 release that we report on here utilizes an updated version of HTSlib to support rapid CRAM file access[9]. This has also improved performance, and bam-readcount can report on 100,000 randomly selected sites from a 30x whole-genome sequencing (WGS) BAM in around 5 minutes[10]. It’s performance scales nearly linearly with the number of genomic sites queried and average sequencing depth (Figure 1). Querying the same 100,000 sites from a BAM with 300x WGS takes 48 minutes, roughly 10x as long.

Figure 1:

Performance of bam-readcount when querying randomly selected genomic positions from BAMs (left) or corresponding CRAMs (right) of varying sequencing depth. Colors correspond to average sequencing depth of the downsampled BAM/CRAM file.

Memory usage likewise is dependent on depth of sequencing, but still requires less than 1 GB of RAM for a 300x WGS BAM. Processing small CRAM files is somewhat slower than BAMs with comparable amounts of data, due to the increased CPU usage for decompression, but as depth increases, retrieval from disk becomes the bottleneck and operations on CRAMs exceed the speed of BAM. In our testing, on a fast SSD tier of networked disk, this transition occurs at a depth of about 180x. The problem is also embarrassingly parallel, so assuming adequate disk I/O, a roughly linear increase in speed can be achieved with a scatter/gather approach. To lower barriers to adoption, we provide docker images for containerized workflows, and have developed a python wrapper that annotates a VCF file with read counts produced from this tool, (available as part of the VAtools package - https://github.com/griffithlab/VAtools).

Conclusions

Bam-readcount plays a central role in many genomic pipelines and there is a rich ecosystem of tools built on top of it that enable discovery. It has many uses in benchmarking and variant discovery, and it’s feature-rich output has enabled deep learning approaches to variant calling and filtering[7,11]. In cancer genomics, it has been used for understanding pre-leukemic phenotypes and for detecting therapy-altering mutations from cell-free DNA[12,13]. Viral researchers have applied it to understand diversity in Varicella Zoster Virus Encephalitis and to perform epidemiological surveillance in wastewater of SARS-CoV-2[14,15]. Those with RNA-sequencing data have found it useful for identifying allele-specific expression in cancer, or for enabling copy-number detection in single-cell RNA sequencing[5,16]. It also serves as core infrastructure that supports genomics pipelines of all sizes, from bespoke workflows produced by small research groups to the NCI’s Genomic Data Commons pipelines, where it has been run on tens of thousands of genomes[17-19]. Looking forward, we anticipate that as machine learning makes deeper inroads into genomics, the ability to extract highly informative features from large cohorts in a rapid manner will continue to make bam-readcount useful for the next generation of genomics research. The bam-readcount tool is available at https://github.com/genome/bam-readcount and is shared under a MIT license to enable broad re-use.

18 in total

1. Using VarScan 2 for Germline Variant Calling and Somatic Mutation Detection.

Authors: Daniel C Koboldt; David E Larson; Richard K Wilson
Journal: Curr Protoc Bioinformatics Date: 2013-12

Review 2. The NCI Genomic Data Commons as an engine for precision medicine.

Authors: Mark A Jensen; Vincent Ferretti; Robert L Grossman; Louis M Staudt
Journal: Blood Date: 2017-06-09 Impact factor: 22.113

3. Genome Modeling System: A Knowledge Management Platform for Genomics.

Authors: Malachi Griffith; Obi L Griffith; Scott M Smith; Avinash Ramu; Matthew B Callaway; Anthony M Brummett; Michael J Kiwala; Adam C Coffman; Allison A Regier; Ben J Oberkfell; Gabriel E Sanderson; Thomas P Mooney; Nathaniel G Nutter; Edward A Belter; Feiyu Du; Robert L Long; Travis E Abbott; Ian T Ferguson; David L Morton; Mark M Burnett; James V Weible; Joshua B Peck; Adam Dukes; Joshua F McMichael; Justin T Lolofie; Brian R Derickson; Jasreet Hundal; Zachary L Skidmore; Benjamin J Ainscough; Nathan D Dees; William S Schierding; Cyriac Kandoth; Kyung H Kim; Charles Lu; Christopher C Harris; Nicole Maher; Christopher A Maher; Vincent J Magrini; Benjamin S Abbott; Ken Chen; Eric Clark; Indraniel Das; Xian Fan; Amy E Hawkins; Todd G Hepler; Todd N Wylie; Shawn M Leonard; William E Schroeder; Xiaoqi Shi; Lynn K Carmichael; Matthew R Weil; Richard W Wohlstadter; Gary Stiehr; Michael D McLellan; Craig S Pohl; Christopher A Miller; Daniel C Koboldt; Jason R Walker; James M Eldred; David E Larson; David J Dooling; Li Ding; Elaine R Mardis; Richard K Wilson
Journal: PLoS Comput Biol Date: 2015-07-09 Impact factor: 4.475

4. Age-related mutations associated with clonal hematopoietic expansion and malignancies.

Authors: Mingchao Xie; Charles Lu; Jiayin Wang; Michael D McLellan; Kimberly J Johnson; Michael C Wendl; Joshua F McMichael; Heather K Schmidt; Venkata Yellapantula; Christopher A Miller; Bradley A Ozenberger; John S Welch; Daniel C Link; Matthew J Walter; Elaine R Mardis; John F Dipersio; Feng Chen; Richard K Wilson; Timothy J Ley; Li Ding
Journal: Nat Med Date: 2014-10-19 Impact factor: 53.440

5. appreci8: a pipeline for precise variant calling integrating 8 tools.

Authors: Sarah Sandmann; Mohsen Karimi; Aniek O de Graaf; Christian Rohde; Stefanie Göllner; Julian Varghese; Jan Ernsting; Gunilla Walldin; Bert A van der Reijden; Carsten Müller-Tidow; Luca Malcovati; Eva Hellström-Lindberg; Joop H Jansen; Martin Dugas
Journal: Bioinformatics Date: 2018-12-15 Impact factor: 6.937

6. NeoMutate: an ensemble machine learning framework for the prediction of somatic mutations in cancer.

Authors: Irantzu Anzar; Angelina Sverchkova; Richard Stratford; Trevor Clancy
Journal: BMC Med Genomics Date: 2019-05-16 Impact factor: 3.063

7. Multiple Introductions Followed by Ongoing Community Spread of SARS-CoV-2 at One of the Largest Metropolitan Areas of Northeast Brazil.

Authors: Marcelo Henrique Santos Paiva; Duschinka Ribeiro Duarte Guedes; Cássia Docena; Matheus Filgueira Bezerra; Filipe Zimmer Dezordi; Laís Ceschini Machado; Larissa Krokovsky; Elisama Helvecio; Alexandre Freitas da Silva; Luydson Richardson Silva Vasconcelos; Antonio Mauro Rezende; Severino Jefferson Ribeiro da Silva; Kamila Gaudêncio da Silva Sales; Bruna Santos Lima Figueiredo de Sá; Derciliano Lopes da Cruz; Claudio Eduardo Cavalcanti; Armando de Menezes Neto; Caroline Targino Alves da Silva; Renata Pessôa Germano Mendes; Maria Almerice Lopes da Silva; Tiago Gräf; Paola Cristina Resende; Gonzalo Bello; Michelle da Silva Barros; Wheverton Ricardo Correia do Nascimento; Rodrigo Moraes Loyo Arcoverde; Luciane Caroline Albuquerque Bezerra; Sinval Pinto Brandão Filho; Constância Flávia Junqueira Ayres; Gabriel Luz Wallau
Journal: Viruses Date: 2020-12-09 Impact factor: 5.048

8. HTSlib: C library for reading/writing high-throughput sequencing data.

Authors: James K Bonfield; John Marshall; Petr Danecek; Heng Li; Valeriu Ohan; Andrew Whitwham; Thomas Keane; Robert M Davies
Journal: Gigascience Date: 2021-02-16 Impact factor: 6.524

9. A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data.

Authors: Benjamin J Ainscough; Erica K Barnell; Peter Ronning; Katie M Campbell; Alex H Wagner; Todd A Fehniger; Gavin P Dunn; Ravindra Uppaluri; Ramaswamy Govindan; Thomas E Rohan; Malachi Griffith; Elaine R Mardis; S Joshua Swamidass; Obi L Griffith
Journal: Nat Genet Date: 2018-11-05 Impact factor: 38.330

10. A direct capture method for purification and detection of viral nucleic acid enables epidemiological surveillance of SARS-CoV-2.

Authors: Subhanjan Mondal; Nathan Feirer; Michael Brockman; Melanie A Preston; Sarah J Teter; Dongping Ma; Said A Goueli; Sameer Moorji; Brigitta Saul; James J Cali
Journal: Sci Total Environ Date: 2021-07-07 Impact factor: 7.963