Literature DB >> 26803155

MuCor: mutation aggregation and correlation.

Karl W Kroll¹, Ann-Katherin Eisfeld², Gerard Lozanski³, Clara D Bloomfield⁴, John C Byrd⁴, James S Blachly⁴.

Abstract

MOTIVATION: There are many tools for variant calling and effect prediction, but little to tie together large sample groups. Aggregating, sorting and summarizing variants and effects across a cohort is often done with ad hoc scripts that must be re-written for every new project. In response, we have written MuCor, a tool to gather variants from a variety of input formats (including multiple files per sample), perform database lookups and frequency calculations, and write many types of reports. In addition to use in large studies with numerous samples, MuCor can also be employed to directly compare variant calls from the same sample across two or more platforms, parameters or pipelines. A companion utility, DepthGauge, measures coverage at regions of interest to increase confidence in calls.
AVAILABILITY AND IMPLEMENTATION: Source code is freely available at https://github.com/blachlylab/mucor and a Docker image is available at https://hub.docker.com/r/blachlylab/mucor/ CONTACT: james.blachly@osumc.eduSupplementary data: Supplementary data are available at Bioinformatics online.

Mesh：

Year: 2016 PMID： 26803155 PMCID： PMC4866525 DOI： 10.1093/bioinformatics/btw028

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Examination of genomic variants from multiple samples is a common procedure in bioinformatics. Whether to use existing tools or custom scripts depends on many factors, including the number of samples, source(s) of data and scope of the project. A typical workflow may involve variant annotation, extraction of variants from VCF files, binning loci into features, calculating summary statistics, filtering, limiting by region and writing to a tab-separated values table. Finally, this table is imported into a statistical package for calculations or to Microsoft Excel for use by end-users or inclusion as a data supplement to a paper. Existing tools provide only parts of this workflow. The Genome Analysis Toolkit (McKenna ; DePristo ) function CombineVariants, JoinX (http://gmt.genome.wustl.edu/packages/joinx/), and vcftools all can merge records from VCF files, but produce only another VCF file in turn. The resultant VCF is necessarily focused on individual genomic locations, and a study of features (e.g. genes) requires additional work on the part of the bioinformatician. Further, a multi-sample VCF file may be viewed interactively in a genome browser but is not suitable as a locus-wise frequency table for the biologist or statistician. Transformation of such a VCF to a detailed report again requires additional processing. Some commercial products purport to be end-to-end solutions, but these are not available to all. Variant ToolChest (Ebbert ) is open-source, but is designed around subsetting and has only preliminary support for non-VCF reporting. Motivated by these factors, we developed MuCor, a flexible tool for variant aggregation and summarization. MuCor collects variants from an arbitrary number of sources, assigns them to user-defined features (typically but not necessarily genes), looks them up in variant databases, calculates metrics and generates reports, all within a configurable and extensible framework.

2 Implementation and usage

2.1 Implementation

MuCor is written in Python2 (≥2.7.0) and uses the following libraries not included with the Python standard library: numpy (Van Der Walt ), pandas and HTSeq (Anders ). It can optionally make use of pytabix and xlsxwriter to enable additional functionality. MuCor is used in two stages: setup and run. Project setup can be as simple as running the configure script, yielding a JSON settings file. In ‘autodetect’ mode, the script scans the project directory recursively for all supported variant call files belonging to a supplied list of sample IDs; a single sample may have more than one associated VCF file (e.g. if SNV and indel detection are performed separately). The setup script also establishes links to variant databases against which the study samples will be checked. The final configuration can be adjusted prior to run phase. In the run phase, MuCor parses the JSON configuration file and reads samples, databases, annotation and region files in to the analysis core. It then groups variants according to features specified in the annotation, optionally limited to region(s) of interest, calculates summary metrics and reports the information about each variant, feature and sample in output reports. The key vehicle by which this is accomplished is the pandas DataFrame (df). By leveraging the grouping, aggregation and pivoting functions native to pandas, the variant data can be swiftly manipulated in myriad ways. For example, all variants can be condensed to the gene level using df.groupby, while data could be reformatted into a sample × gene matrix with df.pivot or df.stack. The three steps of the run phase—input, aggregation and analysis, and reporting—are distinctly separated to make MuCor easy to extend. New input and output formats are easily coded (see Supplementary material), while any run can be limited to regions of interest and cross-referenced against an arbitrary number of variant databases with merely a configuration change. Figure 1 summarizes MuCor’s modular approach.

Fig. 1.

Schematic of MuCor’s inputs, engine and outputs. Many more input and output types are available. See Supplementary materials

Schematic of MuCor’s inputs, engine and outputs. Many more input and output types are available. See Supplementary materials MuCor is not a variant annotator: functional effect prediction is better left to specialized tools. In support of this, MuCor accepts as input data that have already been decorated with a functional effect prediction, retains this information through processing and passes it through to the output. Furthermore, some MuCor reports are in a form suitable for further processing by variant effect prediction software, a more efficient approach for large cohorts (i.e. hundreds to thousands of samples) than per-sample effect prediction.

2.2 Usage

Detailed instructions and sample workflows are given in the Supplementary material. Configuration begins by specifying an annotation in GTF/GFF3 format, a feature type for grouping, a list of samples and one or more report types. The user may optionally specify reference databases (e.g. dbSNP, 1000 Genomes, COSMIC) and limit analysis to regions of interest. Variant files are automatically detected. The resultant JSON file may be hand-edited prior to execution, or retained for reproducibility. A corresponding run requires only passing the configuration file; output is written to a prespecified directory. Reports range in scope from summary-level counts across broad regions of interest to variant-specific metrics on a per-sample basis. The Supplementary material contains a list of all report types and some example reports. Finally, DepthGauge, a companion program, queries source BAM files for total reads at each defined region in all samples to improve confidence in the veracity of wild-type calls. The generality of MuCor makes it useful for additional applications beyond comparing variants across samples. We have also used it to compare sequencing platforms, variant calling tools and pipelines within a single sample. In this case, the comparisons reveal concordance and discordance between platforms or tools, rather than between samples as in the canonical usage. See example workflow 3 in the Supplementary material.

3 Conclusions

We have created MuCor, a tool to aggregate and report genomic variants in configurable, meaningful groups. A key goal is generality, that it may replace the ad hoc scripting often performed in the course of routine bioinformatic analyses of multiple samples. A flexible runtime and modular architecture ensure broad applicability: we have applied MuCor for its intended use by aggregating hundreds to thousands of cases, but also for novel uses, such as comparing different variant calling software pipelines or different sequencing platforms. By separating input parsers, the annotation and analysis core, and output reporting into distinct components, MuCor is modular and expandable through new input plugins, new annotations or auxiliary databases, and definition of new output report formats.

Funding

This work was supported by the National Institutes of Health [P30 CA016058] and in part by an allocation of computing resources from The Ohio Supercomputer Center. Interest: none declared.

4 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330

3. Variant Tool Chest: an improved tool to analyze and manipulate variant call format (VCF) files.

Authors: Mark T W Ebbert; Mark E Wadsworth; Kevin L Boehme; Kaitlyn L Hoyt; Aaron R Sharp; Brendan D O'Fallon; John S K Kauwe; Perry G Ridge
Journal: BMC Bioinformatics Date: 2014-05-28 Impact factor: 3.169

4. HTSeq--a Python framework to work with high-throughput sequencing data.

Authors: Simon Anders; Paul Theodor Pyl; Wolfgang Huber
Journal: Bioinformatics Date: 2014-09-25 Impact factor: 6.937

4 in total

14 in total

1. Prognostic and Biologic Relevance of Clinically Applicable Long Noncoding RNA Profiling in Older Patients with Cytogenetically Normal Acute Myeloid Leukemia.

Authors: Ramiro Garzon; Clara D Bloomfield; Dimitrios Papaioannou; Deedra Nicolet; Hatice G Ozer; Krzysztof Mrózek; Stefano Volinia; Paolo Fadda; Andrew J Carroll; Jessica Kohlschmidt; Jonathan E Kolitz; Eunice S Wang; Richard M Stone; John C Byrd
Journal: Mol Cancer Ther Date: 2019-06-04 Impact factor: 6.261

2. Clinical and molecular characterization of patients with acute myeloid leukemia and sole trisomies of chromosomes 4, 8, 11, 13 or 21.

Authors: Bhavana Bhatnagar; Ann-Kathrin Eisfeld; Jessica Kohlschmidt; Krzysztof Mrózek; Deedra Nicolet; Dimitrios Papaioannou; Christopher J Walker; Shelley Orwick; James S Blachly; Jonathan E Kolitz; Bayard L Powell; Andrew J Carroll; Richard M Stone; John C Byrd; Clara D Bloomfield
Journal: Leukemia Date: 2019-08-28 Impact factor: 11.528

3. Persistence of DNMT3A R882 mutations during remission does not adversely affect outcomes of patients with acute myeloid leukaemia.

Authors: Bhavana Bhatnagar; Ann-Kathrin Eisfeld; Deedra Nicolet; Krzysztof Mrózek; James S Blachly; Shelley Orwick; David M Lucas; Jessica Kohlschmidt; William Blum; Jonathan E Kolitz; Richard M Stone; Clara D Bloomfield; John C Byrd
Journal: Br J Haematol Date: 2016-08-01 Impact factor: 6.998

4. Prognostic and biological significance of the proangiogenic factor EGFL7 in acute myeloid leukemia.

Authors: Dimitrios Papaioannou; Changxian Shen; Deedra Nicolet; Betina McNeil; Marius Bill; Malith Karunasiri; Matthew H Burke; Hatice Gulcin Ozer; Selen A Yilmaz; Nina Zitzer; Gregory K Behbehani; Christopher C Oakes; Damian J Steiner; Guido Marcucci; Bayard L Powell; Jonathan E Kolitz; Thomas H Carter; Eunice S Wang; Krzysztof Mrózek; Carlo M Croce; Michael A Caligiuri; Clara D Bloomfield; Ramiro Garzon; Adrienne M Dorrance
Journal: Proc Natl Acad Sci U S A Date: 2017-05-22 Impact factor: 11.205

5. Mutations in the CCND1 and CCND2 genes are frequent events in adult patients with t(8;21)(q22;q22) acute myeloid leukemia.

Authors: A-K Eisfeld; J Kohlschmidt; S Schwind; D Nicolet; J S Blachly; S Orwick; C Shah; M Bainazar; K W Kroll; C J Walker; A J Carroll; B L Powell; R M Stone; J E Kolitz; M R Baer; A de la Chapelle; K Mrózek; J C Byrd; C D Bloomfield
Journal: Leukemia Date: 2016-11-15 Impact factor: 11.528

6. Genetic Characterization and Prognostic Relevance of Acquired Uniparental Disomies in Cytogenetically Normal Acute Myeloid Leukemia.

Authors: Christopher J Walker; Jessica Kohlschmidt; Ann-Kathrin Eisfeld; Krzysztof Mrózek; Sandya Liyanarachchi; Chi Song; Deedra Nicolet; James S Blachly; Marius Bill; Dimitrios Papaioannou; Christopher C Oakes; Brian Giacopelli; Luke K Genutis; Sophia E Maharry; Shelley Orwick; Kellie J Archer; Bayard L Powell; Jonathan E Kolitz; Geoffrey L Uy; Eunice S Wang; Andrew J Carroll; Richard M Stone; John C Byrd; Albert de la Chapelle; Clara D Bloomfield
Journal: Clin Cancer Res Date: 2019-08-02 Impact factor: 12.531

7. Selinexor in combination with decitabine in patients with acute myeloid leukemia: results from a phase 1 study.

Authors: Bhavana Bhatnagar; Qiuhong Zhao; Alice S Mims; Sumithira Vasu; Gregory K Behbehani; Karilyn Larkin; James S Blachly; William Blum; Rebecca B Klisovic; Amy S Ruppert; Shelley Orwick; Christopher Oakes; Parvathi Ranganathan; John C Byrd; Alison R Walker; Ramiro Garzon
Journal: Leuk Lymphoma Date: 2019-09-23

8. MonoSeq Variant Caller Reveals Novel Mononucleotide Run Indel Mutations in Tumors with Defective DNA Mismatch Repair.

Authors: Christopher J Walker; Mario A Miranda; Matthew J O'Hern; James S Blachly; Cassandra L Moyer; Jennifer Ivanovich; Karl W Kroll; Ann-Kathrin Eisfeld; Caroline E Sapp; David G Mutch; David E Cohn; Ralf Bundschuh; Paul J Goodfellow
Journal: Hum Mutat Date: 2016-08-08 Impact factor: 4.878

9. Clinical and functional significance of circular RNAs in cytogenetically normal AML.

Authors: Dimitrios Papaioannou; Stefano Volinia; Deedra Nicolet; Michał Świerniak; Andreas Petri; Krzysztof Mrózek; Marius Bill; Felice Pepe; Christopher J Walker; Allison E Walker; Andrew J Carroll; Jessica Kohlschmidt; Ann-Kathrin Eisfeld; Bayard L Powell; Geoffrey L Uy; Jonathan E Kolitz; Eunice S Wang; Sakari Kauppinen; Adrienne Dorrance; Richard M Stone; John C Byrd; Clara D Bloomfield; Ramiro Garzon
Journal: Blood Adv Date: 2020-01-28

10. The mutational oncoprint of recurrent cytogenetic abnormalities in adult patients with de novo acute myeloid leukemia.

Authors: A-K Eisfeld; K Mrózek; J Kohlschmidt; D Nicolet; S Orwick; C J Walker; K W Kroll; J S Blachly; A J Carroll; J E Kolitz; B L Powell; E S Wang; R M Stone; A de la Chapelle; J C Byrd; C D Bloomfield
Journal: Leukemia Date: 2017-03-24 Impact factor: 11.528