Literature DB >> 30520945

Alfred: interactive multi-sample BAM alignment statistics, feature counting and feature annotation for long- and short-read sequencing.

Tobias Rausch^1,2, Markus Hsi-Yang Fritz², Jan O Korbel², Vladimir Benes¹.

Abstract

SUMMARY: Harmonizing quality control (QC) of large-scale second and third-generation sequencing datasets is key for enabling downstream computational and biological analyses. We present Alfred, an efficient and versatile command-line application that computes multi-sample QC metrics in a read-group aware manner, across a wide variety of sequencing assays and technologies. In addition to standard QC metrics such as GC bias, base composition, insert size and sequencing coverage distributions it supports haplotype-aware and allele-specific feature counting and feature annotation. The versatility of Alfred allows for easy pipeline integration in high-throughput settings, including DNA sequencing facilities and large-scale research initiatives, enabling continuous monitoring of sequence data quality and characteristics across samples. Alfred supports haplo-tagging of BAM/CRAM files to conduct haplotype-resolved analyses in conjunction with a variety of next-generation sequencing based assays. Alfred's companion web application enables interactive exploration of results and comparison to public datasets.
AVAILABILITY AND IMPLEMENTATION: Alfred is open-source and freely available at https://tobiasrausch.com/alfred/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene

Mesh：

Year: 2019 PMID： 30520945 PMCID： PMC6612896 DOI： 10.1093/bioinformatics/bty1007

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Many methods have been developed to perform quality control (QC) on specific types of sequencing assays (Endrullat ), such as RNA-SeQC (DeLuca ) for RNA-Seq data, Chance (Diaz ) for ChIP-seq data or Poretools (Loman and Quinlan, 2014) for Oxford Nanopore sequencing data. Popular general purpose alignment QC methods are, for instance, QualiMap2 (Okonechnikov ) and NGS QC Toolkit (Patel and Jain, 2012). Despite these developments, it remains challenging for DNA sequencing facilities and large genomics projects to find a versatile QC method that is computationally efficient, easy to use and install, robust to diverse sequencing assays as well as protocols and accessible to Bioinformaticians, experimentalists and wet-lab technicians alike. An installation-free, easy to use, interactive interface that allows exploration of the QC data across samples and data types is often not available and a direct comparison to public data frequently entails downloading gigabytes of alignment files to compute background distributions and to estimate acceptable ranges for each quality parameter. Alfred fills an important gap in this context by computing key quality statistics rapidly, across hundreds of samples and with a versatile web front end for data exploration and comparison to publicly available data resources. For each sample, the method generates an extensible JSON file with the key QC metrics that can be merged across samples and then explored interactively at https://gear.embl.de/alfred. The Alfred command-line interface is available as a statically linked binary on GitHub, via Bioconda or as a minimal Docker container. Alfred provides a wide range of QC metrics, some of general relevance such as the insert size distribution and others more targeted to specific sequencing assays, such as the on-target rate for capture assays (Supplementary Table S1). Grouping of charts can be pursued to facilitate comparison of relevant distributions: the insert size distribution, for instance, is stratified by paired-end orientation to distinguish paired-end and mate-pair libraries frequently employed in structural variant calling (Rausch ). Alfred additionally provides fast methods for feature counting and feature annotation, generation of browser tracks and methods to analyze alignments in a haplotype-resolved manner. All of these functions are competitive with respect to runtime and memory usage compared to commonly used tools in each of these application areas (Supplementary Table S2). As input, Alfred supports BAM and CRAM files (Hsi-Yang Fritz ).

2 Materials and methods

Alfred uses sub-commands for BAM/CRAM statistics, feature counting and feature annotation. The main methods are outlined below. All backend code is open-source and written in C++ using HTSlib (Li ) and Boost (Schaeling, 2011).

2.1 BAM QC metrics

Alfred parses the BAM file only once and pre-allocates data structures for counting primary, secondary, supplementary and spliced alignments. Paired-end orientations are counted by type (F-, F+, R-, R+) and sequencing error rates are computed separately for mismatch, insertion and deletion errors (Supplementary Fig. S1). InDel sizes are cataloged and potential homopolymer sequence regions and a fragment-based GC bias curve is estimated from the reference context. If a BED file of target regions is provided, Alfred computes the on-target rate and the target coverage distribution as well as overall enrichment of targeted regions. For tagged BAM files that utilize unique molecular identifiers (UMIs), the number of UMIs and the fraction of tagged molecules are computed. For haplotype-tagged files, the number of phased blocks and the N50 phased block length are computed. All QC output files are gzip-compressed to save space. Two output formats are available: first, a block structured tab-delimited file format as in samtools stats to efficiently filter (‘grep’) desired statistics in pipelines and computational workflows and second, a succinct and extensible JSON format that can be visualized in our companion web application.

2.2 Feature counting and feature annotation

Alfred supports counting reads in overlapping or non-overlapping windows, at pre-defined intervals in BED format or as gene and transcript counting for RNA-Seq in stranded or unstranded mode using a gtf or gff3 gene annotation file. Expression values can be normalized as raw counts, FPKM or FPKM-UQ values. Additionally, browser tracks in UCSC bedgraph format can be computed with configurable resolution. Alfred also supports annotation of ChIP-Seq and ATAC-Seq peaks for neighboring genes or transcription factor binding sites based on motif alignments.

2.3 Haplotype tagging and allele-specific applications

With the advent of long reads and haplotype resolved sequencing protocols such as 10X Genomics or Strand-Seq (Porubsky ), there is an increasing need to split BAM files by haplotype and perform haplotype-aware downstream analyses. Alfred provides basic functions to haplo-tag BAM files based on phased VCF files and generates allele-specific count tables for subsequent analyses. For error-prone long read datasets, haplotype-resolved BAM files in conjunction with Alfred’s alignment methods can also be used to generate highly accurate haplotype-specific consensus sequences (Supplementary Material).

2.4 Multi-sample web application

Alfred’s JSON files can be visualized with the companion web application that is built with standard web technologies (HTML, CSS, JavaScript and SVG) and thus can readily be used with common web browsers. Importantly, this allows using the application from different operating systems and without any installation procedure. All charts are interactive, supporting panning and zooming and all charts and tables can be downloaded as png and csv files, respectively. Due to its client-only design (i.e. no server is involved), the application can also be installed easily and used offline or embedded in other websites, for example paper companion sites, to provide QC statistics transparently across all samples analyzed in a given study. The application is adaptive to different sequencing protocols, and several features are geared towards specific applications such as the on-target rate measurement available for capture-based sequence assays. The web application also hosts a set of JSON QC files that span a wide range of sequencing assays (DNA-Seq, RNA-Seq, ATAC-Seq, ChIP-Seq, HiC) and sequencing technologies (PacBio, Oxford Nanopore Technologies, Illumina), providing an ideal resource for researchers in need of comparing QC statistics.

3 Discussion

Alfred is a comprehensive alignment QC, feature counting and feature annotation method that complements specialized QC packages available for a specific sequencing assay by providing an easy to use, cross-platform interface that allows read-group aware multi-sample comparisons. Alfred supports third generation sequencing technologies and can handle 10X Genomics datasets, Strand-Seq data and reads derived from nanopore sequencing where it readily enables haplotype-resolved analyses. Click here for additional data file.

10 in total

1. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data.

Authors: Ravi K Patel; Mukesh Jain
Journal: PLoS One Date: 2012-02-01 Impact factor: 3.240

2. Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors: Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal: Genome Res Date: 2011-01-18 Impact factor: 9.043

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

4. RNA-SeQC: RNA-seq metrics for quality control and process optimization.

Authors: David S DeLuca; Joshua Z Levin; Andrey Sivachenko; Timothy Fennell; Marc-Danie Nazaire; Chris Williams; Michael Reich; Wendy Winckler; Gad Getz
Journal: Bioinformatics Date: 2012-04-25 Impact factor: 6.937

5. Poretools: a toolkit for analyzing nanopore sequence data.

Authors: Nicholas J Loman; Aaron R Quinlan
Journal: Bioinformatics Date: 2014-08-20 Impact factor: 6.937

6. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data.

Authors: Konstantin Okonechnikov; Ana Conesa; Fernando García-Alcalde
Journal: Bioinformatics Date: 2015-10-01 Impact factor: 6.937

7. DELLY: structural variant discovery by integrated paired-end and split-read analysis.

Authors: Tobias Rausch; Thomas Zichner; Andreas Schlattl; Adrian M Stütz; Vladimir Benes; Jan O Korbel
Journal: Bioinformatics Date: 2012-09-15 Impact factor: 6.937

8. CHANCE: comprehensive software for quality control and validation of ChIP-seq data.

Authors: Aaron Diaz; Abhinav Nellore; Jun S Song
Journal: Genome Biol Date: 2012-10-15 Impact factor: 13.583

Review 9. Standardization and quality management in next-generation sequencing.

Authors: Christoph Endrullat; Jörn Glökler; Philipp Franke; Marcus Frohme
Journal: Appl Transl Genom Date: 2016-07-01

10. Dense and accurate whole-chromosome haplotyping of individual genomes.

Authors: David Porubsky; Shilpa Garg; Ashley D Sanders; Jan O Korbel; Victor Guryev; Peter M Lansdorp; Tobias Marschall
Journal: Nat Commun Date: 2017-11-03 Impact factor: 14.919

10 in total

20 in total

Review 1. Computational analysis of cancer genome sequencing data.

Authors: Isidro Cortés-Ciriano; Doga C Gulhan; Jake June-Koo Lee; Giorgio E M Melloni; Peter J Park
Journal: Nat Rev Genet Date: 2021-12-08 Impact factor: 53.242

2. An updated genetic marker for detection of Lake Sinai Virus and metagenetic applications.

Authors: Deborah D Iwanowicz; Judy Y Wu-Smart; Tugce Olgun; Autumn H Smart; Clint R V Otto; Dawn Lopez; Jay D Evans; Robert Cornman
Journal: PeerJ Date: 2020-07-17 Impact factor: 2.984

3. Analytical Approaches for ATAC-seq Data Analysis.

Authors: Jason P Smith; Nathan C Sheffield
Journal: Curr Protoc Hum Genet Date: 2020-06

4. Endogenous protein tagging in medaka using a simplified CRISPR/Cas9 knock-in approach.

Authors: Ali Seleit; Alexander Aulehla; Alexandre Paix
Journal: Elife Date: 2021-12-06 Impact factor: 8.140

5. The Effect of Population Bottleneck Size and Selective Regime on Genetic Diversity and Evolvability in Bacteria.

Authors: Tanita Wein; Tal Dagan
Journal: Genome Biol Evol Date: 2019-11-01 Impact factor: 3.416

6. TRiCoLOR: tandem repeat profiling using whole-genome long-read sequencing data.

Authors: Davide Bolognini; Alberto Magi; Vladimir Benes; Jan O Korbel; Tobias Rausch
Journal: Gigascience Date: 2020-10-07 Impact factor: 6.524

7. Persistence of RNA transcription during DNA replication delays duplication of transcription start sites until G2/M.

Authors: Jianming Wang; Patricia Rojas; Jingwen Mao; Martina Mustè Sadurnì; Olivia Garnier; Songshu Xiao; Martin R Higgs; Paloma Garcia; Marco Saponaro
Journal: Cell Rep Date: 2021-02-16 Impact factor: 9.423

8. CUTseq is a versatile method for preparing multiplexed DNA sequencing libraries from low-input samples.

Authors: Xiaolu Zhang; Silvano Garnerone; Michele Simonetti; Luuk Harbers; Marcin Nicoś; Reza Mirzazadeh; Tiziana Venesio; Anna Sapino; Johan Hartman; Caterina Marchiò; Magda Bienko; Nicola Crosetto
Journal: Nat Commun Date: 2019-10-18 Impact factor: 14.919

9. Chromatin accessibility landscape of pediatric T-lymphoblastic leukemia and human T-cell precursors.

Authors: Büşra Erarslan-Uysal; Joachim B Kunz; Tobias Rausch; Paulina Richter-Pechańska; Ianthe Aem van Belzen; Viktoras Frismantas; Beat Bornhauser; Diana Ordoñez-Rueada; Malte Paulsen; Vladimir Benes; Martin Stanulla; Martin Schrappe; Gunnar Cario; Gabriele Escherich; Kseniya Bakharevich; Renate Kirschner-Schwabe; Cornelia Eckert; Tsvetomir Loukanov; Matthias Gorenflo; Sebastian M Waszak; Jean-Pierre Bourquin; Martina U Muckenthaler; Jan O Korbel; Andreas E Kulozik
Journal: EMBO Mol Med Date: 2020-08-05 Impact factor: 12.137

10. SMG5-SMG7 authorize nonsense-mediated mRNA decay by enabling SMG6 endonucleolytic activity.

Authors: Volker Boehm; Sabrina Kueckelmann; Jennifer V Gerbracht; Sebastian Kallabis; Thiago Britto-Borges; Janine Altmüller; Marcus Krüger; Christoph Dieterich; Niels H Gehring
Journal: Nat Commun Date: 2021-06-25 Impact factor: 14.919