Literature DB >> 26428292

Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data.

Konstantin Okonechnikov1, Ana Conesa2, Fernando García-Alcalde1.   

Abstract

MOTIVATION: Detection of random errors and systematic biases is a crucial step of a robust pipeline for processing high-throughput sequencing (HTS) data. Bioinformatics software tools capable of performing this task are available, either for general analysis of HTS data or targeted to a specific sequencing technology. However, most of the existing QC instruments only allow processing of one sample at a time.
RESULTS: Qualimap 2 represents a next step in the QC analysis of HTS data. Along with comprehensive single-sample analysis of alignment data, it includes new modes that allow simultaneous processing and comparison of multiple samples. As with the first version, the new features are available via both graphical and command line interface. Additionally, it includes a large number of improvements proposed by the user community.
AVAILABILITY AND IMPLEMENTATION: The implementation of the software along with documentation is freely available at http://www.qualimap.org. CONTACT: meyer@mpiib-berlin.mpg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2015. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2015        PMID: 26428292      PMCID: PMC4708105          DOI: 10.1093/bioinformatics/btv566

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

High-throughput sequencing (HTS) is a powerful discovery method applied in genomics, transcriptomics and other omics disciplines. Projects such as ENCODE (Rosenbloom ) or BLUEPRINT (Adams ) generated terabytes of sequencing data, providing new insights into the molecular mechanisms of the cell. The sequencing technology itself has been improving continuously, allowing longer reads and deeper coverage at lower cost (Sims ). However, despite these advantages, HTS is prone to random errors and systematic biases, including polymerase chain reaction amplification problems, GC-content shift and read contamination (Ross ). To generate reliable conclusions from HTS data, these biases have to be detected and addressed accordingly. In this regard, several bioinformatics tools have been developed to perform quality control (QC) of the HTS data by analyzing raw reads and their derivatives in the form of sequencing alignments and other quantitative data (Patel and Mukesh, 2012). In the context of large sequencing projects, it is also crucial to have a global overview of all samples in the experiment. Multi-sample analysis results comparison enables examination of data clustering and detection of possible outliers. There are special toolkits such as StatsDB (Ramirez-Gonzalez ) that allow creating detailed multi-sample analysis workflows; however, they require accurate construction of custom pipelines. Several existing NGS QC software tools including RNA-seq QC (DeLuca ) and RSeQC (Wang ) have only a few options for working with multiple samples. This is a major limitation, since sequencing experiments are often conducted using biological replicates and can include multiple conditions. Here, we present the second version of Qualimap (García-Alcalde ), a toolkit for QC of HTS alignment data. In Qualimap 2, we provide new analysis capabilities that allow multi-sample comparison of sequencing datasets. Additionally, we have added a novel mode for discovery of biases and problems specific to RNA-seq technology, redesigned the read counts QC mode and implemented numerous improvements.

2 Software description

Qualimap is a multiplatform user-friendly application with both graphical user and command line interfaces. It includes four analysis modes: BAM QC, Counts QC, RNA-seq QC and Multi-sample BAM QC. The latter two modes are introduced for the first time in version 2. Based on the selected type of analysis, users provide input data in the form of a BAM/SAM alignment, GTF/GFF/BED annotation and/or read counts table. The results of the QC analysis are presented as an interactive report within the graphical user interface, as a static report in HTML, as a PDF, or as a plain text file suitable for parsing and further processing. Typically, the report contains summary statistics of the dataset, description of input data, exploratory plots and histograms that visualize certain aspects of the data and help to detect potential problems. One of the major new developments in Qualimap2 is the analysis mode called Multi-sample BAM QC, which allows combined QC estimation of multiple alignment files. For this purpose, Qualimap uses the metrics computed during the single-sample BAM QC procedure as input. The program loads the QC analysis results from each sample and creates a number of combined and normalized plots comparing specific properties. The types of generated plots correspond to single-sample BAM QC analysis plots. Analyzed samples can have different coverage depth, experiment type or even derive from different organisms. The simultaneous comparison of multiple samples allows examination of consistency between samples and visual detection of outliers (Fig. 1A). To estimate the variability between analyzed datasets, Qualimap performs a principal component analysis based on specific features derived from the alignment, including coverage, GC content, insert size and mapping quality (Fig. 1B).
Fig. 1.

 Multi-sample BAM QC analysis of a γH2AX ChiP-seq experiment in human cells comparing four different conditions (Koeppel ). The sequencing was performed in three batches. A single batch included samples in all conditions. (A) The GC-content distribution indicates a problem with the samples from the second batch. (B) The PCA biplot also demonstrates the second batch grouped together, despite different biological treatments

Multi-sample BAM QC analysis of a γH2AX ChiP-seq experiment in human cells comparing four different conditions (Koeppel ). The sequencing was performed in three batches. A single batch included samples in all conditions. (A) The GC-content distribution indicates a problem with the samples from the second batch. (B) The PCA biplot also demonstrates the second batch grouped together, despite different biological treatments Qualimap 2 also introduces a novel analysis mode called RNA-seq QC. This mode allows computation of metrics specific to RNA-seq data, including per-transcript coverage, junction sequence distribution, genomic localization of reads, 5′–3′ bias and consistency of the library protocol. A detailed comparison of Qualimap to RSeQC and RNA-seq QC tools that are focused on a similar goal can be found in Supplementary Table S1. The most significant difference to other tools is the subsequent RNA-seq QC analysis step that Qualimap performs after computation of read counts. The mode Counts QC was completely redesigned to allow processing of multiple samples. Normally, this mode estimates the quality of the read counts that are derived from intersecting sequencing alignments within genomic features. Counts are usually applicable for analysis of differential gene expression from RNA-seq data (Anders ). Having multiple biological replicates per condition is common in RNA-seq experiments; therefore, it is beneficial to be able to analyze counts data from all generated datasets simultaneously. Multi-sample analysis of read counts allows inspection of sample grouping, as well as discovery of outliers and batch effects. Similar to the previous version, the Counts QC mode estimates the saturation of sequencing depth, read count densities, correlation of samples and distribution of counts among classes of selected features (Supplementary Figs. S1–S4). Additionally, new plots that explore the relationship between expression values and GC-content or transcript lengths are available for users. Counts QC is based on the NOIseq package for gene expression estimation (Tarazona ). The analysis results include a combined overview of the counts from all samples along with a QC report for each individual sample. Moreover, the analyzed datasets can have different conditions, e.g. treated and untreated. In this case, plots comparing groups of sample counts corresponding to particular conditions are generated (Supplementary Fig. S5).

3 Results and conclusion

Qualimap 2 is an application for exploratory analysis and QC of HTS alignment data written in Java and R. The major enhancement over the previous version lies in the ability to perform multi-sample analyses. Additionally, a large number of bug fixes and enhancements have been implemented since the initial release. An overview of novel features can be found in Table 1 and Supplementary Materials. In the present version, we have kept the concept of a simple, user-friendly application that follows an ‘open-source’ path. Qualimap 2 has gathered a community of users who frequently suggest new features and contribute their code. Notably, most of the novel features in BAM QC mode were proposed and tested by users. The public repository of Qualimap is hosted at bitbucket.org/kokonech/qualimap.
Table 1.

 Qualimap2—overview of novel features

ModeNovel features and improvements
BAM QCAdvanced statistics of coverage, insert size, mismatch rate, etc.; duplicates extraction; homopolymer size control; performance and output data adaption
Multi-sample BAM QCComparison of coverage, GC-content, insert size etc. from multiple samples along with PCA-based summary
RNA-seq QCTranscript coverage, 5′–3′ bias, alignment distribution, junction, strand-specificity analysis; counts computation
Counts QCMulti-sample analysis (expression level, biotype, etc.) and condition comparison (expression level, GC bias, etc.)
Qualimap2—overview of novel features
  11 in total

1.  RSeQC: quality control of RNA-seq experiments.

Authors:  Liguo Wang; Shengqin Wang; Wei Li
Journal:  Bioinformatics       Date:  2012-06-27       Impact factor: 6.937

2.  NGS QC Toolkit: a toolkit for quality control of next generation sequencing data.

Authors:  Ravi K Patel; Mukesh Jain
Journal:  PLoS One       Date:  2012-02-01       Impact factor: 3.240

3.  Count-based differential expression analysis of RNA sequencing data using R and Bioconductor.

Authors:  Simon Anders; Davis J McCarthy; Yunshun Chen; Michal Okoniewski; Gordon K Smyth; Wolfgang Huber; Mark D Robinson
Journal:  Nat Protoc       Date:  2013-08-22       Impact factor: 13.491

4.  Qualimap: evaluating next-generation sequencing alignment data.

Authors:  Fernando García-Alcalde; Konstantin Okonechnikov; José Carbonell; Luis M Cruz; Stefan Götz; Sonia Tarazona; Joaquín Dopazo; Thomas F Meyer; Ana Conesa
Journal:  Bioinformatics       Date:  2012-08-22       Impact factor: 6.937

5.  Helicobacter pylori Infection Causes Characteristic DNA Damage Patterns in Human Cells.

Authors:  Max Koeppel; Fernando Garcia-Alcalde; Frithjof Glowinski; Philipp Schlaermann; Thomas F Meyer
Journal:  Cell Rep       Date:  2015-06-11       Impact factor: 9.423

6.  BLUEPRINT to decode the epigenetic signature written in blood.

Authors:  David Adams; Lucia Altucci; Stylianos E Antonarakis; Juan Ballesteros; Stephan Beck; Adrian Bird; Christoph Bock; Bernhard Boehm; Elias Campo; Andrea Caricasole; Fredrik Dahl; Emmanouil T Dermitzakis; Tariq Enver; Manel Esteller; Xavier Estivill; Anne Ferguson-Smith; Jude Fitzgibbon; Paul Flicek; Claudia Giehl; Thomas Graf; Frank Grosveld; Roderic Guigo; Ivo Gut; Kristian Helin; Jonas Jarvius; Ralf Küppers; Hans Lehrach; Thomas Lengauer; Åke Lernmark; David Leslie; Markus Loeffler; Elizabeth Macintyre; Antonello Mai; Joost H A Martens; Saverio Minucci; Willem H Ouwehand; Pier Giuseppe Pelicci; Hèléne Pendeville; Bo Porse; Vardhman Rakyan; Wolf Reik; Martin Schrappe; Dirk Schübeler; Martin Seifert; Reiner Siebert; David Simmons; Nicole Soranzo; Salvatore Spicuglia; Michael Stratton; Hendrik G Stunnenberg; Amos Tanay; David Torrents; Alfonso Valencia; Edo Vellenga; Martin Vingron; Jörn Walter; Spike Willcocks
Journal:  Nat Biotechnol       Date:  2012-03-07       Impact factor: 54.908

Review 7.  Sequencing depth and coverage: key considerations in genomic analyses.

Authors:  David Sims; Ian Sudbery; Nicholas E Ilott; Andreas Heger; Chris P Ponting
Journal:  Nat Rev Genet       Date:  2014-02       Impact factor: 53.242

8.  RNA-SeQC: RNA-seq metrics for quality control and process optimization.

Authors:  David S DeLuca; Joshua Z Levin; Andrey Sivachenko; Timothy Fennell; Marc-Danie Nazaire; Chris Williams; Michael Reich; Wendy Winckler; Gad Getz
Journal:  Bioinformatics       Date:  2012-04-25       Impact factor: 6.937

9.  ENCODE data in the UCSC Genome Browser: year 5 update.

Authors:  Kate R Rosenbloom; Cricket A Sloan; Venkat S Malladi; Timothy R Dreszer; Katrina Learned; Vanessa M Kirkup; Matthew C Wong; Morgan Maddren; Ruihua Fang; Steven G Heitner; Brian T Lee; Galt P Barber; Rachel A Harte; Mark Diekhans; Jeffrey C Long; Steven P Wilder; Ann S Zweig; Donna Karolchik; Robert M Kuhn; David Haussler; W James Kent
Journal:  Nucleic Acids Res       Date:  2012-11-27       Impact factor: 16.971

10.  Characterizing and measuring bias in sequence data.

Authors:  Michael G Ross; Carsten Russ; Maura Costello; Andrew Hollinger; Niall J Lennon; Ryan Hegarty; Chad Nusbaum; David B Jaffe
Journal:  Genome Biol       Date:  2013-05-29       Impact factor: 13.583

View more
  454 in total

1.  Morphologic and genetic heterogeneity in breast fibroepithelial lesions-a comprehensive mapping study.

Authors:  Benjamin Yongcheng Tan; Nur Diyana Md Nasir; Huan Ying Chang; Cedric Chuan Young Ng; Peiyong Guan; Sanjanaa Nagarajan; Vikneswari Rajasegaran; Jing Yi Lee; Jing Quan Lim; Aye Aye Thike; Bin Tean Teh; Puay Hoon Tan
Journal:  Mod Pathol       Date:  2020-04-22       Impact factor: 7.842

2.  Complex DNA structures trigger copy number variation across the Plasmodium falciparum genome.

Authors:  Adam C Huckaby; Claire S Granum; Maureen A Carey; Karol Szlachta; Basel Al-Barghouthi; Yuh-Hwa Wang; Jennifer L Guler
Journal:  Nucleic Acids Res       Date:  2019-02-28       Impact factor: 16.971

3.  A whole-tissue RNA-seq toolkit for organism-wide studies of gene expression with PME-seq.

Authors:  Surya Pandey; Michihiro Takahama; Adam Gruenbaum; Makda Zewde; Katerina Cheronis; Nicolas Chevrier
Journal:  Nat Protoc       Date:  2020-02-19       Impact factor: 13.491

4.  Low bias multiple displacement amplification with confinement effect based on agarose gel.

Authors:  Ying Zhou; Erteng Jia; Yi Qiao; Huajuan Shi; Zhiyu Liu; Min Pan; Xiangwei Zhao; Yunfei Bai; Qinyu Ge
Journal:  Anal Bioanal Chem       Date:  2021-05-28       Impact factor: 4.142

5.  Firefly genomes illuminate parallel origins of bioluminescence in beetles.

Authors:  Timothy R Fallon; Sarah E Lower; Ching-Ho Chang; Manabu Bessho-Uehara; Gavin J Martin; Adam J Bewick; Megan Behringer; Humberto J Debat; Isaac Wong; John C Day; Anton Suvorov; Christian J Silva; Kathrin F Stanger-Hall; David W Hall; Robert J Schmitz; David R Nelson; Sara M Lewis; Shuji Shigenobu; Seth M Bybee; Amanda M Larracuente; Yuichi Oba; Jing-Ke Weng
Journal:  Elife       Date:  2018-10-16       Impact factor: 8.140

6.  Transcriptomics of Arabidopsis sperm cells at single-cell resolution.

Authors:  Chandra Shekhar Misra; Mário R Santos; Mariana Rafael-Fernandes; Nuno P Martins; Marta Monteiro; Jörg D Becker
Journal:  Plant Reprod       Date:  2019-01-24       Impact factor: 3.767

7.  11-Ketotestosterone Is the Dominant Circulating Bioactive Androgen During Normal and Premature Adrenarche.

Authors:  Juilee Rege; Adina F Turcu; Josephine Z Kasa-Vubu; Antonio M Lerario; Gabriela C Auchus; Richard J Auchus; Joshua M Smith; Perrin C White; William E Rainey
Journal:  J Clin Endocrinol Metab       Date:  2018-12-01       Impact factor: 5.958

8.  Macrophages display proinflammatory phenotypes in the eutopic endometrium of women with endometriosis with relevance to an infectious etiology of the disease.

Authors:  Júlia Vallvé-Juanico; Xavier Santamaria; Kim Chi Vo; Sahar Houshdaran; Linda C Giudice
Journal:  Fertil Steril       Date:  2019-12       Impact factor: 7.329

9.  SeQuiLa-cov: A fast and scalable library for depth of coverage calculations.

Authors:  Marek Wiewiórka; Agnieszka Szmurło; Wiktor Kuśmirek; Tomasz Gambin
Journal:  Gigascience       Date:  2019-08-01       Impact factor: 6.524

10.  A molecular map of lung neuroendocrine neoplasms.

Authors:  Aurélie A G Gabriel; Emilie Mathian; Lise Mangiante; Catherine Voegele; Vincent Cahais; Akram Ghantous; James D McKay; Nicolas Alcala; Lynnette Fernandez-Cuesta; Matthieu Foll
Journal:  Gigascience       Date:  2020-10-30       Impact factor: 6.524

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.