Literature DB >> 24319692

QPLOT: a quality assessment tool for next generation sequencing data.

Bingshan Li¹, Xiaowei Zhan, Mary-Kate Wing, Paul Anderson, Hyun Min Kang, Goncalo R Abecasis.

Abstract

BACKGROUND: Next generation sequencing (NGS) is being widely used to identify genetic variants associated with human disease. Although the approach is cost effective, the underlying data is susceptible to many types of error. Importantly, since NGS technologies and protocols are rapidly evolving, with constantly changing steps ranging from sample preparation to data processing software updates, it is important to enable researchers to routinely assess the quality of sequencing and alignment data prior to downstream analyses.
RESULTS: Here we describe QPLOT, an automated tool that can facilitate the quality assessment of sequencing run performance. Taking standard sequence alignments as input, QPLOT generates a series of diagnostic metrics summarizing run quality and produces convenient graphical summaries for these metrics. QPLOT is computationally efficient, generates webpages for interactive exploration of detailed results, and can handle the joint output of many sequencing runs.
CONCLUSION: QPLOT is an automated tool that facilitates assessment of sequence run quality. We routinely apply QPLOT to ensure quick detection of diagnostic of sequencing run problems. We hope that QPLOT will be useful to the community as well.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2013 PMID： 24319692 PMCID： PMC3844194 DOI： 10.1155/2013/865181

Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411

1. Introduction

Next generation sequencing (NGS) is a revolutionary technology for biomedical research and is being deployed in a variety of applications, ranging from the identification of rare variants, de novo mutations, and somatic mutations in human disease studies to assessments of transcriptome and epigenome states in cultured cells. Since NGS provides more complete results than traditional array technologies and is rapidly decreasing in cost, it is becoming more widely used for genomics studies. Whole exome sequencing, which is the targeted sequencing of the entire collection of protein coding regions in the genome, has already led to great advances in Mendelian disorder genetics [1, 2], complex traits [3, 4],and cancer genomics [5, 6]. The 1000 Genomes Project [7, 8] is leading an effort to provide a comprehensive catalog of human variation across the world through whole genome sequencing. Several underway studies are now deploying whole genome and whole exome sequencing to study large collections of human disease samples. The success of NGS studies depends on appropriately understanding the quality of underlying data. However, unlike traditional array platforms, analysis of sequencing data is much more complex, making real time monitoring of data quality more challenging. NGS technologies and associated set of protocols are constantly evolving, and updates to several different components of the process (including, e.g., software, sample preparation, and/or reagents) can result in important and sometimes unexpected changes in data quality. We believe that the ability to generate automated visual summaries that help identify common problems is critical. To achieve this, we developed QPLOT, a tool for quick quality assessment in NGS data. QPLOT calculates and graphs summary statistics describing sequence and alignment quality. Data quality is assessed both through reported base quality scores and empirically obtained metrics by comparing aligned bases to the reference genome. In this way, it is possible to track the number of high quality bases along the length of a read (to choose a read length that maximizes the yield of high quality bases and compare run quality over time) or identify the presence of adaptor sequence and other problems in alignment (these can result in high empirical mismatch rates near the ends of RNA-sequencing reads, due to difficulties in correctly placing splice junctions—a problem that can be ameliorated by excluding these bases from variant calling and RNA editing analyses after alignment). We constantly interact with our sequencing core and other collaborators generating sequence data to improve QPLOT and facilitate efforts to drive up the quality of next generation sequence data. QPLOT differs from tools that only inspect unaligned sequence reads (such as FastQC [9] and SolexaQA [10]), because it can identify common problems in alignment and provide diagnostic descriptions of read mapping. For example, it generates empirically calibrated base quality scores and insert size distributions, two features that have substantial impact on variant calling and other downstream analyses. QPLOT also tries to improve packages designed specifically for handling aligned data (such as SAMStat [11] and Picard [12]) through its computational efficiency (QPLOT can sample regions of the genome randomly so as to rapidly evaluate very large alignments) and its ability to handle many samples (which helps to identify batch effects and other transient data processing problems). Importantly we note that genome-wide summary statistics can be extrapolated based on randomly sampled regions with little loss of accuracy. When the number of input files is very large, QPLOT can generate XML and text files with raw summary data and an interactive webpage that allows users to explore available quality metrics and graphs. XML and text output can be conveniently stored in a tracking database. In addition to graphical representation, key features are also summarized to generate a concise representation of the quality measurement (e.g., a mean squared difference is used to summarize concordance of empirical and reported base quality scores, and the impact of GC content is summarized in a similar fashion based on the deviation of the depth for each GC bin from uniform coverage).

2. Materials and Methods

QPLOT is implemented in C++ and invokes R to generate figures. Available statistics include summaries of base quality, both overall and along each position in a read, comparisons of reported and empirical quality base scores, summaries of insert size for paired end libraries, global evaluations of coverage as well as more detailed evaluations of coverage as a function of GC content, and the regions targeted for enrichment. Empirical base scores are calculated as Phred scaled mismatch rates, that is, −10 × log10 (number of matches/(number of matches + number of mismatches)), where number of matches and number of mismatches are the counts of aligned sequence bases that are concordant or discordant with the expected base in the reference genome, respectively, excluding known variant sites; these mismatches are dominated by genuine sequencing errors and provide a basis for base quality recalibration. To describe potential GC bias in sequencing runs, we calculate the mean depth of coverage for each GC content bin (0–100 representing 0–100% GC composition) for a series of windows along the genome (or, in the case of targeted sequencing experiments, within targeted regions). After normalization by the expected depth based on total mapped reads, the normalized depth for each GC content bin reflects biases of each experiment and can be compared with sequenced samples. Details of other summary statistics are available on the QPLOT website (http://genome.sph.umich.edu/wiki/QPLOT). QPLOT can be run as a stand-alone tool or incorporated into automated data processing pipelines.

3. Results and Discussion

We regularly use QPLOT in our sequencing projects including whole genome sequencing, RNA-seq, and targeted sequencing. Results for one Illumina run in a whole genome low pass sequencing study are shown in Figure 1. In this run the reported base quality scores deviate from empirically assessed quality, indicating that base quality recalibration is recommended (Figure 1(a)). As expected, empirical base quality scores decrease with increasing position along reads (Figure 1(b)), which is typical of Illumina sequencing. However at position 36 empirical quality scores appear to increase, an artifact of the −q 15 option used in BWA [13] when mapping these data. The −q 15 option trims portions of reads with base quality <15, but always leaves at least 36 bases in each read (in our experience, this option increases the fraction of mapped reads and the number of mapped high quality bases). In this run, sequences with very high or very low GC content are underrepresented (below 1 in the relative depth curve, Figure 1(c)). Assessment of paired reads shows a distribution of insert sizes with peaks ranging from ~240 bp to ~300 bp (Figure 1(d)). In this case, since reads are 120 bases long, many paired reads overlap (particularly in lanes 1, 3, 5, and 7);these overlaps, if ignored, can result in PCR artifacts that look like sequence variants—suggesting that the protocol might be tweaked to increase library insert sizes. When we compared metrics generated by evaluating the complete data and those extrapolated from random 5 Mb segments of the genome, the two sets of summary statistics were remarkably similar (see QPLOT webpage, e.g.), but computing time was reduced from 38 minutes to 13 minutes.

Figure 1

A subset of figures generated by QPLOT on an Illumina run. (a) Empirical base quality scores versus the scores stored in the BAM files. (b) Empirical base quality scores by cycles. (c) Bias of depth by GC content. (d) Insert size distribution.

In a second example, Figure 2 summarizes the results of an RNA-sequencing run. Here, empirical base quality scores are unexpectedly low near the beginning of each read (Figure 2(a)). When we remapped all reads after trimming the first several bases, the same pattern was repeated, suggesting that the observation is not due to high sequencing error rates or residual adapter sequences (trimming and remapping usually solve problems with residual adapter sequences, in our experience). Instead, the observation is the result of alignment artifacts when exon boundaries fall near the beginning or end of reads, a common problem in RNA-sequencing analyses. To avoid artifacts in downstream analyses, we suggest trimming the beginning and end bases of each read after mapping. Figure 2(b) shows that lane 7 has a GC content pattern that is dramatically different from the others, recommending great caution before comparing gene expression levels estimated for that sample and the others [14].

Figure 2

Exemplar diagnosis plots of RNA-sequencing data. (a) Empirical base quality scores by cycles. (b) Differential GC biases across multiple samples.

4. Conclusions

NGS has revolutionized the way genomics and biomedical studies are conducted. However the technologies are still rapidly evolving, and analysis of NGS data is challenging. Simple and convenient tools are important to help monitor data production and processing. Here we describe QPLOT, a computationally efficient tool that we hope will be helpful in quality assessment and diagnosis of NGS performance. We hope that information conveyed in these plots and statistics will facilitate the understanding of sequencing data to enable improved downstream processing and constant quality improvements.

12 in total

1. Evolution and functional impact of rare coding variation from deep sequencing of human exomes.

Authors: Jacob A Tennessen; Abigail W Bigham; Timothy D O'Connor; Wenqing Fu; Eimear E Kenny; Simon Gravel; Sean McGee; Ron Do; Xiaoming Liu; Goo Jun; Hyun Min Kang; Daniel Jordan; Suzanne M Leal; Stacey Gabriel; Mark J Rieder; Goncalo Abecasis; David Altshuler; Deborah A Nickerson; Eric Boerwinkle; Shamil Sunyaev; Carlos D Bustamante; Michael J Bamshad; Joshua M Akey
Journal: Science Date: 2012-05-17 Impact factor: 47.728

2. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome.

Authors: Sarah B Ng; Abigail W Bigham; Kati J Buckingham; Mark C Hannibal; Margaret J McMillin; Heidi I Gildersleeve; Anita E Beck; Holly K Tabor; Gregory M Cooper; Heather C Mefford; Choli Lee; Emily H Turner; Joshua D Smith; Mark J Rieder; Koh-Ichiro Yoshiura; Naomichi Matsumoto; Tohru Ohta; Norio Niikawa; Deborah A Nickerson; Michael J Bamshad; Jay Shendure
Journal: Nat Genet Date: 2010-08-15 Impact factor: 38.330

3. Understanding mechanisms underlying human gene expression variation with RNA sequencing.

Authors: Joseph K Pickrell; John C Marioni; Athma A Pai; Jacob F Degner; Barbara E Engelhardt; Everlyne Nkadori; Jean-Baptiste Veyrieras; Matthew Stephens; Yoav Gilad; Jonathan K Pritchard
Journal: Nature Date: 2010-03-10 Impact factor: 49.962

4. Comprehensive molecular characterization of human colon and rectal cancer.

Authors:
Journal: Nature Date: 2012-07-18 Impact factor: 49.962

5. Exome sequencing of extreme phenotypes identifies DCTN4 as a modifier of chronic Pseudomonas aeruginosa infection in cystic fibrosis.

Authors: Mary J Emond; Tin Louie; Julia Emerson; Wei Zhao; Rasika A Mathias; Michael R Knowles; Fred A Wright; Mark J Rieder; Holly K Tabor; Deborah A Nickerson; Kathleen C Barnes; Ronald L Gibson; Michael J Bamshad
Journal: Nat Genet Date: 2012-07-08 Impact factor: 38.330

6. SAMStat: monitoring biases in next generation sequencing data.

Authors: Timo Lassmann; Yoshihide Hayashizaki; Carsten O Daub
Journal: Bioinformatics Date: 2010-11-18 Impact factor: 6.937

7. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data.

Authors: Murray P Cox; Daniel A Peterson; Patrick J Biggs
Journal: BMC Bioinformatics Date: 2010-09-27 Impact factor: 3.169

8. Exome sequencing identifies the cause of a mendelian disorder.

Authors: Sarah B Ng; Kati J Buckingham; Choli Lee; Abigail W Bigham; Holly K Tabor; Karin M Dent; Chad D Huff; Paul T Shannon; Ethylin Wang Jabs; Deborah A Nickerson; Jay Shendure; Michael J Bamshad
Journal: Nat Genet Date: 2009-11-13 Impact factor: 38.330

9. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

10. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

12 in total

1. Quality control and integration of genotypes from two calling pipelines for whole genome sequence data in the Alzheimer's disease sequencing project.

Authors: Adam C Naj; Honghuang Lin; Badri N Vardarajan; Simon White; Daniel Lancour; Yiyi Ma; Michael Schmidt; Fangui Sun; Mariusz Butkiewicz; William S Bush; Brian W Kunkle; John Malamon; Najaf Amin; Seung Hoan Choi; Kara L Hamilton-Nelson; Sven J van der Lee; Namrata Gupta; Daniel C Koboldt; Mohamad Saad; Bowen Wang; Alejandro Q Nato; Harkirat K Sohi; Amanda Kuzma; Li-San Wang; L Adrienne Cupples; Cornelia van Duijn; Sudha Seshadri; Gerard D Schellenberg; Eric Boerwinkle; Joshua C Bis; Josée Dupuis; William J Salerno; Ellen M Wijsman; Eden R Martin; Anita L DeStefano
Journal: Genomics Date: 2018-05-29 Impact factor: 5.736

2. FASTQuick: rapid and comprehensive quality assessment of raw sequence reads.

Authors: Fan Zhang; Hyun Min Kang
Journal: Gigascience Date: 2021-01-29 Impact factor: 6.524

3. Excess of rare, inherited truncating mutations in autism.

Authors: Niklas Krumm; Tychele N Turner; Carl Baker; Laura Vives; Kiana Mohajeri; Kali Witherspoon; Archana Raja; Bradley P Coe; Holly A Stessman; Zong-Xiao He; Suzanne M Leal; Raphael Bernier; Evan E Eichler
Journal: Nat Genet Date: 2015-05-11 Impact factor: 38.330

4. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.

Authors: Goo Jun; Mary Kate Wing; Gonçalo R Abecasis; Hyun Min Kang
Journal: Genome Res Date: 2015-04-16 Impact factor: 9.043

5. Systematic evaluation of error rates and causes in short samples in next-generation sequencing.

Authors: Franziska Pfeiffer; Carsten Gröber; Michael Blank; Kristian Händler; Marc Beyer; Joachim L Schultze; Günter Mayer
Journal: Sci Rep Date: 2018-07-19 Impact factor: 4.379

6. Discovery of rare coding variants in OGDHL and BRCA2 in relation to breast cancer risk in Chinese women.

Authors: Xingyi Guo; Jirong Long; Zhishan Chen; Xiao-Ou Shu; Yong-Bing Xiang; Wanqing Wen; Chenjie Zeng; Yu-Tang Gao; Qiuyin Cai; Wei Zheng
Journal: Int J Cancer Date: 2019-12-27 Impact factor: 7.396

7. Whole-exome sequencing identifies somatic mutations and intratumor heterogeneity in inflammatory breast cancer.

Authors: Rui Luo; Weelic Chong; Qiang Wei; Zhenchao Zhang; Chun Wang; Zhong Ye; Maysa M Abu-Khalaf; Daniel P Silver; Robert T Stapp; Wei Jiang; Ronald E Myers; Bingshan Li; Massimo Cristofanilli; Hushan Yang
Journal: NPJ Breast Cancer Date: 2021-06-01

8. SUGAR: graphical user interface-based data refiner for high-throughput DNA sequencing.

Authors: Yukuto Sato; Kaname Kojima; Naoki Nariai; Yumi Yamaguchi-Kabata; Yosuke Kawai; Mamoru Takahashi; Takahiro Mimori; Masao Nagasaki
Journal: BMC Genomics Date: 2014-08-08 Impact factor: 3.969

9. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers.

Authors: Carlo Sidore; Fabio Busonero; Andrea Maschio; Eleonora Porcu; Silvia Naitza; Magdalena Zoledziewska; Antonella Mulas; Giorgio Pistis; Maristella Steri; Fabrice Danjou; Alan Kwong; Vicente Diego Ortega Del Vecchyo; Charleston W K Chiang; Jennifer Bragg-Gresham; Maristella Pitzalis; Ramaiah Nagaraja; Brendan Tarrier; Christine Brennan; Sergio Uzzau; Christian Fuchsberger; Rossano Atzeni; Frederic Reinier; Riccardo Berutti; Jie Huang; Nicholas J Timpson; Daniela Toniolo; Paolo Gasparini; Giovanni Malerba; George Dedoussis; Eleftheria Zeggini; Nicole Soranzo; Chris Jones; Robert Lyons; Andrea Angius; Hyun M Kang; John Novembre; Serena Sanna; David Schlessinger; Francesco Cucca; Gonçalo R Abecasis
Journal: Nat Genet Date: 2015-09-14 Impact factor: 38.330

10. Machine learning-based genome-wide interrogation of somatic copy number aberrations in circulating tumor DNA for early detection of hepatocellular carcinoma.

Authors: Kaishan Tao; Zhenyuan Bian; Qiong Zhang; Xu Guo; Chun Yin; Yang Wang; Kaixiang Zhou; Shaogui Wan; Meifang Shi; Dengke Bao; Chuhu Yang; Jinliang Xing
Journal: EBioMedicine Date: 2020-06-05 Impact factor: 8.143