Literature DB >> 22539670

RNA-SeQC: RNA-seq metrics for quality control and process optimization.

David S DeLuca¹, Joshua Z Levin, Andrey Sivachenko, Timothy Fennell, Marc-Danie Nazaire, Chris Williams, Michael Reich, Wendy Winckler, Gad Getz.

Abstract

UNLABELLED: RNA-seq, the application of next-generation sequencing to RNA, provides transcriptome-wide characterization of cellular activity. Assessment of sequencing performance and library quality is critical to the interpretation of RNA-seq data, yet few tools exist to address this issue. We introduce RNA-SeQC, a program which provides key measures of data quality. These metrics include yield, alignment and duplication rates; GC bias, rRNA content, regions of alignment (exon, intron and intragenic), continuity of coverage, 3'/5' bias and count of detectable transcripts, among others. The software provides multi-sample evaluation of library construction protocols, input materials and other experimental parameters. The modularity of the software enables pipeline integration and the routine monitoring of key measures of data quality such as the number of alignable reads, duplication rates and rRNA contamination. RNA-SeQC allows investigators to make informed decisions about sample inclusion in downstream analysis. In summary, RNA-SeQC provides quality control measures critical to experiment design, process optimization and downstream computational analysis.
AVAILABILITY AND IMPLEMENTATION: See www.genepattern.org to run online, or www.broadinstitute.org/rna-seqc/ for a command line tool.

Entities: Species

Mesh：

Substances：
RNA, Ribosomal
RNA

Year: 2012 PMID： 22539670 PMCID： PMC3356847 DOI： 10.1093/bioinformatics/bts196

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

RNA-seq is a highly parallelized sequencing technology that allows for comprehensive transcriptome characterization and quantification (Wang ). As with all forms of parallelized sequencing, significant computational processing is required to unlock transcript abundance levels and other measures for biological interpretation (Garber ). However, prior to the calculation of biologically relevant data such as transcript abundance, presence of novel isoforms and genotype identity, it is necessary to evaluate the performance of the RNA-seq experiment itself. Summary statistics and quality control scores provide insight into inherently complex data prior to downstream analysis. Here we present RNA-SeQC, a metrics tool with application to two domains: experiment design and process optimization; and quality control prior to computational analysis. Metrics such as duplication rate, rRNA abundance, alignment rates, coverage continuity and correlation to reference expression profiles are highly informative during selection of experiment conditions and library construction methods (Levin ). RNA-SeQC's multi-sample input feature allows for direct comparison across samples (Fig. 1). Additionally, a single-sample mode can be used to monitor samples on an ongoing basis to rapidly assess the quality of a particular sequencing run, and to monitor and optimize these measures in production over time and prior to downstream analysis. RNA-SeQC provides a suite of experiment quality measures, many of which are currently not provided by other available tools (Supplementary Material).

Fig. 1.

Overview of the RNA-SeQC process. (a) RNA-SeQC will work with one or more input samples to produce both a comparative summary across samples as well as a more detailed report for each sample. (b) The comparative summary report includes an extensive range of metrics (in addition to those shown) as well as coverage plots. (c) For each sample, additional reports quantify the coverage profile (variation, gaps, etc.) for individual transcripts

2 METRICS

RNA-SeQC provides three types of quality control metrics: Read Counts, Coverage and Correlation. A list and description of these metrics is shown below. RNA-SeQC is compatible with any alignment method that produces a specification-conforming BAM file (Li ), with flags properly set. For additional information, usage and software requirements, see the GenePattern help document provided as Supplementary Material 1. Metrics reports are provided in HTML for human consumption, as well as tab-delimited text files for pipeline integration.

2.1 Read counts

The following metrics are generated by counting reads with particular characteristics. Rates are also provided, and are calculated as either per total reads or per aligned reads. Since the BAM format does support multiple alignments per read, this implementation ignores any read flagged as not being a primary alignment. Total, unique and duplicate reads Mapped reads and mapped unique reads rRNA reads: counted in one of two modes: (i) interval mode where an interval file defines the location in the given alignment to which rRNA reads map; and (ii) BWA mode, where an independent Burrows–Wheeler Aligner (Li and Durbin, 2009) alignment to reference rRNA sequences is performed. Transcript-annotated reads: intragenic (regions between genes), intergenic (within genes), exonic and intronic. These regions are defined in a user-specified GTF file (Supplementary Material). GENCODE annotations (Harrow ) are used by default. Expression profile efficiency: the ratio of exon-mapped reads to the total reads sequenced. Expressed transcripts: count of transcripts with reads ≥1. Strand specificity: to assess the performance of strand-specific library construction methods, the percentage of sense-derived reads is given for each end of the read pair. Whereas a non-strand-specific protocol would give values of 50%/50%, strand-specific protocols typically yield 99%/1% or 1%/99% for this metric.

2.2 Coverage

The following metrics are based on coverage: the number of reads that cover a given genomic position (in units of reads per base). RNA-SeQC quantifies the uniformity of coverage with several different metrics. To reflect the effect of expression level on these metrics, we select genes from three categories: low, middle and high expression genes (see Supplementary Material) and also report the average of these metrics for each gene set. Mean coverage: the mean number of reads per base. Mean coefficient of variation: the mean coefficient of variation across all transcripts. 5′/3′ Coverage: the mean per-base coverage for end regions of RNA transcripts. The length of the end region has a default value of 100 base pairs. Gaps in coverage: a stretch of sequence of at least 5 base pairs having zero coverage. Both the number of gaps as well as the summed gap length across all transcripts in the set is reported. Cumulative gap length: sum of gap lengths of all transcripts. Downsampling: to normalize data to a specific total read count we enable an on-the-fly random reduction of reads to reach a user-defined number. This is useful for comparing certain statistics across datasets, e.g. gap metrics, which are not otherwise adjusted for depth. GC bias: to assess effects of GC content on sequencing performance, all coverage metrics are additionally reported for three levels of transcript GC content: high, low and moderate (see Supplementary Material for default threshold settings). Coverage plots: plots of coverage level versus base index, either for a single transcript or a set of transcripts.

2.3 Expression correlation

One of the most valuable ways to interpret the performance of an RNA-seq run is to compare the measured expression levels to a reference (Levin ). RNA-SeQC provides RPKM-based estimation of expression levels (Mortazavi ). When run with multiple samples, RNA-SeQC creates a matrix of correlations among all combinations, reporting the Spearman (rank based) and Pearson (quantity based) correlation coefficients. Optionally, an array based or RNA-seq reference expression profile can be provided for the correlation analysis. Correlation metrics are also provided for the different GC content stratifications to measure GC bias.

3 IMPLEMENTATION

Implemented in Java, RNA-SeQC is platform independent and requires no installation. For investigators who prefer a web interface to a command-line tool, this software can be run using the GenePattern web interface found at http://www.GenePattern.org (Reich ). Within the RNA-SeQC software package, Read Count metrics were implemented by inheriting from the ReadWalker class of the GATK software package (McKenna ). Transcript annotations are bound to the walker in the RefGen format. This format is created on-the-fly from a user-provided GTF file. The program is designed to support the minimal GTF specification, but the GTF format used by GENCODE (Harrow ) is recommended. For continuity of coverage calculations, the GATK's Depth of Coverage walker was used to calculate the number of bases at a given position in the genomic alignment. Finally, ribosomal RNA quantification is performed by realigning all reads to rRNA reference sequences using the Burrows–Wheeler Aligner (Li and Durbin, 2009). Funding: Funded in part with Federal funds from the National Human Genome Research Institute, National Institutes of Health, Department of Health and Human under Contract No. HHSN268201000029C. Conflict of Interest: none declared.

9 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. GenePattern 2.0.

Authors: Michael Reich; Ted Liefeld; Joshua Gould; Jim Lerner; Pablo Tamayo; Jill P Mesirov
Journal: Nat Genet Date: 2006-05 Impact factor: 38.330

3. Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Authors: Ali Mortazavi; Brian A Williams; Kenneth McCue; Lorian Schaeffer; Barbara Wold
Journal: Nat Methods Date: 2008-05-30 Impact factor: 28.547

Review 4. Computational methods for transcriptome annotation and quantification using RNA-seq.

Authors: Manuel Garber; Manfred G Grabherr; Mitchell Guttman; Cole Trapnell
Journal: Nat Methods Date: 2011-05-27 Impact factor: 28.547

5. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

Review 6. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

7. GENCODE: producing a reference annotation for ENCODE.

Authors: Jennifer Harrow; France Denoeud; Adam Frankish; Alexandre Reymond; Chao-Kung Chen; Jacqueline Chrast; Julien Lagarde; James G R Gilbert; Roy Storey; David Swarbreck; Colette Rossier; Catherine Ucla; Tim Hubbard; Stylianos E Antonarakis; Roderic Guigo
Journal: Genome Biol Date: 2006-08-07 Impact factor: 13.583

8. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

9. Comprehensive comparative analysis of strand-specific RNA sequencing methods.

Authors: Joshua Z Levin; Moran Yassour; Xian Adiconis; Chad Nusbaum; Dawn Anne Thompson; Nir Friedman; Andreas Gnirke; Aviv Regev
Journal: Nat Methods Date: 2010-08-15 Impact factor: 28.547

9 in total

361 in total

1. Characterization of type I interferon pathway during hepatic differentiation of human pluripotent stem cells and hepatitis C virus infection.

Authors: Joseph Ignatius Irudayam; Deisy Contreras; Lindsay Spurka; Aparna Subramanian; Jenieke Allen; Songyang Ren; Vidhya Kanagavel; Quoclinh Nguyen; Arunachalam Ramaiah; Kalidas Ramamoorthy; Samuel W French; Andrew S Klein; Vincent Funari; Vaithilingaraja Arumugaswami
Journal: Stem Cell Res Date: 2015-08-15 Impact factor: 2.020

2. A serine-arginine-rich (SR) splicing factor modulates alternative splicing of over a thousand genes in Toxoplasma gondii.

Authors: Lee M Yeoh; Christopher D Goodman; Nathan E Hall; Giel G van Dooren; Geoffrey I McFadden; Stuart A Ralph
Journal: Nucleic Acids Res Date: 2015-04-13 Impact factor: 16.971

Review 3. Advances in Transcriptomics: Investigating Cardiovascular Disease at Unprecedented Resolution.

Authors: Robert C Wirka; Milos Pjanic; Thomas Quertermous
Journal: Circ Res Date: 2018-04-27 Impact factor: 17.367

4. A Potential Contributory Role for Ciliary Dysfunction in the 16p11.2 600 kb BP4-BP5 Pathology.

Authors: Eugenia Migliavacca; Christelle Golzio; Katrin Männik; Ian Blumenthal; Edwin C Oh; Louise Harewood; Jack A Kosmicki; Maria Nicla Loviglio; Giuliana Giannuzzi; Loyse Hippolyte; Anne M Maillard; Ali Abdullah Alfaiz; Mieke M van Haelst; Joris Andrieux; James F Gusella; Mark J Daly; Jacques S Beckmann; Sébastien Jacquemont; Michael E Talkowski; Nicholas Katsanis; Alexandre Reymond
Journal: Am J Hum Genet Date: 2015-04-30 Impact factor: 11.025

5. Identification of Novel Susceptibility Loci and Genes for Prostate Cancer Risk: A Transcriptome-Wide Association Study in Over 140,000 European Descendants.

Authors: Lang Wu; Jifeng Wang; Qiuyin Cai; Taylor B Cavazos; Nima C Emami; Jirong Long; Xiao-Ou Shu; Yingchang Lu; Xingyi Guo; Joshua A Bauer; Bogdan Pasaniuc; Kathryn L Penney; Matthew L Freedman; Zsofia Kote-Jarai; John S Witte; Christopher A Haiman; Rosalind A Eeles; Wei Zheng
Journal: Cancer Res Date: 2019-05-17 Impact factor: 12.701

6. IL-15 Preconditioning Augments CAR T Cell Responses to Checkpoint Blockade for Improved Treatment of Solid Tumors.

Authors: Lauren Giuffrida; Kevin Sek; Melissa A Henderson; Imran G House; Junyun Lai; Amanda X Y Chen; Kirsten L Todd; Emma V Petley; Sherly Mardiana; Izabela Todorovski; Emily Gruber; Madison J Kelly; Benjamin J Solomon; Stephin J Vervoort; Ricky W Johnstone; Ian A Parish; Paul J Neeson; Lev M Kats; Phillip K Darcy; Paul A Beavis
Journal: Mol Ther Date: 2020-07-21 Impact factor: 11.454

7. An Enumerative Combinatorics Model for Fragmentation Patterns in RNA Sequencing Provides Insights into Nonuniformity of the Expected Fragment Starting-Point and Coverage Profile.

Authors: Celine Prakash; Arndt Von Haeseler
Journal: J Comput Biol Date: 2016-09-23 Impact factor: 1.479

8. Targeting nuclear receptor NR4A1-dependent adipocyte progenitor quiescence promotes metabolic adaptation to obesity.

Authors: Yang Zhang; Alexander J Federation; Soomin Kim; John P O'Keefe; Mingyue Lun; Dongxi Xiang; Jonathan D Brown; Matthew L Steinhauser
Journal: J Clin Invest Date: 2018-10-02 Impact factor: 14.808

9. Kinome-wide functional screen identifies role of PLK1 in hormone-independent, ER-positive breast cancer.

Authors: Neil E Bhola; Valerie M Jansen; Sangeeta Bafna; Jennifer M Giltnane; Justin M Balko; Mónica V Estrada; Ingrid Meszoely; Ingrid Mayer; Vandana Abramson; Fei Ye; Melinda Sanders; Teresa C Dugger; Eliezer V Allen; Carlos L Arteaga
Journal: Cancer Res Date: 2014-12-05 Impact factor: 12.701

10. Gene expression elucidates functional impact of polygenic risk for schizophrenia.

Authors: Menachem Fromer; Panos Roussos; Solveig K Sieberts; Jessica S Johnson; David H Kavanagh; Thanneer M Perumal; Douglas M Ruderfer; Edwin C Oh; Aaron Topol; Hardik R Shah; Lambertus L Klei; Robin Kramer; Dalila Pinto; Zeynep H Gümüş; A Ercument Cicek; Kristen K Dang; Andrew Browne; Cong Lu; Lu Xie; Ben Readhead; Eli A Stahl; Jianqiu Xiao; Mahsa Parvizi; Tymor Hamamsy; John F Fullard; Ying-Chih Wang; Milind C Mahajan; Jonathan M J Derry; Joel T Dudley; Scott E Hemby; Benjamin A Logsdon; Konrad Talbot; Towfique Raj; David A Bennett; Philip L De Jager; Jun Zhu; Bin Zhang; Patrick F Sullivan; Andrew Chess; Shaun M Purcell; Leslie A Shinobu; Lara M Mangravite; Hiroyoshi Toyoshiba; Raquel E Gur; Chang-Gyu Hahn; David A Lewis; Vahram Haroutunian; Mette A Peters; Barbara K Lipska; Joseph D Buxbaum; Eric E Schadt; Keisuke Hirai; Kathryn Roeder; Kristen J Brennand; Nicholas Katsanis; Enrico Domenici; Bernie Devlin; Pamela Sklar
Journal: Nat Neurosci Date: 2016-09-26 Impact factor: 24.884