Literature DB >> 19654119

ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data.

Martin Morgan¹, Simon Anders, Michael Lawrence, Patrick Aboyoun, Hervé Pagès, Robert Gentleman.

Abstract

UNLABELLED: ShortRead is a package for input, quality assessment, manipulation and output of high-throughput sequencing data. ShortRead is provided in the R and Bioconductor environments, allowing ready access to additional facilities for advanced statistical analysis, data transformation, visualization and integration with diverse genomic resources.
AVAILABILITY AND IMPLEMENTATION: This package is implemented in R and available at the Bioconductor web site; the package contains a 'vignette' outlining typical work flows.

Entities: Species

Mesh：

Year: 2009 PMID： 19654119 PMCID： PMC2752612 DOI： 10.1093/bioinformatics/btp450

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

High-throughput DNA sequencing technologies include Illumina (Solexa) (Bentley et al., 2008), Roche 454 (Torres et al., 2008) and other platforms. These technologies produce millions of DNA sequences of tens to hundreds of nucleotides each. Biological questions addressed with this data include SNP calling, ChIP-seq (Mardis, 2007), and RNA-seq (Mortazavi et al., 2008). We introduce the ShortRead package, part of the Bioconductor (Gentleman et al., 2004) project. ShortRead extends Bioconductor with tools useful in the initial stages of short-read DNA sequence analysis. Main functionalities include data input, quality assessment, data transformation and access to downstream analysis opportunities. ShortRead is an important gateway to use of Bioconductor for processing high-throughput DNA sequence data.

1 AVAILABLE FUNCTIONALITY

1.1 Input and output

ShortRead provides mechanisms for input of diverse high-throughput sequence data. A major starting point is reads aligned to references, as from manufacturer software or aligners such as MAQ (Li et al., 2008) and Bowtie (Langmead et al., 2009). ShortRead parses additional formats (e.g. fasta, fastq and arbitrary column-oriented text files). Resulting R data structures allow manipulation of sequence, quality, alignment and other information. Input functions transparently parse compressed ( .gz) files; most file types can be read as ‘chunks’, to allow processing of data subsets. ShortRead inputs but does not specially represent fine-grained alignment descriptions (e.g. in Stockholm format). Facilities for data output include fasta and fastq text formats, arbitrary column-oriented output of reads and auxiliary information, serialization of objects in native R format, and through use of additional R packages such as rtracklayer export to common genome browser formats such as wiggle, bed and gff (Kuhn et al., 2008).

1.2 Quality assessment

ShortRead includes facilities for assessment (QA) of read quality, sample processing and sequencing artifacts, and alignment characteristics. The QA pipeline can start with base calls and their quality scores (e.g. fastq or qseq files), as well as aligned data formats from special-purpose aligners. The result is an HTML report with embedded narrative to facilitate interpretation; a sample report is included with the package. Illustrative results are shown in Figure 1. Highlights include: (i) The number of raw, filtered and aligned reads; (ii) Base call frequencies. (iii) Cycle-specific base calls and read qualities (e.g. Fig. 1A). (iv) Tabulation of read occurrences (how often reads are represented once, twice, …, n times). For instance, reads occurring once or a few times (to the left in Fig. 1B) may be unique due to base call errors, whereas reads occurring very frequently (at the extreme right in Fig. 1B) typically reflect PCR or resequencing issues. (v) Preliminary alignment quality score summaries. Technology-specific quality measures are also generated, especially for Illumina's Genome Analyzer (e.g. tile-specific read counts and qualities).

Fig. 1.

Quality assessment. (A) Unlikely directional nucleotide change and base calls (cycle 26) from a Short Read Archive accession. (B) Left and right ‘tails’ correspond to infrequently and highly repeated reads, respectively, in a φX174 control lane.

1.3 Transformation and downstream analysis

ShortRead provides facilities for data exploration, transformation, and down-stream analysis. For example, alphabetByCycle summarizes cycle-specific nucleotide counts (Fig. 1A) and base qualities. The alphabetFrequency function summarizes nucleotide use over all cycles, on a per-read basis or over the entire set of reads. The tables summarizes commonly occurring sequences, as illustrated in Figure 1B. ShortRead contains facilities for sorting reads, finding duplicates, trimming left and right ends and for exploiting the extensive string pattern matching functions of Biostrings. The features described here are generally fast, operating on tens of millions of short reads in a few seconds; input of large text files can be slow, taking 3–5 min for 50 million 36mers. Sixty-four bit platforms with 4–8 GB of memory are typically sufficient. ShortRead provides extensible ‘filter’ functions for removing short reads satisfying predefined or ad hoc criteria. For instance, the dustyFilter identifies and removes low-complexity reads. Filters can be composed to formulate complex criteria. Additional ShortRead functionality is a starting point for downstream analysis. The function coverage summarizes [possibly ‘extended’, see Kharchenko et al. (2008)] alignments as vectors tallying the number of reads over each nucleotide in the reference. ShortRead is one of several Bioconductor packages for sequence analysis. Biostrings has flexible tools for pattern matching, sequence alignment and manipulation. BSgenome provides facilities for representing and efficiently manipulating whole genomes. IRanges provides range-based and other expressive representations. rtracklayer provides an interface to genome browsers from within R sessions.

1.4 Advanced features

The ShortRead package includes advanced features for handling large resequencing data. In particular the large volume of data and generation in ‘lanes’ encourages a ‘block’ processing style. For instance much of the QA functionality of ShortRead can be conducted on a per-lane basis. The srapply function exploits this work flow. A typical use takes a list of file names and a function to be applied to the file. srapply applies the function to each file. Usually the function reduces the data volume in the file, e.g. from a large collection of reads to a compact summary of lane quality. The distinguishing feature of srapply is that the calculation is distributed across processors or nodes in a computer cluster, if such resources exist.

2 CONCLUSIONS

This note introduces the Bioconductor ShortRead package for analysis of resequencing data. The package allows input into R of diverse sequence-related files, and output of common data formats. It provides quality assessment tools and HTML-based report-generating functionality. ShortRead data structures allow convenient manipulation of data, such as filtering reads based on sequence characteristics. The package work flow represents an entry point for down-stream analysis using Bioconductor or other software. Future plans include improved support for longer and paired-end reads, and development of additional quantitative measures of quality; the challenge of incorporating the SOLiD color space model into standard work flows precludes support for this format beyond that available from data transformed to sequence and Phred-like quality scores.

9 in total

1. Gene expression profiling by massively parallel sequencing.

Authors: Tatiana Teixeira Torres; Muralidhar Metta; Birgit Ottenwälder; Christian Schlötterer
Journal: Genome Res Date: 2007-11-21 Impact factor: 9.043

2. ChIP-seq: welcome to the new frontier.

Authors: Elaine R Mardis
Journal: Nat Methods Date: 2007-08 Impact factor: 28.547

3. Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors: Heng Li; Jue Ruan; Richard Durbin
Journal: Genome Res Date: 2008-08-19 Impact factor: 9.043

4. Mapping and quantifying mammalian transcriptomes by RNA-Seq.

Authors: Ali Mortazavi; Brian A Williams; Kenneth McCue; Lorian Schaeffer; Barbara Wold
Journal: Nat Methods Date: 2008-05-30 Impact factor: 28.547

5. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors: Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-03-04 Impact factor: 13.583

6. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

7. The UCSC Genome Browser Database: update 2009.

Authors: R M Kuhn; D Karolchik; A S Zweig; T Wang; K E Smith; K R Rosenbloom; B Rhead; B J Raney; A Pohl; M Pheasant; L Meyer; F Hsu; A S Hinrichs; R A Harte; B Giardine; P Fujita; M Diekhans; T Dreszer; H Clawson; G P Barber; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2008-11-07 Impact factor: 16.971

8. Design and analysis of ChIP-seq experiments for DNA-binding proteins.

Authors: Peter V Kharchenko; Michael Y Tolstorukov; Peter J Park
Journal: Nat Biotechnol Date: 2008-11-16 Impact factor: 54.908

9. Accurate whole human genome sequencing using reversible terminator chemistry.

Authors: David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown; Kevin P Hall; Dirk J Evers; Colin L Barnes; Helen R Bignell; Jonathan M Boutell; Jason Bryant; Richard J Carter; R Keira Cheetham; Anthony J Cox; Darren J Ellis; Michael R Flatbush; Niall A Gormley; Sean J Humphray; Leslie J Irving; Mirian S Karbelashvili; Scott M Kirk; Heng Li; Xiaohai Liu; Klaus S Maisinger; Lisa J Murray; Bojan Obradovic; Tobias Ost; Michael L Parkinson; Mark R Pratt; Isabelle M J Rasolonjatovo; Mark T Reed; Roberto Rigatti; Chiara Rodighiero; Mark T Ross; Andrea Sabot; Subramanian V Sankar; Aylwyn Scally; Gary P Schroth; Mark E Smith; Vincent P Smith; Anastassia Spiridou; Peta E Torrance; Svilen S Tzonev; Eric H Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D Alam; Carole Anastasi; Ify C Aniebo; David M D Bailey; Iain R Bancarz; Saibal Banerjee; Selena G Barbour; Primo A Baybayan; Vincent A Benoit; Kevin F Benson; Claire Bevis; Phillip J Black; Asha Boodhun; Joe S Brennan; John A Bridgham; Rob C Brown; Andrew A Brown; Dale H Buermann; Abass A Bundu; James C Burrows; Nigel P Carter; Nestor Castillo; Maria Chiara E Catenazzi; Simon Chang; R Neil Cooley; Natasha R Crake; Olubunmi O Dada; Konstantinos D Diakoumakos; Belen Dominguez-Fernandez; David J Earnshaw; Ugonna C Egbujor; David W Elmore; Sergey S Etchin; Mark R Ewan; Milan Fedurco; Louise J Fraser; Karin V Fuentes Fajardo; W Scott Furey; David George; Kimberley J Gietzen; Colin P Goddard; George S Golda; Philip A Granieri; David E Green; David L Gustafson; Nancy F Hansen; Kevin Harnish; Christian D Haudenschild; Narinder I Heyer; Matthew M Hims; Johnny T Ho; Adrian M Horgan; Katya Hoschler; Steve Hurwitz; Denis V Ivanov; Maria Q Johnson; Terena James; T A Huw Jones; Gyoung-Dong Kang; Tzvetana H Kerelska; Alan D Kersey; Irina Khrebtukova; Alex P Kindwall; Zoya Kingsbury; Paula I Kokko-Gonzales; Anil Kumar; Marc A Laurent; Cynthia T Lawley; Sarah E Lee; Xavier Lee; Arnold K Liao; Jennifer A Loch; Mitch Lok; Shujun Luo; Radhika M Mammen; John W Martin; Patrick G McCauley; Paul McNitt; Parul Mehta; Keith W Moon; Joe W Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M Novo; Michael J O'Neill; Mark A Osborne; Andrew Osnowski; Omead Ostadan; Lambros L Paraschos; Lea Pickering; Andrew C Pike; Alger C Pike; D Chris Pinkard; Daniel P Pliskin; Joe Podhasky; Victor J Quijano; Come Raczy; Vicki H Rae; Stephen R Rawlings; Ana Chiva Rodriguez; Phyllida M Roe; John Rogers; Maria C Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K Roth; Natalie J Rourke; Silke T Ruediger; Eli Rusman; Raquel M Sanches-Kuiper; Martin R Schenker; Josefina M Seoane; Richard J Shaw; Mitch K Shiver; Steven W Short; Ning L Sizto; Johannes P Sluis; Melanie A Smith; Jean Ernest Sohna Sohna; Eric J Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L Tregidgo; Gerardo Turcatti; Stephanie Vandevondele; Yuli Verhovsky; Selene M Virk; Suzanne Wakelin; Gregory C Walcott; Jingwen Wang; Graham J Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C Mullikin; Matthew E Hurles; Nick J McCooke; John S West; Frank L Oaks; Peter L Lundberg; David Klenerman; Richard Durbin; Anthony J Smith
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

9 in total

228 in total

1. Fibrocyte accumulation in the lungs of cystic fibrosis patients.

Authors: Rajesh K Kasam; Prathibha R Gajjala; Anil G Jegga; Jennifer A Courtney; Scott H Randell; Elizabeth L Kramer; John P Clancy; Satish K Madala
Journal: J Cyst Fibros Date: 2020-06-25 Impact factor: 5.482

2. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

3. ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data.

Authors: Lihua J Zhu; Claude Gazin; Nathan D Lawson; Hervé Pagès; Simon M Lin; David S Lapointe; Michael R Green
Journal: BMC Bioinformatics Date: 2010-05-11 Impact factor: 3.169

4. SATB2 expression increased anchorage-independent growth and cell migration in human bronchial epithelial cells.

Authors: Feng Wu; Ashley Jordan; Thomas Kluz; Steven Shen; Hong Sun; Laura A Cartularo; Max Costa
Journal: Toxicol Appl Pharmacol Date: 2016-01-11 Impact factor: 4.219

5. Characterization of Extracellular Vesicles from Entamoeba histolytica Identifies Roles in Intercellular Communication That Regulates Parasite Growth and Development.

Authors: Manu Sharma; Pedro Morgado; Hanbang Zhang; Gretchen Ehrenkaufer; Dipak Manna; Upinder Singh
Journal: Infect Immun Date: 2020-09-18 Impact factor: 3.441

6. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data.

Authors: Ravi K Patel; Mukesh Jain
Journal: PLoS One Date: 2012-02-01 Impact factor: 3.240

Review 7. RNA-Seq technology and its application in fish transcriptomics.

Authors: Xi Qian; Yi Ba; Qianfeng Zhuang; Guofang Zhong
Journal: OMICS Date: 2013-12-31

8. Count-based differential expression analysis of RNA sequencing data using R and Bioconductor.

Authors: Simon Anders; Davis J McCarthy; Yunshun Chen; Michal Okoniewski; Gordon K Smyth; Wolfgang Huber; Mark D Robinson
Journal: Nat Protoc Date: 2013-08-22 Impact factor: 13.491

9. Ah Receptor Activation by Dioxin Disrupts Activin, BMP, and WNT Signals During the Early Differentiation of Mouse Embryonic Stem Cells and Inhibits Cardiomyocyte Functions.

Authors: Qin Wang; Hisaka Kurita; Vinicius Carreira; Chia-I Ko; Yunxia Fan; Xiang Zhang; Jacek Biesiada; Mario Medvedovic; Alvaro Puga
Journal: Toxicol Sci Date: 2015-11-15 Impact factor: 4.849

10. High-throughput analysis of type I-E CRISPR/Cas spacer acquisition in E. coli.

Authors: Ekaterina Savitskaya; Ekaterina Semenova; Vladimir Dedkov; Anastasia Metlitskaya; Konstantin Severinov
Journal: RNA Biol Date: 2013-04-25 Impact factor: 4.652