Literature DB >> 22009675

The Sequence Read Archive: explosive growth of sequencing data.

Yuichi Kodama¹, Martin Shumway, Rasko Leinonen.

Abstract

New generation sequencing platforms are producing data with significantly higher throughput and lower cost. A portion of this capacity is devoted to individual and community scientific projects. As these projects reach publication, raw sequencing datasets are submitted into the primary next-generation sequence data archive, the Sequence Read Archive (SRA). Archiving experimental data is the key to the progress of reproducible science. The SRA was established as a public repository for next-generation sequence data as a part of the International Nucleotide Sequence Database Collaboration (INSDC). INSDC is composed of the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). The SRA is accessible at www.ncbi.nlm.nih.gov/sra from NCBI, at www.ebi.ac.uk/ena from EBI and at trace.ddbj.nig.ac.jp from DDBJ. In this article, we present the content and structure of the SRA and report on updated metadata structures, submission file formats and supported sequencing platforms. We also briefly outline our various responses to the challenge of explosive data growth.

Entities: Chemical Gene Species

Mesh：

Year: 2011 PMID： 22009675 PMCID： PMC3245110 DOI： 10.1093/nar/gkr854

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

THE SEQUENCE READ ARCHIVE

Massively parallel next-generation sequencing platforms are revolutionizing life sciences. These instruments are producing vastly more sequence data than that was ever possible with the capillary technology. National Center for Biotechnology Information (NCBI) started the archive of raw sequencing data from next-generation platforms in 2007, followed by European Bioinformatics Institute (EBI) and DNA Data Bank of Japan (DDBJ) in 2008. In 2009, an international public archival resource ‘Sequence Read Archive (SRA)’ for next-generation sequencing data was established as a part of the International Nucleotide Sequence Database Collaboration (INSDC) (1–3). The mission of the SRA is to help the wider research community gain access to the next generation sequencing data emanating from scientific research. The SRA works as a core infrastructure for sharing of pre-publication sequence data as required by several large-scale international projects including the Human Microbiome project (https://commonfund.nih.gov/hmp) and 1000 Genomes project (http://www.1000genomes.org). It is to be noted that data requiring authorized access, such as human genome sequenced under ethical consent agreements, should be submitted to the database of phenotypes and genotypes at NCBI (dbGaP, http://www.ncbi.nlm.nih.gov/gap) or to the European Genome-phenome Archive at EBI (EGA, http://www.ebi.ac.uk/ega). Data submitted to dbGaP or EGA is not part of the public SRA. However, summary-level metadata is made available through SRA.

CONTENT

In 2011 the SRA surpassed 100 Terabases of open-access genetic sequence reads from next generation sequencing technologies. The Illumina™ platform comprises 84% of sequenced bases, with SOLiD™ and Roche/454™ platforms accounting for 12% and 2%, respectively. The most active SRA submitters in terms of submitted bases are the Broad Institute, the Wellcome Trust Sanger Institute and Baylor College of Medicine with 31, 13 and 11%, respectively. The largest individual global project generating next-generation sequence is the 1000 Genomes project which has contributed nearly one third of all bases. The most sequenced organisms are Homo sapiens with 61%, human metagenome with 6% and Mus musculus with 5% share of all bases. The common study types in terms of sequenced bases are Whole Genome Sequencing and Re-sequencing, Population Genomics, Metagenomics and Epigenetics with 57, 12, 11 and 8% share of all bases, respectively.

ACCEPTED DATA

The SRA is a repository of raw sequence data with the aim to balance the cost of long-term archival with the requirement to store sufficient information to support re-use of the submitted data. At minimum, data submitted to SRA must include base or SOLiD color calls and their qualities. To limit the archival cost and guided by community consultation, the SRA also sets maximum levels for accepted raw data. For example, since the end of 2010 signal data from the Illumina and SOLiD platforms are no longer archived by the SRA. In addition to base calls (or SOLiD color calls) and quality scores, SRA also accepts alignments submissions in BAM (4) format. Other data may be accepted as well; full details are available for submitters from NCBI, DDBJ or EBI. Interactive and pipeline submission routes to the SRA archives are available. Functional genomics studies using next-generation sequencing (e.g. ChIP-seq and RNA-seq) can be submitted via the Gene Expression Omnibus at NCBI (http://www.ncbi.nlm.nih.gov/geo) (5), ArrayExpress at EBI (http://www.ebi.ac.uk/arrayexpress) (6) and DDBJ Omics Archive (http://trace.ddbj.nig.ac.jp/dor) (7).

SUPPORTED PLATFORMS AND FILE FORMATS

The SRA aims to support all established and emerging sequencing platforms and most commonly used data file formats. Supported platforms include Roche/454 (Roche Diagnostics Corp.), Illumina (Illumina Inc.), SOLiD (Life Technologies Corp.), HeliScope™ Single Molecule Sequencer (Helicos Biosciences Corp.), Complete Genomics™ (Complete Genomics Inc.), SMRT™ (Pacific Biosciences Inc.) and Ion Torrent PGM™ (Life Technologies Corp.). Depending on the data file format, submissions for emerging platforms may be first supported only provisionally where submitted data is made available only in the original submitted format. This procedure guarantees early access to data generated by new platforms. Data submitted in any of the widely used data formats is rigorously validated and made available to pubic in a variety of formats. For example, NCBI makes data available in the NCBI SRA toolkit format, which can be converted into many other file formats, while EBI and DDBJ make data available in the FASTQ format. Recommended data submission formats may vary slightly between DDBJ, EBI and NCBI, but all widely used formats, such as BAM and Standard Flowgram Format (SFF), are universally accepted.

METADATA MODEL

Data submitted to SRA is organized using a metadata model consisting of six objects: study, sample, experiment, run, analysis and submission. The SRA study contains high-level information including goals of the study and literature references, and may be linked to the INSDC BioProject database. Similarly, the SRA sample object contains detailed sample information, and may be linked to the BioSample databases of NCBI (http://www.ncbi.nlm.nih.gov/biosample) and EBI (http://www.ebi.ac.uk/biosamples). The SRA experiment and run objects contain library and instrument information and are directly associated with the sequence data. The SRA analysis object is used for the deposition of a variety of analysis results including alignments and assemblies. The SRA submission object groups the other objects for submission into the SRA. These metadata objects are all accessioned with unique permanent identifiers that are shared by INSDC partners. The SRA has updated the metadata model to better represent new sequencing technologies and applications. The schema version 1.3 introduced in 2011 added a new structure called GapDescriptor that will encode the placement of spot subsequences (tags) against a reference or assembly substrate. This structure encodes mate pair gaps and tandem read gaps. Introduction of the GapDescriptor element was motivated by the need to describe Complete Genomics platform sequencing. The next planned metadata version, 2.0, will largely simplify the model by removing redundant and deprecated fields. While this new model will be incompatible with the previous version, the SRA archives will transform all existing metadata documents to conform to the new model. The SRA metadata model is largely shared by all three archives, however, small differences have been introduced to support archive specific local requirements.

SEQUENCE DATA EXCHANGE

The public sequence data are exchanged between the INSDC partners allowing all public data to be accessed at each site regardless of the point of the original submission (‘submit locally, share globally’ model). The data is currently exchanged in the NCBI SRA toolkit format (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software). The SRA toolkit provides a configurable storage and compression architecture and its format can be converted to other formats such as widely-used FASTQ through its standard API. The SRA data exchange model follows the long established INSDC policy of exchanging GenBank, EMBL-Bank and DDBJ entries.

CHALLENGE OF DATA GROWTH

The explosive growth of next-generation sequencing data submitted into the SRA exceeds the growth rate of storage capacity. This trend provides the greatest challenge to handle raw sequence data for SRA archives and users of the raw sequence data alike. The SRA partners actively discuss and pursue approaches together with user communities to maximize the benefit gained from archiving next-generation sequencing data while minimizing the infrastructure costs. Possible approaches discussed include reference-based compression of sequencing data, quantization of base quality values, selective storage of base quality values, reducing the metadata stored for individual reads (e.g. read names), federation of data in place of data submission and exchange, and consolidation of catastrophe back-up storage across SRA archives. Among these possibilities, SRA is exploring approaches based on reference alignment and compression of reads, and on the preservation of only the most valuable base quality information (8), and is also actively participating in experiments assessing the effect of quality score quantization. The SRA partners continue actively to discuss with the research community to explore appropriate data reduction approaches.

FUNDING

DNA Data Bank of Japan, Ministry of Education, Culture, Sports, Science and Technology of Japan; European Molecular Biology Laboratory, European Commission and the Wellcome Trust; National Library of Medicine; Intramural Research Program of the NIH. Funding for open access charge: Ministry of Education, Culture, Sports, Science and Technology of Japan (management expense grant). Conflict of interest statement. None declared.

8 in total

1. Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors: Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal: Genome Res Date: 2011-01-18 Impact factor: 9.043

2. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

3. ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments.

Authors: Helen Parkinson; Ugis Sarkans; Nikolay Kolesnikov; Niran Abeygunawardena; Tony Burdett; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Emma Hastings; Ele Holloway; Natalja Kurbatova; Margus Lukk; James Malone; Roby Mani; Ekaterina Pilicheva; Gabriella Rustici; Anjan Sharma; Eleanor Williams; Tomasz Adamusiak; Marco Brandizi; Nataliya Sklyar; Alvis Brazma
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

4. NCBI GEO: archive for functional genomics data sets--10 years on.

Authors: Tanya Barrett; Dennis B Troup; Stephen E Wilhite; Pierre Ledoux; Carlos Evangelista; Irene F Kim; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Rolf N Muertter; Michelle Holko; Oluwabukunmi Ayanbule; Andrey Yefanov; Alexandra Soboleva
Journal: Nucleic Acids Res Date: 2010-11-21 Impact factor: 16.971

5. The sequence read archive.

Authors: Rasko Leinonen; Hideaki Sugawara; Martin Shumway
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

6. The DNA Data Bank of Japan launches a new resource, the DDBJ Omics Archive of functional genomics experiments.

Authors: Yuichi Kodama; Jun Mashima; Eli Kaminuma; Takashi Gojobori; Osamu Ogasawara; Toshihisa Takagi; Kousaku Okubo; Yasukazu Nakamura
Journal: Nucleic Acids Res Date: 2011-11-22 Impact factor: 16.971

7. The International Nucleotide Sequence Database Collaboration.

Authors: Ilene Karsch-Mizrachi; Yasukazu Nakamura; Guy Cochrane
Journal: Nucleic Acids Res Date: 2011-11-12 Impact factor: 16.971

8. Archiving next generation sequencing data.

Authors: Martin Shumway; Guy Cochrane; Hideaki Sugawara
Journal: Nucleic Acids Res Date: 2009-12-03 Impact factor: 16.971

8 in total

382 in total

1. COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems.

Authors: Yasunobu Okamura; Yuichi Aoki; Takeshi Obayashi; Shu Tadaka; Satoshi Ito; Takafumi Narise; Kengo Kinoshita
Journal: Nucleic Acids Res Date: 2014-11-11 Impact factor: 16.971

2. RNASeqMetaDB: a database and web server for navigating metadata of publicly available mouse RNA-Seq datasets.

Authors: Zhengyu Guo; Boriana Tzvetkova; Jennifer M Bassik; Tara Bodziak; Brianna M Wojnar; Wei Qiao; Md A Obaida; Sacha B Nelson; Bo Hua Hu; Peng Yu
Journal: Bioinformatics Date: 2015-08-30 Impact factor: 6.937

3. Minimum information for reporting next generation sequence genotyping (MIRING): Guidelines for reporting HLA and KIR genotyping via next generation sequencing.

Authors: Steven J Mack; Robert P Milius; Benjamin D Gifford; Jürgen Sauter; Jan Hofmann; Kazutoyo Osoegawa; James Robinson; Mathijs Groeneweg; Gregory S Turenchalk; Alex Adai; Cherie Holcomb; Erik H Rozemuller; Maarten T Penning; Michael L Heuer; Chunlin Wang; Marc L Salit; Alexander H Schmidt; Peter R Parham; Carlheinz Müller; Tim Hague; Gottfried Fischer; Marcelo Fernandez-Viňa; Jill A Hollenbach; Paul J Norman; Martin Maiers
Journal: Hum Immunol Date: 2015-09-25 Impact factor: 2.850

4. Omicseq: a web-based search engine for exploring omics datasets.

Authors: Xiaobo Sun; William S Pittard; Tianlei Xu; Li Chen; Michael E Zwick; Xiaoqian Jiang; Fusheng Wang; Zhaohui S Qin
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971

5. A meta-analysis of the genomic and transcriptomic composition of complex life.

Authors: Ganqiang Liu; John S Mattick; Ryan J Taft
Journal: Cell Cycle Date: 2013-06-06 Impact factor: 4.534

6. Sensing Enzymatic Activity by Exposure and Selection of DNA-Encoded Probes.

Authors: Rachael R Jetson; Casey J Krusemark
Journal: Angew Chem Int Ed Engl Date: 2016-06-29 Impact factor: 15.336

Review 7. Existing and emerging technologies for tumor genomic profiling.

Authors: Laura E MacConaill
Journal: J Clin Oncol Date: 2013-04-15 Impact factor: 44.544

8. Pan-transcriptomic analysis identified common differentially expressed genes of Acinetobacter baumannii in response to polymyxin treatments.

Authors: Mengyao Li; Su Mon Aye; Maizbha Uddin Ahmed; Mei-Ling Han; Chen Li; Jiangning Song; John D Boyce; David R Powell; Mohammad A K Azad; Tony Velkov; Yan Zhu; Jian Li
Journal: Mol Omics Date: 2020-05-29

9. Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.

Authors: Mihaela Pertea; Daehwan Kim; Geo M Pertea; Jeffrey T Leek; Steven L Salzberg
Journal: Nat Protoc Date: 2016-08-11 Impact factor: 13.491

Review 10. Toward Accurate and Quantitative Comparative Metagenomics.

Authors: Stephen Nayfach; Katherine S Pollard
Journal: Cell Date: 2016-08-25 Impact factor: 41.582