Literature DB >> 21062823

The sequence read archive.

Rasko Leinonen¹, Hideaki Sugawara, Martin Shumway.

Abstract

The combination of significantly lower cost and increased speed of sequencing has resulted in an explosive growth of data submitted into the primary next-generation sequence data archive, the Sequence Read Archive (SRA). The preservation of experimental data is an important part of the scientific record, and increasing numbers of journals and funding agencies require that next-generation sequence data are deposited into the SRA. The SRA was established as a public repository for the next-generation sequence data and is operated by the International Nucleotide Sequence Database Collaboration (INSDC). INSDC partners include the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). The SRA is accessible at http://www.ncbi.nlm.nih.gov/Traces/sra from NCBI, at http://www.ebi.ac.uk/ena from EBI and at http://trace.ddbj.nig.ac.jp from DDBJ. In this article, we present the content and structure of the SRA, detail our support for sequencing platforms and provide recommended data submission levels and formats. We also briefly outline our response to the challenge of data growth.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 21062823 PMCID： PMC3013647 DOI： 10.1093/nar/gkq1019

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

THE SEQUENCE READ ARCHIVE

The Sequence Read Archive (SRA) is an international public archival resource for next-generation sequence data established under the guidance of the International Nucleotide Sequence Database Collaboration (INSDC) (1). Instances of the SRA are operated by the National Center for Biotechnology Information (NCBI) (2), the European Bioinformatics Institute (EBI) (3) and the DNA Data Bank of Japan (DDBJ) (4). The mission of INSDC is to preserve public-domain sequencing data and to provide free, unrestricted and permanent access to the data. For INSDC policy details please refer to: http://www.insdc.org/policy.html. Authorized access data submissions, such as human samples sequenced under ethical consent agreements, should be submitted to dbGAP (http://www.ncbi.nlm.nih.gov/gap) at NCBI or to the European Genome-phenome archive (EGA) (http://www.ebi.ac.uk/ega) at EBI. Data submitted to dbGAP or EGA is not part of the public SRA. However, high-level metadata is made available through SRA. For a brief history of the SRA please refer to (5).

CONTENT

In mid-September 2010, the SRA contained >500 billion reads consisting of 60 trillion base pairs available for download including authorized access data submitted to dbGAP. Almost 80% of the sequencing data are derived from the Illumina GA platform. The SOLiD™ and Roche/454 platforms account for 15% and 5% of submitted base pairs, respectively. In terms of submitted base pairs, the most active SRA submitters include the Broad Institute, Washington University in St Louis, the Wellcome Trust Sanger Institute and Baylor College of Medicine with 34, 15, 13 and 12% share of sequenced bases, respectively. The largest individual global project generating next-generation sequence is the 1000 Genomes project (http://www.1000genomes.org) which has generated nearly half of all data submitted into the SRA. The most sequenced organisms are Homo sapiens with 65% and Mus musculus with 4% share of all bases. Human metagenome sequencing accounted for 16% of submitted bases.

PLATFORM SUPPORT

At present, support is offered for widely used sequencing platforms: Roche/454 (Roche Diagnostics Corp.), Illumina Genome Analyzer (Illumina Inc.) and SOLiD™ (Life Technologies Corp.). Support for HeliScope™ Single Molecule Sequencer (Helicos Biosciences Corp.), Complete Genomics Inc., SMRT™ (Pacific Biosciences Inc.) and Ion Torrent Systems Inc. will be available shortly.

RECOMMENDED DATA SUBMISSION LEVELS AND FORMATS

The SRA is intended as a repository of data from the primary analysis phase of sequencing. Experience of operation of the SRA over the last 3 years has allowed us to refine the levels to which we archive data. Storing early raw forms of data, such as images and signals, provides users greatest theoretical precision, but at the expense of significant costs. Data submitted to the SRA archives must always include base or SOLiD™ color calls and qualities. This is now also the recommended data submission level for Illumina Genome Analyzer (GA) and SOLiD™ platforms. Signal data for the Illumina GA and SOLiD™ platforms should no longer be submitted into the SRA archives, as the cost of signal data storage for these platforms is considered to be significantly higher than the value of making these data available for any further analysis. For the 454 platform, the submitted data should still include the signal information. The recommended submission format for data from the Illumina GA and SOLiD™ platforms is Sequence Read Format (SRF); SRF files for the Illumina GA platform should be prepared using the DNA Sequence Read Toolkit (http://sourceforge.net/projects/sequenceread/files), and for the SOLiD™ using the SOLiD™ SRF conversion utility (http://solidsoftwaretools.com/gf/project/srf). For the 454 platform, the recommended submission format is Standard Flowgram Format (SFF). Both SRF and SFF files are highly compressed and should be submitted to SRA without applying any further compression.

METADATA STORAGE AND EXCHANGE

The SRA metadata model consists of six XML objects, each constrained by a schema. All metadata are exchanged on a daily basis between the SRA archives and can be accessed and retrieved from all three sites. The current versions (1.1) of the SRA XML Schemas are available at http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=schema&m=software&s=schema from NCBI and at ftp://ftp.sra.ebi.ac.uk/meta/xsd/sra_1_1 from EBI. While the SRA stores and presents metadata using SRA XML documents, submissions may be prepared using a variety of tools and pipelines. The SRA XML objects are study, sample, experiment, run, analysis and submission. The SRA study object contains high-level project information including literature references, and may be linked to the INSDC projects database. Similarly, the SRA sample object contains detailed sample information. The SRA experiment and run objects contain instrument and library information and are directly associated with the sequence data. The SRA analysis object is used for the deposition of a variety of analysis results including reference alignments, multiple alignments and assemblies. The SRA submission object groups the other objects for submission into the SRA. Metadata XML objects are all accessioned with unique permanent identifiers that are used by all partners in the collaboration.

SEQUENCE DATA STORAGE AND EXCHANGE

The SRA follows the established INSDC data-exchange convention where public data are exchanged between the INSDC partners on a daily basis. This allows all public data to be accessed at each site regardless of the point of the original submission. Before next-generation sequencing platforms existed, the most commonly used format for the representation of base calls and quality scores was the Sanger Fastq format (6). In 2001, a new community format was created which also supported the inclusion of the signal information: the ZTR format (7). SRF, a further development of ZTR, became the first widely used cross-platform format for storing next-generation sequence data. The SRF format gained a substantial user base from Illumina GA and SOLiD™ users, while the earlier SFF format (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=format#sff) became the standard for the 454 platform. In 2009, SAM and BAM (8) were introduced as generic formats for storing read alignments against reference sequences. Sequence alignments are increasingly generated as a primary analysis intermediate, and BAM is expected to replace SRF as the preferred submission format to the SRA; importantly, BAM supports not only aligned, but also unaligned reads which are also recommended to be submitted to SRA. The SRA archives are currently working together with community experts to define an archival BAM format with the goal of making submission and exchange of BAM files as easy as possible. Efficient storage and compression of next-generation sequence data has always been one of the main objectives of the SRA. Internally, the SRA uses the NCBI SRA Toolkit (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software) for storing and exchanging all next-generation sequence data. Critically, the toolkit contains a configurable storage and compression architecture allowing current best practices to co-exist with future ones. The NCBI SRA Toolkit has established itself as an important part of the SRA operations at NCBI, EBI and DDBJ, who now routinely validate and convert submitted data into the SRA Toolkit format. This format is used for data exchange by the SRA partners, converted to other formats such as Fastq, and made available to other applications through its standard API. For example, the NCBI BLAST has been extended to do sequence similarity searches using the files generated by the NCBI SRA Toolkit.

CHALLENGE OF DATA GROWTH

With the growth of the next-generation sequence data surpassing the growth of disk-storage capacity, the value of storing different types of data is being evaluated. Experience from the last 3 years has allowed SRA to define the recommended platform specific data submission levels. The cost of archiving Illumina GA and SOLiD™ signal data are now considered to significantly exceed the value of making this data available for any subsequent analysis. At NCBI, this signal data is now stored on a less accessible secondary-storage system and is no longer guaranteed to be permanently available as part of the SRA archives. A complementary approach to limiting the cost of the archival storage is to implement more efficient compression strategies. Different types of data vary in their compressibility characteristics and some types of data can be compressed significantly more efficiently than others. Currently, one of the most promising compression strategies for next-generation sequences involves reference-based compression (9). The SRA is actively exploring better compression methods including approaches based on reference alignment of reads, and on the preservation of only the most valuable base quality information (Fritz,M.H. et al., submitted for publication). The SRA strategy, then, is to balance data reduction and compression in light of infrastructure costs and usage patterns.

FUNDING

European Molecular Biology Laboratory, European Commission and the Wellcome Trust; Ministry of Education, Culture, Sports, Science and Technology of Japan (to D.D.B.J.’s work on SRA and Trace Archive); Intramural Research Program of the NIH, National Library of Medicine (to NCBI’s SRA work). Funding for open access charge: European Molecular Biology Laboratory. Conflict of interest statement. None declared.

9 in total

1. ZTR: a new format for DNA sequence trace data.

Authors: James K Bonfield; Rodger Staden
Journal: Bioinformatics Date: 2002-01 Impact factor: 6.937

2. Human genomes as email attachments.

Authors: Scott Christley; Yiming Lu; Chen Li; Xiaohui Xie
Journal: Bioinformatics Date: 2008-11-07 Impact factor: 6.937

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

Review 4. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.

Authors: Peter J A Cock; Christopher J Fields; Naohisa Goto; Michael L Heuer; Peter M Rice
Journal: Nucleic Acids Res Date: 2009-12-16 Impact factor: 16.971

5. Improvements to services at the European Nucleotide Archive.

Authors: Rasko Leinonen; Ruth Akhtar; Ewan Birney; James Bonfield; Lawrence Bower; Matt Corbett; Ying Cheng; Fehmi Demiralp; Nadeem Faruque; Neil Goodgame; Richard Gibson; Gemma Hoad; Christopher Hunter; Mikyung Jang; Steven Leonard; Quan Lin; Rodrigo Lopez; Michael Maguire; Hamish McWilliam; Sheila Plaister; Rajesh Radhakrishnan; Siamak Sobhany; Guy Slater; Petra Ten Hoopen; Franck Valentin; Robert Vaughan; Vadim Zalunin; Daniel Zerbino; Guy Cochrane
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

6. The International Nucleotide Sequence Database Collaboration.

Authors: Guy Cochrane; Ilene Karsch-Mizrachi; Yasukazu Nakamura
Journal: Nucleic Acids Res Date: 2010-11-23 Impact factor: 16.971

7. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2009-11-12 Impact factor: 16.971

8. Archiving next generation sequencing data.

Authors: Martin Shumway; Guy Cochrane; Hideaki Sugawara
Journal: Nucleic Acids Res Date: 2009-12-03 Impact factor: 16.971

9. DDBJ launches a new archive database with analytical tools for next-generation sequence data.

Authors: Eli Kaminuma; Jun Mashima; Yuichi Kodama; Takashi Gojobori; Osamu Ogasawara; Kousaku Okubo; Toshihisa Takagi; Yasukazu Nakamura
Journal: Nucleic Acids Res Date: 2009-10-22 Impact factor: 16.971

9 in total

856 in total

1. High-resolution experimental and computational profiling of tissue-specific known and novel miRNAs in Arabidopsis.

Authors: Natalie W Breakfield; David L Corcoran; Jalean J Petricka; Jeffrey Shen; Juthamas Sae-Seaw; Ignacio Rubio-Somoza; Detlef Weigel; Uwe Ohler; Philip N Benfey
Journal: Genome Res Date: 2011-09-22 Impact factor: 9.043

2. Mitochondrial Mutation Rate, Spectrum and Heteroplasmy in Caenorhabditis elegans Spontaneous Mutation Accumulation Lines of Differing Population Size.

Authors: Anke Konrad; Owen Thompson; Robert H Waterston; Donald G Moerman; Peter D Keightley; Ulfar Bergthorsson; Vaishali Katju
Journal: Mol Biol Evol Date: 2017-06-01 Impact factor: 16.240

3. The two paralogous kiwellin proteins KWL1 and KWL1-b from maize are structurally related and have overlapping functions in plant defense.

Authors: Florian Altegoer; Paul Weiland; Pietro Ivan Giammarinaro; Sven-Andreas Freibert; Lynn Binnebesel; Xiaowei Han; Alexander Lepak; Regine Kahmann; Marcus Lechner; Gert Bange
Journal: J Biol Chem Date: 2020-04-28 Impact factor: 5.157

4. miRSwitch: detecting microRNA arm shift and switch events.

Authors: Fabian Kern; Jeremy Amand; Ilya Senatorov; Alina Isakova; Christina Backes; Eckart Meese; Andreas Keller; Tobias Fehlmann
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971

5. Integrative analysis of many RNA-seq datasets to study alternative splicing.

Authors: Wenyuan Li; Chao Dai; Shuli Kang; Xianghong Jasmine Zhou
Journal: Methods Date: 2014-02-28 Impact factor: 3.608

6. The histone deacetylase SIRT6 controls embryonic stem cell fate via TET-mediated production of 5-hydroxymethylcytosine.

Authors: Jean-Pierre Etchegaray; Lukas Chavez; Yun Huang; Kenneth N Ross; Jiho Choi; Barbara Martinez-Pastor; Ryan M Walsh; Cesar A Sommer; Matthias Lienhard; Adrianne Gladden; Sita Kugel; Dafne M Silberman; Sridhar Ramaswamy; Gustavo Mostoslavsky; Konrad Hochedlinger; Alon Goren; Anjana Rao; Raul Mostoslavsky
Journal: Nat Cell Biol Date: 2015-04-27 Impact factor: 28.824

10. A comparison of statistical and machine learning methods for creating national daily maps of ambient PM_2.5 concentration.

Authors: Veronica J Berrocal; Yawen Guan; Amanda Muyskens; Haoyu Wang; Brian J Reich; James A Mulholland; Howard H Chang
Journal: Atmos Environ (1994) Date: 2019-11-14 Impact factor: 4.798