Literature DB >> 19965774

Archiving next generation sequencing data.

Martin Shumway¹, Guy Cochrane, Hideaki Sugawara.

Abstract

Next generation sequencing platforms are producing biological sequencing data in unprecedented amounts. The partners of the International Nucleotide Sequencing Database Collaboration, which includes the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ), have established the Sequence Read Archive (SRA) to provide the scientific community with an archival destination for next generation data sets. The SRA is now accessible at http://www.ncbi.nlm.nih.gov/Traces/sra from NCBI, at http://www.ebi.ac.uk/ena from EBI and at http://www.ddbj.nig.ac.jp/sub/trace_sra-e.html from DDBJ. Users of these resources can obtain data sets deposited in any of the three SRA instances. Links and submission instructions are provided.

Entities: Chemical Gene Species

Mesh：

Year: 2009 PMID： 19965774 PMCID： PMC2808927 DOI： 10.1093/nar/gkp1078

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

TEXT

Next generation sequencing platforms are revolutionizing genomics and genome science. These instruments are producing vastly more sequencing data than was ever possible with capillary technology, providing more power for resolution of genomic variation, reducing clonal bias in amplification and making practicable new assays such as full-length cDNA sequencing on a large scale. In addition, the shift from microarrays to next generation sequencing platforms for gene expression and epigenomics investigations has resulted in much greater resolving power and accuracy for those experiments. The new technologies offer tremendous promise for advancing fundamental knowledge about biology, particularly if the data are made widely available to the researchers. Based on the experience with the Trace Archive (established at NCBI and Wellcome Trust Sanger Institute in 2001 to archive and distribute capillary sequences to the scientific community) (1), NCBI set out in 2007 to design a successor archive to accommodate the next generation sequencing platforms (2). These platforms now include 454 (Roche Diagnostics Corporation, Branford, CT, USA), Illumina Genome Analyzer (Illumina, Inc., San Diego, CA, USA), SOLiD™ (Life Technologies Corporation, Carlsbad, CA, USA), HeliScope (Helicos Biosciences Corporation, Cambridge, MA, USA), Complete Genomics (Complete Genomics Inc., Mountain View, CA, USA) and SMRT™ (Pacific Biosciences Inc., Menlo Park, CA, USA). The resulting Sequence Read Archive (SRA) is now accessible at www.ncbi.nlm.nih.gov/Traces/sra from NCBI, at http://www.ebi.ac.uk/ena from European Bioinformatics Institute (EBI) and at http://www.ddbj.nig.ac.jp/sub/trace_sra-e.html from DNA Data Bank of Japan (DDBJ). In order to adapt to the much greater output from next generation sequencing platforms, the SRA incorporates several improvements over the Trace Archive, including separation of metadata from the content, institution of a ‘run’ concept to cover the production unit (plate or flowcell) and the creation of a sequencing ‘experiment’ object to describe the sequencing library that the runs belong to. The SRA data model was designed in collaboration with the EBI and the DDBJ under the auspices of the International Nucleotide Sequence Database Collaboration (INSDC) (http://www.insdc.org). The INSDC’s DDBJ/EMBL/GenBank database has been a critical resource in biomedicine. As new technologies have arisen, be they ESTs or whole genome shotgun records, DDBJ/EMBL/GenBank have adapted and expanded to maintain this valuable international shared resource. The expansion of Trace/SRA into the international collaboration continues the support for a uniform, international path to critical data sharing in biomedicine. The three SRAs will mirror data and share an accession space, essentially providing a world-wide archive. The EBI’s SRA implementation is described in (3) and DDBJ’s in (4). In November 2009, the SRAs collectively hosted about 11 Terabases of biological sequence data. This included 170 full-length human genomes, over 900 bacterial genomes, and ∼100 expression and epigenomics studies. Over 90 published studies have been linked to SRA deposits. Most of the human genomes were produced by the 1000 Genomes Project, which is using sequencing data to perform a deep analysis of ordinary human variation in three healthy populations with the expectation of detecting common human genetic variants (defined as frequency 1% or higher) (www.1000genomes.org). The Project is submitting reads to the SRAs in real time as they are produced, allowing investigators, not associated with this project, direct access to its output. The value of the SRAs to the scientific community will depend on the degree to which data from investigations are deposited. Accordingly, NCBI, EBI and DDBJ encourage researchers to consider depositing their data in one of the SRAs. We have tried to ease the burden of sequence submission in several ways: first time and occasional submitters can use an interactive interface and upload smaller data sets through a web browser; high-throughput users can submit data via an automated submission pipeline that uses XML to describe metadata and the community-developed Sequence Read Format (SRF) as a common container file format; and all three SRAs use a high-speed file transfer protocol called fasp (Aspera, Inc., Emeryville, CA, USA) that allows users to transfer files at speeds up to 400 Mbps, many times faster than ftp. For information about submitting to SRA, see http://www.ncbi.nlm.nih.gov/Traces/sra/static/SRA_Submission_Guidelines.pdf at NCBI, http://www.ebi.ac.uk/embl/Documentation/ENA-Reads.html at EBI and http://trace.ddbj.nig.ac.jp/dra/submission_e.shtml at DDBJ. Functional genomics studies utilizing short reads (e.g. ChIP-Seq and mRNA-Seq) can be submitted via the Gene Expression Omnibus and ArrayExpress resources; see instructions at http://www.ncbi.nlm.nih.gov/geo/info/seq.html and http://www.ebi.ac.uk/microarray/submissions_overview.html, respectively. Finally, NCBI and EBI are working on developing SRA instances specially designed for the archiving of human sequencing data sets under privacy control, usage restrictions or ethical constraints.

FUNDING

The EBI's; next generation sequence archiving activities are supported by the Wellcome Trust, the European Commission and the European Molecular Biology Laboratory. DDBJ’s work on SRA and Trace Archive is supported by the Ministry of Education, Culture, Sports, Science and Technology of Japan. Funding for open access charge: NCBI’s SRA work was supported by the Intramural Research Program of the NIH, National Library of Medicine. Conflict of interest statement. None declared.

5 in total

1. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Lewis Y Geer; Yuri Kapustin; Oleg Khovayko; David Landsman; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Vadim Miller; Kim D Pruitt; Gregory D Schuler; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Roman L Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2006-12-14 Impact factor: 16.971

2. Petabyte-scale innovations at the European Nucleotide Archive.

Authors: Guy Cochrane; Ruth Akhtar; James Bonfield; Lawrence Bower; Fehmi Demiralp; Nadeem Faruque; Richard Gibson; Gemma Hoad; Tim Hubbard; Christopher Hunter; Mikyung Jang; Szilveszter Juhos; Rasko Leinonen; Steven Leonard; Quan Lin; Rodrigo Lopez; Dariusz Lorenc; Hamish McWilliam; Gaurab Mukherjee; Sheila Plaister; Rajesh Radhakrishnan; Stephen Robinson; Siamak Sobhany; Petra Ten Hoopen; Robert Vaughan; Vadim Zalunin; Ewan Birney
Journal: Nucleic Acids Res Date: 2008-10-31 Impact factor: 16.971

3. DDBJ dealing with mass data produced by the second generation sequencer.

Authors: Hideaki Sugawara; Kazuho Ikeo; Satoshi Fukuchi; Takashi Gojobori; Yoshio Tateno
Journal: Nucleic Acids Res Date: 2008-10-16 Impact factor: 16.971

4. Database resources of the National Center for Biotechnology Information.

Authors: Eric W Sayers; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Michael Feolo; Lewis Y Geer; Wolfgang Helmberg; Yuri Kapustin; David Landsman; David J Lipman; Thomas L Madden; Donna R Maglott; Vadim Miller; Ilene Mizrachi; James Ostell; Kim D Pruitt; Gregory D Schuler; Edwin Sequeira; Stephen T Sherry; Martin Shumway; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko; Jian Ye
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

5. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael Dicuccio; Ron Edgar; Scott Federhen; Michael Feolo; Lewis Y Geer; Wolfgang Helmberg; Yuri Kapustin; Oleg Khovayko; David Landsman; David J Lipman; Thomas L Madden; Donna R Maglott; Vadim Miller; James Ostell; Kim D Pruitt; Gregory D Schuler; Martin Shumway; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Roman L Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2007-11-27 Impact factor: 16.971

5 in total

58 in total

Review 1. Next-generation sequencing techniques for eukaryotic microorganisms: sequencing-based solutions to biological problems.

Authors: Minou Nowrousian
Journal: Eukaryot Cell Date: 2010-07-02

2. Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors: Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal: Genome Res Date: 2011-01-18 Impact factor: 9.043

Review 3. Analytical tools and current challenges in the modern era of neuroepigenomics.

Authors: Ian Maze; Li Shen; Bin Zhang; Benjamin A Garcia; Ningyi Shao; Amanda Mitchell; HaoSheng Sun; Schahram Akbarian; C David Allis; Eric J Nestler
Journal: Nat Neurosci Date: 2014-10-28 Impact factor: 24.884

Review 4. Protein-DNA binding in high-resolution.

Authors: Shaun Mahony; B Franklin Pugh
Journal: Crit Rev Biochem Mol Biol Date: 2015-06-03 Impact factor: 8.250

5. The case for cloud computing in genome informatics.

Authors: Lincoln D Stein
Journal: Genome Biol Date: 2010-05-05 Impact factor: 13.583

6. Identifying wrong assemblies in de novo short read primary sequence assembly contigs.

Authors: Vandna Chawla; Rajnish Kumar; Ravi Shankar
Journal: J Biosci Date: 2016-09 Impact factor: 1.826

7. Geoseq: a tool for dissecting deep-sequencing datasets.

Authors: James Gurtowski; Anthony Cancio; Hardik Shah; Chaya Levovitz; Ajish George; Robert Homann; Ravi Sachidanandam
Journal: BMC Bioinformatics Date: 2010-10-12 Impact factor: 3.169

8. ISA software suite: supporting standards-compliant experimental annotation and enabling curation at the community level.

Authors: Philippe Rocca-Serra; Marco Brandizi; Eamonn Maguire; Nataliya Sklyar; Chris Taylor; Kimberly Begley; Dawn Field; Stephen Harris; Winston Hide; Oliver Hofmann; Steffen Neumann; Peter Sterk; Weida Tong; Susanna-Assunta Sansone
Journal: Bioinformatics Date: 2010-08-02 Impact factor: 6.937

Review 9. Proteomics data repositories: providing a safe haven for your data and acting as a springboard for further research.

Authors: Juan Antonio Vizcaíno; Joseph M Foster; Lennart Martens
Journal: J Proteomics Date: 2010-07-06 Impact factor: 4.044

10. Gene Fusion Markup Language: a prototype for exchanging gene fusion data.

Authors: Shanker Kalyana-Sundaram; Achiraman Shanmugam; Arul M Chinnaiyan
Journal: BMC Bioinformatics Date: 2012-10-16 Impact factor: 3.169