Literature DB >> 20972220

The European Nucleotide Archive.

Rasko Leinonen¹, Ruth Akhtar, Ewan Birney, Lawrence Bower, Ana Cerdeno-Tárraga, Ying Cheng, Iain Cleland, Nadeem Faruque, Neil Goodgame, Richard Gibson, Gemma Hoad, Mikyung Jang, Nima Pakseresht, Sheila Plaister, Rajesh Radhakrishnan, Kethi Reddy, Siamak Sobhany, Petra Ten Hoopen, Robert Vaughan, Vadim Zalunin, Guy Cochrane.

Abstract

The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is Europe's primary nucleotide-sequence repository. The ENA consists of three main databases: the Sequence Read Archive (SRA), the Trace Archive and EMBL-Bank. The objective of ENA is to support and promote the use of nucleotide sequencing as an experimental research platform by providing data submission, archive, search and download services. In this article, we outline these services and describe major changes and improvements introduced during 2010. These include extended EMBL-Bank and SRA-data submission services, extended ENA Browser functionality, support for submitting data to the European Genome-phenome Archive (EGA) through SRA, and the launch of a new sequence similarity search service.

Entities: Species

Mesh：

Year: 2010 PMID： 20972220 PMCID： PMC3013801 DOI： 10.1093/nar/gkq967

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

THE EUROPEAN NUCLEOTIDE ARCHIVE

The European Nucleotide Archive (ENA) operates as a public archive for nucleotide sequence data. By bringing together databases for raw sequence data, assembly information and functional annotation, the ENA provides a comprehensive and integrated resource for this fundamental source of biological information. Central to the ENA is the provision of submission services, including interactive and programmatic submission tools, search services, including text and sequence similarity search tools and data presentation and retrieval services. The ENA works closely together with NCBI (1) and DDBJ (2) as partners in the International Nucleotide Sequence Database Collaboration (3). The principal policy of INSDC is to provide free and unrestricted permanent access to all archived nucleotide data. All primary data in the INSDC belongs to the submitters and can only be updated with submitter consent. For full policy details please refer to: http://www.insdc.org/policy.html.

CONTENT

In October 2010, the ENA contained ∼500 billion raw and assembled sequences consisting of ∼50 trillion base pairs. In the last 3 years, the next-generation sequence reads stored in the Sequence Read Archive (SRA) have become the largest and fastest growing source of new data accounting now for ∼95% of all base pairs made available by ENA. At the same time, the number of completed genome sequences has risen to over 1400 for cellular organisms and 3000 for viruses and phages (http://www.ebi.ac.uk/genomes/).

SUBMISSIONS OF RAW DATA FROM NEXT GENERATION PLATFORMS

The SRA accepts sequence submissions from next-generation sequencing platforms. New submitters should contact datasubs@ebi.ac.uk for the creation of a submission account and a secure data upload area. Submitters first upload data files into the secure data-upload area in one of the supported data formats, then prepare and submit study, sample, experiment, run and submission XML files to SRA. Detailed submission instructions are available here: http://www.ebi.ac.uk/ena/about/page.php?page=sra_submissions. We have extended the SRA submission service to support submissions of authorized access data, typically clinical samples that have been sequenced under a confidentiality and consent agreement. Authorized access data can now be submitted through the SRA submission service into the European Genome-phenome Archive (EGA; http://www.ebi.ac.uk/ega). Data submitted to EGA are not part of the public SRA database and are excluded from the INSDC data exchange. Permission to view and retrieve authorised access data can only be granted by the external data access committee (DAC) responsible for the data concerned. Please contact ega-helpdesk@ebi.ac.uk for more information about EGA policies. A secure data upload area is required to submit authorised access data through the SRA submission service. It is also possible to submit EGA’s policy, dataset and DAC objects through SRA. SRA will shortly accept sequence read submissions in Binary Alignment/Map (BAM) format (4). A BAM file is a binary compressed representation of the Sequence Alignment/Map (SAM) format. With sequence read alignments becoming an increasingly common intermediate in primary analysis, BAM format is emerging as a popular choice for storing sequence reads with alignments. The SRA is currently finalizing an archive BAM specification which will standardize the use of BAM files for primary data archival purposes. Once completed, BAM submissions to SRA archives will be required to follow this specification.

SUBMISSIONS OF ASSEMBLED AND ANNOTATED SEQUENCES

EMBL-Bank is a comprehensive public database of nucleotide sequences, associated biological annotation and bibliographic information. It contains a large diversity of data from patent, expressed sequence tag, whole genome shotgun and other high-throughput sequences, through genomic assemblies and richly annotated sequence fragments to whole replicons (5). Submitters should navigate to http://www.ebi.ac.uk/ena/about/page.php?page=submissions for access to all submission services. Advice regarding EMBL-Bank submissions is available from datasubs@ebi.ac.uk. We have extended the web-based EMBL-Bank submission service in a number of ways. For providers of genome-scale data, we have added functionality that allows data submissions in EMBL-Bank flat file format. For smaller scale submissions, we have added new templates to the EMBL-Bank submission service. Each template focuses on a particular commonly occurring type of sequence and annotation data and collects required information from the submitters using a web form or spreadsheet upload. New templates are available for unannotated WGS submissions with only source organism annotation, and for protein coding and phylogenetic-marker regions. The template mechanism, introduced in 2009, has been well received and attracts now up to half of all web-based EMBL-Bank submissions.

DATA SEARCH, BROWSING AND RETRIEVAL

ENA data can be browsed and retrieved in XML, HTML, fasta, fastq and flat file formats using the ENA Browser which can be used both interactively and programmatically through REST URLs. In 2010, we extended the ENA Browser to cover EMBL-Bank and Trace archive records and introduced several improvements including a graphical EMBL-Bank annotation and assembly viewer and intuitive navigation between different ENA data classes. For full details of the ENA browser URL syntax please refer to: http://www.ebi.ac.uk/ena/about/page.php?page=browser. For example, the following URL returns the complete mitochondrial genome for ‘Ursus spelaeus’ (cave bear) (6): http://www.ebi.ac.uk/ena/data/view/FM177760 (Figure 1). Data can be queried using the EB-Eye free text search functionality available in the header section of all EBI web pages (7). ENA results are available under the ‘Nucleotide Sequences’ category and linked to the ENA Browser. Free text search is also available from the ENA home page: http://www.ebi.ac.uk/ena.

Figure 1.

The complete mitochondrial genome for Ursus spelaeus (cave bear) from the Max Planck Institute for Evolutionary Anthropology submitted to EMBL-Bank in 2010.

The complete mitochondrial genome for Ursus spelaeus (cave bear) from the Max Planck Institute for Evolutionary Anthropology submitted to EMBL-Bank in 2010. Rapid and comprehensive sequence similarity searches against ENA data are supported through a new service based on Exonerate (8) technology: http://www.ebi.ac.uk/ena/search/ (Goodgame, N., manuscript in preparation). All nucleotide sequences archived by the INSDC and made available as part of EMBL-Bank are covered by our service. This includes all ENA sequences except raw reads from the Trace Archive and SRA. Experimental search support for a limited number of raw reads is provided through De-Bruijn servers based on Velvet (9), using the Exonerate client-server protocol and being fully integrated with our search service. This search is available by selecting the ‘Experimental De Bruijn search’ option from the search page. The EMBL-Bank sequence search service is currently being expanded for more specific purposes according to community requests. Bulk download of EMBL-Bank data is supported through FTP at ftp://ftp.ebi.ac.uk/pub/databases/embl/, and SRA and Trace Archive data through FTP at ftp://ftp.sra.ebi.ac.uk/ and Aspera through fasp.sra.ebi.ac.uk.

ENA COMMUNITY

The ENA team welcomes feedback and suggestions relating to all of our services at datasubs@ebi.ac.uk. We are always interested in hearing from potential collaborators who have an interest in working with and integrating our services.

FUNDING

The ENA is funded by the European Molecular Biology Laboratory, European Commission and the Wellcome Trust. Funding for open access charge: European Molecular Biology Laboratory. Conflict of interest statement. None declared.

8 in total

1. Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors: Daniel R Zerbino; Ewan Birney
Journal: Genome Res Date: 2008-03-18 Impact factor: 9.043

2. Fast and efficient searching of biological data resources--using EB-eye.

Authors: Franck Valentin; Silvano Squizzato; Mickael Goujon; Hamish McWilliam; Juri Paern; Rodrigo Lopez
Journal: Brief Bioinform Date: 2010-02-11 Impact factor: 11.622

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

4. Improvements to services at the European Nucleotide Archive.

Authors: Rasko Leinonen; Ruth Akhtar; Ewan Birney; James Bonfield; Lawrence Bower; Matt Corbett; Ying Cheng; Fehmi Demiralp; Nadeem Faruque; Neil Goodgame; Richard Gibson; Gemma Hoad; Christopher Hunter; Mikyung Jang; Steven Leonard; Quan Lin; Rodrigo Lopez; Michael Maguire; Hamish McWilliam; Sheila Plaister; Rajesh Radhakrishnan; Siamak Sobhany; Guy Slater; Petra Ten Hoopen; Franck Valentin; Robert Vaughan; Vadim Zalunin; Daniel Zerbino; Guy Cochrane
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

5. Automated generation of heuristics for biological sequence comparison.

Authors: Guy St C Slater; Ewan Birney
Journal: BMC Bioinformatics Date: 2005-02-15 Impact factor: 3.169

6. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2009-11-12 Impact factor: 16.971

7. DDBJ launches a new archive database with analytical tools for next-generation sequence data.

Authors: Eli Kaminuma; Jun Mashima; Yuichi Kodama; Takashi Gojobori; Osamu Ogasawara; Kousaku Okubo; Toshihisa Takagi; Yasukazu Nakamura
Journal: Nucleic Acids Res Date: 2009-10-22 Impact factor: 16.971

8. Mitochondrial genomes reveal an explosive radiation of extinct and extant bears near the Miocene-Pliocene boundary.

Authors: Johannes Krause; Tina Unger; Aline Noçon; Anna-Sapfo Malaspinas; Sergios-Orestis Kolokotronis; Mathias Stiller; Leopoldo Soibelzon; Helen Spriggs; Paul H Dear; Adrian W Briggs; Sarah C E Bray; Stephen J O'Brien; Gernot Rabeder; Paul Matheus; Alan Cooper; Montgomery Slatkin; Svante Pääbo; Michael Hofreiter
Journal: BMC Evol Biol Date: 2008-07-28 Impact factor: 3.260

8 in total

192 in total

1. Large-scale analysis of conserved rare codon clusters suggests an involvement in co-translational molecular recognition events.

Authors: Matthieu Chartier; Francis Gaudreault; Rafael Najmanovich
Journal: Bioinformatics Date: 2012-03-30 Impact factor: 6.937

2. RNAcentral: A vision for an international database of RNA sequences.

Authors: Alex Bateman; Shipra Agrawal; Ewan Birney; Elspeth A Bruford; Janusz M Bujnicki; Guy Cochrane; James R Cole; Marcel E Dinger; Anton J Enright; Paul P Gardner; Daniel Gautheret; Sam Griffiths-Jones; Jen Harrow; Javier Herrero; Ian H Holmes; Hsien-Da Huang; Krystyna A Kelly; Paul Kersey; Ana Kozomara; Todd M Lowe; Manja Marz; Simon Moxon; Kim D Pruitt; Tore Samuelsson; Peter F Stadler; Albert J Vilella; Jan-Hinnerk Vogel; Kelly P Williams; Mathew W Wright; Christian Zwieb
Journal: RNA Date: 2011-09-22 Impact factor: 4.942

3. Data analysis: Create a cloud commons.

Authors: Lincoln D Stein; Bartha M Knoppers; Peter Campbell; Gad Getz; Jan O Korbel
Journal: Nature Date: 2015-07-09 Impact factor: 49.962

4. FAIRness and Usability for Open-access Omics Data Systems.

Authors: Daniel C Berrios; Afshin Beheshti; Sylvain V Costes
Journal: AMIA Annu Symp Proc Date: 2018-12-05

Review 5. Online tools for bioinformatics analyses in nutrition sciences.

Authors: Sridhar A Malkaram; Yousef I Hassan; Janos Zempleni
Journal: Adv Nutr Date: 2012-09-01 Impact factor: 8.701

6. CistromeFinder for ChIP-seq and DNase-seq data reuse.

Authors: Hanfei Sun; Bo Qin; Tao Liu; Qixuan Wang; Jing Liu; Juan Wang; Xueqiu Lin; Yulin Yang; Len Taing; Prakash K Rao; Myles Brown; Yong Zhang; Henry W Long; X Shirley Liu
Journal: Bioinformatics Date: 2013-03-18 Impact factor: 6.937

7. A new RNA-seq method to detect the transcription and non-coding RNA in prostate cancer.

Authors: Xiao-Ming Zhang; Zhong-Wei Ma; Qiang Wang; Jian-Ning Wang; Ji-Wei Yang; Xian-Duo Li; Hao Li; Tong-Yi Men
Journal: Pathol Oncol Res Date: 2013-09-17 Impact factor: 3.201

8. One stop shop for everything Dictyostelium: dictyBase and the Dicty Stock Center in 2012.

Authors: Petra Fey; Robert J Dodson; Siddhartha Basu; Rex L Chisholm
Journal: Methods Mol Biol Date: 2013

9. Supplementation with dairy matrices impacts on homocysteine levels and gut microbiota composition of hyperhomocysteinemic mice.

Authors: Paola Zinno; Vincenzo Motta; Barbara Guantario; Fausta Natella; Marianna Roselli; Cristiano Bello; Raffaella Comitato; Domenico Carminati; Flavio Tidona; Aurora Meucci; Paola Aiello; Giuditta Perozzi; Fabio Virgili; Paolo Trevisi; Raffaella Canali; Chiara Devirgiliis
Journal: Eur J Nutr Date: 2019-01-30 Impact factor: 5.614

10. Gene Fusion Markup Language: a prototype for exchanging gene fusion data.

Authors: Shanker Kalyana-Sundaram; Achiraman Shanmugam; Arul M Chinnaiyan
Journal: BMC Bioinformatics Date: 2012-10-16 Impact factor: 3.169