Literature DB >> 22080548

Major submissions tool developments at the European Nucleotide Archive.

Clara Amid¹, Ewan Birney, Lawrence Bower, Ana Cerdeño-Tárraga, Ying Cheng, Iain Cleland, Nadeem Faruque, Richard Gibson, Neil Goodgame, Christopher Hunter, Mikyung Jang, Rasko Leinonen, Xin Liu, Arnaud Oisel, Nima Pakseresht, Sheila Plaister, Rajesh Radhakrishnan, Kethi Reddy, Stephane Rivière, Marc Rossello, Alexander Senf, Dimitriy Smirnov, Petra Ten Hoopen, Daniel Vaughan, Robert Vaughan, Vadim Zalunin, Guy Cochrane.

Abstract

The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena), Europe's primary nucleotide sequence resource, captures and presents globally comprehensive nucleic acid sequence and associated information. Covering the spectrum from raw data to assembled and functionally annotated genomes, the ENA has witnessed a dramatic growth resulting from advances in sequencing technology and ever broadening application of the methodology. During 2011, we have continued to operate and extend the broad range of ENA services. In particular, we have released major new functionality in our interactive web submission system, Webin, through developments in template-based submissions for annotated sequences and support for raw next-generation sequence read submissions.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 22080548 PMCID： PMC3245037 DOI： 10.1093/nar/gkr946

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The European Nucleotide Archive (ENA) is maintained and developed at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) and serves as Europe's primary repository for nucleotide sequence and associated information. Content spans raw sequence reads from all sequencing platforms, read alignments, assembly information and submitted functional annotation. Providing both the permanent scientific record as a complement to literature publication process and a forum for early sharing of pre-publication data, the ENA serves as a critical foundation for the global bioinformatics data infrastructure. Globally comprehensive coverage is assured through long-standing data exchange agreements with the DNA Databank of Japan (DDBJ) (1) and the United States National Institutes of Health National Center for Biotechnology Information (NCBI) (2) under the International Nucleotide Sequence Database Collaboration (3; http://www.insdc.org/). Underlying ENA are a number of core databases, including the Sequence Read Archive for raw reads and read alignments from next generation sequencing platforms (4) and EMBL-Bank for high level assembly information, assembled sequences and functional annotation. ENA services are numerous: we provide submission tools, both the web-based Webin system and programmatic interfaces; we offer search technologies, such as the newly developed rapid ENA sequence similarity search (http://www.ebi.ac.uk/ena/search) and text-based search tools (http://www.ebi.ac.uk/ena); we present integrated access to all ENA content through the ENA Browser, which offers both web browsing and REST access (http://www.ebi.ac.uk/ena/about/browser). We are highly responsive in the development of new technologies and services to adapt to changes in sequencing technology and user requirements: we are leading a community-facing sequence read compression initiative, CRAM (5; http://www.ebi.ac.uk/ena/about/cram_toolkit); we are developing anencrypted BAM read alignment server that supports reference coordinate-based lookups of controlled acess reads by region; we are active in the development of data warehousing methodologies to provide real-time access to the massive data sets that we store (e.g. the ENA Taxon Portal; http://www.ebi.ac.uk/ena/data/view/Taxon:Eukaryota). In this article, we comment on content and report briefly on means by which ENA data can be accessed. We then focus on major developments in our Webin submission system in the areas of template-based submissions of annotated and assembled sequences and raw next generation sequence read submission. We also announce the introduction of a sequence length limit for submission of assembled sequences.

ENA CONTENT

At the time of going to press, ENA contains 346 598 699 035 nt of assembled sequence in 220 504 007 assembled sequence entries (See EMBL-Bank release notes at http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html) and more than 100 terabases of raw next generation sequence reads (Figure 1A and B).

Figure 1.

(A) Growth of assembled sequences (ENA:EMBL-Bank); see http://www.ebi.ac.uk/ena/about/statistics#embl_growth for dynamically updated growth chart. (B) Growth of raw data from next generation sequencing platforms (ENA: SRA); see http://www.ebi.ac.uk/ena/about/statistics#sra_growth for dynamically updated growth chart. Notable datasets submitted to ENA during 2011 include assemblies of Gorilla gorilla (FR853080-FR853106), atlantic cod, Gadus morhua (Project:41391), Vine, Vitis vinifera (Project:18785), Takifugu rubripes (Project:1434), Macaca fascicularis (FR874244-FR874264), medieval mitochondria and Yersinia plasmids (6; HE576978-HE576987), raw genomic reads from 18 lines of Arabidopsis thaliana (7; ERP000565), Staphylococcus aureus (8; ERP000528) and Mus musculus ES cells (9; ERP000570) and transcriptomicreads from multiple Silene species (10; ERP000371).

ENA DATA ACCESS

Full ENA content is made available through an integrated platform, the ENA Browser, that supports discovery (text search, sequence similarity search, taxon lookup, etc.) and retrieval of records interactively (through web browsing and programatically under RESTful URLs). Full details are available from http://www.ebi.ac.uk/ena/about/browser. Records are made available in a selection of appropriate formats that include EMBL-Bank flat file, fasta and XML for assembled and annotated sequences, Fastq for sequence reads and Darwin Core for taxon records (http://www.ebi.ac.uk/ena/about/formats). In addition, we support both ftp and Aspera protocols for network transfers of large raw data sets (ftp://ftp.sra.ebi.ac.uk) and offer a variety of data products over ftp for other areas of ENA content (ftp://ftp.ebi.ac.uk/pub/databases/embl and ftp://ftp.ebi.ac.uk/pub/databases/ena)

ANNOTATED AND ASSEMBLED SEQUENCESUBMISSIONS

Apre-tailored template system was introduced in our Webin submission framework in 2009 for annotated sequence submissions and has been expanded during 2011 with the release of nine new templates. These templates have been designed for the most frequent types of sequence submissions and reached 15 in number in September 2011. When using the templates, submitters provide nucleotide sequences with associated annotation through spread sheets or Fastq files with pre-defined mandatory and optional fields, a process that significantly reduces the overall complexity of the submissions process for both the submitter and the ENA curator. Some advantages of the new system include the ability to choose from a small number of variables, functionalities that prevent the need for repetitive entry of information constant across all records in a data set and straightforward validation before data submission. The template concept has shown growing popularity since its launch versus the traditional system (which remains available for a limited time). Under the traditional system, submitters were able to annotate their entries with the full INSDC-approved features and qualifiers either one entry at a time or by defining with an ENA curator a specific template for each submission. This was useful for annotating small submissions in great detail but did not cater efficiently for larger-scale submissions of same-type data. Figure 2 shows the usage of the available submission systems between 2009 and 2011 and Table 1 shows the currently available templates.

Figure 2.

Usage of the different web-based interactive submission systems for annotated sequences at ENA between 2009 and 2011.

Table 1.

Names and definitions of templates currently available for sequence submissions to EMBL-bank

Template name	Definition
Intergenic Spacer, IGS	For intergenic spacer (IGS) sequences between neighbouring genes (e.g. psbA-trnH IGS, 16S-23S rRNA IGS). Inclusion of the flanking genes is allowed
ITS region	For the 18S rRNA, ITS1, 5.8S rRNA, ITS2, 28S rRNA region, where the locations of the boundaries are not known
D-Loop	For mitochondrial D-loop (control region) sequences. All D-loops are considered partial
trnK-matK locus	For complete or partial matK gene within the chloroplast trnK gene
COI gene	For mitochondrial cytochrome oxidase subunit 1 genes
MHC gene 1 exon	For partial MHC class I or II antigens containing one exon
MHC gene 2 exons	For partial MHC class I or II antigens containing two exons
Single CDS genomic DNA	For complete or partial single non-segmented coding sequence (CDS) derived from genomic DNA
Single viral CDS genomic RNA	For complete or partial single coding sequence (CDS) derived from viral genomic RNA. Please do not use for viral DNA, peptides processed from polyproteins, viral cRNAs, or proviral sequences, as these are all annotated differently
Single CDS mRNA	For complete or partial single coding sequence (CDS) derived from mRNA (via cDNA)
rRNA gene	For ribosomal RNA genes from prokaryotic, nuclear or mitochondrial DNA. All rRNAs are considered partial
EST	For EST (expressed sequence tag) submissions
WGS (unannotated)	For unannotated Whole Genome Shotgun (WGS) sequences
MIMARKS-Survey 16S rRNA sequences	For the submission of 16S rRNA sequence compliant with the MIMARKS Minimal Information about a MARKer gene Sequence Standard
Soil sample MIMARKS-Survey using 16S rRNA sequences	For the submission of 16S rRNA sequence compliant with the MIMARKS Minimal Information about a MARKer gene Sequence Standard, specific to soil metagenomes

Usage of the different web-based interactive submission systems for annotated sequences at ENA between 2009 and 2011. Names and definitions of templates currently available for sequence submissions to EMBL-bank As part of these developments, ENA is also facilitating the submission of marker gene sequences compliant with a community standard that has been developed by the Genomic Standards Consortium (GSC), called the Minimal Information about a MARKer gene Sequence Standard (MIMARKS) (11, 12). MIMARKS provides a minimal set of required information fields essential for downstream reuse of the data. The last two templates in Table 1 have been designed for submissions of MIMARKS-compliant data. Further improvements to the submissions system for annotated sequences will continue in 2012 and beyond.

NEXT GENERATION SEQUENCE DATA SUBMISSIONS

To complement the existing programmatic SRA REST submission interface, we have recently extended the Webin system to support submissions of raw next generation sequencing reads to the SRA. Unlike the SRA REST interface, which is targeted for large-scale sequence submitters and allows direct programmatic interaction between external LIMS systems and the SRA database at EBI, this new component of Webin is designed for interactive use. Users work through a web interface to create studies, samples and experiments, to update submitted metadata and to release previously submitted data to the public. Importantly, all metadata are submitted either by uploading or editing spreadsheets. While SRA REST submitters are fully exposed to the underlying SRA XML-data model, the SRA submission functionality in Webin completely hides this complexity. For example, during a raw sequence submission process, users are asked to define their raw data file format and are then presented with a spreadsheet, which can be either uploaded or filled with the required additional information (Figure 3).

Figure 3.

Screenshot of raw data definition page in SRA Webin.

Screenshot of raw data definition page in SRA Webin. The SRA submission component of Webin is under active development and new improvements are deployed weekly. Forthcoming improvements include support for European Genome–Phenome Archive submissions for controlled access raw sequence data, support for checklist for provision of community standard compliant meta data and numerous usability additions.

INTRODUCTION OF SEQUENCE LENGTH LIMIT FOR ASSEMBLED SEQUENCES

ENA will introduce a sequence length limit for submissions of assembled sequences. From January 2012, ENA will accept sequences <100 bp only if they fall into one of the following sequence categories of ‘Ancient DNA’, ‘non-coding-RNA’, ‘Microsatellites’ or ‘Complete Exons’. Exceptions require the submitter to demonstrate that a peer-reviewed journal has accepted a manuscript by the submitter, confirming the relevance of the short sequences to the scientific community. A validation step will be implemented in Webin to facilitate implementation of this requirement. We encourage submitters to check our website for further forthcoming changes announcements (http://www.ebi.ac.uk/ena/about/forthcoming_changes)

HELPDESK AND TRAINING

The ENA team provides advice and guidance regarding ENA services by email through datasubs@ebi.ac.uk. Feedback and suggestions related to all of our services are very welcome at the same email address. We also operate a variety of hands-on training programmes, for which details are available at http://www.ebi.ac.uk/training. We strongly encourage submitters to take our survey (http://www.surveymonkey.com/s/ENA_User_Survey_2011) and help us to improve our service.

FUNDING

European Molecular Biology Laboratory; FP7 Programme of the European Commission; WellcomeTrust and Biotechnology and Biological Sciences Research Council (BBSRC). Funding for open access charge: European Molecular Biology Laboratory. Conflict of interest statement. None declared.

12 in total

1. Efficient storage of high throughput DNA sequencing data using reference-based compression.

Authors: Markus Hsi-Yang Fritz; Rasko Leinonen; Guy Cochrane; Ewan Birney
Journal: Genome Res Date: 2011-01-18 Impact factor: 9.043

2. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.

Authors: Pelin Yilmaz; Renzo Kottmann; Dawn Field; Rob Knight; James R Cole; Linda Amaral-Zettler; Jack A Gilbert; Ilene Karsch-Mizrachi; Anjanette Johnston; Guy Cochrane; Robert Vaughan; Christopher Hunter; Joonhong Park; Norman Morrison; Philippe Rocca-Serra; Peter Sterk; Manimozhiyan Arumugam; Mark Bailey; Laura Baumgartner; Bruce W Birren; Martin J Blaser; Vivien Bonazzi; Tim Booth; Peer Bork; Frederic D Bushman; Pier Luigi Buttigieg; Patrick S G Chain; Emily Charlson; Elizabeth K Costello; Heather Huot-Creasy; Peter Dawyndt; Todd DeSantis; Noah Fierer; Jed A Fuhrman; Rachel E Gallery; Dirk Gevers; Richard A Gibbs; Inigo San Gil; Antonio Gonzalez; Jeffrey I Gordon; Robert Guralnick; Wolfgang Hankeln; Sarah Highlander; Philip Hugenholtz; Janet Jansson; Andrew L Kau; Scott T Kelley; Jerry Kennedy; Dan Knights; Omry Koren; Justin Kuczynski; Nikos Kyrpides; Robert Larsen; Christian L Lauber; Teresa Legg; Ruth E Ley; Catherine A Lozupone; Wolfgang Ludwig; Donna Lyons; Eamonn Maguire; Barbara A Methé; Folker Meyer; Brian Muegge; Sara Nakielny; Karen E Nelson; Diana Nemergut; Josh D Neufeld; Lindsay K Newbold; Anna E Oliver; Norman R Pace; Giriprakash Palanisamy; Jörg Peplies; Joseph Petrosino; Lita Proctor; Elmar Pruesse; Christian Quast; Jeroen Raes; Sujeevan Ratnasingham; Jacques Ravel; David A Relman; Susanna Assunta-Sansone; Patrick D Schloss; Lynn Schriml; Rohini Sinha; Michelle I Smith; Erica Sodergren; Aymé Spo; Jesse Stombaugh; James M Tiedje; Doyle V Ward; George M Weinstock; Doug Wendel; Owen White; Andrew Whiteley; Andreas Wilke; Jennifer R Wortman; Tanya Yatsunenko; Frank Oliver Glöckner
Journal: Nat Biotechnol Date: 2011-05 Impact factor: 54.908

3. DDBJ progress report.

Authors: Eli Kaminuma; Takehide Kosuge; Yuichi Kodama; Hideo Aono; Jun Mashima; Takashi Gojobori; Hideaki Sugawara; Osamu Ogasawara; Toshihisa Takagi; Kousaku Okubo; Yasukazu Nakamura
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

4. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

5. c-di-AMP is a new second messenger in Staphylococcus aureus with a role in controlling cell size and envelope stress.

Authors: Rebecca M Corrigan; James C Abbott; Heike Burhenne; Volkhard Kaever; Angelika Gründling
Journal: PLoS Pathog Date: 2011-09-01 Impact factor: 6.823

6. The International Nucleotide Sequence Database Collaboration.

Authors: Guy Cochrane; Ilene Karsch-Mizrachi; Yasukazu Nakamura
Journal: Nucleic Acids Res Date: 2010-11-23 Impact factor: 16.971

7. Comparative high-throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database.

Authors: Nicolas Blavet; Delphine Charif; Christine Oger-Desfeux; Gabriel A B Marais; Alex Widmer
Journal: BMC Genomics Date: 2011-07-26 Impact factor: 3.969

8. The sequence read archive.

Authors: Rasko Leinonen; Hideaki Sugawara; Martin Shumway
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

9. The genomic standards consortium: bringing standards to life for microbial ecology.

Authors: Pelin Yilmaz; Jack A Gilbert; Rob Knight; Linda Amaral-Zettler; Ilene Karsch-Mizrachi; Guy Cochrane; Yasukazu Nakamura; Susanna-Assunta Sansone; Frank Oliver Glöckner; Dawn Field
Journal: ISME J Date: 2011-04-07 Impact factor: 10.302

10. Multiple reference genomes and transcriptomes for Arabidopsis thaliana.

Authors: Xiangchao Gan; Oliver Stegle; Jonas Behr; Joshua G Steffen; Philipp Drewe; Katie L Hildebrand; Rune Lyngsoe; Sebastian J Schultheiss; Edward J Osborne; Vipin T Sreedharan; André Kahles; Regina Bohnert; Géraldine Jean; Paul Derwent; Paul Kersey; Eric J Belfield; Nicholas P Harberd; Eric Kemen; Christopher Toomajian; Paula X Kover; Richard M Clark; Gunnar Rätsch; Richard Mott
Journal: Nature Date: 2011-08-28 Impact factor: 49.962

13 in total

1. EMBL2checklists: A Python package to facilitate the user-friendly submission of plant and fungal DNA barcoding sequences to ENA.

Authors: Michael Gruenstaeudl; Yannick Hartmaring
Journal: PLoS One Date: 2019-01-10 Impact factor: 3.240

Review 2. Online tools for bioinformatics analyses in nutrition sciences.

Authors: Sridhar A Malkaram; Yousef I Hassan; Janos Zempleni
Journal: Adv Nutr Date: 2012-09-01 Impact factor: 8.701

3. Data mining of transcriptional biomarkers at different cotton fiber developmental stages.

Authors: Uzma Khatoon; Priti Prasad; Rishi Kumar Verma; Samir V Sawant; Sumit K Bag
Journal: Funct Integr Genomics Date: 2022-07-05 Impact factor: 3.674

4. The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection.

Authors: Michael Y Galperin; Xosé M Fernández-Suárez
Journal: Nucleic Acids Res Date: 2011-12-05 Impact factor: 16.971

Review 5. Artificial intelligence to deep learning: machine intelligence approach for drug discovery.

Authors: Rohan Gupta; Devesh Srivastava; Mehar Sahu; Swati Tiwari; Rashmi K Ambasta; Pravir Kumar
Journal: Mol Divers Date: 2021-04-12 Impact factor: 3.364

6. The International Nucleotide Sequence Database Collaboration.

Authors: Yasukazu Nakamura; Guy Cochrane; Ilene Karsch-Mizrachi
Journal: Nucleic Acids Res Date: 2012-11-24 Impact factor: 16.971

7. Ensembl 2013.

Authors: Paul Flicek; Ikhlak Ahmed; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Denise Carvalho-Silva; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Laurent Gil; Carlos García-Girón; Leo Gordon; Thibaut Hourlier; Sarah Hunt; Thomas Juettemann; Andreas K Kähäri; Stephen Keenan; Monika Komorowska; Eugene Kulesha; Ian Longden; Thomas Maurel; William M McLaren; Matthieu Muffato; Rishi Nag; Bert Overduin; Miguel Pignatelli; Bethan Pritchard; Emily Pritchard; Harpreet Singh Riat; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sheppard; Daniel Sobral; Kieron Taylor; Anja Thormann; Stephen Trevanion; Simon White; Steven P Wilder; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Jennifer Harrow; Javier Herrero; Tim J P Hubbard; Nathan Johnson; Rhoda Kinsella; Anne Parker; Giulietta Spudich; Andy Yates; Amonida Zadissa; Stephen M J Searle
Journal: Nucleic Acids Res Date: 2012-11-30 Impact factor: 16.971

Major submissions tool developments at the European Nucleotide Archive.

INTRODUCTION

ENA CONTENT

ENA DATA ACCESS

ANNOTATED AND ASSEMBLED SEQUENCESUBMISSIONS

NEXT GENERATION SEQUENCE DATA SUBMISSIONS

INTRODUCTION OF SEQUENCE LENGTH LIMIT FOR ASSEMBLED SEQUENCES

HELPDESK AND TRAINING

FUNDING

1. Efficient storage of high throughput DNA sequencing data using reference-based compression.

2. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications.

3. DDBJ progress report.

4. GenBank.

5. c-di-AMP is a new second messenger in Staphylococcus aureus with a role in controlling cell size and envelope stress.

6. The International Nucleotide Sequence Database Collaboration.

7. Comparative high-throughput transcriptome sequencing and development of SiESTa, the Silene EST annotation database.

8. The sequence read archive.

9. The genomic standards consortium: bringing standards to life for microbial ecology.

10. Multiple reference genomes and transcriptomes for Arabidopsis thaliana.

1. EMBL2checklists: A Python package to facilitate the user-friendly submission of plant and fungal DNA barcoding sequences to ENA.

Review 2. Online tools for bioinformatics analyses in nutrition sciences.

3. Data mining of transcriptional biomarkers at different cotton fiber developmental stages.

4. The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection.

Review 5. Artificial intelligence to deep learning: machine intelligence approach for drug discovery.

6. The International Nucleotide Sequence Database Collaboration.

7. Ensembl 2013.

8. DDBJ new system and service refactoring.

9. The IMGT/HLA database.

10. SIFTS: Structure Integration with Function, Taxonomy and Sequences resource.