| Literature DB >> 18978013 |
Guy Cochrane1, Ruth Akhtar, James Bonfield, Lawrence Bower, Fehmi Demiralp, Nadeem Faruque, Richard Gibson, Gemma Hoad, Tim Hubbard, Christopher Hunter, Mikyung Jang, Szilveszter Juhos, Rasko Leinonen, Steven Leonard, Quan Lin, Rodrigo Lopez, Dariusz Lorenc, Hamish McWilliam, Gaurab Mukherjee, Sheila Plaister, Rajesh Radhakrishnan, Stephen Robinson, Siamak Sobhany, Petra Ten Hoopen, Robert Vaughan, Vadim Zalunin, Ewan Birney.
Abstract
Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.Entities:
Mesh:
Year: 2008 PMID: 18978013 PMCID: PMC2686451 DOI: 10.1093/nar/gkn765
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.ENA structure. The figure shows how nucleotide sequencing information is partitioned according to class; ENA-Reads treats raw sequencing information, ENA-Assembly treats information on how fragmented sequences have been assembled into higher order structures and ENA-Annotation treats functional annotation based on assembled sequence. The three components are integrated in the ENA.
Points of entry to the ENA: submissions, retrieval and support
| Submissions | |
| Submission of new data | Submissions of annotated sequence data to ENA-Annotation and capillary traces to ENA-Reads |
| Updates to existing data | Updates to existing ENA-Annotation data |
| Next generation sequence, project accounts and WGS submissions | To establish next generation sequence submissions, new project accounts and pipelines for WGS projects. |
| Retrieval | |
| SRS | Data retrieval by term search and through links to/from other databases |
| Sequence similarity search | Data retrieval by sequence similarity |
| Sequence Version Archive | Access to current and previous versions of entries by accession number |
| ENA-Annotation and ENA-Assembly FTP | Access to complete data sets in flatfile format, for both release and updated data |
| ENA-Reads FTP | Access to ENA-Read data for capillary and next generation reads, respectively |
| Genomes | Completed genomes and proteomes |
| Dbfetch/Wsdbfetch | Retrieval by accession number through web browser, or via webservice, respectively |
| Custom Datasets | For data retrieval requirements not supported by existing tools |
| Support | |
| General Information | Documentation including user manual, INSDC Feature Table Definition, news and forthcoming changes |
| Specific Help | For help with all ENA services |
| Educational Information | Background bioinformatics educational resources |
Figure 2.Webin. The figure shows a selection of screenshots from Webin; (a) launcher page, (b) submissions page, (c) source feature page and (d) new feature addition page.
Figure 3.Throughput for validated ENA-Annotation entries. The figure shows cumulative counts of ENA-Annotation entries that have been processed by ENA biologists.
Sample validation rules
| Condition 1 | Condition 2 |
|---|---|
| QE(/environmental_sample) | QE(/isolation_source) |
| QE(map) | QE(/chromosome) OR QE(/segment) OR QE(/organelle) |
| QE(/proviral) OR QE(/virion) | NOT (QE(/proviral) AND QE(/virion)) |
| ME(‘BARCODE’) | QE(/pcr_primers) |
| QC(/organisms, ‘Bacteria’) AND NOT QE(/environmental_sample) | QE(/strain) |
| NOT (QC(/organism, ‘Deltavirus’) OR QC(/organism, ‘Retro-transcribing viruses’) OR QC(/organism, ‘ssRNA viruses’) OR QC(/organism, ‘dsRNA viruses’)) | NOT QV(/mol_type, ‘genomic RNA’) |
Sample validation rules are shown; a grammar has been developed that allows non-technical curation of combinations of conditions that should be satisfied in combination in order for the rule to passed. The QE function expresses the existence of a source qualifier of the specified name, the ME function expresses the existence of the specified methodological keyword, the QC function expresses parity or a child relationship to the specified value within the taxonomic hierarchy and the QV function expresses a specified value for the specified source qualifier.
Figure 4.Structure of ENA-Reads. A relational data model has been developed for next generation sequencing data that relates the concept of a study to samples that have been used for the study, to runs that have been executed as part of the experiments that make up the study and describe the details of how samples have been configured in runs. Underlying this data model is an API that provides abstraction from the nature of the data file system, returning read data upon request based on read identifiers (and groupings of these identifiers), rather than on specified files within the file system.