Literature DB >> 24265224

Updates to BioSamples database at European Bioinformatics Institute.

Adam Faulconbridge¹, Tony Burdett, Marco Brandizi, Mikhail Gostev, Rui Pereira, Drashtti Vasant, Ugis Sarkans, Alvis Brazma, Helen Parkinson.

Abstract

The BioSamples database at the EBI (http://www.ebi.ac.uk/biosamples) provides an integration point for BioSamples information between technology specific databases at the EBI, projects such as ENCODE and reference collections such as cell lines. The database delivers a unified query interface and API to query sample information across EBI's databases and provides links back to assay databases. Sample groups are used to manage related samples, e.g. those from an experimental submission, or a single reference collection. Infrastructural improvements include a new user interface with ontological and key word queries, a new query API, a new data submission API, complete RDF data download and a supporting SPARQL endpoint, accessioning at the point of submission to the European Nucleotide Archive and European Genotype Phenotype Archives and improved query response times.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2013 PMID： 24265224 PMCID： PMC3965081 DOI： 10.1093/nar/gkt1081

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The EBI’s BioSamples database provides a single point of entry to sample data stored in EBI assay databases and delivers a dedicated query interface and API for accessing sample data. Samples are arranged into sample groups for ease of query, submission and to allow attributes to be added at the group level rather than at the sample specific level. This supports cases where values or attributes must be expressed as binned values across samples. This happens when the information is not available or cannot be provided at the sample level for ethical reasons (1). When querying BioSamples users are offered links to assay databases, for example, to sequence information in the European Nucleotide Archive or ENA (2), gene expression microarrays in ArrayExpress (3) or proteomics data PRoteomics IDEntifications database or PRIDE (4). The EBI’s BioSamples database is developed in parallel with the NBCI’s BioSamples database (5), which fulfils a similar function at the NCBI. This article describes data growth and new features implemented since our previous publication in 2011 (6). The EBI BioSamples database has doubled in size since January 2012 when 1 million samples were described in the BioSamples database, as of October 2013 2 846 137 samples are available as 80 232 groups. Data growth is attributed to new data sources, and increased volume of data from existing sources. New data sources include 22 288 samples from The Cancer Genome Atlas (http://cancergenome.nih.gov/), 920 441 samples from the Catalogue of Somatic Mutation in Cancer—COSMIC (7); 920 441 samples in 10 737 groups. Addition of samples from these sources provides interoperability between resources where, for example, COSMIC identifiers are included, e.g. in Ensembl (8).

INFRASTRUCTURAL IMPROVEMENTS

An updated web interface delivers new search functionality, improved tabular layout, ontology supported queries and access revised help documentation (see Figure 1). The experimental factor ontology (EFO) (9) is now for used query expansion using synonyms and child terms thus allowing more specific searches to be made. Users may select indexed key words or an EFO term for their queries in combination with Boolean Operators to refine their searches. EFO has been expanded in parallel to support BioSamples use cases, including an import of a ‘SLIM’ of the Uber Anatomy Ontology Uberon (10) and a genetic disease classification from Orphanet (11). These were selected based on analysis of common user queries and provide enhanced queries for samples based on anatomical and disease characteristics.

Figure 1.

The new BioSamples web interface showing search results for a query of ‘acute myeloid leukemia’. Auto-complete on the search box suggests appropriate ontology and keyword terms as the query is entered by the user, including more specific terms from EFO such as subtypes of the disease. Highlighted results are colour coded for clarity; exact matches (yellow), synonyms (green) and more specific terms (red).

New programmatic interfaces are available for both retrieving and submitting data. The API supports queries by sample or sample group accession, and queries of samples, or sample groups by their attributes. For example, the URL http://www.ebi.ac.uk/biosamples/xml/sample/SAME42581 returns all information about a sample in XML format. The APIs are documented on the BioSamples database help pages, and example queries are provided. The new BioSamples web interface showing search results for a query of ‘acute myeloid leukemia’. Auto-complete on the search box suggests appropriate ontology and keyword terms as the query is entered by the user, including more specific terms from EFO such as subtypes of the disease. Highlighted results are colour coded for clarity; exact matches (yellow), synonyms (green) and more specific terms (red). Sample information can also be submitted to the BioSamples database through an API via a JSON representation of the SampleTab tabular submission format. Part of the submission API is a SampleTab validation service, including a web page interface (http://www.ebi.ac.uk/biosamples/sampletab). This service is used for pre-submission validation of both manual and programmatic submissions. To maintain quality for directly submitted sample information, use of the submission web service is restricted by API keys, which can be obtained from biosamples@ebi.ac.uk.

DATA INTEGRATION

The BioSamples database has improved interoperability with other EBI resources and with external groups. All submissions of sample information to ENA and European Genotype Phenotype Archive (http://www.ebi.ac.uk/ega) are now assigned a BioSamples Accession, which is returned to the submitter immediately as part of the submission process. Users can also pre-register sample(s) and re-use those accession(s) when submitting to resources such as ENA and EGA. By preregistering samples the BioSamples staff curates submitted data, and the BioSamples database becomes a single source of sample information across multiple experimental technologies and databases. This, in turn, encourages the use of BioSamples identifiers in other repositories to identify and link equivalent samples. Several major research projects have established links with the BioSamples database. For example, the HipSci project (http://www.hipsci.org/) pre-register information about donors and cell lines, including the relationships between them, with BioSamples database. The ENCODE (12) data coordination centre is working with BioSamples database to ensure their existing sample records are updated and annotated with ontology terms and in specifying relationships between samples in ENCODE datasets. To date sample information from users is submitted directly to the BioSamples database through both manual and automatic processes, both of which are supported by the curatorial staff. Other locations around the world have also established repositories of sample information, including the NCBI BioSample database. The BioSamples database at EBI is using a common accessioning scheme previously agreed with NCBI and DDBJ, and we expect that data exchange will be implemented in early 2014. As sample data can be identified by multiple accessions assigned by EBI and external databases an identifier and URL resolution service ‘MyEquivalents’ has been deployed. It provides mapping between different, but equivalent, sample identifiers. For example, human RNA-Seq data deposited at EBI may have identifiers for ArrayExpress, ENA and BioSamples database as these resources share records. In time the BioSamples database identifier will be the only sample identifier for new submissions, but until then, and to preserve backwards compatibility for legacy data, the MyEquivalents service provides redirection URLs and web services describing mappings. More information and the source code for the MyEquivalents software is available (https://github.com/EBIBioSamples/myequivalents). Finally, as a component of the EBI Resource Description Framework (RDF) platform (http://www.ebi.ac.uk/rdf) RDF is now available for the BioSamples database content. The schema is derived from the SampleTab format, supported by integration with existing ontologies such as the Ontology of Biomedical Investigations (13) and EFO. Data are made available as RDF and also for query via a SPARQL endpoint for which example queries are documented.

FUTURE WORK

The development of the process and tools supporting EBI-NCBI data exchange is underway in collaboration with NCBI. EBI has completed a test parse and load of the current NCBI BioSamples database content and we are examining and mapping attributes used by the NBCI’s and EBI’s databases to deliver a core set of common attributes and context for these, for example, those required by standards such as The Minimal Information about a MetaGenome (14). The core attributes list will be used to facilitate data exchange, provide improved searches across attributes and drive context specific displays to ensure like attributes are displayed together for specific experiment types, e.g. latitude, longitude and depth for ocean samples. We will further improve our API and GUI access by implementing improved support for single sample level queries by technology and assay types.

FUNDING

EMBL core budget (in part) provided by the EMBL member countries with contributions from the European Commission grants CAGEKID [HEALTH-F4-2010-241669]; ENGAGE [HEALTH-F4-2007-201413 from the European Commission FP7 program]; BioMedBridges [European Commission FP7 Capacities Specific Programme 284209]; Gen2Phen [European Commission FP7 200754]; Diachron [European Commission FP7 601043]; a grant from the Wellcome Trust and the Medical Research Council [award WT098503]. Funding for open access charge: EMBL Core. Conflict of interest statement. None declared.

14 in total

1. Modeling sample variables with an Experimental Factor Ontology.

Authors: James Malone; Ele Holloway; Tomasz Adamusiak; Misha Kapushesky; Jie Zheng; Nikolay Kolesnikov; Anna Zhukova; Alvis Brazma; Helen Parkinson
Journal: Bioinformatics Date: 2010-03-03 Impact factor: 6.937

2. Modeling biomedical experimental processes with OBI.

Authors: Ryan R Brinkman; Mélanie Courtot; Dirk Derom; Jennifer M Fostel; Yongqun He; Phillip Lord; James Malone; Helen Parkinson; Bjoern Peters; Philippe Rocca-Serra; Alan Ruttenberg; Susanna-Assunta Sansone; Larisa N Soldatova; Christian J Stoeckert; Jessica A Turner; Jie Zheng
Journal: J Biomed Semantics Date: 2010-06-22

3. The minimum information about a genome sequence (MIGS) specification.

Authors: Dawn Field; George Garrity; Tanya Gray; Norman Morrison; Jeremy Selengut; Peter Sterk; Tatiana Tatusova; Nicholas Thomson; Michael J Allen; Samuel V Angiuoli; Michael Ashburner; Nelson Axelrod; Sandra Baldauf; Stuart Ballard; Jeffrey Boore; Guy Cochrane; James Cole; Peter Dawyndt; Paul De Vos; Claude DePamphilis; Robert Edwards; Nadeem Faruque; Robert Feldman; Jack Gilbert; Paul Gilna; Frank Oliver Glöckner; Philip Goldstein; Robert Guralnick; Dan Haft; David Hancock; Henning Hermjakob; Christiane Hertz-Fowler; Phil Hugenholtz; Ian Joint; Leonid Kagan; Matthew Kane; Jessie Kennedy; George Kowalchuk; Renzo Kottmann; Eugene Kolker; Saul Kravitz; Nikos Kyrpides; Jim Leebens-Mack; Suzanna E Lewis; Kelvin Li; Allyson L Lister; Phillip Lord; Natalia Maltsev; Victor Markowitz; Jennifer Martiny; Barbara Methe; Ilene Mizrachi; Richard Moxon; Karen Nelson; Julian Parkhill; Lita Proctor; Owen White; Susanna-Assunta Sansone; Andrew Spiers; Robert Stevens; Paul Swift; Chris Taylor; Yoshio Tateno; Adrian Tett; Sarah Turner; David Ussery; Bob Vaughan; Naomi Ward; Trish Whetzel; Ingio San Gil; Gareth Wilson; Anil Wipat
Journal: Nat Biotechnol Date: 2008-05 Impact factor: 54.908

4. ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments.

Authors: Helen Parkinson; Ugis Sarkans; Nikolay Kolesnikov; Niran Abeygunawardena; Tony Burdett; Miroslaw Dylag; Ibrahim Emam; Anna Farne; Emma Hastings; Ele Holloway; Natalja Kurbatova; Margus Lukk; James Malone; Roby Mani; Ekaterina Pilicheva; Gabriella Rustici; Anjan Sharma; Eleanor Williams; Tomasz Adamusiak; Marco Brandizi; Nataliya Sklyar; Alvis Brazma
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

5. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata.

Authors: Tanya Barrett; Karen Clark; Robert Gevorgyan; Vyacheslav Gorelenkov; Eugene Gribov; Ilene Karsch-Mizrachi; Michael Kimelman; Kim D Pruitt; Sergei Resenchuk; Tatiana Tatusova; Eugene Yaschenko; James Ostell
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

6. Data mining using the Catalogue of Somatic Mutations in Cancer BioMart.

Authors: Rebecca Shepherd; Simon A Forbes; David Beare; S Bamford; Charlotte G Cole; Sari Ward; Nidhi Bindal; Prasad Gunasekaran; Mingming Jia; Chai Yin Kok; Kenric Leung; Andrew Menzies; Adam P Butler; Jon W Teague; Peter J Campbell; Michael R Stratton; P Andrew Futreal
Journal: Database (Oxford) Date: 2011-05-23 Impact factor: 3.451

7. Uberon, an integrative multi-species anatomy ontology.

Authors: Christopher J Mungall; Carlo Torniai; Georgios V Gkoutos; Suzanna E Lewis; Melissa A Haendel
Journal: Genome Biol Date: 2012-01-31 Impact factor: 13.583

8. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

9. Ensembl 2013.

Authors: Paul Flicek; Ikhlak Ahmed; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Denise Carvalho-Silva; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Laurent Gil; Carlos García-Girón; Leo Gordon; Thibaut Hourlier; Sarah Hunt; Thomas Juettemann; Andreas K Kähäri; Stephen Keenan; Monika Komorowska; Eugene Kulesha; Ian Longden; Thomas Maurel; William M McLaren; Matthieu Muffato; Rishi Nag; Bert Overduin; Miguel Pignatelli; Bethan Pritchard; Emily Pritchard; Harpreet Singh Riat; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sheppard; Daniel Sobral; Kieron Taylor; Anja Thormann; Stephen Trevanion; Simon White; Steven P Wilder; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Jennifer Harrow; Javier Herrero; Tim J P Hubbard; Nathan Johnson; Rhoda Kinsella; Anne Parker; Giulietta Spudich; Andy Yates; Amonida Zadissa; Stephen M J Searle
Journal: Nucleic Acids Res Date: 2012-11-30 Impact factor: 16.971

10. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013.

Authors: Juan Antonio Vizcaíno; Richard G Côté; Attila Csordas; José A Dianes; Antonio Fabregat; Joseph M Foster; Johannes Griss; Emanuele Alpi; Melih Birim; Javier Contell; Gavin O'Kelly; Andreas Schoenegger; David Ovelleiro; Yasset Pérez-Riverol; Florian Reisinger; Daniel Ríos; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

22 in total

1. The Cellosaurus, a Cell-Line Knowledge Resource.

Authors: Amos Bairoch
Journal: J Biomol Tech Date: 2018-05-10

2. Content discovery and retrieval services at the European Nucleotide Archive.

Authors: Nicole Silvester; Blaise Alako; Clara Amid; Ana Cerdeño-Tárraga; Iain Cleland; Richard Gibson; Neil Goodgame; Petra Ten Hoopen; Simon Kay; Rasko Leinonen; Weizhong Li; Xin Liu; Rodrigo Lopez; Nima Pakseresht; Swapna Pallreddy; Sheila Plaister; Rajesh Radhakrishnan; Marc Rossello; Alexander Senf; Dmitriy Smirnov; Ana Luisa Toribio; Daniel Vaughan; Vadim Zalunin; Guy Cochrane
Journal: Nucleic Acids Res Date: 2014-11-17 Impact factor: 16.971

3. Navigating in vitro bioactivity data by investigating available resources using model compounds.

Authors: Sten Ilmjärv; Fiona Augsburger; Jerven Tjalling Bolleman; Robin Liechti; Alan James Bridge; Jenny Sandström; Vincent Jaquet; Ioannis Xenarios; Karl-Heinz Krause
Journal: Sci Data Date: 2019-04-29 Impact factor: 6.444

4. Using association rule mining and ontologies to generate metadata recommendations from multiple biomedical databases.

Authors: Marcos Martínez-Romero; Martin J O'Connor; Attila L Egyedi; Debra Willrett; Josef Hardi; John Graybeal; Mark A Musen
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

5. ArrayExpress update--simplifying data submissions.

Authors: Nikolay Kolesnikov; Emma Hastings; Maria Keays; Olga Melnichuk; Y Amy Tang; Eleanor Williams; Miroslaw Dylag; Natalja Kurbatova; Marco Brandizi; Tony Burdett; Karyn Megy; Ekaterina Pilicheva; Gabriella Rustici; Andrew Tikhonov; Helen Parkinson; Robert Petryszak; Ugis Sarkans; Alvis Brazma
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971

6. European Nucleotide Archive in 2016.

Authors: Ana Luisa Toribio; Blaise Alako; Clara Amid; Ana Cerdeño-Tarrága; Laura Clarke; Iain Cleland; Susan Fairley; Richard Gibson; Neil Goodgame; Petra Ten Hoopen; Suran Jayathilaka; Simon Kay; Rasko Leinonen; Xin Liu; Josué Martínez-Villacorta; Nima Pakseresht; Jeena Rajan; Kethi Reddy; Marc Rosello; Nicole Silvester; Dmitriy Smirnov; Daniel Vaughan; Vadim Zalunin; Guy Cochrane
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

Review 7. A Standard Nomenclature for Referencing and Authentication of Pluripotent Stem Cells.

Authors: Andreas Kurtz; Stefanie Seltmann; Amos Bairoch; Marie-Sophie Bittner; Kevin Bruce; Amanda Capes-Davis; Laura Clarke; Jeremy M Crook; Laurence Daheron; Johannes Dewender; Adam Faulconbridge; Wataru Fujibuchi; Alexander Gutteridge; Derek J Hei; Yong-Ou Kim; Jung-Hyun Kim; Anja Kolb- Kokocinski; Fritz Lekschas; Geoffrey P Lomax; Jeanne F Loring; Tenneille Ludwig; Nancy Mah; Tohru Matsui; Robert Müller; Helen Parkinson; Michael Sheldon; Kelly Smith; Harald Stachelscheid; Glyn Stacey; Ian Streeter; Anna Veiga; Ren-He Xu
Journal: Stem Cell Reports Date: 2018-01-09 Impact factor: 7.765

8. Ensembl Genomes 2018: an integrated omics infrastructure for non-vertebrate species.

Authors: Paul Julian Kersey; James E Allen; Alexis Allot; Matthieu Barba; Sanjay Boddu; Bruce J Bolt; Denise Carvalho-Silva; Mikkel Christensen; Paul Davis; Christoph Grabmueller; Navin Kumar; Zicheng Liu; Thomas Maurel; Ben Moore; Mark D McDowall; Uma Maheswari; Guy Naamati; Victoria Newman; Chuang Kee Ong; Michael Paulini; Helder Pedro; Emily Perry; Matthew Russell; Helen Sparrow; Electra Tapanari; Kieron Taylor; Alessandro Vullo; Gareth Williams; Amonida Zadissia; Andrew Olson; Joshua Stein; Sharon Wei; Marcela Tello-Ruiz; Doreen Ware; Aurelien Luciani; Simon Potter; Robert D Finn; Martin Urban; Kim E Hammond-Kosack; Dan M Bolser; Nishadi De Silva; Kevin L Howe; Nicholas Langridge; Gareth Maslen; Daniel Michael Staines; Andrew Yates
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

9. The European Bioinformatics Institute in 2017: data coordination and integration.

Authors: Charles E Cook; Mary T Bergman; Guy Cochrane; Rolf Apweiler; Ewan Birney
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

Review 10. Exploring the potential of public proteomics data.

Authors: Marc Vaudel; Kenneth Verheggen; Attila Csordas; Helge Raeder; Frode S Berven; Lennart Martens; Juan A Vizcaíno; Harald Barsnes
Journal: Proteomics Date: 2015-12-15 Impact factor: 3.984