Literature DB >> 28472374

The EBI search engine: EBI search as a service-making biological data accessible for all.

Young M Park¹, Silvano Squizzato¹, Nicola Buso¹, Tamer Gur¹, Rodrigo Lopez¹.

Abstract

We present an update of the EBI Search engine, an easy-to-use fast text search and indexing system with powerful data navigation and retrieval capabilities. The interconnectivity that exists between data resources at EMBL-EBI provides easy, quick and precise navigation and a better understanding of the relationship between different data types that include nucleotide and protein sequences, genes, gene products, proteins, protein domains, protein families, enzymes and macromolecular structures, as well as the life science literature. EBI Search provides a powerful RESTful API that enables its integration into third-party portals, thus providing 'Search as a Service' capabilities, which are the main topic of this article.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Enzymes
Proteins

Year: 2017 PMID： 28472374 PMCID： PMC5570174 DOI： 10.1093/nar/gkx359

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

During 2016, more than 362 000 unique IPs accessed EBI Search from across the globe. These give rise to more than 300 million searches, comprising the web as well as usage of the RESTful API, and representing more than 1 TB of metadata downloaded that links search results with more than 1.3 billion records held in the databases at the EMBL–EBI. At present, the system can re-index all these data in less than 24 h. There is a continuous collaboration between the developers of EBI Search and its users. Working together, the search functionality can now be integrated into third-party portals in novel, intuitive and useful ways. The concept of ‘Search as a Service’ (https://en.wikipedia.org/wiki/Search_as_a_service) is not new but the implementation presented here is novel and unique, and opens up further opportunities for collaboration, removing the need for developing and maintaining different search applications. These collaborations often result in the development of new features to EBI Search, which are potentially useful for other projects and ultimately benefit scientific findability, reproducibility and discoverability.

DATA COVERAGE

EBI Search (1) provides comprehensive access to all data maintained in the public repositories hosted at the EMBL–EBI, European Molecular Biology Laboratory, European Bioinformatics Institute (2). These are divided into biological themes, as shown in Table 1. Since last reported, new data resources include the Enzyme Portal (3), providing protein- and enzyme-centric views; Rfam (4) RNA families; Ensembl Genome Variants (5); Elixir Tools Registry (6); Omics Discovery Index (OmicsDI) (https://doi.org/10.1101/049205); the EBI Metagenomics portal (7) and annotations from the InterPro protein domain and families consortium (8). Data resources have also been deprecated that include ENA (9), Whole Genome Shotgun sequences and Transcriptome Shotgun Assemblies, however, master projects remain in order to provide access to these data via the ENA (European Nucleotide Archive) portal.

Table 1.

Data resources available through EBI Search in 2016–2017

Category	Data resources
Genomes and metagenomes	Ensembl Genomes, Ensembl, HGNC, PomBase, DGVa, EGA, LRG, WormBase ParaSite, Metagenomics
Nucleotide sequences	ENA, RNAcentral, NRNL1, NRNL2, IMGT/HLA, IPD-KIR, IPD-MHC
Protein sequences	UniProtKB, UniParc, UniRef, EPO, JPO, KIPO, USPTO, NRPL1, NRPL2
Macromolecular structures	PDBe, EMDB
Small molecules	ChEBI, ChEMBL, Ligands
Gene expression	ArrayExpress, Expression Atlases
Molecular interactions	IntAct
Reactions, pathways and diseases	Rhea, Reactome, BioModels, MetaboLights, OMIM, MetabolomeExpress, Metabolomics Workbench
Protein families	InterPro, TreeFam, Pfam, TreeFam, MEROPS, GPCRDB
Protein expression data	PRIDE, GNPS, GPMdb, MassIVE, PeptideAtlas
Enzymes	IntEnz, Enzyme Portal
Literature	MEDLINE, Patent families, Patents
Samples and ontologies	Taxonomy, GO, EFO, SBO, MESH, BioSamples, Elixir registry

EBI Search retrieves and indexes data in native XML, EBI search XML schema and various types of text files, including flat files. In addition to these, JSON format is now supported using a new schema (http://www.ebi.ac.uk/ebisearch/schemas/data_schema.json) to give data providers more flexibility.

EBI SEARCH AS A SERVICE

The concept of EBI Search as a Service defines a novel paradigm in the use of the EBI Search API (Application Programming Interface) to create complex views of inter-related data, thus allowing users to combine and create different views. Examples of this can be seen on the OmicsDI (http://www.omicsdi.org/) portal that integrates 11 data sources relating to transcriptomics, genomics, proteomics and metabolomics; the EBI Metagenomics portal is another example that provides fast and uniform access to project, samples and individual runs, and relates these to taxonomic assignments, allowing users to explore some 80 000 metagenomic datasets from more than 140 biomes. Finally, EBI Search as a Service is used to provide fast lookup of annotations, as in the EBI HMMER service (https://www.ebi.ac.uk/Tools/hmmer). Yet another example is the EBI Search web interface itself: the traditional graphical user interface (GUI) has now been replaced by a RESTful client developed using Angular (https://angular.io), a popular JavaScript web development platform. The new GUI adopts the EMBL–EBI Visual Framework (https://github.com/ebiwd/EBI-Framework), this helps with keeping a uniform user experience across the institute's data resources and implements Responsive Design (http://www.w3schools.com/html/html_responsive.asp), thus providing for a better experience across different devices. In order to document the RESTful API of EBI Search and help developers design and build web applications that manage complex biological concepts, a Swagger (http://swagger.io), an OpenAPI specification-compliant (https://www.openapis.org/specification/repo) interface has been provided that is available from: https://www.ebi.ac.uk/ebisearch/swagger.ebi. An advantage of using this technology is that it markedly improves the APIs accessibility. As mentioned above, there are good, functional examples of portals that use the EBI Search as a Service to search, retrieve and display complex data. Table 2 lists 16 portals consuming the service API at the time of writing.

Table 2.

List of resources currently consuming the EBI Search through its RESTful API

Resource	URL/reference	Format used
ENA	https://www.ebi.ac.uk/ena/	XML
Ensembl Genomes	https://www.ensemblgenomes.org/	JSON
InterPro	https://www.ebi.ac.uk/interpro/	XML
Expression Atlas	https://www.ebi.ac.uk/gxa/	XML
LRG	http://www.lrg-sequence.org/	XML
Job Dispatcher	http://europepmc.org/abstract/MED/25845596	JSON/XML
RNAcentral	http://rnacentral.org/	XML
MetaboLights	https://www.ebi.ac.uk/metabolights/	XML
Enzyme Portal	https://www.ebi.ac.uk/enzymeportal/	JSON
OmicsDI	https://www.omicsdi.org	JSON
PomBase	http://www.pombase.org/	XML
Gene & Protein Summaries	https://www.ebi.ac.uk/s4/	XML
WormBase Parasite	http://www.wormbase.org/	XML
HMMER	https://www.ebi.ac.uk/Tools/hmmer/	JSON/XML
Metagenomics	https://www.ebi.ac.uk/metagenomics/	JSON/XML
identifiers.org	http://www.identifiers.org/	XML
EBI Search web interface	https://www.ebi.ac.uk/ebisearch/	JSON

IMPROVED SEARCH FINDABILITY AND USABILITY

Newly added features to the web interface improve findability and discoverability: Query Builder; saving results for later re-use; launching of bioinformatic tools and using RSS feeds for alerts. Also, user experience is improved by the new GUI framework. Query Builder, which is a replacement of the advanced search page, helps users to build complex queries with an intuitive GUI. It provides a list of available data resources that users can start searching from, allowing them to combine several fields, using Boolean criteria, to build a complex query. Users can trigger searching and save the query for re-use or generate an RSS feed from it, as will be discussed later. The EBI Search GUI now provides the ability to save search results on the client side, which can later be used for further analysis. There is now a ‘Save result’ button, which downloads a subset or all of the current search results in machine-readable formats such as XML (eXtensible Markup Language), JSON (JavaScript Object Notation), TSV |(Tab Separated Values) and CSV (Comma Separated Values). The save function is built using the RESTful API, thus the generated URLs can be used in scripts or analysis pipelines. The number of entries that can be retrieved in a single operation is 100; more can be retrieved by using ‘pagination’ (i.e. specifying ranges of results with a maximum number of 100 between operations). Launching bioinformatic tools with a selected subset of nucleotide or protein sequence query results is now possible. The list of tools varies depending on a selected database. For example, BLAST+ (10), Clustal Omega (11), UniSave lookup (12) and Dbfetch (13) are available for results pertaining to UniprotKB (14). When launching a tool, users will be taken to that tool's web page, with the relevant identifiers pre-filled. EBI Search provides an alerting mechanism to help users create queries that can be used to find new or updated data. Creating RSS (http://www.rssboard.org/rss-specification) query alerts is possible from the Query Builder page and all data resource results pages. Alert examples can be found in the documentation page ‘https://www.ebi.ac.uk/ebisearch/documentation.ebi’. The new GUI loads data in an asynchronous manner, a behavior that has improved user experience during testing. Because JavaScript libraries are required by the client, the browser memory footprint has increased slightly but this is offset by faster rendering of the search results. Furthermore, a cache server has been implemented into the framework to provide high levels of query responsiveness. The move toward a JavaScript framework will make it easier to adapt to changing web practices.

ENHANCED RESTFUL API

In the previous paper (2) the basic RESTful API was described. Feedback from users has prompted enhancements that include hierarchical facets and ‘more like this’ results. Faceting is an efficient way for users to filter search results. However, some data, e.g. taxonomy (15), are difficult to fit in a single-level structure facet. The Metagenomics portal taxonomy data are now searchable and filterable as a hierarchical facet through the RESTful API, meaning it is now possible to navigate results across this taxonomic classification. Another new feature, called ‘more like this,’ developed in collaboration with the OmicsDI team, has been introduced in the RESTful interface. Starting from an entry, the new API returns a list of similar entries. These are related by terms they have in common and which are extracted from descriptive fields according to specific criteria, such as the expected frequency of a term. The selected terms form a new query that is used to search against the same database as the original entry or against other database(s). In the OmicsDI website these types of result appear under the heading ‘Similar Datasets.’

DESCRIBING CONTENTS AND MONITORING (SEARCH ANALYTICS)

As the importance of EBI Search as a Service is growing, the users need to know about the status of databases. The main information page of the system (https://www.ebi.ac.uk/ebisearch) provides graphical overviews of the most popular terms that originate through web search boxes and the relative size of each data resource. Below these, there is a list of collaborators followed by a table of data resources that shows the current domain classification, its name, category, query on, number of entries and dates of indexing, last updated and release. This information is also available over the RESTful Web Services. To better understand how the search engine is used by web and Web Services users, there are dedicated systems that monitor and analyse requests using Elastic Stack (http://www.elastic.co). These measure the volume of incoming and outgoing traffic, the relative use of the RESTful API methods and the source (Figure 1). Usage patterns can be captured and later used for diagnosing problems, generating usage statistics and finally, making improvements and enhancements.

Figure 1.

Relative use of the RESTful API methods during first quarter 2017.

FUTURE DIRECTIONS

There is no doubt that the continuous growth of data in EBI Search will present challenges. From the data perspective, there will be more collaboration with data providers, which should bring more content to index and produce enriched views on the data. Also, future collaboration with ChEMBL (16) will bring improvement in search and results in the cheminformatics resources. With user-centered design techniques, the web interface will be reviewed in order to expose more information to the end users. Similarly, better error handling is required. In the RESTful API, more result formats, such as lists of database identifiers, which can work directly as input into biological analysis workflows will be developed.

DISCUSSION

The implementation of industry standards, such as RESTful Web Services, has permitted the development of a scalable search infrastructure that provides fast and efficient access to a diverse and complex set of biological data. Keeping up-to-date with proteomics data generation, the output of high-throughput sequencing technologies, growing literature resources, ontologies and specialized taxonomies, is a challenge. Providing direct access to specialist data portals as well as presenting views on combined complex data is high on the development agenda and must happen in close collaboration with the data providers. In this context, the concept of EBI Search as a Service, has allowed third-parties to integrate and develop powerful search functionality that can be used by all. This has several advantages: sharing of search and result components; avoiding duplication of effort; establishing common syntax for searching, and finally, improving scientific findability, reproducibility and discoverability.

16 in total

1. UniSave: the UniProtKB sequence/annotation version database.

Authors: Rasko Leinonen; Francesco Nardone; Weimin Zhu; Rolf Apweiler
Journal: Bioinformatics Date: 2006-03-21 Impact factor: 6.937

2. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors: Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal: Mol Syst Biol Date: 2011-10-11 Impact factor: 11.429

3. The EMBL-EBI bioinformatics web and programmatic tools framework.

Authors: Weizhong Li; Andrew Cowley; Mahmut Uludag; Tamer Gur; Hamish McWilliam; Silvano Squizzato; Young Mi Park; Nicola Buso; Rodrigo Lopez
Journal: Nucleic Acids Res Date: 2015-04-06 Impact factor: 16.971

4. The EBI Search engine: providing search and retrieval functionality for biological data from EMBL-EBI.

Authors: Silvano Squizzato; Young Mi Park; Nicola Buso; Tamer Gur; Andrew Cowley; Weizhong Li; Mahmut Uludag; Sangya Pundir; Jennifer A Cham; Hamish McWilliam; Rodrigo Lopez
Journal: Nucleic Acids Res Date: 2015-04-08 Impact factor: 16.971

5. Rfam 12.0: updates to the RNA families database.

Authors: Eric P Nawrocki; Sarah W Burge; Alex Bateman; Jennifer Daub; Ruth Y Eberhardt; Sean R Eddy; Evan W Floden; Paul P Gardner; Thomas A Jones; John Tate; Robert D Finn
Journal: Nucleic Acids Res Date: 2014-11-11 Impact factor: 19.160

6. InterPro in 2017-beyond protein family and domain annotations.

Authors: Robert D Finn; Teresa K Attwood; Patricia C Babbitt; Alex Bateman; Peer Bork; Alan J Bridge; Hsin-Yu Chang; Zsuzsanna Dosztányi; Sara El-Gebali; Matthew Fraser; Julian Gough; David Haft; Gemma L Holliday; Hongzhan Huang; Xiaosong Huang; Ivica Letunic; Rodrigo Lopez; Shennan Lu; Aron Marchler-Bauer; Huaiyu Mi; Jaina Mistry; Darren A Natale; Marco Necci; Gift Nuka; Christine A Orengo; Youngmi Park; Sebastien Pesseat; Damiano Piovesan; Simon C Potter; Neil D Rawlings; Nicole Redaschi; Lorna Richardson; Catherine Rivoire; Amaia Sangrador-Vegas; Christian Sigrist; Ian Sillitoe; Ben Smithers; Silvano Squizzato; Granger Sutton; Narmada Thanki; Paul D Thomas; Silvio C E Tosatto; Cathy H Wu; Ioannis Xenarios; Lai-Su Yeh; Siew-Yit Young; Alex L Mitchell
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

7. UniProt: the universal protein knowledgebase.

Authors:
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

8. The EBI enzyme portal.

Authors: Rafael Alcántara; Joseph Onwubiko; Hong Cao; Paula de Matos; Jennifer A Cham; Jules Jacobsen; Gemma L Holliday; Julia D Fischer; Syed Asad Rahman; Bijay Jassal; Mikael Goujon; Francis Rowland; Sameer Velankar; Rodrigo López; John P Overington; Gerard J Kleywegt; Henning Hermjakob; Claire O'Donovan; María Jesús Martín; Janet M Thornton; Christoph Steinbeck
Journal: Nucleic Acids Res Date: 2012-11-21 Impact factor: 16.971

9. The European Bioinformatics Institute's data resources 2014.

Authors: Catherine Brooksbank; Mary Todd Bergman; Rolf Apweiler; Ewan Birney; Janet Thornton
Journal: Nucleic Acids Res Date: 2013-11-23 Impact factor: 16.971

10. Ensembl Genomes 2016: more genomes, more complexity.

Authors: Paul Julian Kersey; James E Allen; Irina Armean; Sanjay Boddu; Bruce J Bolt; Denise Carvalho-Silva; Mikkel Christensen; Paul Davis; Lee J Falin; Christoph Grabmueller; Jay Humphrey; Arnaud Kerhornou; Julia Khobova; Naveen K Aranganathan; Nicholas Langridge; Ernesto Lowy; Mark D McDowall; Uma Maheswari; Michael Nuhn; Chuang Kee Ong; Bert Overduin; Michael Paulini; Helder Pedro; Emily Perry; Giulietta Spudich; Electra Tapanari; Brandon Walts; Gareth Williams; Marcela Tello-Ruiz; Joshua Stein; Sharon Wei; Doreen Ware; Daniel M Bolser; Kevin L Howe; Eugene Kulesha; Daniel Lawson; Gareth Maslen; Daniel M Staines
Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971

15 in total

1. CD70 Activation Decreases Pulmonary Fibroblast Production of Extracellular Matrix Proteins.

Authors: Thi K Tran-Nguyen; Jianmin Xue; Carol Feghali-Bostwick; Frank C Sciurba; Daniel J Kass; Steven R Duncan
Journal: Am J Respir Cell Mol Biol Date: 2020-08 Impact factor: 6.914

2. HMMER web server: 2018 update.

Authors: Simon C Potter; Aurélien Luciani; Sean R Eddy; Youngmi Park; Rodrigo Lopez; Robert D Finn
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

3. Developing computational biology at meridian 23° E, and a little eastwards.

Authors: Christos A Ouzounis
Journal: J Biol Res (Thessalon) Date: 2018-11-14 Impact factor: 1.889

4. Functional Metagenomics Reveals a New Catalytic Domain, the Metallo-β-Lactamase Superfamily Domain, Associated with Phytase Activity.

Authors: Genis Andrés Castillo Villamizar; Katrina Funkner; Heiko Nacke; Karolin Foerster; Rolf Daniel
Journal: mSphere Date: 2019-06-19 Impact factor: 4.389

5. The European Bioinformatics Institute in 2018: tools, infrastructure and training.

Authors: Charles E Cook; Rodrigo Lopez; Oana Stroe; Guy Cochrane; Cath Brooksbank; Ewan Birney; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

6. The EMBL-EBI search and sequence analysis tools APIs in 2019.

Authors: Fábio Madeira; Young Mi Park; Joon Lee; Nicola Buso; Tamer Gur; Nandana Madhusoodanan; Prasad Basutkar; Adrian R N Tivey; Simon C Potter; Robert D Finn; Rodrigo Lopez
Journal: Nucleic Acids Res Date: 2019-07-02 Impact factor: 16.971

7. The bio.tools registry of software tools and data resources for the life sciences.

Authors: Jon Ison; Hans Ienasescu; Piotr Chmura; Emil Rydza; Hervé Ménager; Matúš Kalaš; Veit Schwämmle; Björn Grüning; Niall Beard; Rodrigo Lopez; Severine Duvaud; Heinz Stockinger; Bengt Persson; Radka Svobodová Vařeková; Tomáš Raček; Jiří Vondrášek; Hedi Peterson; Ahto Salumets; Inge Jonassen; Rob Hooft; Tommi Nyrönen; Alfonso Valencia; Salvador Capella; Josep Gelpí; Federico Zambelli; Babis Savakis; Brane Leskošek; Kristoffer Rapacki; Christophe Blanchet; Rafael Jimenez; Arlindo Oliveira; Gert Vriend; Olivier Collin; Jacques van Helden; Peter Løngreen; Søren Brunak
Journal: Genome Biol Date: 2019-08-12 Impact factor: 13.583

8. EBI Metagenomics in 2017: enriching the analysis of microbial communities, from sequence reads to assemblies.

Authors: Alex L Mitchell; Maxim Scheremetjew; Hubert Denise; Simon Potter; Aleksandra Tarkowska; Matloob Qureshi; Gustavo A Salazar; Sebastien Pesseat; Miguel A Boland; Fiona M I Hunter; Petra Ten Hoopen; Blaise Alako; Clara Amid; Darren J Wilkinson; Thomas P Curtis; Guy Cochrane; Robert D Finn
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

9. The Pfam protein families database in 2019.

Authors: Sara El-Gebali; Jaina Mistry; Alex Bateman; Sean R Eddy; Aurélien Luciani; Simon C Potter; Matloob Qureshi; Lorna J Richardson; Gustavo A Salazar; Alfredo Smart; Erik L L Sonnhammer; Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto; Robert D Finn
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

10. BioModels: expanding horizons to include more modelling approaches and formats.

Authors: Mihai Glont; Tung V N Nguyen; Martin Graesslin; Robert Hälke; Raza Ali; Jochen Schramm; Sarala M Wimalaratne; Varun B Kothamachu; Nicolas Rodriguez; Maciej J Swat; Jurgen Eils; Roland Eils; Camille Laibe; Rahuman S Malik-Sheriff; Vijayalakshmi Chelliah; Nicolas Le Novère; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971