Literature DB >> 25855807

The EBI Search engine: providing search and retrieval functionality for biological data from EMBL-EBI.

Silvano Squizzato¹, Young Mi Park¹, Nicola Buso¹, Tamer Gur¹, Andrew Cowley¹, Weizhong Li¹, Mahmut Uludag¹, Sangya Pundir¹, Jennifer A Cham¹, Hamish McWilliam¹, Rodrigo Lopez².

Abstract

The European Bioinformatics Institute (EMBL-EBI-https://www.ebi.ac.uk) provides free and unrestricted access to data across all major areas of biology and biomedicine. Searching and extracting knowledge across these domains requires a fast and scalable solution that addresses the requirements of domain experts as well as casual users. We present the EBI Search engine, referred to here as 'EBI Search', an easy-to-use fast text search and indexing system with powerful data navigation and retrieval capabilities. API integration provides access to analytical tools, allowing users to further investigate the results of their search. The interconnectivity that exists between data resources at EMBL-EBI provides easy, quick and precise navigation and a better understanding of the relationship between different data types including sequences, genes, gene products, proteins, protein domains, protein families, enzymes and macromolecular structures, together with relevant life science literature.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Enzymes
Proteins

Year: 2015 PMID： 25855807 PMCID： PMC4489232 DOI： 10.1093/nar/gkv316

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The European Bioinformatics Institute (EBI) (1) hosts data from life science experiments comprising assembled genomes; nucleotide sequences; protein sequences; macromolecular structures; small (‘drug-like’) molecules; gene expression; molecular interactions; reactions, pathways and diseases; protein families; enzymes; literature; and samples and ontologies. These represent discrete categories containing one or more specialised data resources that are curated and annotated by experts from around the world. Searching for biological information within and across these resources is a challenge. In this article we discuss how EBI Search (previously known as ‘EB-eye’) (2) provides a solution that is fast and scalable, that allows users to query and review results using faceted navigation (3) and filters based on common fields. Users can move seamlessly from the search into specialised web portals where more detail and functionality is available. The search engine is built using the Apache Lucene library (http://lucene.apache.org) and is constantly updated with new data and is under continuous review by scientists as well as specialists in web usability and design. In 2010, EBI Search had 400 million entries. During 2014 it surpassed one billion entries, which are accessible over the web as well as programmatically using SOAP and RESTful Web Services. During 2014 EBI Search was used by more than 393 000 unique Internet Protocols (IPs) addresses that generated 290 million requests.

DATA COVERAGE

EBI Search provides a uniform and consistent search and retrieval functionality spanning many individual data resources, split into thematic categories (Table 1). For example, in the ‘Nucleotide sequences’ category all ENA (4) data collections are represented together with data from RNAcentral (5); in the ‘Protein sequences’ category users can find UniprotKB and UniParc (6); ‘Protein families’ contains InterPro (7), ‘Genomes’ includes Ensembl (8) and Ensembl Genomes (9); ‘Gene expression’ includes the baseline and differential Expression Atlases (10), ‘Macromolecular structures’ includes PDBe (11); ‘Small molecules’ contains ChEBI (12) and ChEMBL (13); and ‘Reactions, pathways and diseases’ includes OMIM (http://omim.org), Reactome (14) and Rhea (15). A complete list of data resources is available from https://www.ebi.ac.uk/ebisearch/aboutebisearch.ebi.

Table 1.

Data resources available through EBI Search

Category	Data resources
Genomes	Ensembl Genomes, Ensembl, HGNC, PomBase, DGVa, EGA, LRG, WormBase ParaSite
Nucleotide sequences	ENA, RNAcentral, NRNL1, NRNL2, IMGT/HLA, IPD-KIR, IPD-MHC
Protein sequences	UniProtKB, UniParc, UniRef, EPO, JPO, KIPO, USPTO, NRPL1, NRPL2
Macromolecular structures	PDBe, EMDB
Small molecules	ChEBI, ChEMBL, Ligands
Gene expression	ArrayExpress, Expression Atlases
Molecular interactions	IntAct
Reactions, pathways and diseases	Rhea, Reactome, BioModels, MetaboLights, OMIM
Protein families	InterPro, TreeFam, MEROPS, GPCRDB
Enzymes	IntEnz
Literature	MEDLINE, Patent families, Patents
Samples and ontologies	Taxonomy, GO, EFO, SBO, MESH, BioSamples

EBI Search is automatically updated. This is triggered by a set of scripts that monitor the production cycles of the source data resources, ensuring search results are always up-to-date.

WEB INTERFACE

The main entry points to EBI Search are bespoke search boxes, found throughout the website, into which users can type simple phrases, database identifiers, keywords, gene symbols, species, and molecule and disease names. This is aided by auto-complete, which suggests terms based on indexed content in the system. Search queries can be single or multiple terms combined with Boolean logic (e.g. OR, AND, NOT), and expansion of terms using wildcard characters is supported. The query syntax of the EBI Search engine follows Apache Lucene query parser syntax and its implementation is explained in detail in the help pages: https://www.ebi.ac.uk/ebisearch/documentation.ebi.

Search results pages

EBI Search executes a query against a vast amount of indexed data, so one challenge is how to present results in a coherent and intuitive way. Search results are organised into the aforementioned biological categories (e.g. ‘Genomes’, ‘Nucleotide sequences’, ‘Gene Expression’, etc.). Within each category, the top ranking results are presented with the option for the user to expand any category of interest to see all matches. This overview also shows the number of results by category, which helps users identify data resources of interest. Typically, each entry on a query result page displays primary identifiers hyperlinked to the main data resource web portal. Additionally, titles, names and descriptions are shown. Database cross-references are available via a ‘Related data’ button. The relationships between entries in different data resources can be found by navigating through cross-references, which in EBI Search can be implicitly declared by the provider or inferred by the system. A ‘Views’ button provides access to alternative formats of the data (e.g. EMBL format for ENA entries, PDB format, etc.) served via the dbfetch application (16). This button also provides access to analytical tools, such as NCBI BLAST+ (17) and InterProScan 5 (18), which apply to protein sequences in the results. Custom fields are provided for some data resources. For example, UniProtKB query results contain primary and secondary accession numbers, IDs, names, description, species and review status. Results from the ‘Literature’ category contain titles, author lists, journals and publication dates.

Facets help the user filter and narrow down results

The available facets of a search result are presented on the left-hand side when users select a data resource or a category. The text descriptions are followed by check boxes or filtering links for selecting results according to data-specific attributes such as taxonomy, keywords and controlled vocabularies. As an example, search results in UniProtKB can be filtered using the ‘Organisms’ facet, and reviewed entries (i.e. UniProtKB_SwissProt) can be selected by using the ‘Status’ facet (e.g. ‘Reviewed’ or ‘Unreviewed’). Not all facets are keyword-based, for example, the ‘Type’ facet in InterPro results; some facets represent ranges, for instance, the ‘Publication date’ in the ‘Literature’ category. Common and custom facets are shown in Table 2.

Table 2.

Custom and common facets available through EBI Search

Category	Common facets	Custom facets
Genomes	Organisms
Nucleotide sequences	Organisms	Genomic mapping, Expert databases, RNA types (in RNAcentral)
Protein sequences	Organisms	Keywords and Status (in UniProtKB)
Macromolecular structures	Organisms
Gene expression	Organisms	Organism part (in Expression Atlases)
Reactions, pathways and diseases	Organisms	Type, Compartment name and Keywords (in Reactome)
Protein families		Type (in InterPro)
Literature	Publication date
Samples and Ontologies		Ontology (in GO)

Automatic generation of human-readable ‘Gene & protein summaries’

‘Gene & protein summaries’ are available at the top of the main results page when queries contain established gene names (i.e. HGNC gene nomenclature) or common database identifiers in Ensembl or UniProtKB (i.e. accession numbers). These summaries are organised into five sections presented as tabs, namely: gene, expression, protein, protein structure and literature, and apply to the following model organisms: Homo sapiens, Mus musculus, Drosophila melanogaster, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana and Escherichia coli K-12. These gene-centric summaries are generated by a separate application (https://www.ebi.ac.uk/s4/), which uses EBI Search's SOAP programmatic interface to identify, retrieve and display data from the main resources portals using the DAS protocol (19).

Programmatic access to search functionality via Web Services

In addition to SOAP, a new RESTful Web Service API has been available since June 2014 in response to demand for search services from developers using the latest web technologies and wanting to integrate search functionality into their portals. These Web Services provide methods, which can be grouped into three main types: ‘meta-data’ (i.e. retrieving information about searchable data resources), ‘search and retrieval’ functionality, ‘navigation’ (i.e. exploring cross-references) and ‘filtering’ (i.e. narrowing down results using facets). Further details and sample clients can be found in the documentation pages at https://www.ebi.ac.uk/Tools/webservices/services/eb-eye_rest. Since the launch of EBI Search in 2007, search functionality has been integrated into projects such as ENA, Ensembl Genomes, InterPro, LRG (20), Rhea, MetaboLights (21), Enzyme Portal (22) and PomBase (23). Novel pipeline processes or analytical workflows can be created by combining methods from EBI Search and other Web Services. For example, entry identifiers from UniProtKB can be sent to the dbfetch Web Service (WSDbfetch), which retrieves the corresponding sequence entries in batch fashion. These sequence entries can in turn be sent to analytic tools Web Services (16) such as Clustal Omega (24). The current version of the APIs covers the existing functionality available in the web interface of EBI Search, including facets and auto-complete. Search result formats include: XML, JSON, CSV and TSV, enabling integration of results into third-party frameworks such as AngularJS (https://angularjs.org). An example of such integration is the RNAcentral portal (http://www.rnacentral.org) launched in 2014 (5).

FUTURE DIRECTIONS

As the volume of data and the number of data resources continue to grow, providing continuous search functionality is a big challenge. This will be achieved by improving code and simplifying configuration and hardware requirements, analysing user queries and exploring novel technologies. Improving the user experience is a central focus based on user-centred design techniques. Users will be able to select results and download these in formats that can be consumed programmatically for further post-processing (e.g. for further analysis in local pipelines and workflows). In addition to the previously mentioned formats, RSS 2 (http://www.rssboard.org/rss-specification) will be available to help users pre-generate queries, which can be repeated over time to check for new or updated data. These bespoke alerting systems can be enacted using widely available RSS clients or built-in browser tools. Methods to simplify the launching of applications such as BLAST from search results are also being tested that use the tools provided by the Job Dispatcher framework (25). Lastly but importantly, the SOAP API will be phased out during 2016 in order to concentrate resources on the RESTful interface, which is easier to use and more scalable.

DISCUSSION

EBI Search is built on top of technologies that allow fast indexing and searching of vast amounts of data. The implementation of a scalable search system relies on the quick and efficient uptake of the latest technologies and also in their successful integration into existing compute infrastructures with little or negligible cost. The EBI Search engine has been successful at acting as both a ‘global search’ that presents results across many distinct knowledge domains and as a ‘local search’ thanks to the implementation of industry standard Web Services. Keeping up-to-date with changes in search technologies as well as with changes in the underlying data is challenging. However, this drives forward the development of simpler, more responsive and more efficient methods of finding and re-using biological information. By providing direct access to the primary data sources (web portals), where biological entities are fully annotated and displayed in the way expected by specialists, the EBI Search engine can provide access to a large range of data sources for a cheaper cost than multiple search engines. EBI Search must not be confused with an integration platform or a data warehouse. It enables interoperability between distinct underlying data resources and analytical tools, ultimately delivering a powerful and reproducible way to interpret biological search results.

24 in total

1. Updates in Rhea--a manually curated resource of biochemical reactions.

Authors: Anne Morgat; Kristian B Axelsen; Thierry Lombardot; Rafael Alcántara; Lucila Aimo; Mohamed Zerara; Anne Niknejad; Eugeni Belda; Nevila Hyka-Nouspikel; Elisabeth Coudert; Nicole Redaschi; Lydie Bougueleret; Christoph Steinbeck; Ioannis Xenarios; Alan Bridge
Journal: Nucleic Acids Res Date: 2014-10-20 Impact factor: 16.971

2. Content discovery and retrieval services at the European Nucleotide Archive.

Authors: Nicole Silvester; Blaise Alako; Clara Amid; Ana Cerdeño-Tárraga; Iain Cleland; Richard Gibson; Neil Goodgame; Petra Ten Hoopen; Simon Kay; Rasko Leinonen; Weizhong Li; Xin Liu; Rodrigo Lopez; Nima Pakseresht; Swapna Pallreddy; Sheila Plaister; Rajesh Radhakrishnan; Marc Rossello; Alexander Senf; Dmitriy Smirnov; Ana Luisa Toribio; Daniel Vaughan; Vadim Zalunin; Guy Cochrane
Journal: Nucleic Acids Res Date: 2014-11-17 Impact factor: 16.971

3. BLAST+: architecture and applications.

Authors: Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal: BMC Bioinformatics Date: 2009-12-15 Impact factor: 3.169

4. Integrating biological data--the Distributed Annotation System.

Authors: Andrew M Jenkinson; Mario Albrecht; Ewan Birney; Hagen Blankenburg; Thomas Down; Robert D Finn; Henning Hermjakob; Tim J P Hubbard; Rafael C Jimenez; Philip Jones; Andreas Kähäri; Eugene Kulesha; José R Macías; Gabrielle A Reeves; Andreas Prlić
Journal: BMC Bioinformatics Date: 2008-07-22 Impact factor: 3.169

5. InterProScan 5: genome-scale protein function classification.

Authors: Philip Jones; David Binns; Hsin-Yu Chang; Matthew Fraser; Weizhong Li; Craig McAnulla; Hamish McWilliam; John Maslen; Alex Mitchell; Gift Nuka; Sebastien Pesseat; Antony F Quinn; Amaia Sangrador-Vegas; Maxim Scheremetjew; Siew-Yit Yong; Rodrigo Lopez; Sarah Hunter
Journal: Bioinformatics Date: 2014-01-21 Impact factor: 6.937

6. RNAcentral: an international database of ncRNA sequences.

Authors: Anton I Petrov; Simon J E Kay; Richard Gibson; Eugene Kulesha; Dan Staines; Elspeth A Bruford; Mathew W Wright; Sarah Burge; Robert D Finn; Paul J Kersey; Guy Cochrane; Alex Bateman; Sam Griffiths-Jones; Jennifer Harrow; Patricia P Chan; Todd M Lowe; Christian W Zwieb; Jacek Wower; Kelly P Williams; Corey M Hudson; Robin Gutell; Michael B Clark; Marcel Dinger; Xiu Cheng Quek; Janusz M Bujnicki; Nam-Hai Chua; Jun Liu; Huan Wang; Geir Skogerbø; Yi Zhao; Runsheng Chen; Weimin Zhu; James R Cole; Benli Chai; Hsien-Da Huang; His-Yuan Huang; J Michael Cherry; Artemis Hatzigeorgiou; Kim D Pruitt
Journal: Nucleic Acids Res Date: 2014-10-28 Impact factor: 16.971

7. Expression Atlas update--a database of gene and transcript expression from microarray- and sequencing-based functional genomics experiments.

Authors: Robert Petryszak; Tony Burdett; Benedetto Fiorelli; Nuno A Fonseca; Mar Gonzalez-Porta; Emma Hastings; Wolfgang Huber; Simon Jupp; Maria Keays; Nataliya Kryvych; Julie McMurry; John C Marioni; James Malone; Karine Megy; Gabriella Rustici; Amy Y Tang; Jan Taubert; Eleanor Williams; Oliver Mannion; Helen E Parkinson; Alvis Brazma
Journal: Nucleic Acids Res Date: 2013-12-04 Impact factor: 16.971

8. PDBe: Protein Data Bank in Europe.

Authors: Aleksandras Gutmanas; Younes Alhroub; Gary M Battle; John M Berrisford; Estelle Bochet; Matthew J Conroy; Jose M Dana; Manuel A Fernandez Montecelo; Glen van Ginkel; Swanand P Gore; Pauline Haslam; Rowan Hatherley; Pieter M S Hendrickx; Miriam Hirshberg; Ingvar Lagerstedt; Saqib Mir; Abhik Mukhopadhyay; Thomas J Oldfield; Ardan Patwardhan; Luana Rinaldi; Gaurav Sahni; Eduardo Sanz-García; Sanchayita Sen; Robert A Slowley; Sameer Velankar; Michael E Wainwright; Gerard J Kleywegt
Journal: Nucleic Acids Res Date: 2013-11-27 Impact factor: 16.971

9. Locus Reference Genomic: reference sequences for the reporting of clinically relevant sequence variants.

Authors: Jacqueline A L MacArthur; Joannella Morales; Ray E Tully; Alex Astashyn; Laurent Gil; Elspeth A Bruford; Pontus Larsson; Paul Flicek; Raymond Dalgleish; Donna R Maglott; Fiona Cunningham
Journal: Nucleic Acids Res Date: 2013-11-26 Impact factor: 16.971

10. The InterPro protein families database: the classification resource after 15 years.

Authors: Alex Mitchell; Hsin-Yu Chang; Louise Daugherty; Matthew Fraser; Sarah Hunter; Rodrigo Lopez; Craig McAnulla; Conor McMenamin; Gift Nuka; Sebastien Pesseat; Amaia Sangrador-Vegas; Maxim Scheremetjew; Claudia Rato; Siew-Yit Yong; Alex Bateman; Marco Punta; Teresa K Attwood; Christian J A Sigrist; Nicole Redaschi; Catherine Rivoire; Ioannis Xenarios; Daniel Kahn; Dominique Guyot; Peer Bork; Ivica Letunic; Julian Gough; Matt Oates; Daniel Haft; Hongzhan Huang; Darren A Natale; Cathy H Wu; Christine Orengo; Ian Sillitoe; Huaiyu Mi; Paul D Thomas; Robert D Finn
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 16.971

19 in total

1. The ARF guanine nucleotide exchange factor GBF1 is targeted to Golgi membranes through a PIP-binding domain.

Authors: Justyna M Meissner; Jay M Bhatt; Eunjoo Lee; Melanie L Styers; Anna A Ivanova; Richard A Kahn; Elizabeth Sztul
Journal: J Cell Sci Date: 2018-02-05 Impact factor: 5.285

2. Structure of a CGI-58 motif provides the molecular basis of lipid droplet anchoring.

Authors: Andras Boeszoermenyi; Harald Manuel Nagy; Haribabu Arthanari; Christoph Jens Pillip; Hanna Lindermuth; Rafael Eulogio Luna; Gerhard Wagner; Rudolf Zechner; Klaus Zangger; Monika Oberer
Journal: J Biol Chem Date: 2015-09-08 Impact factor: 5.157

3. Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge.

Authors: Wei Wei; Zhanglong Ji; Yupeng He; Kai Zhang; Yuanchi Ha; Qi Li; Lucila Ohno-Machado
Journal: Database (Oxford) Date: 2018-01-01 Impact factor: 3.451

4. An update on the Enzyme Portal: an integrative approach for exploring enzyme knowledge.

Authors: S Pundir; J Onwubiko; R Zaru; S Rosanoff; R Antunes; M Bingley; X Watkins; C O'Donovan; M J Martin
Journal: Protein Eng Des Sel Date: 2017-03-01 Impact factor: 1.650

5. The EBI search engine: EBI search as a service-making biological data accessible for all.

Authors: Young M Park; Silvano Squizzato; Nicola Buso; Tamer Gur; Rodrigo Lopez
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971

6. TGFBR2-dependent alterations of exosomal cargo and functions in DNA mismatch repair-deficient HCT116 colorectal cancer cells.

Authors: Fabia Fricke; Jennifer Lee; Malwina Michalak; Uwe Warnken; Ingrid Hausser; Meggy Suarez-Carmona; Niels Halama; Martina Schnölzer; Jürgen Kopitz; Johannes Gebert
Journal: Cell Commun Signal Date: 2017-04-04 Impact factor: 5.712

7. The replication initiator of the cholera pathogen's second chromosome shows structural similarity to plasmid initiators.

Authors: Natalia Orlova; Matthew Gerding; Olha Ivashkiv; Paul Dominic B Olinares; Brian T Chait; Matthew K Waldor; David Jeruzalmi
Journal: Nucleic Acids Res Date: 2017-04-20 Impact factor: 16.971

8. Gut microbiota associated with HIV infection is significantly enriched in bacteria tolerant to oxygen.

Authors: Grégory Dubourg; Jean-Christophe Lagier; Sophie Hüe; Mathieu Surenaud; Dipankar Bachar; Catherine Robert; Caroline Michelle; Isabelle Ravaux; Saadia Mokhtari; Matthieu Million; Andreas Stein; Philippe Brouqui; Yves Levy; Didier Raoult
Journal: BMJ Open Gastroenterol Date: 2016-07-28

9. The European Bioinformatics Institute in 2016: Data growth and integration.

Authors: Charles E Cook; Mary Todd Bergman; Robert D Finn; Guy Cochrane; Ewan Birney; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2015-12-15 Impact factor: 16.971

10. RNAcentral: a comprehensive database of non-coding RNA sequences.

Authors: Anton I Petrov; Simon J E Kay; Ioanna Kalvari; Kevin L Howe; Kristian A Gray; Elspeth A Bruford; Paul J Kersey; Guy Cochrane; Robert D Finn; Alex Bateman; Ana Kozomara; Sam Griffiths-Jones; Adam Frankish; Christian W Zwieb; Britney Y Lau; Kelly P Williams; Patricia P Chan; Todd M Lowe; Jamie J Cannone; Robin Gutell; Magdalena A Machnicka; Janusz M Bujnicki; Maki Yoshihama; Naoya Kenmochi; Benli Chai; James R Cole; Maciej Szymanski; Wojciech M Karlowski; Valerie Wood; Eva Huala; Tanya Z Berardini; Yi Zhao; Runsheng Chen; Weimin Zhu; Maria D Paraskevopoulou; Ioannis S Vlachos; Artemis G Hatzigeorgiou; Lina Ma; Zhang Zhang; Joern Puetz; Peter F Stadler; Daniel McDonald; Siddhartha Basu; Petra Fey; Stacia R Engel; J Michael Cherry; Pieter-Jan Volders; Pieter Mestdagh; Jacek Wower; Michael B Clark; Xiu Cheng Quek; Marcel E Dinger
Journal: Nucleic Acids Res Date: 2016-10-28 Impact factor: 19.160