Literature DB >> 28431173

Programmatic access to bioinformatics tools from EMBL-EBI update: 2017.

Szymon Chojnacki1, Andrew Cowley1, Joon Lee1, Anna Foix1, Rodrigo Lopez1.   

Abstract

Since 2009 the EMBL-EBI provides free and unrestricted access to several bioinformatics tools via the user's browser as well as programmatically via Web Services APIs. Programmatic access to these tools, which is fundamental to bioinformatics, is increasingly important as more high-throughput data is generated, e.g. from proteomics and metagenomic experiments. Access is available using both the SOAP and RESTful approaches and their usage is reviewed regularly in order to ensure that the best, supported tools are available to all users. We present here an update describing the latest enhancement to the Job Dispatcher APIs as well as the governance under it.
© The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Year:  2017        PMID: 28431173      PMCID: PMC5570243          DOI: 10.1093/nar/gkx273

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The EMBL-EBI Job Dispatcher (1–3) framework provides an interface between High Performance Compute clusters and command-line applications. It integrates tools and generates uniform interfaces that are used to generate Web, SOAP and RESTful APIs. It also produces statistics in a common format for each tool and makes it possible to analyze detailed usage with common analytic tools. At present, tools include sequence similarity search services (https://www.ebi.ac.uk/Tools/sss/) such as BLAST (4), FASTA (5) and PSI-Search (6), multiple sequence alignment tools (https://www.ebi.ac.uk/Tools/msa/) such as Clustal Omega (7), MAFFT (8) and KAlign (9), and other sequence analysis tools (https://www.ebi.ac.uk/Tools/pfa/) such as InterProScan5 (10). The use of sequence similarity search tools comprises 45 000 distinct sequences libraries from ENA (11), Ensembl Genomes (12), UniProt (13), InterPro (14) and Pfam (15). These contain sequences from whole genomes and complete proteomes, gene sequence submissions, transcripts, reference proteomes, amplicons, metagenomes, metatranscriptomes and assemblies from metagenomic studies, sequences from patents and specialized collections, such as sequence from immunological studies. During 2016, usage totaled 152 million jobs. Usage is from the academic and industry scientists and is supplemented by training and support activities in collaboration with the EMBL-EBI training program (16).

THE TOOLS FRAMEWORK

The Job Dispatcher framework consists of: a tools configuration module; a cluster scheduling interface that communicates with the queuing system; and results management and rendering modules that take care of coordinating how results are displayed. The framework is developed using the Java JAX-WS APIs for creating XML-based SOAP and RESTful Web Services. Extensive validation routines are built into the web service to ensure that the correct types of data and parameter values are sent to the tools. All outputs are examined in order to verify tool execution, detect errors and produce human readable reports. Visual representations of tool results are provided to help the user understand the job output both interactively using web browser or programmatically using the web services clients. These clients, written in C#, Java, Perl, Python, PHP and Ruby, are available for many tools that can be used directly from the command line as part of a workflow or pipeline, or as a template for integrating tool functionality into complex applications. Work is in progress to add Common Workflow Language (CWL) (https://github.com/common-workflow-language/common-workflow-language) descriptions which will allow the clients to be further integrated into workflow management systems that support CWL, such as Taverna (http://www.taverna.org.uk/), Arvados (https://arvados.org/) and Galaxy (https://galaxyproject.org/). Tools such as HMMER3, with in-memory database support are in the pipeline, as well as new modern compute resources that scale better with current usage. Due to popular demand, a complete collection of clients written in Python is in the making that will support JSON technologies and allow users to interface results with analytical suites, such as the R package. Already, the Job Dispatcher framework provides a high level of interoperability by allowing users to specify output formats such as XML, JSON, CSV and TSV. Furthermore, these significantly ease the effort of integrating tool functionality into third-party portals. Contextually, the framework is equivalent to provisioning ‘software as a service’ and in the context of bioinformatics, this fits well with the mission of EMBL-EBI.

NEW ANALYSIS TOOLS AND DATABASES

New tools include: HMMER3 (17), that uses probabilistic models called hidden Markov models for searching sequence databases for sequence homologs; Simple_Phylogeny, which replaces ClustalW2_Phylogeny and that provides access to phylogenetic tree generation methods; PredComp (1), that compares a set of predicted annotations against actual automated annotations existing in UniProt-TrEMBL and generates a comprehensive graphical report and PSI-Search2 (18), an improved version of PSI-Search, that can reduce the frequency of false-positive alignments more than 20-fold compared with psiblast. A complete list of currently supported categories and tools is shown in Table 1. ChEMBL (19) and MEROPS (20), ENA Barcode, Geospatial and non-coding sequences (11), have been added to the sequence similarity search libraries. Importantly, new UniProt Reference Proteomes (13) and Enzyme Centric (21) libraries are also now available.
Table 1.

Tool services available in the Job Dispatcher framework (2017)

CategoryTool
EMBOSS Programs (https://www.ebi.ac.uk/Tools/emboss/)needle, stretcher, water, matcher, transeq, sixpack, backtranseq, backtranambig, pepinfo, pepstats, pepwindow, cpgplot, newcpgreport, isochore and seqret
Multiple Sequence Alignment (https://www.ebi.ac.uk/Tools/msa/)clustal omega, kalign, mafft, mafft_addseq, muscle, mview, tcoffee and prank
Pairwise Sequence Alignment (https://www.ebi.ac.uk/Tools/psa/)needle, stretcher, water, matcher, lalign, wise2dba, genewise and promoterwise
Phylogeny Analysis (https://www.ebi.ac.uk/Tools/phylogeny/)simple_phylogeny and raxml_epa
Protein Functional Analysis (https://www.ebi.ac.uk/Tools/pfa/)interproscan5, pfamscan, phobius, pratt, prosite scan and radar
RNA Analysis (https://www.ebi.ac.uk/Tools/rna/)infernal_cmscan and mapmi
Sequence Format Conversion (https://www.ebi.ac.uk/Tools/sfc/)seqret and mview
Sequence Operation (https://www.ebi.ac.uk/Tools/so/)seqcksum
Sequence Similarity Search (https://www.ebi.ac.uk/Tools/sss/)ncbiblast+, fasta, ggsearch, glsearch, psiblast, psisearch, psisearch2 and ssearch
Sequence Statistics (https://www.ebi.ac.uk/Tools/seqstats/)pepinfo, pepstats, pepwindow, saps, cpgplot, newcpgplot and isochore
Sequence Translation (https://www.ebi.ac.uk/Tools/st/)transeq, sixpack, backtranseq and backtranambig

TOOLS AND DATABASE RETIREMENTS

Workflow tools such as ps_scan (22), InterProScan (23) and FingerPrintScan (24) have been retired, although some remain part of the InterProScan5 tool. ClustalW2 (25), WU-Blast (26), MaxSprout (27), DaliLite (28), DBClustal (29) and ReadSeq (30) have also been removed.

TOOLS GOVERNANCE

EMBL-EBI is proud to provide free access to data and analytical tools. There are many variables that need to be taken into account when deciding to provide access to tools and databases. These range from operational requirements to the relative size of the user community of a tool. Importantly, the scientific suitability of a particular tool to produce up-to-date and relevant results need to be considered. In order to manage the process, a governance model has been set up that comprises expert users of bioinformatics tools, developers, infrastructure managers and usability specialists. It is also important that the users’ opinions count and these are obtained through annual surveys (please see: http://www.ebi.ac.uk/about/our-impact). The governance model requires detailed analysis of usage statistics. This includes the use of storage, CPU, memory, number of runs, provenance, interface used (www, SOAP or REST), as well as availability of support for enhancing or fixing bugs, publications and current training.

TOOLS USAGE

The top 10 tools, by job number alone are: InterProScan5, which, as a workflow, is currently running 19 protein domain and structural domain detection methods. This is followed by the BLAST+ programs, which give access to ∼45 000 libraries of sequences from ENA, UniProt and EnsemblGenomes. Clustal Omega and Muscle are the most popular multiple sequence alignment methods. water and needle from the EMBOSS suite give access to local and global pairwise alignments methods. seqret is very popular for sequence reformatting, pfamscan for searching Pfam HMMs and Phobius (31) for predicting transmembrane regions and signal peptides. Finally, simple_phylogeny, which is used for generating phylogenetic trees using Neighbor-Joining (32) and UPGMA (33) methods. About 56% of all usage occurs using the RESTful APIs, while 26 and 18% are using the SOAP and www interfaces, respectively. Users come from all over of the world, but predominantly from: Germany with 36%; USA with 28%; Japan with 10%; UK with 6%; France with 5%; China and India with 4%; Portugal, Spain and South Korea with 2%. Uptake by users has been steady since 2009 as can be seen in Figure 1, which shows job submissions to the Job Dispatcher framework during 2009–2016.
Figure 1.

Job Dispatcher jobs 2009–2016.

Job Dispatcher jobs 2009–2016.

DISCUSSION

Providing robust and reliable access to bioinformatics tools is one important focus of this framework since 2009. However, these tools represent the workhorses of modern bioinformatics and the continuous improvement of the framework ensures the APIs interoperate as easily as possible with workflow management systems. Having a governance model ensures that resources are available to run the tools and meet user demand in a measured and optimal way, and that the acquisition of results and data from EMBL-EBI is consistent, uniform and importantly, as up-to-date as possible. Support and training are important efforts in understanding usage, and users are also encouraged to provide feedback via https://www.ebi.ac.uk/support.
  32 in total

1.  Improved tools for biological sequence comparison.

Authors:  W R Pearson; D J Lipman
Journal:  Proc Natl Acad Sci U S A       Date:  1988-04       Impact factor: 11.205

2.  The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Authors:  N Saitou; M Nei
Journal:  Mol Biol Evol       Date:  1987-07       Impact factor: 16.240

3.  MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors:  Kazutaka Katoh; Daron M Standley
Journal:  Mol Biol Evol       Date:  2013-01-16       Impact factor: 16.240

4.  New and continuing developments at PROSITE.

Authors:  Christian J A Sigrist; Edouard de Castro; Lorenzo Cerutti; Béatrice A Cuche; Nicolas Hulo; Alan Bridge; Lydie Bougueleret; Ioannis Xenarios
Journal:  Nucleic Acids Res       Date:  2012-11-17       Impact factor: 16.971

5.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors:  Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal:  Mol Syst Biol       Date:  2011-10-11       Impact factor: 11.429

6.  InterProScan: protein domains identifier.

Authors:  E Quevillon; V Silventoinen; S Pillai; N Harte; N Mulder; R Apweiler; R Lopez
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

7.  HMMER web server: 2015 update.

Authors:  Robert D Finn; Jody Clements; William Arndt; Benjamin L Miller; Travis J Wheeler; Fabian Schreiber; Alex Bateman; Sean R Eddy
Journal:  Nucleic Acids Res       Date:  2015-05-05       Impact factor: 16.971

8.  An update on the Enzyme Portal: an integrative approach for exploring enzyme knowledge.

Authors:  S Pundir; J Onwubiko; R Zaru; S Rosanoff; R Antunes; M Bingley; X Watkins; C O'Donovan; M J Martin
Journal:  Protein Eng Des Sel       Date:  2017-03-01       Impact factor: 1.650

9.  Analysis Tool Web Services from the EMBL-EBI.

Authors:  Hamish McWilliam; Weizhong Li; Mahmut Uludag; Silvano Squizzato; Young Mi Park; Nicola Buso; Andrew Peter Cowley; Rodrigo Lopez
Journal:  Nucleic Acids Res       Date:  2013-05-13       Impact factor: 16.971

10.  Ensembl Genomes 2016: more genomes, more complexity.

Authors:  Paul Julian Kersey; James E Allen; Irina Armean; Sanjay Boddu; Bruce J Bolt; Denise Carvalho-Silva; Mikkel Christensen; Paul Davis; Lee J Falin; Christoph Grabmueller; Jay Humphrey; Arnaud Kerhornou; Julia Khobova; Naveen K Aranganathan; Nicholas Langridge; Ernesto Lowy; Mark D McDowall; Uma Maheswari; Michael Nuhn; Chuang Kee Ong; Bert Overduin; Michael Paulini; Helder Pedro; Emily Perry; Giulietta Spudich; Electra Tapanari; Brandon Walts; Gareth Williams; Marcela Tello-Ruiz; Joshua Stein; Sharon Wei; Doreen Ware; Daniel M Bolser; Kevin L Howe; Eugene Kulesha; Daniel Lawson; Gareth Maslen; Daniel M Staines
Journal:  Nucleic Acids Res       Date:  2015-11-17       Impact factor: 16.971

View more
  112 in total

1.  Biochemical Reduction of the Topology of the Diverse WDR76 Protein Interactome.

Authors:  Gerald Dayebgadoh; Mihaela E Sardiu; Laurence Florens; Michael P Washburn
Journal:  J Proteome Res       Date:  2019-08-09       Impact factor: 4.466

2.  The Clustal Omega Multiple Alignment Package.

Authors:  Fabian Sievers; Desmond G Higgins
Journal:  Methods Mol Biol       Date:  2021

3.  dbAMP: an integrated resource for exploring antimicrobial peptides with functional activities and physicochemical properties on transcriptome and proteome data.

Authors:  Jhih-Hua Jhong; Yu-Hsiang Chi; Wen-Chi Li; Tsai-Hsuan Lin; Kai-Yao Huang; Tzong-Yi Lee
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

4.  Calcium-sensitive pyruvate dehydrogenase phosphatase is required for energy metabolism, growth, differentiation, and infectivity of Trypanosoma cruzi.

Authors:  Noelia Lander; Miguel A Chiurillo; Mayara S Bertolini; Melissa Storey; Anibal E Vercesi; Roberto Docampo
Journal:  J Biol Chem       Date:  2018-09-19       Impact factor: 5.157

5.  Fine mapping titin's C-zone: Matching cardiac myosin-binding protein C stripes with titin's super-repeats.

Authors:  Paola Tonino; Balazs Kiss; Jochen Gohlke; John E Smith; Henk Granzier
Journal:  J Mol Cell Cardiol       Date:  2019-05-31       Impact factor: 5.000

6.  Rapid detection of IMP, NDM, VIM, KPC and OXA-48-like carbapenemases from Enterobacteriales and Gram-negative non-fermenter bacteria by real-time PCR and melt-curve analysis.

Authors:  Massimo Mentasti; Kerry Prime; Kirsty Sands; Swati Khan; Mandy Wootton
Journal:  Eur J Clin Microbiol Infect Dis       Date:  2019-08-05       Impact factor: 3.267

7.  Polymerization in the actin ATPase clan regulates hexokinase activity in yeast.

Authors:  Patrick R Stoddard; Eric M Lynch; Daniel P Farrell; Annie M Dosey; Frank DiMaio; Tom A Williams; Justin M Kollman; Andrew W Murray; Ethan C Garner
Journal:  Science       Date:  2020-02-28       Impact factor: 47.728

Review 8.  Comparative studies of endocannabinoid modulation of pain.

Authors:  Riley T Paulsen; Brian D Burrell
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2019-09-23       Impact factor: 6.237

9.  Analysis of human mitochondrial genome co-occurrence networks of Asian population at varying altitudes.

Authors:  Rahul K Verma; Alena Kalyakulina; Cristina Giuliani; Pramod Shinde; Ajay Deep Kachhvah; Mikhail Ivanchenko; Sarika Jalan
Journal:  Sci Rep       Date:  2021-01-08       Impact factor: 4.379

10.  Interrogation of nonconserved human adipose lincRNAs identifies a regulatory role of linc-ADAL in adipocyte metabolism.

Authors:  Xuan Zhang; Chenyi Xue; Jennie Lin; Jane F Ferguson; Amber Weiner; Wen Liu; Yumiao Han; Christine Hinkle; Wenjun Li; Hongfeng Jiang; Sager Gosai; Melanie Hachet; Benjamin A Garcia; Brian D Gregory; Raymond E Soccio; John B Hogenesch; Patrick Seale; Mingyao Li; Muredach P Reilly
Journal:  Sci Transl Med       Date:  2018-06-20       Impact factor: 17.956

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.