Literature DB >> 20439314

A new bioinformatics analysis tools framework at EMBL-EBI.

Mickael Goujon¹, Hamish McWilliam, Weizhong Li, Franck Valentin, Silvano Squizzato, Juri Paern, Rodrigo Lopez.

Abstract

The EMBL-EBI provides access to various mainstream sequence analysis applications. These include sequence similarity search services such as BLAST, FASTA, InterProScan and multiple sequence alignment tools such as ClustalW, T-Coffee and MUSCLE. Through the sequence similarity search services, the users can search mainstream sequence databases such as EMBL-Bank and UniProt, and more than 2000 completed genomes and proteomes. We present here a new framework aimed at both novice as well as expert users that exposes novel methods of obtaining annotations and visualizing sequence analysis results through one uniform and consistent interface. These services are available over the web and via Web Services interfaces for users who require systematic access or want to interface with customized pipe-lines and workflows using common programming languages. The framework features novel result visualizations and integration of domain and functional predictions for protein database searches. It is available at http://www.ebi.ac.uk/Tools/sss for sequence similarity searches and at http://www.ebi.ac.uk/Tools/msa for multiple sequence alignments.

Entities: Chemical Gene Species

Mesh：

Year: 2010 PMID： 20439314 PMCID： PMC2896090 DOI： 10.1093/nar/gkq313

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Bioinformatics is a vast and complex multidisciplinary research area where numerous tools have been developed over the years to analyse constantly growing amounts of data. Since 1998, the European Bioinformatics Institute (EMBL–EBI) has provided public access to various mainstream sequence analysis applications (1,2). These include sequence similarity search services (http://www.ebi.ac.uk/Tools/similarity.html), such as FASTA (3), BLAST (4,5) and InterProScan (6) and multiple sequence alignment tools (http://www.ebi.ac.uk/Tools/sequence.html), such as ClustalW (7), T-Coffee (8), MUSCLE (9), Kalign (10) and MAFFT (11). These services are provided via a PERL-CGI job dispatcher framework for managing job submission and result representation. This infrastructure handled more than 16 million jobs during 2009. The popularity of these services has made it necessary to redesign the system in order to minimize maintenance and enhance the integration of features requested by users. A new and modular framework, called JDispatcher, has been developed to improve the accessibility and quality of the services relevant to the biological community.

JDispatcher framework

JDispatcher is aimed at both novice and expert users and exposes novel methods of obtaining annotations and visualizing sequence analysis results through one uniform and consistent interface. These services are available interactively over the web and via SOAP and REST interfaces for systematic or programmatic use. The new framework provides input validation to assure successful job submissions, offers new visualization features to assist in the interpretation of results and uses the EBI search engine, EB-eye (12), to integrate relevant annotations. A user can submit sequences using web forms that contain all supported parameters and their possible values. The different tools have been grouped into categories based on their purpose (Table 1).

Table 1.

Tools available in the JDispatcher framework

Category	Tool
Sequence Similarity Searches (sss)	psisearch, psiblast, ncbiblast, wublast, fasta, ssearch, ggsearch and glsearch
Multiple Sequence Alignments (msa)	clustalw2, tcoffee, kalign, muscle, mafft, and prank

Tools available in the JDispatcher framework Within a category, the tools share the same interface design, which uses well established usability patterns, such as wizard-like steps to guide the user through the submission process. It makes use of decision-trees to validate all the parameters required to warrant successful job submissions. If the validation fails, the user is notified about which specific parameters or data are invalid, and the job is not submitted. Alternatively, JDispatcher assigns a unique job identifier and sends a request to a workload management system for the job to be executed. The identifier is then used to keep track of the tasks and to retrieve the results when they become available. The results of each job are kept for a maximum of 7 days.

Results representation

The results of an analysis are made available using various representations (e.g. HTML tables, XML files, images, etc.). In order to produce these representations, each result is converted into a generic category-specific model that is used by a renderer that generates the requested output. The renderers are specific to the model and not to the tool, and thus are available across all the tools in a category. The availability of multiple views of the same data helps the user to interpret and compare results from different tools within a category. Sequence search algorithms produce limited hits annotation. With the new framework it is possible to navigate hits and access related information. Figure 1 shows the ‘Summary Table’ of an SSEARCH of mouse glomulin (UniProtKB/Swiss-Prot GLMN_MOUSE), which is essential for the development of the vascular system, against the UniProtKB/Swiss-Prot database (13). Each column heading has clickable arrows that allow the user to sort the results according to the values in the columns [e.g. sequence length, score, percentage identity, positives and E()-value]. Each match is enriched with links to cross-references and related information in various data resources (e.g. gene expression, genomic sequences, structures, function, ontologies and literature citations). Optionally, the alignment from the search, and/or the full-annotation for the selected matches can be displayed. A hits selection can also be downloaded in fasta format.

Figure 1.

Summary Table view of the results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH.

Summary Table view of the results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH. Figure 2 shows the ‘Visual Output’ obtained from searches using SSEARCH and NCBI BLAST of the glomulin sequence against UniProtKB/Swiss-Prot using default parameters. Comparison of the two images reveals notable differences in the sequence matches reported by the two search methods. For example, differences in the aligned regions between glomulin and aberrant root formation protein 4 for Arabidopsis (ALF4_ARATH) are clearly visible in both; SSEARCH identifies two MON2 homologues at E()-values <1 (MON2_XENLA and MON2_HUMAN), which may indicate there is a structural relationship between GLMN at the C-terminus of the MON2 homologues, although these may not share related functions.

Figure 2.

Comparisons between the Visual output results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH and NCBI BLAST, respectively.

Comparisons between the Visual output results obtained when searching the sequence of mouse glomulin against the UniProtKB/Swiss-Prot database using SSEARCH and NCBI BLAST, respectively. Determining which functional domains and families a protein belongs to is critical to the understanding of the biological processes it may be involved in. This is important for the characterization of existing drug targets as well as in the identification of novel ones. Family and domain functional predictions have been built into the framework, using pre-calculated matches from the InterPro Consortium (14) data. This enables users, not only to search for sequence similarities when using the UniProt databases, but also to characterize the sequence query in terms of domain architectures that may elicit its function. Figure 3 shows ‘Functional Predictions’ for a hypothetical bioactive lysophospholipid that was compared against UniProtKB/Swiss-Prot using NCBI BLAST. The hypothetical sequence has several good homologues, all belonging to the GPCR rhodopsin-like superfamily, which are clearly seen. This indicates the query protein could represent a potential target for receptor-binding studies.

Figure 3.

Functional prediction view of the results obtained when comparing the sequence of putative bioactive lysophospholipid that was compared against UniProtKB/Swiss-Prot using NCBI BLAST.

Functional prediction view of the results obtained when comparing the sequence of putative bioactive lysophospholipid that was compared against UniProtKB/Swiss-Prot using NCBI BLAST. In both, the ‘Visual Output’ and ‘Functional Predictions’ result representations, the matches are coloured, from red to blue, according to E()-value, using a relative scale, from the most to the least significant hits within the result. An absolute scale, which ranges from E() = 0 to E()=1.0, is also available. These aim to aid the user in deciding whether weak similarities may be biologically significant. These images are available in Scalable Vector Graphics (SVG), Portable Network Graphic (PNG) and JPEG output, providing wide compatibility. The raw result and processed forms, such as the ‘Summary Table’ content and XML formats are downloadable for further processing by the user. The examples above illustrate how, from a single sequence similarity search, it is possible to access related sources of annotation, determine visually which results are relevant and infer gene and protein functional associations, using the JDispatcher framework.

Web Services

Web Services technologies have opened up important opportunities for the analysis of life sciences data. It is now well established that sharing resources, across geographically distributed networks, is advantageous to scientists and bioinformaticians through the re-use of generic services, such as those presented in this article. The new JDispatcher framework provides multiple front-ends: in addition to the web interface, SOAP and REST APIs (http://www.ebi.ac.uk/Tools/webservices/) have been implemented to offer programmatic access using accepted web services standards. The SOAP and REST APIs cater for users requiring systematic access to a wide range of sequence similarity search and multiple sequence alignment services, which can be built into local analytical workflows and pipelines (e.g. Taverna (15), Triana (http://www.trianacode.org/), KNIME (www.knime.org) (16) and Pipeline Pilot (http://accelrys.com/products/scitegic/index.html))—typical usage scenarios include the characterization of novel genomes and proteomes and the analysis of data derived from meta-genome experiments. Using the APIs, complex applications can be developed in various programming languages, which include: C/C++, C#, Java, Perl, PHP, Python and Ruby, or scripting environments such a Bash, csh, batch and PowerShell. This allows integration of services into existing and/or new applications that require access to fast sequence database searching or multiple sequence alignment methods. To facilitate this type of usage, the services provide extensive meta-information describing the available parameters, including their possible values and descriptions of their purpose. Typical applications of the JDispatcher framework services include: providing an alternative interface for specialist usage targeted at a specific community; integrating a service into an existing data portal to provide analysis services; and enhancing analysis results by directly connecting the result with the data. These are of importance to service providers and users of pipelines who may not have the resources to run and maintain the infrastructure required to support equivalent functionality.

CONCLUSIONS

The modularity of this new framework reduces maintenance overheads and simplifies the addition of tools and features. Keeping the result data model and the renderers separate provides the flexibility to add additional representations to all functionally related tools. This improves the level of usability for both novice and expert users. The presented visualization examples highlight important insights in the understanding of existing and new nucleotide and protein sequences from both genomes and metagenome experiments and suggest novel ways in which these data can be interpreted. Academic and commercial laboratories can integrate the JDispatcher framework services with their local analytical pipelines or workflows. These represent an important contribution to the growing number of available services in bioinformatics and have been submitted to the BioCatalogue (17) (www.biocatalogue.org), a registry of freely available web services in the life sciences.

FUNDING

The European Commission under FELICS [contract number 021902 (RII3), within the Research Infrastructure Action of the FP6 ‘Structuring the European Research Area’ Programme]; core funding from the European Molecular Biology Laboratory; European Patent Office. Funding for open access charge: EMBL. Conflict of interest statement. None declared.

15 in total

1. T-Coffee: A novel method for fast and accurate multiple sequence alignment.

Authors: C Notredame; D G Higgins; J Heringa
Journal: J Mol Biol Date: 2000-09-08 Impact factor: 5.469

2. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

3. Clustal W and Clustal X version 2.0.

Authors: M A Larkin; G Blackshields; N P Brown; R Chenna; P A McGettigan; H McWilliam; F Valentin; I M Wallace; A Wilm; R Lopez; J D Thompson; T J Gibson; D G Higgins
Journal: Bioinformatics Date: 2007-09-10 Impact factor: 6.937

4. Fast and efficient searching of biological data resources--using EB-eye.

Authors: Franck Valentin; Silvano Squizzato; Mickael Goujon; Hamish McWilliam; Juri Paern; Rodrigo Lopez
Journal: Brief Bioinform Date: 2010-02-11 Impact factor: 11.622

Review 5. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

6. Improved tools for biological sequence comparison.

Authors: W R Pearson; D J Lipman
Journal: Proc Natl Acad Sci U S A Date: 1988-04 Impact factor: 11.205

7. Kalign--an accurate and fast multiple sequence alignment algorithm.

Authors: Timo Lassmann; Erik L L Sonnhammer
Journal: BMC Bioinformatics Date: 2005-12-12 Impact factor: 3.169

8. Taverna: a tool for building and running workflows of services.

Authors: Duncan Hull; Katy Wolstencroft; Robert Stevens; Carole Goble; Mathew R Pocock; Peter Li; Tom Oinn
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

9. InterProScan: protein domains identifier.

Authors: E Quevillon; V Silventoinen; S Pillai; N Harte; N Mulder; R Apweiler; R Lopez
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

10. InterPro: the integrative protein signature database.

Authors: Sarah Hunter; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Alex Bateman; David Binns; Peer Bork; Ujjwal Das; Louise Daugherty; Lauranne Duquenne; Robert D Finn; Julian Gough; Daniel Haft; Nicolas Hulo; Daniel Kahn; Elizabeth Kelly; Aurélie Laugraud; Ivica Letunic; David Lonsdale; Rodrigo Lopez; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Jaina Mistry; Alex Mitchell; Nicola Mulder; Darren Natale; Christine Orengo; Antony F Quinn; Jeremy D Selengut; Christian J A Sigrist; Manjula Thimma; Paul D Thomas; Franck Valentin; Derek Wilson; Cathy H Wu; Corin Yeats
Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971

763 in total

1. Immunocytochemical localization of short-chain family reductases involved in menthol biosynthesis in peppermint.

Authors: Glenn W Turner; Edward M Davis; Rodney B Croteau
Journal: Planta Date: 2011-12-15 Impact factor: 4.116

2. SynaptomeDB: an ontology-based knowledgebase for synaptic genes.

Authors: Mehdi Pirooznia; Tao Wang; Dimitrios Avramopoulos; David Valle; Gareth Thomas; Richard L Huganir; Fernando S Goes; James B Potash; Peter P Zandi
Journal: Bioinformatics Date: 2012-01-27 Impact factor: 6.937

3. Structure of the cytoplasmic domain of Yersinia pestis YscD, an essential component of the type III secretion system.

Authors: George T Lountos; Joseph E Tropea; David S Waugh
Journal: Acta Crystallogr D Biol Crystallogr Date: 2012-02-07

4. Characterization of biphenyl dioxygenase sequences and activities encoded by the metagenomes of highly polychlorobiphenyl-contaminated soils.

Authors: Christine Standfuss-Gabisch; Djamila Al-Halbouni; Bernd Hofer
Journal: Appl Environ Microbiol Date: 2012-02-10 Impact factor: 4.792

5. Structure of the RBD-PRDI fragment of the antiterminator protein GlcT.

Authors: Sebastian Himmel; Christian Grosse; Sebastian Wolff; Claudia Schwiegk; Stefan Becker
Journal: Acta Crystallogr Sect F Struct Biol Cryst Commun Date: 2012-06-22

6. Plant-like bacterial expansins play contrasting roles in two tomato vascular pathogens.

Authors: Matthew A Tancos; Tiffany M Lowe-Power; F Christopher Peritore-Galve; Tuan M Tran; Caitilyn Allen; Christine D Smart
Journal: Mol Plant Pathol Date: 2017-12-18 Impact factor: 5.663

7. Leukocyte protease binding to nucleic acids promotes nuclear localization and cleavage of nucleic acid binding proteins.

Authors: Marshall P Thomas; Jennifer Whangbo; Geoffrey McCrossan; Aaron J Deutsch; Kimberly Martinod; Michael Walch; Judy Lieberman
Journal: J Immunol Date: 2014-04-25 Impact factor: 5.422

8. Molecular Mechanisms for Species Differences in Organic Anion Transporter 1, OAT1: Implications for Renal Drug Toxicity.

Authors: Ling Zou; Adrian Stecula; Anshul Gupta; Bhagwat Prasad; Huan-Chieh Chien; Sook Wah Yee; Li Wang; Jashvant D Unadkat; Simone H Stahl; Katherine S Fenner; Kathleen M Giacomini
Journal: Mol Pharmacol Date: 2018-05-02 Impact factor: 4.436

9. Production of a Chaetomium globosum enolase monoclonal antibody.

Authors: Brett J Green; Ajay P Nayak; Angela R Lemons; William R Rittenour; Justin M Hettick; Donald H Beezhold
Journal: Monoclon Antib Immunodiagn Immunother Date: 2014-12

10. Acrofacial Dysostosis, Cincinnati Type, a Mandibulofacial Dysostosis Syndrome with Limb Anomalies, Is Caused by POLR1A Dysfunction.

Authors: K Nicole Weaver; Kristin E Noack Watt; Robert B Hufnagel; Joaquin Navajas Acedo; Luke L Linscott; Kristen L Sund; Patricia L Bender; Rainer König; Charles M Lourenco; Ute Hehr; Robert J Hopkin; Dietmar R Lohmann; Paul A Trainor; Dagmar Wieczorek; Howard M Saal
Journal: Am J Hum Genet Date: 2015-04-23 Impact factor: 11.025