Literature DB >> 26780094

Integrative genomic analysis by interoperation of bioinformatics tools in GenomeSpace.

Kun Qu¹, Sara Garamszegi², Felix Wu², Helga Thorvaldsdottir², Ted Liefeld^2,3, Marco Ocana^2,3, Diego Borges-Rivera⁴, Nathalie Pochet^2,5, James T Robinson^2,3, Barry Demchak³, Tim Hull³, Gil Ben-Artzi^6,7, Daniel Blankenberg⁸, Galt P Barber⁹, Brian T Lee⁹, Robert M Kuhn⁹, Anton Nekrutenko⁸, Eran Segal⁶, Trey Ideker³, Michael Reich^2,3, Aviv Regev^2,4,10, Howard Y Chang^1,11, Jill P Mesirov^2,3.

Abstract

Complex biomedical analyses require the use of multiple software tools in concert and remain challenging for much of the biomedical research community. We introduce GenomeSpace (http://www.genomespace.org), a cloud-based, cooperative community resource that currently supports the streamlined interaction of 20 bioinformatics tools and data resources. To facilitate integrative analysis by non-programmers, it offers a growing set of 'recipes', short workflows to guide investigators through high-utility analysis tasks.

Entities: Disease Gene Species

Mesh：

Year: 2016 PMID： 26780094 PMCID： PMC4767623 DOI： 10.1038/nmeth.3732

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

Building on the Human Genome Project[1] and the advent of high-throughput genomic technologies, the past two decades of biomedical research are yielding a flood of massive and varied biological datasets. As a result, numerous databases and analysis software tools have been developed for researchers to access, visualize, and analyze different data types. However, integrative analysis of diverse data types through multiple analysis tools remains an enormous challenge for many biologists. There is an ever-growing gap between the need to use various analysis and visualization software tools and the difficulty of getting tools from different sources to work together. Moreover, the wealth of existing and emerging analytical methods makes it difficult – even for experts, but especially for less computationally oriented biologists – to keep up with all of the available tools, and to identify the right recipe to follow, particularly in the absence of an accepted “laboratory manual” for analytic protocols. These difficulties curtail the agility and creativity of researchers and may prevent them from adopting alternative or new methods. Here, we present GenomeSpace, a cooperative community resource that provides an open-source interoperability platform to enable non-programming scientists to work easily across data types, tools, and analysis methods. GenomeSpace provides a “tool launch pad” into which tools can be seamlessly added, and a “data highway” that handles transfers between tools through format converters, relieving scientists of the burden of identifying and scripting the conversions. The GenomeSpace Recipe Resource is a growing set of high-utility use cases that demonstrate how to leverage multiple tools, and also serve as quick guides to analysis tasks using GenomeSpace and the GenomeSpace tools. The GenomeSpace website, http://www.genomespace.org, serves as a knowledge base, newsstand, and point of online contact and help for the GenomeSpace community of users and tool developers. Initially seeded by a consortium of biology research labs and development teams of six popular bioinformatics tools (Cytoscape[2,3], Galaxy[4], GenePattern[5], Genomica[6], the Integrative Genomics Viewer (IGV)[7], and the UCSC Table Browser[8]), GenomeSpace now connects 20 tools and data resources. Our consortium labs provided real driving biological projects and analytical needs to shape the design and development of the GenomeSpace architecture and software. For example, we recapitulated the steps and results of published analyses[9,10] within GenomeSpace (Supplementary Figs. 1-2), dissecting and visualizing the gene regulatory networks in human cancer stem cells (Supplementary Note 1, Supplementary Figs. 2–5). This illustrates how GenomeSpace enables a non-programming biologist to conduct a rich and involved integrative analysis, which previously led to a novel result. The study required diverse data types, analytical steps, methods, tools, and multiple data transfers between the tools. While originally requiring substantial scripting, this work can now be performed using only the tools within, and capabilities of, GenomeSpace. From a user’s perspective (Fig. 1, Supplementary Fig. 6), GenomeSpace has several key features that together facilitate integrative analysis with a low barrier to user entry: (1) The collection of resident tools spanning a broad range of applications (Table 1); (2) Easy dataset management in a variety of cloud storage types, alongside data sharing capabilities. All GenomeSpace account holders receive an allocation of cloud storage, and GenomeSpace also supports connections to other cloud accounts (Dropbox, Google Drive, Amazon S3); (3) The ability to launch tools and to move data and analyses between tools, all facilitated by “behind-the-scenes” file format converters; (4) A lightweight, simple, unifying web interface. In summary, from the web interface a researcher can launch a desired tool and simultaneously feed it input data files, move analysis results into other tools as needed through simple launching operations, and collect additional processed data within their GenomeSpace cloud account, other cloud accounts, or local storage.

Figure 1

The GenomeSpace environment for interoperation of bioinformatics tools

Table 1

GenomeSpace provides access to a diverse set of bioinformatics tools and resources

Tool Name	Organization	Project Website
Analysis and Visualization Tools

Cistrome	Dana-Farber Cancer Institute	http://www.cistrome.org
Cytoscape 3 *	Cytoscape Consortium	http://www.cytoscape.org
Cytoscape 2 *	Cytoscape Consortium	http://www.cytoscape.org
Galaxy	Pennsylvania State University; andJohns Hopkins University	http://www.galaxyproject.org
GenePattern	Broad Institute; and UC San Diego	http://www.genepattern.org
Genomica	Weizmann Institute of Science	http://genomica.weizmann.ac.il
geWorkbench	Columbia University	http://www.geworkbench.org
Gitools	University Pompeu Fabra, Barcelona	http://www.gitools.org
Integrative Genomics Viewer (IGV)	Broad Institute; and UC San Diego	http://www.igv.org
ISAcreator	University of Oxford	http://www.isa-tools.org
Molecular Signatures Database (MSigDB) Online Tools	Broad Institute; and UC San Diego	http://www.msigdb.org

Data Resources

ArrayExpress	European Bioinformatics Institute	http://www.ebi.ac.uk/arrayexpress
InSilicoDB	InSilico Genomics	http://insilicodb.com
Synapse	Sage Bionetworks	http://synapse.org
UCSC Table Browser	University of California Santa Cruz	http://genome.ucsc.edu

Integrated Portals (Data and Analysis)

Achilles Project	Dana-Farber Cancer Institute; andBroad Institute	http://broadinstitute.org/achilles
Cancer Cell Line Encyclopedia (CCLE)	Novartis Institutes for BioMedicalResearch; and Broad Institute	http://broadinstitute.org/ccle
cBioPortal for Cancer Genomics	Memorial Sloan Kettering CancerCenter	http://www.cbioportal.org
Multiple Myeloma Genomics Portal (MMGP)	Multiple Myeloma ResearchConsortium; Broad Institute; andTranslational Genomics ResearchInstitute (TGen)	http://broadinstitute.org/mmgp
Reactome	Ontario Institute for Cancer Research;European Bioinformatics Institute; andNew York University Medical Center	http://www.reactome.org

Cytoscape 3 and Cytoscape 2 have different underlying architectures and different user interfaces. Both versions are made available through GenomeSpace to accommodate users who may prefer one to the other.

We developed the GenomeSpace Recipe Resource to aid biomedical researchers in identifying the steps required to perform a genomic analysis – a challenging task even for short analyses. Although pre-constructed pipelines can embody the entire workflow of a study, they may be insufficiently open-ended or flexible for exploratory research. We took an alternative approach by providing a collection or “cookbook” of recipes, i.e., comprehensive descriptions of cross-tool analysis workflows. Recipes are generally short – involving two or three tools – but commoditize important research tasks that investigators can employ in many ways as part of more complex analyses. The notion of our Recipe Resource is modeled after the classical lab guide “Molecular cloning: A laboratory manual”[11], which used a similar approach to democratize molecular biology three decades ago. Each GenomeSpace recipe contains a motivating biological problem, a relevant example dataset, detailed recipe steps, and one possible interpretation of the results, illustrated on the example data. A variety of media accompany the recipe steps including screenshot guides and videos that together walk users through the workflow. The recipes are served on our Recipe web page (http://www.genomespace.org/recipes), the most frequently visited section of the GenomeSpace website after the home page. The current recipes cover diverse genomic analyses as well as basic utilities for using GenomeSpace itself (Supplementary Table 1). Since no single lab can supply the expertise or effort required to create a comprehensive recipe collection, we are adding social media vehicles to make recipe collection a crowd-sourced, collaborative effort through community contributions. We encourage suggestions for new and useful multi-tool recipes and ideas to improve existing recipes. An illustrative example from the GenomeSpace Recipe Resource is “Find subnetworks of differentially expressed genes and identify associated biological functions”. Briefly, given a gene expression dataset, this recipe identifies network interactions between differentially expressed genes, and annotates the biological functions within subnetworks via the Gene Ontology (GO) (Supplementary Fig. 7). The example dataset provided with this recipe is gene expression data from a study in which granulocyte-macrophage progenitor cells were transformed into leukemia stem cells by introduction of an oncogene, MLL-AF9[13]. Applying the recipe identifies processes that are correlated with transformation from a normal to a leukemic phenotype (Supplementary Fig. 8), such as SMAD1-dependent signaling, a process associated with the regulation of hematopoietic differentiation by TGF-β and BMP[14]. A second recipe example, “Identify biological functions for genes in copy number variation (CNV) regions”, is described in Supplementary Note 2 (Supplementary Figs. 9–10). An important GenomeSpace design goal was to facilitate rapid addition of diverse tools contributed by the developer community. This mutually benefits GenomeSpace and independent tool developers by extending the capabilities of GenomeSpace while also giving developers’ tools access to all GenomeSpace-connected tools and data sources, circumventing the need to connect to each one individually. Recent cross-tool interoperability efforts have used one of several approaches: aggregators host a large number of command line tools (Galaxy, GenePattern); plug-in architectures provide a way to extend the functionality of a basic package (Cytoscape, geWorkbench[15], MeV[16]), and messaging systems send data and instructions between tools (MeDICi[17], Gaggle[18]). Our open-source, lightweight, hybrid approach combines aspects of both messaging and aggregating systems. The resulting platform (Supplementary Fig. 11 and Online Methods) provides single sign-on for GenomeSpace tools and data resources; security mechanisms and user-controlled levels of sharing; and a common interface to multiple cloud storage providers. Moreover, this approach supports interoperation among diverse desktop and web-based tools, while minimizing the amount of effort required to connect to the platform (Online Methods). To further facilitate cross-tool interoperability, GenomeSpace offers a range of file converters for directly converting between pairs of file formats (Supplementary Note 3). Our direct conversion approach confers a number of benefits. Notably, it obviates the development burden of defining and supporting central data models for tools, especially legacy ones, connecting to GenomeSpace. Moreover, since converters are independent and do not rely on a GenomeSpace-specific data model, we can expand the set of supported formats by leveraging converters that are developed outside of GenomeSpace. In conclusion, GenomeSpace has several key benefits. First, it allows seamless transition between tools. Automatic inter-tool file format converters speed tasks like launching and moving data between tools and obviate the need for custom conversion scripts, an insurmountable barrier for many biologists. Second, the large set of connected tools enriches the interpretation of integrative analyses. Exploring the same data in multiple tools—each designed to highlight distinct features of the data—allows the analysis to be examined in greater depth and diversity than with any single tool. Third, we encourage the inclusion of multiple tools with similar capabilities. Therefore many analysis steps can be performed with alternative tools from the GenomeSpace suite, allowing investigators to test their findings for robustness and reproducibility. It also permits them to use the tool with which they are most familiar. Fourth, recipes play an important role in making integrative analysis accessible. Conceived of as small analysis components, recipes describe short workflows that guide users to perform analysis tasks. Recipes can be assembled into more complex analysis scenarios and can also introduce investigators to new analysis methods and tools. In this way, GenomeSpace and the Recipe Resource can greatly expand the analytic universe accessible to investigators and help to move their research agenda forward.

ONLINE METHODS

GenomeSpace Architecture

From a tool developer’s perspective, GenomeSpace presents a “connection layer” that includes a collection of web services with well defined entry points to the GenomeSpace server that provides the core system functionality (Supplementary Figure 11). The GenomeSpace web user interface also interacts with the server through these entry points. The GenomeSpace server currently runs as an Amazon Machine Instance (AMI) in the Amazon Elastic Compute Cloud (EC2). It consists of three components: (1) An Identity Service manages sign-on credentials, including single sign-on to the GenomeSpace tools, and data access. GenomeSpace leverages the Amazon AWS security mechanisms, which are compliant with the requirements from many standards organizations and government agencies. All data is private by default, but users may share directories or files with other users or groups of users. (2) An Analysis Tools Manager maintains information about tool capabilities and dependencies and coordinates tool launches, including the ability to launch other GenomeSpace tools from within a tool; (3) A Data Manager handles data storage, transfer to/from the cloud (including Amazon S3, Google Drive, and Dropbox), data sharing, and the file format conversions that provide a smooth script-free connection between tools.

Connecting Tools to GenomeSpace

The GenomeSpace connection layer includes a collection of web services with well-defined entry points to the GenomeSpace server that provides the core system functionality. It is available as Java and JavaScript client development kits for tools developed in those languages, or as web services with a RESTful application programming interface (API) for any language. Developers can also take advantage of a number of user interface widgets that are available for common user tasks, including file chooser dialogs and authentication panels. Adding a tool to GenomeSpace, using these resources, typically takes on the order of two programmer days or less, depending on the type of tool. The most recent tool to join the community was cBioPortal (http://www.cbioportal.org) from Memorial Sloan Kettering Cancer Center, and the development team reported that it took an hour to connect this web-based portal as a data source to GenomeSpace. We note that command line tools that do not have their own user interface can join the GenomeSpace community via either of its current aggregator members – GenePattern and Galaxy.

16 in total

1. TM4: a free, open-source system for microarray data management and analysis.

Authors: A I Saeed; V Sharov; J White; J Li; W Liang; N Bhagabati; J Braisted; M Klapa; T Currier; M Thiagarajan; A Sturn; M Snuffin; A Rezantsev; D Popov; A Ryltsov; E Kostukovich; I Borisovsky; Z Liu; A Vinsavich; V Trush; J Quackenbush
Journal: Biotechniques Date: 2003-02 Impact factor: 1.993

2. The UCSC Table Browser data retrieval tool.

Authors: Donna Karolchik; Angela S Hinrichs; Terrence S Furey; Krishna M Roskin; Charles W Sugnet; David Haussler; W James Kent
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

4. geWorkbench: an open source platform for integrative genomics.

Authors: Aris Floratos; Kenneth Smith; Zhou Ji; John Watkinson; Andrea Califano
Journal: Bioinformatics Date: 2010-05-28 Impact factor: 6.937

5. Galaxy: a platform for interactive large-scale genome analysis.

Authors: Belinda Giardine; Cathy Riemer; Ross C Hardison; Richard Burhans; Laura Elnitski; Prachi Shah; Yi Zhang; Daniel Blankenberg; Istvan Albert; James Taylor; Webb Miller; W James Kent; Anton Nekrutenko
Journal: Genome Res Date: 2005-09-16 Impact factor: 9.043

6. GenePattern 2.0.

Authors: Michael Reich; Ted Liefeld; Joshua Gould; Jim Lerner; Pablo Tamayo; Jill P Mesirov
Journal: Nat Genet Date: 2006-05 Impact factor: 38.330

Review 7. The role of Smad signaling in hematopoiesis.

Authors: Jonas Larsson; Stefan Karlsson
Journal: Oncogene Date: 2005-08-29 Impact factor: 9.867

8. A module map showing conditional activity of expression modules in cancer.

Authors: Eran Segal; Nir Friedman; Daphne Koller; Aviv Regev
Journal: Nat Genet Date: 2004-09-26 Impact factor: 38.330

9. Module map of stem cell genes guides creation of epithelial cancer stem cells.

Authors: David J Wong; Helen Liu; Todd W Ridky; David Cassarino; Eran Segal; Howard Y Chang
Journal: Cell Stem Cell Date: 2008-04-10 Impact factor: 24.633

10. Cytoscape: the network visualization tool for GenomeSpace workflows.

Authors: Barry Demchak; Tim Hull; Michael Reich; Ted Liefeld; Michael Smoot; Trey Ideker; Jill P Mesirov
Journal: F1000Res Date: 2014-07-01

14 in total

Review 1. Cancer transcriptome profiling at the juncture of clinical translation.

Authors: Marcin Cieślik; Arul M Chinnaiyan
Journal: Nat Rev Genet Date: 2017-12-27 Impact factor: 53.242

2. BioJupies: Automated Generation of Interactive Notebooks for RNA-Seq Data Analysis in the Cloud.

Authors: Denis Torre; Alexander Lachmann; Avi Ma'ayan
Journal: Cell Syst Date: 2018-11-14 Impact factor: 10.304

Review 3. Mining Cancer Transcriptomes: Bioinformatic Tools and the Remaining Challenges.

Authors: Thomas Milan; Brian T Wilhelm
Journal: Mol Diagn Ther Date: 2017-06 Impact factor: 4.074

4. Jupyter and Galaxy: Easing entry barriers into complex data analyses for biomedical researchers.

Authors: Björn A Grüning; Eric Rasche; Boris Rebolledo-Jaramillo; Carl Eberhard; Torsten Houwaart; John Chilton; Nate Coraor; Rolf Backofen; James Taylor; Anton Nekrutenko
Journal: PLoS Comput Biol Date: 2017-05-25 Impact factor: 4.475

5. NGSmethDB 2017: enhanced methylomes and differential methylation.

Authors: Ricardo Lebrón; Cristina Gómez-Martín; Pedro Carpena; Pedro Bernaola-Galván; Guillermo Barturen; Michael Hackenberg; José L Oliver
Journal: Nucleic Acids Res Date: 2016-10-27 Impact factor: 16.971

6. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.

Authors: Enis Afgan; Dannon Baker; Bérénice Batut; Marius van den Beek; Dave Bouvier; Martin Cech; John Chilton; Dave Clements; Nate Coraor; Björn A Grüning; Aysam Guerler; Jennifer Hillman-Jackson; Saskia Hiltemann; Vahid Jalili; Helena Rasche; Nicola Soranzo; Jeremy Goecks; James Taylor; Anton Nekrutenko; Daniel Blankenberg
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

7. SyNDI: synchronous network data integration framework.

Authors: Erno Lindfors; Jesse C J van Dam; Carolyn Ming Chi Lam; Niels A Zondervan; Vitor A P Martins Dos Santos; Maria Suarez-Diez
Journal: BMC Bioinformatics Date: 2018-11-06 Impact factor: 3.169

8. NOVA2 regulates neural circRNA biogenesis.

Authors: David Knupp; Daphne A Cooper; Yuhki Saito; Robert B Darnell; Pedro Miura
Journal: Nucleic Acids Res Date: 2021-07-09 Impact factor: 19.160

9. A multi-tool recipe to identify regions of protein-DNA binding and their influence on associated gene expression.

Authors: Daniel Carlin; Kassi Kosnicki; Sara Garamszegi; Trey Ideker; Helga Thorvaldsdóttir; Michael Reich; Jill Mesirov
Journal: F1000Res Date: 2017-06-06

Review 10. Norwegian e-Infrastructure for Life Sciences (NeLS).

Authors: Kidane M Tekle; Sveinung Gundersen; Kjetil Klepper; Lars Ailo Bongo; Inge Alexander Raknes; Xiaxi Li; Wei Zhang; Christian Andreetta; Teshome Dagne Mulugeta; Matúš Kalaš; Morten B Rye; Erik Hjerde; Jeevan Karloss Antony Samy; Ghislain Fornous; Abdulrahman Azab; Dag Inge Våge; Eivind Hovig; Nils Peder Willassen; Finn Drabløs; Ståle Nygård; Kjell Petersen; Inge Jonassen
Journal: F1000Res Date: 2018-06-29