Literature DB >> 22068528

Harnessing cloud computing with Galaxy Cloud.

Enis Afgan, Dannon Baker, Nate Coraor, Hiroki Goto, Ian M Paul, Kateryna D Makova, Anton Nekrutenko, James Taylor.

Abstract

Entities: Disease Species

Mesh：

Substances：
DNA, Mitochondrial

Year: 2011 PMID： 22068528 PMCID： PMC3868438 DOI： 10.1038/nbt.2028

Source DB: PubMed Journal: Nat Biotechnol ISSN： 1087-0156 Impact factor: 54.908

× No keyword cloud information.

To the editor

Continuing evolution of DNA sequencing has transformed modern biology. Reduced sequencing costs coupled with novel sequencing based assays has led to rapid adoption of next generation sequencing (NGS) across diverse areas of life science research[1-4]. Sequencing has moved out of the genome centers into core facilities and individual labs where any investigator can access them for modest and progressively declining cost. While easy to generate in tremendous quantities, sequence data is still difficult to manage and analyze. Sophisticated informatics techniques and supporting infrastructure are needed to make sense of even conceptually simple sequencing experiments — let alone the more complex analysis techniques being developed. The most pressing challenge facing the sequencing community today is providing the informatics infrastructure and accessible analysis methods needed to make it possible for all investigators to realize the power of high-throughput sequencing to advance their research. A possible solution to this infrastructure challenge comes in the form of “cloud computing”, a model where computation and storage exist as virtual resources, accessed via the Internet, which can be dynamically allocated and released as needed[5]. Where previously acquisition of large amounts of computing power required significant initial and ongoing costs, the cloud model radically alters this by allowing computing resources and services to be acquired and paid for on demand. Importantly, cloud resources can provide storage and computation at far less cost than dedicated resources for certain use cases. For several specific applications, effective use of cloud resources has already been demonstrated[6-8]. In general however, cloud resources are not provided in a form that can be immediately used by a researcher without informatics expertise. Several commercial vendors provide cloud-based sequence analysis services through the web that hide all complexity of the underlying infrastructure. Yet these contain limited sets of analysis tools, and because they are proprietary solutions users must give up some control over their own data and risk vendor lock-in. All “battle-tested” NGS analysis practices (such as analysis of human variation exemplified by the 1000 Genome Consortium publication) are open-source. One popular open-source platform that has made substantial progress toward making complex analysis available to researchers is Galaxy[9, 10]. Galaxy enables users to perform analysis using nothing more than a web-browser. The environment automatically and transparently tracks every detail of the analysis, allows the construction of complex workflows, and allows the results to be documented, shared, and published with complete provenance, guaranteeing transparency and reproducibility. Importantly, Galaxy is an extensible platform; nearly any software tool can easily be integrated into Galaxy, and there is an active community of developers ensuring the latest tools are wrapped and made available through the Galaxy Tool Shed (http://usegalaxy.org/community). Galaxy is provided as a free public service (http://usegalaxy.org) with which thousands of users perform hundreds of thousands of analyses each month. However, this free public resource cannot meet increasing demand without implementing limits on data transfer and compute usage, resulting in delays that users may find unacceptable. Fortunately the Galaxy platform is easily deployed on local resources, and many groups working with large-scale sequence data now run their own Galaxy instances. However, this still requires local compute resources and informatics knowledge. To bring the virtually unlimited resources of cloud computing into the hands of biomedical researchers we have developed Galaxy Cloud. It allows anyone to run a private Galaxy installation on the Cloud exactly replicating functionality of the main site (http://usegalaxy.org) but without the need to share compute resources with other users (Fig. 1). Unlike software service solutions, with Galaxy Cloud the user can customize their deployment as well as retain complete control over their instance and associated data; the analysis can also be moved to other cloud providers or local resources, avoiding concerns about vendor lock-in.

Figure 1

Overview of Galaxy instances running on cloud resources: (A) Users in different labs access a dedicated Galaxy instance over the internet with nothing more than a web browser, (B) these Galaxy instances appear to the users to be dedicated infrastructure with apparently infinite compute and storage resources, but are in fact virtual resources (C) which Galaxy's autoscaling acquires and releases on demand in response to changing workloads.

Currently we provide a public Galaxy Cloud deployment on the popular Amazon Web Services (AWS) cloud, however it is compatible with Eucalyptus and other clouds. If starting for the first time, the instance is configured by the user (e.g. by specifying the amount of initial storage allocated; exact step-by-step instructions are provided at http://usegalaxy.org/cloud). Once configured, users can then access their Galaxy, which will function exactly like the Galaxy public site. Every analysis tool that is available through the public Galaxy instance is installed an available for immediate use, as well as all the necessary supporting data (e.g. genome sequences, alignments, indexes). In addition, a number of tools that are too computationally intensive to provide on the public Galaxy are also included. This ready-to-use environment is combined with the ability to allocate practically unlimited computing power on demand thanks to use of cloud computing. When the user has finished analysis and the instance is no longer needed, all compute resources can be released, while the users data and instance state are preserved to be used later. Galaxy Cloud's deployment is achieved by coupling the Galaxy framework to CloudMan[11], which automates management of the underlying infrastructure cloud resources (see Supplemental Notes, Supplemental Figure 1). CloudMan handles all aspects of infrastructure management, including resource acquisition, configuration, and data persistence, necessary to support the Galaxy application. In the above scenario, CloudMan has allocated dedicated storage for the user's own data, initialized the Galaxy database, as well as composed additional data volumes containing the tools and secondary data they require. As with any instance of Galaxy, additional tools and data can easily be added by the owner of the instance and shared with others. As a case study into the use of Galaxy Cloud, we consider the problem of identifying heteroplasmic sites — variation among the multiple copies of the mitochondrial genome (mtDNA) within a cell or individual. Mutations in mtDNA have been implicated in hundreds of diseases, and in many cases the disease causing variants can be heteroplasmic, with manifestation dependent on the relative proportion of variants[12, 13]. Further, this task emphasizes many of the key motivations for Galaxy Cloud. 1) It involves the use of clinical samples, which often involve strict privacy concerns and should not be analyzed on a public site, but can be analyzed on a secure public or private cloud resource. 2) It is both a data intensive problem and one with compute needs that vary over the course of the analysis. 3) It requires different methodology than the related problem of SNP calling in diploid genomes, showing the power of Galaxy's workflow system to construct solutions to non-standard tasks. 4) There is currently has no commonly accepted approach, which has led to questions about the validity of published heteroplasmic sites, emphasizing the need for a system that makes analysis completely transparent and reproducible. Using mtDNA sequence data from nine individuals across three families[14], we developed Galaxy workflows to perform the identification of heteroplasmic sites. These workflows map the sequencing reads, separate them by strand, transform datasets from read-centric to genome-centric form and perform a number of filtering and thresholding steps before merging the branches and generating a list of sites that contain allelic variants above a certain frequency supported by high quality reads on both strands. Running the workflows identified four heteroplasmic sites in two of the three examined families. This analysis was computed entirely using Galaxy Cloud on AWS, and can be replicated exactly by importing the datasets and workflow available at http://s3.amazonaws.com/heteroplasmy/index.html. For complete description and explanation of the acquired data as well as how to use, import, and modify workflows used for the described heteroplasmy study see the Galaxy Page[9] at http://usegalaxy.org/heteroplasmy/. To perform the analysis, we uploaded 45GB of sequence datasets to S3, which took 9 hours at an average transfer rate of 1.5MB/sec and cost $5. During the execution of the analysis workflow, the cluster size was managed by CloudMan's auto-scaling feature and the cluster size varied between 1 and 16 nodes. It took approximately 6 hours and cost $20 to complete the workflow. With auto-scaling disabled, for fixed cluster sizes of 5 and 20 nodes the runtime was 9 hours at a cost of $20 and 6 hours at a cost of $50 respectively. By adapting the compute resource as the workflows demands change, auto-scaling is able to provide both the smallest total runtime and cost. Once the workflow is executed, the obtained results can be further analyzed directly on the cloud, downloaded, or left on the cloud for future reference. Overall, a complete analysis utilizing a compute cluster and a variety of open source NGS tools was performed within 15 hours for a cost of $25 using nothing but a web browser. Cloud computing resources may not be as cost-effective for all usage scenarios. The workflow was already developed, which made it straightforward to execute in entirety. The interactive analysis and trial-and-error involved in building and refining the workflows is less cost-effective, though auto-scaling helps avoid excessive waste. This particular workflow has steps that could be executed in parallel, which allowed it to take advantage of cloud elasticity. Cloud instances of Galaxy will be limited by the resources available from a given cloud provider. For example, the largest memory instances currently provided by AWS are not sufficient to run certain de-novo assemblers. However, these are limitations of the provider used, not the cloud model. An advantage of the virtualization-based cloud model is the ability to move to a different cloud provider or to local resources. Cloud computing offers a new avenue for accessing computational infrastructure and Galaxy Cloud helps harness the potential in a very general way, but may not be appropriate or cost-effective for some workloads. As NGS becomes an indispensible tool for biomedical research, it is crucial to provide analysis solutions that are usable for biomedical researchers and cost effective. Galaxy Cloud addresses this by combining the accessible Galaxy interface with automated management of cloud computing resources. Unlike purpose built solutions, Galaxy allows users both to use existing tested best practices in the form of workflows, or construct their own analyses for novel tasks. Galaxy Cloud instance are owned and controlled entirely by the user who created them, and can be used effectively in secure private clouds. Thus Galaxy Cloud provides a solution that retains user control and privacy, makes complex analysis accessible, and enables the use of practically limitless on-demand compute resources.

13 in total

1. The many faces of mitochondrial diseases.

Authors: Salvatore DiMauro
Journal: Mitochondrion Date: 2004-09-30 Impact factor: 4.160

Review 2. Computation for ChIP-seq and RNA-seq studies.

Authors: Shirley Pepke; Barbara Wold; Ali Mortazavi
Journal: Nat Methods Date: 2009-11 Impact factor: 28.547

3. The case for cloud computing in genome informatics.

Authors: Lincoln D Stein
Journal: Genome Biol Date: 2010-05-05 Impact factor: 13.583

4. Cloud computing and the DNA data race.

Authors: Michael C Schatz; Ben Langmead; Steven L Salzberg
Journal: Nat Biotechnol Date: 2010-07 Impact factor: 54.908

5. Cloud-scale RNA-sequencing differential expression analysis with Myrna.

Authors: Ben Langmead; Kasper D Hansen; Jeffrey T Leek
Journal: Genome Biol Date: 2010-08-11 Impact factor: 13.583

Review 6. RNA-Seq: a revolutionary tool for transcriptomics.

Authors: Zhong Wang; Mark Gerstein; Michael Snyder
Journal: Nat Rev Genet Date: 2009-01 Impact factor: 53.242

7. Galaxy CloudMan: delivering cloud compute clusters.

Authors: Enis Afgan; Dannon Baker; Nate Coraor; Brad Chapman; Anton Nekrutenko; James Taylor
Journal: BMC Bioinformatics Date: 2010-12-21 Impact factor: 3.169

8. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

Authors: Jeremy Goecks; Anton Nekrutenko; James Taylor
Journal: Genome Biol Date: 2010-08-25 Impact factor: 13.583

9. Dynamics of mitochondrial heteroplasmy in three families investigated via a repeatable re-sequencing study.

Authors: Hiroki Goto; Benjamin Dickins; Enis Afgan; Ian M Paul; James Taylor; Kateryna D Makova; Anton Nekrutenko
Journal: Genome Biol Date: 2011-06-23 Impact factor: 13.583

10. Searching for SNPs with cloud computing.

Authors: Ben Langmead; Michael C Schatz; Jimmy Lin; Mihai Pop; Steven L Salzberg
Journal: Genome Biol Date: 2009-11-20 Impact factor: 13.583

33 in total

Review 1. Next-generation sequencing data interpretation: enhancing reproducibility and accessibility.

Authors: Anton Nekrutenko; James Taylor
Journal: Nat Rev Genet Date: 2012-09 Impact factor: 53.242

Review 2. Progress Toward Cancer Data Ecosystems.

Authors: Robert L Grossman
Journal: Cancer J Date: 2018 May/Jun Impact factor: 3.360

3. BioBlend: automating pipeline analyses within Galaxy and CloudMan.

Authors: Clare Sloggett; Nuwan Goonasekera; Enis Afgan
Journal: Bioinformatics Date: 2013-04-28 Impact factor: 6.937

4. Genome annotation of five Mycoplasma canis strains.

Authors: D R Brown; M May; D L Michaels; A F Barbet
Journal: J Bacteriol Date: 2012-08 Impact factor: 3.490

Review 5. -Omic and Electronic Health Record Big Data Analytics for Precision Medicine.

Authors: Po-Yen Wu; Chih-Wen Cheng; Chanchala D Kaddi; Janani Venugopalan; Ryan Hoffman; May D Wang
Journal: IEEE Trans Biomed Eng Date: 2016-10-10 Impact factor: 4.538

6. Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services.

Authors: Ravi K Madduri; Dinanath Sulakhe; Lukasz Lacinski; Bo Liu; Alex Rodriguez; Kyle Chard; Utpal J Dave; Ian T Foster
Journal: Concurr Comput Date: 2014-09-10 Impact factor: 1.536

7. CloudMap: a cloud-based pipeline for analysis of mutant genome sequences.

Authors: Gregory Minevich; Danny S Park; Daniel Blankenberg; Richard J Poole; Oliver Hobert
Journal: Genetics Date: 2012-10-10 Impact factor: 4.562

8. Using cloud computing infrastructure with CloudBioLinux, CloudMan, and Galaxy.

Authors: Enis Afgan; Brad Chapman; Margita Jadan; Vedran Franke; James Taylor
Journal: Curr Protoc Bioinformatics Date: 2012-06

9. Enhancing PCORnet Clinical Research Network data completeness by integrating multistate insurance claims with electronic health records in a cloud environment aligned with CMS security and privacy requirements.

Authors: Lemuel R Waitman; Xing Song; Dammika Lakmal Walpitage; Daniel C Connolly; Lav P Patel; Mei Liu; Mary C Schroeder; Jeffrey J VanWormer; Abu Saleh Mosa; Ernest T Anye; Ann M Davis
Journal: J Am Med Inform Assoc Date: 2022-03-15 Impact factor: 4.497

Review 10. Bioinformatics clouds for big data manipulation.

Authors: Lin Dai; Xin Gao; Yan Guo; Jingfa Xiao; Zhang Zhang
Journal: Biol Direct Date: 2012-11-28 Impact factor: 4.540