Literature DB >> 32719837

Scalability and cost-effectiveness analysis of whole genome-wide association studies on Google Cloud Platform and Amazon Web Services.

Inès Krissaane1, Carlos De Niz1, Alba Gutiérrez-Sacristán1, Gabor Korodi1, Nneka Ede1, Ranjay Kumar1, Jessica Lyons1, Arjun Manrai1, Chirag Patel1, Isaac Kohane1, Paul Avillach1.   

Abstract

OBJECTIVE: Advancements in human genomics have generated a surge of available data, fueling the growth and accessibility of databases for more comprehensive, in-depth genetic studies.
METHODS: We provide a straightforward and innovative methodology to optimize cloud configuration in order to conduct genome-wide association studies. We utilized Spark clusters on both Google Cloud Platform and Amazon Web Services, as well as Hail (http://doi.org/10.5281/zenodo.2646680) for analysis and exploration of genomic variants dataset.
RESULTS: Comparative evaluation of numerous cloud-based cluster configurations demonstrate a successful and unprecedented compromise between speed and cost for performing genome-wide association studies on 4 distinct whole-genome sequencing datasets. Results are consistent across the 2 cloud providers and could be highly useful for accelerating research in genetics.
CONCLUSIONS: We present a timely piece for one of the most frequently asked questions when moving to the cloud: what is the trade-off between speed and cost?
© The Author(s) 2020. Published by Oxford University Press on behalf of the American Medical Informatics Association.

Entities:  

Keywords:  cloud computing; distributed systems; genome-wide association study; whole genome

Year:  2020        PMID: 32719837      PMCID: PMC7534581          DOI: 10.1093/jamia/ocaa068

Source DB:  PubMed          Journal:  J Am Med Inform Assoc        ISSN: 1067-5027            Impact factor:   4.497


INTRODUCTION

As datasets become increasingly larger and more abundant, science faces a new challenge: how to overcome the economic and technological barriers that arise when trying to store and analyze the data generated by large sample sizes. Every year, the scale of available genomic variant datasets nearly doubles. This has led to a recent broad interest in genomics analyses using cloud computing. For example, investigators have launched a new large-scale initiative, called the Trans-Omics for Precision Medicine (TOPMed) program, as part of the Precision Medicine Initiative. TOPMed focuses on the integration of thousands of whole genomes, gathered across several studies. The processing of such large amounts of data, is unprecedented and requires significant funding for both storage and computation. A solution, perhaps the only sustainable one currently available, is cloud-distributed computing systems. Because the costs of such a solution remain obscure for common genomic operations, many investigators remain tentative or unsure of the suitability of cloud computing for their purpose; therefore, we undertook this study to clarify those costs. We present an adaptable and reproducible method to deploy Spark clusters using Hail, an open-source, scalable framework for exploring and analyzing genetic data, as well as variant storage. We also utilized the cloud service Google Dataproc in the Google Cloud Platform (GCP) and the cloud platform Amazon Elastic MapReduce (EMR) in Amazon Web Services (AWS) for performing genomic variant analysis with whole-genome sequencing (WGS) data. Therefore, we offer a promising strategy to accelerate functional interpretation of genetic variants, and discover their association with human disease in particular for genome-wide association study (GWAS) analysis. In order to estimate the required computational infrastructure needed, we performed cost analyses of GWAS using 4 different datasets from the 1000 Genomes Project, and the TOPMed WGS program. Our goal was to optimize and customize cloud resources to fit computation and storage needs. We further offered appropriate strategies for using cloud resources by assessing the best cluster configuration for a GWAS analysis based on total cost and runtime.

MATERIALS AND METHODS

Study sample and variant calling format

For this study, we used 4 different WGS datasets from the 1000 Genomes Project and TOPMed project, as well as the COPDGene Study and Jackson Heart Study. First, phases 1 and 3 of the 1000 Genomes Project were publicly and readily available in Google cloud buckets (gs://1000-genomes on Google Cloud Storage and s3://1000genomes/in Amazon S3). Freeze 4 (COPDGene Study) and freeze 5 (Jackson Heart Study) obtained variant data in variant call format (VCF) files for every sample in a specific freeze. These corresponded to aggregate single nucleotide polymorphisms (SNPs) for each study. We combined VCF files using the function merge in bcftools of the dbGap database (see Table 1 and Supplementary Appendix File 1). We imported VCF files and transformed them into a Hail Matrix Table object (.mt). We note that we found it advantageous to use .mt files in Hail, as they are written and read faster than VCF files (see Supplementary Appendix 1).
Table 1.

Whole-genome sequencing datasets description used to conduct the genome-wide association study

Project releasesVCF file size in GBMT file size in GBSNPsSamples
1KG Phase 1123125038 248 7791092
1KG Phase 38531277 253 6902535
COPD Freeze 452a10269 023 3551886
Jackson Freeze 529a3474 623 0503406

1KG: 1000 Genomes Project; GB: gigabytes; MT: Matrix Table; SNP: single nucleotide polymorphism; VCF: variant call format.

Compressed VCF file size.

Whole-genome sequencing datasets description used to conduct the genome-wide association study 1KG: 1000 Genomes Project; GB: gigabytes; MT: Matrix Table; SNP: single nucleotide polymorphism; VCF: variant call format. Compressed VCF file size.

GWAS analysis

We chose the variable gender present in all 4 GWAS datasets: female vs male as case and control group. Though both Python and Jupyter notebook scripts applied to the 4 datasets, we executed the necessary steps one would use to perform a GWAS (see Figure 1). Specifically, we utilized genotype information (GT) for many genetic markers, divided upon chromosomes. We deployed the standard quality procedures for genomics data. Then, we filtered the results based upon minor allele frequency of the most common SNPs representing more than 1%. We further checked for missing values. We then corrected for population structure by performing a principal component analysis from the Hardy-Weinberg normalized genotype call matrix method. We conducted a logistic regression using as a covariate the 2 first principal components obtained previously to predict the gender (see Supplementary Appendix 1, part 1) with the genotype call (GT):
Figure 1.

Overview of a genome-wide association study using Hail variant store using 1000 Genome Project dataset Phase 3. EUR, EAS, AMR, SAS and AFR designed the European, East Asian, American, South Asian and African populations.

Overview of a genome-wide association study using Hail variant store using 1000 Genome Project dataset Phase 3. EUR, EAS, AMR, SAS and AFR designed the European, East Asian, American, South Asian and African populations. Results are plotted in a Manhattan plot (Figure 1), showing the significant SNPs (Wald test per variant). The horizontal line represents the significance threshold after Bonferroni correction (P value ≤5.0 × 10–8).

Cloud deployment

In GCP, we executed a shell script to automate the process of deployment and deletion of our clusters (cluster creation code supported by the Hail team [https://github.com/Nealelab/cloudtools]). For AWS, we used an in-house script to manage EMR cluster generation. We used 2 cloud formation tools to create Spark cloud clusters with master node and several worker nodes: (1) Google Dataproc with the image 1.2-deb9 and Spark version 2.2.1 and (2) Amazon EMR 5.13 with Spark version 2.3.0. The use of 2 different cloud providers created only minor variation, in terms of hardware (instance characteristics), cluster configurations, and network connection. However, as shown in the Results, the outcome and results are very consistent for both platforms. Both providers have similarities in terms of the computing environment, including number of central processing units (CPUs) per instance, storage, memory (random access memory [RAM]), networking, and operating systems. For the worker nodes, we performed GWAS with preemptible instances provided in the GCP and spot instances in AWS. These represented significant cost reduction, while at the same time meeting performance requirements (see Supplementary Appendix 1, part 4).

Hail cloud testing and workflow

For the sake of reproducibility, we decided to use standard instance types—preferably the most common and accessible—rather than customizing our own. In GCP, we varied only 2 parameters: the typical instance for worker nodes and the number of nodes. These directly impacted the total number of CPU and memory (in gigabytes [GB]) of the cluster. We tested 2 instance types: n1-standard and n1-highmem among those possible for a total of 6 different Google Cloud Engine (GCE) virtual instance machines. We used clusters, including 16 to 64 CPUs and 60 to 416 GB of RAM per worker nodes in GCP. For AWS, we used a cluster with worker nodes with 16 CPUs and 64 GB of RAM. We calculated the total cost of each cluster during end-to-end processing (from the instantiation to deletion). The total cost was calculated based on the prices applied by Google Cloud. These include the price per instance and product Google Cloud Dataproc. These rates were applicable to the North Virginia zone (January 31, 2019). The process described in Figure 2 was performed using a bash script that parallels the creation of all clusters and automates their deletion after the Hail operation finished. The code is available online (https://github.com/hms-dbmi/Hail-on-Google-Cloud/tree/master/Bash_script).
Figure 2.

A distributed computational framework for large genomics analysis. Cloud computing setup for executing Hail jobs on Google Cloud Platform and Amazon Web Services with Spark and Hadoop-distributed systems. CPU: central processing unit; MT: Matrix Table; RAM: random access memory.

A distributed computational framework for large genomics analysis. Cloud computing setup for executing Hail jobs on Google Cloud Platform and Amazon Web Services with Spark and Hadoop-distributed systems. CPU: central processing unit; MT: Matrix Table; RAM: random access memory.

Availability and implementation

The workflows to deploy Hail cloud clusters are available online (https://github.com/hms-dbmi/Hail-on-Google-Cloud and https://github.com/hms-dbmi/hail-on-AWS-spot-instances) and the Jupyter notebook to launch analyses with the 1000 Genomes Project can be accessed online (https://github.com/hms-dbmi/Hail-on-Google-Cloud/blob/master/Analysis/GWAS_Gender_Phase1.ipynb).

RESULTS

Large-scale genomic data analyses on GCP

Focusing first on clusters generated in GCP, we analyzed the total cost and the runtime necessary to perform GWAS analyses for each cluster, from creation to deletion. We instantiated more than 100 clusters (see Supplementary Appendix 3 and Figure 3), resulting in high variability for both total time and cost necessary to conduct a GWAS analysis. Overall, total time for GWAS analysis of each dataset was <2 hours, evidencing the high-performance capacity of cloud parallel processing at scale (see Supplementary Appendices 2 and 3). When trying to optimize our method, we tested the current mindset around cloud-based resources: that to be the most efficient, one should group data into the largest-size clusters. However, our results (Figure 3) showed that each instance type had a breaking point, after which increasing the size of a cluster yielded no further benefit. When facing limitation in terms of cost, we demonstrated that it was advantageous to sample larger clusters (ie, those with a high number of nodes). This reduced the time and consequently, the total cost. However, we also noted a trade-off. Once a particular number of nodes (a large enough cluster) was reached, performance plateaued or even decreased. This manifested as a significant inflexion point across all 4 datasets (Figure 3), where one gives up speed for cost. We determined the best configuration for the 4 distinct datasets (Table 2) based on our primary goal: achieving the lowest total cost (see Supplementary Appendix 2).
Figure 3.

Total cost and performance (from cluster instantiation to deletion) of a Google Cloud, DataProc computing cluster. Master node was instance type: n1-standard-4. Worker nodes had 2 different instance types: n1-highmem-16 and n1-highmem-64. Analysis of genome-wide association studies were performed on 4 different datasets. The numbers near each point indicate the number of worker nodes per cluster. Lines link clusters with the same instance, with increasing numbers of instances per cluster. 1000G: 1000 Genomes Project; CPU: central processing unit; TOPMed: Trans-Omics for Precision Medicine.

Table 2.

Best cluster configuration based on the total cost to conduct the genome-wide association study across 4 different datasets

Project releasesInstance typeNodesTotal runtime (min)Total cost ($)
1KG Phase 1n1-standard-1610141.1
1KG Phase 3n1-standard-1610322.6
COPD Freeze 4n1-highmem-1610302.9
Jackson Freeze 5n1-highmem-1610232.2

1KG: 1000 Genomes Project.

Total cost and performance (from cluster instantiation to deletion) of a Google Cloud, DataProc computing cluster. Master node was instance type: n1-standard-4. Worker nodes had 2 different instance types: n1-highmem-16 and n1-highmem-64. Analysis of genome-wide association studies were performed on 4 different datasets. The numbers near each point indicate the number of worker nodes per cluster. Lines link clusters with the same instance, with increasing numbers of instances per cluster. 1000G: 1000 Genomes Project; CPU: central processing unit; TOPMed: Trans-Omics for Precision Medicine. Best cluster configuration based on the total cost to conduct the genome-wide association study across 4 different datasets 1KG: 1000 Genomes Project.

Validation with AWS

As described previously, cluster setup required more time in Amazon Web Services. Therefore, we compared the performance obtained with GCP by running the same GWAS script in Jupyter notebook, without considering cluster preparation time and cluster deletion. When choosing the same configuration for both cloud services, we obtained identical execution runtimes for all GWAS (see Supplementary Appendix 3). Our approach worked on both cloud services and with identical computational runtimes and cost (not factoring the cluster setup).

DISCUSSION

Although using distributed computing in research is becoming increasingly common, information concerning cost and the computational power required to perform any specific study (from storage and loading data through computation) is lacking. Moreover, distributed system tools like Spark and Hadoop require specific knowledge that is not yet not widely utilized by bioinformaticians. In this study, Hail, a cloud-compatible analytic tool can be harnessed to address scalability challenges arising from large genomic data analytics. We described a simple and relatively effortless way (with line of command) to set up a Spark cluster via Hail on both GCP and AWS. Using this framework, we facilitated the downloading and preprocessing of data via an optimized pipeline for large scale genomic variant analytics. The method is highly scalable and shows that cloud-based distributed systems are, indeed, an effective and novel way to perform cost-effective computational analysis with data sizes higher than several terabytes. The cost of cloud commercial services alone can deter many researchers from transitioning to a cloud infrastructure. We showed that this cost can be reduced by deploying an optimized strategy of cluster size choice, aligned with submission of Hail jobs to the cloud. We acknowledge that cloud computing still needs to overcome many challenges (ie, cost that is subject to abrupt change and problems with network speed between components). Given these realities, future work might focus on finding ways to estimate the best upstream cluster configuration before launch, specifically optimizing both cost and time. Future studies might delve more deeply into the complex mechanisms of cloud computation, thus further enhancing optimization and driving down cost. Looking toward the future of precision medicine, a daunting challenge lies in the handling of the massive genomics datasets being generated, as well as the ability to perform extensive interrogation of whole-genome sequences. With an eye toward enabling new biological discovery, we champion the performance benefits provided by the cloud, while emphasizing the boundaries of cluster size and utilization of computational resources. We anticipate that researchers will increasingly utilize cloud computing, especially as the challenges mount around prominent initiatives, such as the 100 000 Genomes Project, the Cancer Genomics Cloud, and the Precision Medicine Initiative. We propose that our method and framework will be an applicable and a powerful addition to these and other future large-scale genomic datasets.

FUNDING

This work was supported by the National Institutes of Health through the DataCommons program grant number 1OT3OD025466-0 and by National Heart, Lung, and Blood Institute DataSTAGE program grant number 1OT3HL142480-01. All Google and Amazon benchmarks and applications are funded respectively by a Google cloud grant and an Amazon grant.

AUTHOR CONTRIBUTIONS

IKr led the design and implementation of the framework in Google Cloud Platform and wrote the majority of the manuscript. CDN worked on the implementation of the method in Amazon Web Services and carried out most of the benchmark for Amazon Web Services. AG-S and NE participated in the analysis and understanding of genetic data. GK and RK helped in the implementation of cloud tools needed for the manuscript as well as for the management of the data used. IKo, JL, AM, and CP have made substantial contributions in the conception and initiation of the project. PA was responsible for the conception of the study, overseeing the analysis, revising the manuscript, and approving the final manuscript. All authors reviewed and approved the final version of the manuscript.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online. Click here for additional data file.
  27 in total

1.  SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision.

Authors:  Marek S Wiewiórka; Antonio Messina; Alicja Pacholewska; Sergio Maffioletti; Piotr Gawrysiak; Michał J Okoniewski
Journal:  Bioinformatics       Date:  2014-05-19       Impact factor: 6.937

2.  The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research.

Authors:  Jessica W Lau; Erik Lehnert; Anurag Sethi; Raunaq Malhotra; Gaurav Kaushik; Zeynep Onder; Nick Groves-Kirkby; Aleksandar Mihajlovic; Jack DiGiovanna; Mladen Srdic; Dragan Bajcic; Jelena Radenkovic; Vladimir Mladenovic; Damir Krstanovic; Vladan Arsenijevic; Djordje Klisic; Milan Mitrovic; Igor Bogicevic; Deniz Kural; Brandi Davis-Dusenbery
Journal:  Cancer Res       Date:  2017-11-01       Impact factor: 12.701

3.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

Review 4.  Cloud computing for genomic data analysis and collaboration.

Authors:  Ben Langmead; Abhinav Nellore
Journal:  Nat Rev Genet       Date:  2018-01-30       Impact factor: 53.242

5.  Big Data: Astronomical or Genomical?

Authors:  Zachary D Stephens; Skylar Y Lee; Faraz Faghri; Roy H Campbell; Chengxiang Zhai; Miles J Efron; Ravishankar Iyer; Michael C Schatz; Saurabh Sinha; Gene E Robinson
Journal:  PLoS Biol       Date:  2015-07-07       Impact factor: 8.029

6.  An integrated map of genetic variation from 1,092 human genomes.

Authors:  Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal:  Nature       Date:  2012-11-01       Impact factor: 49.962

7.  Searching for SNPs with cloud computing.

Authors:  Ben Langmead; Michael C Schatz; Jimmy Lin; Mihai Pop; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-11-20       Impact factor: 13.583

8.  Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets.

Authors:  Allison P Heath; Matthew Greenway; Raymond Powell; Jonathan Spring; Rafael Suarez; David Hanley; Chai Bandlamudi; Megan E McNerney; Kevin P White; Robert L Grossman
Journal:  J Am Med Inform Assoc       Date:  2014-01-24       Impact factor: 4.497

Review 9.  Challenges of Identifying Clinically Actionable Genetic Variants for Precision Medicine.

Authors:  Tonia C Carter; Max M He
Journal:  J Healthc Eng       Date:  2016       Impact factor: 2.682

10.  The Medical Genome Reference Bank: a whole-genome data resource of 4000 healthy elderly individuals. Rationale and cohort design.

Authors:  Paul Lacaze; Mark Pinese; Warren Kaplan; Andrew Stone; Marie-Jo Brion; Robyn L Woods; Martin McNamara; John J McNeil; Marcel E Dinger; David M Thomas
Journal:  Eur J Hum Genet       Date:  2018-10-24       Impact factor: 4.246

View more
  1 in total

1.  Understanding enterprise data warehouses to support clinical and translational research: enterprise information technology relationships, data governance, workforce, and cloud computing.

Authors:  Boyd M Knosp; Catherine K Craven; David A Dorr; Elmer V Bernstam; Thomas R Campion
Journal:  J Am Med Inform Assoc       Date:  2022-03-15       Impact factor: 7.942

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.