| Literature DB >> 32719837 |
Inès Krissaane1, Carlos De Niz1, Alba Gutiérrez-Sacristán1, Gabor Korodi1, Nneka Ede1, Ranjay Kumar1, Jessica Lyons1, Arjun Manrai1, Chirag Patel1, Isaac Kohane1, Paul Avillach1.
Abstract
OBJECTIVE: Advancements in human genomics have generated a surge of available data, fueling the growth and accessibility of databases for more comprehensive, in-depth genetic studies.Entities:
Keywords: cloud computing; distributed systems; genome-wide association study; whole genome
Year: 2020 PMID: 32719837 PMCID: PMC7534581 DOI: 10.1093/jamia/ocaa068
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Whole-genome sequencing datasets description used to conduct the genome-wide association study
| Project releases | VCF file size in GB | MT file size in GB | SNPs | Samples |
|---|---|---|---|---|
| 1KG Phase 1 | 1231 | 250 | 38 248 779 | 1092 |
| 1KG Phase 3 | 853 | 12 | 77 253 690 | 2535 |
| COPD Freeze 4 | 52 | 102 | 69 023 355 | 1886 |
| Jackson Freeze 5 | 29 | 34 | 74 623 050 | 3406 |
1KG: 1000 Genomes Project; GB: gigabytes; MT: Matrix Table; SNP: single nucleotide polymorphism; VCF: variant call format.
Compressed VCF file size.
Figure 1.Overview of a genome-wide association study using Hail variant store using 1000 Genome Project dataset Phase 3. EUR, EAS, AMR, SAS and AFR designed the European, East Asian, American, South Asian and African populations.
Figure 2.A distributed computational framework for large genomics analysis. Cloud computing setup for executing Hail jobs on Google Cloud Platform and Amazon Web Services with Spark and Hadoop-distributed systems. CPU: central processing unit; MT: Matrix Table; RAM: random access memory.
Figure 3.Total cost and performance (from cluster instantiation to deletion) of a Google Cloud, DataProc computing cluster. Master node was instance type: n1-standard-4. Worker nodes had 2 different instance types: n1-highmem-16 and n1-highmem-64. Analysis of genome-wide association studies were performed on 4 different datasets. The numbers near each point indicate the number of worker nodes per cluster. Lines link clusters with the same instance, with increasing numbers of instances per cluster. 1000G: 1000 Genomes Project; CPU: central processing unit; TOPMed: Trans-Omics for Precision Medicine.
Best cluster configuration based on the total cost to conduct the genome-wide association study across 4 different datasets
| Project releases | Instance type | Nodes | Total runtime (min) | Total cost ($) |
|---|---|---|---|---|
| 1KG Phase 1 | n1-standard-16 | 10 | 14 | 1.1 |
| 1KG Phase 3 | n1-standard-16 | 10 | 32 | 2.6 |
| COPD Freeze 4 | n1-highmem-16 | 10 | 30 | 2.9 |
| Jackson Freeze 5 | n1-highmem-16 | 10 | 23 | 2.2 |
1KG: 1000 Genomes Project.