| Literature DB >> 32761098 |
Arash Bayat1, Piotr Szul2, Aidan R O'Brien1, Robert Dunne2, Brendan Hosking1, Yatish Jain1, Cameron Hosking1, Oscar J Luo3, Natalie Twine1, Denis C Bauer1,4.
Abstract
BACKGROUND: Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions.Entities:
Year: 2020 PMID: 32761098 PMCID: PMC7407261 DOI: 10.1093/gigascience/giaa077
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Nine phenotypes simulated with PEPS
| Phenotype name | Category | No. of | Total No. | |||||
|---|---|---|---|---|---|---|---|---|
| 1-way | 2-way | 3-way | 4-way | 5-way | Truth-variables | Truth-variants | ||
| PIL | PI | 5 | 0 | 0 | 0 | 0 | 5 | 5 |
| PIM | 50 | 0 | 0 | 0 | 0 | 50 | 50 | |
| PIH | 500 | 0 | 0 | 0 | 0 | 500 | 500 | |
| PEL | PE | 0 | 2 | 2 | 2 | 2 | 8 | 28 |
| PEM | 0 | 20 | 20 | 20 | 20 | 80 | 280 | |
| PEH | 0 | 50 | 50 | 50 | 50 | 200 | 700 | |
| PXL | PX | 5 | 3 | 2 | 1 | 1 | 12 | 26 |
| PXM | 50 | 25 | 17 | 13 | 10 | 115 | 253 | |
| PXH | 500 | 250 | 167 | 125 | 100 | 1,142 | 2,501 | |
1000-Genome dataset and its subsets
| Dataset | No. of variants | % of truth-variants Included |
|---|---|---|
| 1KG-80M | 81,647,203 | 100 |
| 1KG-5M | 5,000,516 | 6.1 |
| 1KG-500K | 500,446 | 0.6 |
| 1KG-5M-T | 5,016,789 | 100 |
| 1KG-500K-T | 517,729 | 100 |
There are 2,504 samples in these datasets.
Synthetic datasets generated by VariantSpark
| Dataset | Size | No. | ||
|---|---|---|---|---|
| Samples (nS) | Variants (nV) | Genotypes nS × nV | ||
| 1K-10K | 10M | 1,000 | 10,000 | 1e7 |
| 1K-100K | 100M | 1,000 | 100,000 | 1e8 |
| 1K-1M | 1B | 1,000 | 1,000,000 | 1e9 |
| 1K-10M | 10B | 1,000 | 10,000,000 | 1e10 |
| 1K-100M | 100B | 1,000 | 100,000,000 | 1e11 |
| 10K-10K | 100M | 10,000 | 10,000 | 1e8 |
| 10K-100K | 1B | 10,000 | 100,000 | 1e9 |
| 10K-1M | 10B | 10,000 | 1,000,000 | 1e10 |
| 10K-10M | 100B | 10,000 | 10,000,000 | 1e11 |
| 10K-100M | 1T | 10,000 | 100,000,000 | 1e12 |
| 100K-10K | 1B | 100,000 | 10,000 | 1e9 |
| 100K-100K | 10B | 100,000 | 100,000 | 1e10 |
| 100K-1M | 100B | 100,000 | 1,000,000 | 1e11 |
| 100K-10M | 1T | 100,000 | 10,000,000 | 1e12 |
Datasets for high-resolution comparison of the VariantSpark runtime with other implementations of RF
| Dataset | No. of variants (nV) | Dataset | No. of variants (nV) |
|---|---|---|---|
| 1X | 100 | 512X | 51,200 |
| 2X | 200 | 1KX | 102,400 |
| 4X | 400 | 2KX | 204,800 |
| 8X | 800 | 4KX | 409,600 |
| 16X | 1,600 | 8KX | 819,200 |
| 32X | 3,200 | 16KX | 1,638,400 |
| 64X | 6,400 | 32KX | 3,276,800 |
| 128X | 12,800 | 64KX | 6,553,600 |
| 256X | 25,600 | 10M | 10,000,000 |
X represents 100 and KX represents 102,400. 10M is identical to the 10K-10M dataset. Each dataset includes 10,000 samples.
EMR clusters and compute-nodes
| Cluster | Compute-nodes | Master + compute | |
|---|---|---|---|
| vCPU | Memory (GB) | ||
| C16 | 1 × r4.4xlarge | 8 + 16 | 61 + 122 |
| C32 | 2 × r4.4xlarge | 8 + 32 | 61 + 244 |
| C64 | 4 × r4.4xlarge | 8 + 64 | 61 + 488 |
| C128 | 8 × r4.4xlarge | 8 + 128 | 61 + 976 |
| C256 | 16 × r4.4xlarge | 8 + 256 | 61 + 1,952 |
| C512 | 32 × r4.4xlarge | 8 + 512 | 61 + 3,904 |
| C1024 | 64 × r4.4xlarge | 8 + 1,024 | 61 + 7,808 |
| C256-S | 32 × r4.2xlarge | 8 + 256 | 61 + 1,952 |
| C256-L | 8 × r4.8xlarge | 8 + 256 | 61 + 1,952 |
Figure 1:VariantSpark comparison with Logistic-Regression on their ability to detect phenotype-associated variants. Phenotype labels (i.e., PIL, PIM, ...) are described in Table 1.
Figure 2:Comparison of exclusively detected variants and correlation with prediction accuracy.
Figure 3:VariantSpark’s runtime compared with other implementations of Random Forest (RF) and Decision Tree (DT). The RF and DT workloads are different and should not be compared with each other. The number of variants in the dataset is doubled at each step (see Table 4 for the list of datasets used for the comparison). The thin unmarked black line illustrates the case if the runtime increases linearly starting from the average runtime of VariantSpark and Reforest for a dataset of 1.6M variants.
Figure 4:VariantSpark runtime as a function of size of (a) the dataset, (b) the cluster, and (c) the compute-nodes. VariantSpark runtime and accuracy as a function of (d) mTry and nTree and (e) maxD and minNS. (f) The effect of maxD and minNS on the average depth and the number of nodes per tree.
Figure 5:Illustration of partitioning strategies for distributed computing implementations of RF. For genomics data the number of features is larger than the number of samples. Here, vertical partitioning better balances data divisions and makes communication between compute-nodes (C) and the master-node (M) more efficient. Specifically, training each node of each tree with vertical partitioning enables each each compute-node to find the local best split in the allocated partition and to only communicate the best local split with the master (small green squares). In contrast with horizontal partitioning, each compute-node must communicate the summary statistics of all allocated samples with the master node (large orange tables).