| Literature DB >> 30894453 |
Alexander Grueneberg1, Gustavo de Los Campos1,2,3.
Abstract
We created a suite of packages to enable analysis of extremely large genomic data sets (potentially millions of individuals and millions of molecular markers) within the R environment. The package offers: a matrix-like interface for .bed files (PLINK's binary format for genotype data), a novel class of linked arrays that allows linking data stored in multiple files to form a single array accessible from the R computing environment, methods for parallel computing capabilities that can carry out computations on very large data sets without loading the entire data into memory and a basic set of methods for statistical genetic analyses. The package is accessible through CRAN and GitHub. In this note, we describe the classes and methods implemented in each of the packages that make the suite and illustrate the use of the packages using data from the UK Biobank.Entities:
Keywords: big data; biobank; distributed computing; genetic analyses; parallel computing
Mesh:
Year: 2019 PMID: 30894453 PMCID: PMC6505159 DOI: 10.1534/g3.119.400018
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Packages, their purpose and repositories
| Name | Purpose | GitHub Repository1 |
|---|---|---|
| BEDMatrix | Matrix-like class to extract genotypes from PLINK .bed files | |
| LinkedMatrix | Matrix-like class to link matrix-like objects by rows or by columns | |
| symDMatrix | Matrix-like class to link blocks of matrix-like objects into a partitioned symmetric matrix | |
| BGData | Computational methods for matrix-like objects, a class to represent genotype/phenotype data, and the flagship package of the suite |
All packages are also available at https://CRAN.R-project.org/.
Summary of the modular pipeline used to analyze data from the UK Biobank
| Task | Data Set | R Package Used | |
|---|---|---|---|
| Train | Test | ||
| 1) Form a linked array with genotypes | ☒ | ☒ | BGData |
| 2) Determine white British cohort | ☒ | ☒ | base |
| 3) Summaries | ☒ | ☒ | BGData |
| 4) SNP filtering (allele frequency & call rate) | ☒ | ☒ | base |
| 5) Genomic relationships (GR) | ☒ | ☒ | BGData |
| 6) Identification of samples with GR < 0.03 | ☒ | ☒ | BGData |
| 7) Computation of 5 PC | ☒ | ☒ | base |
| 8) Phenotypes adjustments | ☒ | ☒ | base |
| 9) Building of training and test set | ☒ | ☒ | base |
| 10) GWAS (using adjusted phenotypes) | ☒ | BGData | |
| 11) Selection of the top-p variants | ☒ | base | |
| 12) Bayesian Genomic Regression | ☒ | BGLR | |
| 13) Assessment of prediction accuracy | ☒ | base | |
Figure 1Manhattan plot obtained by regressing sex-age adjusted height on variants using data from the training set (n = 222,648, unrelated White British).
Figure 2Correlation (+/− SE) between sex-adjusted height and predicted height in the testing set, by the number of SNPs used.
Computational times involved in some of the most demanding tasks of the pipeline, performed on a node with four cores and 350 GB of RAM provided by the MSU High Performance Computing Center (HPCC).
| Task | Dimension | Median Time (SD) in either seconds (s) or hours (hr) |
|---|---|---|
| n = 410K | 0.27 hr (23.94 s) | |
| n = 20K | 1.00 hr (105.32 s) | |
| n = 223K | 0.57 hr (139.9 s) | |
| n = 223K | 0.156 s (0.039 s) | |
| n = 223K | 0.391 s (0.061 s) | |
| n = 223K | 0.612 s (0.042 s) | |
| n = 223K | 1.665 s (0.095 s) | |
| n = 223K | 3.355 s (0.126 s) | |
| n = 223K | 6.858 s (0.174 s) | |
| n = 223K | 10.804 s (0.240 s) | |
| n = 223K | 14.985 s (1.200 s) | |
| n = 223K | 20.901 s (0.536 s) |
223K and 625K were n = 222,661 and p = 624,528.