| Literature DB >> 30482799 |
Zhi Xiong1, Qingrun Zhang2,3, Alexander Platt4, Wenyuan Liao5, Xinghua Shi6, Gustavo de Los Campos7, Quan Long8,9,5,10.
Abstract
Matrices representing genetic relatedness among individuals (i.e., Genomic Relationship Matrices, GRMs) play a central role in genetic analysis. The eigen-decomposition of GRMs (or its alternative that generates fewer top singular values using genotype matrices) is a necessary step for many analyses including estimation of SNP-heritability, Principal Component Analysis (PCA), and genomic prediction. However, the GRMs and genotype matrices provided by modern biobanks are too large to be stored in active memory. To accommodate the current and future "bigger-data", we develop a disk-based tool, Out-of-Core Matrices Analyzer (OCMA), using state-of-the-art computational techniques that can nimbly perform eigen and Singular Value Decomposition (SVD) analyses. By integrating memory mapping (mmap) and the latest matrix factorization libraries, our tool is fast and memory-efficient. To demonstrate the impressive performance of OCMA, we test it on a personal computer. For full eigen-decomposition, it solves an ordinary GRM (N = 10,000) in 55 sec. For SVD, a commonly used faster alternative of full eigen-decomposition in genomic analyses, OCMA solves the top 200 singular values (SVs) in half an hour, top 2,000 SVs in 0.95 hr, and all 5,000 SVs in 1.77 hr based on a very large genotype matrix (N = 1,000,000, M = 5,000) on the same personal computer. OCMA also supports multi-threading when running in a desktop or HPC cluster. Our OCMA tool can thus alleviate the computing bottleneck of classical analyses on large genomic matrices, and make it possible to scale up current and emerging analytical methods to big genomics data using lightweight computing resources.Entities:
Keywords: Eigen decomposition; Gene mapping; Genetic matrices; Genomic selection; Genotype-based phenotype prediction; Memory virtualization; Singular value decomposition
Mesh:
Year: 2019 PMID: 30482799 PMCID: PMC6325911 DOI: 10.1534/g3.118.200908
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Figure 1Illustrating schema of mmap. The horizontal bar represents the physical address in the disk and the vertical bar represents the virtual address from the perspective of a computing process in the memory.
The algorithm of OCMA calculates the right outcome. Eigenvectors (or singular vectors) are compared using Infinite norm , Taxicab norm , and Euclidean norm . We use and to denote the vectors calculated by OCMA and and to denote the vectors calculated by MATLAB (version 2012b). The comparison for eigen-decomposition is presented in the upper table, and the comparison for singular-value decomposition is presented in the lower one. In the upper table, N denotes the number of individuals. In lower table, N = 1,000,000, and M denotes the number of genetic markers. The configuration of the personal computer: Intel Core i7-6700 CPU (4 cores), Memory = 24GB. Disk = Samsung SSD 850 EVO 250GB. The operating system is Windows 7. Time is measured by wall-clock (instead of CPU time)
| Computation time (s) | |||||
|---|---|---|---|---|---|
| OCMA | MATLAB | ||||
| 1000 | 5.2*10−7 | 7.6*10−8 | 1.1*10−7 | 0.1 | 0.4 |
| 2000 | 2.8*10−7 | 8.1*10−8 | 1.0*10−7 | 0.5 | 2.8 |
| 5000 | 4.3*10−7 | 2.1*10−7 | 2.4*10−7 | 6.4 | 35.5 |
| 10000 | 1.6*10−6 | 5.7*10−7 | 6.3*10−7 | 46.2 | 294.0 |
| 20000 | 3.3*10−6 | 4.4*10−6 | 4.3*10−6 | 246.4 | 2905.1 |
Runtime and memory consumption of OCMA in a Linux personal computer for eigen decomposition. The same hardware in Table 1 is used. Operating system is CentOS Linux release 7.3.1611. Four threads are used for the test. The three columns in the “Calculation time” stand for the runtime for three tools: GCTA (Yang ), OCMA using memory only, and OCMA using disk (based on mmap technique). The careful interpretation of the comparison between GCTA and OCMA is explained in the main text and Supplementary Notes II. *The usage is slightly larger than the physical memory because of the swapping by the operating system. **GCTA was tested in an HPC cluster when the memory of a personal computer is insufficient. Time is measured by wall-clock (instead of CPU time). Memory consumption is estimated using the formula on the GCTA website and the Intel MKL specification (detailed in the Supplementary Notes). The sign “/” indicates that the system does not allow the calculation to happen due to limited memory and swap spaces or it exceeds the maximal runtime in the HPC
| Computation time | Memory usage | ||||
|---|---|---|---|---|---|
| GCTA (GREML) | OCMA (Memory) | OCMA (Disk) | GCTA | OCMA (Memory) | |
| 10000 | 286.0 s | 41.6 s | 55.0 s | 1.6 GB | 1.1 GB |
| 20000 | 2988 s | 214.5 s | 231.8 s | 6.4 GB | 4.5 GB |
| 30000 | 14760 s | 690.1 s | 889.0 s | 14.4 GB | 10.1 GB |
| 40000 | 8.4 h | 0.44 h | 1.13 h | 25.6 GB | 17.9 GB |
| 50000 | 14.2 h | 0.94 h | 4.20 h | 40.0 GB** | 27.9 GB* |
| 60000 | 27.3 h | / | 8.61 h | 57.6 GB** | / |
| 70000 | 50.3 h | / | 14.85 h | / | / |
| 80000 | 91.3 h | / | 40.90 h | / | / |
| 90000 | / | / | 84.89 h | / | / |
| 100000 | / | / | 127.91 h | / | / |
Comparison of the runtime of OCMA under the Windows and Linux operating systems for eigen decomposition. The same machine described in the Table 1 and 2 was used. Four threads are used in the test. Time is measured by wall-clock (instead of CPU time). The sign “/” indicates that the system does not allow the calculation to happen due to limited memory and swap spaces
| Windows | Linux | |||
|---|---|---|---|---|
| Memory | Disk | Memory | Disk | |
| 10000 | 46.2 s | 46.2 s | 41.6 s | 55.0 s |
| 20000 | 246.4 s | 252.6 s | 214.5 s | 231.8 s |
| 30000 | 760.7 s | 775.3 s | 690.1 s | 889.0 s |
| 40000 | 0.50 h | 0.51 h | 0.44 h | 1.13 h |
| 50000 | 1.02 h | 1.05 h | 0.94 h | 4.20 h |
| 60000 | 1.96 h | 2.25 h | / | 8.61 h |
| 70000 | / | 3.43 h | / | 14.85 h |
| 80000 | / | 18.31 h | / | 40.90 h |
| 90000 | / | 52.53 h | / | 84.89 h |
| 100000 | / | 132.91 h | / | 127.91 h |
Runtime of OCMA for Singular Value Decomposition (SVD). Sample size N = 1 million. M is the number of selected genetic markers. The same machine described in the Tables 1 and 2 is used. Four threads are used in the test. Time is measured by wall-clock (instead of CPU time). The sign “/” indicates that the system does not allow the calculation to happen due to limited memory and swap spaces
| Windows | Linux | |||
|---|---|---|---|---|
| Memory | Disk | Memory | Disk | |
| 1000 | 89.5 s | 102.5 s | 93.6 s | 309.9 s |
| 2000 | 226.1 s | 230.9 s | 208.0 s | 3564 s |
| 3000 | 450.8 s | 523.4 s | 391.8 s | 11232 s |
| 4000 | 0.41 h | 0.42 h | / | 4.94 h |
| 5000 | 1.50 h | 1.77 h | / | 6.58 h |
| 6000 | 3.21 h | 3.38 h | / | 9.94 h |
| 7000 | / | 8.95 h | / | 16.09 h |
| 8000 | / | 16.40 h | / | 21.16 h |
Runtime of OCMA for a subset of singular values. Sample size N = 1 million. M, the number of selected genetic markers, is 5,000. K is the number of top SVs. The same machine described in the Table 1 and 2 is used. Four threads are used in the test. Time is measured by wall-clock (instead of CPU time)
| Windows | Linux | |||
|---|---|---|---|---|
| Memory | Disk | Memory | Disk | |
| 10 | 0.30 h | 0.30 h | 0.206 h | 4.46 h |
| 20 | 0.30 h | 0.30 h | 0.207 h | 4.60 h |
| 50 | 0.30 h | 0.30 h | 0.209 h | 4.79 h |
| 100 | 0.31 h | 0.31 h | 0.211 h | 4.87 h |
| 200 | 0.32 h | 0.32 h | 0.215 h | 4.95 h |
| 500 | 0.32 h | 0.32 h | 0.225 h | 5.01 h |
| 1000 | 0.33 h | 0.33 h | 0.230 h | 5.10 h |
| 2000 | 0.83 h | 0.95 h | / | 5.65 h |
Comparison between OCMA and OpenBLAS, a multi-thread package for matrix factorization. The same function (“ssyevd”) in OpenBLAS is used. The computing node has two E5-2690V4 CPUs with 28 cores in total. The total memory available is 192GB. Environmental variables MKL_NUM_THREADS and OPENBLAS_NUM_THREADS are used to specify the number of threads in OCMA and OpenBLAS respectively
| Memory | #Threads | Runtime | ||
|---|---|---|---|---|
| OCMA | OpenBLAS | |||
| 10000 | 1.1GB | 10 | 16.4 s | 29.3 s |
| 20 | 14.3 s | 24.3 s | ||
| 20000 | 4.5GB | 10 | 125.9 s | 187.0 s |
| 20 | 73.1 s | 156.5 s | ||
| 50000 | 27.9GB | 10 | 0.47 h | 0.76 h |
| 20 | 0.25 h | 0.52 h | ||
| 100000 | 111.8GB | 10 | 3.63 h | 4.90 h |
| 20 | 1.90 h | 3.54 h | ||