| Literature DB >> 24718290 |
Gad Abraham1, Michael Inouye1.
Abstract
Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.Entities:
Mesh:
Year: 2014 PMID: 24718290 PMCID: PMC3981753 DOI: 10.1371/journal.pone.0093766
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1(a) The first two principal components from analyzing the HapMap3 dataset. (b) Scatter plots showing near-perfect absolute Pearson correlation (lower left-hand corner) between the 1st PCs estimated by smartpa, flashpca, shellfish, and R’s prcomp (using the standardization from Equation 4). Note that since eigenvectors are only defined up to sign, the correlations may be negative as well as positive. In addition, the scale of the PCs may differ between the methods, however, this has no bearing on the interpretation of the PCs.
Figure 2Total wall time (seconds) for flashpca versus EIGENSOFT’s smartpca and shellfish on increasing subsets of the celiac disease dataset, employing multi-threaded mode (8 threads), using 43,049 SNPs.
shellfish did not complete PCA for the 50,000 subsets, and smartpca was stopped after 100,000 sec.The results shown are averages over 3 runs. Results for 15,000 are based on subsamples of the original dataset = 16,002 (light blue background), whereas results for 50,000 are based on duplicating the original samples (light yellow background).
Algorithm 1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Pseudocode for the eigen-decomposition variant of the fast PCA, based on the randomized algorithm of [5] for the case where . is the standardization in Equation 4. is a function generating an iid multivariate normal matrix, is the user-defined number of extra dimensions, is the QR decomposition, is a function that divides each column by its norm , is the eigen-decomposition producing the -top eigenvectors and vector of -top eigenvalues . is the matrix of principal components.