| Literature DB >> 32469896 |
Aman Agrawal1, Alec M Chiu2, Minh Le3, Eran Halperin3,4,5,6,7, Sriram Sankararaman3,4,6.
Abstract
Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32469896 PMCID: PMC7286535 DOI: 10.1371/journal.pgen.1008773
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 6.020
ProPCA accurately estimates principal components relative to other methods.
| ProPCA | FlashPCA2 | fastPCA | PLINK2 | bigsnpr | TeraPCA | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.001 | 0.987 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.999 |
| 0.002 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.003 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.004 | 0.999 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.005 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.006 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.007 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.008 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.009 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| 0.010 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
The principal components computed by ProPCA are compared to the PCs obtained from a full SVD on a genotype dataset containing 50, 000 SNPs and 10, 000 individuals. Accuracy was measured by the mean of explained variance (MEV) which measures the overlap between the set of PCs inferred from ProPCA and those from SVD across values of F ∈ {0.001, …, 0.01}. We report MEV for K = 5 using 5 populations as well as for K = 10 PCs using 10 populations. Methods shown are run using their default parameters.
Fig 1ProPCA is computationally efficient.
Comparison of runtimes over simulated genotype data varied over individuals and SNPs. Figures 1a and 1b display the total runtime containing 100, 000 SNPs, six subpopulations, F = 0.01 and individuals varying from 10, 000 to 1, 000, 000. We report the mean and standard deviation over ten trials. Figure 1b compares the runtimes of all algorithms excluding PLINK_SVD which could only run successfully up to a sample size of 70, 000. Figure 1c displays the total runtime containing 100, 000 individuals, six subpopulations, F = 0.01, and SNPs varying from 10, 000 to 1, 000, 000. All methods were capped to a maximum of 100 hours and a maximum memory of 64 GB and run using default settings. We were unable to include bigstatsr in the SNP benchmark as it does not allow for monomorphic SNPs.
Runtimes of methods on largest simulated datasets for 40 principal components.
| Method | SNPs | Individuals |
|---|---|---|
| - | 103 | |
| - | - | |
| 93 | 114 | |
| 74 | 72 | |
| 35 | 28 | |
| 49 | 48 |
We computed 40 PCs from each method on each of our largest simulated datasets. Times are reported in hours. The ‘SNPs’ column contains the runtime on a 1 million SNP and 10,000 individuals dataset while the ‘Individuals’ contains the runtime on a 1 million individual and 10,000 SNP dataset. FastPCA could not be run to completion on either dataset due to a segmentation fault while bigstatsr could not run on the SNPs dataset due to the inclusion of monomorphic SNPs. All methods were run with default parameters except TeraPCA, which was run with ‘-rfetched 4000’ for the SNPs dataset and ‘-rfetched 2000’ for the Individuals dataset due to a segementation fault.
Fig 2Principal components uncover population and geographic structure in the UK Biobank.
We used ProPCA to compute PCs on the UK Biobank data. Figure 2a shows the first two principal components to reveal population structure. Figure 2b shows geographic structure by plotting the score of 276, 736 unrelated White British individuals on the first principal component on their birth location coordinates.
Fig 3Selection scan for the first five principal components in the white British individuals in the UK Biobank.
A Manhattan plot with the −log10 p values associated with the test of selection displayed for the first five principal components for the unrelated White British subset of the UK Biobank. The red line represents the Bonferroni adjusted significance level (α = 0.05). Significant loci are labeled. Signals above −log10(p) = 18 were capped at this value for better visualization.