Literature DB >> 32469896

Scalable probabilistic PCA for large-scale genetic variation data.

Aman Agrawal¹, Alec M Chiu², Minh Le³, Eran Halperin^3,4,5,6,7, Sriram Sankararaman^3,4,6.

Abstract

Principal component analysis (PCA) is a key tool for understanding population structure and controlling for population stratification in genome-wide association studies (GWAS). With the advent of large-scale datasets of genetic variation, there is a need for methods that can compute principal components (PCs) with scalable computational and memory requirements. We present ProPCA, a highly scalable method based on a probabilistic generative model, which computes the top PCs on genetic variation data efficiently. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in about thirty minutes. To illustrate the utility of computing PCs in large samples, we leveraged the population structure inferred by ProPCA within White British individuals in the UK Biobank to identify several novel genome-wide signals of recent putative selection including missense mutations in RPGRIP1L and TLR4.

Entities: Chemical

Mesh：

Substances：

Year: 2020 PMID： 32469896 PMCID： PMC7286535 DOI： 10.1371/journal.pgen.1008773

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 6.020

Introduction

Inference of population structure is a key step in population genetic analyses [1] with applications that include understanding genetic ancestry [2-4] and controlling for confounding in genome-wide association studies (GWAS) [5]. While several methods have been proposed to infer population structure (e.g., [6-10]), principal component analysis (PCA) is one of the most widely used [6, 11]. Unfortunately, the naive approach for estimating principal components (PCs) by computing a full singular value decomposition (SVD) scales quadratically with sample size (for datasets where the number of SNPs is larger than sample size), resulting in runtimes unsuitable for large data sets. In light of these challenges, several solutions have been proposed for the efficient computation of PCs. One approach taken by many recent scalable implementations (FastPCA [12], FlashPCA2 [13], bigsnpr [14], TeraPCA [15], PLINK2 [16]) takes advantage of the fact that typical applications of PCA in genetics only require computing a small number of top PCs; e.g. GWAS typically use 5-20 PCs to correct for stratification [17]. These methods can be grouped according to their underlying algorithm: blanczos (FastPCA, PLINK2, TeraPCA) or the implicitly restarted Arnoldi algorithm (FlashPCA2, bigsnpr). An alternative approach for efficient computation of PCs takes advantage of the parallel computation infrastructure of the cloud [18]. However, the cost of cloud usage is roughly proportional to the number of CPU hours used by these algorithms, making them cost-prohibitive. Finally, these scalable implementations lack a full probabilistic model, making them challenging to extend to settings with missing genotypes or linkage disequilibrium (LD) between SNPs. In this work, we describe ProPCA, a scalable method to compute the top PCs on genotype data. ProPCA is based on a previously proposed probabilistic model [19, 20], of which PCA is a special case. While PCA treats the PCs and the PC scores as fixed parameters, probabilistic PCA imposes a prior on the PC scores. This formulation leads to an iterative Expectation Maximization (EM) algorithm for computing the PCs. ProPCA leverages the structure of genotype data to further reduce the computation time in each iteration of the EM algorithm. The EM algorithm requires only a small number of iterations to obtain accurate estimates of the PCs resulting in a highly scalable algorithm. ProPCA obtains a computational speed-up through the integration of the Mailman algorithm [21] into its EM algorithm. The Mailman algorithm allows for fast matrix-vector multiplication when there are a finite number of values (e.g. genotypes) in exchange for additional memory usage. As a result, ProPCA requires more memory than some of the other scalable PCA methods. However, the increased memory consumption is reasonable; often still within the memory available within typical computing environments. In both simulated and real data, ProPCA is able to accurately infer the top PCs while scaling favorably with increasing sample size. We applied ProPCA to compute the top five PCs on genotype data from the UK Biobank, consisting of 488,363 individuals and 146,671 SNPs, in less than thirty minutes. To illustrate how the ability to compute PCs in large samples can lead to biological discovery, we leveraged the population structure inferred by ProPCA within the White British individuals in the UK Biobank [22] to scan for SNPs that are not well-modeled by the top PCs and, consequently, identify several novel genome-wide signals of recent positive selection. Our scan recovers sixteen loci that are highly differentiated across the top five PCs that are likely signals of recent selection. While these loci include previously reported targets of selection [12], the larger sample size that we analyze here allows us to identify eleven novel signals including a missense mutation in RPGRIP1L (p = 2.09 × 10−9) and another in TLR4 (p = 7.60 × 10−12). A number of algorithms that analyze genotype data, including methods for heritability estimation and association testing, can be modeled as iterative procedures where the core computational operation is similar to that solved by ProPCA. Thus, the algorithm that we employ in this work can potentially lead to highly scalable algorithms for a broad set of population genetic analyses.

Results

Accuracy

We first assessed the accuracy of ProPCA using the simulation framework described in the Methods. We generated datasets containing 50, 000 SNPs and 10, 000 individuals across q populations, where q was chosen to be 5 and 10. The populations were simulated with varying levels of population differentiation that are typical of present-day human populations (values of F ranging from 0.001 to 0.01) and were small enough so that we could compute the full SVD thereby allowing us to estimate the accuracy of the PCs computed by ProPCA. To measure accuracy, we computed the mean of explained variances (MEV), a measure of the overlap between the subspaces spanned by the PCs estimated by ProPCA compared to the PCs computed using a full SVD (Methods). ProPCA, All methods are able to estimate highly accurate PCs (values of MEV close to 1) across the range of parameters (Table 1).

Table 1

ProPCA accurately estimates principal components relative to other methods.

F_st	ProPCA		FlashPCA2		fastPCA		PLINK2		bigsnpr		TeraPCA
	K = 5	K = 10	K = 5	K = 10	K = 5	K = 10	K = 5	K = 10	K = 5	K = 10	K = 5	K = 10
0.001	0.987	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	0.999
0.002	0.999	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
0.003	0.999	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
0.004	0.999	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
0.005	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
0.006	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
0.007	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
0.008	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
0.009	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000
0.010	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000	1.000

The principal components computed by ProPCA are compared to the PCs obtained from a full SVD on a genotype dataset containing 50, 000 SNPs and 10, 000 individuals. Accuracy was measured by the mean of explained variance (MEV) which measures the overlap between the set of PCs inferred from ProPCA and those from SVD across values of F ∈ {0.001, …, 0.01}. We report MEV for K = 5 using 5 populations as well as for K = 10 PCs using 10 populations. Methods shown are run using their default parameters.

Runtime

We assessed the scalability of ProPCA with increasing sample size (Methods). We simulated genotypes from six populations containing 100, 000 SNPs and sample sizes varying from 10, 000 to 1, 000, 000 with F = 0.01. We compared the wall-clock time for running ProPCA, the SVD implementation in PLINK (PLINK_SVD [23]), FastPCA [12], FlashPCA2 [13], bigsnpr [14], PLINK2 [16], TeraPCA [15]). The SVD implementation in PLINK could not run in reasonable time on datasets exceeding 70, 000 individuals (Fig 1a). While all the other methods scale with sample size, ProPCA is faster than the methods compared against (Fig 1b). ProPCA computes PCs in about 30 minutes even on the largest data containing a million individuals and 100, 000 SNPs. We similarly explored how each method scale in terms of the number of variants. We repeated our experiment by varying the number of SNPs from 10, 000 to 1, 000, 000 while keeping the sample size constant at 100, 000 and found similar results (Fig 1c).

Fig 1

ProPCA is computationally efficient.

ProPCA is computationally efficient.

Comparison of runtimes over simulated genotype data varied over individuals and SNPs. Figures 1a and 1b display the total runtime containing 100, 000 SNPs, six subpopulations, F = 0.01 and individuals varying from 10, 000 to 1, 000, 000. We report the mean and standard deviation over ten trials. Figure 1b compares the runtimes of all algorithms excluding PLINK_SVD which could only run successfully up to a sample size of 70, 000. Figure 1c displays the total runtime containing 100, 000 individuals, six subpopulations, F = 0.01, and SNPs varying from 10, 000 to 1, 000, 000. All methods were capped to a maximum of 100 hours and a maximum memory of 64 GB and run using default settings. We were unable to include bigstatsr in the SNP benchmark as it does not allow for monomorphic SNPs. We further tested runtime as a function of the number of PCs on a simulated dataset containing 10,000 individuals, 50,000 SNPs, and 20 latent populations separated at F = 0.01. We find that PLINK2 and ProPCA scale linearly when computing the upto the top 40 PCs (S1 Fig). FlashPCA2 is efficient at computing 2-20 PCs, but increases in runtime when computing a single PC or more than 20 PCs. We found that this trend was both reproducible across different datasets. We also tested bigsnpr/bigstatsr, which uses the same underlying algorithm as FlashPCA2 and found a similar trend, i.e., its performance to be similar to FlashPCA2 in that it is efficient at computing 1-20 PCs, but we see a steady increase in runtime after 20 PCs. We tested a final scenario in which each method computed 40 PCs on our two largest simulated datasets containing one million SNPs and 10,000 individuals dataset as well as the one million individuals and 10,000 SNPs dataset (Table 2). We find that ProPCA can compute the 40 PCs most efficiently under two days for both datasets while other methods required 2-4 days.

Table 2

Runtimes of methods on largest simulated datasets for 40 principal components.

Method	SNPs	Individuals
bigstatsr	-	103
FastPCA	-	-
FlashPCA2	93	114
PLINK2	74	72
ProPCA	35	28
TeraPCA	49	48

We computed 40 PCs from each method on each of our largest simulated datasets. Times are reported in hours. The ‘SNPs’ column contains the runtime on a 1 million SNP and 10,000 individuals dataset while the ‘Individuals’ contains the runtime on a 1 million individual and 10,000 SNP dataset. FastPCA could not be run to completion on either dataset due to a segmentation fault while bigstatsr could not run on the SNPs dataset due to the inclusion of monomorphic SNPs. All methods were run with default parameters except TeraPCA, which was run with ‘-rfetched 4000’ for the SNPs dataset and ‘-rfetched 2000’ for the Individuals dataset due to a segementation fault. Since ProPCA, FastPCA, and FlashPCA2 are all based on iterative algorithms, their runtimes depend on details of convergence criterion. We performed an additional experiment to compare the runtime of ProPCA, FastPCA (for which we could instrument the source code) for a single iteration and found ProPCA to be three to four times faster than FastPCA across the range of sample sizes (S2 Fig). Measuring the accuracy of the PCs (MEV) as a function of runtime (on datasets with a range of F containing 50, 000 SNPs and 10, 000 individuals so that we could compare the estimated PCs to exact PCs), ProPCA attains a given MEV in about half the time as FastPCA and FlashPCA2 (S3 Fig).

Memory

We assessed the memory usage of ProPCA and other methods as a function of individuals and SNPs (S4a and S4b Fig). Due to computations utilized by the Mailman algorithm, ProPCA uses more memory than other methods, but is still relatively efficient requiring about 40 GB on the largest dataset. Memory usage for ProPCA scales linearly with respect both individuals and SNPs.

Application to real genotype data

We applied ProPCA to genotype data from Phase 1 of the 1000 Genomes project [24]. On a dataset of 1092 individuals and 442, 350 SNPs, ProPCA computes the top forty PCs that are qualitatively indistinguishable from running a full SVD (S5 Fig). Furthermore, we tested each method’s ability to compute 5-40 PCs on this dataset. We took a small subset for 450k SNPs and 1,092 individuals for which we could compute the full SVD. We tested all methods at increments of 5 PCs to 40 PCs and ultimately found that all that all methods still performed well across the range tested (MEV ≥ 0.95) (S1 Table). We also applied ProPCA to genotype data from the UK Biobank [22] consisting of 488, 363 individuals and 146, 671 SNPs after QC. ProPCA can compute the top five PCs in about 30 minutes and the resulting PCs reflect population structure within the UK Biobank, consistent with previous studies [22] (Fig 2a).

Fig 2

Principal components uncover population and geographic structure in the UK Biobank.

We used ProPCA to compute PCs on the UK Biobank data. Figure 2a shows the first two principal components to reveal population structure. Figure 2b shows geographic structure by plotting the score of 276, 736 unrelated White British individuals on the first principal component on their birth location coordinates.

Principal components uncover population and geographic structure in the UK Biobank.

Application to scans for selection

Since the PCs in ProPCA are computed as maximum likelihood estimates under a probabilistic model, ProPCA provides a natural framework for applications such as hypothesis testing. By utilizing the statistical assumptions and set up provided by the ProPCA model, we developed a statistical test to search for SNPs that are not well-modeled by the ProPCA model as a means of discovering signals of natural selection (Methods and S1 Text). This statistic relies on the observation that a SNP evolving under positive selection is expected to exhibit differentiation in the frequencies of its alleles that is extreme compared to a typical SNP that is evolving neutrally [25]. Since deviations from the ProPCA model can occur due to reasons unrelated to selection, we filtered out SNPs with high rates of missingness, low minor allele frequency (MAF), and presence in regions of long-range LD [26] (Methods). We ran ProPCA to infer the top five PCs on 276, 736 unrelated White British samples and the UK Biobank SNP set consisting of 146, 671 SNPs obtained by further removing SNPs in high LD (S6 Fig). The Pearson correlation coefficient between birth location coordinates and the PC score for each individual reveals that the estimated PCs capture geographic structure within the UK (Fig 2b, S7 Fig, S2 Table). We used these PCs to perform a selection scan on a larger set of 516, 140 SNPs and we report SNPs that are genome-wide significant after accounting for the number of SNPs as well as PCs tested (p-value ; we use 6 to account for the additional combined test statistic that we describe later). We ensured that the selection statistic for each PC was well-calibrated against a distribution (S8 Fig) and genomic inflation (λ) values for each of the PCs showed no substantial inflation (S3 Table). While our statistic is closely related to a previously proposed statistic to detect selection on PCs (S1 Text), we found that our proposed statistic is better calibrated (S3 Table). Our scan revealed a total of 59 SNPs that were genome-wide significant (S4 Table). Clustering these signals into 1 Mb windows centered around the most significant SNP for each PC, we obtained twelve non-overlapping loci that contain putative targets of selection (Fig 3, S5 Table). These twelve loci include five that were previously reported to be signals of selection in the UK with genome-wide significance: LCT (rs7570971 with p = 8.51 × 10−16), TLR1 (rs5743614, p = 5.65 × 10−25), IRF4 (rs62389423, p = 8.80 × 10−42), HLA (rs9267817, p = ×6.17 × 10−9), and FUT2 (rs492602, p = 7.02 × 10−10) [12]. The larger sample size that we analyze here also reveals novel signals at additional loci. Four of the twelve signals were previously suggested to be signals of selection but were not genome-wide significant: HERC2 (rs12913832, p = 5.21 × 10−10), RPGRIP1L (rs61747071, p = 2.09 × 10−9), SKI (rs79907870, p = 2.58 × 10−9), rs77635680 (p = 2.22 × 10−10) [12] while the remaining three loci: HERC6 (rs112873858, p = 2.68 × 10−11), rs6670894 (p = 4.98 × 10−9), and rs12380860 (p = 8.62 × 10−9) appear to be previously unreported.

Fig 3

Selection scan for the first five principal components in the white British individuals in the UK Biobank.

Selection scan for the first five principal components in the white British individuals in the UK Biobank.

A Manhattan plot with the −log10 p values associated with the test of selection displayed for the first five principal components for the unrelated White British subset of the UK Biobank. The red line represents the Bonferroni adjusted significance level (α = 0.05). Significant loci are labeled. Signals above −log10(p) = 18 were capped at this value for better visualization. To validate our findings, we utilized birth location coordinates for each individual and assigned them to geographical regions in the UK as defined in the Nomenclature of Territorial Units for Statistics level 3 (NUTS3) classification. We performed a test of association between the allele frequency of the top SNP in each of our novel loci with geographical regions and confirmed that SNPs identified in our selection scan show differences in allele frequencies across specific geographical regions (S6 Table). One of the novel genome-wide significant loci is RPGRIP1L. RPGRIP1L is a highly conserved gene that encodes a protein that localizes at primary cilia and is important in development [27]. Mutations in this gene have been implicated with neurological disorders such as Joubert syndrome and Meckel syndrome [28], conditions that sometimes also result in additional symptoms such as eye diseases and kidney disorders [29]. The SNP with the most significant p-value in our scan in RPGRIP1L, rs61747071, is a missense loss-of-function mutation A229T that has been shown to lead to photoreceptor loss in ciliopathies [30]. We created an additional variant of our selection statistic which tests for SNPs that are not well-modeled by a linear combination of the first five PCs by summing the per-PC statistics resulting in a new chi-squared statistic with five degrees of freedom. Combining signals across PCs has been previously shown to boost power in association testing [31]. We verified that the resulting combined statistic is also calibrated (S8 Fig, S3 Table). Under this combined statistic, we recover majority of the loci found on each individual PC, but we also discover four additional novel loci: AMPH (rs118079376, p = 2.64 × 10−10), TLR4 (rs4986790, p = 7.60 × 10−12), rs9856661 (p = 6.46 × 10−9), and rs116352364 (p = 5.24 × 10−11) (S7 Table). TLR4 is a member of the toll-like receptor family. The TLR gene family is known to play a fundamental role in pathogen recognition and activation of innate immunity, but TLR4 in particular is involved with proinflammatory cytokines and has a pro-carcinogenic function [32]. The SNP with the most significant p-value at our TLR4 locus is rs4986790, a missense D299G mutation and D259G mutation on two different transcripts for the TLR4 gene. The D299G mutation is of particular interest as this mutation is strongly correlated with increased infection by Plasmodium falciparum, a parasite that causes malaria [33, 34]. To better understand the signals of selection that the proposed statistic is sensitive to, we compared the time-scale for our selection hits to those from a recent study that is designed to detect recent positive selection [35] (S9 Fig). Using estimates of allelic ages for variants in the 1000 Genomes Project [36], we find that the variants detected by the proposed statistic tend to be older on average than those found to have a singleton-density score > 4 from Field et al. 2016 (average age of 19, 007 generations for our statistic vs 11, 944 generations for the SDS statistic using the combined mutation and recombination clock). We caution that the interpretation of these results is complicate by the considerable uncertainty in the allelic age estimates. Further, the timing of an episode of selection might post-date the age of the mutation—for example, when selection acts on standing variation. Finally, there is substantial variation in the mean age estimates of the hits. While the average age is around 19, 000 generations, 17 of the 42 putatively selected variants have ages less than 5, 000 generations. This suggests that the proposed statistic could be sensitive to both recent and older selection where the resulting allele frequencies are not well-modeled by the PCs. To further illustrate how the ability to compute PCs in large samples is necessary for biological discovery, we analyzed how many selection signals we discover as a function of sample size by randomly subsampling the number of individuals from the White British population and repeating our analyses (S10 Fig). We ultimately find that sample sizes larger than 150, 000 individuals are required to retain over 80% of the total signals of selection we discover.

Discussion

We have presented, ProPCA, a scalable method for PCA on genotype data that relies on performing inference in a probabilistic model. Inference in this model consists of an iterative procedure that uses a fast matrix-vector multiplication algorithm. We have demonstrated its accuracy and efficiency across diverse settings. Further, we have demonstrated that ProPCA can accurately estimate population structure within the UK Biobank dataset and how this structure can be leveraged to identify targets of recent putative selection. The algorithm that we employ here to accelerate the EM updates is of independent interest. Beyond PCA, several algorithms that operate on genotype data perform repeated matrix-vector multiplication on the matrix of genotypes. For example, association tests and permutation tests, can be formulated as computing a matrix-vector product where the matrix is the genotype matrix while the vector consists of phenotype measurements. Indeed, the algorithm has been used to accelerate heritability estimation [37]. The idea that SVD computations can leverage fast matrix-vector multiplication operations to obtain computational efficiency is well known in the numerical linear algebra literature [38]. Indeed, the algorithms [13, 38] implemented in other PCA methods can also utilize these ideas to gain additional computational efficiency. Alternate approaches to improve matrix-vector multiplication in the genetics setting include approaches that rely on sparsity of the genotype matrix. It is important to note that the speedup obtained from the Mailman algorithm does not rely explicitly on sparsity and could be applied even to dense matrices. It would be of interest to contrast the use of sparse multiplication versus the Mailman algorithm and to investigate the potential to combine these two approaches to be able to leverage sparsity as well as the discrete nature of the genotype matrix. It is likely that different algorithms and implementations to compute PCs (and more generally, infer population structure) might be appropriate based on the specific application. The choice of the specific algorithm and implementation involves a number of trade-offs. While ProPCA is computationally efficient, its use of the Mailman algorithm results in a bigger memory footprint relative to other methods. The probabilistic formulation underlying ProPCA allows the algorithm to be generalized in several directions. One direction is the application of PCA in the presence of missing data that often arises when analyzing multiple datasets. We have explored an extension of the ProPCA model to this setting (S1 Text, S11 Fig). While this approach is promising, a limitation of the use of the Mailman algorithm within ProPCA is the requirement of discrete genotypes, which prevents ProPCA from being directly applied to dosages (e.g. imputed genotypes). Another potential future direction is in modeling linkage disequilibrium and in incorporating rare variants which have the potential to reveal structure that is not apparent from the analysis of common SNPs [39, 40]. Current applications of PCA remove correlated SNPs and singletons though this has been shown to discard information [12]. One possible way to incorporate LD would leverage the connection between haplotype copying models [41] and the multivariate normal model of PCA [42], or by a whitening transformation [4]. Further, the observation model can also be modified to account for the discrete nature of genotypes [3, 43]. A number of non-linear dimensionality reduction methods have been recently proposed [44, 45]. A comparison of these methods to ProPCA (in terms of statistical structure that the methods aim to detect, ability to handle missing data, and computational scalability) would be of great interest. Finally, leveraging fine-scale population structure inferred from large-scale data to study recent positive selection in human history is an important direction for future work. While the probabilistic model underlying ProPCA leads to a natural model for testing for selection, other hypotheses about models of selection could lead to other tests of selection. The challenge is to design realistic statistical models of population structure while enabling inference at scale.

Software availability

ProPCA is available at https://github.com/sriramlab/ProPCA.

Materials and methods

Principal components analysis (PCA)

We observe genotypes from n individuals at m SNPs. The genotype vector for individual i is a length m vector denoted by ∈ {0, 1, 2}. The j entry of denotes the number of minor allele carried by individual i at SNP j. Let be the m × n genotype matrix where = [1…]. Let denote the matrix of standardized genotypes obtained by centering and rescaling each row of the genotype matrix so that ∑ y = 0 and for all i ∈ {1, …, m}. Principal components analysis (PCA) [11] attempts to find a low-dimensional linear transformation of the data that maximizes the projected variance or, equivalently, minimizes the reconstruction error. Given the m × n matrix of standardized genotypes and a target dimension k, PCA attempts to find a m × k matrix with orthonormal columns and n × k matrix that minimizes the reconstruction error: ‖ − T‖ where is the Frobenius norm of the matrix . To solve the PCA problem, we perform a singular-value decomposition (SVD) of the standardized genotype matrix = ΣT and set , where is a m × k matrix containing the k columns of corresponding to the k largest singular vectors of .

Probabilistic PCA

PCA can be viewed as a limiting case of the probabilistic PCA model [10, 19, 20]. Probabilistic PCA models the observed data as a linear transformation of a k-dimensional latent random variable (k ≤ m) with additive Gaussian noise. Denoting the linear transformation by the m × k matrix , and the (m-dimensional) noise by (with isotropic covariance matrix σ2), the generative model can be written as The maximum likelihood estimate of the matrix in this model has been shown to span the k-dimensional principal subspace of the data = [1,…] [46].

EM algorithm for PCA

Since probabilistic PCA is a probabilistic model endowed with latent variables, the EM algorithm presents a natural approach to compute the maximum likelihood estimates of the model parameters (, σ2) [19, 20]. The EM algorithm for learning the principal components can be derived as a special case of the EM algorithm for the probabilistic PCA model where the variance of the observation noise σ2 tends to zero leading to these updates: Here = [1…] is a k × n matrix and = [1…] is a m × n matrix. Noting that all matrix inversions require inverting a k × k matrix, the computational complexity of the E-step is while the computational complexity of the M-step is . For small k and large m, n, the per-iteration runtime complexity is . Thus, the EM algorithm provides a computationally efficient estimator of the top k PCs when the number of PCs to be estimated is small.

Sub-linear time EM

The key bottleneck in the EM algorithm is the multiplication of the matrix with matrices = ()−1 and = ()−1. The vectors representing the sample mean and standard deviation of the genotypes at each SNP are denoted and s. Assuming no entry in s is zero (we remove SNPs that have no variation across samples), the matrix of standardized genotypes can be written as: Here diag() is an operator that constructs a diagonal matrix with the entries of along its diagonals, 1 is a length n vector with each entry equal to one, and is a length m vector with , j ∈ {1, …, m}. The EM updates can be written as: Here can be computed in time while and can be computed in time . The key bottleneck in the E-step is the multiplication of the genotype matrix by each of the k rows of the matrix and in the M-step, multiplication of by each of the k columns of the matrix respectively. Leveraging the fact that each element of the genotype matrix takes values in the set {0, 1, 2}, we can improve the complexity of these multiplication operations from to by extending the Mailman Algorithm [21]. For additional implementation details, see S1 Text.

The Mailman algorithm

In the M-step, we need to compute = for an arbitrary real-valued vector and a m × n matrix whose entries take values in {0, 1, 2}. We assume that m = ⌈log3(n)⌉. Naive matrix-vector multiplication takes time. The Mailman algorithm decomposes as = . Here is the m × r matrix whose columns containing all r = 3 possible vectors over {0, 1, 2} of length m. We set an entry P to 1 if column j of matches column i of . The decomposition of any matrix into and can be done in time. Given this decomposition, the desired product is computed in two steps, each of which has time complexity [21]: The Mailman algorithm provides computational savings in a setting where the cost of computing the decomposition of are offset by the gains in repeated multiplication involving . Similarly, in the E-step, we need to compute T in time by computing T and computing a decomposition of T. A drawback of this approach is the need to store both decompositions that would double the memory requirements of the algorithm. Instead, we propose a novel variant of the Mailman algorithm that can compute T in time using the same decomposition as A (S1 Text). Additional details on efficient implementation of the EM and Mailman algorithms can be found in S1 Text.

Simulations

We simulated genotypes at m independent SNPs across n individuals in which a single ancestral population diverged into q sub-populations with drift proportional to the F, a measure of population differentiation. The allele frequency at SNP f, j ∈ {1, …, m} in the ancestral population was sampled from a uniform distribution such that . Allele frequencies in each of the l subpopulations were generated by simulating neutral drift from the ancestral allele frequency, and were set to 0 or 1 if they fell outside the interval [0, 1]. The genotypes of an individual in population l at SNP j was sampled from a Binomial(2, f) distribution.

Benchmarking

To compare estimated PCs to reference PCs, we computed the mean of explained variance (MEV)—a measure of the overlap between the subspaces spanned by the two sets of PCs. Two different sets of K principal components each produce a K-dimensional column space. A metric for the performance of a PCA algorithm against some baseline is to see how much the column spaces overlap. This is done by projecting the eigenvectors of one subspace onto the other and finding the mean lengths of the projected eigenvectors. If we have a reference set of PCs (v1, v2, …, v) against which we wish to evaluate the performance of a set of estimated PCs (u1, u2, …, u), where U is a matrix whose column vectors are the PCs which we are testing. In practice, when attempting to compute the top k PCs, ProPCA was found to converge faster by computing l PCs for l > k PCs and retaining the top k PCs. The reason for this is that in the initial iterations of the EM algorithm, the estimates of the top PCs are noisy. We set l = k in our experiments for an effective 2k. While ProPCA could be run to convergence, we found that running it for k iterations already gave accurate results across the range of parameters considered. Our empirical results are consistent with our theoretical result that the EM algorithm converges exponentially fast in the spectral norm of the error matrix [38, 47] (S1 Text). We compared ProPCA to the current state-of-the-art methods for computing PCs from genotype data: the SVD implementation in PLINK (PLINK_SVD [23]), FastPCA [12], FlashPCA2 [13], bigsnpr [14], PLINK2 [16], and TeraPCA [15]. PLINK_SVD refers to an exact computation of PCs using the full Singular Value Decomposition as implemented in the PLINK package (PLINK_SVD). FastPCA [12] is an implementation of the blanczos method, a randomized subspace iteration method [38] while FlashPCA2 [13] is an implementation of the implicitly restarted Arnoldi method [48]. PLINK2 [16] and TeraPCA [15] are reimplimentations of the FastPCA algorithm while bigsnpr [14] is a reimplimentation of the FlashPCA2 algorithm designed to utilize disk space as a file backend. We used default parameters for all methods unless otherwise stated. For benchmarking of bigsnpr, we included the creation of the file backend in timing as it is required to run any of the computations included in the backend. Furthermore, we excluded bigsnpr from some experiments due to the inability of its PCA function to natively handle missing data and when faced with monomorphic SNPs. All experiments were performed on a Intel(R) Xeon(R) CPU 2.10GHz server with 128 GB RAM, restricted to a single core, capped to a maximum runtime of 100 hours and a maximum memory of 64 GB.

Selection scan

The White British cohort was identified by the UK Biobank as participants who self-identified as ‘British’ within the broader-level group ‘White’ while having similar ancestral background [22]. For our selection scan, we further filtered the 409, 634 individuals in the White British subset to obtain an unrelated White British subset by removing individuals with one other related individual in the data set (individuals with kinship coefficients greater than 0.0625 (third-degree relatedness) to any other individual as determined by KING [49]). After removing these individuals, we obtained an unrelated White British subset containing 276, 736 individuals. We inferred the top five PCs using ProPCA on all 276, 736 unrelated White British individuals and a filtered SNP set containing 146, 671 SNPs (UK Biobank SNP set). SNPs in the UK Biobank SNP set consist of SNPs on the UK Biobank Axiom array from which SNPs were removed if they have missing rates greater than 1.5%, minor allele frequencies (MAF) less than 1%, or if they were in regions of long-range linkage disequilibrium. The remaining SNPs were then pruned for pairwise r2 less than 0.1 using windows of 1000 base pairs (bp) and a step-size of 80 bp. We developed a selection statistic to search for SNPs whose variation is not well-explained by the ProPCA model, closely related to the selection statistic proposed in [12] (S1 Text). Under the probabilistic PCA model, the normalized genotype matrix is modeled by a low rank approximation and Gaussian noise, Y = CX + . Given our low rank approximation of the genotype matrix, , we have the residual: . For a SNP j, the Gaussian noise, . Projecting this residual onto a PC results in a univariate Gaussian with zero mean and constant variance across SNPs. This variance can be estimated as the sample variance of the resulting statistics across SNPs. In summary, we propose the statistic: for SNP j, given the k-th PC. The projection of the residual onto a PC allows the signal of selection to be interpreted in the context of deviations from ancestry captured by the specific PC. Furthermore, a variant of this statistic, which we call the combined statistic, can be generated from the selection statistics computed on each individual PC using the observation that the resulting chi-squared statistics are independent of each other. This allows us to create an additional statistic by summing the individual PC statistics to create a combined statistic that follows a chi-squared distribution with additional degrees of freedom for each PC used. Using the results from the PCA on the UK Biobank SNP set, we performed our selection scan on a different set of 516, 140 SNPs. We generated this set of SNPs by removing SNPs that were multi-allelic, had genotyping rates less than 99%, had minor allele frequencies less than 1%, and were not in Hardy-Weinberg equilibrium (p < 10−6). We performed an allele frequency test for each novel SNP using the Nomenclature of Territorial Units for Statistics level 3 (NUTS3) classification of regions for the UK. The NUTS3 classification defines non-overlapping borders for each region in the UK, allowing us to uniquely map each individual to a region in the UK using their birth location coordinates by checking which NUTS3 regions they fell into. For each of our novel loci, we then performed an two-tailed Z-test between each region’s allele frequency against all other regions. We corrected for multiple testing using the Bonferroni correction.

Supplementary information.

Section S1 contains additional information on the White British analysis. Section S2 compares our selection statistic to other existing selection statistics. Section S3 explores the time-scales of our selection hits. Section S4 explores the application of ProPCA to missing data. Section S5 details the implementation of ProPCA. Section S6 explains our variant of the Mailman algorithm for left multiplication. Section S7 details the convergence of ProPCA in the noiseless setting. Section S8 explores the contribution of the Mailman algorithm to scalability. (PDF) Click here for additional data file.

ProPCA is efficient at computing large numbers of PCs.

Comparison of methods when calculating differing numbers of principal components. We computed principal components ranging from 1-40 on a dataset containing 20 populations separated at F = 0.01, 10, 000 individuals, and 50, 000 SNPs. All methods were run with default settings. (PDF) Click here for additional data file.

ProPCA has faster per-iteration runtimes versus FastPCA.

Comparison of average per-iteration runtimes over simulated genotype data containing 100, 000 SNPs, six subpopulations, F = 0.10 and individuals varying from 10, 000 to 1, 000, 00. We were unable to leverage the source code for FlashPCA2 for this comparison. (PDF) Click here for additional data file.

ProPCA is computationally efficient relative to other methods.

We compute the total time taken to estimate the top five principal components as a function of a measure of accuracy (MEV) for ProPCA compared to FastPCA and FlashPCA2. We performed these comparisons on simulated genotype data containing 50, 000 SNPs, 10, 000 individuals, six subpopulations, and F ∈ {0.001, 0.005, 0.01}. (PDF) Click here for additional data file.

ProPCA memory usage scales linearly.

We display the memory usage in gigabytes of each method when computing the top 5 principal components. S4 Figa show the average memory usage from each method over 10 runs on a dataset containing six populations separated at F = 0.01, 100, 000 SNPs, and individuals varying from 100,000 to 1,000,000. Figure S4 Figb shows a similar result, but with 100, 000 individuals and SNPs varying from 100,000 to 1,000,000 over a single run. All methods were run using default settings. We were unable to run bigstatsr for the SNPs experiment due to a bug that causes the method to crash in the presence of monomorphic SNPs. (PDF) Click here for additional data file.

ProPCA estimates principal components that are qualitatively indistinguishable from a full SVD on 1000 Genomes Phase 1 data.

We applied our method to genotype data from Phase 1 of the 1000 Genomes project. On a dataset of 1, 092 individuals and 442, 350 SNPs, ProPCA computes the top five PCs in about 17 seconds on a single core. The top two PCs computed by ProPCA and by running SVD on this data set are qualitatively indistinguishable. EM refers to ProPCA. (PDF) Click here for additional data file.

Scatterplot pairs between the projections of the first five principal components of the unrelated White British.

Plotting pairs of the first five principal components reveals structure amongst the unrelated White British. This structure diminishes as we increase the number principal components used. (PDF) Click here for additional data file.

Principal component scores of the unrelated White British overlaid on a map of the UK.

Using birth location data available in the UK Biobank, we placed a scatter plot colored by principal component score to reveal geographic variation captured by the principal components. (PDF) Click here for additional data file.

The selection statistic is calibrated in the unrelated White British.

We plot the theoretical quantiles of the distribution against each of the empirical quantiles observed from the first five principal components. All five principal components follow the theoretical distribution well until the upper tail. We additionally show the calibration of the combined statistic against the theoretical quantiles of the distribution. (PDF) Click here for additional data file.

Boxplot of estimated allelic ages of putative signals of selection.

Using allelic age estimates from the Human Genome Dating Atlas of Variant Age, we compared the estimated allelic ages of the significant signals of selection in Field et al. 2016 (SDS score > 4) and signals found by our own selection statistic. The x-axis denotes different clock models used to estimate allelic ages while allelic age estimates are denoted in generations on the y-axis. The joint clock model estimates allelic age using information from both the recombination and mutational clock models. (PDF) Click here for additional data file.

Proportion of total number of selection signal hits as a function of sample size.

To further illustrate the importance of large sample sizes for biological discovery, we analyzed how many selection signals we could discover as a function of sample size. We randomly subsampled 10,000, 50,000, 100,000, and 200,000 individuals from the White British populations and performed our selection scan. The x-axis denotes sample size in thousands and the y-axis denotes the proportions of total hits discovered. (PDF) Click here for additional data file.

ProPCA infers more accurate principal components (PCs) in the presence of missing data compared to imputed genotypes.

We compared the MEV from eigenvectors calculated from both modes of ProPCA with ground truth eigenvectors from performing a full SVD. We evaluated performance at 5% and 20% random missing values at 5 (S11a Fig) and 10 principal components (S11b Fig). We additionally compared ProPCA to mean imputation followed by a full SVD (S11c Fig). The data consists of simulated genotype data of 50, 000 SNPs from 10, 000 individuals from 5 populations for 5 PCs and 10 populations for 10 PCs separated by a range of F values. This process was repeated ten times to measure variability. Error bars denoting one standard deviation are shown for each point. (PDF) Click here for additional data file.

The Mailman matrix-vector multiplication contributes to the scalability of ProPCA.

We compare the time taken to compute the top five principal components by the EM algorithm underlying ProPCA when used in conjunction with the Mailman algorithm and without. We performed these comparisons on simulated genotype data containing 100, 000 SNPs, six subpopulations, F = 0.10 and individuals varying from 10, 000 to 1, 000, 000. S12 Figa compares the runtime of the EM algorithm with the Mailman matrix-vector multiplication to an EM algorithm where the genotypes are represented as a matrix of doubles (EM1). With this representation, the EM algorithm could only be applied to sample sizes of at most 70, 000 individuals due to memory constraints. S12 Figb compares the runtime of the EM algorithm with the Mailman matrix-vector multiplication to an EM algorithm where genotypes are represented in a compact representation (EM2). (PDF) Click here for additional data file.

Comparison of accuracy of methods to estimate principal components on the genotype data from the 1000 Genomes Phase 1 project.

We compared the accuracy of the ProPCA algorithm, bigsnpr, FlashPCA2, PLINK2, TeraPCA, and FastPCA when applied to 1092 individuals in the 1000 Genomes Phase 1 project. We report MEV averaged over ten trials. FastPCA gave us a segmentation fault for estimation of ≥ 35 PCs. We ran all methods using default settings. (PDF) Click here for additional data file.

Pearson correlation between principal components and birth location coordinates in the unrelated White British.

Pearson correlation between the principal components and birth location coordinates reveals that the principal components unveil geographic variation. P-values from Pearson correlation t-test is shown in parentheses on the right. (PDF) Click here for additional data file.

Selection statistics are not substantially inflated.

We calculated the λ values for each principal component and the combined statistic to check for inflation. In the unrelated White British set, the calculated values show that our selection statistics are not substantially inflated (top row). Furthermore, we show that the previously related statistic proposed by Galinsky et al. 2016 does not calibrate as well as our statistics based on λ values (bottom row). (PDF) Click here for additional data file.

Table of significant SNPs found by selection scan on unrelated White British.

Our selection scan on the unrelated White British population resulted in 59 significant SNPs. Our significance threshold was a Bonferroni corrected p < 0.05. Bonferroni corrected p-values are shown. (XLSX) Click here for additional data file.

Principal component selection scan reveals 12 unique loci under selection across the top five principal components.

We obtained 59 selection hits across the first five principal components of the unrelated White British subset of the UK Biobank. We clustered these hits into 12 unique loci by aggregating all significant hits into 1 Mb windows centered around the most significant hits. Other genes with significant hits that are within the 1 Mb window are listed in the last column. (PDF) Click here for additional data file.

Allele frequency tests between NUTS3 regions at novel loci confirms differences between geographic regions.

We performed a two-tailed proportion test for our novel loci between the allele frequency in each individual region from the NUTS3 classification of the United Kingdom against the frequency from every other region. We corrected the p-values using the Bonferroni correction (11 loci × 163 regions). The corrected p-values for regions passing the significance threshold are shown in the table. (PDF) Click here for additional data file.

Combined selection statistic across the top five principal components reveals four additional novel loci.

We discover four additional novel loci using our combined selection statistic from the first five principal components. Loci not found in the individual PC selection statistics are denoted by an asterik in the rsid column. The chi-squared statistic (one degree of freedom) for each principal component is shown in the last five columns of the table. (PDF) Click here for additional data file.

Selection hits are associated with phenotypes in the UK Biobank.

We ran genome-wide association tests for 64 phenotypes in the full release of the UK Biobank for each of our loci. Phenotypes shown reached a p-value of genome-wide significance level (0.05 × 10−6). (PDF) Click here for additional data file.

Selection hits are associated with phenotypes from the Global Biobank Engine.

We queried the Global Biobank Engine for associations from our loci. The Global Biobank Engine contains GWAS results for many more phenotypes than those available in the UK Biobank. Phenotypes shown are significant at genome-wide significance level (0.05 × 10−6). (PDF) Click here for additional data file.

39 in total

1. Principal components analysis corrects for stratification in genome-wide association studies.

Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330

2. TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes.

Authors: Aritra Bose; Vassilis Kalantzis; Eugenia-Maria Kontopoulou; Mai Elkady; Peristera Paschou; Petros Drineas
Journal: Bioinformatics Date: 2019-10-01 Impact factor: 6.937

3. Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia.

Authors: Kevin J Galinsky; Gaurav Bhatia; Po-Ru Loh; Stoyan Georgiev; Sayan Mukherjee; Nick J Patterson; Alkes L Price
Journal: Am J Hum Genet Date: 2016-02-25 Impact factor: 11.025

Review 4. New approaches to population stratification in genome-wide association studies.

Authors: Alkes L Price; Noah A Zaitlen; David Reich; Nick Patterson
Journal: Nat Rev Genet Date: 2010-07 Impact factor: 53.242

Review 5. TLR-signaling and proinflammatory cytokines as drivers of tumorigenesis.

Authors: Kirill V Korneev; Kamar-Sulu N Atretkhany; Marina S Drutskaya; Sergei I Grivennikov; Dmitry V Kuprash; Sergei A Nedospasov
Journal: Cytokine Date: 2016-02-04 Impact factor: 3.861

Review 6. The ciliary protein Rpgrip1l in development and disease.

Authors: Antonia Wiegering; Ulrich Rüther; Christoph Gerhardt
Journal: Dev Biol Date: 2018-08-01 Impact factor: 3.582

7. Detection of human adaptation during the past 2000 years.

Authors: Yair Field; Evan A Boyle; Natalie Telis; Ziyue Gao; Kyle J Gaulton; David Golan; Loic Yengo; Ghislain Rocheleau; Philippe Froguel; Mark I McCarthy; Jonathan K Pritchard
Journal: Science Date: 2016-10-13 Impact factor: 47.728

8. Analysis and application of European genetic substructure using 300 K SNP information.

Authors: Chao Tian; Robert M Plenge; Michael Ransom; Annette Lee; Pablo Villoslada; Carlo Selmi; Lars Klareskog; Ann E Pulver; Lihong Qi; Peter K Gregersen; Michael F Seldin
Journal: PLoS Genet Date: 2008-01 Impact factor: 5.917

9. Inferring human colonization history using a copying model.

Authors: Garrett Hellenthal; Adam Auton; Daniel Falush
Journal: PLoS Genet Date: 2008-05-23 Impact factor: 5.917

10. Dating genomic variants and shared ancestry in population-scale sequencing data.

Authors: Patrick K Albers; Gil McVean
Journal: PLoS Biol Date: 2020-01-17 Impact factor: 8.029

4 in total

1. A geometric relationship of F₂, F₃ and F₄-statistics with principal component analysis.

Authors: Benjamin M Peter
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2022-04-18 Impact factor: 6.671

2. Contribution of Genome-Wide Polygenic Score to Risk of Coronary Artery Disease in Childhood Cancer Survivors.

Authors: Yadav Sapkota; Qi Liu; Nan Li; Neel S Bhatt; Matthew J Ehrhardt; Carmen L Wilson; Zhaoming Wang; John L Jefferies; Jinghui Zhang; Gregory T Armstrong; Melissa M Hudson; Leslie L Robison; Daniel A Mulrooney; Yutaka Yasui
Journal: JACC CardioOncol Date: 2022-06-21

3. Efficient toolkit implementing best practices for principal component analysis of population genetic data.

Authors: Florian Privé; Keurcien Luu; Michael G B Blum; John J McGrath; Bjarni J Vilhjálmsson
Journal: Bioinformatics Date: 2020-08-15 Impact factor: 6.937

Review 4. Statistics or biology: the zero-inflation controversy about scRNA-seq data.

Authors: Ruochen Jiang; Tianyi Sun; Dongyuan Song; Jingyi Jessica Li
Journal: Genome Biol Date: 2022-01-21 Impact factor: 13.583

4 in total