Literature DB >> 29881252

Decomposing the Apoptosis Pathway Into Biologically Interpretable Principal Components.

Min Wang¹, Steven M Kornblau², Kevin R Coombes³.

Abstract

Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 "biological components," 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.

Entities: Chemical Disease Gene Species

Keywords: Auer-Gervini; Bayes rule; Dimension reduction; broken stick; randomization-based procedure

Year: 2018 PMID： 29881252 PMCID： PMC5987987 DOI： 10.1177/1176935118771082

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Since the earliest days of gene expression microarrays, 2-way clustered heatmaps have been a standard feature of papers studying genome-wide biological data sets.[1,2] Such heatmaps remain ubiquitous, despite numerous difficulties in interpretation, in reproducibility, and in assigning statistical significance. Good clustering of the genes in such data sets is critical for understanding the biology. Because many biologists are more interested in signaling pathways than in individual genes, they want to find a source of consistent, robust, and interpretable blocks of genes that drive distinct functional characteristics of the pathways. These blocks of genes form clusters that are relevant to comprehensive understanding of critical biological processes. For example, apoptosis is an important biological process, which is characterized by distinct morphological states and energy-dependent biochemical mechanisms.[3] Understanding how proteins cluster in the apoptotic pathway will help elucidate its underlying molecular mechanisms. Now, clustering can be thought of as a form of dimension reduction, and a natural question is the “true dimension” of the data. Various techniques have been developed to determine the dimensionality, the most common being principal component analysis (PCA). For our purposes, an important problem in PCA is to determine the number of statistically significant components. Numerous methods have already been developed to estimate the number of significant components. There are 4 types of approaches: (1) ad hoc subjective and graphical rules; (2) methods based on distributional assumptions; (3) computationally extensive procedures relying on Monte Carlo, permutation, cross-validation, bootstrap, or jackknife[4,5]; and (4) Bayesian methods based on the probabilistic formulation of Tipping and Bishop[6] with marginalization estimated through Laplace approximation and its variants.[7-9] The scree plot method, which consists of plotting a curve of the eigenvalues of the sample covariance matrix versus their rank and looking for an “elbow” in the curve, is the most famous graphical approach.[10] However, this method relies on the user’s subjective experience to find any possible “elbow.” Even so, other methods are not always superior to the simple scree plot. Legendre and Legendre[11] used the “broken stick” distribution to compare the extra information in a model to one with fewer parameters. Ferre[12] conducted an empirical study of many methods to select the number of PCs, using data simulated from known parameters. He concluded that there is no “ideal” solution to the problem of dimensionality in PCA. He also concluded that Bartlett’s tests[13] are an improvement because they are less subjective but may have a tendency to overestimate the true number of components. Peres-Neto et al[14] conducted an extensive simulation study to evaluate a wider variety of methods. They concluded that several methods, especially those based on randomization and permutation proposed by ter Braak,[15,16] outperform the others and should be applied to study general data sets. More recently, Josse and Husson[5] showed that the generalized cross-validation method performs well. Sobczyk et al[9] proposed a Bayesian approach called PEnalized SEmi-integrated Likelihood (PESEL); by comparing it with other state-of-the-art methods, they found that PESEL-based criteria are more robust against deviations from the assumptions of a probabilistic model than the other methods. In 2008, Auer and Gervini[17] addressed the problem of selecting principal components in the context of Bayesian model selection. Although their method has strong theoretical foundations and appears to work well in practice, it still depends on the subjective evaluation of a graphical display of how the maximum posterior estimate of the number of components depends on a parameter describing the choice of the prior distribution. Moreover, its performance has only been compared with the scree plot and broken stick methods and not to more sophisticated methods that have performed well in comparative studies. In this article, we consider several algorithms to extend and automate the Auer-Gervini method by providing objective rules to select the number of significant principal components. Using an extensive set of simulations, we compare these algorithms with the broken stick model, Bartlett’s test, ter Braak’s randomization methods, generalized cross-validation, and Bayesian probabilistic PCA. The methods chosen for comparison were the “winners” from the previous comparative studies.[7,9,12,14] Our extensions to the Auer-Gervini method are implemented in the PCDimension R package. Because the most promising versions of the randomization algorithms appear not to be readily available, we have also implemented them in the PCDimension package. For convenience, the package also implements the broken stick method. For Bartlett’s test, we rely on an existing implementation in the nFactors package.[18] For generalized cross-validation, we use the implementation in the FactoMineR package.[19] The implementation of Bayesian approximation methods including Minka’s approach and PESEL from probabilistic PCA is available in the pesel package and code on Github.[9] This article is organized as follows. In section “Methods,” we review the theoretical framework of different types of methods. In section “Simulation Study” of the “Results” section, we perform simulation studies to test the performance of the proposed algorithms. In the “Decomposing the Apoptosis Pathway in AML” section, we apply the methods to a study of apoptosis in acute myeloid leukemia (AML) using reverse phase protein arrays (RPPAs). Finally, we conclude the article and make several remarks in section “Conclusions.” A simple example to illustrate the implementation of different methods in the PCDimension package is also provided in the supplementary material.

Methods

Let denote an data matrix, where each row represents an object to be analyzed, and each column represents a measured attribute. In PCA, each principal component is a linear combination of the attributes. Much effort has been expended on investigating many objects using relatively few attributes, but the opposite scenario—few objects and many attributes—is usually ignored. One practical application would be clustering a few genes or proteins from a single biological pathway using their expression values for many patients. Therefore, we are primarily interested in biological applications where . In this section, we briefly review the methods used to estimate the number of statistically significant PCs.

Bartlett’s test

Bartlett[13] proposed a statistical method to conduct a hypothesis test on the significance of the principal components based on the eigenvalues of , the correlation matrix of the objects. This test is designated to check whether the remaining eigenvalues of the correlation structure are equal after removing the well-determined (highly significant) components. Let the eigenvalues of be with . The procedure, for various values of , starting at in decreasing order, is to test the null hypothesis that the “ smallest eigenvalues of the correlation matrix are equal” against the alternative hypothesis that “at least 2 of the eigenvalues are different.” The test statistic is as follows: where Under the null hypothesis, the test statistic follows a distribution with degrees of freedom. The optimal number of principal components is the smallest where is accepted. Both Lawley[20] and Anderson[21] made some modifications to the multiplicative factor for equation (1); these are viewed as improved variants of Bartlett’s test.

Broken stick model

Under the assumption that the total variance of the multivariate data is divided at random among all possible components, the expected distribution of the eigenvalues in the covariance or correlation matrix follows a broken stick distribution.[22] This model says that if we have a stick of unit length, broken at random into segments, then the expected length of the longest piece is as follows: As the expected values under the broken stick model are obtained in decreasing order, it is necessary to rank the relative proportions of the variance that are accounted for by the PCs in the same way. The estimated number of PCs is the maximum index where the observed relative proportion of variance is greater than or equal to the expected value from the broken stick distribution.

Randomization-based procedure

This procedure involves a randomization approach to generate a large number of data sets by scrambling the observed data in a manner of sampling without replacement.[15,16] These randomized data sets are then used to compute empirical P values for the statistics of interest characterize the internal structure of the eigenvalues in the correlation matrix. The test procedure is as follows: (1) randomize the values with all the attribute columns of the data matrix, (2) perform PCA on the scrambled data matrix, and (3) compute the test statistics. All 3 steps are repeated a total of times, where is a large enough integer to guarantee the accuracy of estimating the P value; in practice, is usually set to equal 1000. In each randomization, 2 test statistics are computed: (1) the eigenvalue for the principal component and (2) a pseudo F ratio computed as . Finally, the P value for each and each statistic of interest is estimated to be the proportion of the test statistics in all data sets that are greater than or equal to the one in the observed data matrix.

Bayesian approximation methods

The probabilistic PCA model assumes that each observation arises from the following model: where is the mean vector, is a d-dimensional Gaussian latent vector, is an parameter coefficient matrix, and is Gaussian noise. Tipping and Bishop[6] showed that the principal components of can be retrieved using the maximum likelihood estimator of , which is given as follows: where is the matrix of ordered principal eigenvectors of , is an estimator of the true noise level , is the diagonal matrix with the estimates of the eigenvalues of the covariance matrix, and is an arbitrary orthogonal matrix. Finding the number of principal components is then equivalent to a model selection problem from . The optimal number of components can be obtained by applying the following criteria: where the number of components is implicit in the matrix . With a noninformative prior, the mean vector can be integrated out and the probability of given and is obtained as follows: where is the sample covariance matrix.[7] With a prior , the probability of observing given the signal dimensionality is as follows: Because of the computational burden in calculating equation (7), Minka[7] proposed a conjugate prior for and derived a Laplace approximation of the marginal likelihood in equation (7) which is then used in equation (5) to compute the optimal number of components. This approximation approach has been empirically shown to be efficient in small-sample scenarios. In 2008, Hoyle[8] considered the high-dimensional scenarios when and grow to infinity and proposed another approximation where more terms in the asymptotic expansion of the integrand of equation (7) are used to guarantee more accurate evaluation. However, similar to Minka’s Laplace approximation, it depends on the selection of the prior on . This approach is not considered here because there is no publicly available implementation. Sobczyk et al[9] explored 2 other high-dimensional scenarios: (1) is fixed and and (2) is fixed and , where the second one is out of the scope of this article. For the 2 different scenarios, they modeled either the rows or the columns of the matrix via fixed effects models (with slightly different forms depending on the relative size of and ). The 2 models are similar to equation (3) where either the matrix of PCA scores or the matrix of PCA loadings are used. They imposed a prior on these matrices and presented a methodology for approximating the posterior probability of . Using 2 different prior distributions on terms similar to in equation (3), they proposed 2 corresponding forms of PESEL where either all the singular values in PCA are homogeneous (abbreviation homo) or heterogeneous (abbreviation hete). Therefore, there are 4 criteria to maximize the posterior probability of given the PESEL: and for Scenario 1 and and for Scenario 2. In the rest of this article, we study the same 3 criteria , , and used in the work by Sobczyk et al.

Auer-Gervini model

We briefly review the Auer-Gervini method.[17] Suppose that are all columns of data matrix , and is a random sample with mean vector and covariance matrix . Note that in the work by Auer and Gervini, they assume that the rows are i.i.d. Gaussian. Here, we make a transposition and assume the columns of X are i.i.d. Gaussian to guarantee the consistency of notations with previous sections. Write where is orthonormal and with . Let be the model with significant components or eigenvalues, that is, , and , for . Under , a random sample is as follows: where are uncorrelated random variables with mean 0 and variance 1, and is a random error vector with mean 0 and covariance matrix . Therefore, the problem of selecting the number of PCs is transformed into the problem of choosing the correct model . A prior probability is assigned to of the following form: where is a normalizing constant that satisfies for . Then, under certain approximations, one can use Bayes rule to derive a formula for the maximum posterior estimate of as a function of the prior parameter : where is an estimator of , are the eigenvalues of the sample covariance matrix and also the maximum likelihood estimators of under model , and and are the geometric and the arithmetic means of , respectively. The formula describes as a nonincreasing step function with respect to , with . The step function is then plotted and the “highest dimension for which the step length is significantly large” is selected to be the optimal number of components. In other words, an exponential prior is placed on the number of significant components. The prior depends on a hyperparameter that governs how fast the distribution decays. As goes to , the prior drops off rapidly and the posterior estimate of the number of PCs will go to 0. Auer and Gervini proposed graphing the posterior estimate as a step function of , which can visually help select the highest “nontrivial step length.” A large step length means that the estimated number of PCs is optimal under a wide range of prior model probabilities. However, the notion of “nontrivial step length” remains subjective, which is similar to the situation where one needs to select a recognizable “elbow” in the scree plot. Automating the definition of nontrivial step length is further complicated by the fact that the final step for is theoretically infinite. We will operationalize the final subjective step by putting an upper bound on the largest “reasonable” estimate of and will develop criteria to automatically choose the significantly large step length.

Automating the Auer-Gervini method

As originally described, the Auer-Gervini model is a visual Bayesian approach, and the critical final step is to decide what constitutes a significantly large length of a step. This problem can be thought of as one of classification, in which the set of step lengths must be partitioned into 2 groups (short and long). We propose to test multiple algorithms to solve the problem as follows.

TwiceMean

Use twice the mean of the set of step lengths as a cutoff to separate the long and short steps. Intuitively, the idea is to view the distribution of the step lengths as exponential when the data arise from random noise. As the mean equals the standard deviation for an exponential distribution, twice the mean is the same as the mean plus one standard deviation and provides a plausible cutoff to select “long” step lengths. This simple idea is inspired in part by Chaterjee,[23] who considered recovery of low-rank matrices by thresholding singular values. He proposed that one could have a single universal choice of the threshold parameter which is slightly greater than 2 and gives near-optimal mean square error for singular value hard thresholding.

K-means

Because the goal is to partition the step lengths into 2 groups, a natural solution is to cluster them using the traditional K-means algorithm with . We seed the algorithm with starting centers using the minimum and maximum step lengths. As we will discuss in more detail below, the final step (when ) is theoretically infinite. We will bound this last step but it will ensure that at least one step is “long.”

Kmeans3

Our initial experience (data not shown) using K-means with large found it to be overly conservative when assigning steps to the “long” group. Given its dependence on Euclidean distances, that is exactly how one should expect it to perform when the true mixture distribution is skewed right. To address this problem, we modified the algorithm as follows. If the number of objects is large , we use the K-means algorithm, with and seeds of the minimum, median, and maximum values, to separate the step lengths into 3 groups: Low, Intermediate, and High. We then treat both Intermediate and High groups as long.

Spectral

Use spectral clustering to divide the step lengths into 2 groups. Spectral clustering is one of the most popular modern clustering algorithms. It sometimes outperforms the traditional clustering algorithms such as K-means.[24,25]

CPT

Instead of simply clustering the step lengths into 2 groups, we can instead sort them into increasing order and view the problem as one of change point detection. One existing solution to this problem is provided by the At Most One Change (AMOC) method implemented by the cpt.mean function from the changepoint R package.[26] The detection of the first change point is posed as a hypothesis test and a generalized likelihood ratio–based approach is extended to changes in variance within normally distributed observations (the reader can consult Hinkley[27] and Gupta and Tang[28] for more details).

CpmFun

The cpm R package defines several other “Change Point Models.”[29] These are implemented by the detectChangePointBatch function, which processes the step lengths in one batch and returns information regarding whether the sequence contains a change point. The default is to use the “Exponential” method, which computes a generalized likelihood ratio statistic for the exponential distribution.

Ttest

We also implemented a novel change point detection algorithm based on the t test. We begin by sorting the steps lengths in increasing order. Then, we compute the gaps between successive step lengths. At each (sorted) step, we use the t distribution to determine the likelihood that the next gap is larger than expected from the previously observed gaps. The first time that the next gap is significantly larger than expected, we assert that this next step length is the smallest one that constitutes a “long” step length.

Ttest2

Where the K-means algorithm was found to be conservative, the Ttest algorithm just described was sometimes found to be anticonservative. This can happen when the first few step lengths are all about the same size, which yields a small standard deviation. In this case, a relative short next step will be falsely discovered based on the Ttest criterion. To avoid this problem, it may be appropriate to include the next (test) step length and gap when estimating the mean and standard deviation of the gap distributon. We modified the Ttest algorithm in this way to make it more conservative.

Bounding the last step

All of the proposed methods for separating short from long steps require us to bound the permissible length of the final step when , which would otherwise be infinite. This step is important because it allows the algorithm to conclude in some cases that the only long step is the final one, and the true number of principal components should equal 0. We use the largest of the following 3 quantities: further than the final change point (to ) in the step function. , where is the number of attributes. This procedure selects the value of for which 99% of the prior probability is assigned to . if and , otherwise, where is the number of objects. This formula was derived empirically from a Monte Carlo study on data sets with random noise. We estimate as the maximum of the empirical largest change point in the step function for various values of and in 2 scenarios: and . It can be seen as the value where the maximum point for could be achieved when various d’s are sharing the prior information on under the uninformative noise structure. Note that all simulations and computations in this article were performed using version 3.2.2 of the R statistical software environment with version 1.1.3 of the PCDimension package, which we have developed, version 2.3.3 of the nFactors package, version 1.39 of the FactoMineR package, and version 0.7.1 of the pesel package. The details on how to select the number of PCs for a simple example using the PCDimension package are provided in the supplementary material.

Results

Simulation study

We follow a Monte Carlo procedure to study the robustness of different types of methods described above for estimating the number of PCs. For real data sets, we will never know the “correct” answer. So, we simulate a collection of data sets with different correlation structures to compare the numerical Auer-Gervini model we have implemented with the other types. We start by supplying details about the correlation structures and data sets used in the simulations.

Simulated datatypes

We use a protocol similar to those discussed in recent papers.[5,14,17,30-32] In the simulations, the number of measured attributes is taken to be either or . The range of 100 to 400 is chosen to represent small to moderately large data sets. We also consider data sets with either or objects. We view 24 objects as a small data set and 96 objects as a moderately large one.[33] The number of significant diagonal blocks is either the number shown in Figure 1 or twice that number (with finer correlation structures of double group blocks). By varying both the number of objects and the number of correlated blocks, we can explore the effects of the number of nontrivial components and the number of objects per component. To also explore the effects of different combinations of additional factors, including eigenvalue distributions, signed or unsigned signals, uncorrelated variation, and unskewed normal or skewed distributions, we use the 19 different covariance structures and correlation matrices illustrated in Figure 1. Matrices 1–3 are covariance matrices with different marginal distributions: normal, marginal gamma, and marginal t distribution, where , and with for 1≤i≤5 and for . Matrices 4–19 are correlation matrices and we set the corresponding covariance matrices to be where . Note that the correlation is the standardized version of the covariance, that is, where . We use this form for most of the covariance matrices because we want to control the signal to noise ratio (basically, ) to test the robustness of methods to the noise level . Matrix 4 contains only noise; it is a purely uncorrelated structure. Matrices 5 and 6 represent correlation structures with various homogeneous cross-correlation strengths (unsigned signals) 0.3 and 0.8. Matrices 7–13 are correlation matrices where between-group (0.3, 0.1, or 0) and within-group (0.8 or 0.3) correlations of objects are fixed.[14,31] Matrices 14–19 are correlation structures where negative cross-correlations (−0.8 or −0.3, signed signals) are considered within groups and mixture of signed and unsigned signals are also included.

Figure 1.

The 19 correlation matrices considered in the simulation study for 24 objects. Values of correlations are provided by the colorbar. Numbers in parentheses correspond to the known dimension.

Empirical simulation results

For each of the 19 scenarios, we simulate 1000 data sets. Then, the numbers of components are estimated based on all (19 variants of the) types of methods. That is, for each variant within each type of model, we compute the estimated dimension. We also investigate a “majority rule” procedure for the Auer-Gervini model; that is, the dimension that more than 3 criteria out of 8 selected is the one that is estimated by the majority rule. Then, we calculate the absolute difference between the known dimensions and the estimated ones for each simulated sample and correlation structure. The mean of the absolute differences over both 1000 simulated data sets and 19 correlation matrices is plotted in Figure 2. The corresponding numeric values are also provided in Supplementary Table 1. The values in this table can help assess the quality of each method. The closer to 0 the values are, the better the corresponding variant within the type of method. However, the values do not describe whether a method overestimates or underestimates the number of nontrivial PCs.

Figure 2.

Log-transformed mean values, across the correlation matrices, of the absolute difference between the known dimension and the sample estimates from 15 different methods in 8 simulation scenarios (Scenario 1: 24 × 100, 1X correlated blocks; Scenario 2: 24 × 400, 1X; Scenario 3: 96 × 100, 1X; Scenario 4: 96 × 400, 1X; Scenario 5: 24 × 100, 2X; Scenario 6: 24 × 400, 2X; Scenario 7: 96 × 100, 2X; and Scenario 8: 96 × 400, 2X). In Figure 2 and Supplementary Table 1, one can see that, as anticipated, the results from most algorithms are better with fewer correlated blocks (Scenarios 1-4), probably because there are more objects representing each block. Also, accuracy in almost all methods is much better with fewer objects (Scenarios 1-2 and 5-6). The situation is more complicated when the number of attributes changes. In general, the worst performance occurs when the data matrix is nearly square (Scenarios 3 and 7). Overall, the most accurate methods when averaging across all 8 scenarios are (1) the rnd-Lambda algorithm (mean deviation, ), (2) the PESEL method with criterion , (3) the Auer-Gervini model with criterion “CPT” , and (4) Minka’s Laplace approximation . Interestingly, the rnd-Lambda and Minka-Laplace algorithms each produce the best average performance in only 1 out of 8 scenarios. Moreover, the Auer-Gervini CPT model and the model are not the best performers in any of the 8 individual scenarios. By contrast, 3 other methods (the “TwiceMean” Auer-Gervini model, the “ttest2” Auer-Gervini model, and generalized cross-validation) have the best performance in 1 out of 8 scenarios. The simple broken stick method is the best performer in 3 out of 8 scenarios. This variabililty in the winners suggests that we look more closely at how the simulation parameters affect performance. When there are only 24 objects, the 4 best methods are (1) “TwiceMean” Auer-Gervini , (2) generalized cross-validation , (3) rnd-Lambda , and (4) “kmeans” Auer-Gervini . When there are 96 objects, the 4 best performers are (1) broken stick , (2) rnd-Lambda , (3) , and (4) Minka-Laplace . A slight shuffling of the order of the same set of high-performing methods occurs when we vary the number of blocks (broken stick is best with 1X , Minka-Laplace with 2X ) or the number of attributes (Minka-Laplace is best with 100 , and rnd-Lambda with 400 ). The variants of Bartlett’s test, as implemented in R, have the worst performance of all the stopping rules we have considered (Figure 2). Furthermore, the rnd-F algorithm is not as good as expected because a previous study found it to be one of the most successful rules tested.[14] Even though the CPT criterion is one of the best overall, and the TwiceMean criterion is the best when there are 24 objects, some of the criteria that we use to automate the Auer-Gervini model are not good candidates for computing the dimension. For example, the “Ttest” and “CPM” criteria often result in large deviations from the true dimension. The PESEL criteria and have middle-of-the-road performance, which may by due to considering in this article.

Running time

In addition to the absolute differences between the true dimension and the estimates, we computed the average running time of all types of methods over all correlation matrices per data set (Table 1). All timings were conducted on a computer with “Intel Xeon CPU E5-2690 v3 @ 2.60-GHz 2.59-GHz” processors running Windows Server 2008 R2 Enterprise. Note that the time shown in the table is the total time of all the variants or criteria within each type of model. It is obvious that the computation time becomes longer as the number of objects or attributes increases, and there is almost no change in time usage when the number of blocks is doubled. From the table, we can see that Bartlett’s test and the broken stick method use the least time in computing the number of components. However, the accuracy using the broken stick approach is much better than in Bartlett’s test. The most accurate overall method, rnd-Lambda, is by far the slowest, taking several orders of magnitude more time than the other methods. Of the remaining 3 methods, which achieve nearly the same level of accuracy, Minka’s Laplace approximation is the fastest and the “CPT” Auer-Gervini model is the slowest.

Table 1.

Average running time of all types of methods across correlation matrices (unit: seconds).

Rules	Original blocks (1X)				Twice blocks (2X)
	24 objects		96 objects		24 objects		96 objects
	m = 100	m = 400	m = 100	m = 400	m = 100	m = 400	m = 100	m = 400
Broken stick	0.002	0.003	0.008	0.018	0.002	0.003	0.008	0.018
Bartlett’s test	0.004	0.004		0.015	0.004	0.004	0.013	0.016
GCV	0.004		0.012	0.032	0.003	0.006	0.012	0.032
Minka’s Laplace	0.005		0.024	0.047	0.005	0.007	0.024	0.048
PESEL	0.010		0.026	0.191	0.008	0.104	0.026	0.183
Auer-Gervini	0.035		0.234	0.269	0.033	0.035	0.236	0.275
Rand. based	2.121	2.990	9.253	20.152	1.832	3.131	9.212	20.061

Average running time of all types of methods across correlation matrices (unit: seconds).

High-accuracy methods: rnd-Lambda and Auer-Gervini with CPT

More detailed results on the performance of the rnd-Lambda algorithm are presented in Figure 3 and Supplementary Tables 2 and 3 for the case of the original block structure shown in Figure 1. Similar results for the “CPT” criterion in the Auer-Gervini model are presented in Figure 4 and Supplementary Tables 4 and 5. For each covariance or correlation matrix, we computed the percentage of deviations between the estimates and the known dimension. The results are similar regardless of the number of attributes (100 or 400) or objects (24 or 96). Both methods tend to underestimate the dimension for covariance matrices of unskewed or skewed distributions (matrices 1-3). They are quite accurate for correlation matrices of normal distribution (matrices 4-19). When they make errors with the normal distribution, the rnd-Lambda algorithm is more likely to slightly overestimate the dimension, whereas the Auer-Gervini CPT method is more likely to underestimate.

Figure 3.

Percentage of deviations between the estimate from randomization-based procedure (rnd-Lambda) and known dimension.

Figure 4.

Percentage of deviations between the estimate from the “CPT” criterion in the Auer-Gervini model and known dimension.

Percentage of deviations between the estimate from randomization-based procedure (rnd-Lambda) and known dimension. Percentage of deviations between the estimate from the “CPT” criterion in the Auer-Gervini model and known dimension.

Special-case methods: TwiceMean and broken stick

We present details on the performance of the “TwiceMean” criterion in the Auer-Gervini model and the broken stick method, which were best when restricting to either the 24-object or 96-object simulations, respectively, as shown in Figure 5 and Supplementary Tables 6 and 7. For data sets with 24 objects, the Auer-Gervini method with criterion “TwiceMean” is very accurate for uniform matrices (matrices 5 and 6), correlated matrices (matrices 7-13) and unsigned data with or without signed signals (matrices 14-19). When the sample covariance matrix is either skewed or unskewed (matrices 1-3), the dimension is usually underestimated. Also, the results do not vary too much with different numbers of attibutes. For 96 objects, there is not much difference between the results of the broken stick method in Supplementary Table 7 and that of the criterion “TwiceMean” in Supplementary Table 6. And the results of the broken stick method for 100 attributes are slightly better than those for 400 attributes.

Figure 5.

Percentage of deviations between the estimate from either the “TwiceMean” criterion in the Auer-Gervin model or the broken stick model and known dimension.

Robustness to random noise

We investigated the influence of random noise on the ability of different methods to correctly detect the underlying structure. We conducted additional simulation studies by adding different levels of (i.i.d. normal) noise corresponding to 3 different values of the variance: , , and . Summaries of the absolute difference between the known dimension and the estimates made using various methods under different levels of noise are presented in Supplementary Tables 8 to 10. Although the error rate tends to increase with increasing noise, we found that the relative performance of the methods is consistent regardless of the size of the noise. That is, most algorithms still perform better with fewer objects, and accuracy is almost always worse when the number of blocks doubles. Most importantly, the best method for each scenario does not change in most cases as the noise moves from 0 to 1. The “TwiceMean” criterion in the Auer-Gervini model is better when the number of objects is small relative to the number of objects, whereas the broken stick method is better when the number of objects is close to the number of attributes. Finally, regardless of the noise level, the rnd-Lambda algorithm, the Auer-Gervini model with “CPT” criterion, the PESEL method with criterion , and Minka’s Laplace approximation outperform the others on average across all scenarios.

Decomposing the apoptosis pathway in AML

Since the introduction of gene expression microarrays in the 1990s, most statistical analyses of omics data have treated pathways as second-class objects, in the following sense: primary analyses are performed at the gene level. That is, the data are first analyzed gene by gene to find differences between known groups of patients such as responders and nonresponders. Then, a significance cutoff is chosen and a second statistical test conditional on the gene-by-gene results (such as gene set enrichment analysis[34]) is performed to infer which pathways differ between the 2 groups. One reason analysts give precedence to individual genes is that univariate analyses are easier than the multivariate ones needed for pathways. However, many biologists are more interested in pathways than in individual genes because they give a higher-level functional picture of biological behavior. Informally, biologists talk about pathways as though they are 1-dimensional entities. At a cell level they are “on” or “off”; at a tissue level, they have a simple “degree of activation.” But we hypothesize that most pathways, including the apoptosis signaling pathway, are intrinsically multidimensional. To test this hypothesis, we used a subset of RPPA data on samples collected from 511 patients with AML.[35,36] The subset consists of 33 proteins that are involved in the apoptosis signaling pathway. Apoptosis is known to be an essential component of several processes including normal cell turnover, proper development and functioning of the immune system, and chemical-induced cell death.[3] It is generally characterized by distinct morphological states and energy-dependent biochemical mechanisms. Even though many important apoptotic proteins have been identified, the molecular mechanisms of these proteins still remain to be elucidated. We applied different methods to determine the number of significant components in this RPPA data set; the results are displayed in Table 2 and Supplementary Table 11. The number of components that they find is highly variable, ranging from . Even the best methods from our simulations give different values. However, our simulation studies suggest that the TwiceMean Auer-Gervini method works particularly well when the number of objects (in this case, 33 proteins) is small compared with the number of attributes (in this case, 511 samples). In the top panel of Figure 6, we have plotted the maximum posterior estimate of the number of components as a function of the prior parameter ; this plot gives better support for than for .

Table 2.

Number of principal components (PCs) from different algorithms on reverse phase protein array data.

Rules	Auer-Gervini		Broken stick	Rand. based	Minka	GCV	PESEL
	twicemean	cpt	broken-stick	rnd-Lambda	laplace	gcv	pesel.m.hete
PCs	6	1	1	8	20	12	15

Figure 6.

Analysis of AML RPPA data. Auer-Gervini step function relating the prior hyperparameter to the maximum posterior estimate of the number of significant principal components (top). Projection of proteins on the space of the first 2 components; colors denote different clusters (bottom).

Number of principal components (PCs) from different algorithms on reverse phase protein array data. Analysis of AML RPPA data. Auer-Gervini step function relating the prior hyperparameter to the maximum posterior estimate of the number of significant principal components (top). Projection of proteins on the space of the first 2 components; colors denote different clusters (bottom). To understand the biology driving these mathematical principal components, we then projected the proteins into the 6-dimensional PC space and used their directions to cluster them using a von-Mises Fisher mixture model.[37] Using the Bayesian information criterion for model selection, we found an optimal clustering into 6 groups of proteins, which are displayed in different colors in a 2-dimensional projection in the bottom panel of Figure 6. The 6 protein clusters are as follows: AIFM1, BCL2, and DIABLO, which we interpret as a block corresponding to the mitochondrial release of apoptosis-inducing factor[38]; BID, CASP3, CASP8, and XIAP, which are part of a caspase 3 feedback loop[39]; CASP3.cl175, CASP7.cl198, CASP9.cl315, and MDM2, a cleaved caspase block[40]; ARC, BAD.pS112, and YAP1p; BAD.pS136, BAD.pS155, and BAX; and The core group of 16 apoptosis-related proteins: BAD, BAK1, BCL2L1, BCL2L11, BIRC2, BIRC5, BMI1, CASP9, CASP9.cl330, MCL1, MDM4, PARP1, PARP1.cl214, TP53, TP53.pS15, and YAP1. The fact that 3 of these 6 clusters can be immediately identified from the literature as coherent biological subcomponents of the apoptosis pathway provides strong support for our approach.

Conclusions

Principal component analysis is one of the most popular and important techniques in the multivariate analysis of general data sets. However, because principal components are linear combinations of correlated variables, these components usually lack interpretability when analyzing biological data sets, especially transcriptomic or proteomic data sets from patients with cancer. Our study of the apoptosis pathway using proteomic data from patients with AML shows that we can use mixture model clustering in principal component space to replace the uninterpretable mathematical components with natural collections of related genes that enhance the biological interpetability of the decomposition. We expect that applying these methods to the genes or proteins in other signaling pathways will divide these pathways into 1-dimensional “building blocks” that are interpretable, robust, and can yield new biological insights. It would be of particular interest to apply these ideas to overlapping pathways to better understand the way similar components are reused in different contexts. Our ability to find biologically interpretable components, however, depends in a fundamental way on being able to determine the dimension of the principal component space. To accomplish this task, we introduced the PCDimension R package, which implements 3 types of models—the broken stick method, the randomization-based procedure of ter Braak,[15,16] and our enhancments to the model developed by Auer and Gervini[17]—to compute the number of significant principal components. Through extensive simulations, we have shown that the enhanced Auer-Gervini methods are competitive with the methods that performed best in previous comparative studies. It has been claimed that simulation of multivariate data sets can always be criticized as unrepresentative because they can never explore more than a tiny fraction of the wide range of possible covariance and correlation structures.[4] As with previous simulation studies, our work may have the same limitation. However, Ferre has also pointed out that simulations are the only way to test and compare these methods.[12] It is still valuable to compare methods empirically when the dimension of the data set is known, and factors of interest can be manipulated under simulation. We have endeavored to explore a wide variety of different correlation structures to identify settings where each method is likely to fail. In our simulations, the variants of Bartlett method clearly had the worst performance. This finding may be somewhat surprising: Peres-Neto et al[14] found these methods to be only a little worse than the best performers in their simulations and concluded that they were actually the best for distinguishing from . There are 2 factors that distinguish their simulations from ours. First, we considered a wider variety of correlation structures. The matrices considered by Peres-Neto are represented by our matrices 4 to 13. Second, the matrices we considered were larger. Motivated by problems from ecology, they looked at matrices that were or . Motivated by the larger gene or protein expression data sets currently being produced in biology, we looked at matrices that were . We think that both factors contribute to the different results. Supplementary Figure 3 shows that the errors arise primarily from matrices 1 to 4 and 10 to 12. This subset is comprised precisely of those matrices that include a substantial amount of unstructured noise. We suspect that the underlying difficulty arises because the stepwise hypothesis tests give rise to a classical problem of multiple comparisons. Thus, the likelihood of incorrectly rejecting a null hypothesis (and inflating the dimension) increases. This may also explain why Ferre[12] cautioned that Bartlett’s test can overestimate the number of components. The rnd-F method introduced by ter Brack also performed significantly worse in our simulations than in those of Perres-Neto. The plots in Supplementary Figure 3 reveal 2 things. First, for virtually every correlation matrix we considered, rnd-F is more variable and less accurate than rnd-Lambda. Second, the most serious large errors arise from matrices 10 to 13. These matrices contain a mix of both highly structured data and completely unstructured noise. Similar matrices were considered in the previous study, so we conclude that the rnd-F method simply works poorly with larger matrices. We investigated the graphical Bayesian method of Auer and Gervini[17] in some detail. Specifially, we introduced and tested 8 algorithms to enhance the method by automatically selecting the number of components. Two of these—the novel T test–based changepoint algorithm and the exponential model from the CPM package—were abysmal. In virtually every simulation, the T test seriously overestimates the number of components. The exponential CPM model overestimates the number at least half the time. The remaining 6 methods have acceptable performance most of the time. The overall winner in terms of accuracy from our simulation study is the rnd-Lambda randomization-based procedures. This additional accuracy, however, was obtained at a sustantial cost in computation time. The randomization methods take at least 2 orders of magnitude longer than any other methods that we studied. Three other methods were competitive with rnd-Lamda in terms of overall accuracy, but considerably faster: the Auer-Gervini model with criterion “CPT,” the PESEL method with criterion , and Minka’s Laplace approximation. It is interesting to note, however, that the 4 best overall methods rarely give the absolute best results for any fixed size of data matrix. Their ultimate strength is their consistency: their estimates are always competitive, and the average error in the estimated dimension is always less than 2. Two other methods perform well in complementary settings. The broken stick model is most accurate when there are 96 objects, and the Auer-Gervini method using the TwiceMean criterion is the most accurate when there are 24 objects. The TwiceMean criterion appears to overestimate the number of components when the size of the data matrix increases. By contrast, the broken stick model appears to benefit from having extra data available. MacArthur[22] and De Vita[41] showed that the broken stick model worked well when fitting the relative abundance data of species in ecological populations. It is possible that the distribution of the expected lengths in equation (2) will be better approximated with larger data matrices. This may explain why its performance improves. Our simulation studies uncovered at least 2 (possibly related) contexts where it is particularly difficult to estimate the number of components correctly. Every reasonable method severely underestimates the dimension for correlation matrices 1 to 3. In addition, most methods overestimate the dimension when there are 96 objects and 100 attributes, especially when we doubled the number of correlated blocks. In both cases, there is very little redundancy in the signals we are trying to detect. (The biological contexts where we expect to apply these methods are expected to contain considerable redundancy in the form of highly correlated genes or proteins.) To handle more general data sets, however, new methods will need to be developed to improve performance in these examples without sacrificing it in other examples. Further work will also be needed to clarify what kinds of structural changes occur as the number of objects increases from 24 to 96 and beyond. We do not expect our study to be the final word on how to determine the number of significant principal components; similar to Ferre,[12] we must conclude that there is no ideal solution to the problem. If forced to choose one method for all data sets, we would pick the Auer-Gervini model using the CPT criterion, as it is both reasonably accurate and reasonably fast. This would especially be the case if we were trying to analyze many data sets at once; for example, when performing an analysis similar to the one in our AML data set for a long list of different biological pathways or gene sets. When focused on only one data set, we would compute the estimates from multiple methods (including rnd-Lambda, PESEL with criterion , Minka’s Laplace approximation, broken stick, and Auer-Gervini with TwiceMean), and review them in light of both the traditional scree plot and the Auer-Gervini plot of the maximum posterior dimension as a function of the hyperparameter. When we used this method with the RPPA apoptosis data set, we successfully selected a dimension that produced clusters that made biological sense. We believe that this combination of analytical and graphical methods, as provided in the PCDimension package, will guide researchers to the most reliable results. Click here for additional data file. Supplemental material, uomf_march_5_2018_pcda_supplement for Decomposing the Apoptosis Pathway Into Biologically Interpretable Principal Components by Min Wang, Steven M Kornblau and Kevin R Coombes in Cancer Informatics

11 in total

1. The Scree Test For The Number Of Factors.

Authors: R B Cattell
Journal: Multivariate Behav Res Date: 1966-04-01 Impact factor: 5.923

Review 2. Apoptosis: a review of programmed cell death.

Authors: Susan Elmore
Journal: Toxicol Pathol Date: 2007-06 Impact factor: 1.902

3. Mitochondrial release of AIF and EndoG requires caspase activation downstream of Bax/Bak-mediated permeabilization.

Authors: Damien Arnoult; Brigitte Gaume; Mariusz Karbowski; Juanita C Sharpe; Francesco Cecconi; Richard J Youle
Journal: EMBO J Date: 2003-09-01 Impact factor: 11.598

4. Caspase-3 feeds back on caspase-8, Bid and XIAP in type I Fas signaling in primary mouse hepatocytes.

Authors: Karine Sá Ferreira; Clemens Kreutz; Sabine Macnelly; Karin Neubert; Angelika Haber; Matthew Bogyo; Jens Timmer; Christoph Borner
Journal: Apoptosis Date: 2012-05 Impact factor: 4.677

5. Highly phosphorylated FOXO3A is an adverse prognostic factor in acute myeloid leukemia.

Authors: Steven M Kornblau; Neera Singh; YiHua Qiu; Wenjing Chen; Nianxiang Zhang; Kevin R Coombes
Journal: Clin Cancer Res Date: 2010-03-09 Impact factor: 12.531

6. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

7. An information-intensive approach to the molecular pharmacology of cancer.

Authors: J N Weinstein; T G Myers; P M O'Connor; S H Friend; A J Fornace; K W Kohn; T Fojo; S E Bates; L V Rubinstein; N L Anderson; J K Buolamwini; W W van Osdol; A P Monks; D A Scudiero; E A Sausville; D W Zaharevitz; B Bunow; V N Viswanadhan; G S Johnson; R E Wittes; K D Paull
Journal: Science Date: 1997-01-17 Impact factor: 47.728

8. A 60 kd MDM2 isoform is produced by caspase cleavage in non-apoptotic tumor cells.

Authors: R Pochampally; B Fodera; L Chen; W Shao; E A Levine; J Chen
Journal: Oncogene Date: 1998-11-19 Impact factor: 9.867

9. Cluster analysis and display of genome-wide expression patterns.

Authors: M B Eisen; P T Spellman; P O Brown; D Botstein
Journal: Proc Natl Acad Sci U S A Date: 1998-12-08 Impact factor: 11.205

10. Progeny Clustering: A Method to Identify Biological Phenotypes.

Authors: Chenyue W Hu; Steven M Kornblau; John H Slater; Amina A Qutub
Journal: Sci Rep Date: 2015-08-12 Impact factor: 4.379

7 in total

1. CytoGPS: A large-scale karyotype analysis of CML data.

Authors: Zachary B Abrams; Suli Li; Lin Zhang; Caitlin E Coombes; Philip R O Payne; Nyla A Heerema; Lynne V Abruzzo; Kevin R Coombes
Journal: Cancer Genet Date: 2020-10-02

2. Adaptive shifts underlie the divergence in wing morphology in bombycoid moths.

Authors: Brett R Aiello; Milton Tan; Usama Bin Sikandar; Alexis J Alvey; Burhanuddin Bhinderwala; Katalina C Kimball; Jesse R Barber; Chris A Hamilton; Akito Y Kawahara; Simon Sponberg
Journal: Proc Biol Sci Date: 2021-08-04 Impact factor: 5.530

3. Pattern recognition in lymphoid malignancies using CytoGPS and Mercator.

Authors: Zachary B Abrams; Dwayne G Tally; Lin Zhang; Caitlin E Coombes; Philip R O Payne; Lynne V Abruzzo; Kevin R Coombes
Journal: BMC Bioinformatics Date: 2021-03-01 Impact factor: 3.169

4. DNA methylation epitypes highlight underlying developmental and disease pathways in acute myeloid leukemia.

Authors: Brian Giacopelli; Min Wang; Ada Cleary; Yue-Zhong Wu; Anna Reister Schultz; Maximilian Schmutz; James S Blachly; Ann-Kathrin Eisfeld; Bethany Mundy-Bosse; Sebastian Vosberg; Philipp A Greif; Rainer Claus; Lars Bullinger; Ramiro Garzon; Kevin R Coombes; Clara D Bloomfield; Brian J Druker; Jeffrey W Tyner; John C Byrd; Christopher C Oakes
Journal: Genome Res Date: 2021-03-11 Impact factor: 9.043