Min Wang1, Steven M Kornblau2, Kevin R Coombes3. 1. Mathematical Biosciences Institute, The Ohio State University, Columbus, OH, USA. 2. Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX, USA. 3. Department of Biomedical Informatics, The Ohio State University, Columbus, OH, USA.
Abstract
Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 "biological components," 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.
Principal component analysis (PCA) is one of the most common techniques in the analysis of biological data sets, but applying PCA raises 2 challenges. First, one must determine the number of significant principal components (PCs). Second, because each PC is a linear combination of genes, it rarely has a biological interpretation. Existing methods to determine the number of PCs are either subjective or computationally extensive. We review several methods and describe a new R package, PCDimension, that implements additional methods, the most important being an algorithm that extends and automates a graphical Bayesian method. Using simulations, we compared the methods. Our newly automated procedure is competitive with the best methods when considering both accuracy and speed and is the most accurate when the number of objects is small compared with the number of attributes. We applied the method to a proteomics data set from patients with acute myeloid leukemia. Proteins in the apoptosis pathway could be explained using 6 PCs. By clustering the proteins in PC space, we were able to replace the PCs by 6 "biological components," 3 of which could be immediately interpreted from the current literature. We expect this approach combining PCA with clustering to be widely applicable.
Since the earliest days of gene expression microarrays, 2-way clustered heatmaps have
been a standard feature of papers studying genome-wide biological data
sets.[1,2] Such heatmaps
remain ubiquitous, despite numerous difficulties in interpretation, in
reproducibility, and in assigning statistical significance. Good clustering of the
genes in such data sets is critical for understanding the biology. Because many
biologists are more interested in signaling pathways than in individual genes, they
want to find a source of consistent, robust, and interpretable blocks of genes that
drive distinct functional characteristics of the pathways. These blocks of genes
form clusters that are relevant to comprehensive understanding of critical
biological processes. For example, apoptosis is an important biological process,
which is characterized by distinct morphological states and energy-dependent
biochemical mechanisms.[3] Understanding how proteins cluster in the apoptotic pathway will help
elucidate its underlying molecular mechanisms. Now, clustering can be thought of as
a form of dimension reduction, and a natural question is the “true dimension” of the
data. Various techniques have been developed to determine the dimensionality, the
most common being principal component analysis (PCA). For our purposes, an important
problem in PCA is to determine the number of statistically significant
components.Numerous methods have already been developed to estimate the number of significant
components. There are 4 types of approaches: (1) ad hoc subjective and graphical
rules; (2) methods based on distributional assumptions; (3) computationally
extensive procedures relying on Monte Carlo, permutation, cross-validation,
bootstrap, or jackknife[4,5];
and (4) Bayesian methods based on the probabilistic formulation of Tipping and Bishop[6] with marginalization estimated through Laplace approximation and its
variants.[7-9] The scree plot
method, which consists of plotting a curve of the eigenvalues of the sample
covariance matrix versus their rank and looking for an “elbow” in the curve, is the
most famous graphical approach.[10] However, this method relies on the user’s subjective experience to find any
possible “elbow.” Even so, other methods are not always superior to the simple scree
plot. Legendre and Legendre[11] used the “broken stick” distribution to compare the extra information in a
model to one with fewer parameters. Ferre[12] conducted an empirical study of many methods to select the number of PCs,
using data simulated from known parameters. He concluded that there is no “ideal”
solution to the problem of dimensionality in PCA. He also concluded that Bartlett’s tests[13] are an improvement because they are less subjective but may have a tendency
to overestimate the true number of components. Peres-Neto et al[14] conducted an extensive simulation study to evaluate a wider variety of
methods. They concluded that several methods, especially those based on
randomization and permutation proposed by ter Braak,[15,16] outperform the others and
should be applied to study general data sets. More recently, Josse and Husson[5] showed that the generalized cross-validation method performs well. Sobczyk et al[9] proposed a Bayesian approach called PEnalized SEmi-integrated Likelihood
(PESEL); by comparing it with other state-of-the-art methods, they found that
PESEL-based criteria are more robust against deviations from the assumptions of a
probabilistic model than the other methods.In 2008, Auer and Gervini[17] addressed the problem of selecting principal components in the context of
Bayesian model selection. Although their method has strong theoretical foundations
and appears to work well in practice, it still depends on the subjective evaluation
of a graphical display of how the maximum posterior estimate of the number of
components depends on a parameter describing the choice of the prior distribution.
Moreover, its performance has only been compared with the scree plot and broken
stick methods and not to more sophisticated methods that have performed well in
comparative studies.In this article, we consider several algorithms to extend and automate the
Auer-Gervini method by providing objective rules to select the number of significant
principal components. Using an extensive set of simulations, we compare these
algorithms with the broken stick model, Bartlett’s test, ter Braak’s randomization
methods, generalized cross-validation, and Bayesian probabilistic PCA. The methods
chosen for comparison were the “winners” from the previous comparative
studies.[7,9,12,14] Our extensions
to the Auer-Gervini method are implemented in the PCDimension R package. Because the
most promising versions of the randomization algorithms appear not to be readily
available, we have also implemented them in the PCDimension package. For
convenience, the package also implements the broken stick method. For Bartlett’s
test, we rely on an existing implementation in the nFactors package.[18] For generalized cross-validation, we use the implementation in the FactoMineR package.[19] The implementation of Bayesian approximation methods including Minka’s
approach and PESEL from probabilistic PCA is available in the pesel package and code
on Github.[9]This article is organized as follows. In section “Methods,” we review the theoretical
framework of different types of methods. In section “Simulation Study” of the
“Results” section, we perform simulation studies to test the performance of the
proposed algorithms. In the “Decomposing the Apoptosis Pathway in AML” section, we
apply the methods to a study of apoptosis in acute myeloid leukemia (AML) using
reverse phase protein arrays (RPPAs). Finally, we conclude the article and make
several remarks in section “Conclusions.” A simple example to illustrate the
implementation of different methods in the PCDimension package is also provided in
the supplementary material.
Methods
Let denote an data matrix, where each row represents an object to be analyzed,
and each column represents a measured attribute. In PCA, each principal component is
a linear combination of the attributes. Much effort has been expended on
investigating many objects using relatively few attributes, but the opposite
scenario—few objects and many attributes—is usually ignored. One practical
application would be clustering a few genes or proteins from a single biological
pathway using their expression values for many patients. Therefore, we are primarily
interested in biological applications where . In this section, we briefly review the methods used to estimate
the number of statistically significant PCs.
Bartlett’s test
Bartlett[13] proposed a statistical method to conduct a hypothesis test on the
significance of the principal components based on the eigenvalues of
, the correlation matrix of the objects. This test is
designated to check whether the remaining eigenvalues of the correlation
structure are equal after removing the well-determined (highly significant)
components. Let the eigenvalues of be with . The procedure, for various values of , starting at in decreasing order, is to test the null hypothesis
that the “ smallest eigenvalues of the correlation matrix are equal”
against the alternative hypothesis that “at least 2 of the eigenvalues are different.” The test statistic is as
follows:whereUnder the null hypothesis, the test statistic follows a distribution with degrees of freedom. The optimal number of principal components
is the smallest where is accepted. Both Lawley[20] and Anderson[21] made some modifications to the multiplicative factor for equation (1); these are viewed
as improved variants of Bartlett’s test.
Broken stick model
Under the assumption that the total variance of the multivariate data is divided
at random among all possible components, the expected distribution of the
eigenvalues in the covariance or correlation matrix follows a broken stick distribution.[22] This model says that if we have a stick of unit length, broken at random
into segments, then the expected length of the longest piece is as follows:As the expected values under the broken stick model are obtained in decreasing
order, it is necessary to rank the relative proportions of the variance that are
accounted for by the PCs in the same way. The estimated number of PCs is the
maximum index where the observed relative proportion of variance is greater than
or equal to the expected value from the broken stick distribution.
Randomization-based procedure
This procedure involves a randomization approach to generate a large number of
data sets by scrambling the observed data in a manner of sampling without
replacement.[15,16] These randomized data sets are then used to compute
empirical P values for the statistics of interest characterize
the internal structure of the eigenvalues in the correlation matrix. The test
procedure is as follows: (1) randomize the values with all the attribute columns
of the data matrix, (2) perform PCA on the scrambled data matrix, and (3)
compute the test statistics. All 3 steps are repeated a total of times, where is a large enough integer to guarantee the accuracy of
estimating the P value; in practice, is usually set to equal 1000. In each randomization, 2 test
statistics are computed: (1) the eigenvalue for the principal component and (2) a pseudo F ratio
computed as . Finally, the P value for each
and each statistic of interest is estimated to be the
proportion of the test statistics in all data sets that are greater than or
equal to the one in the observed data matrix.
Bayesian approximation methods
The probabilistic PCA model assumes that each observation arises from the following model:where is the mean vector, is a d-dimensional Gaussian latent vector,
is an parameter coefficient matrix, and is Gaussian noise. Tipping and Bishop[6] showed that the principal components of can be retrieved using the maximum likelihood estimator of
, which is given as follows:where is the matrix of ordered principal eigenvectors of , is an estimator of the true noise level , is the diagonal matrix with the estimates of the eigenvalues of the
covariance matrix, and is an arbitrary orthogonal matrix.Finding the number of principal components is then equivalent to a model
selection problem from . The optimal number of components can be obtained by applying
the following criteria:where the number of components is implicit in the matrix . With a noninformative prior, the mean vector can be integrated out and the probability of given and is obtained as follows:where is the sample covariance matrix.[7] With a prior , the probability of observing given the signal dimensionality is as follows:Because of the computational burden in calculating equation (7), Minka[7] proposed a conjugate prior for and derived a Laplace approximation of the marginal likelihood
in equation
(7) which is then used in equation (5) to compute the
optimal number of components. This approximation approach has been empirically
shown to be efficient in small-sample scenarios. In 2008, Hoyle[8] considered the high-dimensional scenarios when and grow to infinity and proposed another approximation where more
terms in the asymptotic expansion of the integrand of equation
(7) are used to guarantee more accurate evaluation. However, similar
to Minka’s Laplace approximation, it depends on the selection of the prior on
. This approach is not considered here because there is no
publicly available implementation.Sobczyk et al[9] explored 2 other high-dimensional scenarios: (1) is fixed and and (2) is fixed and , where the second one is out of the scope of this article. For
the 2 different scenarios, they modeled either the rows or the columns of the
matrix via fixed effects models (with slightly different forms depending on the
relative size of and ). The 2 models are similar to equation (3) where either the
matrix of PCA scores or the matrix of PCA loadings are used. They imposed a
prior on these matrices and presented a methodology for approximating the
posterior probability of . Using 2 different prior distributions on terms similar to
in equation (3), they proposed 2
corresponding forms of PESEL where either all the singular values in PCA are
homogeneous (abbreviation homo) or heterogeneous (abbreviation
hete). Therefore, there are 4 criteria to maximize the
posterior probability of given the PESEL: and for Scenario 1 and and for Scenario 2. In the rest of this article, we study the same
3 criteria , , and used in the work by Sobczyk et al.
Auer-Gervini model
We briefly review the Auer-Gervini method.[17] Suppose that are all columns of data matrix , and is a random sample with mean vector and covariance matrix . Note that in the work by Auer and Gervini, they assume that
the rows are i.i.d. Gaussian. Here, we make a transposition and assume the
columns of X are i.i.d. Gaussian to guarantee the consistency of notations with
previous sections. Write where is orthonormal and with . Let be the model with significant components or eigenvalues, that is,
, and , for . Under , a random sample is as follows:where are uncorrelated random variables with mean 0 and variance 1,
and is a random error vector with mean 0 and covariance matrix
. Therefore, the problem of selecting the number of PCs is
transformed into the problem of choosing the correct model .A prior probability is assigned to of the following form:where is a normalizing constant that satisfies for . Then, under certain approximations, one can use Bayes rule to
derive a formula for the maximum posterior estimate of as a function of the prior parameter :where is an estimator of , are the eigenvalues of the sample covariance matrix and also
the maximum likelihood estimators of under model , and and are the geometric and the arithmetic means of , respectively. The formula describes as a nonincreasing step function with respect to
, with . The step function is then plotted and the “highest dimension
for which the step length is significantly large” is selected to be the optimal
number of components.In other words, an exponential prior is placed on the number of significant
components. The prior depends on a hyperparameter that governs how fast the distribution decays. As
goes to , the prior drops off rapidly and the posterior estimate of the
number of PCs will go to 0. Auer and Gervini proposed graphing the posterior
estimate as a step function of , which can visually help select the highest “nontrivial step
length.” A large step length means that the estimated number of PCs is optimal
under a wide range of prior model probabilities. However, the notion of
“nontrivial step length” remains subjective, which is similar to the situation
where one needs to select a recognizable “elbow” in the scree plot. Automating
the definition of nontrivial step length is further complicated by the fact that
the final step for is theoretically infinite. We will operationalize the final
subjective step by putting an upper bound on the largest “reasonable” estimate
of and will develop criteria to automatically choose the
significantly large step length.
Automating the Auer-Gervini method
As originally described, the Auer-Gervini model is a visual Bayesian approach,
and the critical final step is to decide what constitutes a significantly large
length of a step. This problem can be thought of as one of classification, in
which the set of step lengths must be partitioned into 2 groups (short and
long). We propose to test multiple algorithms to solve the problem as
follows.
TwiceMean
Use twice the mean of the set of step lengths as a cutoff to separate the
long and short steps. Intuitively, the idea is to view the distribution of
the step lengths as exponential when the data arise from random noise. As
the mean equals the standard deviation for an exponential distribution,
twice the mean is the same as the mean plus one standard deviation and
provides a plausible cutoff to select “long” step lengths. This simple idea
is inspired in part by Chaterjee,[23] who considered recovery of low-rank matrices by thresholding singular
values. He proposed that one could have a single universal choice of the
threshold parameter which is slightly greater than 2 and gives near-optimal
mean square error for singular value hard thresholding.
K-means
Because the goal is to partition the step lengths into 2 groups, a natural
solution is to cluster them using the traditional K-means
algorithm with . We seed the algorithm with starting centers using the
minimum and maximum step lengths. As we will discuss in more detail below,
the final step (when ) is theoretically infinite. We will bound this last step
but it will ensure that at least one step is “long.”
Kmeans3
Our initial experience (data not shown) using K-means with
large found it to be overly conservative when assigning steps to
the “long” group. Given its dependence on Euclidean distances, that is
exactly how one should expect it to perform when the true mixture
distribution is skewed right. To address this problem, we modified the
algorithm as follows. If the number of objects is large , we use the K-means algorithm, with
and seeds of the minimum, median, and maximum values, to
separate the step lengths into 3 groups: Low, Intermediate, and High. We
then treat both Intermediate and High groups as long.
Spectral
Use spectral clustering to divide the step lengths into 2 groups. Spectral
clustering is one of the most popular modern clustering algorithms. It
sometimes outperforms the traditional clustering algorithms such as
K-means.[24,25]
CPT
Instead of simply clustering the step lengths into 2 groups, we can instead
sort them into increasing order and view the problem as one of change point
detection. One existing solution to this problem is provided by the At Most
One Change (AMOC) method implemented by the cpt.mean function from the
changepoint R package.[26] The detection of the first change point is posed as a hypothesis test
and a generalized likelihood ratio–based approach is extended to changes in
variance within normally distributed observations (the reader can consult Hinkley[27] and Gupta and Tang[28] for more details).
CpmFun
The cpm R package defines several other “Change Point Models.”[29] These are implemented by the detectChangePointBatch function, which
processes the step lengths in one batch and returns information regarding
whether the sequence contains a change point. The default is to use the
“Exponential” method, which computes a generalized likelihood ratio
statistic for the exponential distribution.
Ttest
We also implemented a novel change point detection algorithm based on the
t test. We begin by sorting the steps lengths in
increasing order. Then, we compute the gaps between successive step lengths.
At each (sorted) step, we use the t distribution to
determine the likelihood that the next gap is larger than expected from the
previously observed gaps. The first time that the next gap is significantly
larger than expected, we assert that this next step length is the smallest
one that constitutes a “long” step length.
Ttest2
Where the K-means algorithm was found to be conservative,
the Ttest algorithm just described was sometimes found to be
anticonservative. This can happen when the first few step lengths are all
about the same size, which yields a small standard deviation. In this case,
a relative short next step will be falsely discovered based on the Ttest
criterion. To avoid this problem, it may be appropriate to include the next
(test) step length and gap when estimating the mean and standard deviation
of the gap distributon. We modified the Ttest algorithm in this way to make
it more conservative.
Bounding the last step
All of the proposed methods for separating short from long steps require us to
bound the permissible length of the final step when , which would otherwise be infinite. This step is important
because it allows the algorithm to conclude in some cases that the only long
step is the final one, and the true number of principal components should equal
0. We use the largest of the following 3 quantities:further than the final change point (to
) in the step function., where is the number of attributes. This procedure selects
the value of for which 99% of the prior probability is assigned to
.if and , otherwise, where is the number of objects. This formula was derived
empirically from a Monte Carlo study on data sets with random noise. We
estimate as the maximum of the empirical largest change point
in the step function for various values of and in 2 scenarios: and . It can be seen as the value where the maximum point
for could be achieved when various d’s
are sharing the prior information on under the uninformative noise structure.Note that all simulations and computations in this article were performed using
version 3.2.2 of the R statistical software environment with version 1.1.3 of
the PCDimension package, which we have developed, version 2.3.3 of the nFactors
package, version 1.39 of the FactoMineR package, and version 0.7.1 of the pesel
package. The details on how to select the number of PCs for a simple example
using the PCDimension package are provided in the supplementary material.
Results
Simulation study
We follow a Monte Carlo procedure to study the robustness of different types of
methods described above for estimating the number of PCs. For real data sets, we
will never know the “correct” answer. So, we simulate a collection of data sets
with different correlation structures to compare the numerical Auer-Gervini
model we have implemented with the other types. We start by supplying details
about the correlation structures and data sets used in the simulations.
Simulated datatypes
We use a protocol similar to those discussed in recent papers.[5,14,17,30-32] In the
simulations, the number of measured attributes is taken to be either
or . The range of 100 to 400 is chosen to represent small to
moderately large data sets. We also consider data sets with either
or objects. We view 24 objects as a small data set and 96
objects as a moderately large one.[33] The number of significant diagonal blocks is either the number shown
in Figure 1 or twice
that number (with finer correlation structures of double group blocks). By
varying both the number of objects and the number of correlated blocks, we
can explore the effects of the number of nontrivial components and the
number of objects per component. To also explore the effects of different
combinations of additional factors, including eigenvalue distributions,
signed or unsigned signals, uncorrelated variation, and unskewed normal or
skewed distributions, we use the 19 different covariance structures and
correlation matrices illustrated in Figure 1. Matrices 1–3 are covariance
matrices with different marginal distributions: normal, marginal
gamma, and marginal t distribution, where , and with for 1≤i≤5 and for . Matrices 4–19 are correlation matrices and we set the corresponding covariance matrices to be
where . Note that the correlation is the standardized version of
the covariance, that is, where . We use this form for most of the covariance matrices
because we want to control the signal to noise ratio (basically,
) to test the robustness of methods to the noise level
. Matrix 4 contains only noise; it is a purely uncorrelated
structure. Matrices 5 and 6 represent correlation structures with various
homogeneous cross-correlation strengths (unsigned signals) 0.3 and 0.8.
Matrices 7–13 are correlation matrices where between-group (0.3, 0.1, or 0)
and within-group (0.8 or 0.3) correlations of objects are fixed.[14,31]
Matrices 14–19 are correlation structures where negative cross-correlations
(−0.8 or −0.3, signed signals) are considered within groups and mixture of
signed and unsigned signals are also included.
Figure 1.
The 19 correlation matrices considered in the simulation study for 24
objects. Values of correlations are provided by the colorbar.
Numbers in parentheses correspond to the known dimension.
The 19 correlation matrices considered in the simulation study for 24
objects. Values of correlations are provided by the colorbar.
Numbers in parentheses correspond to the known dimension.
Empirical simulation results
For each of the 19 scenarios, we simulate 1000 data sets. Then, the numbers
of components are estimated based on all (19 variants of the) types of
methods. That is, for each variant within each type of model, we compute the
estimated dimension. We also investigate a “majority rule” procedure for the
Auer-Gervini model; that is, the dimension that more than 3 criteria out of
8 selected is the one that is estimated by the majority rule. Then, we
calculate the absolute difference between the known dimensions and the
estimated ones for each simulated sample and correlation structure. The mean
of the absolute differences over both 1000 simulated data sets and 19
correlation matrices is plotted in Figure 2. The corresponding numeric
values are also provided in Supplementary Table 1. The values in this table can help
assess the quality of each method. The closer to 0 the values are, the
better the corresponding variant within the type of method. However, the
values do not describe whether a method overestimates or underestimates the
number of nontrivial PCs.
Figure 2.
Log-transformed mean values, across the correlation matrices, of the
absolute difference between the known dimension and the sample
estimates from 15 different methods in 8 simulation scenarios
(Scenario 1: 24 × 100, 1X correlated blocks; Scenario 2: 24 × 400,
1X; Scenario 3: 96 × 100, 1X; Scenario 4: 96 × 400, 1X; Scenario 5:
24 × 100, 2X; Scenario 6: 24 × 400, 2X; Scenario 7: 96 × 100, 2X;
and Scenario 8: 96 × 400, 2X).
Log-transformed mean values, across the correlation matrices, of the
absolute difference between the known dimension and the sample
estimates from 15 different methods in 8 simulation scenarios
(Scenario 1: 24 × 100, 1X correlated blocks; Scenario 2: 24 × 400,
1X; Scenario 3: 96 × 100, 1X; Scenario 4: 96 × 400, 1X; Scenario 5:
24 × 100, 2X; Scenario 6: 24 × 400, 2X; Scenario 7: 96 × 100, 2X;
and Scenario 8: 96 × 400, 2X).In Figure 2 and
Supplementary Table 1, one can see that, as anticipated, the
results from most algorithms are better with fewer correlated blocks
(Scenarios 1-4), probably because there are more objects representing each
block. Also, accuracy in almost all methods is much better with fewer
objects (Scenarios 1-2 and 5-6). The situation is more complicated when the
number of attributes changes. In general, the worst performance occurs when
the data matrix is nearly square (Scenarios 3 and 7).Overall, the most accurate methods when averaging across all 8 scenarios are
(1) the rnd-Lambda algorithm (mean deviation, ), (2) the PESEL method with criterion
, (3) the Auer-Gervini model with criterion “CPT”
, and (4) Minka’s Laplace approximation . Interestingly, the rnd-Lambda and Minka-Laplace
algorithms each produce the best average performance in only 1 out of 8
scenarios. Moreover, the Auer-Gervini CPT model and the model are not the best performers in any of the 8
individual scenarios. By contrast, 3 other methods (the “TwiceMean”
Auer-Gervini model, the “ttest2” Auer-Gervini model, and generalized
cross-validation) have the best performance in 1 out of 8 scenarios. The
simple broken stick method is the best performer in 3 out of 8
scenarios.This variabililty in the winners suggests that we look more closely at how
the simulation parameters affect performance. When there are only 24
objects, the 4 best methods are (1) “TwiceMean” Auer-Gervini , (2) generalized cross-validation , (3) rnd-Lambda , and (4) “kmeans” Auer-Gervini . When there are 96 objects, the 4 best performers are (1)
broken stick , (2) rnd-Lambda , (3) , and (4) Minka-Laplace . A slight shuffling of the order of the same set of
high-performing methods occurs when we vary the number of blocks (broken
stick is best with 1X , Minka-Laplace with 2X ) or the number of attributes (Minka-Laplace is best with
100 , and rnd-Lambda with 400 ).The variants of Bartlett’s test, as implemented in R, have the worst
performance of all the stopping rules we have considered (Figure 2).
Furthermore, the rnd-F algorithm is not as good as expected because a
previous study found it to be one of the most successful rules tested.[14] Even though the CPT criterion is one of the best overall, and the
TwiceMean criterion is the best when there are 24 objects, some of the
criteria that we use to automate the Auer-Gervini model are not good
candidates for computing the dimension. For example, the “Ttest” and “CPM”
criteria often result in large deviations from the true dimension. The PESEL
criteria and have middle-of-the-road performance, which may by due to
considering in this article.
Running time
In addition to the absolute differences between the true dimension and the
estimates, we computed the average running time of all types of methods over
all correlation matrices per data set (Table 1). All timings were
conducted on a computer with “Intel Xeon CPU E5-2690 v3 @ 2.60-GHz 2.59-GHz”
processors running Windows Server 2008 R2 Enterprise. Note that the time
shown in the table is the total time of all the variants or criteria within
each type of model. It is obvious that the computation time becomes longer
as the number of objects or attributes increases, and there is almost no
change in time usage when the number of blocks is doubled. From the table,
we can see that Bartlett’s test and the broken stick method use the least
time in computing the number of components. However, the accuracy using the
broken stick approach is much better than in Bartlett’s test. The most
accurate overall method, rnd-Lambda, is by far the slowest, taking several
orders of magnitude more time than the other methods. Of the remaining 3
methods, which achieve nearly the same level of accuracy, Minka’s Laplace
approximation is the fastest and the “CPT” Auer-Gervini model is the
slowest.
Table 1.
Average running time of all types of methods across correlation
matrices (unit: seconds).
Rules
Original blocks
(1X)
Twice blocks (2X)
24 objects
96 objects
24 objects
96 objects
m = 100
m = 400
m = 100
m = 400
m = 100
m = 400
m = 100
m = 400
Broken stick
0.002
0.003
0.008
0.018
0.002
0.003
0.008
0.018
Bartlett’s test
0.004
0.004
0.015
0.004
0.004
0.013
0.016
GCV
0.004
0.012
0.032
0.003
0.006
0.012
0.032
Minka’s Laplace
0.005
0.024
0.047
0.005
0.007
0.024
0.048
PESEL
0.010
0.026
0.191
0.008
0.104
0.026
0.183
Auer-Gervini
0.035
0.234
0.269
0.033
0.035
0.236
0.275
Rand. based
2.121
2.990
9.253
20.152
1.832
3.131
9.212
20.061
Average running time of all types of methods across correlation
matrices (unit: seconds).
High-accuracy methods: rnd-Lambda and Auer-Gervini with CPT
More detailed results on the performance of the rnd-Lambda algorithm are
presented in Figure
3 and Supplementary Tables 2 and 3 for the case of the original block structure shown in
Figure 1.
Similar results for the “CPT” criterion in the Auer-Gervini model are
presented in Figure
4 and Supplementary Tables 4 and 5. For each covariance or correlation matrix, we computed
the percentage of deviations between the estimates and the known dimension.
The results are similar regardless of the number of attributes (100 or 400)
or objects (24 or 96). Both methods tend to underestimate the dimension for
covariance matrices of unskewed or skewed distributions (matrices 1-3). They
are quite accurate for correlation matrices of normal distribution (matrices
4-19). When they make errors with the normal distribution, the rnd-Lambda
algorithm is more likely to slightly overestimate the dimension, whereas the
Auer-Gervini CPT method is more likely to underestimate.
Figure 3.
Percentage of deviations between the estimate from
randomization-based procedure (rnd-Lambda) and known dimension.
Figure 4.
Percentage of deviations between the estimate from the “CPT”
criterion in the Auer-Gervini model and known dimension.
Percentage of deviations between the estimate from
randomization-based procedure (rnd-Lambda) and known dimension.Percentage of deviations between the estimate from the “CPT”
criterion in the Auer-Gervini model and known dimension.
Special-case methods: TwiceMean and broken stick
We present details on the performance of the “TwiceMean” criterion in the
Auer-Gervini model and the broken stick method, which were best when
restricting to either the 24-object or 96-object simulations, respectively,
as shown in Figure 5
and Supplementary Tables 6 and 7. For data sets with 24 objects, the Auer-Gervini method
with criterion “TwiceMean” is very accurate for uniform matrices (matrices 5
and 6), correlated matrices (matrices 7-13) and unsigned data with or
without signed signals (matrices 14-19). When the sample covariance matrix
is either skewed or unskewed (matrices 1-3), the dimension is usually
underestimated. Also, the results do not vary too much with different
numbers of attibutes. For 96 objects, there is not much difference between
the results of the broken stick method in Supplementary Table 7 and that of the criterion “TwiceMean”
in Supplementary Table 6. And the results of the broken stick
method for 100 attributes are slightly better than those for 400
attributes.
Figure 5.
Percentage of deviations between the estimate from either the
“TwiceMean” criterion in the Auer-Gervin model or the broken stick
model and known dimension.
Percentage of deviations between the estimate from either the
“TwiceMean” criterion in the Auer-Gervin model or the broken stick
model and known dimension.
Robustness to random noise
We investigated the influence of random noise on the ability of different
methods to correctly detect the underlying structure. We conducted
additional simulation studies by adding different levels of (i.i.d. normal)
noise corresponding to 3 different values of the variance: , , and . Summaries of the absolute difference between the known
dimension and the estimates made using various methods under different
levels of noise are presented in Supplementary Tables 8 to 10. Although the error rate tends to increase with
increasing noise, we found that the relative performance of the methods is
consistent regardless of the size of the noise. That is, most algorithms
still perform better with fewer objects, and accuracy is almost always worse
when the number of blocks doubles. Most importantly, the best method for
each scenario does not change in most cases as the noise moves from 0 to 1.
The “TwiceMean” criterion in the Auer-Gervini model is better when the
number of objects is small relative to the number of objects, whereas the
broken stick method is better when the number of objects is close to the
number of attributes. Finally, regardless of the noise level, the rnd-Lambda
algorithm, the Auer-Gervini model with “CPT” criterion, the PESEL method
with criterion , and Minka’s Laplace approximation outperform the others
on average across all scenarios.
Decomposing the apoptosis pathway in AML
Since the introduction of gene expression microarrays in the 1990s, most
statistical analyses of omics data have treated pathways as second-class
objects, in the following sense: primary analyses are performed at the gene
level. That is, the data are first analyzed gene by gene to find differences
between known groups of patients such as responders and nonresponders. Then, a
significance cutoff is chosen and a second statistical test conditional on the
gene-by-gene results (such as gene set enrichment analysis[34]) is performed to infer which pathways differ between the 2 groups. One
reason analysts give precedence to individual genes is that univariate analyses
are easier than the multivariate ones needed for pathways. However, many
biologists are more interested in pathways than in individual genes because they
give a higher-level functional picture of biological behavior.Informally, biologists talk about pathways as though they are 1-dimensional
entities. At a cell level they are “on” or “off”; at a tissue level, they have a
simple “degree of activation.” But we hypothesize that most pathways, including
the apoptosis signaling pathway, are intrinsically multidimensional. To test
this hypothesis, we used a subset of RPPA data on samples collected from 511
patients with AML.[35,36] The subset consists of 33 proteins that are involved in the
apoptosis signaling pathway. Apoptosis is known to be an essential component of
several processes including normal cell turnover, proper development and
functioning of the immune system, and chemical-induced cell death.[3] It is generally characterized by distinct morphological states and
energy-dependent biochemical mechanisms. Even though many important apoptotic
proteins have been identified, the molecular mechanisms of these proteins still
remain to be elucidated.We applied different methods to determine the number of significant components in this RPPA data set; the results
are displayed in Table
2 and Supplementary Table 11. The number of components that they find
is highly variable, ranging from . Even the best methods from our simulations give different
values. However, our simulation studies suggest that the TwiceMean Auer-Gervini
method works particularly well when the number of objects (in this case, 33
proteins) is small compared with the number of attributes (in this case, 511
samples). In the top panel of Figure 6, we have plotted the maximum posterior estimate of the
number of components as a function of the prior parameter ; this plot gives better support for than for .
Table 2.
Number of principal components (PCs) from different algorithms on reverse
phase protein array data.
Rules
Auer-Gervini
Broken stick
Rand. based
Minka
GCV
PESEL
twicemean
cpt
broken-stick
rnd-Lambda
laplace
gcv
pesel.m.hete
PCs
6
1
1
8
20
12
15
Figure 6.
Analysis of AML RPPA data. Auer-Gervini step function relating the prior
hyperparameter to the maximum posterior estimate of the number
of significant principal components (top). Projection
of proteins on the space of the first 2 components; colors denote
different clusters (bottom).
Number of principal components (PCs) from different algorithms on reverse
phase protein array data.Analysis of AML RPPA data. Auer-Gervini step function relating the prior
hyperparameter to the maximum posterior estimate of the number
of significant principal components (top). Projection
of proteins on the space of the first 2 components; colors denote
different clusters (bottom).To understand the biology driving these mathematical principal components, we
then projected the proteins into the 6-dimensional PC space and used their
directions to cluster them using a von-Mises Fisher mixture model.[37] Using the Bayesian information criterion for model selection, we found an
optimal clustering into 6 groups of proteins, which are displayed in different
colors in a 2-dimensional projection in the bottom panel of Figure 6. The 6 protein clusters are as
follows:AIFM1, BCL2, and DIABLO, which we interpret as a block corresponding to
the mitochondrial release of apoptosis-inducing factor[38];BID, CASP3, CASP8, and XIAP, which are part of a caspase 3 feedback loop[39];CASP3.cl175, CASP7.cl198, CASP9.cl315, and MDM2, a cleaved caspase block[40];ARC, BAD.pS112, and YAP1p;BAD.pS136, BAD.pS155, and BAX; andThe core group of 16 apoptosis-related proteins: BAD, BAK1, BCL2L1,
BCL2L11, BIRC2, BIRC5, BMI1, CASP9, CASP9.cl330, MCL1, MDM4, PARP1,
PARP1.cl214, TP53, TP53.pS15, and YAP1.The fact that 3 of these 6 clusters can be immediately identified from the
literature as coherent biological subcomponents of the apoptosis pathway
provides strong support for our approach.
Conclusions
Principal component analysis is one of the most popular and important techniques in
the multivariate analysis of general data sets. However, because principal
components are linear combinations of correlated variables, these components usually
lack interpretability when analyzing biological data sets, especially transcriptomic
or proteomic data sets from patients with cancer. Our study of the apoptosis pathway
using proteomic data from patients with AML shows that we can use mixture model
clustering in principal component space to replace the uninterpretable mathematical
components with natural collections of related genes that enhance the biological
interpetability of the decomposition. We expect that applying these methods to the
genes or proteins in other signaling pathways will divide these pathways into
1-dimensional “building blocks” that are interpretable, robust, and can yield new
biological insights. It would be of particular interest to apply these ideas to
overlapping pathways to better understand the way similar components are reused in
different contexts.Our ability to find biologically interpretable components, however, depends in a
fundamental way on being able to determine the dimension of the principal component
space. To accomplish this task, we introduced the PCDimension R package, which
implements 3 types of models—the broken stick method, the randomization-based
procedure of ter Braak,[15,16] and our enhancments to the model developed by Auer and Gervini[17]—to compute the number of significant principal components. Through extensive
simulations, we have shown that the enhanced Auer-Gervini methods are competitive
with the methods that performed best in previous comparative studies.It has been claimed that simulation of multivariate data sets can always be
criticized as unrepresentative because they can never explore more than a tiny
fraction of the wide range of possible covariance and correlation structures.[4] As with previous simulation studies, our work may have the same limitation.
However, Ferre has also pointed out that simulations are the only way to test and
compare these methods.[12] It is still valuable to compare methods empirically when the dimension of the
data set is known, and factors of interest can be manipulated under simulation. We
have endeavored to explore a wide variety of different correlation structures to
identify settings where each method is likely to fail.In our simulations, the variants of Bartlett method clearly had the worst
performance. This finding may be somewhat surprising: Peres-Neto et al[14] found these methods to be only a little worse than the best performers in
their simulations and concluded that they were actually the best for distinguishing
from . There are 2 factors that distinguish their simulations from ours.
First, we considered a wider variety of correlation structures. The matrices
considered by Peres-Neto are represented by our matrices 4 to 13. Second, the
matrices we considered were larger. Motivated by problems from ecology, they looked
at matrices that were or . Motivated by the larger gene or protein expression data sets
currently being produced in biology, we looked at matrices that were . We think that both factors contribute to the different results.
Supplementary Figure 3 shows that the errors arise primarily from
matrices 1 to 4 and 10 to 12. This subset is comprised precisely of those matrices
that include a substantial amount of unstructured noise. We suspect that the
underlying difficulty arises because the stepwise hypothesis tests give rise to a
classical problem of multiple comparisons. Thus, the likelihood of incorrectly
rejecting a null hypothesis (and inflating the dimension) increases. This may also
explain why Ferre[12] cautioned that Bartlett’s test can overestimate the number of components.The rnd-F method introduced by ter Brack also performed significantly worse in our
simulations than in those of Perres-Neto. The plots in Supplementary Figure 3 reveal 2 things. First, for virtually every
correlation matrix we considered, rnd-F is more variable and less accurate than
rnd-Lambda. Second, the most serious large errors arise from matrices 10 to 13.
These matrices contain a mix of both highly structured data and completely
unstructured noise. Similar matrices were considered in the previous study, so we
conclude that the rnd-F method simply works poorly with larger matrices.We investigated the graphical Bayesian method of Auer and Gervini[17] in some detail. Specifially, we introduced and tested 8 algorithms to enhance
the method by automatically selecting the number of components. Two of these—the
novel T test–based changepoint algorithm and the exponential model
from the CPM package—were abysmal. In virtually every simulation, the
T test seriously overestimates the number of components. The
exponential CPM model overestimates the number at least half the time. The remaining
6 methods have acceptable performance most of the time.The overall winner in terms of accuracy from our simulation study is the rnd-Lambda
randomization-based procedures. This additional accuracy, however, was obtained at a
sustantial cost in computation time. The randomization methods take at least 2
orders of magnitude longer than any other methods that we studied. Three other
methods were competitive with rnd-Lamda in terms of overall accuracy, but
considerably faster: the Auer-Gervini model with criterion “CPT,” the PESEL method
with criterion , and Minka’s Laplace approximation. It is interesting to note,
however, that the 4 best overall methods rarely give the absolute best results for
any fixed size of data matrix. Their ultimate strength is their consistency: their
estimates are always competitive, and the average error in the estimated dimension
is always less than 2.Two other methods perform well in complementary settings. The broken stick model is
most accurate when there are 96 objects, and the Auer-Gervini method using the
TwiceMean criterion is the most accurate when there are 24 objects. The TwiceMean
criterion appears to overestimate the number of components when the size of the data
matrix increases. By contrast, the broken stick model appears to benefit from having
extra data available. MacArthur[22] and De Vita[41] showed that the broken stick model worked well when fitting the relative
abundance data of species in ecological populations. It is possible that the
distribution of the expected lengths in equation (2) will be better
approximated with larger data matrices. This may explain why its performance
improves.Our simulation studies uncovered at least 2 (possibly related) contexts where it is
particularly difficult to estimate the number of components correctly. Every
reasonable method severely underestimates the dimension for correlation matrices 1
to 3. In addition, most methods overestimate the dimension when there are 96 objects
and 100 attributes, especially when we doubled the number of correlated blocks. In
both cases, there is very little redundancy in the signals we are trying to detect.
(The biological contexts where we expect to apply these methods are expected to
contain considerable redundancy in the form of highly correlated genes or proteins.)
To handle more general data sets, however, new methods will need to be developed to
improve performance in these examples without sacrificing it in other examples.
Further work will also be needed to clarify what kinds of structural changes occur
as the number of objects increases from 24 to 96 and beyond.We do not expect our study to be the final word on how to determine the number of
significant principal components; similar to Ferre,[12] we must conclude that there is no ideal solution to the problem. If forced to
choose one method for all data sets, we would pick the Auer-Gervini model using the
CPT criterion, as it is both reasonably accurate and reasonably fast. This would
especially be the case if we were trying to analyze many data sets at once; for
example, when performing an analysis similar to the one in our AML data set for a
long list of different biological pathways or gene sets. When focused on only one
data set, we would compute the estimates from multiple methods (including
rnd-Lambda, PESEL with criterion , Minka’s Laplace approximation, broken stick, and Auer-Gervini
with TwiceMean), and review them in light of both the traditional scree plot and the
Auer-Gervini plot of the maximum posterior dimension as a function of the
hyperparameter. When we used this method with the RPPA apoptosis data set, we
successfully selected a dimension that produced clusters that made biological sense.
We believe that this combination of analytical and graphical methods, as provided in
the PCDimension package, will guide researchers to the most reliable results.Click here for additional data file.Supplemental material, uomf_march_5_2018_pcda_supplement for Decomposing the
Apoptosis Pathway Into Biologically Interpretable Principal Components by Min
Wang, Steven M Kornblau and Kevin R Coombes in Cancer Informatics
Authors: Karine Sá Ferreira; Clemens Kreutz; Sabine Macnelly; Karin Neubert; Angelika Haber; Matthew Bogyo; Jens Timmer; Christoph Borner Journal: Apoptosis Date: 2012-05 Impact factor: 4.677
Authors: Steven M Kornblau; Neera Singh; YiHua Qiu; Wenjing Chen; Nianxiang Zhang; Kevin R Coombes Journal: Clin Cancer Res Date: 2010-03-09 Impact factor: 12.531
Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205
Authors: J N Weinstein; T G Myers; P M O'Connor; S H Friend; A J Fornace; K W Kohn; T Fojo; S E Bates; L V Rubinstein; N L Anderson; J K Buolamwini; W W van Osdol; A P Monks; D A Scudiero; E A Sausville; D W Zaharevitz; B Bunow; V N Viswanadhan; G S Johnson; R E Wittes; K D Paull Journal: Science Date: 1997-01-17 Impact factor: 47.728
Authors: Zachary B Abrams; Suli Li; Lin Zhang; Caitlin E Coombes; Philip R O Payne; Nyla A Heerema; Lynne V Abruzzo; Kevin R Coombes Journal: Cancer Genet Date: 2020-10-02
Authors: Brett R Aiello; Milton Tan; Usama Bin Sikandar; Alexis J Alvey; Burhanuddin Bhinderwala; Katalina C Kimball; Jesse R Barber; Chris A Hamilton; Akito Y Kawahara; Simon Sponberg Journal: Proc Biol Sci Date: 2021-08-04 Impact factor: 5.530
Authors: Zachary B Abrams; Dwayne G Tally; Lin Zhang; Caitlin E Coombes; Philip R O Payne; Lynne V Abruzzo; Kevin R Coombes Journal: BMC Bioinformatics Date: 2021-03-01 Impact factor: 3.169
Authors: Brian Giacopelli; Min Wang; Ada Cleary; Yue-Zhong Wu; Anna Reister Schultz; Maximilian Schmutz; James S Blachly; Ann-Kathrin Eisfeld; Bethany Mundy-Bosse; Sebastian Vosberg; Philipp A Greif; Rainer Claus; Lars Bullinger; Ramiro Garzon; Kevin R Coombes; Clara D Bloomfield; Brian J Druker; Jeffrey W Tyner; John C Byrd; Christopher C Oakes Journal: Genome Res Date: 2021-03-11 Impact factor: 9.043
Authors: Zachary B Abrams; Mark Zucker; Min Wang; Amir Asiaee Taheri; Lynne V Abruzzo; Kevin R Coombes Journal: BMC Genomics Date: 2018-10-11 Impact factor: 3.969