Mazen Ahmad1, Volkhard Helms2, Olga V Kalinina1, Thomas Lengauer1. 1. Computational Biology Research Group , Max Planck Institute for Informatics , Saarland Informatics Campus, Campus E1 4 , 66123 Saarbrücken , Germany. 2. Center for Bioinformatics , Saarland University , 66123 Saarbrücken , Germany.
Abstract
A new method termed "Relative Principal Components Analysis" (RPCA) is introduced that extracts optimal relevant principal components to describe the change between two data samples representing two macroscopic states. The method is widely applicable in data-driven science. Calculating the components is based on a physical framework that introduces the objective function (the Kullback-Leibler divergence) appropriate for quantifying the change of the macroscopic state affected by the changes in the microscopic features. To demonstrate the applicability of RPCA, we analyze the thermodynamically relevant conformational changes of the protein HIV-1 protease upon binding to different drug molecules. In this case, the RPCA method provides a sound thermodynamic foundation for analyzing the binding process and thus characterizing both the collective and the locally relevant conformational changes. Moreover, the relevant collective conformational changes can be reconstructed from the informative latent variables to exhibit both the enhanced and the restricted conformational fluctuations upon ligand association.
A new method termed "Relative Principal Components Analysis" (RPCA) is introduced that extracts optimal relevant principal components to describe the change between two data samples representing two macroscopic states. The method is widely applicable in data-driven science. Calculating the components is based on a physical framework that introduces the objective function (the Kullback-Leibler divergence) appropriate for quantifying the change of the macroscopic state affected by the changes in the microscopic features. To demonstrate the applicability of RPCA, we analyze the thermodynamically relevant conformational changes of the protein HIV-1 protease upon binding to different drug molecules. In this case, the RPCA method provides a sound thermodynamic foundation for analyzing the binding process and thus characterizing both the collective and the locally relevant conformational changes. Moreover, the relevant collective conformational changes can be reconstructed from the informative latent variables to exhibit both the enhanced and the restricted conformational fluctuations upon ligand association.
Studying
the transitions and differences between multiple states
populated by a dynamic system is a central topic in different fields
including chemistry, physics, biology, machine learning, and all of
data-driven science. A typical task is to uncover how macroscopic
changes of the dynamic system are related to the features (variables)
that describe its microscopic individuals (instances). Two examples
of such microscopic features would be the genetic sequences of a virus
taken from snapshots during the course of evolution or the spatial
conformations of two biomolecules when they bind to each other. The
relationship between the “microscopic” factors of a
system and the change of its macroscopic states requires the definition
of an appropriate objective function for quantifying the change of
the “macroscopic” state of the system. Such a rigorous
definition of changes of the macroscopic state of a system in terms
of its microscopic features is available for physical systems whose
thermodynamic quantities can be measured or computed. For example,
the change of free energy (a scalar value) is a suitable quantity
to characterize macroscopic changes in physical, chemical, and biochemical
systems. However, in other areas of data-driven science, such a rigorous
definition and quantification of macroscopic changes generally does
not exist. Instead, various heuristic objective functions are used
in practice. Examples include divergence measures from information
theory[1] and the wide variety of objective
functions that are used for prediction and feature extraction in pattern
recognition.[2] Mining the factors informative
for the change between two samples is of high importance and of general
interest in all areas of data-driven science and is generally performed
in a high-dimensional feature space. In fact, mining informative features
is the central theme in a large domain of machine learning and includes
methods such as dimensionality reduction,[3,4] feature
extraction,[2] and latent variable models.[5] However, one needs to select an objective function
that is appropriate for quantifying the change before applying a multivariate
method to extract the informative features.Analyzing the conformational
changes taking place during biomolecular
reactions is one of the most important tasks in structural biology.
Unfortunately, analyzing and mechanistically understanding biochemical
interactions is quite tricky due to the complex conformational dynamics
in the high-dimensional space where the interactions take place. The
macroscopic changes in biochemical systems, on the other hand, are
quantified using the change of free energy (a scalar quantity). Molecular
dynamics (MD) simulations are becoming a more and more attractive
tool for analyzing conformational changes of biomolecules. An important
technique for postanalysis of MD trajectories is provided by Markov
state models.[6,7] These models focus on characterizing
the kinetic transitions between representative conformations. The
analysis is performed in a data space where the points are the (clustered)
conformations. Multivariate methods can be suitably applied to elucidating
the spatial characteristics of conformational ensembles. For example,
principal component analysis (PCA)[8,9] can be used
to find the directions in the feature space with maximum variation.
Furthermore, partial least-squares analysis[10] aims at finding the directions in the feature space that maximize
the covariance between the features and a response variable. However,
we argue that methods adapted from multivariate analysis and their
objective functions are often not adequately reflecting the thermodynamics
and the physical (asymmetric) nature of the change.In this
work, we introduce a unified framework rooted in statistical
information theory and statistical mechanics[11−14] for studying the change between
two data sets representing two states. This physics-based framework
is used to introduce a new method termed “Relative Principal
Components Analysis”. This method extracts directions in the
feature space termed “relative principal components”,
which are most relevant for describing the change between two data
samples (two states). The informative directions of the change are
represented in a latent variable space that is shared between the
two unpaired data samples of the observed variable. RPCA provides
an optimal and disentangled representation[15−17] of the change
in the latent space where the directions in the latent space are selected
in a way that the objective function for quantifying the change (the
Kullback–Leibler (KL) divergence) is maximized and additively
factorized along the different directions. Besides the mapping from
the original (observed) feature space to the latent variable space,
RPCA provides mapping (reconstruction) from the latent variable space
to the (observed) feature space. The RPCA method is applicable in
all areas of data-driven science and is introduced in its generality
in sections and 3. As a special but important example, starting in section , we apply RPCA
for analyzing the energetically relevant conformational changes of
a biomolecule (protease of the human immunodeficiency virus, HIV-1)
upon binding to various ligands.
General
Formalism for Analyzing Changes in Dynamic
Systems Based on Information Theory
Before going into the
technical details of finding the directions
in feature space that are informative of the change between two states,
we first introduce a physical framework for defining and quantifying
the change of dynamic systems in all areas of data-driven sciences
and justify the objective function used for quantifying the macroscopic
change.Let be a random vector
encoding
the microscopic features (variables) necessary to identify the difference
between the instances of a system of interest. By a system we mean a collection of individuals or instances, e.g., viruses or molecules, where each instance is defined via
a set of (microscopic) features. A macroscopic state
(a) of the system is defined by the probability density
distribution P(|a) = P() of all possible instances () when the system is at equilibrium in this macroscopic
state a. The macroscopic state a can be changed into a new macroscopic state b by
perturbing the probability distribution of the microscopic instances to P(|b) = P(). Here, a Bayesian approach
is used whereby the macroscopic state is viewed as a random variable,
and the conditional probability P(|i) = P() is naturally interpreted
as the equilibrium distribution of state i that can
be taken from a finite or discrete set. A relationship between the
two macroscopic states can be obtained by applying Bayes’s
theorem, yielding the following equation for the probability density
ratio P()/P()Here, P(b) and P(a) are
conditional probabilities under the (implicit) condition of the perturbing
factors. Averaging the logarithm of the density ratio equation over
the probability distribution of state b (P()) yieldsIndeed, this is a general
derivation of the Perturbation Divergence Formalism (PDF), which was
previously derived for physical systems for the purpose of decomposing
the change of free energy (ΔF) between two
macroscopic states a and b(13,14)For physical
systems, we
use here the natural unit of the energy (kBT)−1 = 1. DKL is the KL divergence[18] (also
termed the relative entropy[19] or the discriminant
information[11]) between the probability
distributions of states a and b.
Interestingly, eq provides a purely probabilistic formalism of free energy change
via a new probabilistic definition of the perturbation U of the microscopic instances as the
logarithm of the ratio of the posterior probabilities of two macroscopic
states given a particular microscopic configuration
Bridging the Macroscopic
Change and the Microscopic Features
The concept of the perturbation
of a microscopic instance was originally
introduced in statistical thermodynamics as an energetic quantity
(the difference between the potential energies of the states; U() = U() – U()) in order to formalize the
relationship between the microscopic (atomistic) description of a
physical system and its macroscopic changes between states (free energy
change).[20,21] When considering the change between two
macroscopic states, the perturbation U() is a one-dimensional
(microscopic) variable that has the same KL divergence (discriminant
information) as the high-dimensional feature (13)In statistical inference
theory,[22,23] such a variable is termed a sufficient statistic.
A sound framework for the relationship between the change of macroscopic
states of a dynamic system and its microscopic elements can be inherited
from the formalism of exponential families[12,24] in statistical estimation theory.Let us assume we have a
dynamic system at equilibrium in a macroscopic
reference state labeled by the so-called natural parametersλ. The system can populate a new macroscopic
state Pλ() upon perturbing its original state Pλ(). The principle
of minimum discrimination information[25] (equivalent to the principle of maximum entropy[26]) is applied to find the new distribution Pλ() by minimizing
the KL divergence from the reference distribution Pλ() to
the new distribution Pλ() under the constraint of a finite expected value
of the sufficient statistic ⟨()⟩. The probability density distribution Pλ() forms an exponential
family of distributions in terms of the reference distributionHere, the sufficient statistic () is a vector
(or a scalar) function of the microscopic configuration of the system. The cumulant generating function
ψ(λ) depends on only the natural parameters λ. Besides being used to formulate a theoretical framework
for studying the error bound of parameter estimation,[25] the formalism of exponential families plays a central role
in different fields of machine learning such as generalized linear
models and variational inference.[3] In statistical
thermodynamics, Kirkwood introduced his thermodynamic integration
(TI) equation[20] using the exponential family
to “alchemically” interpolate two macroscopic states.[20] Indeed, the one-dimensional sufficient statistic,
“the perturbation”, is the appropriate tool for interpolating
between two macroscopic states in free energy calculations. Generally,
a higher-dimensional sufficient statistic is required when studying
multiple macroscopic states.[22] The sufficient
statistic and the corresponding cumulant generating function in eq are not unique. Notably,
comparing eqs and 2.7 shows that the sufficient statistic can be selected
such that the corresponding cumulant generating function is a function
of the change of free energy between the macroscopic states of dynamical
systems. For example, when selecting the negative value of the perturbation
(−U()) as a sufficient statistic for studying the change
between two macroscopic states, the corresponding cumulant generating
function is the negative of the free energy change and eq turns into the free energy perturbation
equation that is well-known in statistical thermodynamics.[21,27] Another example is the logarithm of the density ratio, which is
a well-known sufficient statistic in statistical inference, and its
corresponding cumulant generating function is the relative change
of free energy (see Supporting Information section S4). An interesting known relationship of an exponential family
is the functional dependence of the change of ψ(λ) (the function of the macroscopic state) on the microscopic features that is given by the average and the covariance
of the sufficient statisticThe
free energy TI equation[20] is a special
case of eq . However,
quantification of the macroscopic
change is not of general interest in the field of data-driven sciences.
The important task here is identifying the unknown sufficient statistic
that explains the influence of the microscopic features of the macroscopic
change. Unlike the free energy change, KL divergence is a quantity
familiar in machine learning and can be computed using parametric
or nonparametric models.[28] Indeed, eq shows that KL divergence
is a Legendre transformation[12,29] of the cumulant generating
function and can be used to quantify the change of the macroscopic
state. The PDF given by eq is a special case of eq where we use the perturbation as a sufficient statistic
for studying the change between two macroscopic states (λ =
0,1). Due to the Legendre duality[12,29] between the
KL divergence and the change of the free energy, the relevance of
the PDF goes beyond being decomposition of the change of free energy.
In fact, the terms of eq are the perturbations (the features informative of the change)
and their fingerprints (the configurational changes), which are quantified
by the KL divergence. In this view, significant perturbations are
reflected by significant changes of KL divergence.
Relative Principal Components Analysis
This section presents
a new method for analyzing the change between
two states (data sets) using multivariate analysis methods.[4] Studying the change between multiple states is
beyond the scope of this work and will be presented in a future publication.
The newly introduced method termed “relative principal components
analysis” (RPCA) computes collective canonical variables (linear
combinations of the original features) termed the relative principal
components (RPCAs) to which the KL divergence factorizes additively.
Indeed, factorization of the KL divergence is equivalent to factorization
of the logarithm of the density ratio, which is a sufficient statistic
of interest in machine learning[28] (see
above). Factorization of KL divergence was introduced for multivariate
normal distributions in the seminal work of Kullback.[11] However, the theoretical approach of factorizing KL divergence
as introduced by Kullback was not accessible in practice. Specifically,
a solution is needed around the singularity of the covariance matrices,
and the resulting features have to be optimal with respect to maximizing
their KL divergences (see below).Let = (x1...x) be a d-dimensional continuous
random variable with two samples from two macroscopic states that
will be labeled (a) and (b). RPCA
aims at finding k latent canonical variables = (y1, y2, ..., y) = f() that satisfy the following two
conditions: (i) their marginal distributions are independent in states a and b, meaning that their KL divergences
are additive, and (ii) they are optimal in terms of maximizing the
KL divergence, such that we can use m (m ≪ d) latent variables to represent the significant
directions informative of the change. The KL divergence of a new variable = f() is always non-negative[11,19] and is bounded
from above by the KL divergence of the original variable (11,14)The new variable = f() is
“sufficient” if the equality holds in eq . When studying the
change between two macroscopic states, a sufficient one-dimensional
variable always exists (e.g., the perturbation[14]) regardless of the dimensionality of . Although the existence of a one-dimensional sufficient feature
appears promising, it is not practically useful for two reasons: (i)
the analytical nature of the sufficient one-dimensional variable is
generally unknown and (ii) the complexity and nonlinearity of a sufficient
one-dimensional variable, if it is known, will hinder its interpretability
in terms of the original features .
In fact, its simple interpretability and analytical traceability is
one reason for the widespread use of the Gaussian linear parametric
model in latent variable models[5] (e.g.,
PCA). We will adopt this model here as well. Then the latent variables
are linear combinations of the original variable and are normally
distributedHere, is a d × k transformation
matrix with columns . Thus, and μ; l = a,b are the averages of the two distributions.
The covariance matrices of the original
variables are related to the covariance matrices of the latent variables
by Λ = ; l = a,b. Under the model assumption of normality, the independence
of the variables y requires Λ to be diagonal in both
states a and bHere, the diagonal
matrix
of state b, Λ =diag (λ1...λ), contains the variances
λ1...λ of the
latent variables at state b, while the covariance
matrix of the latent variables at state a is arbitrarily
selected to be an identity matrix (). In the case of a multivariate normal distribution ( is nonsingular), Kullback[11] formulated the solution for as the generalized eigenvectors corresponding to the generalized
eigenvalue problem | – λ|, which can be solved using Wilkinson’s algorithm[30] involving the Cholesky decomposition of .[31] Practically, the covariance matrices of real
data are mostly singular or ill-conditioned, and the generalized eigenproblem
is not solvable. Singularity arises due to the fact that the real
dimensionality of the probability distributions is smaller than the
apparent dimensionality of .[32]
RPCA via Simultaneous Diagonalization of
Two Matrices
Here, we present a general algorithm for the
simultaneous diagonalization
of the matrices in eq , which can be used even if is singular.A transformation matrix that simultaneously diagonalizes the symmetric
matrices (of rank k) and (eq ) can be found by a combination of two transformation
matricesThe so-called whitening transformation(2) matrix of the matrix is computed from its eigendecomposition = = Here, the k eigenvectors
in the columns of correspond to the k nonzero
eigenvalues in the diagonal matrix . Clearly, reduces to an identity matrix corresponding to the covariance matrix
of the whitened data (). The algorithm above is well-known
in case is nonsingular (d = k).[2]The matrix is formed using the eigenvectors
from the symmetric matrix It is
straightforward to
see from eqs and 3.6 that = simultaneously diagonalizes and and satisfies eq .The k columns of are the generalized eigenvectors with the corresponding generalized
eigenvalues in the diagonal matrix Λ. The relative
principal components y = can be reordered based on their KL divergences.
The additive KL divergences of the independent variables y can be analytically computed based
on the model assumption of normality[11]Here Δ = μ – μ is the change of the average of the distributions
of in states a and b. However, it is important to keep in mind that the value
of DKL of the components from eq is computed based
on the model assumption (normality). Fortunately, a significant KL
divergence of a variable y has to be reflected in a significant change between its distributions
in states a and b, which can be
used to assess the accuracy of the model-based KL divergences (see
the example below).
Optimal RPCAs via Average-Covariance Subspacing
Although
the simultaneous diagonalization algorithm above returns independent
latent variables into which the KL divergence factorizes additively,
the obtained latent variables are not optimal in terms of maximizing
the KL divergence. Optimal latent variables are required for reconstructing
the most informative approximation (maximizing KL divergence) of the
original variable; see below. The KL divergences of the relative principal
components in eq can be decomposed into the terms holding the change of the variances and the terms holding the change of the
average . Indeed, the latent variables from eq are optimal with respect
to maximizing the KL divergences due to a change of the variances.[11] Unfortunately, this optimality is violated by
the contributions to the KL divergences due to the change of the average . Therefore, the following average-covariance
subspacing algorithm is introduced to achieve the optimality of RPCAs
with respect to maximizing their KL divergences. The detailed derivation
is presented in Supporting Information section S1. The idea is to find a transformation matrix = [] such that the KL divergence of the variable = summarizes the total KL divergence
due to changes of the averages while the KL divergences of the variables = are purely due to
the change of covariance. The required solution for is given byHere, the generalized pseudoinverse – is used for the general case
(e.g., singular ) and is normalized with respect to the covariance matrix such that = 1. The KL divergence
of the variable = includes
the total KL divergence due to the change of the average, which in
turn equals half of the Mahalanobis distance between the averages Here, the KL divergence
of the new variable = also
includes a contribution due to the change of its variance (λ).The remaining generalized eigenvectors , which do
not contain contributions arising from the change of the average (Δ = 0), are computed after deflating the contributions
to the KL divergence due to vector from the matrices and . Given the vector , the restricted
simultaneous diagonalization problem can be presented in the equationssubject toHere, Λ is a diagonal matrix,
and 0 denotes a matrix or a vector of zeros of suitable dimensionality.
To fulfill the conditions in eqs and 3.11, the generalized eigenvectors can be constructed
using a combination of two transformation matrices (for details, see Supporting Information section S1)(i) The whitening transformation
matrix , similarly to eq , is constructed here from the eigendecomposition of the covariance
matrix of the projection of on the
subspace, which is orthogonal to the vector (). (ii) The second transformation is obtained from the eigenvectors of the covariance matrix of the
projection of the whitened data () onto the subspace that
is orthogonal to the vector . Here, the number of generalized
eigenvectors from the optimal RPCA subspacing is one less than the
number of generalized eigenvectors from the nonoptimal RPCA algorithm
of eq .
Data Reconstruction
from the Latent Variable
The influence
of the m-dimensional (m < d) latent variable = x on the original d-dimensional variable can be presented via
the reconstruction
(projection) in the
subspace that is spanned by the
corresponding m generalized eigenvectors = [1...] and is given by the relationship[33]Here, is the projection matrix[33] on the d-dimensional subspace that is
spanned by columns of . is the reconstruction matrix facilitating
the transformation from the latent variable. It is straightforward
(see Supporting Information section S2)
to show that the KL divergence between the distributions of the reconstructed
variable equals the KL divergence
between the distributions of its corresponding latent variable . In other
words, the most informative approximation (the one maximizing KL divergence)
of the original variable using a restricted
number (m < d) of variables is
obtained using the m-dimensional latent variables with the highest
KL divergences.
Change Hotspots from RPCA
Besides
representing the
changes collectively (incorporating all x), RPCA provides the possibility to map the feature-wise
(local) contributions to the change from the individual elements x of such that the elements of with larger contribution to the change (KL divergence) can be interpreted
as the hotspots of the change. The contributions to the divergence
in eq can be approximated
as a sum of two quadratic termsHere, () denotes element (i,j) of matrix . The
contributions to the quadratic terms are due to the local contributions
from the elements and their cooperative (cross) terms, taking into
account that each element of a generalized eigenvector (g) corresponds to an element x. However, it is clear that
the contributions to the quadratic terms in eq can be collected via arbitrary grouping
of the elements into subgroups. The computation of the (local) group-wise
contributions and their cooperative contributions is equivalent to
the computation of the quadratic terms using the corresponding submatrices
of and
ΔΔ.
Asymmetric
Nature of RPCA
Multivariate analysis methods
can be grouped into methods handling either one state or multiple
states. The methods that study one state include methods handling
one multivariate variable, such as PCA, factor analysis, and independent
components analysis, and methods handling two multivariate variables
with concurrent measurement (a joint distribution) include canonical
correlation, regression, and partial least-squares. Discriminant analysis
(classification) methods, on the other hand, aim at finding the variation
between several states (classes). Although RPCA is similar to feature
extraction for classification,[2] there are
fundamental differences between pattern classification and the physical
definition of the change, which is introduced above. For example,
the discriminant features in Fisher discriminant analysis (FDA) are
related to the change of the averages of distributions, and the information
from the covariance matrices is whitened using a unified pooled (average)
covariance matrix.[2] Therefore, the use
of FDA for dimensionality reduction[34] is
known to be restricted by a limit on the number of dimensions, which
is equal to the number of classes minus one. While the objective functions
in discriminant analysis are required to be symmetric,[2] the objective function (KL divergence) for RPCA is asymmetric,
which reflects the directed character of changes.[35] When studying the backward change from state b to state a, the generalized eigenvectors pertaining to the reverse direction are the scaled eigenvectors
pertaining to the forward direction (), and the generalized eigenvalues, which present the change of the
variance, are inverted (1/λ)Taking into account that
the KL divergence due to the change of the variance (−ln λ + λ –
1) in eq is a convex
function
with a minimum of 0 at λ = 1, we
can divide the relative principal components into two groups. The
first group includes the components with λ > 1 where the variance (quantified by λ) along the direction increases as we transit from state a to b. The second group includes the components
with λ < 1 because the respective
variance decreases as we transit from state a to b. When the backward change takes from b to a, the roles of the components of these groups
(quantified by 1/λ) are exchanged.
The same role of these groups is also observed when taking the KL
divergence due to the change of the average where the directions with λ > 1 have diminished contributions () to the backward direction
from b to a in comparison with their
contribution
() to the transition from state a to b and vice versa.
RPCA and the Distance Metric
Even
though the task of
RPCA (analyzing the change between states) differs from the task of
PCA (finding the variation within one state; see Figure ), a similarity exists regarding
the applied distance metric. However, it is important to notice that
the generalized eigenvectors are orthonormal with respect to the matrix[33] ( = δ), orthogonal with respect
to ( = λ; = 0), and not necessarily orthonormal to each other
( ≠ δ). RPCA analysis can be interpreted as using
the distance metric[36] of state a to analyze state b. Indeed, applying
the whitening transformation[2] in eq removes the “information”
within state a ( = ). The eigendecomposition of in eq is performed in the whitened space in which
the squared Euclidean distance between two points 1 and 2 equals their Mahalanobis distance in the original space and the
projections on the generalized eigenvectors factorize the Mahalanobis
distance (see Supporting Information section S3)Moreover, the average Mahalanobis
distance of points in state b to the average of state a can be written as the sum of the generalized eigenvalues
and the Mahalanobis distance between the averages of the states (see Supporting Information section S3)
Figure 1
Schematic representation
of the conceptual difference between PCA,
which finds the largest variation within each state, and RPCA, which
finds the change from the initial to the final state.
Schematic representation
of the conceptual difference between PCA,
which finds the largest variation within each state, and RPCA, which
finds the change from the initial to the final state.
Application of RPCA to Reveal
the Thermodynamically
Important Molecular Conformational Changes
Although our RPCA
method is not limited to biomolecular data, the
starting point for its development was based on the new insight stemming
from our recently introduced PDF.[13,14] PDF reveals
that the KL divergence (DKL) equals the
work that is spent to change the conformational ensemble and hereby
is the thermodynamically relevant objective function for analyzing
conformational changes.[13] Indeed, PDF provides
a flexible framework for studying conformational changes involving
either all atoms of the structure, a part of them, or aspects of the
structure based on phenotypical features (e.g., open/closed, pocket
volume, domain movement, etc.); see the discussion in ref (14). An important benefit
of using the KL divergence for quantifying the energetics related
to conformational changes is avoiding the misleading entropic terms
that are subject to enthalpy–entropy compensation when using
the changes in conformational entropy to estimate the importance of
conformational changes, as was previously shown.[37,38,14]In the following, we present use cases
for applying RPCA to analyze
the molecular conformational changes of the protein HIV-1 protease
upon binding to several inhibitor molecules. The conformations of
the protein are sampled via MD simulations for both the initial (free,
unbound) state and the final (bound) state. Technical details on these
simulations are provided in Supporting Information section S5. Figure shows an assessment of the relationship between the relevant
components of the dynamic change within one state, which are analyzed
using traditional PCA of the data points of the final state, and the
thermodynamic importance of the corresponding conformational changes.
The thermodynamic importance of the conformational changes along a
principal component to the association process is quantified by the
KL divergence of the distributions of projections of both the free
and the bound state conformations on the component.[37]Figure a shows that the eigenvalues of the principal components are not
related to their thermodynamic importance for the association process.
To illustrate this more clearly, Figure b shows that principal component 3654 has
more thermodynamic importance (KL divergence) than the first principal
component, which is mostly irrelevant for the association process.
Figure 2
Assessing
the thermodynamic contribution of the conformational
changes along the directions of the principal components from PCA
to the association process. (a) Top panel: PCA eigenvalues derived
from the covariance matrix of the conformational fluctuation (heavy
atoms) in a MD simulation of HIV-1 protease bound to its inhibitor
Tipranavir. Bottom panel: Importance (KL divergence) along PCAs computed
from the projections of the data points (conformations) of the final
state (bound) and the initial state (free). (b) Divergence of the
largest eigenvalue PCA#1 (DKL ≈
0.1) compared to that of the PCA with highest divergence #3654 (DKL ≈ 7.3). Two thousand data points are
projected on the PCA vectors of the final (bound; blue) and the initial
(free; red) states. The marginal probability distribution densities
are shown on the side panels. The PCA analysis is performed on the
conformations from the bound state using the heavy atoms of the protein.
The superposition of the conformations of both the bound and the free
state ensembles and the superposition of the two ensembles to each
other are performed in a way similar to that for RPCA (see below).
The KL divergences between the distributions of the projections of
the two states are computed using the KLIEP method.[39]
Assessing
the thermodynamic contribution of the conformational
changes along the directions of the principal components from PCA
to the association process. (a) Top panel: PCA eigenvalues derived
from the covariance matrix of the conformational fluctuation (heavy
atoms) in a MD simulation of HIV-1 protease bound to its inhibitor
Tipranavir. Bottom panel: Importance (KL divergence) along PCAs computed
from the projections of the data points (conformations) of the final
state (bound) and the initial state (free). (b) Divergence of the
largest eigenvalue PCA#1 (DKL ≈
0.1) compared to that of the PCA with highest divergence #3654 (DKL ≈ 7.3). Two thousand data points are
projected on the PCA vectors of the final (bound; blue) and the initial
(free; red) states. The marginal probability distribution densities
are shown on the side panels. The PCA analysis is performed on the
conformations from the bound state using the heavy atoms of the protein.
The superposition of the conformations of both the bound and the free
state ensembles and the superposition of the two ensembles to each
other are performed in a way similar to that for RPCA (see below).
The KL divergences between the distributions of the projections of
the two states are computed using the KLIEP method.[39]Figure shows the
RPCA of the conformational changes of wild-type HIV-1 protease upon
binding to a drug molecule that has high affinity (Tipranavir). Presented
are both the nonoptimal RPCA (Figure a) and the optimal RPCA calculated with the subspacing
method (Figure b).
The scores (KL divergence) of the importance of the components show
that the first few components account for most of the KL divergence.
The data points (conformations) of both the free, unbound state and
the bound state are projected on selected components to illustrate
the correspondence between the score (model based) and the real divergence,
which can be extracted from the difference between the distributions
of the projections of the states. There are clear benefits of the
optimal subspacing RPCA (Figure b). Namely, the first component has a significant divergence
because it collects the total contribution to the KL divergence arising
from the change of the average. In contrast, one obtains identical
averages when projecting the conformations sampled in the two states
onto the remaining components. Thus, the remaining components do not
contribute to the KL divergence that is due to the change of the averages. Figure shows a comparison
of the KL divergence (discriminant information) of the top-scoring
component versus the lowest-scoring and the principal component with
largest eigenvalue from PCA of the bound state.
Figure 3
RPCA of the conformational
changes of HIV-1 protease upon its association
with Tipranavir. Shown are the nonoptimal RPCA (a) from eq and optimal RPCA with
subspacing (b) from eq . The right panels show the KL divergences of the components
(blue colored) and their corresponding contributions due to the change
of the average (red colored). Left panels show projections of the
data points (conformations) of both the initial (free, unbound) and
the final bound state on selected components. The scores (KL divergences)
of the components are displayed after their corresponding number.
The projections show that the components (components 1–8) with
the highest rank (KL divergence) distinguish the change between the
free and bound states, while the components with the lowest rank do
not distinguish the change (similar projections). The analysis is
performed using the heavy atoms of the protein. The plots are generated
using R,[40] and the densities are smoothed
using the kernel density estimation.
Figure 4
KL divergences of the highest scored relative principal component.
The initial state (red) and the bound state (blue) are mapped on the
top-scoring RPCA#1 against the lowest-scoring RPCA#4541 (a) and on
the top-scoring RPCA#1 against the top-scoring PCA#1 (highest eigenvalue)
(b). RPCA is successful in extracting the components holding the most
significant information on the change (binding process from free to
bound states). The marginal probability distribution densities are
shown on the side panels. Samples of 2000 data points were used for
the projections.
RPCA of the conformational
changes of HIV-1 protease upon its association
with Tipranavir. Shown are the nonoptimal RPCA (a) from eq and optimal RPCA with
subspacing (b) from eq . The right panels show the KL divergences of the components
(blue colored) and their corresponding contributions due to the change
of the average (red colored). Left panels show projections of the
data points (conformations) of both the initial (free, unbound) and
the final bound state on selected components. The scores (KL divergences)
of the components are displayed after their corresponding number.
The projections show that the components (components 1–8) with
the highest rank (KL divergence) distinguish the change between the
free and bound states, while the components with the lowest rank do
not distinguish the change (similar projections). The analysis is
performed using the heavy atoms of the protein. The plots are generated
using R,[40] and the densities are smoothed
using the kernel density estimation.KL divergences of the highest scored relative principal component.
The initial state (red) and the bound state (blue) are mapped on the
top-scoring RPCA#1 against the lowest-scoring RPCA#4541 (a) and on
the top-scoring RPCA#1 against the top-scoring PCA#1 (highest eigenvalue)
(b). RPCA is successful in extracting the components holding the most
significant information on the change (binding process from free to
bound states). The marginal probability distribution densities are
shown on the side panels. Samples of 2000 data points were used for
the projections.RPCA facilitates studying
the conformational changes from both
the collective and the local points of view. The relevant collective
conformational changes (with respect to their KL divergence) can be
presented by reconstructing the conformations in the subspace that
is spanned by the relevant generalized eigenvectors; see eq . Collective conformational
changes can also be represented by reconstructing the conformations
from the (normally distributed) latent variable using eq . Interestingly, RPCA provides a clear
distinction between the directions (generalized eigenvectors) along
which the fluctuation of the conformations increases upon the change
(corresponding to λ > 1) and
the
directions along which the fluctuation of the conformations decreases
(corresponding to λ < 1). Figure a shows the collective
conformational changes around the average conformation of the bound
state along the 33 eigenvectors with the highest generalized eigenvalues
(λ > 10), which represent the
enhanced
motions of the wild-type HIV-1 protease upon binding Tipranavir. Figure b, on the other hand,
shows the collective conformational changes around the average conformation
of the free state along the 33 eigenvectors with the smallest generalized
eigenvalues (λ < 0.009), which,
in turn, represent the most strongly restricted motions of the wild-type
HIV-1 protease upon binding Tipranavir. It should be stressed that
the importance of the local conformational changes (e.g., at residues)
cannot be inferred from the collective conformational changes. In
other words, a large fluctuation in a site, when presenting the collective
conformational changes, does not imply larger thermodynamic importance
than the associated less strongly fluctuating sites. Alternatively,
hotspot analysis from RPCA in eq can be used to rank the thermodynamic contribution
of the local conformational changes (conformational hotspots) of the
biological building blocks (residues) to the association process and
the importance of the cooperative (correlation) interactions between
them (see Supporting Information Figure SF1). In Figure , the
relative local residue contributions to the divergence are mapped
on the structure of the protein, and the importance of the conformational
changes is indicated by radii of varying size in the cartoon representation,
in addition to using the color code. Interestingly, most of the marked
hotspots are residues known to affect the binding affinity upon mutating
them. Examples of the defined conformational hotspots are the residues
in the active pocket (e.g., D25, V82) and residues of the flap region
(e.g., I50, I54). Figure a shows analysis of the conformational hotspots from RPCA
of the binding of a ligand (Saquinavir) to the HIV-1 protease mutant
with a resistance-related mutation (I50V), which is located on the
flap region outside of the binding pocket. The conformational hotspots
of the mutant are located at the flap region around the mutation V50.
The same flap region does not show important conformational changes
when applying the same analysis to the association of Saquinavir to
the wild-type (I50) in Figure b.
Figure 5
Representation of the enhanced and restricted conformational fluctuations
of HIV-1 protease upon binding the inhibitor Tipranavir. Conformations
around the average conformation are reconstructed from the latent
variable after interpolation around its average along selected generalized
eigenvectors; see eq . (a) Enhanced conformational fluctuations around the average
structure of the bound state along the 33 eigenvectors with the highest
generalized eigenvalues (λ >
10).
These conformational fluctuations increase the affinity of binding
by optimizing the local conformations in the ligand–receptor
interface. (b) Conformational fluctuations around the average structure
of the free state along the 33 eigenvectors with the smallest generalized
eigenvalues. These latter fluctuations are highly restricted upon
association (λ < 0.009) because
they decrease the affinity of binding via adverse local movements
in the binding pocket and opening of the flap regions; see the Supporting Information movies. The cartoon representation
is generated using Pymol.[41] The conformation
of the ligand is taken from the experimental structure. Movies of
these conformational changes are presented as Supporting Information, and their descriptions are given in Supporting Information section S6.
Figure 6
Conformational hotspots corresponding to the association
of Tipranavir
to the wild-type HIV-1 protease protein. The conformational hotspots
represent the local conformational changes that are computed for each
residue of the protein and mapped on the structure of the protein.
The radius and the color of the cartoon indicate the importance of
the conformational changes. The computed importance values are the
local terms (the first term) in eq . The values are normalized relative to the highest
value that is set to 1. The analysis is performed using the coordinates
of the heavy atoms of the protein, which are taken from snapshots
from MD simulations of the unbound protein and the complex (see the
details in Supporting Information section S5). The structure of the ligand is taken from the experimental structure
of the complex (PDB code 1D4Y).
Figure 7
Conformational hotspots
from RPCA analysis recognize the importance
of the resistance-related mutation (I50 V) of HIV-1 protease when
bound to Saquinavir. The conformational hotspots of the mutant (a)
are located at the flap region around the mutation V50. The corresponding
flap residues of the wild-type (I50) in (b) do not show energetically
important (expensive) conformational changes. The radius and the color
of the cartoon indicate the importance of the conformational changes.
The importance is normalized relative to the highest value. The structures
of the ligands are taken from the experimental structure of the complexes
(PDB codes 3OXC and 3CYX for
the wild-type and the I50 V mutant, respectively). The analysis is
performed using the coordinates of the heavy atoms of the proteins,
which are taken from snapshots from MD simulations of the unbound
proteins and the complexes (see the details in Supporting Information section S5).
Representation of the enhanced and restricted conformational fluctuations
of HIV-1 protease upon binding the inhibitor Tipranavir. Conformations
around the average conformation are reconstructed from the latent
variable after interpolation around its average along selected generalized
eigenvectors; see eq . (a) Enhanced conformational fluctuations around the average
structure of the bound state along the 33 eigenvectors with the highest
generalized eigenvalues (λ >
10).
These conformational fluctuations increase the affinity of binding
by optimizing the local conformations in the ligand–receptor
interface. (b) Conformational fluctuations around the average structure
of the free state along the 33 eigenvectors with the smallest generalized
eigenvalues. These latter fluctuations are highly restricted upon
association (λ < 0.009) because
they decrease the affinity of binding via adverse local movements
in the binding pocket and opening of the flap regions; see the Supporting Information movies. The cartoon representation
is generated using Pymol.[41] The conformation
of the ligand is taken from the experimental structure. Movies of
these conformational changes are presented as Supporting Information, and their descriptions are given in Supporting Information section S6.Conformational hotspots corresponding to the association
of Tipranavir
to the wild-type HIV-1 protease protein. The conformational hotspots
represent the local conformational changes that are computed for each
residue of the protein and mapped on the structure of the protein.
The radius and the color of the cartoon indicate the importance of
the conformational changes. The computed importance values are the
local terms (the first term) in eq . The values are normalized relative to the highest
value that is set to 1. The analysis is performed using the coordinates
of the heavy atoms of the protein, which are taken from snapshots
from MD simulations of the unbound protein and the complex (see the
details in Supporting Information section S5). The structure of the ligand is taken from the experimental structure
of the complex (PDB code 1D4Y).Conformational hotspots
from RPCA analysis recognize the importance
of the resistance-related mutation (I50 V) of HIV-1 protease when
bound to Saquinavir. The conformational hotspots of the mutant (a)
are located at the flap region around the mutation V50. The corresponding
flap residues of the wild-type (I50) in (b) do not show energetically
important (expensive) conformational changes. The radius and the color
of the cartoon indicate the importance of the conformational changes.
The importance is normalized relative to the highest value. The structures
of the ligands are taken from the experimental structure of the complexes
(PDB codes 3OXC and 3CYX for
the wild-type and the I50 V mutant, respectively). The analysis is
performed using the coordinates of the heavy atoms of the proteins,
which are taken from snapshots from MD simulations of the unbound
proteins and the complexes (see the details in Supporting Information section S5).
Technical Aspects of Using RPCA of Molecular Conformational
Changes
Optimal Superposition of Conformations of the Ensembles
The first step in analyzing an ensemble of conformations is removing
the six external rotational and translational degrees of freedom by
superimposing the conformations. Although the internal coordinates
are not altered due to these similarity transformations (rotations
and translations), the average conformation and the covariance matrix
are highly affected by the way the conformations are superimposed
to remove these external degrees of freedom. Traditionally, the conformations
are superimposed on an arbitrary reference structure (e.g., the starting
structure of the MD simulation). However, the resulting ensemble is
dependent on the reference structure used for superimposing the conformations.
A popular approach to superimposing multiple conformations is known
in statistical shape analysis as Generalized Procrustes Analysis[36,42] (GPA). GPA treats the conformations in the ensemble as rigid and
rotates and translates them to a suitable reference structure (the
average conformation) such as to minimize the squared distance between
the conformation and the reference (the average conformation), which
is equivalent to minimizing the sum of squared distances over all
pairs of conformations.[36] Initially, a
random conformation from the ensemble is used as the reference. Then
GPA iterates between superposing the conformations of the ensemble
to the reference, as described above, and computing the average structure
of the resulting ensemble as a new reference. The result is an ensemble
approximating the minimum sum of squared distances over all pairs
of conformations in the ensemble. Fortunately, a few iterations are
usually enough for convergence (see an example of the performance
of GPA in Supporting Information section S7).
Superposition of Two Ensembles
The Mahalanobis distance
term (Δ–Δ) of the KL divergence
in eq is affected
by the fashion in which we superimpose the average conformations (μ,μ). However, superimposition via minimizing
the Euclidean distance may overestimate the KL divergence via an artificial
contribution to the Mahalanobis distance Δ–Δ; Δ = μ – μ. Therefore, superimposition of the average conformations should
aim at minimization of the Mahalanobis distance. In contrast to the
minimization of the Euclidean distance, there is no analytical solution
known for minimizing the Mahalanobis distance. This type of nonlinear
minimization is known as the Covariance Weighted Procrustes Analysis[43] (CWPA), and numerical methods can be used to
find the optimal superposition (rotations and translations) for minimizing
the quadratic term Δ–Δ.
Steps for Performing RPCA Analysis of Two Molecular States
GPA fitting
of the conformations sampled
in the simulation of the first state. The covariance matrix of this
state is also computed at this step. A successful superimposition
of the ensemble will lead to a singular covariance matrix with at
least six eigenvalues of zero value accounting for removing the external
degrees of freedom.GPA fitting of the conformations sampled
in the simulation of the second state to obtain the average conformation.Covariance-weighted fitting
of the
average conformation of the second state on the average conformation
of the first state by minimizing their Mahalanobis distance. This
unconstrained nonlinear optimization is numerically performed using
the line-search algorithm and the BFGS factored method to update the
Hessian.[44]The new average conformation of the
second state is used as a reference to refit the conformations of
the second state and to compute the covariance matrix of the second
state.Simultaneous
diagonalization of the
covariance matrices is performed. Optionally, the subspacing optimal
algorithm can be used.KL divergences of the relative principal
components are computed and the components are reordered based on
their scores (KL divergences).We have
developed computational tools to perform these
steps. The tools are written in the C programming language, and the
numerical linear algebra operations are performed using the BLAS and
LAPACK routines.[45] When computing the whitening
transformation matrix in eq , the limit of the zero value of the eigenvalues is defined
using the square root of the “machine epsilon” in double
precision multiplied by the largest eigenvalue. The covariance-weighted
superimposition is performed using the nonlinear minimization algorithm
by Dennis and Schnabel.[44]
Conclusions
Here, we introduced the RPCA method, which
extracts the relevant
principal components describing the change between two macroscopic
states of a dynamic system represented by two data sets. The definition
of the macroscopic change of a dynamic system and its quantification
are based on previous work where we derived a generalized quantification
of the change of a physical system based on statistical mechanics.
We presented use cases for conformational changes taking place upon
ligand binding to HIV-1 protease that clearly illustrate the power
of RPCA to characterize the relevant changes between two ensembles
in a high-dimensional space. Moreover, software solutions were introduced
to ensure the removal of the similarity transformations (rotations
and translations) by superposing the conformations using GPA and CWPA.
These procedures may also be beneficial for preparing the conformations
for other analysis methods.Although RPCA is currently limited
to handling only continuous
variables and two macroscopic states, the introduced framework for
quantifying changes of dynamic systems using the exponential family
of distributions is flexible regarding the nature of probability distributions
and the nature of the microscopic variables (continuous and categorical).
Therefore, the presented theoretical formalism opens the door for
developing improved new methods for mining the factors underlying
changes in dynamic systems in the directions of (i) handling both
continuous and categorical data (e.g., the effect of sequence changes
(mutations) on the binding affinity) and of (ii) handling multiple
macroscopic states (e.g., study the binding of a series of ligands
to a receptor).