Literature DB >> 33266738

Computing the Partial Correlation of ICA Models for Non-Gaussian Graph Signal Processing.

Jordi Belda¹, Luis Vergara¹, Gonzalo Safont¹, Addisson Salazar¹.

Abstract

Conventional partial correlation coefficients (PCC) were extended to the non-Gaussian case, in particular to independent component analysis (ICA) models of the observed multivariate samples. Thus, the usual methods that define the pairwise connections of a graph from the precision matrix were correspondingly extended. The basic concept involved replacing the implicit linear estimation of conventional PCC with a nonlinear estimation (conditional mean) assuming ICA. Thus, it is better eliminated the correlation between a given pair of nodes induced by the rest of nodes, and hence the specific connectivity weights can be better estimated. Some synthetic and real data examples illustrate the approach in a graph signal processing context.

Entities: Chemical Disease Gene Species

Keywords: graph signal processing; independent component analysis; partial correlation

Year: 2018 PMID： 33266738 PMCID： PMC7514127 DOI： 10.3390/e21010022

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

1.1. Background

The partial correlation coefficient (PCC) [1] is a classical concept that has relevance in a variety of statistical signal processing problems. Essentially, PCC measures the correlation between two random variables conditioned to other observed random variables. Interest in it has recently increased due to the emergence of graph signal processing (GSP) [2,3,4]. One key aspect of GSP is defining the graph connectivity. Although this can be done considering the natural interactions from the context where the graph signal is defined (e.g., time or space proximity between two nodes), it is desirable to develop formal statistical methods; that is, given a set of multivariate samples where every sample component is assigned to a node of the graph, the graph connectivity which best describes the implicit dependences between any two nodes can be learned. Thus, PCCs are appropriate candidates to define the connectivity as the effect of the rest of nodes is removed from the pairwise correlation. Actually, the PCC is formally interpreted as the correlation between the residuals obtained after optimal estimation of the values of the two involved nodes from the rest of nodes. Optimality is in the sense of minimum linear mean square error (LMSE). Fortunately, it is not necessary to make an explicit estimation, as the PCCs can be computed from the so-called precision matrix (inverse of the covariance matrix). Thus, many efforts have focused onto estimating the precision matrix both in GSP [5,6] and in statistics [7,8,9,10,11,12]. However, the minimum LMSE estimator is optimum only if Gaussianity can be assumed. Accordingly, we can say that methods based on the precision matrix are suboptimal in non-Gaussian scenarios. The concept of PCC is extended to a Gaussian mixture model (GMM) in [13]. Apart from this work, and to our knowledge, there have been no other attempts to consider non-Gaussian models in graph connectivity learning.

1.2. New Contributions and Paper Organization

In this work we consider the partial correlation computation under a non-Gaussian model, in particular, a model of independent component analysis (ICA). ICA [14,15,16,17] is a consolidated technique which has found a myriad of applications in statistical signal processing (e.g., blind source separation [18,19,20,21,22,23,24,25]) and pattern recognition (see [26,27,28,29,30,31] and references therein). From the perspective of this work, ICA is a model which incorporates non-Gaussianity through some independent variables (sources), which are linearly mixed to create the observed samples. This makes it highly versatile and allows for the modeling of non-Gaussian multivariate densities. In the next Section we define a new partial correlation coefficient: ICA-PCC. The basic concept is to replace the implicit linear estimation of conventional PCC by a nonlinear estimation (conditional mean) assuming an underlying ICA model. Then, a general formula is presented to compute the residual covariance matrix from where the ICA-PCCs are to be computed. An essential part of this formula is a diagonal matrix having entries equal to the mean-square-errors of estimating the sources of the ICA model. Then, in Section 3, a practical method is presented to estimate such a matrix from the ICA model parameters. Finally, Section 4 includes some simulations to illustrate the improved estimation of the partial correlation by ICA-PCC in non-Gaussian scenarios. A real data example with EEG multichannel highly non-Gaussian signals is also included to quantify changes in brain connectivity between normal and abnormal states of a patient during sleep.

2. The Partial Correlation of ICA Models

2.1. Statement of the Problem

Let be the observation vector having covariance matrix . We assume that obeys an ICA model, then where is a vector of independent sources and is a square and invertible mixing matrix ( is the de-mixing matrix). The sources are considered standardized (zero mean and unit variance), otherwise they may have different non-Gaussian marginal densities, which factorize the joint probability density function (pdf) . Notice that Every component of is assigned to every node of a graph , where is the set of N nodes, is the set of edges connecting the nodes and is the adjacency matrix. The generic element is the weight (assumed real and nonnegative) corresponding to the edge connecting node m to node n. We will consider undirected graphs, so . The problem is to learn from an available set of observation vectors. PCCs are reasonable candidates as they can measure the correlation between two nodes removing the effect of the rest of nodes. Moreover, PCCs can be computed from the precision matrix in the form where is the nm element of matrix and is the PCC of nodes n and m. Equation (3) could be used for any underlying joint probability density , however it is optimal only for the Gaussian case. This is because the formal definition of is given by where is the vector formed by all the samples of except and , and , are respectively the minimum LMSE estimates of and from . However, optimum removal of the effect of implies the use of the conditional means and , which respectively coincide with and only when is multivariate Gaussian. Thus, in the non-Gaussian case, conventional PCC does not precisely capture the partial correlation, and so, the graph connectivity. In [13] a generalized PCC (GPCC) is defined in the form where the conditional mean depends on the specific model assumed for . In this paper we consider the ICA model (1). Then, the corresponding partial correlation coefficient is called ICA-PCC, and it is represented by to be specific with respect to the general definition (5).

2.2. A General Formula for the Residual Covariance

Let us define the vector . We can express in the form where is a matrix of dimension obtained from an identity matrix by dropping the n-th and m-th columns. Similarly, is a matrix of dimension obtained from an identity matrix dropping all but the n-th and m-th columns. Let us also define the residual vector Notice that the conditional mean is an unbiased estimator, hence the residuals are zero mean and the residual covariance matrix will be We want to compute the residual covariance matrix so that (5) can be applied. We assume an ICA model. First notice that , but considering (1) and (6), we may write Then, we can solve for where is the Moore-Penrose (left) pseudoinverse. Thus, in (9) we are expressing the conditional mean of in terms of the conditional mean of the sources and the ICA model parameters. This allows us to derive the following general formula, which in spite of its simplicity requires a rather tedious derivation that can be found in Appendix A where is an diagonal matrix having in its main diagonal the MSEs of optimally estimating the sources from , that is, Moreover, from (A7) (see Appendix) we know that , hence considering that, by definition, and var [.] are positive quantities, we conclude that . Notice that (10) is a combination of the contributions of every source to the residual covariance matrix. This can be better seen by expressing (10) in the alternative form where is the i-th column of . Notice that is a matrix formed by the n-th and m-th columns of , i.e., by the coefficients that define the (demixing) contributions of and to . Thus, is a matrix, is a vector and is a matrix that can be interpreted as the contribution of source to . This contribution is weighted by . Thus, indicates that source is perfectly estimated by , hence does not contribute to the partial correlation between and . At the other extreme, indicates that is independent of , so it has maximum contribution to the partial correlation between and .

3. Estimating the ICA Partial Correlation Coefficients

We want to estimate by So, according to (7), we have to estimate . Considering (10), we need estimates of and . Estimates of , the ICA model parameters, can be obtained using a variety of algorithms [14,15,16,17,26,27,28,29,30,31] so, in the following, we concentrate on the estimation of , i.e., on estimating . To compute we will consider a particular form of the Wiener structure, which was proposed in [32], namely:where is the LMSE estimator of from (we dropped the dependence on nm to ease the notation) and the uni-dimensional conditional mean can be approximated by [32,33] where is the k-th Hermite polynomial and is a standardized Gaussian random variable (this is justified in [32] by using the central limit theorem). Let us approximate (15) by the first two terms. Taking into account that , we can write but where we consider that (due to the orthogonality between the estimation error and the linear estimate). As any LMSE estimator is unbiased, we know that . Then, we can express the conditional mean in (16) as the combination of a linear term plus a nonlinear term . Let us now compactly express the estimation of from in the form We can write where is a diagonal matrix whose elements are the MSEs corresponding to the linear estimation of from , that is, In (20), we have considered the orthogonality between the error vector and the estimate vector. However, is the minimum LMSE estimate, so it can be obtained from the Wiener-Hopft equations Hence, we can write: Taking into account , and , we can finally express in terms of the ICA model parameters: Let us now consider the other two terms in (19). First, notice that can be interpreted as a linear estimate of from , because assuming that is a standardized Gaussian random variable, then is having a mean equal to 1 and variance equal to 2. Hence, we can apply orthogonality again: Consequently, we have The second term in (25) is zero, because where and , because we assume that is Gaussian so its odd moments are zero. Regarding the first term in (25) Defining the vector and taking into account that we can write and considering (21) and (23), (28) can be expressed in terms of the ICA model parameters So, in conclusion we can express the matrix as where and can be obtained from (23) and (29), respectively, using estimates of the model parameters and a sample mean to evaluate the expectation required in (29). Algorithm 1 below describes the estimation procedure. Equation (30) provides an interesting decomposition of . Let be called the entries of the diagonal matrix , then can be expressed as minus a nonnegative term (see Equation (29)), so that . The condition holds for the Gaussian case, because then (see Equation (27)) becomes zero (it is an odd higher-order moment of a multivariate Gaussian variable). In such a case, becomes a linear function of and so the same happens with in (9). Hence, the second term in (30) is responsible for the improved reduction in the influence of in the estimation of the partial correlation between and , in the non-Gaussian case. Moreover, we should expect similar results for ICA-PCC and PCC for the Gaussian case.

4. Experiments

4.1. Synthetic Data Experiments

In this experiment we evaluated the influence of the training set size in the estimation of as well as comparing the quality of the estimate with the one obtained from the precision matrix. To this aim, we generated synthetic data corresponding to three different ICA models. In the first one, the sources were independent and identically distributed (i.i.d.) random variables having a unit-variance zero-mean uniform pdf. This correspond to an example of sub-Gaussian distribution, as the excess kurtosis is negative, , where is the kurtosis, and is the kurtosis of a Gaussian pdf. In the second model, the sources were i.i.d. random variables having a unit-variance zero-mean Laplacian pdf. This is an example of super-Gaussian distribution, as the excess kurtosis is positive . In the third model, some sources are uniform and the rest are Laplacian. Finally, we also considered the Gaussian case by generating sources having a standard Gaussian pdf. Figure 1 shows the errors corresponding to the estimation of for the four models.

Figure 1

(blue, Extended-Infomax, yellow, JADE) and ; (a) sub-Gaussian case (b) super-Gaussian case (c) Mixed (15/5) sub/super-Gaussian case (d) Gaussian case.

Every curve is an average of 10 curves corresponding to 10 different runs. In every run, an ICA matrix was randomly selected; every entry was obtained by sampling a standard Gaussian pdf. Then, a varying number of training vectors was generated from source vectors having independent components sampled from the mentioned marginal pdfs: sub-Gaussian (Figure 1a), super-Gaussian (Figure 1b), mixed of sub/super-Gaussian (Figure 1c) and Gaussian (Figure 1d). The error was computed as and averaged over the 10 runs for every training set size. Notice that , because , when and , when . In (31), was obtained from Algorithm 1 using the true matrix and an extremely large number of instances for the sample mean required to compute the expectation in (29). On the other hand, was computed from Algorithm 1 using estimates of obtained with the corresponding finite training set. We used the Extended Infomax algorithm described in [34] and the JADE algorithm [35]. Extended Infomax is an extension of the Infomax algorithm [36] used to deal with mixed sub/super-Gaussian sources. It is representative of algorithms that iteratively optimize some defined cost-function like Fast-ICA. JADE is based on matrix computation and diagonalization, so, it is not sensitive to initialization or optimization path problems. The same finite training set was also used to evaluate the expectation in (29). In all cases we considered N = 20. For comparison, we also computed the error which corresponds to the PCCs obtained from empirical estimates of the precision matrix as indicated in (3): . Notice that it is also . Several conclusions may be drawn from Figure 1. First, we can see that in the non-Gaussian cases (a) (b) and (c), PCC cannot decrease the error with increased training set size. This demonstrates the model mismatch due to the implicit Gaussianity of PCC. In these three cases, ICA-PCC methods improve on PCC after a sufficient number of training samples and maintain a decreased error for an increased training set size. The minimum training set size required to improve on PCC depends on the case and on the ICA-PCC method. Thus, this minimum number is smaller in Figure 1a (sub-Gaussians) than in Figure 1b (super-Gaussians), and has an intermediate value in Figure 1c (mixed sub/super Gaussians). However, notice that the non-decreasing error of PCC is much higher in Figure 1a,c, so this suggests that ICA-PCC improvement begins from a smaller training set size. On the other hand, this minimum value is smaller in JADE than in Extended-Infomax. This is due to the different nature of both algorithms. JADE requires a matrix diagonalization, while Extended-Infomax requires iterative learning. However the computational complexity of JADE is much larger, especially as N increases. Regarding convergence for large values of the training set size, we can see that the error level of the mixed case is clearly above the others. This is because, in general all ICA algorithms have more difficulties in estimating the model in the mixed case. Actually, Extended-Infomax was conceived in an effort to deal with the mixed case by incorporating a procedure to estimate the class (sub/super) of every source. This explains the smaller error of Extended-Infomax in Figure 1c with respect to JADE, after a given training set size. For the Gaussian case, PCC yields a very small error, which decreases with increasing training set size. In this case, ICA-PCC is worse than PCC, although the error is reasonably small. Remember that, in the Gaussian case, we expected similar results for both methods, however, the estimation path followed is different: in PCC the precision matrix is directly estimated, while in ICA-PCC the matrices W and are estimated in Algorithm 1. This could explain the separation observed in the error curves of Figure 1d. Finally, most ICA algorithms decompose the estimation of W into two steps: first, estimate a decorrelation matrix, and then, a rotation matrix. When the independent components are Gaussian, any rotation matrix is valid, as all of them are compatible with Gaussianity. However, in the non-Gaussian cases the rotation matrix must be properly estimated for the corresponding non-Gaussian model. This can be interpreted as if a smaller number of model parameters (entries of the decorrelation matrix) should be actually estimated in the Gaussian case. This explains the faster convergence of the curves in Figure 1d.

4.2. A Real Data Application

We applied the proposed method to quantify the significance of changes in brain connectivity during sleep of patients having disorders like apnea or epilepsy [37]. These disorders are characterized by regular arousal, which are stages of abnormal degraded sleep. The frequency of arousals in a given period of time is related to the seriousness of the pathology. However, the intensity of the arousals may also be relevant for an appropriate diagnosis. Assuming that an arousal is associated to changes in brain connectivity [38], a measure related to the change magnitude may be useful to quantify the significance of the pathology. To this aim, the patient was monitored during sleep by 19 channels of EEG recordings. Every signal channel was segmented into intervals of 2 s and a given feature was computed in every interval and averaged in epochs of 26 s. Each epoch was manually or automatically [22] labelled in two possible states: normal sleep (state 0) or arousal (state 1). Then, associated to every epoch, an observation vector was built with one feature extracted from all the channels (the same type of feature for all of them), thus . In this experiment, a total of 2000 epochs were available in every state. Given these data sets, an average measure related to brain connectivity was computed to quantify the importance of brain changes between the two states. There are many possible definitions of overall connectivity, here, we considered the so called algebraic connectivity [39], which can be computed as the second smallest eigenvalue of the graph Laplacian matrix [40] , being a diagonal matrix with entries and the adjacency matrix with entries . The Laplacian matrix is semidefinite positive with the smallest eigenvalue equal to zero, then . Moreover, it is demonstrated in [39] that for a complete graph (a graph with ). It is also demonstrated in [39] that Hence, assuming that (as it will be in our case), the greatest upper bound for in (33) corresponds to the complete graph (), therefore . Consequently, we proposed a normalized version of to measure the connectivity The lower bound corresponds to a disconnected graph, as it implies an order of multiplicity greater than 1 of the smallest eigenvalue. The upper bound corresponds to a complete graph, which is the one having maximum connectivity under the constraint . We obtained connectivity estimates for every state (0 or 1) and method (ICA-PCC or PCC): . This was made from (34) with N = 19, after computing the second smallest eigenvalue of the Laplacian matrix, considering that the entries of the associated adjacency matrix are the respective magnitudes of the partial correlation estimates obtained from the training set 0 or 1: Two different features were considered separately. The first is “amplitude” (Amp), which is the maximum amplitude in the corresponding 2 s interval, the second is the “alfa-slow-index” (Asi), which is the ratio of power in the alpha band (8.0–11 Hz) to the combined power in the delta (0.5–3.5 Hz) and theta (3.5–8.0 Hz) bands. Table 1 and Table 2 show the results corresponding to the Amp and the Asi features, respectively, for 6 different patients. Together with the normalized connectivity, we included the connectivity variation between states defined as and . We also included a kurtosis estimate for every patient and state. This estimate was obtained as a mean of all the 19 empirical kurtosis separately calculated for every component of vector x, i.e., the empirical kurtosis of the marginal distributions of x. Notice that the estimated kurtosis is clearly above the Gaussian reference , so Gaussianity assumption does not hold in this case.

Table 1

Results corresponding to the amplitude (Amp).

Subj.	κ0	κ1	ς^0ICA−PCC	ς^1ICA−PCC	ΔICA−PCC	ς^0PCC	ς^1PCC	ΔPCC
S1	6.46	4.58	0.30	0.33	0.03	0.03	0.03	0.00
S2	8.05	5.29	0.74	0.39	0.35	0.04	0.04	0.00
S3	9.84	6.76	0.57	0.28	0.29	0.03	0.02	0.01
S4	9.04	8.87	0.39	0.66	0.27	0.04	0.02	0.02
S5	9.61	15.13	0.31	0.44	0.13	0.02	0.03	0.01
S6	9.14	13.82	0.24	0.36	0.12	0.02	0.02	0.00

Table 2

Results corresponding to the alfa-slow-index (Asi).

Subj.	κ0	κ1	ς^0ICA−PCC	ς^1ICA−PCC	ΔICA−PCC	ς^0PCC	ς^1PCC	ΔPCC
S1	16.32	22.51	0.31	0.51	0.20	0.02	0.02	0.00
S2	10.52	9.09	0.34	0.60	0.26	0.02	0.02	0.00
S3	9.91	7.05	0.68	0.48	0.20	0.02	0.03	0.01
S4	8.39	11.69	0.37	0.74	0.37	0.03	0.02	0.01
S5	7.72	13.15	0.22	0.71	0.49	0.02	0.03	0.01
S6	11.86	9.24	0.43	0.56	0.13	0.02	0.03	0.01

We can see in Table 1 and Table 2 that the PCC method yields very small values of connectivity for all subjects and states, therefore, it is not sensitive to possible changes between states. However, ICA-PCC provides larger values of connectivity and significant changes between states. Figure 2 and Figure 3 show the estimated adjacency matrices corresponding to the different subjects, methods and states. We can see that PCC magnitudes are, in general, much lower than ICA-PCC magnitudes, therefore, PCC has more difficulty revealing the interrelations between the different EEG channels due to the brain activity. This may be explained in terms of the residuals and . Notice from (12) that where and are the elements of vectors and , respectively, and later, these are the first and second row of , respectively. We showed in Section 3 that PCC should be similar to ICA-PCC for , but for non-Gaussian observations , so it is deduced from (36) that where equality holds in the Gaussian case. So, PCC provides overestimated residuals where the actual partial correlation between and may be eventually hidden. This masking effect should increase with the non-Gaussianity character of the observations. In our experiment, the features are highly non-Gaussian as demonstrated by the kurtosis values of Table 1 and Table 2. So, when using PCC, the “true” residuals seem to be overestimated by rather uncorrelated residuals that provide a too low estimation of the actual interrelation between the different EEG channels.

Figure 2

Adjacency matrices corresponding to Amp.

Figure 3

Adjacency matrices corresponding to Asi.

5. Conclusions and Extensions

Partial correlations may be used to define the weights of an undirected graph for subsequent graph signal processing. Conventionally, partial correlations are obtained from the precision matrix, but this is optimal only under the Gaussianity assumption. Hence, we have proposed a new method for computing the partial correlation, assuming a non-Gaussian model (ICA). The latter is a versatile model which suits a diversity of non-Gaussian pdfs. The proposed method requires the computation of the ICA model parameters, which can be made by using any of the many existing algorithms. Two different ICA methods have been considered in the synthetic examples, which may be considered representative of two different kinds of approaches to estimating the ICA model parameters. Both yield similar performance. Computing the mean-square-errors corresponding to the optimal estimation of the sources is also required. Hence, we have proposed a second-order approximation of the conditional mean. Higher orders could be tried at the price of increased complexity. We have verified, both by simulations and by real data experiments that the new method better captures the pairwise and overall connectivity of the graph compared to the precision matrix in non-Gaussian scenarios. The results could be extended to larger values of N but the training set sizes should be correspondingly increased to keep the quality of the model parameter estimates. Future extensions of this work can be devised. Some kind of regularization is desirable to emphasize the relevant information provided by the graph connectivity and/or to establish more natural relations between the connected nodes. Thus, sparsity is a common requirement of graph learning (see [41] as a representative example). Considering Equation (10), sparsity could be imposed by selecting only those sources that significantly contribute to the partial correlation between and , i.e., by soft or hard thresholding on . On the other hand, smoothness regularization could be tried in a similar manner to the approach proposed in [42] for the Gaussian case. To this aim, it could be considered that the representation matrix U can be factorized in a correlation matrix multiplied by a rotation (unitary) matrix [43]. Understanding how this rotation relates to the graph connectivity may allow the definition of cost functions, which include some possible smoothness related terms. Other structural constraints [44] could also be compatible with the non-Gaussian model.

11 in total

1. Independent component analysis: algorithms and applications.

Authors: A Hyvärinen; E Oja
Journal: Neural Netw Date: 2000 May-Jun

2. Sparse inverse covariance estimation with the graphical lasso.

Authors: Jerome Friedman; Trevor Hastie; Robert Tibshirani
Journal: Biostatistics Date: 2007-12-12 Impact factor: 5.899

3. Single channel blind source separation based local mean decomposition for biomedical applications.

Authors: Yina Guo; Ganesh R Naik; Hung Nguyen
Journal: Conf Proc IEEE Eng Med Biol Soc Date: 2013

4. Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources.

Authors: T W Lee; M Girolami; T J Sejnowski
Journal: Neural Comput Date: 1999-02-15 Impact factor: 2.026

5. Single-Channel EMG Classification With Ensemble-Empirical-Mode-Decomposition-Based ICA for Diagnosing Neuromuscular Disorders.

Authors: Ganesh R Naik; S Easter Selvan; Hung T Nguyen
Journal: IEEE Trans Neural Syst Rehabil Eng Date: 2015-07-09 Impact factor: 3.802