Literature DB >> 33385080

On linear dimension reduction based on diagonalization of scatter matrices for bioinformatics downstream analyses.

Daniel Fischer¹, Klaus Nordhausen², Hannu Oja³.

Abstract

Dimension reduction is often a preliminary step in the analysis of data sets with a large number of variables. Most classical, both supervised and unsupervised, dimension reduction methods such as principal component analysis (PCA), independent component analysis (ICA) or sliced inverse regression (SIR) can be formulated using one, two or several different scatter matrix functionals. Scatter matrices can be seen as different measures of multivariate dispersion and might highlight different features of the data and when compared might reveal interesting structures. Such analysis then searches for a projection onto an interesting (signal) part of the data, and it is also important to know the correct dimension of the signal subspace. These approaches usually make either no model assumptions or work in wide classes of semiparametric models. Theoretical results in the literature are however limited to the case where the sample size exceeds the number of variables which is hardly ever true for data sets encountered in bioinformatics. In this paper, we briefly review the relevant literature and explore if the dimension reduction tools can be used to find relevant and interesting subspaces for small-n-large-p data sets. We illustrate the methods with a microarray dataset of prostate cancer patients and healthy controls.

Entities: Chemical Disease Species

Keywords: Bioinformatics; Computer science; Dimension reduction; Genomics; ICA; Mathematics; Microbial genomics; SIR; Statistics; Transcriptomics

Year: 2020 PMID： 33385080 PMCID： PMC7770551 DOI： 10.1016/j.heliyon.2020.e05732

Source DB: PubMed Journal: Heliyon ISSN： 2405-8440

Introduction

In contemporary data analysis linear dimension reduction is often the first step in reducing the number of variables. For a numeric p-variate random vector this means that a transformation matrix with is searched such that all relevant information for the analysis at hand is contained in . The question is then naturally how “information” content is measured but in general two types of linear dimension reduction can be distinguished: is considered as uninformative noise. for some response variable y of interest. There is an abundance of supervised and unsupervised methods, and depending on the data and problem, they might give quite different results as they adopt different concepts of information. One prevalent supervised method is, e.g. the linear discriminant analysis (LDA) or more recently developed non-linear dimension reduction methods like t-SNE (van der Maaten and Hinton, 2008), Isomap (Tenenbaum et al., 2000) or UMAP (McInnes et al., 2018). However, in the context of linear unsupervised dimension reduction, for example, principal component analysis (PCA) understands information as a large variation whereas independent component analysis (ICA) usually measures information as a degree of non-gaussianity. The dimension q of the information or signal subspace is preselected visually or by using various information criteria or testing strategies. For a more detailed comparison of the two concepts with practical examples, please see Nordhausen and Oja (2018b). Surprisingly many unsupervised and supervised linear dimension reduction methods can be expressed as a joint diagonalization problem of two scatter matrices which yields a nice unifying theory in wide semiparametric models. A problem, however, is that the assumption for the sample size n is needed and the technique cannot be directly applied to bioinformatics data with almost always . In this paper, we will first provide in Section 2 a general but brief overview of linear dimension reduction methods based on the use of two scatter matrices. Section 3 introduces a genetic data set which is used to illustrate the ideas. Section 4 then explores if there are any ways how to apply these methods in the case with the example data discussed in Section 3. A GitHub repository that contains the R-code and data to reproduce all results and figures is also available (https://github.com/fischuu/LinearDimensionReductionInBioinformatics).

Linear dimension reduction using two of scatter matrices

Let denote a p-variate random vector having cdf and the joint cdf of random variable y and is denoted by . An (unsupervised) scatter functional is then any matrix valued functional which is affine equivariant in the sense that for all nonsingular matrices and all p-vectors . Similarly, a supervised scatter functional is defined as any matrix valued functional which is affine equivariant in the sense that with and as above. These functionals are in the following also denoted by and . For a random sample (or ), the estimates of the population values and are obtained as the value of the functional at the corresponding empirical distributions. There is a wide literature on scatter functionals. The covariance matrix serves as the first example but the scatter matrix may also be based on the fourth moments: with . The M-functionals of scatter are implicitly defined by where is a companion location functional and a weight function for . Popular weight functions are for example Huber weights where c is an user specified tuning constant and a scaling factor. The weight function for Tyler's scatter matrix is . For a general discussion on these and other (unsupervised) scatter matrices, see for example Tyler (1987); Rousseeuw and Hubert (2013); Dümbgen et al. (2015). A supervised scatter functional used in the sliced inverse regression (SIR) is For more examples of supervised functionals, see, e.g. Liski et al. (2014). Initially scatter functionals were developed to provide robust and efficient competitors to the covariance matrix under the multivariate normality or ellipticity assumptions. In the elliptic models, the scatter functionals can be shown to be proportional to the covariance matrix (see, e.g. Nordhausen et al., 2011). A property of interest is also the so-called independence property (Nordhausen and Tyler, 2015) which states that if has independent components, then is a diagonal matrix. The covariance matrix, the scatter matrix based on the fourth moments as well as symmetrized versions of Huber's and Tyler's M-estimates all have this property, see for example Dümbgen (1998); Sirkiä et al. (2007). Surprisingly many linear dimension reduction methods can be seen as a joint diagonalization of two scatter matrices. Let be a transformation matrix (functional) which diagonalizes two scatter functionals and so that where is a diagonal matrix with diagonal elements . This method is called invariant coordinate selection (ICS) (Tyler et al., 2009) as, for any nonsingular matrix , up to the signs of the components. The transformed in an invariant coordinate system may reveal intrinsics and hidden structures in the data. For unsupervised and supervised , the transformation is known as supervised invariant coordinate selection (SICS) (Liski et al., 2014). ICS and SICS are very general methods with many several possible applications - but now have a look only at three well-known special cases.

Principal component analysis

PCA is one of the most popular multivariate methods and fits into this framework with and . The principal components in are then ordered according to their variances, that is, the diagonal elements of . The number of components q to choose is often based on the cumulative proportion of variation or, visually, finding the elbow in the scree-plot indicating that the variances of the remaining components are approximately equal. The idea of the scree-plot checking is formalized by assuming that, for some q, the principal values satisfy . The null hypothesis can then be tested by assuming ellipticity and using the variance of the smallest estimated eigenvalues, say , as a test statistic. The limiting distribution of , properly scaled and when , is then a chi squared distribution with degrees of freedom. For asymptotic and bootstrap tests and the estimates of q based on these tests, see Schott (2006); Nordhausen et al. (2017a).

Independent component analysis

The independent component model is a semiparametric model with the assumption that where is a full-rank mixture matrix, specifies the location center of , and is a standardized random p-vector of independent components so that and . The goal in ICA is to estimate an unmixing matrix to transform the data to the independent components (Nordhausen and Oja, 2018a). The matrix also solves the ICA problem if both scatter matrices and have the independence property. The eigenvalues in provide componentwise kurtosis measures (Nordhausen et al., 2011). The most popular choice providing the FOBI solution is and with the componentwise Pearson's kurtosis measures (see Cardoso (1989) and also Nordhausen and Virta (2019)). The q non-gaussian independent components can be estimated in this model if their kurtosis values are distinct. In signal processing, it is generally thought that these q non-gaussian components present the signal part and the gaussian components the noise part of the data. For the FOBI approach, the eigenvalues for the gaussian components are and, to find the non-gaussian components, the strategy is to choose the components whose values differ most from . A natural test statistic for is then the sum of smallest diagonal entries of and, if the eight moments are finite, then the limiting null distribution of as is the distribution of with independent chi squared variables. The constant can be consistently estimated from the data. The bootstrap test versions are also easily available. Consistent estimates of q can be found using various sequential testing strategies. Further, the ICA assumptions can be relaxed by allowing that the non-gaussian components are dependent; this is known as non-gaussian component analysis (NGCA). For these and further results, see Nordhausen et al. (2017a) and Nordhausen et al. (2017b).

Sliced inverse regression

SIR is a supervised dimension reduction method originally suggested by Li (1991). It uses and as defined in equation (1). In practice, one discretizes y and uses where ⇔ , , with disjoint intervals (slices) which satisfy . Note that the rank of the sample version of is at most . We then assume again the location-scatter model where now the standardized satisfies . In the partitioning, is chosen to have the smallest possible dimension q. The subvectors and present the signal and noise parts of , respectively. Under this model and using and , the diagonal entries of are It is further required that so that SIR is assumed to find the full signal. To test the hypothesis Nordhausen et al. (2017a) used the test statistic which is simply the sum of the smallest diagonal entries of . Then, for the true value q, has a limiting distribution as n goes to the infinity. As in the PCA and FOBI cases, different sequential testing can be again used to find a consistent estimate of q. Also, for small n, a bootstrap testing strategy may be used. For these and further results, see again Nordhausen et al. (2017a). Bura and Cook (2001) derived the limiting distribution of the same test statistic under weaker conditions.

Comparison to other dimension reduction methods

The here considered linear dimension reduction methods rely on linear transformations of the original data into a lower dimensional subspace. In contrast, non-linear dimension reduction methods like t-SNE, Isomap and UMAP apply more drastical transformations to the data. These methods do not assume that the original data lies on a linear subspace. Besides linear/non-linear methods for dimension reduction, even feature selection methods several classification methods like, e.g. the Random Forest can be considered as a dimension reduction method. Also, even a simple linear regression can be used for dimension reduction.

The case of

The use of the above methodology (PCA, FOBI, SIR) involves problems in the context of genetic data with . Tyler (2010) showed that in this scenario, unfortunately, all scatter functionals with the affine equivariance property are proportional to the covariance matrix prohibiting the subspace estimation. So to apply these methods, one should first reduce the dimension to some by using SVD for example and then continue working in the resulting subspace. The high dimensionality problem is approached in bioinformatics often by performing PCA via the singular value decomposition (SVD). This is the quasi-standard first step in bioinformatics data analysis, see the tools such as Genomatix, Genious or CLCbio that “solve” the problem of choosing a working dimension k quite conveniently and simply by setting the default to . Let be our centered data matrix, then the singular value decomposition (SVD) for with rank is defined as where is a matrix and is an matrix both with orthonormal columns, and is an diagonal matrix with positive diagonal elements in a decreasing order. The number of variables via PCA is then reduced to k with a transformation where . The problem then naturally is how to choose k without losing any (or too much) information.

A microRNA expression data set

Analysis of genetic data sets is one of the driving forces behind developing the tools for the problems. For example, the current sample sizes in typical transcriptomic experiments range from just a few to a couple of hundreds of individuals due to the high costs of sequencing. Also, the storage of massive data is often almost prohibitive. It is therefore not likely that in the foreseeable future, the case would be realistic for genetic data. The data set used in this work consists of human microRNA (miRNA) expressions from Agilent microarrays where 2125 different probes of 813 different miRNAs were used for subjects coming from three different groups: 76 healthy individuals, 78 patients with a mild form of prostate cancer (PrCa) and 35 with an aggressive type of PrCa. In total, 26 microarrays have been used, and the hybridization took place at four different time points. The data set was originally analyzed in Fischer et al., 2014, Fischer et al., 2015, and all relevant data are available from EMBL-EBI ArrayExpress (accession number E-MTAB-3397). Note that here may be seen untypically large compared to the relatively “small” dimension . The goal in this data is to identify miRNAs which are either responsible for the development of PrCa (predisposition) or which could serve as biomarkers for the detection of PrCa (diagnosis) and, further, to distinguish the two different types of PrCa. Also, please note that the here described findings are not only valid for microarray data but also for any other kind of more recent data like e.g. whole-transcriptome sequencing (WTS). In WTS the problem is even more prominent as, compared to the microarray data, usually all possible genes respective transcripts are quantified so that n is in the level of magnitude >20.000 and p is often also smaller. Hence, also in WTS dimension reduction methods are eminent.

Application

In the analysis of the microRNA data, the problem was to identify those miRNAs which allow us to separate healthy subjects from PrCa cases and, if possible, even distinguish between the two different types of cancer. The following analysis was done entirely with R (R Core Team, 2017) using the packages ICS (Nordhausen et al., 2008) and ICtest (Nordhausen et al., 2017c). As , the SVD was performed with the results reported in Table 1 (the first 10 eigenvalues of cov) and the corresponding scree-plot in Fig. 1. The scree-plot suggests that four components might be reasonable as they already explain more than 80% of variation of the data (which has also sometimes been used as a rule). Note that as many as 98 components are needed to explain 99.99% of the variation.

Table 1

The first 10 squared singular values from SVD.

	Squared singular value	Cumulative explained variation (%)
1	49584782.29	50.3%
2	17020792.70	67.6%
3	10365087.69	78.1%
4	7799839.61	86.0%
5	3178859.60	89.2%
6	2036165.39	91.3%
7	1750698.70	93.1%
8	1386901.81	94.5%
9	937702.68	95.5%
10	846675.15	96.3%

Figure 1

Scree-plot for the squared singular values.

The first 10 squared singular values from SVD. Scree-plot for the squared singular values. ICA in the four-variate subspace We first start with four principal components () plotted in Fig. 2. The pairwise scatter plots for these four first principal components reveal one group that is separate from the bulk of the data but not three groups with different cancer types as desired. This is because, in searching for subgroups of the data, the kurtosis measures are more natural than the variance as an information criterion. We therefore next try ICS for this four-variate data with two choices of the pairs of scatter matrices , namely, (i) FOBI based on cov and , and (ii) robust ICA based on symmetrized versions Tyler's and Huber's scatter matrices. Table 2 describes the estimated kurtosis measures and the diagonal entries of .

Figure 2

Pairwise scatter plots for the first four principal components.

Table 2

Ordered kurtosis values from FOBI and robust ICA using the first four principal components.

	FOBI	robust ICA
1	9.7428	3.6838
2	7.2991	0.8069
3	5.3417	0.6101
4	6.2365	0.5514

Pairwise scatter plots for the first four principal components. Ordered kurtosis values from FOBI and robust ICA using the first four principal components. The test results reported in Table 3 suggest that the number of non-gaussian components is . The four ICS components from FOBI and robust ICA are plotted in Figs. 3 and 4, respectively. In both cases, the first component seems to separate the two groups, which unfortunately are not at all connected to the three health status groups. After some detective work, it was found out that the group of 24 subjects, highlighted with red color in the plots, were outliers from three microarrays created by a single person at different time points. Also, the second components seem just to find another small group of outliers in the data. Hence, based on these results, it can be concluded that ICA cannot find the groups of interest if only the first four principal components are used in the analysis. We repeated ICA with 98 principal components as well, but even in this case could not find structures in the data to identify the health status of the subjects.

Table 3

H₀	Asymp	Boot
q = 0	<0.0001	0.0020
q = 1	0.1050	0.0818
q = 2	0.3963	0.4711

Figure 3

Independent components based on FOBI.

Figure 4

Independent components based on robust ICA.

The p-values for testing the hypotheses of H0:q = k, k = 0,1,2 where q the number of non-gaussian components. The tests are asymptotic and bootstrap tests (with 500 bootstrap samples) using the FOBI approach for the four-variate data. Independent components based on FOBI. Independent components based on robust ICA. SIR in the 98-variate subspace We next perform SIR and use the data with the 98 principal components. As we hope to separate the three health groups, it is natural to let y indicate the group memberships. Note that, as , the dimension of the interesting subspace is or 2. The p-values for the asymptotic and bootstrap tests for and are reported in Table 4. The tests reject both and and therefore suggest that . For visualization purposes we plot, in Fig. 5, the first five components and highlight the three response classes with different colors. We also applied SIR to the four-variate data, and the SICS components are plotted in Fig. 6. It is then clear that the information to separate the three health groups is lost if only the first four principal components are used.

Table 4

The p-values for testing H0:q = 0 and H0:q = 1. The tests are asymptotic and bootstrap tests (with 500 bootstrap samples) using the SIR approach for the 98-variate data.

H₀	Asymp	Boot
q = 0	≤0.0001	0.0448
q = 1	0.0001	0.0050

Figure 5

Pairwise scatter plots for SIR components obtained from the 98-variate data and with the colors corresponding to the three health groups.

Figure 6

Pairwise scatter plots for SIR components obtained from the four-variate data and with the colors corresponding to the three health groups.

The p-values for testing H0:q = 0 and H0:q = 1. The tests are asymptotic and bootstrap tests (with 500 bootstrap samples) using the SIR approach for the 98-variate data. Pairwise scatter plots for SIR components obtained from the 98-variate data and with the colors corresponding to the three health groups. Pairwise scatter plots for SIR components obtained from the four-variate data and with the colors corresponding to the three health groups.

Conclusions

In the paper, we discussed the use of two scatter matrices for the unsupervised and supervised linear dimension reduction under broad model assumptions. The signal and noise spaces are then easily separated using the scatter matrices, and the eigenvalues of one scatter matrix with respect to another one listed in . The eigenvalues can also be used to decide what is the dimension of the signal space. The theory is however developed only for large-n-small-p cases which rules out genetic applications. Following the general practice in bioinformatics, we tried to circumvent this problem by reducing the dimension using SVD (PCA) and hoping that a few first principal components capture all the relevant information. In our case study, we explored this strategy and used ICA and SIR for four- and 98-variate data sets obtained in this way. ICA and SIR have been used to analyze bioinformatics data earlier also in Liebermeister (2002); Kong et al. (2008); Zhong et al. (2005); Fischer et al. (2017). Here we, however, wanted to explore what is the impact of the SVD step on the results, and the conclusion is that the number of principal components to retain is crucial. It seems advisable to keep the number of components as high as the further analysis tools allow without suffering from the course of dimensionality. Based on this study, it would be worthwhile to develop models and techniques which allow the SVD preprocessing step and can then formalize the rules for the number of PCs to retain. The SVD step could, for example, be incorporated into the bootstrapping procedure to accommodate the variation coming from that step. Note however that for example El Karoui and Purdom, 2018, El Karoui and Purdom, 2019 argue that bootstrapping is in general difficult in the setting.

Declarations

Author contribution statement

D. Fischer: Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper. K. Nordhausen: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper. H. Oja: Conceived and designed the experiments; Contributed reagents, materials, analysis tools or data; Wrote the paper.

Funding statement

The work of KN was supported by the (FWF) Grant number P31881-N32.

Data availability statement

Data associated with this study has been deposited at https://github.com/fischuu/LinearDimensionReductionInBioinformatics.

Declaration of interests statement

The authors declare no conflict of interest.

Additional information

No additional information is available for this paper.

6 in total