| Literature DB >> 33385080 |
Daniel Fischer1, Klaus Nordhausen2, Hannu Oja3.
Abstract
Dimension reduction is often a preliminary step in the analysis of data sets with a large number of variables. Most classical, both supervised and unsupervised, dimension reduction methods such as principal component analysis (PCA), independent component analysis (ICA) or sliced inverse regression (SIR) can be formulated using one, two or several different scatter matrix functionals. Scatter matrices can be seen as different measures of multivariate dispersion and might highlight different features of the data and when compared might reveal interesting structures. Such analysis then searches for a projection onto an interesting (signal) part of the data, and it is also important to know the correct dimension of the signal subspace. These approaches usually make either no model assumptions or work in wide classes of semiparametric models. Theoretical results in the literature are however limited to the case where the sample size exceeds the number of variables which is hardly ever true for data sets encountered in bioinformatics. In this paper, we briefly review the relevant literature and explore if the dimension reduction tools can be used to find relevant and interesting subspaces for small-n-large-p data sets. We illustrate the methods with a microarray dataset of prostate cancer patients and healthy controls.Entities:
Keywords: Bioinformatics; Computer science; Dimension reduction; Genomics; ICA; Mathematics; Microbial genomics; SIR; Statistics; Transcriptomics
Year: 2020 PMID: 33385080 PMCID: PMC7770551 DOI: 10.1016/j.heliyon.2020.e05732
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
The first 10 squared singular values from SVD.
| Squared singular value | Cumulative explained variation (%) | |
|---|---|---|
| 1 | 49584782.29 | 50.3% |
| 2 | 17020792.70 | 67.6% |
| 3 | 10365087.69 | 78.1% |
| 4 | 7799839.61 | 86.0% |
| 5 | 3178859.60 | 89.2% |
| 6 | 2036165.39 | 91.3% |
| 7 | 1750698.70 | 93.1% |
| 8 | 1386901.81 | 94.5% |
| 9 | 937702.68 | 95.5% |
| 10 | 846675.15 | 96.3% |
Figure 1Scree-plot for the squared singular values.
Figure 2Pairwise scatter plots for the first four principal components.
Ordered kurtosis values from FOBI and robust ICA using the first four principal components.
| FOBI | robust ICA | |
|---|---|---|
| 1 | 9.7428 | 3.6838 |
| 2 | 7.2991 | 0.8069 |
| 3 | 5.3417 | 0.6101 |
| 4 | 6.2365 | 0.5514 |
The p-values for testing the hypotheses of H0:q = k, k = 0,1,2 where q the number of non-gaussian components. The tests are asymptotic and bootstrap tests (with 500 bootstrap samples) using the FOBI approach for the four-variate data.
| Asymp | Boot | |
|---|---|---|
| <0.0001 | 0.0020 | |
| 0.1050 | 0.0818 | |
| 0.3963 | 0.4711 |
Figure 3Independent components based on FOBI.
Figure 4Independent components based on robust ICA.
The p-values for testing H0:q = 0 and H0:q = 1. The tests are asymptotic and bootstrap tests (with 500 bootstrap samples) using the SIR approach for the 98-variate data.
| Asymp | Boot | |
|---|---|---|
| ≤0.0001 | 0.0448 | |
| 0.0001 | 0.0050 |
Figure 5Pairwise scatter plots for SIR components obtained from the 98-variate data and with the colors corresponding to the three health groups.
Figure 6Pairwise scatter plots for SIR components obtained from the four-variate data and with the colors corresponding to the three health groups.