| Literature DB >> 35525975 |
Haileab Hilafu1, Sandra E Safo2.
Abstract
BACKGROUND: Dimension reduction and variable selection play a critical role in the analysis of contemporary high-dimensional data. The semi-parametric multi-index model often serves as a reasonable model for analysis of such high-dimensional data. The sliced inverse regression (SIR) method, which can be formulated as a generalized eigenvalue decomposition problem, offers a model-free estimation approach for the indices in the semi-parametric multi-index model. Obtaining sparse estimates of the eigenvectors that constitute the basis matrix that is used to construct the indices is desirable to facilitate variable selection, which in turn facilitates interpretability and model parsimony.Entities:
Keywords: Generalized eigenvalue decomposition; High-dimensional data; Linear discriminant analysis; Semiparametric model; Sliced inverse regression
Year: 2022 PMID: 35525975 PMCID: PMC9080177 DOI: 10.1186/s12859-022-04700-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Simulation results for Models (7)–(9)
| Model (7) | Model (8) | Model (9) | Model (7) | Model (8) | Model (9) | ||
|---|---|---|---|---|---|---|---|
| Proposed method | TPR | 100 (2.3) | 99 (0.1) | 84 (1.8) | 100 (0.0) | 100 (0.0) | 91 (1.4) |
| FPR | 6.8 (0.5) | 11 (0.1) | 7.5 (8.0) | 0.0 (0.0) | 0.0 (0.0) | 1.0 (0.1) | |
| Corr | 98 (2.3) | 94 (0.1) | 72 (0.2) | 99 (0.0) | 99 (1.0) | 97 (2.0) | |
| [ | TPR | 96 (1.0) | 94 (1.2) | 91 (1.1) | 98(0.5) | 99 (0.5) | 99 (2.5) |
| FPR | 6.0 (0.9) | 3.6 (0.7) | 7.4 (0.1) | 3.4 (0.4) | 1.1 (0.2) | 2.5 (0.3) | |
| Corr | 88 (0.9) | 86 (1.1) | 74 (1.1) | 91 (0.5) | 92 (0.5) | 79 (0.6) | |
| [ | TPR | 95 (0.9) | 100 (0.0) | 100 (0.6) | 100 (0.0) | 100 (0.0) | 100 (0.0) |
| FPR | 4.9 (0.1) | 4.8 (0.1) | 3.5 (0.1) | 5.9 (0.2) | 6.7 (0.3) | 4.5 (0.2) | |
| Corr | 59 (1.1) | 88 (0.5) | 79 (0.6) | 79 (0.6) | 94 (0.2) | 87(0.5) | |
| [ | TPR | 98 (0.1) | 98 (0.1) | 98 (0.1) | 99 (0.1) | 99 (0.1) | 98 (0.1) |
| FPR | 8.3 (1.2) | 3.8 (0.8) | 23 (1.1) | 1.2 (0.4) | 0.3 (0.2) | 20 (1.1) | |
| Corr | 84 (0.9) | 89 (0.6) | 63 (0.7) | 94 (0.4) | 96 (0.3) | 70 (0.5) | |
| [ | TPR | 89 (1.5) | 94 (1.2) | 80 (1.2) | 98(1.0) | 99 (0.7) | 96 (0.6) |
| FPR | 0.6 (0.1) | 0.6 (0.1) | 0.2 (0.1) | 0.3 (0.1) | 0.3 (0.1) | 0.1 (0.1) | |
| Corr | 82 (1.4) | 85 (1.3) | 70 (1.1) | 91 (1.1) | 93 (1.0) | 84 (0.7) | |
Corr is the correlation coefficient between the true and estimated sufficient predictors; TPR is the true positive rate; FPR is the false positive rate. The mean (standard error), averaged over 200 independent replications, are reported. All entries are multiplied by 100
Simulation results for model 1
| Model 1 | ||||||
|---|---|---|---|---|---|---|
| ( | CISESIR | CISELDA | MGSDA | PLDA | MSDA | |
| (0.5, 50) | 0.761 | 0.821 | 1.346 | 0.251 | 1.328 | |
| MSR | 0.128 | 0.127 | 0.137 | 0.125 | 0.137 | |
| TPR | 0.949 | 0.760 | 0.775 | 1.000 | 0.860 | |
| FPR | 0.100 | 0.227 | 0.101 | 0.225 | 0.264 | |
| (0.5, 500) | 0.797 | 0.888 | 1.374 | 0.455 | 1.315 | |
| MSR | 0.132 | 0.128 | 0.139 | 0.125 | 0.138 | |
| TPR | 0.897 | 0.985 | 0.733 | 1.000 | 0.810 | |
| FPR | 0.052 | 0.076 | 0.010 | 0.011 | 0.011 | |
| (0.5, 1000) | 0.632 | 0.604 | 1.384 | 0.406 | 1.307 | |
| MSR | 0.129 | 0.125 | 0.139 | 0.129 | 0.135 | |
| TPR | 0.932 | 0.999 | 0.739 | 1.000 | 0.794 | |
| FPR | 0.026 | 0.066 | 0.007 | 0.176 | 0.005 | |
| (0.9, 50) | 1.070 | 1.031 | 1.714 | 0.140 | 1.672 | |
| MSR | 0.209 | 0.207 | 0.213 | 0.206 | 0.215 | |
| TPR | 0.835 | 0.925 | 0.368 | 1.000 | 0.481 | |
| FPR | 0.140 | 0.214 | 0.037 | 0.254 | 0.164 | |
| (0.9, 500) | 0.925 | 1.086 | 1.730 | 0.409 | 1.703 | |
| MSR | 0.216 | 0.215 | 0.217 | 0.209 | 0.213 | |
| TPR | 0.828 | 0.998 | 0.376 | 1.000 | 0.399 | |
| FPR | 0.067 | 0.160 | 0.015 | 0.401 | 0.007 | |
| (0.9, 1000) | 0.588 | 0.663 | 1.047 | 0.289 | 1.680 | |
| MSR | 0.210 | 0.209 | 0.214 | 0.206 | 0.213 | |
| TPR | 0.942 | 1.000 | 0.393 | 1.000 | 0.427 | |
| FPR | 0.056 | 0.072 | 0.004 | 0.153 | 0.005 | |
is as defined in (10); TPR is the true positive rate; FPR is the false positive rate; MSR is the misclassification rate over a test set of 900 observations. Note again, TPR and FPR are with respect to variable selection. The reported numbers are averages over 50 repetitions
Simulation results for model 2
| Model 2 | ||||||
|---|---|---|---|---|---|---|
| CISESIR | CISELDA | MGSDA | PLDA | MSDA | ||
| 50 | 0.632 | 1.032 | 1.701 | 0.142 | 1.108 | |
| MSR | 0.101 | 0.048 | 0.122 | 0.037 | 0.104 | |
| TPR | 0.972 | 0.625 | 0.252 | 0.692 | 0.956 | |
| FPR | 0.075 | 0.200 | 0.072 | 0.231 | 0.248 | |
| 500 | 0.869 | 1.114 | 1.726 | 0.396 | 1.112 | |
| MSR | 0.111 | 0.053 | 0.123 | 0.040 | 0.104 | |
| TPR | 0.844 | 0.675 | 0.240 | 0.7612 | 0.916 | |
| FPR | 0.020 | 0.195 | 0.009 | 0.3824 | 0.021 | |
| 1000 | 0.786 | 0.711 | 1.708 | 0.366 | 1.100 | |
| MSR | 0.114 | 0.040 | 0.121 | 0.037 | 0.102 | |
| TPR | 0.851 | 0.644 | 0.241 | 0.694 | 0.922 | |
| FPR | 0.017 | 0.089 | 0.004 | 0.227 | 0.010 | |
is as defined in (10); TPR is the true positive rate; FPR is the false positive rate; MSR is the misclassification rate over a test set of 900 observations. Note again, TPR and FPR are with respect to variable selection. The reported numbers are averages over 50 repetitions
Simulation results for model 3
| Model 3 | ||||||
|---|---|---|---|---|---|---|
| ( | CISESIR | CISELDA | MGSDA | PLDA | MSDA | |
| (0.5, 50) | 0.590 | 0.489 | 0.690 | 0.215 | 0.677 | |
| MSR | 0.208 | 0.086 | 0.111 | 0.080 | 0.100 | |
| TPR | 0.960 | 0.994 | 0.956 | 0.998 | 0.950 | |
| FPR | 0.357 | 0.102 | 0.090 | 0.204 | 0.118 | |
| (0.5, 500) | 0.438 | 0.536 | 0.746 | 0.241 | 0.683 | |
| MSR | 0.091 | 0.092 | 0.122 | 0.085 | 0.100 | |
| TPR | 0.978 | 0.990 | 0.924 | 1.000 | 0.942 | |
| FPR | 0.030 | 0.017 | 0.013 | 0.076 | 0.011 | |
| (0.5, 1000) | 0.322 | 0.302 | 0.751 | 0.218 | 0.749 | |
| MSR | 0.080 | 0.081 | 0.119 | 0.076 | 0.096 | |
| TPR | 0.994 | 1.000 | 0.922 | 1.000 | 0.900 | |
| FPR | 0.019 | 0.014 | 0.010 | 0.023 | 0.008 | |
| (0.9, 50) | 0.514 | 0.602 | 1.050 | 0.126 | 0.833 | |
| MSR | 0.070 | 0.069 | 0.097 | 0.067 | 0.077 | |
| TPR | 0.920 | 0.970 | 0.660 | 1.000 | 0.874 | |
| FPR | 0.292 | 0.082 | 0.077 | 0.335 | 0.059 | |
| (0.9, 500) | 0.273 | 0.390 | 1.048 | 0.229 | 0.814 | |
| MSR | 0.064 | 0.069 | 0.095 | 0.067 | 0.073 | |
| TPR | 0.988 | 1.000 | 0.638 | 1.000 | 0.868 | |
| FPR | 0.021 | 0.033 | 0.009 | 0.291 | 0.004 | |
| (0.9, 1000) | 0.492 | 0.228 | 1.047 | 0.185 | 0.772 | |
| MSR | 0.068 | 0.065 | 0.096 | 0.064 | 0.066 | |
| TPR | 1.000 | 1.000 | 0.652 | 1.000 | 0.850 | |
| FPR | 0.033 | 0.017 | 0.003 | 0.131 | 0.006 | |
is as defined in (10); TPR is the true positive rate; FPR is the false positive rate; MSR is the misclassification rate over a test set of 900 observations. Note again, TPR and FPR are with respect to variable selection. The reported numbers are averages over 50 repetitions
Average misclassification rates and number of variables selected for the depression study
| Method | Mean test error | Mean | Mean | Selected metabolites (SE) |
|---|---|---|---|---|
| Sensitivity (%) | Specificity (%) | |||
| CISESIR | 0.1955 (0.0150) | 84.62 (2.63) | 72.27 (3.71) | 40.540 (2.634) |
| CISELDA | 0.1659 (0.0095) | 80.37 (2.32) | 83.20 (2.32) | 26.280 (1.828) |
| MGSDA | 0.1639 (0.0093) | 84.62 (1.36) | 81.47 (2.02) | 16.129 (0.772) |
| PLDA | 0.2484 (0.0093) | 89.62 (0.97) | 59.73 (2.36) | 78.200 (2.797) |
| MSDA | 0.1458 (0.0068) | 84.75 (1.20) | 86.13 (1.50) | 58.480 (3.770) |
Averages are over 50 repetitions of randomly splitting the data into training (63 observations) and testing (31 observations). Reported average error rates are obtained from the test sets
Fig. 1Left panel: ROC curve for the depression study data. Average AUC for CISELDA, CISESIR, MGSDA, MSDA, and PLDA are respectively 0.89, 0.94,0.91, 0.93, and 0.92. Right panel: Bar graphs of the log2 intensities for metabolites identified by CISELDA
Average misclassification rates and number of variables selected for the RNA-seq study
| Method | Mean test error | |||
|---|---|---|---|---|
| CISESIR | 0.005 (0.0218) | 352.26 | 352.26 | 352.26 |
| CISELDA | 0.002 (0.0147) | 297.63 | 297.63 | 297.63 |
| MGSDA | 0.007 (0.0024) | 4.30 | 4.30 | 4.30 |
| PLDA | 0.058 (0.0094) | 6774.9 | 5225.3 | 5476.8 |
| MSDA | – | – | ||
Averages are over 50 repetitions of randomly splitting the data into training (99 observations) and testing (48 observations). Reported average error rates are obtained from the test sets
Fig. 2RNA-seq test set projected onto estimated discriminant vectors. Top-right panel: CISESIR; Top-left panel: CISELDA; Bottom-right panel: MGSDA; Bottom-left panel: PLDA
Fig. 3Riboflavin production data: left panel plot is the sufficient summary plot versus for the 50 training samples; right panel is the sufficient summary plot versus of the 21 test samples