| Literature DB >> 34021151 |
Xinlei Mi1, Baiming Zou2, Fei Zou2, Jianhua Hu3.
Abstract
Study of human disease remains challenging due to convoluted disease etiologies and complex molecular mechanisms at genetic, genomic, and proteomic levels. Many machine learning-based methods have been developed and widely used to alleviate some analytic challenges in complex human disease studies. While enjoying the modeling flexibility and robustness, these model frameworks suffer from non-transparency and difficulty in interpreting each individual feature due to their sophisticated algorithms. However, identifying important biomarkers is a critical pursuit towards assisting researchers to establish novel hypotheses regarding prevention, diagnosis and treatment of complex human diseases. Herein, we propose a Permutation-based Feature Importance Test (PermFIT) for estimating and testing the feature importance, and for assisting interpretation of individual feature in complex frameworks, including deep neural networks, random forests, and support vector machines. PermFIT (available at https://github.com/SkadiEye/deepTL ) is implemented in a computationally efficient manner, without model refitting. We conduct extensive numerical studies under various scenarios, and show that PermFIT not only yields valid statistical inference, but also improves the prediction accuracy of machine learning models. With the application to the Cancer Genome Atlas kidney tumor data and the HITChip atlas data, PermFIT demonstrates its practical usage in identifying important biomarkers and boosting model prediction performance.Entities:
Year: 2021 PMID: 34021151 PMCID: PMC8140109 DOI: 10.1038/s41467-021-22756-2
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Simulation results on continuous outcomes.
a Estimated feature importance for the five true causal features: X1, , , , , and two null feature sets: S0 and S1. b Mean squared prediction error (MSPE) for methods in comparison. DNN, RF or SVM: specific modeling with all features; PermFIT-DNN, SHAP-DNN, LIME-DNN, HRT-DNN, SNGM-DNN, PermFIT-RF, Vanilla-RF, PermFIT-SVM, or RFE-SVM: specific modeling after feature selection. Data are presented as mean values ± s.d. Simulations in each scenario are repeated for 100 times. Source data are provided as a Source Data file.
Simulation results on continuous outcomes.
| Variable | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PermFIT | HRT | PermFIT | Vanilla | PermFIT | SHAPa | LIMEa | SNGMa | RFEa | PermFIT | HRT | PermFIT | Vanilla | PermFIT | SHAPa | LIMEa | SNGMa | RFEa | ||
| DNN | DNN | RF | RF | SVM | DNN | DNN | DNN | SVM | DNN | DNN | RF | RF | SVM | DNN | DNN | DNN | SVM | ||
| 0 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| 100 | 100 | 100 | 100 | 94 | 100 | 17 | 28 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 13 | 25 | 100 | ||
| 100 | 100 | 100 | 100 | 99 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 79 | 100 | 100 | ||
| 100 | 100 | 26 | 64 | 96 | 100 | 9 | 38 | 41 | 100 | 100 | 98 | 100 | 100 | 100 | 19 | 35 | 42 | ||
| 100 | 100 | 39 | 71 | 91 | 100 | 12 | 40 | 41 | 100 | 100 | 98 | 100 | 100 | 100 | 17 | 31 | 36 | ||
| 5.4 | 7.5 | 5.3 | 9.4 | 6.1 | 5.1 | 10.2 | 7.3 | 6.5 | 5.0 | 4.9 | 4.9 | 9.3 | 5.0 | 5.7 | 11.5 | 8.1 | 7.2 | ||
| 5.5 | 8.0 | 5.1 | 8.9 | 5.7 | 5.4 | 6.1 | 7.3 | 6.5 | 4.6 | 4.9 | 5.0 | 9.0 | 5.0 | 4.9 | 5.1 | 6.9 | 6.0 | ||
| 0.2 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| 100 | 100 | 100 | 100 | 100 | 100 | 19 | 30 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 25 | 22 | 100 | ||
| 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 85 | 100 | 100 | ||
| 100 | 100 | 36 | 68 | 98 | 100 | 13 | 35 | 27 | 100 | 100 | 94 | 100 | 100 | 100 | 9 | 28 | 0 | ||
| 100 | 100 | 35 | 69 | 98 | 100 | 15 | 34 | 22 | 100 | 100 | 93 | 100 | 100 | 100 | 11 | 35 | 0 | ||
| 6.1 | 7.1 | 6.7 | 17.5 | 7.9 | 5.1 | 9.9 | 7.6 | 13.2 | 5.7 | 5.2 | 7.5 | 29.0 | 12.9 | 5.3 | 10.9 | 7.8 | 15.6 | ||
| 5.5 | 8.0 | 5.1 | 10.6 | 5.2 | 5.4 | 6.2 | 7.2 | 1.2 | 4.9 | 5.2 | 4.8 | 11.2 | 4.7 | 5.3 | 5.6 | 7.3 | 0.0 | ||
| 0.5 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| 100 | 100 | 100 | 100 | 100 | 100 | 13 | 26 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 17 | 26 | 100 | ||
| 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 83 | 100 | 100 | ||
| 100 | 100 | 35 | 69 | 96 | 100 | 12 | 27 | 0 | 100 | 100 | 96 | 100 | 100 | 100 | 10 | 42 | 0 | ||
| 100 | 100 | 28 | 68 | 99 | 100 | 11 | 32 | 0 | 100 | 100 | 98 | 100 | 100 | 100 | 10 | 31 | 0 | ||
| 9.3 | 7.5 | 14.1 | 59.3 | 15.9 | 5.5 | 10.8 | 8.0 | 15.6 | 9.6 | 4.9 | 33.8 | 87.8 | 35.2 | 5.6 | 11.5 | 7.5 | 15.6 | ||
| 6.7 | 7.2 | 4.6 | 16.9 | 6.0 | 5.1 | 5.6 | 7.1 | 0.0 | 7.0 | 5.1 | 4.2 | 39.1 | 11.9 | 5.0 | 5.2 | 7.2 | 0.0 | ||
| 0.8 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| 100 | 100 | 100 | 100 | 100 | 100 | 16 | 33 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 22 | 28 | 100 | ||
| 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | 100 | ||
| 100 | 100 | 54 | 90 | 96 | 99 | 6 | 24 | 0 | 100 | 100 | 99 | 100 | 100 | 100 | 10 | 30 | 0 | ||
| 100 | 100 | 49 | 88 | 98 | 100 | 10 | 18 | 0 | 100 | 100 | 99 | 100 | 100 | 100 | 14 | 39 | 0 | ||
| 18.8 | 7.2 | 31.6 | 88.3 | 27.7 | 6.2 | 10.7 | 9.0 | 15.6 | 20.7 | 5.7 | 75.3 | 100.0 | 68.2 | 6.6 | 11.9 | 8.4 | 15.6 | ||
| 9.7 | 7.0 | 4.0 | 37.3 | 10.5 | 4.5 | 5.7 | 6.4 | 0.0 | 9.2 | 5.2 | 3.3 | 93.8 | 34.8 | 4.0 | 4.4 | 6.5 | 0.0 | ||
Reported is the percentage of the important variables detected by each method (p-value cutoff of 0.05), out of 100 repetitions for each simulation scenario, for five true causal features: X1, , , , , and two null feature sets: S0 and S1.
aNote that SHAP-DNN, LIME-DNN, SNGM-DNN, and RFE-SVM do not perform formal statistical testing, and features can only be ranked with no associated p values. The reported results for each of these four methods are based on the top 10 selected features for a simple illustration. For PermFIT methods and Vanilla-RF, p values are calculated from one-sided Z test.
Simulation results on continuous outcomes with smaller sample size and/or larger dimension.
| Variable | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PermFIT | HRT | PermFIT | Vanilla | PermFIT | SHAPa | LIMEa | SNGMa | RFEa | PermFIT | HRT | PermFIT | Vanilla | PermFIT | SHAPa | LIMEa | SNGMa | RFEa | ||
| DNN | DNN | RF | RF | SVM | DNN | DNN | DNN | SVM | DNN | DNN | RF | RF | SVM | DNN | DNN | DNN | SVM | ||
| 0 | 98 | 99 | 99 | 100 | 98 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| 86 | 91 | 97 | 100 | 22 | 86 | 26 | 27 | 100 | 100 | 100 | 100 | 100 | 15 | 100 | 20 | 34 | 100 | ||
| 96 | 98 | 100 | 100 | 92 | 100 | 99 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | 95 | 100 | 100 | ||
| 54 | 71 | 9 | 19 | 14 | 19 | 14 | 18 | 30 | 83 | 87 | 15 | 27 | 10 | 42 | 15 | 20 | 26 | ||
| 46 | 61 | 9 | 26 | 18 | 34 | 16 | 18 | 36 | 89 | 96 | 14 | 13 | 10 | 29 | 8 | 13 | 27 | ||
| 5.0 | 16.5 | 5.3 | 9.1 | 4.7 | 7.4 | 9.4 | 8.1 | 6.9 | 5.5 | 22.6 | 5.3 | 7.4 | 5.5 | 3.0 | 5.4 | 3.5 | 3.1 | ||
| 5.5 | 16.7 | 5.4 | 9.2 | 5.8 | 6.5 | 6.4 | 7.5 | 6.4 | 5.6 | 21.5 | 5.1 | 7.0 | 5.5 | 3.4 | 2.5 | 4.0 | 3.5 | ||
| 0.2 | 97 | 100 | 98 | 100 | 95 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| 89 | 93 | 100 | 100 | 37 | 83 | 20 | 22 | 100 | 100 | 100 | 100 | 100 | 29 | 100 | 13 | 31 | 100 | ||
| 94 | 97 | 98 | 100 | 84 | 100 | 99 | 100 | 100 | 100 | 100 | 100 | 100 | 98 | 100 | 94 | 100 | 100 | ||
| 56 | 70 | 10 | 24 | 24 | 27 | 21 | 22 | 32 | 93 | 96 | 5 | 19 | 13 | 44 | 10 | 16 | 17 | ||
| 53 | 69 | 11 | 25 | 21 | 31 | 16 | 18 | 26 | 98 | 95 | 13 | 25 | 15 | 32 | 3 | 13 | 18 | ||
| 5.7 | 16.7 | 5.6 | 11.8 | 6.3 | 7.2 | 9.8 | 8.2 | 8.7 | 6.1 | 22.1 | 5.8 | 11.5 | 6.3 | 3.6 | 5.8 | 3.9 | 5.5 | ||
| 5.9 | 17.6 | 5.4 | 9.5 | 6.4 | 6.7 | 6.0 | 7.4 | 5.0 | 5.9 | 21.9 | 5.0 | 8.1 | 5.6 | 2.8 | 2.3 | 3.7 | 1.4 | ||
| 0.5 | 97 | 96 | 94 | 100 | 90 | 100 | 100 | 100 | 100 | 100 | 100 | 99 | 100 | 99 | 100 | 100 | 100 | 100 | |
| 96 | 98 | 98 | 100 | 66 | 94 | 15 | 24 | 100 | 100 | 100 | 100 | 100 | 52 | 100 | 11 | 33 | 100 | ||
| 97 | 95 | 99 | 100 | 84 | 100 | 99 | 98 | 100 | 99 | 100 | 100 | 100 | 85 | 100 | 98 | 100 | 100 | ||
| 67 | 64 | 13 | 18 | 40 | 34 | 17 | 17 | 9 | 92 | 91 | 14 | 23 | 32 | 56 | 7 | 22 | 1 | ||
| 65 | 66 | 9 | 31 | 37 | 24 | 11 | 14 | 6 | 93 | 97 | 15 | 24 | 33 | 58 | 5 | 16 | 0 | ||
| 9.1 | 17.4 | 8.7 | 31.8 | 9.4 | 7.1 | 9.3 | 8.5 | 14.9 | 8.3 | 21.1 | 9.4 | 32.1 | 10.5 | 3.0 | 5.7 | 4.2 | 7.4 | ||
| 6.6 | 18.1 | 4.7 | 11.5 | 6.3 | 6.5 | 6.8 | 7.2 | 0.3 | 6.0 | 22.3 | 4.7 | 10.4 | 6.2 | 3.0 | 2.4 | 3.3 | 0.0 | ||
| 0.8 | 90 | 87 | 69 | 100 | 83 | 100 | 94 | 96 | 97 | 100 | 92 | 82 | 100 | 95 | 100 | 93 | 100 | 100 | |
| 85 | 89 | 95 | 100 | 65 | 95 | 11 | 20 | 91 | 100 | 98 | 99 | 100 | 64 | 100 | 8 | 31 | 94 | ||
| 90 | 89 | 73 | 100 | 75 | 100 | 80 | 94 | 90 | 99 | 98 | 90 | 100 | 89 | 100 | 84 | 97 | 83 | ||
| 62 | 58 | 16 | 33 | 44 | 34 | 11 | 20 | 0 | 80 | 85 | 16 | 40 | 48 | 42 | 6 | 21 | 0 | ||
| 63 | 60 | 14 | 39 | 48 | 17 | 7 | 18 | 0 | 81 | 79 | 21 | 42 | 46 | 44 | 2 | 20 | 0 | ||
| 16.4 | 17.7 | 17.8 | 67.7 | 19.1 | 8.5 | 10.3 | 10.2 | 16.0 | 13.4 | 21.4 | 18.9 | 67.7 | 18.8 | 4.2 | 5.6 | 5.0 | 7.6 | ||
| 7.4 | 17.5 | 4.6 | 18.2 | 5.9 | 5.4 | 6.6 | 5.6 | 0.0 | 7.1 | 22.1 | 4.4 | 17.2 | 9.5 | 2.2 | 2.8 | 2.5 | 0.0 | ||
Reported is the percentage of the important variables detected by each method (p value cutoff of 0.05), out of 100 repetitions for each simulation scenario, for five true causal features: X1, , , , , and two null feature sets: S0 and S1.
aNote that SHAP-DNN, LIME-DNN, SNGM-DNN, and RFE-SVM do not perform formal statistical testing, and features can only be ranked with no associated p values. The reported results for each of these four methods are based on the top 10 selected features for a simple illustration. For PermFIT methods and Vanilla-RF, p values are calculated from one-sided Z test.
Fig. 2Negative p values for TCGA kidney cancer data.
Important features selected by each method is marked in red. Since SHAP-DNN, LIME-DNN, SNGM-DNN, and RFE-SVM do not produce a p value, its importance is presented instead, and 10 features with top importance scores are marked. The highly correlated features (see details from the dendrogram on the right) selected by RFE-SVM, Vanilla-RF, and SNGM-DNN, but not by PermFIT methods, are highlighted. Source data are provided as a Source Data file.
Fig. 3Model performance improvement from feature selection.
a, b Fivefold cross-validated prediction accuracy and AUC for TCGA kidney cancer data. c, d Fivefold cross-validated MSPE and Pearson correlation (between true outcome and prediction) for HITChip Atlas data. Fivefold cross-validation evaluation is randomly repeated for 100 times. Data are presented as mean values ± s.d. Source data are provided as a Source Data file.
Fig. 4Negative p values for HITChip Atlas data.
Important features selected by each method is marked in red. Since SHAP-DNN, LIME-DNN, SNGM-DNN, and RFE-SVM do not produce a p-value, its importance is presented instead, and 10 features with top importance scores are marked. The highly correlated features (see details from the dendrogram on the right) selected by RFE-SVM and Vanilla-RF, but not by PermFIT methods, are highlighted. Source data are provided as a Source Data file.