Stefano Nembrini1. 1. Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, Emerging Pathogens Institute, University of Florida, Gainesville, FL, USA.
Abstract
MOTIVATION: In bioinformatics applications, it is currently customary to permute the outcome variable in order to produce inference on covariates to test novel methods or statistics whose distributions are poorly known. The seminal publication of Altmann et al. in Bioinformatics uses the same permutation scheme to obtain P-values that can be treated as corrected measure of feature importance to rectify the bias of the Gini variable importance in Random Forests. Since then, such method has been used in applied work to also draw statistical conclusions on variable importance measures from resulting P-values. RESULTS: In this paper, we show that permuting the outcome may produce unexpected results, including P-values with undesirable properties and illustrate how more refined permutation schemes can be appropriate to obtain desirable results, including high power in discovering relevant variables. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: In bioinformatics applications, it is currently customary to permute the outcome variable in order to produce inference on covariates to test novel methods or statistics whose distributions are poorly known. The seminal publication of Altmann et al. in Bioinformatics uses the same permutation scheme to obtain P-values that can be treated as corrected measure of feature importance to rectify the bias of the Gini variable importance in Random Forests. Since then, such method has been used in applied work to also draw statistical conclusions on variable importance measures from resulting P-values. RESULTS: In this paper, we show that permuting the outcome may produce unexpected results, including P-values with undesirable properties and illustrate how more refined permutation schemes can be appropriate to obtain desirable results, including high power in discovering relevant variables. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.