Jingyi Jessica Li1, Yiling Elaine Chen1, Xin Tong2. 1. Department of Statistics, University of California, Los Angeles. 2. Department of Data Sciences and Operations, Marshall Business School, University of Southern California.
Abstract
Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists' strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary classification, one of the most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample t test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions, both criteria achieve sample-level ranking that is consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling biases are ubiquitous. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.
Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists' strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary classification, one of the most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample t test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions, both criteria achieve sample-level ranking that is consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling biases are ubiquitous. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.
From scientific research to industrial applications, practitioners often face the challenge to rank features for a prediction task. Among the ranking tasks performed by scientists and practitioners, a large proportion belongs to marginal ranking; that is, ranking features based on the relation between the response variable and one feature at a time, ignoring other available features. For example, to predict cancer driver genes, biomedical researchers need to first extract predictive features from patients’ data. Then they decide whether each extracted feature is informative by examining its marginal distributions in tumor and normal tissues, usually by boxplots and histograms (Davoli et al., 2013; Lyu et al., 2020). Another example is genome-wide association studies, where single nucleotide polymorphisms are ranked by their marginal associations with a phenotype (Buniello et al., 2019).From a prediction perspective, marginal feature ranking may seem suboptimal, as multiple features usually have dependence and would thus be jointly predictive of the response variable beyond a simple additive manner. Hence, interpretation of all features’ importance in a multivariate predictive model is an active area of research; popular criteria include the SHAP value (Lundberg and Lee, 2017) and the feature importance measured by Gini index in the random forest (RF) algorithm. However, such “joint” feature ranking is too computationally intensive when the candidate feature number is huge, as it requires all candidate features to be input into one multivariate model; also, it does not reflect a feature’s marginal predictive power when other highly-correlated candidate features are in the model. Moreover, the popularity of marginal feature ranking roots in not only researchers’ education backgrounds and discipline conventions but also their strong desire for simple interpretation and visualization in the trial-and-error scientific discovery process. As such, marginal feature ranking has been an indispensable data-analysis step in the scientific community, and it will likely stay popular.In practice, statistical tests (e.g., two-sample t test and two-sample Wilcoxon rank-sum test) and association measures (e.g., Pearson correlation) are often used to rank features marginally (Davoli et al., 2013; Lyu et al., 2020). However, these tests and association measures do not reflect the objective of a prediction task. For example, if the classification error is of concern, it is unclear how the significance of these tests or the values of these measures are connected to the classification error. This misalignment of ranking criterion and prediction objective is undesirable: the resulting feature rank list does not reflect the marginal importance of each feature for the prediction objective. Hence, scientists and practitioners call for a marginal ranking approach that meets the prediction objective.In this work, we focus on marginal ranking for binary prediction, which can be formulated as binary classification in machine learning. Binary classification has multiple prediction objectives, which we refer to as paradigms here. These paradigms include (1) the classical paradigm that minimizes the classification error, i.e., a weighted sum of the type I and type II errors where the weights are the class probabilities (Hastie et al., 2009; James et al., 2013), (2) the cost-sensitive learning paradigm that replaces the two error weights by pre-determined costs (Elkan, 2001; Zadrozny et al., 2003), (3) the Neyman-Pearson (NP) paradigm that minimizes the type II error subject to a type I error upper bound (Cannon et al., 2002; Scott and Nowak, 2005; Tong, 2013; Tong et al., 2018), and (4) the global paradigm that focuses on the overall prediction accuracy under all possible thresholds: the area under the receiver-operating-characteristic curve (AUROC) or the area under the precision-recall curve (AUPRC). Here we consider marginal ranking of features under the classical and NP paradigms, and we define the corresponding ranking criteria as the classical criterion (CC) and the Neyman-Pearson criterion (NPC). To implement CC and NPC, we take a model-free approach by using nonparametric estimates of class-conditional feature densities. This approach makes CC and NPC more adaptive to diverse feature distributions than existing criteria for marginal feature ranking. The idea behind CC and NPC is easily generalizable to the cost-sensitive learning paradigm and the global paradigm.It is worth highlighting that NPC is robust to sampling bias; that is, even when the class proportions in a sample deviate from those in the population, NPC still achieves feature ranking consistency between sample and population with high probability. This nice property makes NPC particularly useful for disease diagnosis, where a long-standing obstacle is that the proportions of diseased patients and healthy people in medical records do not reflect the proportions in the population.The rest of the paper is organized as follows. In Section 2, we define the population-level CC and NPC as the oracle criteria under the classical and NP paradigms, respectively. In Section 3, we define the sample-level CC and NPC, and we develop model-free algorithms to implement them. In Section 4, we derive theoretical results regarding the ranking consistency of the sample-level CC and NPC in relation to their population counterparts. In Section 5, we use simulation studies to demonstrate the performance of sample-level CC and NPC in ranking low-dimensional and high-dimensional features. We also implement variants of sample-level CC and NPC based on the support vector machine (SVM) algorithm and show that they are less robust than our proposed sample-level CC and NPC to sampling bias. In Section 6, we apply sample-level CC and NPC to marginal feature ranking in two real datasets. Using the first dataset regarding breast cancer diagnosis, we show that both criteria can identify informative features, many of which have been previously reported; we also provide a Supplementary Excel File for literature evidence. Using the second dataset for prostate cancer diagnosis from urine samples, we demonstrate that NPC is robust to sampling bias. In both simulation and real-data studies, we compare sample-level CC and NPC with joint feature ranking criteria—the SHAP value and the feature importance measures in the RF algorithm—and commonly-used marginal ranking criteria that may give feature ranking misaligned with the prediction objective, including the Pearson correlation, the distance correlation (Székely et al., 2009),[1] the two-sample t test, and the two-sample Wilcoxon rank-sum test. We conclude with a discussion in Section 7. Additional materials and proofs of lemmas, propositions, and theorems are relegated to the Appendix.The code for reproducing the numerical results is available at http://doi.org/10.5281/zenodo.4680067. The R package frc is available at https://github.com/JSB-UCLA/frc.
Population-level ranking criteria
In this section, we introduce two objective-based marginal feature ranking criteria, on the population level, under the classical paradigm and the Neyman-Pearson (NP) paradigm. As argued previously, when one has a learning/prediction objective, the feature ranking criterion should be in line with that. Concretely, the j-th ranked feature should be the one that achieves the j-th best performance based on that objective. This objective-based feature ranking perspective is extendable to ranking feature subsets (e.g., feature pairs). Although we focus on marginal feature ranking in this work, to cope with this future extension, our notations in the methodology and theory development are compatible with ranking feature subsets.
Notations and classification paradigms
We first introduce essential mathematical notations to facilitate our discussion. Let (, Y) be a pair of random observations where is a vector of features and Y ∈ {0, 1} indicates the class label of . A classifier
maps from the feature space to the label space. A loss function assigns a cost to each misclassified instance ϕ() ≠ Y, and the risk is defined as the expectation of this loss function with respect to the joint distribution of (, Y). We adopt in this work a commonly used loss function, the 0–1 loss: , where denotes the indicator function. Let and denote the generic probability distribution and expectation, whose meaning depends on specific contexts. With the choice of the 0–1 loss function, the risk is the classification error: , which is aligned with most practitioners’ interest in classifier evaluation. Note that in this work, the 0–1 loss is only used as an evaluation criterion in our development of marginal ranking criteria, not as a loss function for training a classifier from data.In this paper, we call the learning objective of minimizing R(·) the classical paradigm. Under this paradigm, one aims to mimic the classical oracle classifier φ* that minimizes the population-level classification error,
It is well known in literature that the classical oracle , where is the regression function (Koltchinskii, 2011). Equivalently, , where , , p0 is the probability density function of |(Y = 0), and p1 is the probability density function of |(Y = 1). Note that the risk can be decomposed as follows:
where , for j = 0 or 1. The notations R0(·) and R1(·) denote the population-level type I and II errors respectively. Note that minimizing R(·) implicitly imposes a weighting of R0 and R1 by π0 and π1. This is not always desirable. For example, when practitioners know the explicit costs for making type I and II errors: c0 and c1, they may want to optimize the criterion c0R0(·) + c1R1(·), which is often referred to as the cost-sensitive learning paradigm.In parallel to the classical paradigm, we consider the NP paradigm, which aims to mimic the level-α NP oracle classifier that minimizes the type II error while constraining the type I error under α, a user-specified type I error upper bound,
Usually, α is a small value (e.g., 5% or 10%), reflecting a conservative attitude towards the type I error. As the development of classifiers under the NP paradigm is relatively new, here we review the NP oracle classifier . Motivated by the classic NP Lemma (Appendix G) and a correspondence between classification and statistical hypothesis testing, in (1) can be constructed by thresholding p1(·)/p0(·) at a proper level (Tong, 2013):
where is such that , and the estimation of is introduced in Section 3.2.In addition to the above three paradigms, a common practice is to evaluate a classification algorithm by its AUROC or AUPRC, which we refer to as the global paradigm. In contrast to the above three paradigms that lead to a single classifier, which has its corresponding type I and II errors, the global paradigm evaluates a classification algorithm by aggregating its all possible classifiers with type I errors ranging from zero to one. For example, the oracle AUROC is the area under the curve .
Classical and Neyman-Pearson criteria on the population level
Different learning/prediction objectives in classification induce distinct feature ranking criteria. We first define the population-level CC and NPC. Then we show that these two criteria lead to different rankings of features in general, and that NPC may rank features differently at different α values. We denote by and , respectively, the classical oracle classifier and the level-α NP oracle classifier that only use the features indexed by A ⊆ {1, …, d}. This paper focuses on the case when |A| = 1. Concretely, under the classical paradigm, the classical oracle classifier on index set A, , achieves
where is any mapping that first projects to its |A|-dimensional sub-vector , which comprises of the coordinates of corresponding to the index set A, and then maps from . Analogous to φ*(·), we know
where is the regression function using only features in the index set A, and p1 and p0 denote the class-conditional probability density functions of the features . Suppose that candidate feature subsets denoted by A1, …, A are provided, which may be enumerated by a computational algorithm or curated by domain experts. We define the population-level classical criterion (p-CC) of A as its optimal risk ; i.e., A1, …, A will be ranked based on , with the smallest being ranked the top. The prefix “p” in p-CC indicates “population-level.” Note that represents A’s best achievable performance measure under the classical paradigm and does not depend on any specific models assumed for the distribution of (, Y).Under the NP paradigm, the NP oracle classifier defined on the index set A, , achieves
By the NP Lemma,
where is a constant such that .For a given level α, we define the population-level Neyman-Pearson criterion (p-NPC) of A as its optimal type II error ; i.e., A1, …, A will be ranked based on , with the smallest being ranked the top.For a graphical illustration of marginal feature ranking by p-CC and p-NPC, please see Figure H.1 in Appendix H. It is worth noting that p-CC and p-NPC do not always give the same feature ranking. For a toy example, we compare two features , ,[2] whose class-conditional distributions are the following Gaussians:
and the class priors are equal, i.e., π0 = π1 = .5. It can be calculated that and . Therefore, , and p-CC ranks feature 1 higher than feature 2. The comparison is more subtle for p-NPC. If we set α = .01, is larger than . However, if we set α = .20, is smaller than . Figure 1 illustrates the NP oracle classifiers for these α’s.
Figure H.1:
An illustration of marginal feature ranking by p-CC (left) and p-NPC (right). In this example, π0 = π1 = .5. The purple areas indicate p-CC values, the green areas indicate p-NPC values, and the yellow areas indicate α (the type I error upper bound in the NP paradigm). For both p-CC and p-NPC, a smaller value gives a better ranking.
Figure 1:
A toy example in which feature ranking under p-NPC changes as α varies. Panel a: α = .01. The NP oracle classifier based on feature 1 (or feature 2) has the type II error .431 (or .299). Panel b: α = .20. The NP oracle classifier based on feature 1 (or feature 2) has the type II error .049 (or .084).
This example suggests a general phenomenon that feature ranking depends on the user-chosen criteria. For some α values (e.g., α = .20 in the example), p-NPC and p-CC agree on the ranking, while for others (e.g., α = .01 in the example), they disagree.[3] This observation calls for the development of sample-level CC and NPC.
Sample-level ranking criteria
In the following text, we refer to sample-level CC and NPC as “s-CC” and “s-NPC” respectively. In the same model-free spirit of the p-CC and p-NPC definitions, we use model-free nonparametric techniques to construct s-CC and s-NPC. Admittedly, such construction would be impractical when the feature subsets to be ranked have large cardinalities. However, since we are mainly interested in marginal feature ranking, with intended extension to small subsets such as feature pairs, model-free nonparametric techniques are appropriate.In the methodology and theory sections, we assume the following sampling scheme. Suppose we have a training dataset , where are independent and identically distributed (i.i.d.) class 0 observations, are i.i.d. class 1 observations, and is independent of . The sample sizes m and n are considered as fixed positive integers. The construction of both s-CC and s-NPC involves splitting the class 0 and class 1 observations. To increase stability, we perform multiple random splits. In detail, we randomly divide for B times into two halves and , where m1 + m2 = m, the subscripts “ts” and “lo” stand for train-scoring and left-out respectively, and the superscript b ∈ {1, …, B} indicates the b-th random split. We also randomly split for B times into and , where n1 + n2 = n and b ∈ {1, …, B}. In this work, we make equal-sized splits: m1 = ⌊m/2⌋ and n1 = ⌊n/2⌋. We leave the possibility of data-adaptive splits to future work.As in the definitions of p-CC and p-NPC, we use notations to allow for the extension to ranking feature subsets. For A ⊆ {1, …, d} with |A| = l, recall the classical oracle classifier restricted to A, , defined in (3) and the NP oracle classifier restricted to A, , defined in (5). Although these two oracles have different thresholds, π0/π1 vs. , the class-conditional density ratio p1(·)/p0(·) is involved in both oracles. The densities p0 and p1 can be estimated respectively from and by kernel density estimators
where and denote the bandwidths, and K(·) is a kernel in .
Sample-level classical ranking criterion
To define s-CC, we first construct plug-in classifiers on for b ∈ {1, …, B}. In each classifier, the threshold m1/n1 mimics π0/π1. If classes 0 and 1 are sampled with probabilities π0 and π1, respectively, then each classifier is a good plug-in estimate of defined in (3). However, in the presence of sampling bias, m1/n1 cannot mimic π0/π1, and thus is not a good estimate of Armed with the classifiers , we define the sample-level CC (s-CC) of A as
In other words, CC is the average of the risks of on the left-out observations for b ∈ {1, …, B}.
Sample-level Neyman-Pearson ranking criterion
To define s-NPC, we use the same kernel density estimates to plug in p1(·)/p0(·), as in s-CC. To estimate the oracle threshold , we use the NP umbrella algorithm (Tong et al., 2018). Unlike s-CC, in which we use both and to evaluate the constructed classifier, for s-NPC we use to estimate the threshold and only to evaluate the classifier.The NP umbrella algorithm finds proper thresholds for all scoring-type classification methods (e.g., nonparametric density ratio plug-in, logistic regression, and RF) so that the resulting classifiers achieve a high probability control on the type I error under the pre-specified level α. A scoring-type classification method outputs a scoring function that maps the feature space to , and a classifier is constructed by combining the scoring function with a threshold. To construct an NP classifier given a scoring-type classification method, the NP umbrella algorithm first trains a scoring function on . In this work, we specifically use , in which the numerator and the denominator are defined in (7). Second, the algorithm applies to to obtain scores , which are then sorted in an increasing order and denoted by . Third, for a user-specified type I error upper bound α ∈ (0, 1) and a violation rate δ1 ∈ (0, 1), which refers to the probability that the type I error of the trained classifier exceeds α, the algorithm chooses the order
When , a finite k* exists,[4] and the umbrella algorithm chooses the threshold of the estimated scoring function as . Hence, the resulting NP classifier isProposition 1 in Tong et al. (2018) states that there is no more than δ1 probability for the type I error of to exceed α:
for every b = 1, …, B. We evaluate the type II error of the B NP classifiers on the left-out class 1 sets respectively. Our sample-level NPC (s-NPC) of A at level α, denoted by NPC, computes the average of these type II errors:
where is the kernel density ratios constructed on using only the features indexed by A, and is found by the NP umbrella algorithm.The time complexity is O ((m1 + n1) · (m2 + n2)) for calculating s-CC, and an additional complexity of O (m2 log m2) is needed for calculating s-NPC; both time complexities can be reduced to O(m + n) if approximate kernel density estimation is used and m2 is bounded. We discuss the calculation details of the time complexities in Appendix E, and we illustrate the calculation of s-CC and s-NPC in Figure H.2.
Figure H.2:
An illustration of the calculation of s-CC and s-NPC.
For the implementation of s-NPC and s-CC, we use the kde() function with default arguments in the R package ks. By default, the function uses the Gaussian kernel and the bandwidth selected by the univariate plug-in selector of (Wand and Jones, 1994).
Revisiting the toy example at the sample level
With the above definitions of s-CC and s-NPC, we demonstrate how they rank the two features in the toy example (Figure 1) and that their ranks are consistent with their population-level counterparts p-CC and p-NPC, respectively, with high probability.We simulate 1000 samples, each of size N = 2000 (two classes combined), from the two-feature distribution (6) in the toy example (Figure 1). With B = 11, we apply s-CC (8) and s-NPC with δ1 = .05 (11) to each sample to rank the two features, and we calculate the frequency of each feature being ranked the top among the 1000 ranking results. Table 1 shows that s-NPC (α = .01) ranks feature 2 the top with high probability, while s-CC and s-NPC (α = .20) prefer feature 1 with high probability. This is consistent with our population-level result in Section 2.2: p-NPC (α = .01) prefers feature 2, while p-CC and p-NPC (α = .20) find feature 1 better.
Table 1:
The frequency of each feature being ranked the top by each criterion among 1,000 samples in the toy example (Figure 1).
Criterion
Feature 1
Feature 2
s-CC
78.0%
22.0%
s-NPC (α = .01)
1.6%
98.4%
s-NPC (α = .20)
99.0%
1.0%
Theoretical properties
In this section, we investigate the ranking properties of s-CC and s-NPC. Concretely, we will address this question: for J candidate feature index sets A1, …, A of size l, is it guaranteed that s-CC and s-NPC have ranking agreements with p-CC and p-NPC respectively with high probability? In our theory development, we consider J as a fixed number; for simplicity, we assume the number of random splits to be B = 1 in s-CC and s-NPC, thus removing the super index (b) in all notations in this section and the Appendix proofs.In addition to investigating ranking consistency, we discover a property unique to s-NPC: the robustness against sampling bias. Concretely, as long as the sample sizes m and n are large enough, s-NPC gives ranking consistent with p-NPC even when the class size ratio m/n in the sample is far from the ratio π0/π1 in the population. In contrast, s-CC is not robust against sampling bias, except when the population class proportion ratio π0/π1 is known and we replace the thresholds in the plug-in classifiers , which are used for s-CC, by this ratio.
Definitions and key assumptions
We assume that the candidate index sets A1, …, A have a moderate size l (≪ d). Following Audibert and Tsybakov (2007), for any multi-index and , we define , , and the differential operator . For all the theoretical discussions, we assume the domain of p0 and p1, i.e., the class-conditional densities of |(Y = 0) and |(Y = 1), is [−1, 1], where l = |A|. We denote the distributions of |(Y = 0) and |(Y = 1) by P0 and P1 respectively.
Definition 1 (Hölder function class)
Let β > 0. Denote by ⌊β⌋ the largest integer strictly less than β. For a ⌊β⌋-times continuously differentiable function
, we denote by g
its Taylor polynomial of degree ⌊β⌋ at a value
:
For L > 0, the (β, L, [−1, 1])-Hölder function class, denoted by Σ (β, L, [−1, 1]), is the set of ⌊β⌋-times continuously differentiable functions
that satisfy the following inequality:
Definition 2 (Hölder density class)
The (β, L, [−1, 1])-Hölder density class is defined asThe following β-valid kernels are multi-dimensional analogs of univariate higher order kernels.
Definition 3 (β-valid kernel)
Let K(·) be a real-valued kernel function on
with the support [−1, 1]. For a fixed β > 0, the function K(·) is a β-valid kernel if it satisfies (1)
for any q ≥ 1, (2) , and (3) in the case ⌊β⌋ ≥ 1,
for any 1 ≤ || ≤ ⌊β⌋.One example of β-valid kernels is the product kernel whose ingredients are kernels of order β in 1 dimension:
where K1(·) is a 1-dimensional β-valid kernel and is constructed based on Legendre polynomials. Such kernels have been considered in Rigollet and Vert (2009). When a β-valid kernel is constructed out of Legendre polynomials, it is also Lipschitz and bounded. For simplicity, we assume that all the β-valid kernels considered in the theory discussion are constructed from Legendre polynomials.
Definition 4 (Margin assumption)
A function f(·) satisfies the margin assumption of the order
, if there exist positive constants ≥ 0,The above condition for density functions was first introduced in Polonik (1995), and its counterpart in the classical binary classification was called the margin condition (Mammen and Tsybakov, 1999), which is a low noise condition. Recall that the set { : η() = 1/2} is the decision boundary of the classical oracle classifier, and the margin condition in the classical paradigm is a special case of Definition 4 by taking f(·) = η(·) and C = 1/2. Unlike the classical paradigm where the optimal threshold 1/2 on regression function η(·) is known, the optimal threshold under the NP paradigm is unknown and needs to be estimated, thus suggesting the necessity of having sufficient data around the decision boundary to detect it. This concern motivated Tong (2013) to formulate a detection condition that works as an opposite force to the margin assumption, and Zhao et al. (2016) improved the condition and proved the condition’s necessity in bounding the excess type II error of an NP classifier. To establish ranking consistency properties of s-NPC, a bound on the excess type II error is an intermediate result, so we also need this detection condition for our current work.
Definition 5 (Detection condition (Zhao et al., 2016))
A function f(·) satisfies the detection condition of the order (C, δ*) with respect to the probability distribution P of a random vector
, if there exists a positive constant
, such that for all δ ∈ (0, δ*),
A uniform deviation result of the scoring function
For A ⊆ {1, …, d} and |A| = l, recall that we estimate p0(·) and p1(·) respectively from and by kernel density estimators and defined (7), where K(·) is a β-valid kernel in . We are interested in deriving a high probability bound for .
Condition 1
Suppose that for all A ⊂ {1 …, d} satisfying |A| = l,there exist positive constants μmin
and μmax
such that μmax ≥ p0A(·) ≥ μmin
and μmax ≥ p1A(·) ≥ μmin;there is a positive constant L such that p0A(·), .
Proposition 6
Assume
Condition 1
and let the kernel K(·) be β-valid and L′-Lipschitz. Let A ⊆ {1, …, d} and |A| = l. Let (7). Take the bandwidths
and . For any δ3 ∈ (0, 1), if sample sizes
satisfy
where ∧ denotes the minimum, 1 = μmax‖K‖2,
and C is such that
. Then there exists a positive constant 1 − δ3,
Ranking property of s-CC
To study the ranking agreement between s-CC and p-CC, an essential step is to develop a concentration result between CC and , where was defined in (3), based on Proposition 6.
Theorem 7
Let δ3, δ4, δ5 ∈ (0, 1). In addition to the assumptions of
Propositions 6, assume that the density ratio s(·) = p1A(·)/p0A(·) satisfies the margin assumption of order 0/π1
(with constant
) with respect to both P0A
and P1A
(the distributions of
|(Y = 0) and
|(Y = 1)), that
and
, and that m/n = m1/n1 = π0/π1, then we have with probability at least 1 − δ3 − δ4 − δ5,
for some positive constant .Under smoothness, regularity, and sample size conditions, Theorem 7 shows the concentration of CC around with probability at least 1 − (δ3 +δ4 +δ5). The user-specified violation rate δ3 accounts for the uncertainty in training the scoring function on a finite sample; δ4 represents the uncertainty of using left-out class 1 observations to estimate ; δ5 represents the uncertainty of using left-out class 0 observations to estimate (Recall that R(·) = π0R0(·) + π1R1(·)). Like the constant C0 in Proposition 6, the generic constant in Theorem 7 can be provided more explicitly, but it would be too cumbersome to do so. More discussion about Theorem 7 can be found after its counterpart for NPC, i.e., Theorem 12.Theorem 7 leads to the ranking consistency of s-CC.
Theorem 8
Let δ3, δ4, δ5 ∈ (0, 1), A1, …, A ⊆ {1, …, d}, |A1| = |A2| = ⋯ = |A| = l, and . Both J and l are constants that do not diverge with the sample sizes. In addition to the assumptions in
Theorem 7, assume that the p-CC’s of A1, …, A
are separated by some margin g > 0; in other words,
In addition, assume that m1, m2, n1, n2
satisfy
where
Theorem 7. Then with probability at least 1 − J(δ3+δ4+δ5),
for all i = 1, …, J − 1. That is, s-CC ranks A1, …, A
the same as p-CC does.
Remark 9
If the sample size ratio m/n (= m1/n1) is far from π0/π1, we cannot expect a concentration result on
Theorem 7, to hold. The rationale is, if we replace the trained scoring function 1A(·)/p0A(·) (think of m and n extremely large), then s-CC is based on the classifier
. In contrast, p-CC is based on the oracle
. When m1/n1
is far from π0/π1, clearly the classification errors of these two classifiers would not be close, so we would not have . As
Theorem 7
is a cornerstone to ranking consistency between s-CC and p-CC, we conclude that the classical criterion is not robust to sampling bias.
Ranking property of s-NPC
To establish ranking agreement between s-NPC and p-NPC, an essential step is to develop a concentration result of NPC around , where is defined in (4). Recall that , where is determined by the NP umbrella classification algorithm. We always assume that the cumulative distribution function of , where ~ P0, is continuous.
Lemma 10
Let α, δ1, δ2 ∈ (0, 1). If
, then the classifier 1 − δ1 − δ2,
where
and ⌈z⌉ denotes the smallest integer larger than or equal to z. Moreover, if
, we have
.The next proposition is a result of Lemma 10 and a minor modification to the proof of Proposition 2.4 in Zhao et al. (2016). We can derive an upper bound for the same as that for the excess type II error in Zhao et al. (2016).
Proposition 11
Let α, δ1, δ2 ∈ (0, 1). Assume that the density ratio s(·) = p1A(·)/p0A(·) satisfies the margin assumption of order
) and detection condition of order
), both with respect to the distribution
. If 1 − δ1 − δ2,Propositions 6 and 11 lead to the following result.
Theorem 12
Let α, δ1, δ2, δ3, δ4 ∈ (0, 1), and l = |A|. In addition to the assumptions of
Propositions 6
and
11, assume 1 − (δ1 + δ2 + δ3 + δ4),
for some positive constant .Under smoothness, regularity, and sample size conditions, Theorem 12 shows the concentration of NPC around with probability at least 1 − (δ1 + δ2 + δ3 + δ4). The user-specified violation rate δ1 represents the uncertainty that the type I error of an NP classifier exceeds α, leading to the underestimation of ; δ2 accounts for the possibility of unnecessarily stringent control on the type I error, which results in the overestimation of ; δ3 accounts for the uncertainty in training the scoring function on a finite sample; δ4 represents the uncertainty of using leave-out class 1 observations to estimate . Note that while δ1 is both an input parameter for the construction of s-NPC and a constraint on the sample sizes, the other parameters δ2, δ3, and δ4 only have the latter role. Like the constant C0 in Proposition 6, the generic constant in Theorem 12 can be provided more explicitly, but it would be too cumbersome to do so.Note that the upper bound in Theorem 12 involves while that in Theorem 7 does not. This is expected as the detection condition (that involves ) is a condition for diminishing excess type II errors under NP paradigm. Here we make some example simplifications to digest the bounds in Theorems 7 and 12. If we assume that m1, m2, n1, n2 ~ N (the total sample size), β = 2, l = 1, and , then the high probability upper bound for is , while that of is . Hence, when , i.e., when there are not many points around the NP oracle decision boundary and thus the boundary is difficult to detect, the convergence rate of the upper bound is slower for NPC.Although Theorems 7 and 12 both assume bounded supports in their conditions, we regard this as just a way to streamline the proofs. In Appendix F, we conduct a simulation study where features are generated from distributions with unbounded supports, and there is still clear concentration of CC and NPC.
Theorem 13
Let α, δ1, δ2, δ3, δ4 ∈ (0, 1), A1, …, A ⊆ {1, …, d}, |A1| = |A2| = ⋯ = |A| = l, and
, ordered by p-NPC. Both J and l are constants that do not diverge with the sample sizes. In addition to the assumptions in
Theorem 12, assume that the p-NPC’s of A1, …, A
are separated by some margin g > 0; in other words,
In addition, assume that m1, m2, n1, n2
satisfy
where
Theorem 12. Then with probability at least 1 − J (δ1 +δ2 + δ3 + δ4),
for all i = 1, …, J − 1. That is, s-NPC ranks A1, …, A
the same as p-NPC does.
Remark 14
The conclusion in
Theorem 13
also holds under sampling bias, i.e., when the sample sizes n (of class 1) and m (of class 0) do not reflect the population proportions π0
and π1.Here we offer some intuition about the the robustness of s-NPC against sampling bias. Note that the objective and constraint of the NP paradigm only involve the class-conditional feature distributions, not the class proportions. Hence, p-NPC does not rely on the class proportions. Furthermore, in s-NPC, each class-conditional density is estimated separately within each class, so s-NPC does not depend on the class proportions either. It is also worth noting that the proof of Theorem 13 (in Appendix) does not use the relation between the ratio of class sizes in the sample and that in the population.Moreover, we derive partial consistency results for s-CC and s-NPC in Appendix B, where we show that if the top J feature subsets and the other feature subsets have p-CC or p-NPC differ by a margin, then s-CC or s-NPC can distinguish the top J feature subsets.
Simulation studies
This section contains six simulation studies to verify the practical performance of s-CC and s-NPC in ranking features and to compare s-CC and s-NPC against multiple commonly used criteria for marginal and joint feature ranking. Table 2 summarizes the designs and purposes of the six studies.
Table 2:
Designs and purposes of six simulation studies
Study
Distribution
N
d
Sampling bias
Correlated features
Purpose
S1
Gaussian
400
30
No
No
Verify s-CC and s-NPC
1000
S2
Chi-squared
400
30
No
No
Verify s-CC and s-NPC
1000
S3
Gaussian
1000
30
Yes
No
Compare s-CC and s-NPC with SVM variants
Yes
Compare s-CC and s-NPC with multivariate feature ranking criteria
S4
Gaussian
400
500
No
No
Verify s-CC and s-NPC
S5
Gaussian
200
10,000
No
No
Verify s-CC and s-NPC; compare them with SVM variants and multivariate feature ranking criteria
S6
(Mixture) Gaussian
400
2
No
No
Compare s-CC and s-NPC with marginal feature ranking criteria
N: sample size (number of observations)
d: number of features
In detail, we verify the performance of s-CC and s-NPC in ranking features under low-dimensional settings with the class-conditional distributions as Gaussian (studies S1 & S3) or chi-squared (study S2), as well as under high-dimensional settings (studies S4–S5). Furthermore, in studies S3 and S5, we compare s-CC and s-NPC with three commonly used measures of feature importance in a multivariate classifier trained by the RF algorithm—the SHAP value (Lundberg and Lee, 2017) and two feature importance measures (the mean decrease in accuracy and the mean decrease in Gini index). Moreover, we design study S6 to demonstrate the advantages of s-CC and s-NPC over four commonly used measures for marginal feature ranking—the Pearson correlation, the distance correlation (Székely et al., 2009), the two-sample t test, and the two-sample Wilcoxon rank-sum test. Besides, motivated by (Lin, 2002; Guyon et al., 2002), we implement variants of s-CC and s-NPC based on classifiers trained by the support vector machine (SVM) algorithm (Appendix C), and we show in study S3 that these variants are not robust to sampling biases, unlike s-NPC.In all the simulation studies, we set the number of random splits B = 11 (which we show in Figure H.6 in Appendix as a reasonable choice) for s-CC and s-NPC, as well as their SVM variants, so that we can obtain reasonably stable criteria and meanwhile finish thousands of simulation runs within reasonable time. Regarding the RF algorithm, we use the randomForest() function in R package randomForest. The number of trees is set to ntree=500 by default.
Figure H.6:
Distributions of s-CC or s-NPC (with varying α) values of one feature, whose class-conditional distributions are and and whose class prior is , with varying B (number of random splits) and N (sample size) across 1000 independent simulations. Based on these distributions, B = 11 is a reasonable choice.
Ranking low-dimensional features at the sample level
We first demonstrate the performance of s-CC and s-NPC in ranking features when d, the number of features, is much smaller than N (the total sample size). We design simulation studies S1 and S2 to support our theoretical results in Theorems 8 and 13 in the absence of sampling bias. Using simulation study S3, we demonstrate that s-NPC is robust to sampling bias, while s-CC and the SVM variants of s-CC and s-NPC are not; furthermore, we show that the RF algorithm’s three feature ranking criteria (mean decrease in accuracy, mean decrease in Gini index, and SHAP value) cannot capture features’ marginal ranking in the presence of feature correlations.There is no sampling bias in simulation studies S1 and S2. In study S1, we generate data from the following two-class Gaussian model with d = 30 features, among which we set the first s = 10 features to be informative (a feature is informative if and only if it has different marginal distributions in the two classes).
where , , with μ11, …, μ30 independently and identically drawn from and then held fixed, and Σ = 4I30. In terms of population-level criteria p-CC and p-NPC, a clear gap exists between the first 10 informative features and the rest features, yet the 10 features themselves have increasing criterion values but no obvious gaps. That is, the first 10 features have true ranks going down from 1 to 10, and the rest of features have a tied true rank of 20.5, i.e., the average of 11, …, 30[5].We simulate 1000 samples of size N = 400[6] or 1000 from the above model. We apply s-CC (8) and s-NPC with δ1 = .05 and four α levels .05, .10, .20, and .30 (11), five criteria in total, to each sample to rank the 30 features. That is, for each feature, we obtain 1000 ranks by each criterion. We summarize the average rank of each feature by each criterion in Tables H.1 and H.2 in Appendix, and we plot the distribution of ranks of each feature in Figures 2 and 3. The results show that all criteria clearly distinguish the first 10 informative features from the rest. For s-NPC, we observe that its ranking is more variable for a smaller α (e.g., 0.05). This is expected because, when α becomes smaller, the threshold in the NP classifiers would have an inevitably larger variance and lead to a more variable type II error estimate, i.e., s-NPC. As the sample size N increases from 400 (Table H.1 and Figure 2) to 1000 (Table H.2 and Figure 3), all criteria achieve greater agreement with the true ranks.
Table H.1:
Average ranks of the d = 30 features by s-CC or s-NPC (with varying α) with sample size N = 400 under the Gaussian setting (15)—simulation study S1.
1
2
3
4
5
6
7
8
9
10
s-CC
2.19
2.03
3.45
4.94
5.60
6.28
5.80
7.05
8.84
8.82
s-NPC (α = .05)
2.17
3.73
4.04
6.43
5.37
5.11
6.21
9.35
8.97
8.54
s-NPC (α = .10)
1.91
4.43
4.34
3.26
5.99
6.93
6.39
7.17
6.89
7.85
s-NPC (α = .20)
2.39
3.67
3.50
3.51
6.35
4.70
5.91
7.82
8.84
8.32
s-NPC (α = .30)
1.96
2.54
3.86
4.40
5.65
5.21
6.53
7.14
8.67
9.04
11
12
13
14
15
16
17
18
19
20
s-CC
19.80
21.75
21.36
16.34
18.79
21.53
22.60
18.89
17.26
23.31
s-NPC (α = .05)
15.38
21.58
22.65
21.47
17.09
21.30
20.79
21.65
20.96
18.15
s-NPC (α = .10)
20.66
23.62
18.73
23.01
21.69
19.03
23.05
18.83
20.77
20.33
s-NPC (α = .20)
20.81
17.65
21.73
21.67
17.50
21.30
20.30
22.75
18.18
23.84
s-NPC (α = .30)
16.72
22.23
19.93
19.27
19.80
21.97
19.29
19.92
18.95
19.75
21
22
23
24
25
26
27
28
29
30
s-CC
19.34
19.63
22.10
17.26
22.18
25.03
22.69
18.33
20.51
21.31
s-NPC (α = .05)
21.47
16.29
21.36
20.56
19.39
20.32
19.84
21.70
23.34
19.79
s-NPC (α = .10)
19.62
18.80
19.10
21.55
19.98
22.81
18.93
18.44
19.91
20.95
s-NPC (α = .20)
21.59
15.89
20.22
20.88
21.44
18.87
21.03
21.77
20.49
22.06
s-NPC (α = .30)
20.50
21.82
22.03
18.52
20.88
21.26
21.26
21.13
23.13
21.65
Table H.2:
Average ranks of the d = 30 features by s-CC or s-NPC (with varying α) with sample size N = 1000 under the Gaussian setting (15)—simulation study S1.
1
2
3
4
5
6
7
8
9
10
s-CC
2.21
2.28
2.73
4.09
4.64
6.14
6.93
7.93
8.71
9.34
s-NPC (α = .05)
2.55
2.60
4.21
4.44
4.28
6.43
6.48
6.99
8.22
8.80
s-NPC (α = .10)
1.97
2.76
2.72
4.49
4.26
6.63
6.74
7.67
8.72
9.04
s-NPC (α = .20)
1.36
2.35
3.23
4.19
4.67
5.93
7.02
8.24
8.75
9.24
s-NPC (α = .30)
1.85
2.73
2.71
3.58
5.18
6.11
6.80
8.04
9.01
8.99
11
12
13
14
15
16
17
18
19
20
s-CC
18.65
18.19
20.78
19.92
23.99
18.60
19.87
22.16
21.70
21.61
s-NPC (α = .05)
22.07
20.25
21.63
18.63
17.00
22.16
19.80
23.05
19.68
20.84
s-NPC (α = .10)
20.37
19.67
22.67
20.15
19.31
19.58
21.61
18.53
20.51
22.49
s-NPC (α = .20)
19.10
20.26
18.08
20.69
22.15
22.65
18.19
21.55
23.79
20.48
s-NPC (α = .30)
18.19
19.32
20.80
16.88
22.97
21.70
19.81
23.49
19.24
20.95
21
22
23
24
25
26
27
28
29
30
s-CC
20.73
21.62
20.33
19.17
22.00
19.08
20.81
20.00
21.02
19.76
s-NPC (α = .05)
20.55
17.53
20.00
18.30
22.76
22.20
18.84
22.09
20.14
22.47
s-NPC (α = .10)
19.15
20.34
20.38
20.82
21.23
21.27
21.80
21.03
18.29
20.81
s-NPC (α = .20)
20.59
18.04
21.27
19.80
21.80
20.29
19.80
23.33
19.71
18.45
s-NPC (α = .30)
20.39
20.00
21.47
20.23
21.91
21.71
18.88
22.39
19.32
20.35
Figure 2:
Rank distributions of the d = 30 features by s-CC or s-NPC (with varying α) with sample size N = 400 under the Gaussian setting (15)—simulation study S1.
Figure 3:
Rank distributions of the d = 30 features by s-CC or s-NPC (with varying α) with sample size N = 1000 under the Gaussian setting (15)—simulation study S1.
In study S2, we generate data from the following two-class Chi-squared distributions of d = 30 features, among which we still set the first s = 10 features to be informative.
Similar to the previous Gaussian setting, the first 10 features have true ranks going down from 1 to 10, and the rest of features are tied with a true rank of 20.5. We simulate 1000 samples of size N = 400 or 1000 from this model, and we apply s-CC (8) and s-NPC with δ1 = .05 and four α levels .05, .10, .20, and .30 (11), five criteria in total, to each sample to rank the 30 features. We summarize the average rank of each feature by each criterion in Tables H.3 and H.4 (in Appendix), and we plot the distribution of ranks of each feature in Figures H.3 and H.4 (in Appendix). The results and conclusions are consistent with those under the Gaussian setting.
Table H.3:
Average ranks of the d = 30 features by s-CC or s-NPC (with varying α) with sample size N = 400 under the Chi-squared setting (16)—simulation study S2.
1
2
3
4
5
6
7
8
9
10
s-CC
1.10
2.14
2.85
3.97
4.98
5.99
6.99
7.98
9.00
10.01
s-NPC (α = .05)
1.40
2.75
3.79
3.73
3.80
5.83
7.38
7.43
9.32
10.78
s-NPC (α = .10)
1.15
2.27
2.76
4.01
4.96
5.91
6.98
7.98
9.00
10.39
s-NPC (α = .20)
1.15
2.01
2.93
3.97
4.99
5.98
7.00
7.99
8.99
10.10
s-NPC (α = .30)
1.08
2.08
2.99
4.00
4.90
5.98
6.99
7.98
9.00
10.01
11
12
13
14
15
16
17
18
19
20
s-CC
20.56
19.80
22.54
21.97
20.63
20.80
18.33
18.26
20.41
21.65
s-NPC (α = .05)
21.00
17.83
19.21
21.03
19.86
19.81
22.79
21.50
22.56
23.24
s-NPC (α = .10)
17.67
24.46
18.36
16.72
22.22
20.05
23.21
19.76
21.00
20.20
s-NPC (α = .20)
19.10
25.31
21.22
19.55
20.19
20.17
20.45
19.95
21.28
19.41
s-NPC (α = .30)
19.91
22.46
20.62
21.77
20.49
18.79
19.72
19.77
22.06
18.42
21
22
23
24
25
26
27
28
29
30
s-CC
21.27
20.02
19.08
21.68
21.05
19.50
19.14
22.41
21.63
19.27
s-NPC (α = .05)
18.84
20.55
20.00
21.93
19.52
19.11
18.81
21.81
20.97
18.44
s-NPC (α = .10)
21.05
20.11
23.74
20.93
16.57
20.58
22.53
19.00
23.53
17.88
s-NPC (α = .20)
20.98
19.77
18.05
18.58
20.08
23.26
19.18
20.78
23.61
18.98
s-NPC (α = .30)
18.93
22.03
23.31
19.47
20.75
22.04
20.49
19.23
20.35
19.37
Table H.4:
Average ranks of the d = 30 features by s-CC or s-NPC (with varying α) with sample size N = 1000 under the Chi-squared setting (16)—simulation study S2.
1
2
3
4
5
6
7
8
9
10
s-CC
1.05
2.68
2.30
3.99
4.99
5.99
7.00
8.00
9.00
10.00
s-NPC (α = .05)
1.44
2.38
2.28
3.99
4.95
5.99
7.22
7.77
8.99
10.12
s-NPC (α = .10)
1.05
2.70
2.28
3.99
4.99
5.99
7.01
7.99
9.00
10.01
s-NPC (α = .20)
1.07
2.47
3.06
3.41
5.00
6.00
7.00
8.00
9.00
10.00
s-NPC (α = .30)
1.05
2.20
3.17
3.63
4.97
5.98
7.00
8.00
9.00
10.00
11
12
13
14
15
16
17
18
19
20
s-CC
19.87
20.05
24.51
22.30
23.43
23.26
18.01
18.22
23.86
19.85
s-NPC (α = .05)
23.85
21.45
18.73
18.45
18.49
17.36
20.65
21.48
20.10
18.70
s-NPC (α = .10)
24.12
17.48
20.12
19.22
21.02
22.58
21.03
17.97
17.97
20.29
s-NPC (α = .20)
20.19
24.91
20.15
22.55
21.36
18.35
19.82
20.66
20.76
19.32
s-NPC (α = .30)
22.50
22.12
23.15
23.31
20.11
22.51
19.61
18.88
19.52
19.91
21
22
23
24
25
26
27
28
29
30
s-CC
19.54
18.13
19.57
20.90
20.24
19.89
19.30
18.04
18.10
22.94
s-NPC (α = .05)
19.05
20.94
18.09
21.05
19.02
18.04
23.57
22.26
23.89
24.69
s-NPC (α = .10)
19.99
18.39
17.08
20.04
20.31
18.84
22.61
22.38
23.93
24.63
s-NPC (α = .20)
20.29
17.02
20.71
19.12
17.70
18.30
20.29
22.33
24.30
21.90
s-NPC (α = .30)
20.54
19.42
16.57
18.70
20.48
20.09
22.57
17.09
19.93
22.98
Figure H.3:
Rank distributions of the d = 30 features by s-CC or s-NPC (with varying α) with sample size N = 400 under the Chi-squared setting (16)—simulation study S2.
Figure H.4:
Rank distributions of the d = 30 features by s-CC or s-NPC (with varying α) with sample size N = 1000 under the Chi-squared setting (16)—simulation study S2.
Next, we design study S3 with sampling bias, i.e., the two classes have proportions in the sample different from those in the population. We use the following Gaussian setting:
that is, class 1 has a proportion .5 in the population but is undersampled with probability .1 in the sample. Same as in our first simulation study, we set , , with μ11, …, μ30 independently and identically drawn from and then held fixed, and we still set the diagonal entries of Σ to 4. What is different here is that we add a scenario with feature correlations: conditional on each class, features i and j have a correlation ρ = .9|, i, j = 1, …, 30; that is, Σ is a Toeplitz-type matrix with the (i, j)-th entry equal to .9| × 4. Here features 1 to 10 still have their true ranks going down from 1 to 10, while the other features still have a tied true rank of 20.5. We simulate 1000 samples of size N = 1000 from this model, and we apply s-CC (8); s-NPC with δ1 = .05 and four α levels .05, .10, .20, and .30 (11); their corresponding SVM variants (Appendix C); and the RF algorithm’s feature importance measures (mean decrease in accuracy and mean decrease in Gini index) and SHAP value—13 criteria in total—to each sample to rank the 30 features. In Appendix, we summarize the average rank of each feature by each criterion in Table H.5 (for the uncorrelated-feature scenario) and Table H.6 (for the correlated-feature scenario), and we plot the distribution of ranks of each feature in Figure H.5 (for the uncorrelated-feature scenario) and Figure 4 (for the correlated-feature scenario). The results show that, in the presence of sampling bias, only s-NPC can distinguish the first 10 informative features from the rest, while s-CC and the SVM variants of s-CC and s-NPC cannot. These results highlight the unique robustness of s-NPC to sampling bias, an advantage that even its SVM variant does not embrace. The reason is that sampling bias affects the training of the SVM scoring function. Moreover, s-CC, s-NPC, and their SVM variants—all marginal feature ranking criteria—are unaffected by feature correlations, as expected. In contrast, the RF algorithm’s three feature ranking criteria, which are based on multivariate classifiers and thus not marginal, cannot accurately capture the features’ marginal ranking in the presence of feature correlations (Figure 4(c)).
Table H.5:
Average ranks of the d = 30 features by s-CC, s-NPC (with varying α), or their SVM variants with N = 1000 under the Gaussian setting with sampling bias ( and ) (17) and without feature correlations—simulation study S3. Undesirable average ranks—i.e., the average ranks of the informative (first 10) features greater than 10 and the average ranks of the uninformative (latter 20) features smaller than 10—are underlined.
1
2
3
4
5
6
7
8
9
10
s-CC
2.10
4.05
6.06
5.08
6.75
7.89
8.38
15.85
13.04
14.30
s-NPC (α = .05)
3.30
3.20
3.53
3.69
5.54
6.22
7.20
8.04
6.87
7.67
s-NPC (α = .10)
2.50
2.08
4.30
4.79
4.87
5.93
7.79
7.08
7.74
7.98
s-NPC (α = .20)
2.17
3.04
2.82
4.75
5.41
6.52
6.55
6.42
8.32
8.99
s-NPC (α = .30)
1.91
2.29
3.76
4.75
5.28
5.89
7.40
8.11
8.30
7.30
s-CC-SVM
2.38
2.43
4.12
5.26
7.00
8.73
10.04
11.40
12.44
13.45
s-NPC-SVM (α = .05)
2.02
2.52
3.86
4.59
5.92
5.26
6.15
8.50
9.12
12.08
s-NPC-SVM (α = .10)
1.84
3.58
2.79
4.79
5.01
5.86
7.13
8.98
13.22
13.35
s-NPC-SVM (α = .20)
2.01
3.41
2.92
5.06
5.09
9.40
5.88
8.84
14.66
15.12
s-NPC-SVM (α = .30)
2.02
3.26
3.78
5.36
8.41
6.66
7.28
11.48
14.17
19.34
11
12
13
14
15
16
17
18
19
20
s-CC
16.13
18.25
20.37
22.20
23.07
20.88
17.31
18.61
18.83
20.15
s-NPC (α = .05)
21.19
21.63
17.19
21.28
19.35
17.98
18.37
20.21
19.73
21.66
s-NPC (α = .10)
22.31
21.38
19.78
22.60
18.04
17.45
16.55
19.62
17.76
21.27
s-NPC (α = .20)
23.22
20.31
18.75
18.42
18.15
18.71
21.65
19.06
16.69
21.55
s-NPC (α = .30)
23.23
19.66
19.66
18.36
19.32
20.16
15.93
15.84
22.74
21.39
s-CC-SVM
14.45
15.46
16.47
17.48
18.50
19.50
20.53
21.54
22.55
23.57
s-NPC-SVM (α = .05)
17.78
20.17
20.39
18.28
19.14
20.46
18.75
20.50
21.01
19.25
s-NPC-SVM (α = .10)
18.17
14.39
19.19
17.82
19.96
22.91
21.36
16.34
20.66
19.80
s-NPC-SVM (α = .20)
16.62
17.10
19.87
20.25
22.43
17.03
16.44
21.85
20.60
20.53
s-NPC-SVM (α = .30)
21.64
12.28
19.18
23.97
18.91
17.76
17.41
17.58
20.06
22.68
21
22
23
24
25
26
27
28
29
30
s-CC
17.60
17.09
21.92
18.29
17.58
20.84
22.87
14.73
18.52
16.28
s-NPC (α = .05)
24.11
18.94
21.15
18.50
19.86
23.07
18.39
22.99
21.25
22.86
s-NPC (α = .10)
24.67
20.20
20.06
21.97
22.92
21.94
16.97
23.18
18.36
22.93
s-NPC (α = .20)
17.93
22.79
21.60
24.64
21.47
21.45
22.99
19.35
21.39
19.88
s-NPC (α = .30)
23.30
24.46
22.10
21.47
21.08
22.45
17.56
22.86
19.85
18.58
s-CC-SVM
24.57
25.56
26.52
27.16
26.87
22.30
19.29
13.19
6.00
6.24
s-NPC-SVM (α = .05)
19.82
21.72
21.85
20.42
19.86
23.08
19.36
20.20
21.09
21.84
s-NPC-SVM (α = .10)
20.73
19.62
22.84
22.10
19.65
18.14
21.00
19.09
21.36
23.34
s-NPC-SVM (α = .20)
20.11
21.57
16.15
20.37
20.96
19.04
20.47
18.85
21.67
20.69
s-NPC-SVM (α = .30)
17.97
18.64
18.00
19.46
19.88
20.83
19.39
17.25
20.64
19.71
Table H.6:
Average ranks of the d = 30 features by s-CC, s-NPC (with varying α), their SVM variants, or the RF algorithm’s feature importance measures and SHAP value, with sample size N = 1000 under the Gaussian setting with sampling bias ( and ) (17) and a Toeplitz-type feature covariance matrix: features i and j have a correlation ρ = .9|, i, j = 1, …, 30—simulation study S3. Undesirable average ranks—i.e., the average ranks of the informative (first 10) features greater than 10 and the average ranks of the uninformative (latter 20) features smaller than 10—are underlined.
1
2
3
4
5
6
7
8
9
10
s-CC
3.13
4.12
5.68
7.32
8.80
10.42
12.32
14.34
16.20
17.54
s-NPC (α = .05)
2.13
2.77
3.55
4.27
5.10
5.94
6.75
7.51
8.41
9.24
s-NPC (α = .10)
1.85
2.60
3.31
4.24
5.16
6.01
6.77
7.64
8.38
9.18
s-NPC (α = .20)
1.80
2.48
3.29
4.13
5.09
5.96
6.88
7.65
8.48
9.25
s-NPC (α = .30)
1.80
2.57
3.27
4.11
5.05
6.01
6.79
7.70
8.52
9.20
s-CC-SVM
2.83
3.75
4.84
6.24
7.52
8.85
10.27
11.52
12.74
13.80
s-NPC-SVM (α = .05)
2.03
2.77
3.50
4.37
5.26
6.31
7.52
8.79
10.15
12.61
s-NPC-SVM (α = .10)
1.85
2.63
3.50
4.36
5.35
6.51
7.81
9.06
10.89
13.59
s-NPC-SVM (α = .20)
1.96
2.72
3.60
4.57
5.81
7.05
8.68
10.29
11.85
14.76
s-NPC-SVM (α = .30)
2.09
3.05
3.86
5.16
6.25
8.05
9.35
11.24
13.61
15.26
RF-MeanDecreaseAccuracy
7.61
6.73
6.12
6.00
5.94
6.05
6.56
6.82
7.06
7.43
RF-MeanDecreaseGini
1.82
3.80
5.66
6.46
6.76
7.29
7.46
7.87
8.20
8.42
RF-SHAP
2.47
6.12
6.74
7.04
7.19
7.92
8.64
9.36
10.69
10.80
11
12
13
14
15
16
17
18
19
20
s-CC
18.95
19.98
20.40
20.48
20.72
20.87
20.66
20.28
19.94
19.28
s-NPC (α = .05)
19.96
19.87
20.18
20.34
20.29
20.52
20.16
20.32
20.43
20.39
s-NPC (α = .10)
20.34
20.13
20.28
20.73
20.09
20.53
20.38
20.47
20.74
20.09
s-NPC (α = .20)
20.45
20.37
20.62
20.70
20.45
20.01
20.52
20.38
20.53
20.58
s-NPC (α = .30)
20.41
20.46
20.46
20.46
20.56
20.74
20.39
20.35
20.60
20.66
s-CC-SVM
14.86
15.86
16.88
17.89
18.91
19.93
20.92
21.96
22.96
23.93
s-NPC-SVM (α = .05)
18.19
19.27
19.53
19.71
19.39
19.98
19.79
20.07
20.28
20.28
s-NPC-SVM (α = .10)
18.50
19.23
19.21
19.30
19.64
19.69
19.88
19.78
20.05
20.09
s-NPC-SVM (α = .20)
18.02
18.87
18.82
19.14
19.88
19.77
19.23
19.96
19.68
19.63
s-NPC-SVM (α = .30)
17.66
17.92
18.76
18.79
18.63
19.46
19.45
19.52
19.66
19.32
RF-MeanDecreaseAccuracy
8.06
9.28
11.83
14.35
16.78
18.10
19.02
19.92
20.53
21.24
RF-MeanDecreaseGini
8.77
9.78
12.88
15.79
17.88
19.81
20.30
20.93
21.81
21.74
RF-SHAP
12.20
14.53
16.04
16.66
16.04
15.20
15.12
14.88
15.68
16.48
21
22
23
24
25
26
27
28
29
30
s-CC
19.11
18.56
18.73
17.88
17.96
16.66
15.66
14.71
12.79
11.49
s-NPC (α = .05)
20.61
20.16
20.36
20.73
20.96
20.70
20.38
20.75
21.09
21.10
s-NPC (α = .10)
20.34
20.21
20.62
20.78
20.53
20.38
20.66
20.77
21.03
20.78
s-NPC (α = .20)
20.39
20.52
20.62
20.63
20.35
20.48
20.40
20.79
20.32
20.88
s-NPC (α = .30)
20.34
20.19
20.84
20.44
20.41
20.48
20.44
20.62
20.51
20.63
s-CC-SVM
24.73
25.37
25.34
24.59
22.94
20.03
16.53
12.82
9.18
7.00
s-NPC-SVM (α = .05)
20.04
20.59
20.14
20.43
20.70
20.57
20.63
20.34
20.63
21.14
s-NPC-SVM (α = .10)
20.32
20.18
20.09
20.18
20.09
20.23
20.43
20.67
20.89
20.99
s-NPC-SVM (α = .20)
19.77
19.97
19.82
19.73
20.37
20.18
20.02
20.29
20.12
20.43
s-NPC-SVM (α = .30)
19.56
19.87
19.67
20.00
19.98
19.72
19.88
19.95
20.00
19.27
RF-MeanDecreaseAccuracy
21.71
22.35
22.54
23.00
23.33
23.97
24.37
25.15
25.81
27.34
RF-MeanDecreaseGini
22.21
22.29
22.89
23.20
23.20
23.37
23.39
23.48
23.70
23.87
RF-SHAP
17.66
19.02
20.30
21.58
22.87
24.18
25.48
26.75
28.02
29.37
Figure H.5:
Rank distributions of the features under the Gaussian setting with d = 30, N = 1000, and sampling bias ( and ) (17), and without feature correlations—simulation study S3.
Figure 4:
Rank distributions of the features under the Gaussian setting with d = 30, N = 1000, sampling bias ( and ) (17), and a Toeplitz-type feature covariance matrix: features i and j have a correlation ρ = .9|, i, j = 1, …, 30—simulation study S3.
Ranking high-dimensional features at the sample level
We next examine the performance of s-CC and s-NPC when d > N, using simulation studies S4 and S5. In study S4, we set d = 500 and N = 400. The generative model is the same as (15), and , , with μ11, …, μ500 independently and identically drawn from and then held fixed, and Σ = 4I500. Same as in the low-dimensional settings (Section 5.1), p-CC and p-NPC have a clear gap between the first 10 informative features and the rest of features. In terms of both p-CC and p-NPC, the first 10 features have true ranks going down from 1 to 10, and the rest of features are tied with a true rank of 255.5. We simulate 1000 samples of size N = 400 and apply s-CC (8) and s-NPC with δ1 = .05 and four α levels .05, .10, .20, and .30 (11) to each sample to rank the 500 features. We summarize the average rank of each feature by each criterion in Table H.7 (in Appendix), and we plot the distribution of ranks of each feature in Figure 5. The results show that ranking under this high-dimensional setting is more difficult than under the low-dimensional setting. However, s-CC and s-NPC with α = 0.2 or 0.3 still clearly distinguish the first 10 informative features from the rest, while s-NPC with α = 0.05 or 0.1 have worse performance on features 8–10, demonstrating again that ranking becomes more difficult for s-NPC when α is smaller and requires a larger sample size N.
Table H.7:
Average ranks of the first 30 features by s-CC or s-NPC (with varying α) with d = 500 and N = 400 under the Gaussian setting (15)—simulation study S4.
1
2
3
4
5
6
7
8
9
10
s-CC
1.51
3.39
3.25
4.82
4.43
6.47
6.59
6.80
8.53
9.86
s-NPC (α = .05)
2.48
3.14
3.81
4.57
4.88
33.75
87.81
177.79
136.12
183.96
s-NPC (α = .10)
2.21
2.34
3.84
4.08
5.56
6.70
6.61
19.97
116.98
51.27
s-NPC (α = .20)
1.87
2.55
3.60
3.76
5.41
6.35
6.67
7.51
8.61
46.10
s-NPC (α = .30)
1.43
3.29
3.44
4.54
5.52
6.25
6.86
5.91
8.34
11.48
11
12
13
14
15
16
17
18
19
20
s-CC
234.07
244.32
213.54
213.01
183.60
249.73
292.85
269.15
328.63
240.94
s-NPC (α = .05)
270.19
252.46
174.22
211.67
125.66
241.64
317.62
340.59
231.31
205.63
s-NPC (α = .10)
254.37
300.12
317.98
213.02
263.69
223.81
296.64
279.72
288.77
234.69
s-NPC (α = .20)
223.00
253.27
287.14
205.65
249.97
187.17
312.73
224.19
265.96
238.16
s-NPC (α = .30)
209.82
192.70
206.62
271.58
236.41
263.22
189.90
299.44
238.57
269.64
21
22
23
24
25
26
27
28
29
30
s-CC
237.12
273.42
276.32
305.19
207.06
267.22
219.78
287.79
315.43
288.44
s-NPC (α = .05)
272.94
259.56
231.43
253.27
209.29
238.76
339.55
301.08
214.29
286.21
s-NPC (α = .10)
292.65
184.08
305.28
175.42
199.44
269.48
238.83
146.56
298.10
263.97
s-NPC (α = .20)
224.42
243.56
290.61
203.90
300.27
243.51
208.30
241.41
288.29
277.29
s-NPC (α = .30)
276.51
224.59
223.39
270.17
208.61
248.46
236.14
253.09
295.42
198.65
Figure 5:
Rank distributions of the first 30 features by s-CC or s-NPC with d = 500 and N = 400 under the Gaussian setting (15)—simulation study S4. The vertical axis is on the log10 scale.
In the more high-dimensional study S5, we further decrease the N/d ratio by setting d = 10,000 and N = 200, a scenario that resembles many biomedical datasets. The generative model is the same as (15), and , , with μ11, …, μ10,000 independently and identically drawn from and then held fixed, and Σ = 4I10,000. Same as in study S4, the first 10 features have true ranks going down from 1 to 10, and the rest of features are tied with a true rank of 5005.5. We simulate 1000 samples of size N = 200, and we apply s-CC (8); s-NPC with δ1 = .05 and three α levels .10, .20, and .30 (11); their SVM variants (Appendix C); and the RF’s two feature importance measures (the mean decrease in accuracy and the mean decrease in Gini index)—ten criteria in total—to each sample to rank the 10,000 features. Note that the SHAP value is inapplicable because it requires substantial computational time in this scenario. We summarize the average rank of each feature by each criterion in Table H.8 (in Appendix). The results show that s-NPC outperforms its SVM variant at each α in terms of the ranking accuracy for the top 10 features. Moreover, five criteria, including s-CC, s-NPC (α = .30), their SVM variants, and RF’s mean decrease in accuracy, can distinguish the first 10 informative features by assigning them average ranks no greater than 10. The fact that s-NPC and its SVM variant with α = .10 or .20 have worse performance is because of the small N/d ratio. This result echos the importance of using a not-too-small α for s-NPC when N is not too large and N/d is small. Interestingly, RF’s mean decrease in Gini index performs worse than its mean decrease in accuracy in ranking the 10-th feature.
Table H.8:
Average ranks of the first 20 features by s-CC, s-NPC (with varying α), their SVM variants, or the RF algorithm’s feature importance measures (the mean decrease in accuracy and the mean decrease in Gini index) with d = 10,000 and N = 200 under the Gaussian setting (15)—simulation study S5. The average ranks of the top 10 features, if exceeding 10 (i.e., undesirable), are underlined.
1
2
3
4
5
s-CC
1.99
1.75
4.23
4.93
6.24
s-NPC (α = .10)
1.36
5.38
3.30
3.74
4.57
s-NPC (α = .20)
1.86
2.03
5.22
5.02
5.70
s-NPC (α = .30)
2.22
1.42
5.80
7.22
6.55
S-CC-SVM
1.88
1.74
4.39
5.31
6.12
s-NPC-SVM (α = .10)
2.83
64.98
233.79
435.20
1231.04
s-NPC-SVM (α = .20)
2.15
1.69
5.23
5.80
7.43
s-NPC-SVM (α = .30)
2.22
1.47
5.77
7.37
6.69
RF_MeanDecreaseAccuracy
1.97
2.48
4.73
5.37
5.92
RF MeanDecreaseGini
1.94
3.02
4.60
5.50
5.75
6
7
8
9
10
s-CC
6.62
6.81
7.29
7.15
7.99
s-NPC (α = .10)
66.50
964.71
2715.94
2978.16
3759.18
s-NPC (α = .20)
5.27
7.20
7.00
6.68
601.75
s-NPC (α = .30)
5.91
6.34
6.10
6.86
6.58
S-CC-SVM
7.07
6.70
6.79
7.05
7.95
s-NPC-SVM (α = .10)
1416.56
1621.85
1660.69
2508.79
3242.82
s-NPC-SVM (α = .20)
7.07
6.51
6.95
109.10
754.96
s-NPC-SVM (α = .30)
6.79
6.65
6.19
6.26
15.98
RF_MeanDecreaseAccuracy
6.08
6.57
6.66
7.40
7.82
RF MeanDecreaseGini
5.60
6.75
6.91
7.22
37.34
11
12
13
14
15
s-CC
5458.50
3926.18
4217.73
4814.75
3882.88
s-NPC (α = .10)
3852.06
4512.05
4101.91
4421.95
4479.34
s-NPC (α = .20)
4155.99
4768.53
4547.92
3807.99
4684.78
s-NPC (α = .30)
4710.67
4276.04
4762.01
4787.34
4649.83
S-CC-SVM
4706.49
3694.43
4528.69
4490.41
4949.34
s-NPC-SVM (α = .10)
3136.52
3635.93
3730.52
4193.05
3933.72
s-NPC-SVM (α = .20)
4153.33
4368.59
5165.12
4748.92
5557.16
s-NPC-SVM (α = .30)
6235.73
5586.78
6055.84
6029.32
5565.73
RF_MeanDecreaseAccuracy
4764.65
5284.11
5571.87
5327.75
4956.21
RF MeanDecreaseGini
4944.51
5281.69
5204.39
5292.28
5313.57
16
17
18
19
20
s-CC
4783.62
4933.65
4910.66
4917.21
4713.67
s-NPC (α = .10)
4265.04
3983.15
4408.34
4238.35
4422.13
s-NPC (α = .20)
4275.42
4867.43
4489.88
4537.64
4582.16
s-NPC (α = .30)
5304.28
4825.68
4861.39
5025.82
4908.58
S-CC-SVM
5152.39
4960.78
5054.89
5181.85
4810.33
s-NPC-SVM (α = .10)
4481.82
5014.15
4794.49
4096.55
4571.58
s-NPC-SVM (α = .20)
5165.05
5517.70
5070.07
5115.56
4998.20
s-NPC-SVM (α = .30)
5737.40
5872.29
5109.93
5444.06
4901.91
RF_MeanDecreaseAccuracy
5382.33
4612.39
4546.59
4555.47
4549.88
RF MeanDecreaseGini
6218.43
5119.48
5264.91
4459.50
5310.33
Comparison with other marginal feature ranking criteria
In simulation study S6, we compare s-CC and s-NPC with four criteria that have been widely used to rank features marginally: the Pearson correlation, the distance correlation (Székely et al., 2009), the two-sample t test, and the two-sample Wilcoxon rank-sum test. None of these existing approaches rank features based on a prediction objective; as a result, the feature ranking they give may not reflect the prediction performance of features under a particular objective. Here we use an example to demonstrate this phenomenon. We generate data with d = 2 features from the following model:
To calculate p-CC and p-NPC with δ1 = .05 at four α levels .05, .10, .20, and .30 on these two features, we use a large sample with size 106 for approximation, and the results in Table 3 show that all the five population-level criteria rank feature 2 as the top feature.
Table 3:
Values of p-CC and p-NPC of the two features in (18).
Feature
p-CC
p-NPC (α = .05)
p-NPC (α = .10)
p-NPC (α = .20)
p-NPC (α = .30)
1
.31
.74
.61
.44
.32
2
.22
.49
.36
.24
.17
Then we simulate 1000 samples of size N = 400 from the above model and apply nine ranking approaches: s-CC, s-NPC with δ1 = .05 at four α levels (.05, .10, .20, and .30), the Pearson correlation, the distance correlation, the two-sample t test, and the two-sample Wilcoxon rank-sum test, to each sample to rank the two features. From this we obtain 1000 rank lists for each ranking approach, and we calculate the frequency that each approach correctly finds the true rank order. The frequencies are summarized in Table 4, which shows that none of the four common criteria identifies feature 2 as the better feature for prediction. In other words, if users wish to rank features based on a prediction objective under the classical or NP paradigm, these criteria are not suitable.
Table 4:
The frequency that each ranking approach identifies the true rank order.
s-CC
s-NPC (α = .05)
s-NPC (α = .10)
s-NPC (α = .20)
s-NPC (α = .30)
100%
99.9%
99.3%
99.7%
100%
Pearson cor
distance cor
two-sample t
two-sample Wilcoxon
0%
0.5%
0%
0%
Real data applications
We apply s-CC and s-NPC to two real datasets to demonstrate their wide application potential in biomedical research. Here we set the number of random splits B = 1000 for s-CC and s-NPC, as allowed by our computational resource.
Application 1: classification of breast cancer and normal tissues based on genes’ DNA methylation levels
The first dataset contains genome-wide DNA methylation profiles of 285 breast tissues measured by the Illumina HumanMethylation450 microarray technology. This dataset includes 46 normal tissues and 239 breast cancer tissues. Methylation levels are measured at 468,424 CpG probes in every tissue (Fleischer et al., 2014). We download the preprocessed and normalized dataset from the Gene Expression Omnibus (GEO) (Edgar et al., 2002) with the accession number GSE60185. The preprocessing and normalization steps are described in detail in Fleischer et al. (2014). To facilitate the interpretation of our analysis results, we further process the data as follows. First, we discard a CpG probe if it is mapped to no gene or more than one genes. Second, if a gene contains multiple CpG probes, we calculate its methylation level as the average methylation level of these probes. This procedure leaves us with 19,363 genes with distinct methylation levels in every tissue. We consider the tissues as data points and the genes as features, so we have a sample with size N = 285 and number of features d = 19,363. Since misclassifying a patient with cancer to be healthy leads to more severe consequences than the other way around, we code the 239 breast cancer tissues as the class 0 and the 46 normal tissues as the class 1 to be aligned with the NP paradigm. After applying s-CC (8) and s-NPC with δ1 = .05 and four α levels (.05, .10, .20, and .30) (11) to this sample, we summarize the top 10 genes found by each criterion in Table 5. Most of these top ranked genes have been reported associated with breast cancer, suggesting that our proposed criteria can indeed help researchers find meaningful features. Meanwhile, although other top ranked genes do not yet have experimental validation, they have weak literature indication and may serve as potentially interesting targets for cancer researchers. For a detailed list of literature evidence, please see the Supplementary Excel File. The fact that these five criteria find distinct sets of top genes is in line with our rationale that feature importance depends on prediction objective. By exploring top features found by each criterion, researchers will obtain a comprehensive collection of features that might be scientifically interesting.
Table 5:
Top 10 genes[†] found by each criterion in breast cancer methylation data (Fleischer et al., 2014). Genes with strong literature evidence to be breast-cancer-associated are marked in bold; see the Supplementary Excel File.
Rank
s-CC
s-NPC (α = .05)
s-NPC (α = .10)
s-NPC (α = .20)
s-NPC (α = .30)
1
HMGB2
HMGB2
HMGB2
ABHD14A
ABHD14A
2
MIR195
MICALCL
ABHD14A
ABL1
ABL1
3
MICALCL
NR1H2
ZFPL1
BAT2
ACTN1
4
AIM2
AGER
AGER
BATF
AKAP8
5
AGER
BATF
RILPL1
CCL8
AP4M1
6
KCNJ14
ZFP106
SKIV2L
COG8
ARHGAP1
7
HYAL1
CTNNAL1
TP53
FAM180B
ATG4B
8
SKIV2L
MIR195
RELA
HMGB2
BAT2
9
RUSC2
AIM2
MIR195
HSF1
BAT5
10
DYNC1H1
ZFPL1
CCL8
KIAA0913
BATF
Note that 20 genes have zero s-NPC (α = .20) values, and 119 genes have zero s-NPC (α = .30) values. Hence, the listed top 10 genes by either of these two criteria are the first 10 in the alphabetical order.
Moreover, we apply the four widely-used but non-prediction-based marginal ranking criteria—the Pearson correlation, the distance correlation, the two-sample t test, and the two-sample Wilcoxon rank-sum test, to rank the d = 19,363 genes. For demonstration purpose, we check the s-CC and s-NPC (α = .10) values of the genes ranked as the 1st, 101st, 201st, 301st, 401st, 501st, 601st, 701st, 801st, and 901st by each criterion. Figure 6 shows the cases where the ranks assigned by other criteria differ tremendously from those assigned by s-CC or s-NPC (α = .10), including the 301st and 801st genes ranked by the distance correlation, whose ranks by s-CC and s-NPC (α = .10) are much better, and the 601st and 701st genes ranked by the two-sample t test, whose ranks by s-CC and s-NPC (α = .10) are much worse.
Figure 6:
Values of s-CC and s-NPC (α = .10) of the top K = 1, 101, 201, 401, 501, 601, 701, 801, or 901 features ranked by each criterion in Application 1. As expected, the ranks by s-CC are monotone in s-CC values (left), and the ranks by s-NPC (α = .10) are monotone in s-NPC (α = .10) values (right). We focus on the 301st and 801st genes ranked by the distance correlation, whose ranks by s-CC and s-NPC (α = .10) are much better, and the 601st and 701st genes ranked by the two-sample t test, whose ranks by s-CC and s-NPC (α = .10) are much worse.
Figure 7 illustrates the class-conditional distributions of these four genes. Interestingly, the two genes NXPH1 and TSC22D4 are ranked top by s-CC (with ranks 18 and 77) and s-NPC (α = .10) (with ranks 99 and 146), while they are ranked worse by both the Pearson correlation (with ranks 1061 and 622) and the distance correlation (with ranks 801 and 301). Their class-conditional distributions show distinct, non-overlapping density peaks between classes 0 and 1, confirming their top ranks assigned by s-CC and s-NPC. Literature also suggests the two genes’ potential roles in breast cancer: a methylation study found NXPH1 as a candidate biomarker gene for HER2+ breast cancer (Lindqvist et al., 2014); an RNA-seq study found TSC22D4 up-regulated in a BRCA1 mutated cell line (SL) compared to a BRCA1 wild-type cell line (SB) (Privat et al., 2018); another study identified TSC22D2, an isoform of TSC22D4, as a novel cancer-associated gene in a rare multi-cancer family (Liang et al., 2016).
Figure 7:
Class-conditional distributions of four example genes in Application 1. Out of d = 19,363 genes, the top genes are ranked much better by the two-sample t test than by s-CC or s-NPC (α = .10); the bottom genes are ranked top by s-CC and s-NPC (α = .10) but much worse by Pearson correlation and distance correlation.
In Figure 7, another two genes, COL16A1 and CUBN, are ranked much lower by s-CC (with ranks 3793 and 2687) and s-NPC (α = .10) (with ranks 3924 and 4792) than by the two-sample t test (with ranks 601 and 701). Inspecting the two genes’ class-conditional distributions, we find that their class-1 conditional distributions have high-density domains largely contained in the high-density domains of their respective class-0 conditional distributions, an observation consistent with the low rankings assigned by s-CC and s-NPC (α = .10). Both genes do not seem to have direct associations with breast cancer: COL16A1 encodes the alpha chain of type XVI collagen; CUBN encodes a protein called cubilin, which is involved in the uptake of vitamin B12 from food into the body.
Application 2: classification of high-risk and low-risk prostate cancer patients based on microRNA expression levels in urine samples
The second dataset contains microRNA (miRNA) expression levels in urine samples of prostate cancer patients, downloaded from the GEO with accession number GSE86474 (Jeon et al., 2019). This dataset is composed of 78 high-risk and 61 low-risk patients. To align with the NP paradigm, we code the high-risk and low-risk patients as class 0 and 1, respectively, so m/n = 78/61. In our data pre-processing, we retain miRNAs that have at least 60% non-zero expression levels across the N = 139 patients, resulting in d = 112 features. We use this dataset to demonstrate that s-NPC is robust to sampling bias that results in disproportional training data; that is, training data have different class proportions from those of the population. Since in many biomedical datasets, the proportions of diseased patients do not reflect the true proportions in the population, a desirable feature ranking criterion should be robust to such sampling bias so that the selected features would maintain good out-of-sample predictive power.We create two sub-datasets by randomly removing one half of the data points in class 0 or 1, so that one sub-dataset has m/n = 39/61 and the other has m/n = 78/31. We apply s-CC, s-NPC with δ1 = .05, and the RF algorithm’s feature importance measures (the mean decrease in accuracy and the mean decrease in Gini index) to the full dataset and each sub-dataset to rank features. To evaluate each criterion’s robustness to disproportional data, we compare its rank lists from these three datasets with different m/n ratios. For this comparison, we use the Kuncheva index (Kuncheva, 2007), which quantifies the overlap of top k feature sets and accounts for the overlap by chance.
where A and B are the top k features from two rank lists. The Kuncheva index has range [−1, 1], and its value is monotone increasing in |A ∩ B|. If A and B overlap by chance, the Kuncheva index would be close to 0. For more than two rank lists, their Kuncheva index for the top k features is defined as the average of their pairwise Kuncheva indices for the top k features. In our case, the larger the Kuncheva indices are for varying k, the more robust a criterion is to disproportional data. We illustrate the Kuncheva indices of s-CC, s-NPC, and the two RF feature importance measures in Figure 8, which shows that s-NPC is the most robust criterion.
Figure 8:
Kuncheva indices of s-CC, s-NPC, and the two RF feature importance measures in ranking the top k features in Application 2.
Conclusion and perspectives
This work introduces model-free objective-based marginal feature ranking criteria—s-CC and s-NPC—for the purpose of binary decision-making. The explicit use of a prediction objective to rank features is demonstrated to select more predictive features than existing practices for marginal or multivariate feature ranking do. The reason is that commonly used marginal feature ranking criteria are association measures not reflecting the prediction objective or capturing certain data distributional characteristics, and that popular multivariate feature ranking criteria rely on multivariate classification algorithms such as SVM and RF, which are sensitive to sampling bias and feature correlations.It is worth nothing that s-CC and s-NPC are extendable to multi-class classification. For s-NPC, as it is based on the Neyman-Pearson paradigm for binary classification, its extension to multi-class classification will rely on the one-vs-rest approach. Concretely, we single out the most important class as the class 0 and combine the rest of classes into the class 1, converting the multi-class problem into a binary classification problem; then s-NPC can be applied. For s-CC, with K classes, the classical oracle classifier is known to be φ*(·) = arg max
πp(·), where and p(·) denotes the the class-k conditional density of feature(s). Then we can define s-CC as the average classification error of plug-in classifiers of this oracle, similar to equations (7) and (8) in our manuscript. We have implemented the multiple-class extensions of s-CC and s-NPC in our R package frc.In addition to the illustrated CC and NP paradigms, the same marginal ranking idea extends to other prediction objectives such as the cost-sensitive learning and global paradigms. Another extension direction is to rank feature pairs in the same model-free fashion. In addition to the biomedical examples we show in this paper, model-free objective-based marginal feature ranking is also useful for finance applications, among others. For example, a loan company has successful business in region A and would like to establish new business in region B. To build a loan-eligibility model for region B, which has a much smaller fraction of eligible applicants than region A, the company may use the top ranked features by s-NPC in region A, thanks to the robustness of s-NPC to sampling bias.Both s-CC and s-NPC involve sample splitting. The default option is a half-half split for both class 0 and class 1 observations. It remains an open question whether a refined splitting strategy may lead to a better ranking agreement between the sample-level and population-level criteria. Intuitively, there is a trade-off between classifier training and objective evaluation: using more data for training can result in a classifier closer to the oracle, while saving more data to evaluate the objective can lead to a less variable criterion.
Authors: Jouhyun Jeon; Ekaterina Olkhov-Mitsel; Honglei Xie; Cindy Q Yao; Fang Zhao; Sahar Jahangiri; Carmelle Cuizon; Seville Scarcello; Renu Jeyapala; John D Watson; Michael Fraser; Jessica Ray; Kristina Commisso; Andrew Loblaw; Neil E Fleshner; Robert G Bristow; Michelle Downes; Danny Vesprini; Stanley Liu; Bharati Bapat; Paul C Boutros Journal: J Natl Cancer Inst Date: 2020-03-01 Impact factor: 13.506
Authors: Teresa Davoli; Andrew Wei Xu; Kristen E Mengwasser; Laura M Sack; John C Yoon; Peter J Park; Stephen J Elledge Journal: Cell Date: 2013-10-31 Impact factor: 41.582