Literature DB >> 19458767

On the number of close-to-optimal feature sets.

Abstract

The issue of wide feature-set variability has recently been raised in the context of expression-based classification using microarray data. This paper addresses this concern by demonstrating the natural manner in which many feature sets of a certain size chosen from a large collection of potential features can be so close to being optimal that they are statistically indistinguishable. Feature-set optimality is inherently related to sample size because it only arises on account of the tendency for diminished classifier accuracy as the number of features grows too large for satisfactory design from the sample data. The paper considers optimal feature sets in the framework of a model in which the features are grouped in such a way that intra-group correlation is substantial whereas inter-group correlation is minimal, the intent being to model the situation in which there are groups of highly correlated co-regulated genes and there is little correlation between the co-regulated groups. This is accomplished by using a block model for the covariance matrix that reflects these conditions. Focusing on linear discriminant analysis, we demonstrate how these assumptions can lead to very large numbers of close-to-optimal feature sets.

Entities: Disease Gene

Keywords: classification; covariance model; feature sets; gene expression

Year: 2007 PMID： 19458767 PMCID： PMC2675502

Source DB: PubMed Journal: Cancer Inform ISSN： 1176-9351

Introduction

Concern has recently been expressed regarding the fact that different studies reveal different gene sets for microarray-based cancer classifiers meant to achieve a decision for the same characteristics. In particular, we refer to the concern regarding different gene sets for predicting prognosis for breast cancer (Ioannidis, 2005; Jenssen and Hovig, 2005). In one study, the issue is addressed by repeatedly randomly selecting a subset of the data, designing a classifier on the subset, estimating the error of the designed classifier on the remaining data points, and considering the distribution of the genes composing the various classifier feature sets (Michiels et al. 2005). Concern is expressed over the wide variability of the feature sets and the large number of genes contributing to the various feature sets. Reference is made to the sample size. While sample size is a factor, it is not the only one. A wide array of feature sets can occur with large samples owing to the relations among the genes. The potential for a large number of “best” feature sets is an inherent property of pattern recognition and is not particular to microarrays. In this paper we will explore this potential taking a model-based approach, specifically, modeling the covariance matrix of the features. Before proceeding we must clarify the problem. Suppose we have some large number, D, of features from which to choose, say 20,000 gene-expression levels from a microarray. If we know the class conditional probability distributions, say the distributions of the genes for two phenotypes, then the collection of all D features is optimal. Many may be redundant and can be removed with negligible effect on classification accuracy; nonetheless, all can be kept without doing harm to the classification accuracy. This, however, is abstract. In practice, feature sets and classifiers must be derived from sample data, and here there is the common situation of the error declining as the number of features grows to a certain point and then beginning to increase as the number of features increases beyond that point – the peaking phenomenon (Hughes, 1969; Kanal and Chandrasekaran, 1971; Jain and Waller, 1978) Peaking tends to occur later for larger samples. It occurs because too many features overfit the data. Hence, the issue of an optimal feature set is inherently one of classifier design from sample data. Given a sample, among all possible feature sets, which one results in a classifier with minimal error? Or more generally, what is the optimal number of features, in the sense that the expected error for samples of the optimal size is minimized relative to all possible sizes? The answer depends on the type of classifier being designed (Hua et al. 2005). Since the actual distributions are unknown, evaluating feature sets requires error estimation. Different error estimators will yield different rankings of feature-set performance. Even for large samples, error estimation suffers some imprecision. Thus, if there is a large number of close-to-optimal feature sets arising from a sample, then many of these may be statistically indistinguishable. Moreover, and this seems to be the issue concerning some investigators, different samples will yield different top-performing feature sets. But this is to be expected if each sample yields a list of feature sets (and classifiers) whose error rates are indistinguishable. Putting aside error estimation for the moment, and focusing solely on classifier design, is it even conceivable to find an optimal feature set for a given sample? In the abstract, one could design a classifier for every possible feature set according to the desired classification rule, find the errors for these, and choose the best classifying feature set. But this program cannot be carried out because the number of possible feature sets to check is astronomical. Moreover, in the absence of prior distributional knowledge, which is the common situation in applications, a full search cannot be avoided if we wish to be assured of finding an optimal feature set (Cover and van Campenhout, 1977). This leads to the feature-selection problem, which is to apply some algorithm to select a less-than-optimal feature set that provides good classification. Depending on the feature-selection algorithm, or possible variant choices within the algorithm, different feature sets will be obtained for the same sample. There is the expectation that the more data one has, the better will be the selected feature set; nonetheless, outcome variability remains. Owing to the instability of feature-selection algorithms, one can surely expect slight changes in the data to result in different outputs. It is important to recognize that feature selection is part of classifier design. It determines the features that serve as variables for the classifier, which is a function on those variables. In doing so, it is part of the classification rule by which the classifier is constructed from the data. If we now incorporate error estimation into our reasoning, not only does it create imprecision in obtaining errors, but it also affects the feature-selection algorithm (Sima et al. 2005). Given the preceding considerations, we will approach the optimal-feature-set problem in the following way: given the class conditional distributions and a positive integer d, find the family of all feature sets of size d from among the potential features that possess minimal error, and hence are optimal. As posed, feature selection and error estimation do not play a role. Practically, error estimation is a must, so that, regarding the number of “best” feature sets, the issue is not so much one of simply finding the number optimal feature sets, but one of showing that there can be a very large number of feature sets of size d whose (true) errors are close to minimal, since these will be statistically indistinguishable. We note in this regard that increasing the sample size does not necessarily mitigate the problem of a large number of statistically indistinguishable feature sets, because as the sample size grows, so too does the size of the feature set producing minimal error, which means extraordinary growth in the number of feature sets of that size. In the context of gene-expression classification, reflection on biological regulation indicates the enormous gene set variability that can be expected from a microarray study. A set of genes is useful for discriminating pathologies if its behavior, as a collection, varies according to the phenotypes. It may be that the genes involved have some primary regulatory function relating to the phenotype or that they are related, within the overall genome regulation, to other genes governing the phenotype. Given the complex regulatory connectivity within the genome–intricate feedback, massive redundancy, regulatory cascades, and tightly controlled co-regulation of gene cohorts–one should expect a large collection of gene sets having essentially equivalent capability for phenotype discrimination. Add to this the vast number of genes involved. Even if we throw away 80% of the genes on a 20-000-gene microarray, this leaves C(4000, d) feature sets of size d, where C(D, d) denotes the number of subsets of size d that can be formed from D elements. Even for a modest size d this number is enormous. Fore instance, C(4000, 40) > 1080. Consequently, the errors corresponding to the feature sets can be expected to be extremely dense in the interval [∈, 1], where ∈ is the Bayes error (minimal achievable error) for the classification problem. This observation alone should make it apparent that we can expect to find a large number of feature sets whose errors are statistically indistinguishable from the optimal one. This paper considers a model in which the features are grouped in such a way that intra-group correlation is substantial whereas inter-group correlation is minimal, the intent being to model the situation in which there are groups of highly correlated co-regulated genes and there is little correlation between the co-regulated groups. This will be accomplished by using a block model for the covariance matrix that reflects these conditions. Focusing on linear discriminant analysis, we demonstrate how these assumptions can lead to large numbers of close-to-optimal feature sets.

A Covariance Structure Leading to Many Close-to-Optimal Feature Sets

We consider classification for two Gaussian classes. Thus, the classes 0 and 1 possess normal conditional distributions, with mean vectors μ and covariance matrices K, for k = 0, 1. If we assume that the two classes are equally likely and the covariance matrices are equal, then the optimal classifier is determined by the discriminant d1(x) – d0(x), where The discriminant is a linear function of x, produces hyperplane decision boundaries, and characterizes linear discriminant analysis (LDA). While it is unlikely that the covariance matrices are actually equal, it is this assumption that underlies the optimality of LDA. Moreover, in situations where LDA is applied as a classification rule and sample sizes are limited, such as is often the case with gene-expression data, the assumption of equal covariance matrices is a form of regularization, and is the simplest (and most used) of the various regularized covariance assumptions (Titterington, 1985; Friedman, 1988). The decision boundary and error for LDA depend on the means, variances, and correlation coefficients. Obtaining direct insight concerning feature selection is not possible in the general case. A standard approach is to adopt a tractable covariance model that represents some assumptions on the relations among the features. For instance, a classic paper on finding the optimal number of features for LDA considers three covariance models (Jain and Waller, 1978). For all models it assumes a common variance σ2 and a single base value ρ from which the correlation coefficients are obtained. The models are defined by their correlation assumptions: (I)ρ = ρ,(II)ρ = ρ| and (III)ρ = ρ if |i−j| = 1 and ρ = 0 if |i–j| > 1. The 3 × 3 forms of these models are given by the following covariance models: In this paper we adopt a model suitable to the independent co-regulation of gene groups. Genes in different groups are uncorrelated and those within the same group are correlated. If we label these groups as G1, G2,…, G, then the covariance matrix is blocked and takes the form where, for k = 1, 2,..., m, σ is the variance of the ith feature in G and ρ is the correlation coefficient for the ith and jth features in G. If K has d features split into m equal-sized blocks, we write K ∼ Cov[d, m]. Special cases of the model result if we assume the blocks take the form of one of the models I, II, or III. What is of interest to us here is that, depending on ρ and the variances, it will often be the case that the best feature set of size m has one feature from each block, namely, the best feature set is of the form A = {g1, g2,…, g}, where g ∈ G for k = 1, 2, ..., m. We define a feature set to be separated if the features come from different blocks. We illustrate feature-set separation with several examples. Because the simulations required to construct the examples are extremely computational, we limit ourselves to relatively small covariance matrices. We first consider K consisting of three blocks with three features in each, K ∼ Cov[9, 3]. Fixing the value of ρ throughout K, for each simulation the variances are selected from a uniform distribution. We consider two cases: σ ∼ U[0.3, 0.47] and σ ∼ U[0.3, 1]. The latter case allows a broader range of variances across the features. Fixing ρ throughout K gives a blockwise model corresponding to model I in (Jain and Waller, 1978) except that it is more realistic in allowing the variances to vary. In all cases 100,000 data points are used to train the best feature set containing 3 features, 50,000 per class, and another 100,000 data points are used for error estimation. The overall process is repeated 10,000 times at different values of ρ uniformly distributed between 0 and 1. Figure 1(a) shows the percentage of separated optimal feature sets as a function of p for σ ∼ U [0.3, 0.47]. Except for extreme values of ρ, the majority of optimal feature sets are separated, and for 0.3 ≤ ρ ≤ 0.9, essentially 100% of the optimal feature sets are separated, indicating that this condition holds if there is at least modest correlation between features within the block and the correlation is not extreme. Note the steep fall off in separation percentage when ρ exceeds 0.9. Figure 1(b), which corresponds to σ ∼ U[0.3, 1] for the same covariance structure, shows that increased variability results in smaller percentages of separated optimal feature sets; nonetheless, for most of the range of ρ there is a significant percentage of separated feature sets. For 0.2 ≤ ρ ≤ 0.8, more than 60% of the optimal feature sets are separated.

Figure 1.

Percentage of separated optimal feature sets as a function of the correlation coefficient ρ. First row: K ∼ Cov[9, 3], (a) ρ = ρ, σ ∼ U[0.3, 0.47]; (b) ρ = ρ, σ ∼ U[0.3, 1]; (c) ρ = ρ||, σ∼U[0.3, 0.47]; (d) ρ = ρ||, σ∼ U[0.3, 1]. Second row: K ∼ Cov[10, 2], (e) ρ = ρ, σ ∼ U[0.3, 0.47]; (f) ρ = ρ, σ ∼ U[0.3, 1]; (g) ρ = ρ||, σ ∼ U[0.3, 0.47]; (h) ρ = ρ||, σ ∼ U[0.3, 1]. Third row: K ∼ Cov[10, 5], (i) ρ = ρ, σ ∼ U[0.3, 0.47]; (j) ρ = ρ, σ ∼ U[0.3, 1]; (k) ρ = ρ||, σ ∼ U[0.3, 0.47]; (l) ρ = ρ||σ ∼ U[0.3, 1].

To demonstrate the commonplace nature of the preceding results, as well as how they can vary depending on the model and the variability of the variance, we have done similar analyses in a number of circumstances. The results are shown in the remaining parts of Fig. 1: (c) K ∼ Cov[9, 3], ρ = ρ|| (corresponding to model II in (Jain and Waller, 1978)), and σ ∼ U[0.3, 0.47]; (d) K ∼ Cov[9, 3], ρ = ρ and σ ∼ U[0.3, 1]; (e) K ∼Cov[10,2], ρ = ρ, and σ∼U[0.3, 0.47]; (f) K ∼ Cov[10, 2], ρ = ρ, and σ ∼ U[0.3, 1]; (g) K ∼ Cov[10,2], ρ = ρ and σ ∼ U[0.3, 0.47]; (h) K ∼ Cov[10, 2], ρ = ρ, and σ ∼ U[0.3, 1]; (i) K ∼ Cov[10, 5], ρ = ρ, and σ ∼ U[0.3, 0.47]; (j) K ∼ Cov[10, 5], ρ = ρ, and σ ∼ U[0.3, 1]; (k) K ∼ Cov[10, 5], ρ = ρ, and σ ∼ U[0.3, 0.47]; (l) K ∼ Cov[10, 5], ρ = ρ, and σ ∼ U[0.3, 1]. Even though we have carried out very large simulations, there is still some wobble in the curves. To look closer at the low separation percentage when ρ ≈ 1, we consider two examples for the model having K ∼ Cov[ 10, 2], ρ = ρ, σ ∼ U[0.3, 0.65]. Let X1, X2,…, X10 denote the features, with X1, X2,…, X5 and X6, X7,..., X10 corresponding to the first and second block of K, respectively. In the first example, we let ρ = 0.999 and generate one realization of K based on the randomly chosen variances. For the resulting K the best separated 2-feature set is {X1, X7}, with an estimated error of 0.0253. There are nine non-separated feature sets possessing 0 error, {X1, X3}, {X1, X4}, {X1, X5}, {X6, X8}, {X6, X9}, {X6, X10}, {X7, X8}, {X7, X9}, and {X7, X10}, where all error estimates have been obtained using 1,000,000 independent test sample points. Figures 2(a) and 2(b) show 1,000 sample points projected on to features X1 and X7, and on to features X6 and X8, respectively. The advantage of {X6, X8} over {X1, X7} is clear.

Figure 2.

Scatter plots for model K ∼ Cov[10, 2],ρ = ρ,σ ∼ U [0.3, 0.65], with ρ = 0.999: (a) feature set {X1, X}; (b) feature set {X6, X8}.

In the second example, we let ρ = 0.5 and generate one realization of K based on the randomly chosen variances. For the resulting K the best overall 2-feature set is the separated feature set {X1, X7}, with an estimated error of 0.0257 (the difference with 0.0253 in the previous case resulting from the variability in error estimation). Figures 3(a) and 3(b) show 1,000 sample points projected on to the features X1 and X7, and on to the features X6 and X8, respectively. The advantage of {X1, X7} over {X6, X8}, whose error is 0.0625, is clear.

Figure 3.

Scatter plots for model K ∼ Cov[10, 2], ρ = ρ, σ ∼ U[0.3, 0.65], with ρ = 0.5: (a) feature set {X1, X7}; (b) feature set {X6, X8}.

Comparing Figures 2(b) and 3(b) for the feature set {X6, X8} shows what is happening, ρ is the common conditional correlation coefficient for the conditional densities ƒX|0 and ƒX|1. Since ρ ≈ 1 in Fig. 2(b), the sample points for each class lie almost on a line, the lines being distinct because the class means are not equal. These two almost linear point masses are easily discriminated by a straight line between them. In Fig. 3(b), where ρ = 0.5, we see that the point masses are scattered around the two lines and extensively intermingle. Therefore they are not well discriminated.

Analysis of Feature-Set Redundancy

Now that we know there are many separated feature sets resulting from the covariance model of Eq. 3, let A be such a feature set, μ01, μ02,…, μ0 and μ11, μ12,..., μ1 denote the means of the features in A for classes 0 and 1, respectively, and denote their common variances. By a well-known formula (Duda, et al. 2001), the LDA error for A is given by where Δ(A) is the Mahalanobis distance. Since g1 g2,..., g are independent, this distance is determined by Each term of the sum gives the contribution of the feature in reducing the error. The contribution is increased for greater separation of the means and smaller variance. The error tends to 0 as Δ(A) tends to infinity. If, for any feature g with variance and class means μ0 and μ1, we let then Eq. 5 can be rewritten as To consider feature-set redundancy, suppose t is a very small positive number and that, for k = 1, 2,..., m, the set Ghas a subset = {g, g,..., g} such that for i = 1, 2,..., m. We form a feature set B by replacing q features in A by elements in q of the sets , q≤m. Then, according to Eq. 4, the increase in error resulting from using feature set B instead of feature set A is Equations 7 and 8 yield Some simple algebra shows that Denoting the fraction on the right by η(q, t), we conclude that For very small t, this difference is negligible, in which case the performances of the two feature sets are negligibly different. Since there are m features in G satisfying Eq. 8, the q elements can be chosen in ways. If we consider all feature sets formed by replacing 1,2,..., m elements from different blocks, then the total number of close-to-optimal feature sets formed in this manner is If the blocks are large and there are several genes in some or all of the blocks satisfying Eq. 8, then this number can be very large. The point is clear: if the blocks are large, which they may well be when they represent co-regulation of independent cohorts of genes, then there may be many features in a block possessing contributions close to the maximum. These can be interchanged with little effect on the error, thereby leading to close-to-optimal feature sets. In practice, feature sets whose errors are close to optimal will be indistinguishable owing to the need for error estimation. We now illustrate feature-set redundancy with two examples generated from the model of Eq. 3 with K ∼ Cov [60, 5] and ρ = ρ. In the first example, ρ = 0.8 and σ ∼ U[0.3, 1]. The optimal 4-feature set is separated and has Bayes error 0.0485. Figure 4(a) shows the cumulative histogram of the errors for the superior 4-feature sets. Note that there are almost 700 feature sets whose errors are within 0.007 of the best. This is with 60 possible features. The number would likely be astronomically greater with 40,000 features. In the second example, ρ = 0.8 and σ ∼ U[0.3, 1]. The optimal feature set is separated and has Bayes error 0.1642. Figure 4(b) shows the cumulative histogram of the errors for the superior 4-feature sets. With the larger error, feature-set errors are more spread out, but even in this case there are more than 80 features sets whose errors are within 0.02 of the best.

Figure 4.

Cumulative histograms of superior feature-set errors: (a) low Bayes error; (b) high Bayes error.

Conclusion

Feature-set variability should not be of concern in the context of finding expression-based diagnostic panels. It arises naturally from the relationships among the features, the vast number of feature sets when there is a large number of potential features, and the inherent natures of both classifier design and error estimation from sample data. Owing to the peaking phenomenon, larger sample sizes may not reduce the number of close-to-optimal feature sets because the number of feature sets rises dramatically as their size increases. The important issue is finding good feature sets that can achieve desirable classification rates. Based on what we have seen, it is not unlikely that two studies using comparable technologies and classification procedures will arrive at equally performing feature sets possessing small intersection–or, for that matter, no intersection. Rather than perceive this as a negative, one should see it as a positive. Given the incertitude of feature selection and the imprecision of error estimation, it can be beneficial to have many close-to-optimal feature sets. Let us close by citing two previous empirical studies that have come to similar conclusions. In one, the authors consider a specific breast-cancer data set and classification with regard to good and bad prognosis (Ein-Dor et al. 2005). Their approach is to rank genes based on their correlation with survival, take feature sets consisting of the first through eighth sets of 70 genes in the list, and to show that classifiers based on each of these feature sets perform essentially the same. In another, it was shown that using an exhaustive search finds large numbers of indistinguishably performing small feature sets (Grate, 2005). By taking a model-based approach we have characterized this phenomenon for linear classification in terms of the discriminant and the actual error, and shown how intra- and inter-group correlation effect the phenomenon.

6 in total

1. Outcome signature genes in breast cancer: is there a unique set?

Authors: Liat Ein-Dor; Itai Kela; Gad Getz; David Givol; Eytan Domany
Journal: Bioinformatics Date: 2004-08-12 Impact factor: 6.937

2. Optimal number of features as a function of sample size for various classification rules.

Authors: Jianping Hua; Zixiang Xiong; James Lowey; Edward Suh; Edward R Dougherty
Journal: Bioinformatics Date: 2004-11-30 Impact factor: 6.937

3. Gene-expression profiling in breast cancer.

Authors: Tor-Kristian Jenssen; Eivind Hovig
Journal: Lancet Date: 2005 Feb 19-25 Impact factor: 79.321

4. Prediction of cancer outcome with microarrays: a multiple random validation strategy.

Authors: Stefan Michiels; Serge Koscielny; Catherine Hill
Journal: Lancet Date: 2005 Feb 5-11 Impact factor: 79.321

5. Microarrays and molecular research: noise discovery?

Authors: John P A Ioannidis
Journal: Lancet Date: 2005 Feb 5-11 Impact factor: 79.321

6. Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery.

Authors: Leslie R Grate
Journal: BMC Bioinformatics Date: 2005-04-13 Impact factor: 3.169

6 in total

9 in total

1. Markov Boundary Discovery with Ridge Regularized Linear Models.

Authors: Eric V Strobl; Shyam Visweswaran
Journal: J Causal Inference Date: 2015-11-03

2. Performance of feature selection methods.

Authors: Edward R Dougherty; Jianping Hua; Chao Sima
Journal: Curr Genomics Date: 2009-09 Impact factor: 2.236

3. Analysis and computational dissection of molecular signature multiplicity.

Authors: Alexander Statnikov; Constantin F Aliferis
Journal: PLoS Comput Biol Date: 2010-05-20 Impact factor: 4.475

4. Algorithms for Discovery of Multiple Markov Boundaries.

Authors: Alexander Statnikov; Nikita I Lytkin; Jan Lemeire; Constantin F Aliferis
Journal: J Mach Learn Res Date: 2013-02 Impact factor: 3.654

5. Highly sensitive molecular diagnosis of prostate cancer using surplus material washed off from biopsy needles.

Authors: R Bermudo; D Abia; A Mozos; E García-Cruz; A Alcaraz; A R Ortiz; T M Thomson; P L Fernández
Journal: Br J Cancer Date: 2011-10-18 Impact factor: 7.640

6. Extending greedy feature selection algorithms to multiple solutions.

Authors: Giorgos Borboudakis; Ioannis Tsamardinos
Journal: Data Min Knowl Discov Date: 2021-05-01 Impact factor: 3.670

7. Interdisciplinary data science to advance environmental health research and improve birth outcomes.

Authors: Jeanette A Stingone; Sofia Triantafillou; Alexandra Larsen; Jay P Kitt; Gary M Shaw; Judit Marsillach
Journal: Environ Res Date: 2021-03-15 Impact factor: 8.431

8. Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets.

Authors: Michael Gormley; William Dampier; Adam Ertel; Bilge Karacali; Aydin Tozeren
Journal: BMC Bioinformatics Date: 2007-10-26 Impact factor: 3.169

9. A biological function based biomarker panel optimization process.

Authors: Min Young Lee; Taek-Kyun Kim; Kathie-Anne Walters; Kai Wang
Journal: Sci Rep Date: 2019-05-14 Impact factor: 4.379

9 in total