Literature DB >> 28194165

Unbiased bootstrap error estimation for linear discriminant analysis.

Thang Vu¹, Chao Sima², Ulisses M Braga-Neto^1,2, Edward R Dougherty^1,2.

Abstract

Convex bootstrap error estimation is a popular tool for classifier error estimation in gene expression studies. A basic question is how to determine the weight for the convex combination between the basic bootstrap estimator and the resubstitution estimator such that the resulting estimator is unbiased at finite sample sizes. The well-known 0.632 bootstrap error estimator uses asymptotic arguments to propose a fixed 0.632 weight, whereas the more recent 0.632+ bootstrap error estimator attempts to set the weight adaptively. In this paper, we study the finite sample problem in the case of linear discriminant analysis under Gaussian populations. We derive exact expressions for the weight that guarantee unbiasedness of the convex bootstrap error estimator in the univariate and multivariate cases, without making asymptotic simplifications. Using exact computation in the univariate case and an accurate approximation in the multivariate case, we obtain the required weight and show that it can deviate significantly from the constant 0.632 weight, depending on the sample size and Bayes error for the problem. The methodology is illustrated by application on data from a well-known cancer classification study.

Entities: Chemical Disease Gene

Keywords: Bias; Bootstrap; Error estimation; Gene expression classification; Linear discriminant analysis

Year: 2014 PMID： 28194165 PMCID： PMC5270504 DOI： 10.1186/s13637-014-0015-0

Source DB: PubMed Journal: EURASIP J Bioinform Syst Biol ISSN： 1687-4145

1Introduction

The bootstrap method [1]–[7] has been used in a wide range of statistical problems. The asymptotic behavior of bootstrap has been studied [8]–[11], while small-sample properties have been studied under simplifying assumptions, such as considering the estimator based on all possible bootstrap samples (the ‘complete’ bootstrap) [12]–[14]. The small-sample properties of the usual bootstrap are not well understood, in particular when it comes to estimating the error rates of classification rules [15],[16]. There has been, on the other hand, interest in the application of bootstrap to error estimation in classification problems and, in particular, gene expression classification studies [17]–[20]. Of particular interest is the issue of classifier error estimation [21],[22]. Bootstrap methods have generally been shown to outperform more traditional error estimation techniques, such as resubstitution and cross-validation, in terms of root-mean-square (RMS) error [4],[5],[7],[23]–[35]. Bootstrap error estimation is typically performed via a convex combination of the (generally) pessimistic basic bootstrap estimator, known as the zero bootstrap, and the (generally) optimistic resubstitution estimator. A basic problem is how to choose the weight that yields an unbiased estimator. The problem of unbiased convex error estimation was previously considered in [36]–[38] for a convex combination of resubstitution and cross-validation estimators, and in [4],[7],[23] for a combination between resubstitution and the basic bootstrap estimator. In the former case, a fixed suboptimal weight of 0.5 was proposed in [36],[38], while an asymptotic analysis to find the optimal weight was provided in [37]. In the latter case, our case of interest, a fixed suboptimal weight of 0.632 was proposed in [4], leading to the well-known 0.632 bootstrap estimator, while in [7], a suboptimal weight is computed by means of a sample-based procedure, which attempts to counterbalance the effect of overfitting on the bias, leading to the so-called 0.632+ bootstrap error estimator; the problem of finding the optimal weight for finite sample cases was addressed via a numerical approach in [23]. Here, we determine the optimal weight for finite sample cases analytically, in the case of linear discriminant analysis under Gaussian populations. In the univariate case, no other assumptions are made. In the multivariate case, it is assumed that the populations are homoskedastic and that the common covariance matrix is known and used in the discriminant. In either case, no simplifications are introduced to the bootstrap error estimator; it is the usual one, based on a finite number of random bootstrap samples. The analysis in this paper follows in the steps of previous papers that have provided analytical representations for the moments of error-estimator distributions [39],[40]. In the univariate case, exact expressions are given for the expectation of the zero bootstrap error estimator, in the general heteroskedastic (general-variance) Gaussian case. By using similar expressions for the expected true and resubstitution error [39], this allows the exact calculation of the required weight. In the multivariate case, the expectation of the zero bootstrap error estimator is expressed as a probability involving the ratio of two noncentral chi-square variables, in the homoskedastic Gaussian case, assuming that the true common covariance matrix is used in the discriminant. The resulting expression is exact but necessitates approximation for its numerical computation. This is done in this paper via the Imhof-Pearson three-moment method, which is accurate in small-sample cases [41]. Use of similar expressions for the expected true and resubstitution error [40] then allows the exact calculation of the required weight. In the homoskedastic case, the required weight for unbiasedness is shown to be a function only of the Bayes error and sample size. Accordingly, plots and tables of the required weight for varying values of Bayes error and sample size are presented; if the Bayes error can be estimated for a problem, this provides a way to obtain the optimal weight to use. In the univariate case, it was observed that as the sample size increases, the optimal weight settles on an asymptotic value of around 0.675, thus slightly over the heuristic value 0.632; by contrast, in the multivariate case (d=2), the asymptotic value appears to be strongly dependent on the Bayes error, being as a rule significantly smaller than 0.632, except for very small Bayes error. This paper is organized as follows. The ‘Bootstrap classification’ section defines linear discriminant analysis as well as its application under bootstrap sampling. The ‘Bootstrap error estimation’ section reviews convex bootstrap error estimation. The ‘Unbiased bootstrap error estimation’ section contains the main theoretical results in the paper, providing the analytical expressions for the computation of the required convex bootstrap weight in the univariate and multivariate cases. The ‘Gene expression classification example’ section contains a demonstration of the usage of the optimal weight in bootstrap error estimation using data from the breast cancer classification study in [42],[43]. Lastly, the ‘Conclusions’ section contains a summary and concluding remarks. All the proofs are presented in the Appendix.

2Bootstrap classification

Classification involves a predictor vector X∈R, also known as a feature vector, which represents an individual from one of two populations Π0 and Π1 (we consider here only this binary classification problem). The classification problem is to assign X correctly to its population of origin. The populations are coded into a discrete label Y∈{0,1}. Therefore, given a feature vector X, classification attempts to predict the corresponding value of the label Y. We assume that there is a joint feature-label distribution F for the pair (X,Y) characterizing the classification problem. In particular, it determines the probabilities c0=P(X∈Π0)=P(Y=0) and c1=P(X∈Π1)=P(Y=1), which are called the prior probabilities. Given a fixed sample size n, the sample data is an i.i.d. sample S={(X1,Y1),…,(X,Y)} from F. The population-specific sample sizes are given by and , which are random variables, with n0∼Binomial(n,c0) and n1∼Binomial(n,c1). When we need to emphasize that n0 and n1 are random variables, we will use capital letters N0 and N1, respectively. This sampling design, which is the most commonly found one in contemporary pattern recognition, is known as mixture sampling[44]. A classification rule Ψ is used to map the training data S into a designed classifier ψ=Ψ(S), where ψ is a function taking on values in the set {0,1}, such that X is assigned to population Π0 or Π1 according to whether ψ(X)=0 or 1, respectively. The classification error rate ε of classifier ψ is the probability that the assignment is erroneous: where (X,Y) is an independent test point and is the error rate specific to population Π, for i=0,1. Since the training set S is random, ε is a random variable, with expected classification error rate E[ ε]; this gives the average performance over all possible training sets S, for fixed sample size n. Linear discriminant analysis (LDA) employs Anderson’s W discriminant [45], which is defined as follows: where are the sample means relative to each population, and Σ is a matrix, which can be either (1) the true common covariance matrix of the populations, assuming it is known (this is the approach followed, for example, in [39],[40],[46]), or (2) the sample covariance matrix based on the pooled sample S, which leads to the general LDA case. In this paper, we will assume case (1) throughout. The corresponding LDA classifier is given by that is, the sign of W(X) determines the classification of X. A bootstrap sample contains n instances drawn uniformly, with replacement, from S. Hence, some of the instances in S may appear multiple times in , whereas others may not appear at all. Let C be a vector of size n, where the i th component C(i) equals the number of appearances in of the i th instance in S. The vector C will be referred to as a bootstrap vector. For a given S, the vector C uniquely determines a bootstrap sample , which we denote by . Note that the original sample itself is included: if , then , since each original instance appears once in the bootstrap sample. Note also that the number of distinct bootstrap samples, i.e., values for C, is equal to ; even for small n, this is a large number. For example, the total number of possible bootstrap samples of size n=20 is larger than 6.8×1010. The vector C has a multinomial distribution with parameters (n,1/n,…,1/n), Starting from a classification rule Ψ, one may design a classifier on a bootstrap training set S. Its classification error is given as in (1), namely, where is the error rate specific to population Π, for i=0,1. In this paper, we apply this scheme to the LDA classification rule defined previously. Notice the distinction between a bootstrap LDA classifier and a ‘bagged’ (bootstrap-aggregated) LDA classifier [47],[48]; these correspond to distinct classification rules. The bootstrap LDA classifier is employed here as an auxiliary tool to analyze the problem of unbiased bootstrap error estimation for the plain LDA classifier.

3Bootstrap error estimation

Since the feature-label distribution is typically unknown, the classification error rate ε has to be estimated by a sample-based statistic , commonly referred to as an error estimator. Data in practice are often limited, and the training sample S has to be used for both designing the classifier ψ and as the basis for the error estimator . The simplest and fastest way to estimate the error of a designed classifier ψ is to compute its error on the sample data itself: This resubstitution estimator, or apparent error, is often optimistically biased, that is, it is often the case that , though this is not always so. The bias tends to worsen with more complex classification rules [49]. The basic bootstrap error estimator is the zero bootstrap error estimator [4], which is introduced next. Given the training data S, B bootstrap samples are randomly drawn from it. Denote the corresponding (random) bootstrap vectors by {C1,…,C}. The zero bootstrap error estimator is defined as the average error committed by the B bootstrap classifiers on sample points that do not appear in the bootstrap samples: where n(C) is the number of zeros in C. The bootstrap zero estimator tends to be pessimistically biased, since the amount of distinct training instances available for designing the classifier is on average (1−e−1)n≈0.632n23]. In the case of bootstrap error estimation, the standard approach is to form a convex combination of the zero bootstrap with resubstitution, Selecting the appropriate weight w=w∗ leads to an unbiased error estimator, . In [4], the weight w is heuristically set to w=0.632 to reflect the average ratio of original training instances that appear in a bootstrap sample. This is known as the .632 bootstrap estimator which has been heavily employed in the machine learning field.

4Unbiased bootstrap error estimation

The 0.632 bootstrap error estimator reviewed in the previous section is not guaranteed to be unbiased. In this section, we will examine the necessary conditions for setting the weight w=w∗ in (8) to achieve unbiasedness. We will then particularize the analysis to the Gaussian linear discriminant case, where exact expressions for w∗ will be derived, both in the univariate and multivariate cases. The bias of the convex estimator in (8) is given by Setting this to zero yields the exact weight that produces an unbiased error estimator. Now, applying expectation on both sides of (7) produces where p(C) is given by (5) and the sum is taken over all possible values of C (an efficient procedure for listing all multinomial vectors is provided by the NEXCOM routine given in [50], Chapter 5). Equations (11) and (12) allow the computation of the weight w∗ given the knowledge of E[ε], , and . We will present next exact formulas for these expectations in the case of the LDA classification rule under Gaussian populations.

4.1 Univariate case

In the univariate case, the common variance term cancels and the W statistic and LDA classifier become greatly simplified, with The following functions will be useful. Let Φ(u)=P(Z≤u) and Φ(u,v;ρ)=P((Z1,Z2)≤(u,v)), where Z is a zero-mean, unit-variance Gaussian random variable, and Z1, Z2 are zero-mean, unit-variance random variables that are jointly Gaussian distributed, with correlation coefficient ρ. Assume that population Π is distributed as N(μ,σ), for i=0,1, where σ0≠σ1 in general. Under these conditions, John obtained in [39] an exact expression for the expectation of the true classification error for fixed sample sizes n0 and n1 (this is known as separate sampling [44]). John’s result can be written as follows: where The corresponding result for is obtained by simply interchanging all indices 0 and 1 in the previous expressions. The expected error rate can then be found by using conditioning and Equation (1): where As for resubstitution, Hills provided in [51] exact expressions for the expected error for fixed n0 and n1. However, his expression applies only to the case σ0=σ1. Theorem 3 in [52] provides a generalization of this result to the case of populations of unequal variances. First, note that where are the apparent error rates specific to class 0 and 1, respectively. The result in [52] can be written as where The corresponding result for is obtained by interchanging all indices 0 and 1. The expected resubstitution error rate can then be found by using conditioning and Equation (18): Finally, let us consider the expected bootstrap error. Given C, the bootstrap LDA classifier is obtained by replacing by , i=0,1, in (13): where are bootstrap sample means. Now, note that with N0=n0 fixed, the training data labels Y, i=1,…,n, are no longer random. Since all classification rules of interest are invariant to reordering of the training data, we can, without loss of generality, reorder the sample points so that Y=0 for i=1,…,n0, and Y1=1 for i=n0+1,…,n. Let the same reordering be applied to a given bootstrap vector C. The next theorem extends John’s result to the classification error of the bootstrapped LDA classification rule defined by (23).

Theorem 1.

Assume that population Π is distributed as , for i=0,1. Then the expected error rate of the bootstrap LDA classification rule defined by (23) is given by: where with The corresponding result for is obtained by interchanging all indices 0 and 1. Proof. See the Appendix. It is easy to check that the result in Theorem 1 reduces to the one in (14) and (15) when C=1. Following (16), we can then write The expected bootstrap error rate can now be computed via (12). The weight w∗ for unbiased bootstrap error estimation can now be computed exactly by means of Equations (11), (12), (14) to (17), (20) to (22), and (25) to (28). In the special case σ0=σ1=σ (homoskedasticity), it follows easily from the previous expressions that E[ε], , and depend only on the sample size n and on the Mahalanobis distance between the populations δ=|μ1−μ0|/σ, and therefore so does the weight w∗, through (11). Since the optimal (Bayes) classification error in this case is ε∗=Φ(−δ/2), there is a one-to-one correspondence between Bayes error and the Mahalanobis distance. Therefore, in the homoskedastic case, the weight w∗is a function only of the Bayes error ε∗and the sample size n. Figure 1 and Table 1 display the value of w∗ in the homoskedastic case, for several sample sizes and Bayes errors. In order to extend the plots up to n=200, it is necessary to approximate in (12) by a Monte Carlo procedure; this is done by generating M=100×n2 independent random vectors {C∣i=1,…,M} and letting . We find that this value of M is large enough to obtain an accurate approximation. All other quantities are computed exactly, as described previously. One can see in Figure 1a that w∗ varies wildly and can be very far from the heuristic 0.632 weight; however, as the sample size increases, w∗ appears to settle around an asymptotic fixed value. This asymptotic value is approximately 0.675, being thus slightly larger than 0.632. In addition, Figure 1b allows one to see that convergence to the asymptotic value is faster for smaller Bayes errors. These facts help explain the good performance of the original convex 0.632 bootstrap error estimator with moderate sample sizes and small Bayes errors.

Figure 1

Univariate case. Required weight w∗ for unbiased convex bootstrap estimation plotted against (a) sample size and (b) Bayes error.

Table 1

Univariate case: required weight for unbiased convex bootstrap estimation

	n=10	n=20	n=30	n=40	n=50	n=60	n=70	n=80	n=90	n=100
ε^∗=0.025	0.724	0.687	0.679	0.675	0.674	0.672	0.671	0.671	0.670	0.670
ε^∗=0.050	0.736	0.696	0.685	0.680	0.678	0.676	0.674	0.673	0.672	0.672
ε^∗=0.075	0.738	0.701	0.689	0.683	0.679	0.677	0.676	0.674	0.674	0.673
ε^∗=0.100	0.729	0.704	0.691	0.684	0.681	0.678	0.677	0.675	0.674	0.673
ε^∗=0.125	0.708	0.701	0.692	0.686	0.682	0.679	0.677	0.676	0.675	0.674
ε^∗=0.150	0.681	0.692	0.693	0.687	0.683	0.680	0.678	0.677	0.676	0.675
ε^∗=0.175	0.646	0.670	0.688	0.687	0.683	0.680	0.678	0.677	0.676	0.675
ε^∗=0.200	0.625	0.631	0.673	0.683	0.683	0.681	0.679	0.677	0.676	0.675
ε^∗=0.225	0.614	0.574	0.639	0.671	0.679	0.680	0.679	0.677	0.676	0.675
ε^∗=0.250	0.617	0.516	0.579	0.635	0.663	0.673	0.676	0.677	0.676	0.675
ε^∗=0.275	0.641	0.470	0.498	0.563	0.617	0.648	0.664	0.671	0.673	0.674
ε^∗=0.300	0.676	0.459	0.425	0.464	0.523	0.577	0.616	0.641	0.656	0.665
ε^∗=0.325	0.724	0.487	0.393	0.379	0.405	0.451	0.502	0.548	0.587	0.614
ε^∗=0.350	0.780	0.549	0.422	0.356	0.331	0.334	0.356	0.389	0.428	0.469
ε^∗=0.375	0.837	0.639	0.505	0.412	0.350	0.310	0.288	0.280	0.282	0.295
ε^∗=0.400	0.890	0.741	0.626	0.533	0.458	0.398	0.350	0.312	0.283	0.261
ε^∗=0.425	0.935	0.842	0.761	0.690	0.627	0.570	0.519	0.474	0.434	0.399
ε^∗=0.450	0.971	0.925	0.884	0.845	0.808	0.772	0.739	0.707	0.676	0.647
	n =110	n =120	n =130	n =140	n =150	n =160	n =170	n =180	n =190	n =200
ε^∗=0.025	0.669	0.669	0.669	0.669	0.669	0.669	0.669	0.668	0.668	0.668
ε^∗=0.050	0.671	0.671	0.671	0.671	0.670	0.670	0.670	0.669	0.670	0.669
ε^∗=0.075	0.672	0.672	0.671	0.671	0.671	0.671	0.670	0.670	0.670	0.670
ε^∗=0.100	0.673	0.672	0.672	0.671	0.671	0.671	0.671	0.670	0.670	0.670
ε^∗=0.125	0.673	0.673	0.672	0.672	0.672	0.671	0.671	0.671	0.670	0.670
ε^∗=0.150	0.674	0.673	0.673	0.672	0.672	0.672	0.671	0.671	0.671	0.671
ε^∗=0.175	0.674	0.673	0.673	0.672	0.672	0.672	0.672	0.671	0.671	0.671
ε^∗=0.200	0.674	0.673	0.673	0.673	0.672	0.672	0.672	0.671	0.671	0.671
ε^∗=0.225	0.675	0.674	0.673	0.672	0.672	0.672	0.672	0.672	0.671	0.671
ε^∗=0.250	0.675	0.674	0.673	0.673	0.672	0.672	0.672	0.672	0.671	0.671
ε^∗=0.275	0.674	0.674	0.673	0.673	0.673	0.673	0.672	0.671	0.671	0.671
ε^∗=0.300	0.669	0.671	0.672	0.672	0.672	0.672	0.672	0.672	0.672	0.672
ε^∗=0.325	0.635	0.648	0.657	0.663	0.666	0.668	0.669	0.670	0.671	0.671
ε^∗=0.350	0.508	0.543	0.572	0.597	0.615	0.630	0.642	0.649	0.655	0.660
ε^∗=0.375	0.313	0.337	0.365	0.394	0.425	0.455	0.484	0.511	0.536	0.557
ε^∗=0.400	0.245	0.234	0.229	0.228	0.229	0.235	0.243	0.254	0.268	0.283
ε^∗=0.425	0.367	0.338	0.313	0.290	0.270	0.253	0.238	0.224	0.213	0.203
ε^∗=0.450	0.620	0.594	0.569	0.545	0.522	0.501	0.480	0.461	0.442	0.424

Univariate case. Required weight w∗ for unbiased convex bootstrap estimation plotted against (a) sample size and (b) Bayes error. Univariate case: required weight for unbiased convex bootstrap estimation

4.2 Multivariate case

Assume that population Π is distributed as a multivariate Gaussian N(μ,Σ), for i=0,1. Under these conditions, John obtained in [39] an exact expression for the expectation of the error of the LDA classification rule, defined by (2) to (4), for the case where N0=n0 is fixed. This result is stated by Moran in [40] as follows: where W1 and W2 are independently distributed as noncentral chi-square variables with d degrees of freedom(d being the dimensionality) and noncentrality parameters λ1 and λ2, with where δ2 = (μ1−μ0)Σ−1(μ1−μ0) is the squared Mahalanobis distance between the populations. The corresponding result for is obtained by interchanging n0 and n1. The expected true error rate can then be found by using (16). Moran also provided the following expression for the expectation of the resubstitution error estimator in the multivariate case, for fixed N0=n0[40]: where W3 and W4 are independently distributed as noncentral chi-square variables with d degrees of freedom and noncentrality parameters λ3 and λ4, with The corresponding result for is obtained by interchanging n0 and n1. The expected resubstitution error rate can then be found by using (22). The bootstrap LDA classifier in the multivariate case is given by where and are defined in (24). The next theorem generalizes John’s result for the multivariate classification error to the case of the bootstrapped LDA classification rule.

Theorem 2.

Assume that population Π is distributed as N(μ,Σ), for i=0,1. Then, the expected error rate of the bootstrap LDA classification rule defined by (33) is given by where W5 and W6 are independently distributed as noncentral chi-square variables with d degrees of freedom and noncentrality parameters λ5 and λ6, with where s0 and s1 are defined in (27). The corresponding result for is obtained by interchanging s0 and s1. Proof. See the Appendix. It is easy to check that the result in Theorem 2 reduces to the one in (29) and (30) when C=1. As in the univariate case, Theorem 2 can be used in conjunction with Equations (12) and (28) to compute . The weight w∗ for unbiased bootstrap error estimation can now be computed exactly by means of Equations (11), (12), (16) to (17), (22), (28), (29) to (32), and (34) to (35). An issue that arises in the multivariate case is the computation of the probabilities in (29), (31), and (34). This computation is very difficult since it involves the ratio of noncentral chi-square random variables, which has a doubly noncentral F distribution. Computation of this distribution is a hard problem. Moran proposes in [40] a complex procedure, based on work by Price [53], to compute this probability, which only applies to even dimensionality d. We employ a simpler procedure, namely, the Imhof-Pearson three-moment method, which is applicable to even and odd dimensionality [41]. This consists of approximating a noncentral random variable with a central random variable, by equating the first three moments of their distributions. This approach was also employed in [52], where it was found to be very accurate. To fix ideas, we consider (29). The Imhof-Pearson three-moment approximation is given by where is a central chi-square random variable with h degrees of freedom, with and The approximation is valid only for c3>0 [41]. If c3<0, one uses the approximation where h and y are as in (37), and The same approximation method applies to (31) and (34) by substituting the appropriate values. As in the univariate case, the assumption of a common covariance matrix Σ makes the expectations E[ε], , and and thus also the weight w∗, functions only of n and δ. Since ε∗=Φ(−δ/2), this means that the weight w∗ is a function only of the Bayes error ε∗ and the sample size n. Figure 2 and Table 2 display the value of w∗ computed with the previous expressions in this section, for several sample sizes and Bayes errors. As in the univariate case, in (12) is approximated by a Monte Carlo procedure, with the same number M=100×n2 of MC vectors. All other quantities are computed exactly, as described previously, save for the Imhof-Pearson approximation. We can see in Figure 2 that there is considerable variation in the value of w∗ and it can be far from the heuristic 0.632 weight; however, as the sample size increases, w∗ appears to settle around an asymptotic fixed value. In contrast to the univariate case, these asymptotic values here appear to be strongly dependent on the Bayes error and are significantly smaller than the heuristic 0.632 except for very small Bayes errors. As in the univariate case, convergence to the apparent asymptotic value is faster for smaller Bayes errors. These facts again help explain the good performance of the original convex 0.632 bootstrap error estimator for moderate sample sizes and small Bayes errors.

Figure 2

Bivariate case. Required weight w∗ for unbiased convex bootstrap estimation plotted against (a) sample size and (b) Bayes error.

Table 2

Bivariate case: required weight for unbiased convex bootstrap estimation

	n=10	n=20	n=30	n=40	n=50	n=60	n=70	n=80	n=90	n=100
ε^∗=0.025	0.664	0.667	0.679	0.685	0.690	0.693	0.695	0.697	0.698	0.699
ε^∗=0.050	0.666	0.637	0.638	0.639	0.641	0.642	0.642	0.643	0.644	0.644
ε^∗=0.075	0.670	0.617	0.610	0.608	0.606	0.606	0.605	0.605	0.605	0.605
ε^∗=0.100	0.675	0.604	0.590	0.584	0.581	0.578	0.577	0.576	0.575	0.574
ε^∗=0.125	0.682	0.594	0.573	0.564	0.559	0.555	0.553	0.551	0.550	0.548
ε^∗=0.150	0.691	0.588	0.560	0.547	0.539	0.534	0.530	0.528	0.526	0.524
ε^∗=0.175	0.699	0.586	0.554	0.539	0.530	0.524	0.520	0.517	0.515	0.513
ε^∗=0.200	0.718	0.586	0.544	0.524	0.512	0.504	0.498	0.493	0.490	0.487
ε^∗=0.225	0.738	0.592	0.542	0.517	0.502	0.492	0.485	0.479	0.475	0.471
ε^∗=0.250	0.759	0.603	0.545	0.515	0.497	0.485	0.476	0.469	0.464	0.460
ε^∗=0.275	0.784	0.620	0.553	0.518	0.497	0.482	0.471	0.463	0.457	0.452
ε^∗=0.300	0.815	0.647	0.572	0.530	0.503	0.485	0.472	0.462	0.454	0.448
ε^∗=0.325	0.847	0.681	0.598	0.550	0.518	0.496	0.480	0.468	0.458	0.450
ε^∗=0.350	0.882	0.728	0.639	0.584	0.546	0.520	0.500	0.484	0.472	0.462
ε^∗=0.375	0.915	0.784	0.695	0.635	0.592	0.560	0.535	0.516	0.500	0.487
ε^∗=0.400	0.943	0.842	0.763	0.702	0.655	0.619	0.590	0.566	0.546	0.530
ε^∗=0.425	0.971	0.914	0.859	0.811	0.769	0.732	0.701	0.673	0.650	0.629
ε^∗=0.450	0.987	0.960	0.933	0.905	0.879	0.853	0.830	0.807	0.786	0.766
	n =110	n =120	n =130	n =140	n =150	n =160	n =170	n =180	n =190	n =200
ε^∗=0.025	0.700	0.701	0.701	0.702	0.702	0.703	0.703	0.704	0.704	0.704
ε^∗=0.050	0.644	0.645	0.645	0.645	0.645	0.645	0.645	0.646	0.646	0.646
ε^∗=0.075	0.604	0.604	0.604	0.604	0.604	0.604	0.604	0.604	0.604	0.604
ε^∗=0.100	0.574	0.573	0.573	0.573	0.573	0.572	0.572	0.572	0.572	0.572
ε^∗=0.125	0.548	0.547	0.546	0.546	0.545	0.545	0.544	0.544	0.544	0.543
ε^∗=0.150	0.523	0.522	0.521	0.520	0.519	0.518	0.518	0.517	0.517	0.517
ε^∗=0.175	0.511	0.510	0.509	0.508	0.507	0.506	0.506	0.505	0.505	0.504
ε^∗=0.200	0.485	0.483	0.482	0.480	0.479	0.478	0.477	0.477	0.476	0.475
ε^∗=0.225	0.469	0.466	0.464	0.463	0.461	0.460	0.459	0.458	0.457	0.456
ε^∗=0.250	0.457	0.454	0.452	0.449	0.448	0.446	0.445	0.443	0.442	0.441
ε^∗=0.275	0.448	0.444	0.442	0.439	0.437	0.435	0.433	0.432	0.430	0.429
ε^∗=0.300	0.443	0.438	0.435	0.432	0.429	0.426	0.424	0.422	0.420	0.419
ε^∗=0.325	0.444	0.439	0.434	0.430	0.426	0.423	0.421	0.418	0.416	0.414
ε^∗=0.350	0.454	0.447	0.441	0.435	0.431	0.427	0.423	0.420	0.417	0.415
ε^∗=0.375	0.476	0.467	0.459	0.452	0.446	0.441	0.436	0.432	0.428	0.424
ε^∗=0.400	0.516	0.504	0.493	0.484	0.476	0.469	0.462	0.457	0.451	0.447
ε^∗=0.425	0.611	0.594	0.580	0.567	0.555	0.544	0.535	0.526	0.518	0.511
ε^∗=0.450	0.748	0.731	0.715	0.700	0.687	0.674	0.662	0.650	0.640	0.630

Bivariate case. Required weight w∗ for unbiased convex bootstrap estimation plotted against (a) sample size and (b) Bayes error. Bivariate case: required weight for unbiased convex bootstrap estimation

5Gene expression classification example

Here we demonstrate the application of the previous theory in comparing the performance of the bootstrap error estimator using the optimal weight versus the use of the fixed w=0.632 weight, using gene expression data from the well-known breast cancer classification study in [42], which analyzed expression profiles from 295 tumor specimens, divided into N0=115 specimens belonging to the ‘good-prognosis’ population (class 1 here) and N1=180 specimens belonging to the ‘poor-prognosis’ population (class 0). Our experiment was set up in the following way. We selected two genes among the previously published 70-gene prognosis profile [43]. These genes were selected for their approximate homoskedastic Gaussian distributions (see Figure 3). Since the real prior probabilities c0 and c1 for the good- and poor-prognosis populations are unknown, we assumed three different scenarios corresponding to c0=1/3, c0=1/2, and c0=2/3 and downsampled randomly one or the other set of specimens to obtain new sample sizes (90,180), (115,115), and (115,68), respectively, so as to reflect the assumed prior probabilities. In each of the three cases, we then drew 2,000 random samples of size n=30 from the pooled data, computed for each the true error, resubstitution, basic bootstrap, and convex bootstrap error rates. Bias and root-mean-square (RMS) error for each estimator were estimated by averaging over the 2,000 repetitions. We considered both the fixed 0.632 weight and the optimal weight prescribed by our analysis. For the latter, we estimated for each value of c0 the Bayes error using the full data set and read off Table 2 the optimal weight corresponding to the estimated Bayes error and sample size n=30. The results are displayed in Table 3. Despite the approximate nature of the results, given that the simulated training samples are not independent from each other, we can see that the bias and RMS were always smaller for the estimator using the optimal weight than using the fixed 0.632 weight (all bootstrap estimators vastly outperforming resubstitution).

Figure 3

Table 3

Bias and RMS of estimators considered in the experiment with expression data from genes ‘OXCT’ and ‘WISP1’

c ₀	n	ε ^∗	E[ε_n]	Resub		Basic boot		Opt boot		0.632 boot
				Bias	RMS	Bias	RMS	Bias	RMS	Bias	RMS
0.33	30	0.4043	0.4206	−0.0702	0.1061	0.0008	0.0820	−0.0161	0.0803	−0.0253	0.0817
0.50	30	0.3969	0.4266	−0.0719	0.1060	0.0072	0.0830	−0.0116	0.0798	−0.0219	0.0806
0.67	30	0.3893	0.4131	−0.0914	0.1185	−0.0181	0.0878	−0.0355	0.0885	−0.0451	0.0909

Also displayed are the assumed values for the prior probability c0, sample size n, the estimated value of the Bayes error ε∗, and the expected classification error E[ εn].

Data used in the gene expression experiment. The plot shows the optimal (linear) classifier superimposed on the sample for the genes OXCT and WISP1, from the breast cancer study in [42]. We can see that both populations are approximately Gaussian with equal dispersion. Bad prognosis = red. Good prognosis = blue. Bias and RMS of estimators considered in the experiment with expression data from genes ‘OXCT’ and ‘WISP1’ Also displayed are the assumed values for the prior probability c0, sample size n, the estimated value of the Bayes error ε∗, and the expected classification error E[ εn].

6Conclusions

Exact expressions were derived for the required weight for unbiased convex bootstrap error estimation in the finite sample case, for linear discriminant analysis of Gaussian populations. The results not only provide the practitioner with a recommendation of what weight to use given the sample size and problem difficulty, but also offer insight into the choice of the 0.632 weight for the classic 0.632 bootstrap error estimator. It was observed that the required weight for unbiasedness can deviate significantly from the 0.632 weight, particularly in the multivariate case, where the required weight for unbiasedness appears to settle on an asymptotic value that is strongly dependent on the Bayes error, being as a rule smaller than 0.632. The results were illustrated by application to gene expression data from a well-known breast cancer study.

7Appendix

Proof of Theorem 1 Following the same technique used in [40], we write where and . From (303), it is clear that, given C, and are independent Gaussian random variables, such that , for i=0,1, where s1 and s2 are defined in (27). It follows that U and V are jointly Gaussian random variables, with the following parameters: The result then follows after some algebraic manipulation. By symmetry, to obtain , one needs only to interchange all indices 0 and 1. □ Proof of Theorem 2 Following the same technique used in [32], we write where and . It can be readily checked that U+V and U−V are independent Gaussian random vectors, such that where ρ is defined as in (35) and I denotes the identity matrix of dimension d. It follows that are independent noncentralchi-squared random variables with d degrees of freedom and noncentrality parameters λ5 and λ6 defined in (35). The result then follows from (62). Following along the same lines, one can show that is obtained by interchanging s0 and s1 in the result for (the details are omitted for brevity). □

12 in total

1. An efficient method for estimating the probability of misclassification applied to a problem in medical diagnosis.

Authors: G T Toussaint; P M Sharpe
Journal: Comput Biol Med Date: 1975-02 Impact factor: 4.589

2. Is cross-validation better than resubstitution for ranking genes?

Authors: Ulisses Braga-Neto; Ronaldo Hashimoto; Edward R Dougherty; Danh V Nguyen; Raymond J Carroll
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

3. Is bagging effective in the classification of small-sample genomic and proteomic data?

Authors: T T Vu; U M Braga-Neto
Journal: EURASIP J Bioinform Syst Biol Date: 2009-04-16

4. Bootstrap techniques for error estimation.

Authors: A K Jain; R C Dubes; C C Chen
Journal: IEEE Trans Pattern Anal Mach Intell Date: 1987-05 Impact factor: 6.226

5. Gene expression profiling predicts clinical outcome of breast cancer.

Authors: Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend
Journal: Nature Date: 2002-01-31 Impact factor: 49.962

6. Cross-validation under separate sampling: strong bias and how to correct it.

Authors: Ulisses M Braga-Neto; Amin Zollanvari; Edward R Dougherty
Journal: Bioinformatics Date: 2014-08-13 Impact factor: 6.937

7. μHEM for identification of differentially expressed miRNAs using hypercuboid equivalence partition matrix.

Authors: Sushmita Paul; Pradipta Maji
Journal: BMC Bioinformatics Date: 2013-09-04 Impact factor: 3.169

8. A combined blood based gene expression and plasma protein abundance signature for diagnosis of epithelial ovarian cancer--a study of the OVCAD consortium.

Authors: Dietmar Pils; Dan Tong; Gudrun Hager; Eva Obermayr; Stefanie Aust; Georg Heinze; Maria Kohl; Eva Schuster; Andrea Wolf; Jalid Sehouli; Ioana Braicu; Ignace Vergote; Toon Van Gorp; Sven Mahner; Nicole Concin; Paul Speiser; Robert Zeillinger
Journal: BMC Cancer Date: 2013-04-03 Impact factor: 4.430

9. Stable feature selection and classification algorithms for multiclass microarray data.

Authors: Sebastian Student; Krzysztof Fujarewicz
Journal: Biol Direct Date: 2012-10-02 Impact factor: 4.540

10. FiGS: a filter-based gene selection workbench for microarray data.

Authors: Taeho Hwang; Choong-Hyun Sun; Taegyun Yun; Gwan-Su Yi
Journal: BMC Bioinformatics Date: 2010-01-26 Impact factor: 3.169

1 in total

1. The Molecular Mechanism of Human Voltage-Dependent Anion Channel 1 Blockade by the Metallofullerenol Gd@C₈₂(OH)₂₂: An In Silico Study.

Authors: Xiuxiu Wang; Nan Yang; Juan Su; Chenchen Wu; Shengtang Liu; Lei Chang; Leigh D Plant; Xuanyu Meng
Journal: Biomolecules Date: 2022-01-12

1 in total