Quentin Noirhomme1, Damien Lesenfants1, Francisco Gomez2, Andrea Soddu3, Jessica Schrouff4, Gaëtan Garraux5, André Luxen5, Christophe Phillips6, Steven Laureys1. 1. Cyclotron Research Centre, University of Liège, Liège, Belgium ; Coma Science Group, Neurology Department, University Hospital of Liège, Liège, Belgium. 2. Complexus Group, Computer Science Department, Universidad Central de Colombia, Bogotá, Colombia. 3. Department of Physics & Astronomy, Brain and Mind Institute, University of Western Ontario, London, ON, Canada. 4. Cyclotron Research Centre, University of Liège, Liège, Belgium ; Laboratory of Behavioral and Cognitive Neurology, Stanford University, Palo Alto, USA. 5. Cyclotron Research Centre, University of Liège, Liège, Belgium. 6. Cyclotron Research Centre, University of Liège, Liège, Belgium ; Department of Electrical Engineering and Computer Science, University of Liège, Liège, Belgium.
Abstract
Multivariate classification is used in neuroimaging studies to infer brain activation or in medical applications to infer diagnosis. Their results are often assessed through either a binomial or a permutation test. Here, we simulated classification results of generated random data to assess the influence of the cross-validation scheme on the significance of results. Distributions built from classification of random data with cross-validation did not follow the binomial distribution. The binomial test is therefore not adapted. On the contrary, the permutation test was unaffected by the cross-validation scheme. The influence of the cross-validation was further illustrated on real-data from a brain-computer interface experiment in patients with disorders of consciousness and from an fMRI study on patients with Parkinson disease. Three out of 16 patients with disorders of consciousness had significant accuracy on binomial testing, but only one showed significant accuracy using permutation testing. In the fMRI experiment, the mental imagery of gait could discriminate significantly between idiopathic Parkinson's disease patients and healthy subjects according to the permutation test but not according to the binomial test. Hence, binomial testing could lead to biased estimation of significance and false positive or negative results. In our view, permutation testing is thus recommended for clinical application of classification with cross-validation.
Multivariate classification is used in neuroimaging studies to infer brain activation or in medical applications to infer diagnosis. Their results are often assessed through either a binomial or a permutation test. Here, we simulated classification results of generated random data to assess the influence of the cross-validation scheme on the significance of results. Distributions built from classification of random data with cross-validation did not follow the binomial distribution. The binomial test is therefore not adapted. On the contrary, the permutation test was unaffected by the cross-validation scheme. The influence of the cross-validation was further illustrated on real-data from a brain-computer interface experiment in patients with disorders of consciousness and from an fMRI study on patients with Parkinson disease. Three out of 16 patients with disorders of consciousness had significant accuracy on binomial testing, but only one showed significant accuracy using permutation testing. In the fMRI experiment, the mental imagery of gait could discriminate significantly between idiopathic Parkinson's diseasepatients and healthy subjects according to the permutation test but not according to the binomial test. Hence, binomial testing could lead to biased estimation of significance and false positive or negative results. In our view, permutation testing is thus recommended for clinical application of classification with cross-validation.
Entities:
Keywords:
binomial; classification; cross-validation; permutation test
In the last few years, there has been a growing interest in the statistical assessment of classification results in biomedical applications. Machine learning approaches are now increasingly used to study brain function (Etzel et al., 2009; Pereira et al., 2009; Lemm et al., 2011) and have been proposed as a diagnostic and prognostic tool for patients (e.g., in the field of severe brain injury see (Phillips et al., 2011; Galanaud et al., 2012; Luyt et al., 2012; Lule et al., 2013) or Parkinson disease (Focke et al., 2011; Orru et al., 2012; Schrouff et al., 2012; Garraux et al., 2013; Schrouff et al., 2013)). Such classification machines have also been designed for many other applications such as analyzing DNA microarray and predicting tumor subtype and clinical outcome (Golub et al., 1999; Simon et al., 2003). Limitations and controversies of these approaches have been recently highlighted in a study using brain–computer interfaces (BCIs) to unravel signs of consciousness in patients with disorders of consciousness (Cruse et al., 2011; Goldfine et al., 2013). A statistically significant classification accuracy is one where we can reject the null hypothesis that there is no information about task, patient diagnosis or outcome in the data from which it is being predicted. In a two-class problem with an equivalent number of elements in each class, e.g., disease vs. no-disease, the theoretical chance level, which is valid in the case of an infinite number of trials, is 50%. In practice, we only have a limited number of trials, which can be in the order of 10, due to patientfatigue. If a specific set of features can classify the data with for example 58% accuracy, the question is whether this accuracy is trustworthy. To tackle this issue, several approaches have been proposed in the literature.A frequently used method is based on the binomial distribution (Müller-Putz et al., 2008; Pereira et al., 2009; Billinger et al., 2013). With a limited number of trials, the results of a classifier are seen as the results of tossing a coin, an unfair coin, which can be modeled as a Bernoulli trial with probability p = 50% of success. The probability of achieving k successes out of N independent trials is given by the binomial distribution. Knowing the distribution and a given p-value, we can compute a lower bound for any classification accuracy. If the lower bound is higher than the chance level, we can reject the hypothesis that the accuracy was obtained by chance. Here, we are only interested on the accuracies higher than the chance level. We are not interested in the chance of coincidental deviations below the expected 0.50 because we would not pretend our features contain information in that case. Another approach is based on the Pearson chi-square coefficient (Kubler and Birbaumer, 2008). However, for small number of trials, as it is often the case in the neuroimaging and electrophysiology literature, this approach is not reliable (Pereira et al., 2009) and matches the binomial test for higher number of trials (Howell, 2012).Alternatively, permutation test based methods (Good, 2005) have been employed (Mukherjee et al., 2003; Etzel et al., 2009; Pereira et al., 2009; Schrouff et al., 2013b). A permutation test is a non-parametric test that has also been proposed as a substitute to the Student t-test in functional neuroimaging (Nichols and Holmes, 2002) and electrophysiology (Maris and Oostenveld, 2007) experiments. A permutation test estimates the distribution of the null hypothesis from the data. Assuming that there is no class information in the data, the labels are randomly permuted and the accuracy computed with the new labels. As the new labels are random, the new accuracy estimate is expected to reflect the chance distribution. The permutation is repeated hundreds to thousands of times. Then, the p-value is given by the fraction of the sample that is larger than or equal to the accuracy actually observed when using the correct labels.To estimate classification accuracy, ideally, the original data are split into two independent, complementary subsets: a training set (which is used to train the classifier and to define all parameters) and a testing set (which is used to validate the results). In practice, with small datasets, a cross-validation (CV) scheme is often used. The process of splitting the data into two is repeated several times using different partitions. The results obtained from all partitions are then averaged (Lemm et al., 2011). The classification accuracy can then be tested. Following common practice (Pereira et al., 2009; Pereira et al., 2011), the accuracy estimate obtained through a CV could be treated as if it came from a single classifier. In that case, the binomial test sees all accuracies as independent.In the following, we will show on simulated and real data that the CV scheme has an effect on the calculation of the chance level and that this influence is accounted for by the permutation test but not by the binomial test. We will first present results from simulated data illustrating the influence of the CV scheme. Next, we will exemplify how this may influence the “diagnosis” of patients with disorders of consciousness on real data from a previous EEG-based brain–computer interface (BCI) study (Lule et al., 2013). We will then further illustrate the influence with an fMRI study on activation patterns in Parkinson's disease (Cremers et al., 2012; Schrouff et al., 2012, 2013a). Finally, we will discuss some hypotheses underlying the observed differences between classification testing methods. Our simulations make a simplifying assumption, e.g. type of features, and our example from real data does not cover all possible data source and classification approaches, but the issues presented here are quite general and apply to studies employing a cross-validation scheme to estimate the accuracy of the data.
Material and methods
Simulated data
To test the validity of the binomial and permutation tests to assess classification accuracy, we generated random datasets for a two-class problem. We simulated three cases. First, we tested several scenarios with low number of features and trials. Second, we tested the influence of the number of repetitions of the CV scheme. Third, we tested scenarios with high number of features and low number of trials as often the case in the neuroimaging literature. The generation of the random data and the classifiers used built-in MATLAB (The MathWorks, Natick, MA, USA) functions (rand, randperm, classify)1 and libsvm functions (Chang and Lin, 2011). Datasets were generated with 10,000 simulations. Each simulation included two sets with an equal number of trials. Trial number was 100, 50 or 30. Trials of the 100 trial set (respectively 50 and 30 trial sets) had 40 features (respectively 20 and 10). Features and labels were randomly assigned 0 and 1 (rand function thresholded at .5). We tested four CV schemes. In an ideal CV scheme, all possible partitions of the data should be tested. This is the case for the leave-one-out (LOO) CV but in practice for classical N-fold CV schemes it is computationally intractable. Nevertheless, repeating the N-fold CV several times with different partitions is recommended and can reduce the variance of the estimator (Efron and Tibshirani, 1997; Etzel et al., 2009; Lemm et al., 2011). The CV schemes were LOO, 10-fold, 5-fold and 2-fold CVs. The first three are the most used and recommended in the literature (e.g., Lemm et al., 2011). The 2-fold CV is an extreme case at the opposite of the LOO CV. A linear discriminant analysis and a support vector machine (Burges, 1998) with linear kernel classified the data.To compute the binomial lower bound, the binomial distribution is often approximated by a normal distribution; for example to compute the Wald interval or adjusted Wald interval (Kohavi, 1995; Martin and Hirschberg, 1996; Berrar et al., 2006; Billinger et al., 2013). However, the approximation of the binomial distribution by the normal distribution is only valid whenever the number of trials N and the accuracy p satisfy the following equation: N × p × (1 - p) > 5 (Berrar et al., 2006). In the absence of problem specific knowledge, the best choice for estimation of the bound is derived from Jeffreys' Beta distribution (Martin and Hirschberg, 1996; Berrar et al., 2006). This approximation is adequate for 10 = N (Martin and Hirschberg, 1996). The binomial lower bound (?) was computed using Jeffreys' Beta distribution (Berrar et al., 2006) as follows:where N is the number of trials, m is the number of successful trials, a is the estimated accuracy and z is the z-score (1.65 for one sided test with p < .05 (resp. 2.33 for p < .01)).The permutation test (Good, 2005) was based on 999 permutations plus the original accuracy (Ojala and Garriga, 2010). Only accuracies higher than 0.5 were assessed using permutation testing. We did not compute permutation test for accuracies smaller or equal than 0.50 because we would not pretend that our classifications contain information in that case. The permutation test consisted of randomly exchanging the label and classifying the data with the CV scheme. The p-value was calculated as the sum of all values of the permutation distribution equal or higher than the results of the original data divided by the number of permutations.In a first experiment, 12 datasets were built, three for each of the four CV schemes with 100, 50 or 30 trials, and with 10,000 simulations each. Every simulation involved two subsets with an equal number of trials and features. First, the classification accuracy of the trials from the first subset obtained with linear discriminant analysis was assessed with a chosen CV scheme (Fig. 1A). The distribution of accuracies obtained from all simulations was called: CV distribution. Second, to build an empirical binomial distribution, all trials from the first subset were used to train a classification algorithm which was applied to the second, independent, subset (Fig. 1B). A third distribution, the CV-independent distribution, was built by applying a mixed CV scheme where the N-1 training folds came from the first subset and the test fold came from the second subset (Fig. 1C). At each step of the CV, the classifier trained on N-1 folds from the first subset was applied on a fold from the second subset. Differences between computed distributions and binomial distribution were assessed with a chi-square goodness-of-fit test (Howell, 2012). Results were considered significant at p < .05 with a Bonferroni correction for multiple comparison. In a second experiment, we further tested the influence of the number of repetition of the CV scheme on the binomial test. Datasets with 10,000 simulations, each containing 100 trials with 40 features, were generated as explained above. The CV schemes were tested without repetition and with 5, 10 and 20 repetitions to test the influence of the number of repetitions. A linear discriminant analysis classified the data. In a third experiment, we tested the influence of the number of features. To evaluate the binomial test, datasets with 10,000 simulations, each containing 100 trials, were generated as explained above. We tested the classification accuracy with 40, 100, 400, 1000 and 4000 features. These configurations with more features than trials are often the case in neuroimaging studies. To better accommodate the increasing number of features, a support vector machine with linear kernel classified the data. Classification accuracy was estimated with LOO CV. To evaluate the permutation test, we generated datasets with 1000 simulations, each containing 100 trials. Classification accuracy was estimated with a support vector machine and a LOO CV. The difference in number of simulations is due to the time of the permutation test. “1000 simulations” with the permutation test mean fifty million classifications. Each simulation generates 100 classifications with the LOO CV. On average, half of the simulations have a classification accuracy above 0.5 which are tested for significance with a permutation test (500 simulations × (1 + 999 permutations) × 100 classifications with the LOO CV). The other half are not tested for significance (500 simulations × 100 classifications). On the contrary, “10,000 simulations” with the binomial test mean only 1 million classifications (10,000 simulations × 100 classifications with the LOO CV).
Fig. 1
For each simulation, three distributions of accuracies were computed. The CV distribution (A) was computed through the estimation of accuracy with a CV scheme. Here a 5-fold CV with repetition is used as an example. The empirical binomial distribution (B) was computed by training the classification algorithm on the first subset and testing on the second, independent, subset. In the CV-independent distribution (C), the classification algorithm was trained on N-1 fold from the first subset and the accuracy estimated on one fold from the second subset.
Brain–computer interface diagnostic application
In a recent study (Lule et al., 2013), we used a stepwise linear discriminant analysis (LDA) to classify data from an EEG-based brain–computer interface (BCI) experiment with severely brain-damaged patients who had survived a coma. The experiment aimed at correctly diagnosing non-responding patients by determining if they were able to respond to command using a motor-independent BCI method. Response to command differentiates patients in a minimally conscious state from patients in a vegetative state/unresponsive wakefulness syndrome (Laureys and Schiff, 2012). We studied 16 severely brain damaged patients who had survived a coma. Thirteen were diagnosed with minimally conscious state (aged 42 ± 21 years, 9 males, 5 of traumatic etiology, mean time postinjury 70 ± 109 months) and three patients were in a vegetative state/unresponsive wakefulness syndrome (aged 61 ± 17 years, 2 males, 2 with anoxic etiology, time postinjury 10 ± 15 months). An auditory P3 four-choice speller paradigm was used (Sellers and Donchin, 2006; Furdea et al., 2009). Patients were presented with four stimuli (“yes”, “no”, “stop”, “go”) in a random sequence. Each trial encompassed 15 presentations of four words (60 words in total). The order of presentation was pseudo-randomized (sound duration: 400 ms; inter-stimulus interval: 600 ms, a trial lasting about 1 min). The participants' task was to count the number of times a target, either “yes” or “no”, was presented. Stimulus presentation and data collection were controlled by the BCI2000 software2 (Schalk et al., 2004). The EEG was recorded using an Ag/AgCl electrode cap with 16 channels (F3, Fz, F4, T7, T8, C3, Cz, C4, Cp3, Cp4, P3, Pz, P4, PO7, PO8, and Oz) based on the international 10–20 system (Sharbrough et al., 1991). Each channel was referenced to the right and grounded to the left mastoid. The recordings were divided in a training session and a question session. The training session lasted 4 trials, and participants were instructed to concentrate on either the “yes” or the “no” word. During the question session, participants had to respond to 10 questions with known answers using the BCI. Amplitude values from particular channel locations and time samples were classified with a stepwise linear discriminant analysis method (Farwell and Donchin, 1988; Donchin et al., 2000; Krusienski et al., 2006). Offline, all data were pooled together and a LOO scheme was used to determine the classification accuracy of each participant. From the 16 patients, 3 patients obtained an accuracy above chance level following the binomial test (accuracy equal or above 50% for a theoretical chance level at 25% and 14 trials). Two patients obtained an accuracy of 50% (7/14 questions) and one reached 57% (8/14 questions). These 3 patients were in a minimally conscious state. Here, we reassessed the previously published data with a permutation test (999 permutations) with a LOO CV and a 2-fold CV with 10 repetitions. We used a 2-fold CV scheme as it was one of the only possible partition of 14 trials and quite different from the LOO CV. The labels of the data were randomly exchanged within each trial. Results were considered significant at p < .05.
Discriminant BOLD activation patterns in Parkinson's disease
Recently, we used BOLD fMRI to study the brain activation pattern underlying mental imagery of walking in idiopathic Parkinson's disease as compared with healthy controls (Cremers et al., 2012; Schrouff et al., 2012; Schrouff et al., 2013). Behavioral and brain imaging data acquisition and processing have been described in Cremers et al. (2012). In brief, participants enrolled in this study were 14 patients (8 males; aged 65.1 ± 9.5 years) diagnosed with idiopathic Parkinson's disease (Hughes et al., 1992) with different degrees of severity of gait disturbances and 15 controls matched for age (63.8 ± 8.1 years) and gender (7 males). Before fMRI, all participants were trained to walk comfortably and then briskly on a 25 m path and to mentally rehearse themselves walking on the path. Brain activity changes were recorded using BOLD fMRI during three main experimental conditions: mental imagery of standing (STAND), walking at a comfortable pace (COMF) and walking briskly (BRISK). Eight trials of each condition (12 for BRISK to account for shorter trial duration) were randomly presented within and between subjects. The COMF and BRISK conditions were self-paced, subjects indicating when they had completed each trial by a key press, while each trial of the STAND condition was constrained by the duration of the previous COMF trial. fMRI data preprocessing and first-level univariate analyses were performed using SPM83 as previously reported (Cremers et al., 2012). Three images per subject were generated from these first-level fMRI analyses representing BOLD signal changes associated with STAND, COMF and BRISK conditions, respectively.We aimed to assess whether the multivariate analysis of these images using binary SVM (Burges, 1998) as implemented in PRoNTo4 could be used to accurately discriminate patients from controls. A leave-one-subject per group out CV was performed to compute model performance, its significance being assessed by a permutation testing using 1000 permutations. Either all voxels within the brain served as features (140,305 voxels), or only voxels from the areas involved in gait (both in healthy subjects and in patients), as described in Table 1 of Maillet et al. (2012) (“motor mask”, 45,825 voxels). The between group classification was based on either individual task (e.g., STAND in controls vs. STAND in patients) or a combination of task (e.g., BRISK + COMF in controls vs. BRISK + COMF in patients). Here, we reassessed the previously published data with a binomial test. Results were considered significant at p < .05.
Results
Results of the first experiment on evaluating binomial and permutation tests with a two-class problem and low number of features and trials are shown in Fig. 2 for the 10-fold CV with 100 trials and 40 features, and Table 1 for the LOO, 10-fold, 5-fold and 2-fold CVs with 100, 50 and 30 trials. The binomial lower bound is 59% accuracy for 100 trials with significance level p < .05 (62% accuracy at p < .01) independently of the CV scheme. For the simulations with 50 and 30 trials, the lower bound at p < .05 was, respectively, 62% and 65% (67% and 71% at p < .01). All computed distribution differed significantly of the binomial distribution (chi-square goodness-of-fit test, p < .05). LOO CV produced the widest distribution. More than 8% of the accuracy values from random data were above the binomial accuracy lower bound at p < .05, and 3% at p < .01. 10 × 10-fold CV also produced a wider distribution than the binomial distribution. The 10 × 5-fold CV distribution was closest to the binomial distribution. The 10 × 2-fold CV produced a distribution narrower than the binomial distribution with 0–1% of the random data above the binomial accuracy lower bound at p < .05 and 0% above the lower bound at p < .01. For the permutation test, the percentage of p-values below .05 and .01 was less than 5% and 1% respectively for all CV schemes. For all datasets, the empirical distribution matched the binomial distribution. The CV-independent distribution matched the binomial distribution with the LOO scheme. For all other schemes, the CV-independent distribution differed significantly from the binomial distribution (Fig. 3).
Fig. 2
Histogram of the distribution of the classification accuracy (bars; left axis) and p-values from the permutation test (for accuracy > .5; dots; right axis) for 10,000 simulations with 100 trials, 40 features, 10 × 10-fold cross-validation. The vertical thick line illustrates the binomial test lower bound and the horizontal thick line shows the permutation test accuracy level at p < .05.
Table 1
Percentage of the 10,000 simulations with results thresholded for significance at p < .05 and p < .01 for the binomial and permutation tests. The simulations included either 100, 50 or 30 trials with, respectively, 40, 20 or 10 random features and randomly assigned binary labels. Lower bound thresholds for binomial test were computed using Jeffreys' priors. Permutation tests used 999 permutations. Cross validation schemes included leave-one-out (LOO), 10-fold, 5-fold and 2-fold cross validations. Folding and computing classification was repeated 10 times with different folds.
CV scheme
# of trials
Binomial
Permutation
p < .05
p < .01
p < .05
p < .01
LOO
100
8%
3%
4%
1%
50
10%
3%
4%
1%
30
9%
3%
4%
1%
10 × 10-fold
100
7%
2%
5%
1%
50
7%
2%
5%
1%
30
7%
2%
5%
1%
10 × 5-fold
100
5%
1%
5%
1%
50
4%
1%
5%
1%
30
5%
1%
5%
1%
10 × 2-fold
100
0%
0%
5%
1%
50
1%
0%
5%
1%
30
1%
0%
5%
1%
Fig. 3
Cumulative distribution functions (CDFs) of classification accuracy values obtained using a classifier trained on N-1 fold of one subset and applied on a fold from an independent subset. Classification accuracy values obtained from 10,000 simulations with 100 trials and 40 features. The leave-one-out independent CDF overlaps with the binomial CDF. Note that the 10-, 5- and 2-fold independent CVs show a narrower distribution.
In the second experiment, the distributions built from the 4 CV schemes without repetition were wider than the binomial distribution (Fig. 4), with the LOO CV showing the most deviation. Repeating the CV narrowed the cumulative distribution function (CDF) of the 10-fold (Fig. 5), 5-fold and 2-fold CVs resulting in a mixed effect. The number of repetition had an influence up to 10 repetitions, increasing the number of repetitions to 20 changed only slightly the distribution. In the third experiment, the distributions estimated with LOO CV and 100 trials narrowed with the increased number of features. The binomial test evolved from being not enough conservative to being too conservative (Table 2). For the permutation test, the percentage of p-values below .05 and .01 was less than 5% and 1% respectively for all number of features (Table 3).
Fig. 4
Cumulative distribution functions for the binomial, leave-one-out, 10-fold, 5-fold and 2-fold cross-validations for 10,000 simulations of 100 trials with 40 features without repetition.
Fig. 5
Cumulative distribution functions for the binomial and the 10-fold cross-validated data with 1, 5, 10 and 20 repetitions for 10,000 simulations of 100 trials with 40 features.
Table 2
Percentage of the 10,000 simulations with results thresholded for significance at p < .05 and p < .01 for the binomial tests. The simulations included 100 trials with random features and randomly assigned binary labels. Lower bound thresholds for binomial test were computed using Jeffreys' priors. Classification accuracy was estimated with a support vector machine with linear kernel and a leave-one-out cross validation.
# of features
p < .05
p < .01
40
13%
7%
100
7%
3%
400
8%
3%
1000
6%
2%
4000
5%
2%
10,000
2%
1%
Table 3
Percentage of the 1000 simulations with results thresholded for significance at p < .05 and p < .01 for permutation tests. The simulations included 100 trials with random features and randomly assigned binary labels. Permutation tests used 999 permutations. Classification accuracy was estimated with a support vector machine with linear kernel and a leave-one-out cross validation.
# of features
p < .05
p < .01
40
4%
1%
100
4%
1%
400
5%
1%
1000
4%
1%
4000
4%
1%
10,000
4%
1%
As presented in the original paper, three patients had an accuracy of 50%, 50% and 57% with the LOO CV. These three accuracies are above the binomial lower bound (above or equal to 7/14 compared to a theoretical chance level at 25%). Their permutation p-values were .06, .08 and .03 respectively. When reanalyzing the three patients' data with the 2-fold CV, they obtained an accuracy of 6%, 31% and 39%. All accuracies were below the binomial lower bound but the permutation p-values were .94, .17 and .046, respectively. The histograms of permuted accuracy for the patient with highest accuracy for the LOO and 2-fold CV are reported in Fig. 6. Both histograms peak at 0.25 which is the theoretical chance level. The use of a 2-fold CV narrowed the histogram. The binomial significant level at p < .05 (50%) was too wide for the LOO CV (8% of the data above the limit) and too narrow for the 10 × 2-fold CV (less than a 1% of the data above the limit) as compared to the accuracies obtained by permutation testing.
Fig. 6
Clinical data obtained from a patient with significant diagnostic accuracy using a BCI. Histogram of accuracies obtained with permutation testing for the leave-one-out cross-validation (black) and the 2-fold cross-validation (gray). The vertical line shows the binomial lower bound (50%) for significant accuracy at p = .05.
Overall, using all brain voxels led to a poor discrimination between idiopathic Parkinson's diseasepatient and controls. The binomial lower bound for 29 trials with equal probability of both classes is 66%. The estimated balanced accuracies with the different configurations of features were all below the binomial lower bound (Table 4). However, using the permutation test, one combination of features (BRISK + CONF) with accuracy reaching 62% was significant. The normalized weights from the classifier had a good overlap with the results from the univariate analysis (Schrouff et al., 2013). Slightly better results were obtained while decreasing the number of features with the motor mask, as shown by a higher balanced accuracy for the BRISK–COMF combination, as well as for the BRISK condition both significant at p < 0.05 with the permutation and binomial tests.
Table 4
Balanced accuracy for the idiopathic Parkinson's disease patient vs. control classification for each combination of the three tasks. Significant results with the permutation test are displayed with an *. No result was significant with the binomial test.
Condition
Balanced accuracy (%)
Whole brain
Motor mask
STAND
14
35
COMF
58
62
BRISK
59
66*
STAND + BRISK
37
36
STAND + CONF
36
40
COMF + BRISK
62*
66*
STAND + COMF + BRISK
43
48
Discussion
Our results on artificially generated random data and real clinical data illustrate that the CV scheme has an influence on the statistical significance of obtained classification accuracies. This influence seems to bias results from binomial testing. The permutation test took the cross-validation scheme into account and was therefore not biased. We hypothesize that the observed differences between CV distribution and binomial distribution are due to counterbalancing factors. A first factor is the decreased independence among trials, a key assumption of the binomial testing, in CV scheme. The influence of this factor is well illustrated in the extreme case of the LOO scheme or using CV without repetition. A second factor is that the repetition of the CV scheme virtually increases the number of test examples. This is illustrated through the change in the CV-independent distributions. The number of repetitions and the CV-scheme both influence the size of the test set. In turn, the size of the test set influences the significance of the test, as a random classifier is less likely to maintain the same level of accuracy on an extended test set. This has been previously shown for the permutation test (Mukherjee et al., 2003) and is reproduced here using a real dataset. The 2-fold CV with 10 repetitions had a narrower distribution than the LOO CV (Fig. 6); therefore smaller accuracy could be significant. The reported simulated data here also illustrate this effect for the distribution of classification accuracies. A third factor is the number features. Increasing the number of features narrowed the CV distribution in our third simulation study. This effect was also illustrated in the Parkinson disease dataset where, despite the use of a LOO CV, the permutation distribution was narrower than the binomial distribution. High number of features makes the classifier more prone to generalization problem. The classifier has more chance to pick features that correlate well with training data but not with test data. The final accuracy is therefore less likely to be high. A feature selection method to reduce dimensionality (Lemm et al., 2011), a priori knowledge, or a regularization method may help reducing over-fitting. In our fMRI dataset, physiological a priori information helped reducing the features set and improved the classification. The feature selection or regularization method must be included in the CV loop and may also influence the CV distribution. Another factor which may influence the distribution of classified accuracies is the classifier. We show that the distributions build with LOO CV and with LDA and SVM classifiers yielded slightly different results for simulated data with 100 trials with 40 features.It is important to stress, that the results and conclusions presented here were obtained on small dataset but with number of trials often found in neuroimaging or brain–computer interface studies. These results are not in line with current common practice (Pereira et al., 2009; Pereira and Botvinick, 2011) which treats the accuracy obtained through cross-validation as if it came from an independent dataset, and then test it in exactly the same way. One more point to take into consideration with small dataset is the stability of the classifier. The independence of accuracies obtained through cross-validation holds, as long as the classifier is stable under the perturbation induced by deleting one of the folds from the data in a cross-validation scheme (Kohavi, 1995). A classifier is stable for a given dataset and set of perturbations if it makes the same prediction with the perturbed datasets. This is most probably not the case for small datasets. How large should be a dataset to prevent these issues should be the subject of further studies.Using a permutation test is more demanding than binomial testing, as the classification must be repeated hundreds of times. The number of permutations has an influence on the shape of the distribution. However, the p-value can be monitored to limit the number of permutations, computing all permutations only for a value around or below the level of significance and stopping the test much earlier for the others (Mukherjee et al., 2003; Ojala and Garriga, 2010). In the case of the two real datasets presented here with linear discriminant analysis and support vector machine classifiers, the permutation test took only a few seconds. With other classifier, e.g. Gaussian Processes, the computation time may be much longer. If the permutation test has to be applied independently on all voxels of an image, this could take a considerable time (thousands of voxels times a few seconds). Furthermore, it has been mentioned that a large number of permutations may be required to get p-values in a range that would survive multiple comparison correction (Pereira and Botvinick, 2011). Building a unique distribution for all voxels (Nichols and Holmes, 2002) or cluster based permutation test may circumvent that problem (Maris and Oostenveld, 2007).In the data from the BCI dataset (Lule et al., 2013), one patient had significant accuracy with the permutation test. In ‘clinical’ settings, with a predefined and validated threshold of accuracy this would mean that the patient demonstrated command following, an important landmark for a diagnosis related to consciousness. In a scientific ‘study’, where the aim is to validate the approach, which is the case in the original and the present papers, we would protect ourselves against false claims, i.e., stating that the patient followed the command when he did not. If we test 20 patients with a threshold based on a p-value < .05, just by chance one patient may have positive results. In the original study, 16 patients were included. We therefore corrected for multiple comparisons via the false-discovery rate (Benjamini and Hochberg, 1995), significance at p < .05 and no patients survived the corrected threshold (Goldfine et al., 2013). The final threshold for clinical application should not depend on the number of patients tested as this number would permanently increase changing the threshold continuously (Cruse et al., Brain Injury ). This threshold should be based on the accuracies obtained on an extended cohort of patients and healthy controls and balance the sensitivity and specificity of the method. This threshold should depend of the number of trials and the obtained accuracy. The quality of the data may also be taken into account but must be checked previously to any classification. The threshold may be adapted if the test is repeated or joined to results from other tests.Here, we tested only a limited number of validation schemes. We have not tested bootstrapping (Efron and Tibshirani, 1997) or Monte-Carlo CV (Picard and Cook, 1984). These approaches should be tested in future studies even if the latter has most probably the same properties as the k-folds CV with repetitions. Furthermore, our results do not extend to the validation of an independent dataset which is still the gold standard for validating classification accuracy and recommended whenever possible; unfortunately this is not practical in the two diagnostic cases presented here: brain–computer interface applied to the detection of consciousness and the mental imagery of gait in idiopathic Parkinson's diseasepatients. With an independent validation set, the binomial test is perfectly valid. Eventually, a small test set may be tested multiple times with classifiers trained on slightly different subsets of the training set. The repetition of testing should virtually increase the size of the test set as illustrated by our CV-independent distribution. Regarding the selection of a CV scheme, the first priority should be to decrease the variance and the bias of the estimated classification accuracies. For a good compromise, the use of 10-fold or 5-fold CVs is often recommended (Lemm et al., 2011).Here, we tested only a limited number of parameters (number of trials and features) and presented results for two classifiers. However, we believe that one example is enough to demonstrate that the distribution of accuracies obtained by classifying random data with a CV scheme does not follow a binomial distribution.To conclude, the CV scheme has an influence on the distribution of classification accuracies. This influence biases the binomial testing. Therefore, a permutation test is recommended, especially when dealing with small sample sizes and non-independent CV schemes, as often is the case in clinical datasets.
Authors: Damian Cruse; Srivas Chennu; Camille Chatelle; Tristan A Bekinschtein; Davinia Fernández-Espejo; John D Pickard; Steven Laureys; Adrian M Owen Journal: Lancet Date: 2011-11-09 Impact factor: 79.321
Authors: J Schrouff; M J Rosa; J M Rondina; A F Marquand; C Chu; J Ashburner; C Phillips; J Richiardi; J Mourão-Miranda Journal: Neuroinformatics Date: 2013-07
Authors: Laura Pina-Camacho; Juan Garcia-Prieto; Mara Parellada; Josefina Castro-Fornieles; Ana M Gonzalez-Pinto; Igor Bombin; Montserrat Graell; Beatriz Paya; Marta Rapado-Castro; Joost Janssen; Inmaculada Baeza; Francisco Del Pozo; Manuel Desco; Celso Arango Journal: Eur Child Adolesc Psychiatry Date: 2014-08-11 Impact factor: 4.785
Authors: Brian L Edlow; Camille Chatelle; Camille A Spencer; Catherine J Chu; Yelena G Bodien; Kathryn L O'Connor; Ronald E Hirschberg; Leigh R Hochberg; Joseph T Giacino; Eric S Rosenthal; Ona Wu Journal: Brain Date: 2017-09-01 Impact factor: 13.501