Literature DB >> 23717804

An Efficient P300-based BCI Using Wavelet Features and IBPSO-based Channel Selection.

Abstract

We present a novel and efficient scheme that selects a minimal set of effective features and channels for detecting the P300 component of the event-related potential in the brain-computer interface (BCI) paradigm. For obtaining a minimal set of effective features, we take the truncated coefficients of discrete Daubechies 4 wavelet, and for selecting the effective electroencephalogram channels, we utilize an improved binary particle swarm optimization algorithm together with the Bhattacharyya criterion. We tested our proposed scheme on dataset IIb of BCI competition 2005 and achieved 97.5% and 74.5% accuracy in 15 and 5 trials, respectively, using a simple classification algorithm based on Bayesian linear discriminant analysis. We also tested our proposed scheme on Hoffmann's dataset for eight subjects, and achieved similar results.

Entities: Chemical Disease Gene Species

Keywords: Bayesian linear discriminant analysis; Bhattacharyya distance; brain–computer interface; discrete wavelet; event-related potentials; improved binary particle swarm optimization algorithm

Year: 2012 PMID： 23717804 PMCID： PMC3660708

Source DB: PubMed Journal: J Med Signals Sens ISSN： 2228-7477

INTRODUCTION

Brain–computer interface (BCI) provides a direct communication channel between a subject's brain and a computer by using electroencephalogram (EEG) signals.[1] It improves the quality of life for some patients that suffer from a neurological disorder called locked-in syndrome, e.g., the amyotrophic lateral sclerosis (ALS). Some existing implementations of BCI are mainly based on utilizing the P300 wave, which was shown for the first time in[2] to be an event-related potential and was later utilized in[3] as a control signal in BCI systems. The P300 wave is a positive deflection in the EEG around 300 ms after visual or auditory stimuli for normal young adults. The visual P300-BCI is a synchronous device that enables subjects to spell words or demand an object by focusing their attention on symbols or images in a matrix displayed on a computer screen. In this BCI protocol, the sequence of symbols or images is flashed in a random order and the subject tries to discriminate a desired symbol or image (target) during a random sequence of target and non-target stimuli (oddball paradigm).[13] In the oddball paradigm, the subject focus is on detecting target events, and ignores the non-target events. Target events, on the average, produce larger P300 potentials than non-target events.[4] Thus, by detecting the P300-ERP pertaining to a target image, the subject's intention can be recognized, and a sequence of such detections can lead to, for instance, spelling a word that was intended by the subject. Extracting P300-ERPs from background EEG and environmental noise is the main challenge in ERP analysis. The ERP has low signal-to-noise ratio (SNR) and is a transient signal, making ERPs difficult to detect. In spite of this, it is very desirable to correctly detect ERPs by efficiently utilizing a minimal number of EEG channels to reduce calculations. Typically, a P300-based BCI system has four components, namely preprocessing, feature extraction, channel selection, and classification. Although improving any one of these parts can improve the performance of the system as a whole, in this paper, we focus on feature extraction and channel selection. In many existing P300-BCI systems, due to the large number of electrodes and long durations of recorded EEG signals, one has to deal with extensive data streams that produce a large number of features, which in turn would cause over-fitting in the classifier. Using a minimal set of effective features and channels prevents the over-fitting problem and reduces calculations. As for feature extraction, in existing schemes, either a set of effective features is extracted for a given channel set as in[5-7], or a set of effective channels is selected for given feature set as in.[8-11] However, optimal choices for features and channels are subject-dependent, and may depend on the BCI protocol as well. In this regard, we propose a scheme for joint selection of features and channels for each subject.

Feature Extraction

The discriminating features in P300-ERPs may be time-dependent, frequency-dependent, or time–frequency-dependent. In[811], pre-processed signal samples, and in[12], frequency-domain features (Fourier transforms of segmented ERPs) are fed to the classification algorithm. However, since the ERP is a transient signal, time–frequency features are more appropriate. Time–frequency features can be obtained by the wavelet transform, which is an efficient tool for multi-resolution analysis of non-stationary and transient signals. In[13], the continuous wavelet transform (CWT) is used for extracting time-frequency features of the EEG, and the T-student algorithm is applied for choosing those features that are more effective and discriminant, resulting in significant improvements. One obvious drawback of the CWT is that it requires excessive calculations. The discrete wavelet transform (DWT) is used as a powerful denoising and feature extraction tool to detect the P300-ERPs from EEG epochs. In[1415] a Daubechies 4 wavelet is used for removing noise and unwanted frequency components from the EEG in adults and young people. In[9], the DWT is applied to the dataset IIb of BCI competition 2005. Although the results are relatively accurate, the number of channels and features are excessive. In[16], the discriminating features are the coefficients of the DWT of the signal, and a weighted feature vector is used for further improvements. It was noted that the effective features are in 1–8 Hz frequency band. In this paper, we take the coefficients in the effective sub-bands of the DWT of EEG signals as their discriminating features, where effective sub-bands are identified via the five-fold cross-validation procedure. The mother wavelet is Daubechies 4 (db4), which is suitable for detecting changes in EEG signals.[17] The beginning part of the impulse response of the decomposition low pass filter and the end part of the impulse response of the decomposition high pass filter for the db4 are near zero in the MATLAB wavelet toolbox. We force such small values to zero by truncating the corresponding DWT coefficients, which causes 12% to 30% reduction in the number of features, yet produces satisfactory results.

Channel Selection

In[12], all EEG electrodes (64 channels) are used for signal classification. Although it involves a significant amount of calculations, the accuracy of BCI results is not very satisfactory. To address such shortcomings, various methods have been proposed in the literature to identify the more effective channels. In[89], the training data is divided into several partitions (17 partitions in[8] and 10 partitions in[9]), and for each partition, effective channels are obtained by recursively eliminating the lesser effective channels. Then the classifier algorithm is applied on each partition, and voting is used on the outputs of classifiers to detect P300-ERPs. Although partitioning of the training data and using a separate classifier for each partition reduces calculations, but as we will show later, further improvements are possible. Another approach is to use the Fisher criterion score (FCS)[1819] to identify the effective channels, which may result in not selecting a number of highly correlated channels. A channel is effective for signal classification if the sum of FCSs for all features in that channel has a high value. In contrast, the Bhattacharyya criterion is simpler, and is calculated directly from the feature vector of each channel individually. The main drawback of these methods is that correlated channels that may produce better results may not be selected because of their low FCSs. In[20], a binary version of PSO algorithm is used for channel selection among all EEG channels that may include correlated channels. Although they showed that their method outperforms sequential floating forward search algorithm, but selecting from all channels (without first eliminating the lesser effective ones) increases calculations with no apparent benefit. We present a two stage approach for identifying a minimal subset of effective channels. We begin by sorting channels using the Bhattacharyya distance in decreasing order and eliminate 50% of channels that have smaller distances. We then identify the more effective channels in the remaining channels using the improved binary particle swarm optimization (IBPSO) algorithm. In this way, we limit the search space and processing time of the IBPSO algorithm. The rest of this paper is organized as follows. The two P300-BCI datasets that we use are described in Section 2. In Section 3, we present our proposed scheme that includes preprocessing, feature extraction and minimal feature selection, classification, and the two-step channel selection. Section 4 contains experimental results. Discussion and conclusions are given in Sections 5 and 6, respectively.

P300-BCI Datasets

In order to benchmark our proposed scheme, we use two different P300-BCI datasets, namely the dataset IIb from the third edition of BCI competition 2005 for two subjects,[21] and data recorded in a P300 environment control paradigm by Hoffmann et al.[11] for eight subjects. The protocol of each dataset is briefly explained below.

Dataset 1

The P300 speller paradigm[3] of BCI competition 2005 displays a 6×6 matrix of characters [Figure 1a] to each subject. Each row and each column in the display are flashed at random, and the subject's task is to focus on characters in a given word, one character at a time. Two out of 12 illuminated rows or columns contain the desired letter (in one row and in one column). Thus, one P300-ERP is produced when the row/column of the expected letter is illuminated.[21]

Figure 1

(a) The matrix used in the P300 speller paradigm (b) the position of electrodes

(a) The matrix used in the P300 speller paradigm (b) the position of electrodes This dataset was recorded for two different subjects A and B. For each subject, 64 channels are sampled at the rate of 240 samples per second for 15 trials per character. Figure 1b shows the position of EEG electrodes. The recorded EEG is band-pass filtered from 0.1 to 60 Hz. As the 60 Hz cut-off is way above the highest frequency components of P300, we will low pass filter the dataset signals to further reduce their additive noise. The training and the testing datasets consist of 85 and 100 characters, respectively. As such, the number of corresponding epochs for each subject are 85 ×12 ×15=15300 and 100 ×12 ×15=18000, respectively.

Dataset 2

In this dataset, as shown in Figure 2a, six images include a television, a telephone, a lamp, a door, a window, and a radio are shown on a laptop screen to eight subjects (four disabled and four healthy subjects).[11] The disabled subjects were all wheelchair-bound but had varying communication and limb muscle control abilities. The images are flashed in a random sequence, one image at a time, one image being the target one, and the rest are non-targets. A block consists of six images, each flashed once. Similar to the P300 speller paradigm, when the target image is flashed, a P300-ERP is produced. For each subject, the dataset consists of four sessions, each having six runs. The numbers of blocks are randomly chosen between 20 and 25, i.e., on the average, 22.5 blocks of six flashes were displayed in one run. Hence, on the average, each subject generates 540 target trials (4 sessions × 6 runs × 1 target × 22.5 blocks = 540) and 2700 non-target trials (4 sessions × 6 runs × 5 nontargets × 22.5 blocks = 2700). The sampling rate of EEG signals is 2048 samples per second and 32 electrodes are recorded from Figure 2b.

Figure 2

(a) Six images used in[11], (b) the position of electrodes

MATERIALS AND METHODS

Figure 3a and b show the block diagrams for training and testing of our proposed scheme, respectively. For training, the preprocessing module includes filtering, artifact reduction, and data segmentation. Features are extracted by discrete wavelet transform, and truncated to remove near-zero coefficients. A five-fold cross-validation procedure[22] is utilized to select the best sub-bands by using BLDA classifier on the first eight channels selected by the Bhattacharyya criterion. As in[8], the extracted features are normalized to zero mean and unit variance. To select the best channels, we disregard 50% of channels whose Bhattacharyya distances are smaller than those of the remaining channels (32 channels for Dataset 1 and 16 channels for Dataset 2), and apply the remaining channels together with their selected sub-bands to the IBPSO module. In the sequel, the main modules in each block diagram in Figure 3a and b are described.

Figure 3

(a) Block diagram of the proposed scheme for training, and (b) for testing

Preprocessing

In general, ERP epochs are heavily contaminated by noise, and are difficult to detect in few trials. As in[56], signals from each channel are band-pass filtered (0.1–30.0 Hz) using a 6th order forward–backward Butterworth filter. The bandwidth of 0.1–30.0 Hz covers the frequency range of important EEG rhythms (delta (0.5–4.0 Hz), theta (4.0–7.5 Hz), alpha (8.0–13.0 Hz), and beta (14.0–26.0 Hz)). The Windsorizing method described in[11] is used to reduce the effects of large amplitude outliers caused by eye movements, blinking, or subject's movements. In doing so, signal amplitudes above the 90th and below the 10th percentiles are clipped. After each flash, we use the first 700 ms of recorded signals in both datasets. This window is long enough to capture all required time features for an efficient classification, although, the P300 component is expected to occur around 300 ms after the stimulus.[8]

Sorting Channels by Bhattacharyya Distance

The efficiency of each channel can be measured based on its ability to discriminate signals pertaining to target and non-target patterns in the training dataset. To do so, we use a statistical measure, e.g., the Bhattacharyya distance (BD) that reveals the degree of difference between the two respective patterns via a real valued scalar[2324] defined by where |·| denotes the determinant of a matrix, m1 is the mean vector of target pattern signals, m2 is the mean vector of non-target pattern signals, and C1 and C2 are the corresponding covariance matrices. The value of BD provides a quantitative measure for sorting channels based on their pre-processed signal samples in the training datasets. To obtain target and non-target preprocessed signal samples, each segment of the preprocessed signal is down sampled by a factor of 4, which still satisfies the Nyquist rate for the preprocessing band pass filter. For example, Figure 4 shows the BD values for Subject A in the P300 speller dataset IIb, obtained by extracting 42 preprocessed signal samples from a single channel. We use the sorted channels for two purposes, namely, for selecting eight initial channels that will be utilized for finding the best sub-bands of wavelet coefficients, and for identifying those channels that can be used by the IBPSO algorithm.

Figure 4

The values of Bhattacharyya distance for each channel for subject A in the P300 speller dataset IIb

The values of Bhattacharyya distance for each channel for subject A in the P300 speller dataset IIb Wavelet transform (WT) has been extensively used in ERP analysis due to its ability to effectively explore both the time-domain and the frequency-domain features of ERP.[22] It is also superior to the short time Fourier transform (STFT). This is because the STFT's window is fixed, resulting in a possible loss of some information on fast changing signals; which is in contrast to WT that estimates the low frequency information of the signal by using expanded windows and the high frequency information by utilizing short windows. As such, WT can provide an efficient analysis of non-stationary and transient signals. Wavelet analysis can be performed either in the continuous mode (CWT) or in the discrete mode (DWT). The DWT involves less computation, is simpler than CWT, and can be implemented via digital filtering techniques. The DWT decomposes signal x[n] into different frequency sub-bands with different resolutions using the scaling function (ɸ[n]) and the wavelet function (ψ[n]), where j and k are integers. These functions are the dilated and shifted version of ɸ[n] and ψ[n], defined by The DWT projects the original signal into a set of basis functions built from translations and scaling of the wavelet function (also called the mother wavelet). The DWT coefficients are obtained by convolving x[n] with Ψ[n]. The DWT employs a discrete-time mother wavelet whose dilation and translation parameters are integers. The contracted and dilated versions of the wavelet function will match the high-frequency and low-frequency components of the original signal, respectively. The DWT can be implemented by multi-resolution analysis (MRA) through the application of digital filter banks.[25] The procedure for MRA via dyadic filter banks for decomposing a signal x[n] is schematically shown in Figure 5. Each stage consists of a high-pass filter (hHigh[n]) corresponding to Ψ[n], a low-pass filter (hLow[n]) corresponding to Φ[n], and two down samplers. The decomposition process of x[n] via dyadic filter banks is described below.

Figure 5

Decomposing of × (n) using filter banks

A mother wavelet is chosen to obtain the filters’ impulse responses hLow[n] and hHigh[n] The values of A1[n] and D1[n] are obtained by convolving x[n] with hLow[n] and hHigh[n], respectively The values of A1[n] and D1[n] are divided by 2 to get the approximation coefficients CA1[n] (i.e., the low frequency part of the signal), and the detail coefficients CD1[n] (i.e., the high frequency part of the signal), respectively. This is the first level of wavelet decomposition. The DWT decomposition process continues the same as in 1) above for the low-pass branch in Figure 5. The values of CA1[n] is further decomposed to CA2[n] and CD2[n] by using hLow[n], hHigh[n], and the two down-samplers. By continuing the wavelet decomposition up to level j, the output of the dyadic wavelet transform will be the detail coefficients CD1,…,CD and the approximation coefficients CA (i.e., the approximation coefficients of the last decomposition level). Each of these j+1 parts of the wavelet coefficients corresponds to the signal information within a specific frequency sub-band. Obtaining wavelet coefficients for the jth level can be summarized by Note that because of down-sampling in the dyadic structure in Figure 5, the DWT is a shift-varying transform.[26] In contrast, the stationary wavelet transform (SWT), is shift-invariant.[27] In the SWT, the scales are dyadic but time steps at each level are not. Moreover, the SWT is a non-orthogonal transform with temporal redundancies.[28] In our case, using the shift-invariant SWT that entails more calculations, does not significantly improve the classification accuracy as compared to using the DWT. Decomposing of × (n) using filter banks Selection of a mother wavelet and a proper decomposition level are very important in the DWT. Choosing the mother wavelet for detecting P300-ERPs can be difficult because many wavelet properties cannot be jointly optimized.[29] The Daubechies family of wavelets are very smooth, orthogonal, and easy to implement. In[41730], the Daubechies order-4 (db4) wavelet has been employed for decomposing EEG signals. We also choose the db4 mother wavelet, as it resembles the P300 component in ERPs.[17] Effective frequency components in ERPs specify the number of decomposition levels, which are chosen such that those segments of the signal that are highly correlated with the frequencies required for classification of the signal are retained in the wavelet coefficients[31] To have a sufficient number of low-frequency components, we decompose the signal into six levels. Since the bandwidth of the signal is limited to 0.1–30 Hz, we focus on those subbands and their corresponding coefficients that pertain to 0.1–30 Hz. For selecting the best DWT sub-bands for each subject, we compute all DWT coefficients within 0–30 Hz for the first eight channels selected by the Bhattacharyya criterion in the training dataset. We then truncate the DWT coefficients as explained in Section III-D, and obtain all possible combinations of the truncated DWT coefficients of those sub-bands that do not overlap in frequency. For performance evaluation, the training set is randomly partitioned into five subsets using the five-fold cross-validation procedure,[22] where a single subset is reserved for validation and the remaining four are used for training. The cross-validation process is then repeated five times, when each of the five subsets are used exactly once as the validation data. The results are averaged to obtain a single estimation. The performance of each validation set is determined by the channel classification score denoted by CCS in (8) below, taken from,[8] where ƒp, tp and ƒn are the numbers of false positives, true positives and false negatives, respectively. The reason for using this criterion is that CCS does not include the number of true negatives, which is important for unbalanced datasets. This causes the feature selection to focus on those feature vectors that give positive scores to true positives and false positives, which are fewer in number than true negatives and false negatives. For feature selection, classifier performances are evaluated on target and non-target features (binary classification) and not on character or image recognition performances.

Minimal Feature Selection

By using suitable feature extraction and selection processes, the computation cost decreases and classification performance improves. In general, not all extracted features are useful for classification, as some features are irrelevant or redundant and reduce classification accuracy. We now show that using all wavelet coefficients in each level results in an expanded feature set and may reduce the classification accuracy. Figure 6 shows the impulse response of the decomposition low-pass and high-pass filters corresponding to db4 mother wavelet in which the first 3 coefficients of h [n] and the last 3 coefficients of h [n] are near zero. We use this property of db4 decomposition filters to reduce the number of features. The values of A1[n] and D1[n] in Figure 5 are obtained by convolving x[n] with h [n], and x[n] with h [n], respectively. Hence, the first 3 values of A1[n] and the last three values of D1[n] are near zero. The down-sampled values of A1[n] and D1[n] provide CA1[n] and CD1[n] coefficients, respectively. Thus, the first two values of CA1[n] and at least the last value of CD1[n] are near zero. Since x[n] is unknown, we have no information on the number of first near zero values of CD1[n].

Figure 6

The values of decomposition low-pass and high-pass coefficients for the db4 mother wavelet

The values of decomposition low-pass and high-pass coefficients for the db4 mother wavelet Since the first two values of CA1[n] and the first three values of h[n] are near zero, and A2[n] = CA1[n] * h[n], the first five values of A2[n] are near zero, and so the first three values of CA2[n] (which is the down sampled A2[n]) are near zero. Besides, since the last three values of h[n] are near zero and D2[n]= CA1[n] * h[n], the first two values and the last three values of D2[n] are near zero. Thus, the first value and at least the last value of CD2[n] are zero. Similarly, the first three values of CA[n], and the first two values and at least the last value of CD[n] are near zero. Figure 7 shows the truncated coefficients of a segment of EEG signal for (CA3 - CA6 ) and (CD3 - CD6). The eliminated and remaining coefficients are identified by ( o ) and (∞), respectively. Not that truncating the DWT coefficients reduces the number of features by 12 to 30%.

Figure 7

A segment of EEG signal and its truncated approximation and detail coefficients for different decomposition levels of the db4 mother wavelet

Classification Algorithm

Classification accuracy, simplicity, and fast training are three important factors for choosing a classifier. In the literature, different classification methods are used in the P300-BCI applications, among which are the Fisher linear discriminant analysis (FLDA),[13] the support vector machine (SVM),[812] and the Bayesian linear discriminant analysis (BLDA).[1011] The FLDA is a simple, fast, and easy to use classifier but its performance deteriorates when many electrodes or features are used. This problem is solved by using BLDA, which uses regularization to prevent over-fitting to high-dimensional and noisy data sets. In the Bayesian analysis, the degree of regularization is estimated quickly, robustly and automatically from the training data without needing the complex cross-validation procedures for tuning its parameters.[11] In[1032], it is shown that the BLDA outperforms the SVM and some other classifiers for all tested cases, and its complexity is low. Hence, we use the two-class BLDA classifier (which is similar than the one described in[11]) to classify target and non-target EEG signals. Training features include a set of d -dimensional feature vectors for each class x = [x1, x2,…, xd] and the corresponding class-label y ∊{–1,1} , where j is the feature vector number. The basic assumption in the Bayesian regression is that the feature matrix X=[x1, x2,…, x] and its corresponding label vector y=[y1,y2,…,y] are linearly related, i.e., y = wTX + n (9) where w =[w1,w2,…,w]T is a projection vector to be optimized, n=[n1,n2,…,n] is an additive white Gaussian noise vector, and N is the number of feature vectors in the ith class. The likelihood function for w in the regression is where β is the inverse variance of noise, and l is the number of cases in the training set. For the Bayesian setting, the prior distribution of weight vector w is assumed to be Gaussian, defined by where α is the inverse variance of the prior distribution for weight w1, and I′(α) is a d×d dimensional square matrix, with α's along its diagonal. When both prior and likelihood distributions of w are Gaussian, in[11] it is shown that the posterior distribution is also Gaussian with covariance C and mean m The predictive distribution of the target for an unobserved input vector is also Gaussian, whose mean and variance are and For both of the P300-BCI datasets, we only use the mean value of the predictive distribution for taking decisions.

Channel Selection Algorithm

Efficiency of our P300-BCI depends on utilizing effective channels. In doing so, we apply the following two-step channel selection algorithm. Step 1: In Step 1, we reduce to half the number of channels (from 64 to 32, or from 32 to 16) by using the Bhattacharyya distance. We sort BD values in decreasing order, and select the first half of channels with larger BD values. Step 2: In Step 2, we employ an optimization algorithm to choose the more effective channels from channels selected in Step 1. In[33], five different optimization approaches, namely, genetic, mimetic, ant-colony optimization, shuffled frog leaping, and particle swarm optimization (PSO) algorithms are compared for solving two benchmark continuous optimization test problems. It is shown that the PSO method outperforms the other methods in terms of convergence speed and accuracy of results, while being the second best in terms of processing time. In[34], statistical analysis and formal hypothesis testing are utilized to show that the PSO algorithm has the same effectiveness (finding the true global optimal solution) as the genetic algorithm (GA), but with significantly less calculations. Moreover, in[35], it is shown that when binary PSO (BPSO) is used for feature selection in the diagnosis of coronary artery disease, it yields better results than the GA. The BPSO is also used for channel selection in the motor imagery-based BCI.[20] The PSO algorithm is a population-based search scheme based on the movement and flocking of birds that are called particles. Each particle flies in a n -dimensional search space with a certain velocity based on its own previously acquired knowledge and other particles experiences in the swarm. The position and the velocity of the ith particle are denoted by x = (x,…,x and v = (v,…,v respectively. For each time step , the corresponding velocity is applied to move each particle to its next position by x(t+Δt)=x(t)+x(t)×Δt. (16) The step size is usually set to 1, so at each iteration, the velocity and the position of each particle are updated by and respectively, where p is the position of particle i with the highest value of Ccs up to iteration t, and gt is p with the highest value of Ccs among all particles. Also, c1 and c2 are positive-valued learning factors, r1 and r2 are random numbers in [0, 1], and w is the inertia weight that represents the confidence of the particle to its current position, obtained from in which wmin and wmax are the final and the initial weights, respectively, tmax is the last iteration, and t is the current iteration. A large inertia weight facilitates a global search, while a small inertia weight facilitates a local search. From (19), we observe that the inertia weight decreases linearly from a relatively large value to a small value through the course of the PSO run. A linearly decreasing weight provides a better performance as compared to a fixed weight setting. The velocity and the position of particles are confined to [–vmax vmax] and [–xmax xmax], respectively. This is to reduce the chances of particles flying out of the search space. Selecting the value of vmax is very important, since for very small values of vmax, the step size has to be very small as well, which may cause the algorithm to trap in a local minima, or may take too long to converge. Also, for very large values of vmax, a particle may go out of the search space, or its acceleration may exceed its limit.[36] Assessment of all particles’ positions is based on the value of Ccs score in (8) on the validation sets by using the BLDA classifier. The value of Ccs denotes the particle's position in a 64 or 32 dimensional space according to the five-fold cross-validation procedure that was described in Section 3.3. In our problem, each particle is defined as a group of channels from the set of 32 or 16 channels selected by the Bhattacharyya criterion. We wish to prune the less effective channels and keep the more effective channels in the set of 32 or 16 selected channels (binary decision). In[37], the BPSO is used to search binary spaces on each dimension, where the position vector of each particle is binary-valued, and the velocity of a particle i was used to obtain the probability that the dth bit of its position vector, i.e., x, takes on the value of 1 or 0. The velocity updating equation in the BPSO is the same as PSO, but the position of the dth bit is updated by where rand is a random number generated at t, and sigmoid maps the velocity to [0,1]. When the value of v is very large (positive or negative) the probability of a change in the bit value is one or zero, respectively. We apply the BPSO algorithm to the set of Bhattacharyya pre-selected channels to choose the more effective channels, where each channel is an element of the vector that represents a particle. The value of each element can be either 1 or 0, where 1 means selection and 0 means rejection of the channel. As an example, for binary values of x1 and x2, at iteration t in Figure 8, the corresponding two particles x1 ={C,FC1,…,PO} and x2 ={CP1,…,F8}

Figure 8

Binary particles in the IBPSO algorithm, where one means selection and zero means rejection of the channel

Binary particles in the IBPSO algorithm, where one means selection and zero means rejection of the channel The PSO algorithm suffers from the possibility of convergence to a local minima. In[38], a modified PSO is proposed that solves this problem by utilizing chaotic sequences for the weights in order to find a global solution that is better than the solution obtained by the PSO algorithm. The chaotic sequences are obtained by where μ is a control parameter that determines whether ƒ tends to a fixed value, oscillates between a limited sequence of values, or behaves chaotically in an unpredictable manner. Also, the behavior of the system is influenced by the initial value of ƒ. By choosing ∝=4 and ƒ0 ∉{0,0.25,0.5,0.75,1}, the value of ƒ corresponds to a chaotic sequence. Now, the new inertia weight is obtained by multiplying (19) by (21). wnew = w × ƒ (22) Unlike the PSO algorithm in which the weight decreases monotonically from wmax to wmin, in the improved PSO, the new weight decreases and oscillates simultaneously as shown in Figure 9. We were inspired by the work in[38] to use the improved weights in BPSO algorithm and utilize the improved BPSO (IBPSO) to identify the more effective channels.

Figure 9

Variations in the conventional weight and in the proposed new weight[38]

RESULTS

Experimental Result of Dataset 1

We now present the results of applying our proposed scheme to dataset IIb of BCI competition III in[21]. First, we compute the Bhattacharyya distance of each channel for subjects A and B by using target and non-target preprocessed signal samples. We sort the BD values in decreasing order, and select the first half of channels with larger BD values. The selected 32 channels for subjects A and B are listed in Table 1, respectively. We use the first eight channels of each subject, i.e., [C] for subject A and [PO] for subject B, to select the best truncated DWT coefficients as explained in Section 3.3. We begin by eliminating the near-zero coefficients from the beginning and the end parts of the DWT of single trial training data, as per Section 3.4; and obtain all possible combinations of the truncated DWT coefficients within 0-30 Hz that do not overlap in frequency. The value of CCS score in (8) for each combination set is obtained by the five-fold cross-validation procedure and the BLDA classifier. To compare the impact of using these coefficients vis-a-vis using all DWT and SWT coefficients, the mean classification accuracy for Subjects A and B are shown in Figure 10 for different trials by using the first 8 Bhattacharyya-selected channels. As can be seen, the classification accuracy for the SWT coefficients or for the selected sub-bands is not significantly better than those of the DWT coefficients. This also indicates that our results are not sensitive to varying shifts in the DWT. Our proposed scheme reduces the number of effective features about 20% for all DWT coefficients while maintaining accuracy.

Table 1

The 32 channels sorted by BD criteria for subjects A and B

Figure 10

The mean classification accuracy over Subjects A and B for all DWT, truncated DWT, and SWT coefficients

The 32 channels sorted by BD criteria for subjects A and B The mean classification accuracy over Subjects A and B for all DWT, truncated DWT, and SWT coefficients As features, we apply the truncated coefficients of the 32 channels that were selected via the BD criteria [Table 1] for Subjects A and B, respectively, to the IBPSO algorithm in order to reduce the number of channels even further. We run the algorithm for 6, 8, 10, 12, and 15 particles (a particle is a subset of the 32 channels selected via the BD criteria) and 200 repetitions using the parameter values in Table 2, and observed that the highest Ccs is obtained when the number of particles in the IBPSO algorithm is 10. Figure 11 shows that Ccs for g, i.e., Ccs(g), reaches its final value in less than 200 iterations for both subjects. The mean values of Ccs for pti|i=1,2,…,10, i.e., , are also shown in Figure 11. Note that the value of does not change after 150 iterations for both subjects and its final value is the same as the final value of Ccs(g). This means that 200 iterations are sufficient and all ten position vectors p are able to follow g.

Table 2

Parameter values for IBPSO

Figure 11

Variations of the Ccs score and the mean values of Ccs over ten particles for (a) subject A and (b) subject B

Parameter values for IBPSO Variations of the Ccs score and the mean values of Ccs over ten particles for (a) subject A and (b) subject B The IBPSO algorithm is executed 7 times separately to verify the consistency of channel selection. In each run, a different channel set is obtained, which shows the existence of local minima in the IBPSO. Note that {FC1,C3,C1,CZ,C6,P3,P1,PZ,PO7,POZ,PO8,O1,OZ} channels for Subject A, and {C3,CZ,CPZ,CP6,P6,P8,PO4,PO3,PO8,OZ,IZ} channels for Subject B, are common among the six or seven sets. It shows that they are more important than the other channels. Note also that only {C3,CZ,PO8,OZ} channels are common in both sets, and the rest are subject-dependent, meaning that channel selection should be performed on each subject separately. Table 3 contains the classification accuracy of each channel set for Subjects A and B in 1, 5, and 15 trials. To show that our proposed scheme extracts effective features, we compare the classification accuracies for down-sampled signal, the DWT features, and the truncated DWT features in Table 4 by using the first channel set of each subject in Table 3. As can be seen, the classification accuracy of using the truncated DWT features in all trials except one item is equal to or higher than that of using the down-sampled signal. Moreover, the results of using the DWT and the truncated DWT features are exactly the same for all trials, meaning that by truncating those coefficients whose values are near zero, the classification accuracy is not deteriorated.

Table 3

Classification accuracy in % for selected channels by IBPSO in 1, 5, and 15 trials for subjects A and B

Table 4

Classification accuracy in % for the down-sampled signal, the DWT coefficients and the truncated DWT coefficients

Classification accuracy in % for selected channels by IBPSO in 1, 5, and 15 trials for subjects A and B Classification accuracy in % for the down-sampled signal, the DWT coefficients and the truncated DWT coefficients Classification results for both Subjects A and B in different trials are shown in Table 5. Using BCI 2005 evaluation criteria, we achieved a correct classification rate of 29%, 74.5%, and 97.5% in 1, 5, and 15 trials, respectively, as compared to the three best results of the BCI competition[91021] shown in Table 5. As can be seen, in almost all trials, our results are better than those in[91021], where the aim is accurate classification with less calculations.

Table 5

Mean classification accuracy of our scheme in % and the first ranked competitor in BCI competition 2005, dataset IIb, and[910] for subjects A and B

Mean classification accuracy of our scheme in % and the first ranked competitor in BCI competition 2005, dataset IIb, and[910] for subjects A and B In Table 6, we compare the number of channels in our approach with those of the three best results in the BCI competition. Note that we use fewer channels than the first ranked competitor.[910] Besides, we use the BLDA classifier that needs less calculations as compared to the SVM.

Table 6

No. of channels and classifiers’ types in our scheme and the three best competitors in BCI competition 2005, dataset IIb

Experimental Result of Dataset 2

We use the data recorded in the first three sessions and the last session as the training and the test data, respectively, for disabled subjects (Subject 1-Subject 4) and able-bodied subjects (Subject 6-Subject 9). Data for Subject 5 is not considered in this paper for reasons stated in[11] The EEG signals was down sampled from 2048 to 256 samples per second by selecting every 8th sample from the bandpass-filtered data as described in Section 3.1. For each session, the single trials corresponding to first 20 blocks of flashes were extracted via preprocessing. Hence, a single trial includes 180 samples per trial, as compared to 168 samples per trial for dataset 1. Each block consists of six flashing images, and so the training data is comprised of 360 target trials and 1800 non-target trials. The test data consists of 120 target and 600 non-target trials. For each subject, we reduce the number of channels from 32 to 16 by using the sorted BD values in decreasing order. The first eight channels were used to select the best truncated sub-bands as described in Sections 3.3 and 3.4. Table 7 shows the best truncated DWT coefficients and their length for each subject by using the five-fold cross-validation procedure with cost function CCS and BLDA classifier. Note that in Figure 12, the mean classification accuracy for eight subjects, corresponding to the truncated DWT coefficients in Table 7, are exactly the same as those of utilizing all DWT coefficients (no truncation). Besides, note that using a higher number of SWT features is not very beneficial.

Table 7

The best selected features (truncated DWT coefficients) and length of the feature vector using the five-fold cross-validation procedure and BLDA classifier for 8 subjects

Figure 12

The mean classification accuracy for 8 subjects for all DWT, truncated DWT, and SWT coefficients

The best selected features (truncated DWT coefficients) and length of the feature vector using the five-fold cross-validation procedure and BLDA classifier for 8 subjects The mean classification accuracy for 8 subjects for all DWT, truncated DWT, and SWT coefficients In order to select the final channel sets, we run the IBPSO algorithm by using the selected truncated DWT coefficients for 16 remaining channels that were identified via the BD criteria. Since the number of input channels to IBPSO algorithm in this dataset is half of the input channels in the previous dataset, we used 100 iterations instead of 200 iterations. The other parameters of the IBPSO algorithm are stated in Table 2. For each subject, we run the IBPSO algorithm seven times by using CCS, the BLDA classifier and five fold cross-validation procedure. In each run, we observed that the values of CCS(g) and do not change after 80 iterations for all subjects, which indicates that 100 iterations are sufficient. Table 8 shows the best selected channel set in 7 runs of the IBPSO for each subject. For each subject, some channel sets were similar in 7 runs, which shows better convergence of the IBPSO algorithm as compared to dataset 1 due to fewer input channels.

Table 8

The best selected channel-sets by IBPSO for all 8 subjects

The best selected channel-sets by IBPSO for all 8 subjects For each subject, feature vectors are the truncated DWT coefficients in Table 7, and the channel sets are obtained by the IBPSO algorithm. Hence, we obtained seven different feature vectors corresponding to seven output channel sets of the IBPSO. Extracted feature vectors from single trials (including targets and non-targets) are used to train a BLDA classifier. Classification accuracy is computed by using the extracted features of the test data (the data from the fourth session) over different trials and for seven channel sets. To compare the classification accuracy of our scheme with that of the method proposed in[11], we use the same pre-processed signal samples and the same four different channel sets consisting of 4, 8, 16, and 32 electrodes. In both cases, we use the data from the first three sessions for each subject to select features and channels, and train the classifier; and the data from the fourth session to compute the classification accuracy. Note that the four channel sets used in[11] are CHset1 = {F}, CHset2 = {F3,P4,P7,P8}, CHset3 = {F3,P4,P7,P8,FC1,FC2,C3,C4,CP1,CP2,O1,O2}, and CHset4 = {all 32 channels} in Figure 2b. Figure 13 compares the classification accuracies of the best channel set and the average classification accuracies over seven channel sets in our approach with those in[11] for CHset2 for each subject. For the best channel set, the performance of our method for all subjects and trials except for one case (the first trial of Subject 2) is significantly better than those in[11] for CHset2. As shown in Figure 13, the average classification accuracy over seven chanwnel sets except for very few trials for Subjects 1, 2, 3, 8, 9 is better than those in[11] for CHset2. The performance of our proposed scheme for both disabled and able-bodied subjects does not differ much.

Figure 13

Classification accuracy of the best channel set and the average classification accuracies over 7 channel sets in our approach and those obtained by using the method in[11] for CHset 2, for disabled subjects (subject 1-subject 4) and able-bodied subjects (subject 6-subject 9) In Figure 14, the average classification accuracy for all subjects in our proposed scheme for the truncated DWT coefficients and the channels identified by the IBPSO algorithm is compared with those in[11] that utilizes the down-sampled signal and four different channel sets. As can be seen, compared to CHset1, CHset2, and CHset3 channel sets, our proposed scheme performs better or the same as in.[11] Moreover, the average classification accuracy over seven sets of channels obtained by the IBPSO algorithm is approximately the same as those in[11] for CHset4 (with 32 channels), while we use less channels (with average 6.9 channels per subject).

Figure 14

The average classification accuracy for all subjects in our proposed scheme (truncated DWT coefficients for best run and an average of 7 runs for the IBPSO algorithm) and those in[11] for CHset 1, CHset 2, CHset 3, and CHset 4 channel sets Note that the results of using down-sampled signal and four different channel sets in Figures 13 and 14 are different from those in[11] due to the fact that classification accuracy in the latter is obtained by averaging over four sessions, whereas we only use the fourth session to compute classification accuracy. For a better comparison, we repeated our proposed procedure four times, and each time, we used three different sessions for selecting features and channels, and for training the classifier. The fourth session is used for computing the classifier accuracy. Figure 15 compares the average classification accuracy of our method over four sessions and over all subjects with those in[11] for 8 channels and 32 channels. As can be seen, the average classification accuracy of our method (with average 7.3 channels per subject) over four sessions and over all subjects is approximately the same as the best result (with 32 channels) in[11], confirming the results in Figures 13 and 14.

Figure 15

The average classification accuracy of our method over four sessions and over all subjects and those in[11] for CHset2, and CHset4 channel sets

DISCUSSION

Analysis of EEG signals in the BCI system consists of preprocessing, feature extraction, channel selection, and data classification. While in[8-11] the focus is mainly on channel selection, and in[713], the focus is on feature selection, we focus on both channel and feature selection with a view to improving classification accuracy. The proposed scheme needs less features and provides more accurate classifications for almost all trials and subjects in real time. However, our method for selecting proper features and channels during training is not as simple as those in[1011]. We truncated the DWT coefficients to reduce the number of features, while in[9] all DWT coefficients in each level are used. Furthermore, the number of features in our scheme is less than the number of preprocessed signal samples in[81011]. Note that, we can reduce the number of features up to 30% while maintaining the same accuracy in different trials for all subjects. We also showed that using shift-invariant wavelet transform with a large number of features does not produce better results than using DWT that is shift-varying [Figures 10 and 12]. In order to improve the accuracy, we removed ineffective channels by applying a two-step channel selection algorithm (Bhattacharyya distance and IBPSO algorithm). For dataset 1, we used 22 channels for Subject A, and 21 Channels for Subject B. This is in contrast to[8912] that use almost all 64 channels and more features, resulting in more calculations. In dataset 1 for some trials, the performance of our scheme is below that of the first ranked competitor and.[910] For Subject B, our proposed algorithm provides better results as compared to[810] for all trials. In dataset 2, we can approximately achieve the same classification accuracy with an average 6.9 channels per subject as compared to[10] with 32 channels and more features. Compared to three other channel sets (i.e., 4, 8, and 16 channels) in[10], our results are better or equal in all trials. Another important issue in BCI is choosing a classifier that provides fast discrimination between classes. SVM is a well-known and powerful classifier used by the first and the second ranked competitors, but it requires more calculations to tune its parameters, and gets worse when the training data is extensive. In this study, we use the BLDA classifier instead of SVM as in.[1011] As can be seen in Table 5, the accuracy of our proposed classifier in almost all trials is higher than those of the first ranked competitor. In[9], the FLDA classifier (which is slightly simpler than the BLDA classifier) is used for evaluating classification accuracy in a configuration that consists of 10 parallel classifiers. However, our proposed scheme is more accurate than,[9] except for Subject A with less than five trials. The results show that the selected channels and sub-bands were different among subjects in both datasets. This indicates that the set of optimal electrodes and the set of optimal DWT sub-bands are subject dependent.

CONCLUSIONS

Three performance indicators, namely computation cost, real time, and accuracy, are essential in BCI applications. To achieve these objectives, we proposed a new scheme for selecting a minimal set of features by utilizing DWT and mother wavelet db4, and choose the more effective channels. In particular, we used truncated wavelets when the coefficients’ values are small (near zero) and selected optimal DWT sub-bands for each subject. We also used the BD and the IBPSO algorithm to select fewer channels for attaining accurate classification as compared to existing methods. In particular, using BD to eliminate one half of channels significantly reduces calculations in the two different P300-BCI datasets that include 10 disabled and able-bodied subjects. Our method is subject-dependent, and uses a two-stage procedure in the training phase to select the best sets of sub-bands and channels, resulting is more accurate classification, with less features and less channels.

BIOGRAPHIES

Bahram Perseh was born in Tehran, Iran in 1970. He received his B.S. in Electrical Engineering from Isfahan University of Technology in 1993 and his M.S. degree in Biomedical Engineering from Amirkabir University of Technology (The Tehran Polytechnic) in 1996. He is currently working towards the Ph.D. degree in Electrical and Computer Engineering at Tarbiat Modares University, Tehran, Iran. His research interests include biomedical signal processing, brain–computer interface (BCI), heart sound analysis, and pattern recognition. E-mail: bahramperse@yahoo.com Ahmad R. Sharafat is a professor of Electrical and Computer Engineering at Tarbiat Modares University, Tehran, Iran. He received his B.Sc. degree from Sharif University of Technology, Tehran, Iran, and his M.Sc. and his Ph.D. degrees both from Stanford University, Stanford, California, all in Electrical Engineering in 1975, 1976, and 1981, respectively. His research interests are advanced signal processing techniques, and communications systems and networks. He is a Senior Member of the IEEE and Sigma Xi. E-mail: sharafat@modares.ac.ir

18 in total

1. The mental prosthesis: assessing the speed of a P300-based brain-computer interface.

Authors: E Donchin; K M Spencer; R Wijesinghe
Journal: IEEE Trans Rehabil Eng Date: 2000-06

Review 2. Brain-computer interfaces for communication and control.

Authors: Jonathan R Wolpaw; Niels Birbaumer; Dennis J McFarland; Gert Pfurtscheller; Theresa M Vaughan
Journal: Clin Neurophysiol Date: 2002-06 Impact factor: 3.708

3. BCI Competition 2003--Data sets Ib and IIb: feature extraction from event-related brain potentials with the continuous wavelet transform and the t-value scalogram.

Authors: Vladimir Bostanov
Journal: IEEE Trans Biomed Eng Date: 2004-06 Impact factor: 4.538

4. On wavelet analysis of auditory evoked potentials.

Authors: A P Bradley; W J Wilson
Journal: Clin Neurophysiol Date: 2004-05 Impact factor: 3.708

5. A P300 event-related potential brain-computer interface (BCI): the effects of matrix size and inter stimulus interval on performance.

Authors: Eric W Sellers; Dean J Krusienski; Dennis J McFarland; Theresa M Vaughan; Jonathan R Wolpaw
Journal: Biol Psychol Date: 2006-07-24 Impact factor: 3.251

6. Auditory and spatial navigation imagery in Brain-Computer Interface using optimized wavelets.

Authors: Alvaro Fuentes Cabrera; Kim Dremstrup
Journal: J Neurosci Methods Date: 2008-07-06 Impact factor: 2.390

7. An empirical bayesian framework for brain-computer interfaces.

Authors: Xu Lei; Ping Yang; Dezhong Yao
Journal: IEEE Trans Neural Syst Rehabil Eng Date: 2009-07-17 Impact factor: 3.802

8. Visual modifications on the P300 speller BCI paradigm.

Authors: M Salvaris; F Sepulveda
Journal: J Neural Eng Date: 2009-07-15 Impact factor: 5.379

9. Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials.

Authors: L A Farwell; E Donchin
Journal: Electroencephalogr Clin Neurophysiol Date: 1988-12

10. Application of a hybrid wavelet feature selection method in the design of a self-paced brain interface system.

Authors: Mehrdad Fatourechi; Gary E Birch; Rabab K Ward
Journal: J Neuroeng Rehabil Date: 2007-04-30 Impact factor: 4.262

3 in total

1. Using brain connectivity metrics from synchrostates to perform motor imagery classification in EEG-based BCI systems.

Authors: Lorena Santamaria; Christopher James
Journal: Healthc Technol Lett Date: 2018-03-07

2. Improving the Accuracy and Training Speed of Motor Imagery Brain-Computer Interfaces Using Wavelet-Based Combined Feature Vectors and Gaussian Mixture Model-Supervectors.

Authors: David Lee; Sang-Hoon Park; Sang-Goog Lee
Journal: Sensors (Basel) Date: 2017-10-07 Impact factor: 3.576

3. Parallel Computing Sparse Wavelet Feature Extraction for P300 Speller BCI.

Authors: Zhihua Huang; Minghong Li; Yuanye Ma
Journal: Comput Math Methods Med Date: 2018-10-02 Impact factor: 2.238

3 in total