Literature DB >> 23717808

A hybrid feature subset selection algorithm for analysis of high correlation proteomic data.

Hussain Montazery Kordy¹, Mohammad Hossein Miran Baygi, Mohammad Hassan Moradi.

Abstract

Pathological changes within an organ can be reflected as proteomic patterns in biological fluids such as plasma, serum, and urine. The surface-enhanced laser desorption and ionization time-of-flight mass spectrometry (SELDI-TOF MS) has been used to generate proteomic profiles from biological fluids. Mass spectrometry yields redundant noisy data that the most data points are irrelevant features for differentiating between cancer and normal cases. In this paper, we have proposed a hybrid feature subset selection algorithm based on maximum-discrimination and minimum-correlation coupled with peak scoring criteria. Our algorithm has been applied to two independent SELDI-TOF MS datasets of ovarian cancer obtained from the NCI-FDA clinical proteomics databank. The proposed algorithm has used to extract a set of proteins as potential biomarkers in each dataset. We applied the linear discriminate analysis to identify the important biomarkers. The selected biomarkers have been able to successfully diagnose the ovarian cancer patients from the noncancer control group with an accuracy of 100%, a sensitivity of 100%, and a specificity of 100% in the two datasets. The hybrid algorithm has the advantage that increases reproducibility of selected biomarkers and able to find a small set of proteins with high discrimination power.

Entities: Chemical Disease Species

Keywords: Biomarker; classification; correlation-based weight function; feature subset selection; peak scoring; proteomics

Year: 2012 PMID： 23717808 PMCID： PMC3660712

Source DB: PubMed Journal: J Med Signals Sens ISSN： 2228-7477

INTRODUCTION

A major problem in the treatment of cancer is the lack of a suitable technique for early diagnosis of the disease. The ovarian cancer is a widespread disease within the population of women, and its early diagnosis can greatly prevent the mortality rate.[1] With current diagnostic tools, the disease is diagnosed at an advanced clinical stage in more than 80% of patients that the 5-year survival is only 35% after late stage presentation.[2] It is known that the pathological changes within an organ can be reflected as proteomic patterns in biological fluids such as plasma, serum, and urine.[3] The surface-enhanced laser desorption and ionization time-of-flight mass spectrometry (SELDI-TOF MS) has been used to provide proteomics profile from biological fluids.[4-6] The mass spectrum data analysis is a fast and rather inexpensive procedure to diagnose the disease, and it may potentially allow cancer screening without any complication during the time of diagnosis. In many screening tasks, the input data are presented by a very large number of features of which only a few are suited for predicting the disease factor or class labels. Hence, the feature extraction or selection methods can significantly facilitate the analysis of a large amount of information within the mass spectra. In an earlier research, Petricoin et al.[7] applied a bioinformatics tool based on genetic algorithm and self-organizing neural network to identify proteomic patterns in the serum of ovarian cancer patients. Zhu et al.[8] used a statistical procedure for preselection of m/z values (candidate proteins) in which the potential biomarkers were then selected by a stepwise discriminant analysis and 5-NN classifier. Baggerly et al.[910] evaluated the reproducibility of reported biomarkers in ovarian cancer datasets and mentioned that the results might be effect of sample preprocessing and nature of noisy data. Vannucci et al.[11] analyzed the mass spectrum data to achieve relevant features in content of classification problem by using the wavelet-based Bayesian method. Whelehan et al.[12] used the partial least squares-discriminant analysis (PLS-DA) to identify the potential biomarkers from proteomic profiles. Wu et al.[13] and Morris et al.[14] emphasized in addition to data preprocessing thus the relevant feature selection is another major challenge for MS data analysis. Also, due to the large number of variables and the small size of samples, the data mining approach is necessary to overcome a few of challenges such as dimensionality reduction, feature selection, and biomarker identification.[15-17] Therefore, the preprocessing and relevant feature selection are two major challenges in the analysis of MS data. Also, the reproducibility of biomarker selection is another open problem with regard to varying the training and testing sets in the analysis of proteomic profiles. In this paper, the data preprocessing step is performed appropriately in the wavelet domain. We present a hybrid method based on maximum-discrimination and minimum-correlation (MDMC) coupled with peak scoring criteria to preselect a feature subset as candidate proteins. By peak scoring criteria, the peaks have a higher chance to lie in the final feature subset vector. In our study, the proposed method could be selected the best discriminative features among normal and cancer groups. Using 10-fold cross-validation, our method has showed to be reproducible with regard to biomarker selection in the studied datasets. In addition, our hybrid algorithm has been able to find small sets of proteins as potential biomarkers that have higher discriminative power compared with previously reported biomarkers for these datasets.

Data and Preprocessing

In this research, the SELDI–TOF MS data from serum of ovarian cancer patients was used as the input patterns for our proposed algorithm. At first, we performed the preprocessing step according to described procedure in the “Preprocessing” section. The processed mass spectra were then used to identify a set of candidate proteins as potential biomarkers for discriminating between cancer and noncancer controlled healthy cases.

Data

Two SELDI-TOF MS datasets were used to identify candidate proteins from serum samples. These datasets were obtained from freely available proteomics databank of food and drug administration of the National Cancer Institute website.[18] In two datasets, each mass spectral curve has 15,154 distinct points on the mass-to-charge ratio axis (m/z values) in the range of 0-20,000 Da. According to these points, there is a measure of the abundance of each protein on the intensity axis. In Figure 1, the mean spectra of healthy and cancer cases are shown from dataset I and II, respectively. The distribution of samples for each dataset is illustrated in Table 1.

Figure 1

A typical mass spectrum from normal and cancer groups: (a and b) dataset I and (c and d) dataset II

Table 1

Distribution of data

A typical mass spectrum from normal and cancer groups: (a and b) dataset I and (c and d) dataset II Distribution of data

Preprocessing

The raw data obtained from the SELDI-TOF mass spectrometer must be preprocessed before a feature selection step, containing baseline removal, denoising, and normalization to reduce the systematic errors. The mass spectral curve can be modeled in a mixed form to include the chemical and electrical effects of mass spectrometer.[1920] The following mathematical expression can be written for the mass spectrum signal: y = B+N+ε (1) In this model, y indicates the signal intensity or abundance of a molecule. The baseline, B, denotes a systematic error that is mainly due to the molecules of the energy-absorbing matrix. The true signal, S, represents the peak profiling of each molecule in the biological sample and is scaled in each spectrum by the normalization factor N. The last term, ε, shows the electrical noise that is assumed to have a Gaussian distribution. To baseline removal and denoising, the discrete wavelet transform (DWT) is applied to Equation (1). By applying the DWT, the observed signal, y, is decomposed into approximation and detail coefficients which contain the baseline and electrical noise, respectively.[21-23] For baseline correction, we applied the robust baseline elimination (RBE) technique to the approximation coefficients.[24] By the soft thresholding method and the higher order statistics based threshold selection, noise removal was performed by adjusting the detail coefficients.[2526] After adjusting the approximation and detail coefficients of each mass spectrum, we reconstructed the intensity signal by applying the inverse discrete wavelet transform. The reconstructed mass spectrum is then normalized according to the described method.[27] In Figure 2, we showed a typical preprocessed mass spectrum by Daubechies 4 mother wavelet that has been previously reported to have a better ℓ2 performance on mass spectrometry data.[28]

Figure 2

A processed mass spectra signal: (a) original signal; (b) approximation coefficients; (c) detail coefficients; (d) estimated baseline; (e) estimated noise and (f) preprocessed signal

MATERIALS AND METHODS

Feature extraction (or selection) will be necessary when the number of features is large with respect to the sample size. This is because the uses of all features are impractical and can reduce the performance of the classification task.[29] The feature selection methods can be divided into filter and wrapper approaches.[30] In our research, we developed a filter approach to select candidate proteins from MS data with high dimensionality and correlation within the spectrum profiles as potential biomarkers.

Feature Subset Selection

In some previously published works, the features were preselected with best individual rank using a statistical test and applying a threshold value.[31-33] It needs to be mentioned that combination of the best individual features does not always yield the best feature subset.[3435] The class separability measures could be used for the feature subset selection. Given the input data matrix D tabled as N samples and M features such that each member of this set is shown as X={x=1,...,M}. The goal of feature selection is to find a subspace of features, ℜ, from the M-dimensional observation space, ℜ, that could be optimally separated the c classes. The Bhattacharyya distance is a class separability measure that is based on the minimum Bayes classification error. For Gaussian distribution features, with Σ and μ as the within-class variance and class mean, respectively, the Bhattacharyya distance is expressed as:[36] The feature set S with d features would be selected such that it yields maximum-discrimination (MD) between classes by using the Bhattacharyya distance. Therefore, the aim is to maximize the following criteria: For selecting the best feature subset, S, the number of search would be . It will be hard to search the entire M-dimensional original space. Therefore, a sequential-search-based procedure would be needed such as sequential forward search (SFS).[37]

Correlation-based Weight Function

A mass spectrum could be viewed as the sum of independent signals generated by distinct proteins and their fragments.[20] Also, the resulting spectral data often represent mixture of several components.[38] Therefore, a correlation measure function is needed for selecting the pure variables. In our approach, we have used a correlation-based weight function, which was applied to select pure variables in a method called SIMPLISMA.[38] Let us consider the normalized input data matrix that was normalized by the described method in Ref. [38]. In SIMPLISMA, a correlation matrix C will be computed as . The C matrix gives all the variables an equal contribution in the calculation and a measure of independence of variables. Considering that p represents the index of previously selected i variables, the correlation-based weight function will be obtained as follows: The correlation-based weight function wij is a measure of correlation among selected variables that determines the linear independence of the jth candidate protein with respect to the previously selected i – 1 proteins. The minimum correlation (MC) criteria can be expressed as follows: max J(p|p1,…,p), J(p)=w (5)

Peak Scoring

In the analyzing of mass spectra data, each m/z ratio could be used to select the potential biomarkers, but the peaks are much interest for scientific purpose.[333940] On the other hand, the mass-to-charge axis is not equally sampled in the MS data. Therefore, a point scoring method could be used to assign a score to each m/z ratio that the peaks a higher chance to lie in the final feature subset vector. Let d̄ be the mean vector of D, which is computed for each column of the data matrix. For scoring of m/z ratios, a distance measure will be used in the length interval w that is named as the sum of distances function (SDF). For each point, d̄j, of the mean vector, SDF can be computed as: In Equation (6), w is an even integer that was given the value of 10, in the datasets we used, based on the full-width-at-half-maximum approach (FWHM).[3940] For a typical mass spectrometer, there is a 0.1% reading error around each m/z ratio. Therefore, the mean spectrum is used to decrease this error. The SDF assigns a weight to each point and a peak takes a higher score relative to the other points. Figure 3 shows the SDF for dataset II. The certain points of SDF indicate regions of dataset II that shows apparent differences between intensities of the mass spectra for healthy and cancer cases.

Figure 3

The computed sum of distances function (SDF) for dataset II (top): certain regions of SDF (a-c) are enlarged to show distinguishable differences between intensities of normal cases (solid line) and ovarian cancer patients (dashed line) in the mean spectrum

Hybrid Algorithm

Here, we present a hybrid algorithm based on maximum-discrimination and minimum-correlation (MDMC) criteria for feature subset selection from the mass spectrometry datasets. SFS was used as the search procedure to select d features from M-dimensional data space. Using cross-validation methods, we could select the appropriate value of d empirically to minimize the classification error. For feature subset selection, our algorithm can be summarized in the following three steps: Step 1: we select the first relevant feature, d = 1, to constitute S1 (a subset with one member) that maximize the following criteria: max(J×SDF), d=1 (7) Step 2: we select the subsequent features, d ≥ 2, to form S based on maximizing the following criteria: max (J×J×SDF), d≥2 (8) Step 3: we repeat Step 2 until we reach the specified value for d.

RESULTS AND DISCUSSIONS

To evaluate the performance of the proposed method for biomarker identification, we analyzed the mass spectrometry data from ovarian cancer that is listed in Table 1. All the mass spectra were preprocessed to remove the baseline and electrical noise according to the described procedure (“Preprocessing” section). For discrimination purpose, training and testing sets were selected randomly for normal and cancer groups in each dataset. Due to the small number of samples in each dataset and large number of features, here, we used 10-fold cross-validation to avoid any biasing and error during feature selection and sample classification. To determine the suitable value of d, training and testing sets were selected randomly from normal and cancer groups by using 10-fold cross-validation. The linear discriminate analysis (LDA) was applied to find the classification error in each repetition. In this way, we compute the cross-validation classification error for finding the best value of d. The value of 30 was selected with regard to the minimal error of 3%. Figure 4 shows the recognition rate resulting from classification of samples in the datasets I and II based on 30 selected features with highest rank in 100 iterations. In Figure 4, there are some flat regions that are indicating the presence of redundant features corresponding to the classification concept. As explained in “Materials and Methods” section, the proposed method selects the best-uncorrelated feature subset with regard to the mass spectrometry concept that could be lead to the best candidate proteins with highest discrimination power.

Figure 4

The percentage of recognition rates using 30 high ranked features by the LDA classifier: (a) accuracy in dataset I and (b) accuracy in dataset II

The percentage of recognition rates using 30 high ranked features by the LDA classifier: (a) accuracy in dataset I and (b) accuracy in dataset II One other advantage of our method is the increasing within group reproducibility rate for selected features with regard to the variation of the training set. By changing the training set randomly, the feature subset selection method would be reproducible if the selected features repeated by running the algorithm iteratively. Figure 5 shows the histogram of 30 selected features using the MDMC method. In obtaining the histogram, the training set has been selected using 10-fold cross-validation. The histogram was plotted using those features that were selected more than once. The repeated rate of 30 selected features has been 288 and 294 for datasets I and II, respectively.

Figure 5

A histogram view of selected masses using the MDMC method: (a) histogram of selected features in dataset I and (b) histogram of selected features in dataset II

A histogram view of selected masses using the MDMC method: (a) histogram of selected features in dataset I and (b) histogram of selected features in dataset II We used the LDA to select the potential biomarkers in two datasets. To evaluate the performance and discriminative power of selected biomarkers, we used the accuracy, sensitivity, and specificity for distinguishing between healthy and cancer groups. Using the 30 selected features by MDMC, we identified 14 and 6 peptides from proteomic profile as biomarkers in the two datasets I and II, respectively. These proteins had the m/z values of-in ascending order of masses-(80.61, 81.61, 268.57, 341.46, 393.3, 414.3, 445.25, 564.57, 1522.51, 2025.13, 2064.8, 2072.44, 3184.76, and 6598.81) and (244.66, 331.87, 459.14, 516.84, 2036.91, and 8362.91), in the two datasets I and II, respectively. Table 2 lists the results obtained from classification of samples using the identified biomarkers. To distinguish between the healthy and cancer cases, we used the LDA and support vector machine (SVM) classifiers. To calculate the performance matrix, half of the samples in each dataset were selected randomly as the training set and, then, all the samples were used as the test set.

Table 2

Performance results

Performance results We compared the accuracy of sample classification using the biomarkers selected by MDMC and previously reported biomarkers in the same datasets.[781112] The accuracy was computed using 10-fold cross-validation. As shown in Table 3, the MDMC has resulted a significant improvement in discrimination power with regard to the number of biomarkers. This enhancement is particularly noticeable in dataset I which has a poor quality in contrast to dataset II. Also, the proposed method has been able to reduce the number of selected biomarkers yet preserving the discriminative power.

Table 3

Comparison results

Comparison results It is evident that the improvement in our results compared with the previous works is due to choosing uncorrelated features in the mass spectrometry concept. This has enabled us to extract the pure variables from mass spectrometry datasets. In Figure 6, we have compared the selected biomarkers by MDMC with the previously reported proteins in the same datasets[81011] by computing the correlation between biomarkers. We used a cumulative function to calculate this correlation denoted by cumulative correlation function (CCF).[33] We plotted the inversion of this function for better evaluation. By adding a biomarker, the value of CCF has increased and the inversion decreased. As shown in Figure 6, the MDMC has selected the proteins with lower correlation as potential biomarkers justifying the improvement of our diagnostic results for the two datasets.

Figure 6

A comparison of correlation between selected biomarkers by the MDMC algorithm and results of reported biomarkers by other workers: (a) dataset I and (b) dataset II

CONCLUSIONS

Emerging advances in mass spectrometry technology allow the simultaneous analysis of expression patterns for thousands of proteins in the biological samples. In the analysis of proteomic profiles, we were faced with the high dimensionality and correlation between elements of mass data. In addition, the appropriate preprocessing of data has been a major challenge in this field. The goal of this study has been to present an appropriate algorithm for the analysis of mass spectra data. In this paper, we have presented a hybrid feature subset selection method that determines relevant features based on class separability measure, minimum correlation, and peak scoring criteria. Our method implemented on the two ovarian cancer datasets for identifying the distinguishable biomarkers between control and cancer samples. Using 10-fold cross-validation, our proposed algorithm succeeded to select the reproducible biomarkers. The algorithm was able to identify 14 biomarkers with the accuracy of 99.5%, sensitivity of 99%, and specificity of 100% in dataset I. Also, we analyzed dataset II and could determine six biomarkers that achieved perfect discrimination with 100% accuracy, 100% sensitivity, and 100% specificity. In analyzing the above independent datasets, our method was able to identify a small subset of proteins as potential biomarkers in the training set that could distinguish samples in a blind test set with high discriminatory power. We have shown that the feature subset selection has a key role to achieve the relevant potential biomarkers in the analysis of mass spectrometry data. Also, the preprocessing is an important step in the analysis of the proteomic patterns. Dataset I, as mentioned in the NCI-FDA site, has been processed manually for baseline removal and this has reduced the quality of the data compared with dataset II. Also, our method has succeeded to select the significant biomarkers from poor quality data, but having a not-processed dataset has an important effect to achieve better results from a reproducibility point of view for the selected biomarkers. To conclude, our algorithm can be used as a diagnostic tool employed by the mass spectrometer to extract the potential biomarkers with significantly different between healthy and cancer groups.

BIOGRAPHIES

Hussain Montazery Kordy received the B.S. degree in electronic engineering from Mazandaran University, Babol, in 2000 and M.S. degree in biomedical engineering from Sharif University of Technology, in 2003 and the Ph.D. degree from Tarbiat Modarres University, Tehran, Iran, in 2009. Since 2010, he is a member of Electrical and Computer Engineering with Babol Nooshirvani University of Technology (NIT), Babol, Iran, where he is currently an Assistant Professor of Biomedical Engineering. His teaching interests involve the medical instrumentation, biological system modeling, pattern recognition, and time-frequency signal processing. Also, his research focuses on computer aided diagnosis, feature selection and extraction, biomedical signal processing, and wavelet based signal analysis. E-mail: hmontazery@nit.ac.ir Mohammad Hossein Miran Baygi received the B.S. degree in electronic and control engineering from University of Birmingham, in 1990 and the M.S. degree in Digital System Design and the Ph.D. degree in Biomedical Instrumentation from University of Manchester Institute of Science and Technology (UMIST), UK, in 1992 and 1995, respectively. He is an Associate Professor and Director of Biomedical Engineering Department at the Tarbiat Modarres University. His main research interests include modeling of biological systems, biomedical instrumentation, and studying interaction of lasers with biological tissue. Dr. Miran Baygi is member of institute of physics and engineering in medicine and biology (UK), a member of institute of electrical engineers (UK), and a member of society of Biomedical Engineering (Iran). E-mail: miranbmh@modaress.ac.ir Mohammad Hassan Moradi received the B.S. and M.S. degrees in electronic engineering from Tehran University, in 1988 and 1990, respectively, and the Ph.D. degree from the University of Tarbiat Modarres, Tehran, Iran, in 1995. He has been with the faculty of biomedical engineering, Amirkabir University of Technology (AUT), since 1995, where he is currently a Professor and Director of Bio-Electric Department. His primary research and teaching interests involve the theory and application of medical instrumentation, biomedical signal processing, wavelet systems design, time-frequency transforms and fuzzy neural systems. He has published over 60 technical papers in international journals, over 200 technical papers in international conferences and is the translator of one book with subject of wavelet signal processing. E-mail: mhmoradi@aut.ac.ir

27 in total

A hybrid feature subset selection algorithm for analysis of high correlation proteomic data.

INTRODUCTION

Data and Preprocessing

Data

Preprocessing

MATERIALS AND METHODS

Feature Subset Selection

Correlation-based Weight Function

Peak Scoring

Hybrid Algorithm

RESULTS AND DISCUSSIONS

CONCLUSIONS

BIOGRAPHIES

Review 1. Proteomics in early detection of cancer.

2. Signal denoising and baseline correction by discrete wavelet transform for microchip capillary electrophoresis.

Review 3. Protein expression profiling in human lung, breast, bladder, renal, colorectal and ovarian cancers.

4. Data reduction using a discrete wavelet transform in discriminant analysis of very high dimensionality data.

5. A robust meta-classification strategy for cancer detection from MS data.

6. Use of proteomic patterns in serum to identify ovarian cancer.

7. Detection of cancer-specific markers amid massive mass spectral data.

8. SELDI-TOF-based serum proteomic pattern diagnostics for early detection of cancer.

9. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data.

Review 10. Data mining techniques for cancer detection using serum proteomic profiling.