Literature DB >> 24062796

Identification of antioxidants from sequence information using naïve Bayes.

Peng-Mian Feng1, Hao Lin, Wei Chen.   

Abstract

Antioxidant proteins are substances that protect cells from the damage caused by free radicals. Accurate identification of new antioxidant proteins is important in understanding their roles in delaying aging. Therefore, it is highly desirable to develop computational methods to identify antioxidant proteins. In this study, a Naïve Bayes-based method was proposed to predict antioxidant proteins using amino acid compositions and dipeptide compositions. In order to remove redundant information, a novel feature selection technique was employed to single out optimized features. In the jackknife test, the proposed method achieved an accuracy of 66.88% for the discrimination between antioxidant and nonantioxidant proteins, which is superior to that of other state-of-the-art classifiers. These results suggest that the proposed method could be an effective and promising high-throughput method for antioxidant protein identification.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 24062796      PMCID: PMC3766563          DOI: 10.1155/2013/567529

Source DB:  PubMed          Journal:  Comput Math Methods Med        ISSN: 1748-670X            Impact factor:   2.238


1. Introduction

Oxidation is a chemical reaction that transfers electrons or hydrogen from a substance to an oxidizing agent. Oxidation reactions can produce free radicals. In turn, these radicals can start chain reactions. When the chain reaction occurs in a cell, it can cause damage or death to the cell. Moreover, oxidative stress is also the cause and the consequence of disease. Antioxidants are protein molecules that terminate these chain reactions by removing free radical intermediates and inhibit other oxidation reactions. They do this by being oxidized themselves, so antioxidants are often reducing agents such as thiols, ascorbic acid, or polyphenols [1]. Antioxidants are widely used in dietary supplements and have been investigated for the prevention of diseases such as cancer, coronary heart disease, and even altitude sickness. Plants and animals maintain complex systems of multiple types of antioxidants, such as glutathione, vitamin A, vitamin C, and vitamin E, as well as enzymes such as catalase, superoxide dismutase, and various peroxidases. Insufficient levels of antioxidants or inhibition of the antioxidant enzymes can cause oxidative stress and may damage or kill the cells. As oxidative stress appears to be an important part of many human diseases, the use of antioxidants in pharmacology is intensively studied, particularly as treatments for stroke and neurodegenerative diseases. Recently, Fernandez-Blanco et al. reported a computational model to identify antioxidant proteins based on star graph topological indices [2]. However, by analyzing Fernandez-Blanco et al.'s dataset, we found that sequences in their dataset share high-sequence similarities; some sequences in their dataset even share 100% sequences identity. It has been demonstrated that the predictive accuracy is closely related to sequence identity [3, 4], and high-sequence similarity can surely lead to the overestimation of predictive performance. Therefore, their results are not credible. There is an urgent need to develop efficient computational tools for antioxidant proteins identification. In the current study, we propose a Naïve Bayes-based computational model for predicting antioxidant proteins using amino acid compositions and dipeptide compositions. The correlation-based feature subset selection algorithm [5] was introduced to find the optimal feature set. By using the optimized features, the proposed model was evaluated in a benchmark dataset in the jackknife test. According to some recent comprehensive reviews [6, 7] a series of recent publications [8-13], to establish a really useful statistical predictor, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the statistical samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) and establish a user-friendly webserver for the predictor that is accessible to the public. Below, let us describe how to deal with these steps one by one.

2. Materials and Methods

2.1. Dataset

Fernandez-Blanco et al. have constructed a dataset containing 324 proteins with antioxidant activity and 1657 proteins without [2]. However, sequences in their dataset share high-sequence identity. The predictive accuracy is closely related to the sequence identity [3, 4], and high-sequence similarity can surely lead to the overestimation of predictive performance. In order to prepare a reliable benchmark dataset, we first extracted proteins with antioxidant activities from the UniProt [14] according to the following steps: (i) only proteins with the experimentally confirmed antioxidant activities were included; (ii) the proteins which are fragments of other proteins were dislodged; (iii) and proteins containing nonstandard letters, that is, “B”, “X”, or “Z”, were excluded as their meanings are ambiguous. After following the aforementioned strict screening procedures, we obtained 686 proteins with antioxidant activity and obtained a new raw dataset by merging the 686 proteins into Fernandez-Blanco et al.'s dataset [2]. For balancing the number of samples and providing a significant statistics, sequences which have >60% sequence similarity were removed from the new raw dataset using CD-HIT program [15]. If the sequence identity cutoff is set to a stringent threshold of 25%, the results will be more objective and reliable. However, in this study we did not use such a stringent criterion because the currently available data do not allow us to do so. Otherwise, the number of antioxidant proteins would be too low to have statistical significance. Finally, a benchmark dataset containing 254 antioxidant and 1567 nonantioxidant proteins was constructed and can be found in the online Supporting Information S1 available online at http://dx.doi.org/10.1155/2013/567529. For further estimating the performance of the method, we also collected 20 antioxidant proteins (supporting information S2) which are independent from the training set.

2.2. Feature Vector

One of the most important parts for identifying protein attributes is to generate a set of proper informative parameters to encode protein sequences. The amino acid composition and dipeptide composition are the most important and effective parameters which have been widely applied in the realm of protein prediction [10-13]. Hence, every protein sequence in the benchmark dataset was encoded in a discrete vector as follows: where f are the normalized occurrence frequencies of the 20 amino acids (i = 1, 2,…, 20) and the 400 dipeptides (i = 21, 22,…, 420) in the protein sequence, respectively. T  is the transposing operator.

2.3. Feature Selection

Inclusion of redundant and noisy features in the model building process would cause poor predictive performance and increased computation time. Feature selection is the process of removing irrelevant features and is extremely useful in reducing the dimensionality of the data and improving the predictive accuracy. To reduce the dimension of the feature space and improve the predictive accuracy, the filter method Correlation-based Feature Selection [5] combined with best-first search strategy was used in the process of feature selection in the current work. The process starts with an empty set of features and generates all possible single-feature expansions. The subset with the highest accuracy is chosen and expanded in the same way by adding single features. If when expanding a subset the accuracy does not maximize, the search drops back to the next best unexpanded subset and continues from there until all features are added. The subset with the highest accuracy will be selected as the final optimized feature set [16].

2.4. Naïve Bayes

Naïve Bayes is an effective statistical classification algorithm [17] and has been successfully used in the realm of bioinformatics [18-20]. The theory of Naïve Bayes is to assume the attribute variables to be independent from each other given the outcome. This assumption greatly simplifies the calculation of conditional probabilities. In the Naïve Bayes framework, a classification problem can be seen as the problem of finding the outcome with maximum probability given a set of observed variables. Given the protein example described by its feature vector F = (f 1, f 2,…, f ), we need to look for a class C that maximizes the likelihood P(F | C) = P(f 1, f 2,…, f | C). Since the current work is intended to classify antioxidant and nonantioxidant proteins, a binary class C ∈ (0,1) was generated, where 1 denotes the sample that was predicted as an antioxidant protein and 0 denotes nonantioxidant protein. For the binary classification, the class for the protein sample could be determined by comparing two posteriors as follows: Taking the logarithm of (2), we have Hence, the sample will be predicted as 1 (antioxidant protein) if and 0 (nonantioxidant protein) for otherwise. θ is the threshold determining the tradeoff between sensitivity and specificity and can be trained on the training dataset to maximize the prediction performance.

2.5. Performance Evaluation

The performance of the proposed model was evaluated using sensitivity, specificity (Garmer, Sperling, and Forsberg), and accuracy (Acc), which are expressed as follows: TP, TN, FP, and FN represent the number of the correctly recognized antioxidant proteins, the number of the correctly recognized nonantioxidant proteins, the number of nonantioxidant proteins recognized as antioxidant proteins, and the number of antioxidant proteins recognized as nonantioxidant proteins, respectively. As the performance of the current classifier depends on the threshold θ as given in (4), the receiver operating characteristic (ROC) curve was employed. Therefore, the quality of a classifier can be objectively evaluated by measuring the area under the receiver operating characteristic curve (auROC). The value of auROC score ranges from 0 to 1, with a score of 0.5 corresponding to a random guess and a score of 1.0 indicating a perfect separation.

3. Results and Discussion

Three cross-validation methods, namely, subsampling test, independent dataset test, and jackknife test, are often employed to evaluate the predictive capability of a predictor. Among the three methods, the jackknife test is deemed the most objective and rigorous one that can always yield a unique outcome as demonstrated by a penetrating analysis in a recent comprehensive review [21], and hence has been widely and increasingly adopted by investigators to examine the quality of various predictors (see, e.g., [8, 22–28]). Accordingly, the jackknife test was used to examine the performance of the model proposed in the current study. In the jackknife test, each sequence in the training dataset is in turn singled out as an independent test sample and all the rule parameters are calculated without including the one being identified.

3.1. Prediction of Antioxidant Proteins

We trained the Naïve Bayes classifier using Waikato Environment for Knowledge Analysis (WEKA) [29] on the benchmark dataset. As shown in Table 1, in the jackknife test, an auROC score of 0.68 and an accuracy of 55.85% with an average sensitivity of 75.59% and an average specificity of 52.65% were obtained for the classification of antioxidant and nonantioxidant proteins by using all the 420 features, that is, 20 amino acid compositions and 400 dipeptide compositions.
Table 1

Predictive performance of Naïve Bayes based on different features.

Feature dimensionsSn (%)Sp (%)Acc (%)auROC
42075.5952.6555.850.680
4472.0466.0566.880.855
For saving computing time, cross-validation methods (fivefold or tenfold) are widely used for feature selection in computational proteomics [16, 30]. In order to identify prominent features that can distinguish between antioxidant and nonantioxidant proteins, feature selection method was also carried out to eliminate the redundant features using WEKA in a ten-fold cross-validation approach on the benchmark dataset. In the ten-fold cross-validation, the benchmark dataset is split into ten pieces, and cross validation is performed using each of these ten pieces as the testing set. Thus, the training process is performed ten times, each of which uses the data obtained by deleting the testing set from the whole dataset. We found that the proposed method achieved a maximum accuracy of 66.89% and auROC of 0.762 when the feature dimension reduced to 44 (i.e., C, G, FP, FW, LK, LS, IE, VL, VH, VC, VW, MS, PD, AP, AY, YQ, YE, YR, HE, HG, QA, KA, KH, DF, DK, DR, EF, EM, EY, ER, CP, CN, CG, WC, RT, RD, RW, SV, SD, GV, GY, GK, GC). The jackknife test results of the Naïve Bayes classifier based on the 44 optimized features for identifying antioxidant proteins were listed in Table 1. As it can be seen from Table 1, the current method yielded a better auROC score of 0.855 and a predictive accuracy of 66.88% with an average sensitivity of 72.04% and an average specificity of 66.05% (Table 1). Both predictive accuracy and auROC are higher than those of the model based on the 420 features. Moreover, for the purpose of evaluating the performance of the proposed method, we used the 20 experimentally-confirmed antioxidant proteins (in Supporting Information S2) to examine the method. As a result, 16 antioxidant proteins were correctly predicted by the proposed method; see Table 2. This result demonstrates the excellent performance of our model.
Table 2

Predictive results based on the independent dataset.

UniProt IDPredictive result
Q148E0Antioxidant
Q7RTV5Antioxidant
Q9D1A0Antioxidant
P80239Antioxidant
P0AE08Antioxidant
Q7BHK8Nonantioxidant
P0A251Antioxidant
P0A5N4Antioxidant
Q8L5E0Antioxidant
P06728Antioxidant
Q03247Antioxidant
P23529Nonantioxidant
P30041Antioxidant
O19097Antioxidant
P23345Antioxidant
P23346Antioxidant
O65198Nonantioxidant
P93407Antioxidant
P11964Nonantioxidant
P10792Antioxidant

3.2. Comparison with Other Methods

In order to further testify its superiority, we compared the capability of the present model with that of other models based on different kinds of algorithms such as BayesNet, J48 tree, and Random forest. All the classifiers were compared on the benchmark dataset based on the optimized features (i.e., C, G, FP, FW, LK, LS, IE, VL, VH, VC, VW, MS, PD, AP, AY, YQ, YE, YR, HE, HG, QA, KA, KH, DF, DK, DR, EF, EM, EY, ER, CP, CN, CG, WC, RT, RD, RW, SV, SD, GV, GY, GK, GC). Their best predictive results from jackknife test were shown in Table 3.
Table 3

Comparison of Naïve Bayes with other methods by using optimized features.

ClassifierSn (%)Sp (%)Acc (%)auROC
BayesNet42.1292.5385.500.800
J48 tree26.3790.8181.820.565
Random Forest28.3597.6487.970.797
Naïve Bayes72.0466.0566.880.855
Although the accuracies of BayesNet, J48 tree, and Random forest are higher than those of Naïve Bayes, their auROC scores and sensitivities are all much lower than those of Naïve Bayes. These results indicate that the proposed Naïve Bayes model can be effectively used to classify antioxidant and nonantioxidant proteins.

4. Conclusions

In this study, the Naïve Bayes classifier with feature selection method is presented to identity antioxidant proteins based on the primary sequence information. By using Correlation-based Feature Subset Selection algorithm, the feature dimensions were reduced to 44 prominent features that could remarkably improve the predictive accuracies. However, the detailed analyses of the selected features are required to provide more information about their roles in biological activity. It is expected that the presented model will provide novel insights into the research on antioxidants. Since user-friendly and publicly accessible webservers represent the future direction for developing practically more useful predictors [31], we shall make efforts in our future work to provide a webserver for the method presented in this paper. Supporting Information S1 is the benchmark dataset which consists of a positive dataset containing 254 antioxidant proteins and a negative dataset containing 1567 non-antioxidant proteins. Supporting Information S2 is the independent dataset containing 20 antioxidant proteins which are independent from those in the benchmark dataset. Click here for additional data file.
  26 in total

1.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites.

Authors:  Kuo-Chen Chou; Zhi-Cheng Wu; Xuan Xiao
Journal:  Mol Biosyst       Date:  2011-12-01

2.  Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine.

Authors:  Wei Chen; Hao Lin
Journal:  Comput Biol Med       Date:  2012-01-31       Impact factor: 4.589

3.  MemHyb: predicting membrane protein types by hybridizing SAAC and PSSM.

Authors:  Maqsood Hayat; Asifullah Khan
Journal:  J Theor Biol       Date:  2011-10-06       Impact factor: 2.691

4.  Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning.

Authors:  Suyu Mei
Journal:  J Theor Biol       Date:  2012-06-27       Impact factor: 2.691

5.  iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites.

Authors:  Xuan Xiao; Zhi-Cheng Wu; Kuo-Chen Chou
Journal:  J Theor Biol       Date:  2011-06-17       Impact factor: 2.691

6.  Random Forest classification based on star graph topological indices for antioxidant proteins.

Authors:  Enrique Fernández-Blanco; Vanessa Aguiar-Pulido; Cristian Robert Munteanu; Julian Dorado
Journal:  J Theor Biol       Date:  2012-10-29       Impact factor: 2.691

7.  Discriminating outer membrane proteins with Fuzzy K-nearest Neighbor algorithms based on the general form of Chou's PseAAC.

Authors:  Maqsood Hayat; Asifullah Khan
Journal:  Protein Pept Lett       Date:  2012-04       Impact factor: 1.890

8.  Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions.

Authors:  Chen Ding; Lu-Feng Yuan; Shou-Hui Guo; Hao Lin; Wei Chen
Journal:  J Proteomics       Date:  2012-09-20       Impact factor: 4.044

9.  iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties.

Authors:  Wei Chen; Hao Lin; Peng-Mian Feng; Chen Ding; Yong-Chun Zuo; Kuo-Chen Chou
Journal:  PLoS One       Date:  2012-10-29       Impact factor: 3.240

10.  Bag of Naïve Bayes: biomarker selection and classification from genome-wide SNP data.

Authors:  Francesco Sambo; Emanuele Trifoglio; Barbara Di Camillo; Gianna M Toffolo; Claudio Cobelli
Journal:  BMC Bioinformatics       Date:  2012-09-07       Impact factor: 3.169

View more
  24 in total

Review 1.  A review of computational algorithms for CpG islands detection.

Authors:  Rana Adnan Tahir; D A Zheng; Amina Nazir; Hong Qing
Journal:  J Biosci       Date:  2019-12       Impact factor: 1.826

2.  Protein binding site prediction by combining hidden Markov support vector machine and profile-based propensities.

Authors:  Bin Liu; Bingquan Liu; Fule Liu; Xiaolong Wang
Journal:  ScientificWorldJournal       Date:  2014-07-14

3.  iRNA-3typeA: Identifying Three Types of Modification at RNA's Adenosine Sites.

Authors:  Wei Chen; Pengmian Feng; Hui Yang; Hui Ding; Hao Lin; Kuo-Chen Chou
Journal:  Mol Ther Nucleic Acids       Date:  2018-03-30       Impact factor: 8.886

4.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC.

Authors:  Hui Yang; Wang-Ren Qiu; Guoqing Liu; Feng-Biao Guo; Wei Chen; Kuo-Chen Chou; Hao Lin
Journal:  Int J Biol Sci       Date:  2018-05-22       Impact factor: 6.580

5.  Predicting the types of J-proteins using clustered amino acids.

Authors:  Pengmian Feng; Hao Lin; Wei Chen; Yongchun Zuo
Journal:  Biomed Res Int       Date:  2014-04-02       Impact factor: 3.411

6.  Sequence Based Prediction of Antioxidant Proteins Using a Classifier Selection Strategy.

Authors:  Lina Zhang; Chengjin Zhang; Rui Gao; Runtao Yang; Qing Song
Journal:  PLoS One       Date:  2016-09-23       Impact factor: 3.240

7.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine.

Authors:  Balachandran Manavalan; Tae H Shin; Gwang Lee
Journal:  Front Microbiol       Date:  2018-03-16       Impact factor: 5.640

8.  PrESOgenesis: A two-layer multi-label predictor for identifying fertility-related proteins using support vector machine and pseudo amino acid composition approach.

Authors:  Mohammad Reza Bakhtiarizadeh; Maryam Rahimi; Abdollah Mohammadi-Sangcheshmeh; Vahid Shariati J; Seyed Alireza Salami
Journal:  Sci Rep       Date:  2018-06-13       Impact factor: 4.379

9.  iterb-PPse: Identification of transcriptional terminators in bacterial by incorporating nucleotide properties into PseKNC.

Authors:  Yongxian Fan; Wanru Wang; Qingqi Zhu
Journal:  PLoS One       Date:  2020-05-15       Impact factor: 3.240

10.  Identifying Phage Virion Proteins by Using Two-Step Feature Selection Methods.

Authors:  Jiu-Xin Tan; Fu-Ying Dao; Hao Lv; Peng-Mian Feng; Hui Ding
Journal:  Molecules       Date:  2018-08-10       Impact factor: 4.411

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.