Literature DB >> 28303250

Predicting Presynaptic and Postsynaptic Neurotoxins by Developing Feature Selection Technique.

Hua Tang¹, Yunchun Yang², Chunmei Zhang¹, Rong Chen¹, Po Huang¹, Chenggang Duan¹, Ping Zou¹.

Abstract

Presynaptic and postsynaptic neurotoxins are proteins which act at the presynaptic and postsynaptic membrane. Correctly predicting presynaptic and postsynaptic neurotoxins will provide important clues for drug-target discovery and drug design. In this study, we developed a theoretical method to discriminate presynaptic neurotoxins from postsynaptic neurotoxins. A strict and objective benchmark dataset was constructed to train and test our proposed model. The dipeptide composition was used to formulate neurotoxin samples. The analysis of variance (ANOVA) was proposed to find out the optimal feature set which can produce the maximum accuracy. In the jackknife cross-validation test, the overall accuracy of 94.9% was achieved. We believe that the proposed model will provide important information to study neurotoxins.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Amino Acids
Neurotoxins

Year: 2017 PMID： 28303250 PMCID： PMC5337787 DOI： 10.1155/2017/3267325

Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411

1. Introduction

Neurotoxins act typically against channels to block or enhance synaptic transmission. According to the mechanism of action, neurotoxins can be classified as presynaptic type and postsynaptic type [1]. The function of presynaptic neurotoxins is to act at the presynaptic membrane [2]. They usually block neuromuscular transmission and inhibit the neurotransmitter release due to their specific enzymatic activities [3]. Postsynaptic neurotoxins can bind to the postsynaptic membrane and acetylcholine receptors [4]. Thus, the study of presynaptic and postsynaptic neurotoxin will give us important clues for drug-target discovery and drug design. The function and structure of neurotoxins can be correctly measured by biochemical experiments; however, it is time-consuming and costly. The availability of huge amounts of proteins generated in postgenomic age provides us with an important opportunity to design various computational methods for timely and precisely predicting protein functions. Thus, it is important to develop machine learning approach to predict presynaptic and postsynaptic neurotoxins. Recently, Yang and Li developed an increment of diversity-based method to identify presynaptic neurotoxin and postsynaptic neurotoxin [5]. The benchmark dataset including 78 presynaptic neurotoxins and 69 postsynaptic neurotoxins was downloaded from Animal Toxin Database (ATDB) [6]. The overall accuracy was 90.39% in jackknife cross-validation, which is far from satisfactory. Subsequently, Song proposed using bilayer support vector machine (SVM) to improve prediction accuracy based on a new benchmark dataset [7]. Although the overall accuracy was dramatically improved, the sequence identity of the dataset was so high that the results were overestimated. To overcome the shortcoming mentioned above, in this study, we developed a new method based on feature selection technique to predict presynaptic neurotoxins and postsynaptic neurotoxins. In the following, we will introduce how to construct a new benchmark dataset, to formulate neurotoxin samples using peptide sequences, and to obtain the expected result produced by best feature subset.

2. Materials and Methods

2.1. Benchmark Dataset Construction

A high quality benchmark dataset is the fundamental for building a reliable and accuracy model. The Universal Protein Resource (UniProt) provides the scientific community with a single, centralized, authoritative resource for protein sequences and functional information [8]. Thus, we downloaded presynaptic and postsynaptic neurotoxins from the UniProt. Ambiguous information can reduce the quality of benchmark dataset which makes the prediction model unreliable. Thus, we must exclude the protein sequence which contains ambiguous residues (such as “X,” “B,” and “Z”) and which is the fragment of other proteins. High similar sequences in benchmark dataset will bring about overestimation of results. Thus, the CD-HIT program was used to remove the highly similar sequences by setting the cutoff of sequence identity as 80% [9]. According to above screening procedure, the final benchmark dataset included 256 neurotoxin samples which can be formulated aswhere the subset SPre contains 91 presynaptic neurotoxins and SPro contains 165 postsynaptic neurotoxins.

2.2. The Dipeptide Composition

One of the most important steps in the prediction problem is to formulate neurotoxin sequences with an effective mathematical expression. Generally, we may formulate a neurotoxin by its entire residue sequence as follows:where R denotes the residue of neurotoxin P and the subscript L is the number of residues of the neurotoxin P. We may use some straightforward and intuitive tools, such as BLAST or FASTA, to find the similar sequences. However, these tools are only suitable for the query sequences which have high similar sequences in searching dataset. If there are no similar sequences in the training dataset, they cannot work well. Machine learning approach can overcome such problem and correctly identify presynaptic and postsynaptic neurotoxins. Thus, we must convert neurotoxin sequences into discrete vector. A simplest method used to represent a neurotoxin is its residue composition containing a 20-dimension vector. However, the sequence order information would be completely lost and hence limit the prediction quality [10-13]. Thus, the dipeptide composition was used in this study. Accordingly, each neurotoxin sample in our benchmark dataset can be expressed as a 400-dimension vector and formulated aswhere x (u = 1,2,…, 400) is the occurrence frequency of uth dipeptide and given bywhere A, C,…, W, Y are the single letter codes of 20 native amino acids, respectively. x can be calculated bywhere n denotes the number of the uth dipeptides in the neurotoxin P.

2.3. Support Vector Machine

SVM is a very popular machine learning method and has been widely used in bioinformatics [7, 14–18]. The basic idea of SVM is to transform the input vector into a high-dimension Hilbert space and to determine a separating hyperplane in this space. In this study, we used the LibSVM package 3.18 (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) to implement SVM. Because it is more suitable for nonlinear classification, the radial basis function (RBF) defined as was used as kernel function. In the SVM model construction, a grid search strategy with cross-validation test was used to optimize the regularization parameter C and kernel parameter γ as the following standard:

2.4. Performance Evaluation

In this study, we used jackknife cross-validation to test the prediction. In the jackknife cross-validation test, each protein sample in the dataset is in turn singled out as an independent test sample and all the rule parameters are calculated based on the remaining proteins without including the one being identified. The performance of our proposed method was estimated by the following three indexes called sensitivity (Sn), specificity (Sp), and overall accuracy (Acc) which can be expressed aswhere NPre and NPro are the total number of the presynaptic neurotoxins and postsynaptic neurotoxins. NProPre is the number of the presynaptic neurotoxins incorrectly predicted as the postsynaptic neurotoxins and NPrePro is the number of the postsynaptic neurotoxins incorrectly predicted as presynaptic neurotoxins.

3. Results and Discussion

Many published papers have demonstrated that the optimized features could improve predictive accuracy [19-25]. For high-dimension data, some features are noise or redundant information which has negative contribution to the prediction. Thus, it is very important to develop a feature selection technique to exclude the garbage information. The current study will introduce a new feature selection technique based on the principle of analysis of variance (ANOVA). Two parameters of feature u can be defined aswhere f(u) denotes frequency of the uth feature of the jth sample in the ith group (i = Pre or Pro). N denotes number of samples in the ith group (i = Pre or Pro). SS(u) and SS(u) are called sum of squares between groups and sum of squares within groups, respectively. If the sample means within groups are close to each other, SS(u) will be small. If the sample means are close between two groups, SS(u) will be small. Then the sample variance between groups s2(u) and sample variance within groups s2(u) can be given bywhere df and df are called degrees of freedom in statistics. In this study, df = 1 and df = NPre + NPro − 2 = 254, respectively. According to the statistic theory, the ratio between s2(u) and s2(u) obeys F sampling distribution with df and df degrees of freedom under the null hypothesis. Thus, we used ratio F(u) to measure the contribution of each feature defined as follows: F(u) reveals how strong the uth feature is related to the group variables. Accordingly, the 400 dipeptides in (3) were ranked according to their F(u). Subsequently, the incremental feature selection (IFS) strategy was proposed to find an optimal of feature subset. In IFS procedure, we firstly examined the performance of the best feature with the highest F(u) by using cross-validation. Subsequently, a new feature with the second highest F(u) was added to form new feature subset which was also inputted into SVM and the accuracy was calculated. This process was repeated until 400 feature subsets were examined. By setting the number of features as abscissa and the Acc as ordinate, the IFS curves were plotted in Figure 1. From the figure, we observed that, in the jackknife cross-validation, the maximum Acc of 94.9% can be obtained by the top 190 features which are regarded as the optimal feature subset.

Figure 1

A plot to show the feature selection results. The maximum accuracy is 94.92% by using the top 190 features.

It is very important to compare the performance of different methods. However, it is not feasible because the benchmark datasets are different. Thus, we made a rough comparison and recorded the results in Table 1. Yang and Li proposed ID-based method to predict presynaptic and postsynaptic neurotoxins on a benchmark dataset with the sequence identity of <80% [5]. Thus, our method is superior to Yang's method. Song developed bilayer support vector machine to improve the accuracy [7]. We noticed that the sequence identity of the benchmark dataset reaches 90% which results in the overestimation of the method. Thus, our proposed model is more objective and real.

Table 1

Comparison of prediction performance for presynaptic and postsynaptic neurotoxins.

	Sn	Sp	Acc
ID [5]	88.46	91.30	89.80
Bilayer SVM [7]	100.00	98.37	98.93
Our method	94.51	95.15	94.92

4. Conclusions

The knowledge for neurotoxin is conductive to the development of drug design and drug-target discovery. Thus, the aim of the study is to develop a computational method to predict presynaptic and postsynaptic neurotoxins. A new feature selection technique was proposed to optimize features and to improve prediction accuracy. The feature selection technique can also be used in other bioinformatics fields.

24 in total

1. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition.

Authors: Hao Lin; En-Ze Deng; Hui Ding; Wei Chen; Kuo-Chen Chou
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971

2. Decreased synaptic activity shifts the calcium dependence of release at the mammalian neuromuscular junction in vivo.

Authors: Xueyong Wang; Kathrin L Engisch; Yingjie Li; Martin J Pinter; Timothy C Cope; Mark M Rich
Journal: J Neurosci Date: 2004-11-24 Impact factor: 6.167

3. Prediction of protein structural class using tri-gram probabilities of position-specific scoring matrix and recursive feature elimination.

Authors: Peiying Tao; Taigang Liu; Xiaowei Li; Lanming Chen
Journal: Amino Acids Date: 2015-01-13 Impact factor: 3.520

4. PECM: prediction of extracellular matrix proteins using the concept of Chou's pseudo amino acid composition.

Authors: Jian Zhang; Pingping Sun; Xiaowei Zhao; Zhiqiang Ma
Journal: J Theor Biol Date: 2014-08-11 Impact factor: 2.691

5. Prediction of antiepileptic drug treatment outcomes using machine learning.

Authors: Sinisa Colic; Robert G Wither; Min Lang; Liang Zhang; James H Eubanks; Berj L Bardakjian
Journal: J Neural Eng Date: 2016-11-30 Impact factor: 5.379

6. Four new postsynaptic neurotoxins from Naja naja sputatrix venom: cDNA cloning, protein expression, and phylogenetic analysis.

Authors: F Afifiyan; A Armugam; P Gopalakrishnakone; N H Tan; C H Tan; K Jeyaseelan
Journal: Toxicon Date: 1998-12 Impact factor: 3.033

Review 7. Different mechanism of blockade of neuroexocytosis by presynaptic neurotoxins.

Authors: O Rossetto; M Rigoni; C Montecucco
Journal: Toxicol Lett Date: 2004-04-01 Impact factor: 4.372

8. The Universal Protein Resource (UniProt).

Authors: Amos Bairoch; Rolf Apweiler; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

3. A computational method for prediction of xylanase enzymes activity in strains of Bacillus subtilis based on pseudo amino acid composition features.

Authors: Shohreh Ariaeenejad; Maryam Mousivand; Parinaz Moradi Dezfouli; Maryam Hashemi; Kaveh Kavousi; Ghasem Hosseini Salekdeh
Journal: PLoS One Date: 2018-10-22 Impact factor: 3.240

3 in total