Literature DB >> 28837067

IonchanPred 2.0: A Tool to Predict Ion Channels and Their Types.

Ya-Wei Zhao¹, Zhen-Dong Su², Wuritu Yang^3,4, Hao Lin⁵, Wei Chen^6,7, Hua Tang⁸.

Abstract

Ion channels (IC) are ion-permeable protein pores located in the lipid membranes of all cells. Different ion channels have unique functions in different biological processes. Due to the rapid development of high-throughput mass spectrometry, proteomic data are rapidly accumulating and provide us an opportunity to systematically investigate and predict ion channels and their types. In this paper, we constructed a support vector machine (SVM)-based model to quickly predict ion channels and their types. By considering the residue sequence information and their physicochemical properties, a novel feature-extracted method which combined dipeptide composition with the physicochemical correlation between two residues was employed. A feature selection strategy was used to improve the performance of the model. Comparison results of in jackknife cross-validation demonstrated that our method was superior to other methods for predicting ion channels and their types. Based on the model, we built a web server called IonchanPred which can be freely accessed from http://lin.uestc.edu.cn/server/IonchanPredv2.0.

Entities: Chemical Disease Gene Species

Keywords: ion channels; machine learning method; pseudo-dipeptide composition

Mesh：

Substances：
Dipeptides
Ion Channels

Year: 2017 PMID： 28837067 PMCID： PMC5618487 DOI： 10.3390/ijms18091838

Source DB: PubMed Journal: Int J Mol Sci ISSN： 1422-0067 Impact factor: 5.923

1. Introduction

Ion channels are pore-forming membrane proteins for the transmembrane exchange of inorganic ions (as shown in Figure 1). Ion channels exist in the membranes of all cells and are required in numerous physiological and pathological processes, such as regulating neuronal and cardiac excitability, muscle contraction, hormone secretion, fluid movement, and immune cell activation [1]. Due to their important role in biological processes, ion channels are often used as targets for disease diagnosis and drug development. There are over 300 types of ion channels in living cells [2], and they differ in their structure and function. According to the different gating mechanisms, the ion channels can be mainly divided into two categories, namely voltage-gated ion channels (VGIC) and ligand-gated ion channels (LGIC) [3]. The opening and closing of the voltage-gated ion channels depends on the change of the membrane potential, whereas the state of the ligand channels is closely related to the binding of the ligand. The voltage-gated ion channels can be further classified into the following four subclasses: potassium (K+), sodium (Na+), calcium (Ca2+), and anion channels.

Figure 1

Schematic diagram of material exchange through ion channels.

In view of the important role and multiple types of ion channels, the structures and functions of ion channels have continued to attract the attention of numerous researchers in recent years [4,5,6,7,8,9,10]. Due to the rapid growth of proteomic data, it is particularly important to develop bioinformatics tools to quickly predict and identify ion channels and their types. Consequently, many computational methods based on machine learning algorithm have been developed in the last 10 years [11,12,13,14,15,16,17]. Liu et al. [11] proposed a method to identify voltage-gated potassium channels, and indicated that the local sequence information-based method was better than the global sequence information-based method. Saha et al. [12] developed a support vector machine (SVM)-based method by using amino acid composition and dipeptide composition to predict voltage-gated ion channels and their subtypes. In 2011, our group [13] developed a more generalized predictive tool, called IonchanPred, and identified ion channels and their types accurately. Recently, Tiwari et al. [16] proposed a random forest based methods and Gao et al. [17] proposed a model to predict ion channels and their subfamilies by combining a SVM-based model with BLAST sequence similarity search. Although many predictors for identifying ion channels are available, three essential issues remain elusive. Firstly, the use of high similarity sequences may overestimate the performance of a model. Secondly, the long-range effect is lost in most published models. Thirdly, web servers should be improved. In this paper, a support vector machine-based model was constructed to quickly identify ion channels and their types. In this model, a novel feature extraction method called pseudo-dipeptide composition was employed. The analysis of variance (ANOVA) [18] was introduced to rank features. The incremental feature selection (IFS) was employed to find an optimized feature set which can produce the maximum accuracy. Finally, a web server called IonchanPred 2.0 was established. The flow chart is shown in Figure 2.

Figure 2

Workflow of the IonchanPred 2.0 model.

2. Results and Discussion

2.1. Parameter Optimization

The establishment of our proposed model depends on two important parameters: and . factor denotes the rank of correlation and the larger may contain more global sequence-order information. represents the weight of the correlation of residues’ physiochemical properties compared to the traditional dipeptide component. To obtain the optimal value for the two parameters, a serial of experiments was performed according to the following standard: In view of this, a total of individual combinations were obtained. Then, we can investigate the accuracy of SVM with the jackknife test. The optimal parameter combinations corresponding to the three individual datasets are shown in Table 1. It shows that the highest overall accuracy can be up to 87.5% when and for the dataset including ion channels and non-ion channels (NIC). For the benchmark dataset VGIC vs. LGIC, the maximum accuracy is 93.9% when and . The best model for four types of VGIC prediction can produce overall accuracy of 89.1%. After the parameters are optimized, the samples for the three individual datasets can be respectively formulated as follows: a 589-dimensional vector involving 400 dimensions for traditional dipeptide composition and dimensions for correlation information for IC vs. NIC prediction, a vector involving dimensions for VGIC vs. LGIC, and a vector involving dimensions for four types of voltage-gated ion channels datasets.

Table 1

Optimal parameters for the three datasets.

Database	λ	ω	OA (%)
IC vs. NIC	21	0.20	87.5
VGIC vs. LGIC	7	0.30	93.9
four types of VGIC	9	0.15	89.1

IC: ion channels; NIC: non-ion channels; VGIC: voltage-gated ion channels; LGIC: ligand-gated ion channels; OA: overall accuracy.

2.2. Model Establishment

In order to further improve the accuracy, we used ANOVA to exclude noise or redundant information. After the feature selection, the features were sorted according to the decreasing order of the F values described in Section 3.3 Feature Selection to obtain the feature list. Then, we used the IFS to determine the optimal number of features, as described below. The feature subset starts from a feature ranking first in the feature list. A new feature subset was composed when the second feature of this list was added. We repeated this process until all candidate features were added. In this case, we obtained 589, 463, and 535 feature subsets, respectively, for the three benchmark datasets mentioned above. The performance of each feature subset was examined by using SVM with the jackknife test. We plotted the relationship between the overall accuracy and the numbers of features in Figure 3. We noticed that the prediction performances were the best when the top ranked 527, 460, and 147 features were used for the three datasets, respectively.

Figure 3

The feature selection results for three independent datasets. (a) Incremental feature selection (IFS) curve for ion channel (IC) vs. non-ion channel (NIC) prediction; (b) IFS curve for voltage-gated ion channels (VGIC) vs. ligand-gated ion channels (LGIC) prediction; (c) IFS curve for four types of VGIC prediction.

In order to further evaluate the predictive performance of our model, we also calculated the average accuracies for the three datasets. A comparison of the results with the previous model [13] are shown in Table 2. It is clear that the predictive performance of our proposed model is better than the previous model.

Table 2

Performance evaluation parameters of our proposed model and a previous model.

Datasets		Our Model			Previous Model [13]
Datasets		Sn	OA	AA	Sn	OA	AA
IC vs. NIC	IC	80.2	87.8	87.8	85.9	86.6	86.6
IC vs. NIC	NIC	95.3	87.8	87.8	87.3	86.6	86.6
VGIC vs. LGIC	VGIC	94.7	94.0	94.0	94.6	92.6	92.7
VGIC vs. LGIC	LGIC	93.2	94.0	94.0	90.7	92.6	92.7
Types of VGIC	K⁺	97.5	92.6	87.7	92.6	87.8	83.7
	Ca²⁺	89.7			82.8
	Na⁺	75.0			75.0
	An⁻	88.5			84.6

Sn: sensitivity; AA: average accuracy; OA: overall accuracy; IC: ion channels; NIC: non-ion channels; VGIC: voltage-gated ion channels; LGIC: ligand-gated ion channels.

3. Materials and Methods

3.1. Benchmark Databases

The data used to establish the prediction model in this paper were collected from Lin et al. [13]. The sequences of ion channels were collected from the Universal Protein Resource (UniProt) [19] and the Ligand-Gated Ion channel database [20]. To construct a high-quality benchmark dataset, some sequences were removed according to three characteristics. Firstly, a sequence that contained some ambiguous residues (such as “X”, “B”, “Z”). Secondly, a sequence that was the fragment of other proteins. Thirdly, a sequence that was annotated based on homology or prediction. Then, redundant sequences were removed by using the CD-HIT [21] program with a sequence identity threshold of 40%, which has been widely used to filter out redundant samples in genomics and proteomics [22,23,24,25,26]. After the raw data were preprocessed, we finally obtained 298 ion channels including 148 voltage-gated ion channels and 150 ligand-gated ion channels. These voltage-gated ion channels can be classified into four subtypes as follows: 81 potassium (K+), 29 calcium (Ca2+), 12 sodium (Na+), and 26 voltage-gated anion channels. Here, all the 300 non-ion channel proteins were randomly selected from the membrane proteins which were not marked as ion channels in the UniProt database. Moreover, any two sequences in these non-ion channels should guarantee that the identity between them is less than 40%.

3.2. Feature Extraction of Samples

In order to characterize each protein sequence as accurately as possible, the order effect of sequence was usually selected as a method for generating effective feature vectors. Therefore, PseAAC [27,28] incorporating dipeptide composition was selected as the method for feature extraction of protein samples in this paper. Assuming that there is a protein sequence of L amino acid residues: where represents the amino acid residue at -th sequence position. Therefore, we can get a set of feature vectors with the dimension of from any sequence like Equation (1) where the first 400 features represent the effect of the classical dipeptide composition; the elements in addition to the 400 components represent the sequence order effect of protein samples, namely the first tier to -th tier correlation factors of protein sequence. These features can be calculated by: where is the normalized occurrence frequencies of the 400 dipeptides in protein P; is the weight factor; is the j-tier sequence-correlation factor computed by: where is the correlation function of physicochemical properties and can be calculated as: where denotes the value of -th kind physicochemical property of ; is similar. To obtain the high-quality feature set, all the data of physicochemical properties must be subjected to a standard conversion as below: where represents the 20-native amino acid according to the alphabetical order of their single-letter codes: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, and Y. denotes the original value of the -th physicochemical property for residue . The values of each physicochemical property obtained after the standard conversion have two advantages. These values will have a zero-mean over the 20 native amino acids and remain unchanged if they are subjected to the same conversion procedure again. The values of the nine kinds of physicochemical properties used in this paper are from previous results [29].

3.3. Feature Selection

Generally, all features do not equally contribute to an ion channel prediction system. Some features make key contributions, whereas some others make minor contributions [30,31]. Therefore, the selection of features is an important step for establishing an effective prediction model. To analyze these feature vectors, ANOVA was used to choose the optimal feature sets in this paper. In order to assess the contribution of each feature to the predictive system, the F value was defined as follows: where and respectively denote the sample variance between groups (also called means square between, MSB) and the sample variable within groups (also called means square within, MSW), and are expressed as: where K and N respectively denote the number of groups and the total number of samples. represents the frequency of the -th feature of the -th sample in the -th group. denotes the total number of samples in the -th group. Thus, each feature corresponds to an F score. Obviously, the larger F value means the greater contribution of the corresponding feature to the classification. Thus, according to their F values, we may rank all features. Subsequently, we used the incremental feature selection (IFS) to determine the optimal number of features [32]. Firstly, we examined the accuracy of the first feature subset including a feature with the highest F value in the ranked feature set. Secondly, we investigated the accuracy of the second feature subset which was produced by adding the feature with the second highest F value. This process was repeated from the higher F to the lower F value until all candidate features were added. The performances of all feature subsets were evaluated. Then, we were able to obtain the best feature subset which was capable of producing the maximum accuracy.

3.4. Support Vector Machine

SVM is a kind of classification algorithm that can improve the generalization ability of machine learning and achieve the minimization of experience risk and confidence scope by minimizing the structural risk. Therefore, a good statistical result can be usually achieved even using a small sample. SVM, as a powerful supervised learning method, has been widely used in various fields including bioinformatics [33,34,35,36,37,38]. In this paper, we used LIBSVM 3.21 [39] which could be freely downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/. The radial basis function (RBF) kernel was selected as kernel function and one vs. one (OVO) strategy was used for multiclass classification. For achieving the optimal model, the penalty constant C and the kernel width parameter were tuned by an optimization procedure with a grid search method [39]. The search spaces for C and were [] and [] with steps being 2 and, respectively.

3.5. Performance Evaluation

A cross-validation technique is generally employed to estimate the accuracy of a predictive model. Three cross-validation methods including the independent dataset test, subsampling test, and jackknife test can be used [40,41,42,43]. Among them, the jackknife test is considered to be the most objective and rigorous one. Therefore, the jackknife test was employed to assess the performance of our methods. In addition, we also used other assessment criteria to evaluate the effectiveness of our predictive model in this paper. These assessment criteria, including sensitivity (Sn), overall accuracy (OA), and average accuracy (AA), are defined as follows: where and respectively denote true positives and false negatives of the -th class. N and n represent the total number of samples and number of classes, respectively.

4. Conclusions

We constructed an SVM-based model for the accurate prediction of ion channel proteins and their types. In this model, a pseudo-dipeptide composition was adopted to extract features. The ANOVA was used to exclude noise or redundant information of feature vectors and then IFS was employed to determine the optimal number of features. High accuracies indicated that the proposed method was an effective tool for predicting ion channels and their types. A free web server based on the proposed method presented in this paper has been constructed and is accessible at the website (http://lin.uestc.edu.cn/server/IonchanPredv2.0).

41 in total

1. Prediction of ketoacyl synthase family using reduced amino acid alphabets.

Authors: Wei Chen; Pengmian Feng; Hao Lin
Journal: J Ind Microbiol Biotechnol Date: 2011-10-26 Impact factor: 3.346

2. Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine.

Authors: Wei Chen; Hao Lin
Journal: Comput Biol Med Date: 2012-01-31 Impact factor: 4.589

3. Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition.

Authors: Hao Lin; Zhi-Yong Liang; Hua Tang; Wei Chen
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2017-02-08 Impact factor: 3.710

4. Sequence-based predictive modeling to identify cancerlectins.

Authors: Hong-Yan Lai; Xin-Xin Chen; Wei Chen; Hua Tang; Hao Lin
Journal: Oncotarget Date: 2017-04-25

5. A Dual Role for the OsK5.2 Ion Channel in Stomatal Movements and K⁺ Loading into Xylem Sap.

Authors: Thanh Hao Nguyen; Shouguang Huang; Donaldo Meynard; Christian Chaine; Rémy Michel; M Rob G Roelfsema; Emmanuel Guiderdoni; Hervé Sentenac; Anne-Aliénor Véry
Journal: Plant Physiol Date: 2017-06-16 Impact factor: 8.340

6. Ion channels enable electrical communication in bacterial communities.

Authors: Arthur Prindle; Jintao Liu; Munehiro Asally; San Ly; Jordi Garcia-Ojalvo; Gürol M Süel
Journal: Nature Date: 2015-10-21 Impact factor: 49.962

7. iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences.

Authors: Wei Chen; Pengmian Feng; Hui Yang; Hui Ding; Hao Lin; Kuo-Chen Chou
Journal: Oncotarget Date: 2017-01-17

8. UniProt: the universal protein knowledgebase.

Authors:
Journal: Nucleic Acids Res Date: 2016-11-29 Impact factor: 16.971

9. PSIONplus: Accurate Sequence-Based Predictor of Ion Channels and Their Types.

Authors: Jianzhao Gao; Wei Cui; Yajun Sheng; Jishou Ruan; Lukasz Kurgan
Journal: PLoS One Date: 2016-04-04 Impact factor: 3.240

10. Prediction of phosphothreonine sites in human proteins by fusing different features.

Authors: Ya-Wei Zhao; Hong-Yan Lai; Hua Tang; Wei Chen; Hao Lin
Journal: Sci Rep Date: 2016-10-04 Impact factor: 4.379

15 in total

1. Endoplasmic reticulum retention signaling and transmembrane channel proteins predicted for oilseed ω3 fatty acid desaturase 3 (FAD3) genes.

Authors: Mohammad Fazel Soltani Gishini; Alireza Zebarjadi; Maryam Abdoli-Nasab; Mokhtar Jalali Javaran; Danial Kahrizi; David Hildebrand
Journal: Funct Integr Genomics Date: 2019-11-28 Impact factor: 3.410

2. Special Protein Molecules Computational Identification.

Authors: Quan Zou; Wenying He
Journal: Int J Mol Sci Date: 2018-02-10 Impact factor: 5.923

3. iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC.

Authors: Hui Yang; Wang-Ren Qiu; Guoqing Liu; Feng-Biao Guo; Wei Chen; Kuo-Chen Chou; Hao Lin
Journal: Int J Biol Sci Date: 2018-05-22 Impact factor: 6.580

4. iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree.

Authors: Shaherin Basith; Balachandran Manavalan; Tae Hwan Shin; Gwang Lee
Journal: Comput Struct Biotechnol J Date: 2018-10-24 Impact factor: 7.271

5. Prediction of Antitubercular Peptides From Sequence Information Using Ensemble Classifier and Hybrid Features.

Authors: Salman Sadullah Usmani; Sherry Bhalla; Gajendra P S Raghava
Journal: Front Pharmacol Date: 2018-08-28 Impact factor: 5.810

6. AIPpred: Sequence-Based Prediction of Anti-inflammatory Peptides Using Random Forest.

Authors: Balachandran Manavalan; Tae H Shin; Myeong O Kim; Gwang Lee
Journal: Front Pharmacol Date: 2018-03-27 Impact factor: 5.810

7. PrESOgenesis: A two-layer multi-label predictor for identifying fertility-related proteins using support vector machine and pseudo amino acid composition approach.

Authors: Mohammad Reza Bakhtiarizadeh; Maryam Rahimi; Abdollah Mohammadi-Sangcheshmeh; Vahid Shariati J; Seyed Alireza Salami
Journal: Sci Rep Date: 2018-06-13 Impact factor: 4.379