Literature DB >> 27044036

PSIONplus: Accurate Sequence-Based Predictor of Ion Channels and Their Types.

Jianzhao Gao¹, Wei Cui², Yajun Sheng³, Jishou Ruan^1,4, Lukasz Kurgan^5,6.

Abstract

Ion channels are a class of membrane proteins that attracts a significant amount of basic research, also being potential drug targets. High-throughput identification of these channels is hampered by the low levels of availability of their structures and an observation that use of sequence similarity offers limited predictive quality. Consequently, several machine learning predictors of ion channels from protein sequences that do not rely on high sequence similarity were developed. However, only one of these methods offers a wide scope by predicting ion channels, their types and four major subtypes of the voltage-gated channels. Moreover, this and other existing predictors utilize relatively simple predictive models that limit their accuracy. We propose a novel and accurate predictor of ion channels, their types and the four subtypes of the voltage-gated channels called PSIONplus. Our method combines a support vector machine model and a sequence similarity search with BLAST. The originality of PSIONplus stems from the use of a more sophisticated machine learning model that for the first time in this area utilizes evolutionary profiles and predicted secondary structure, solvent accessibility and intrinsic disorder. We empirically demonstrate that the evolutionary profiles provide the strongest predictive input among new and previously used input types. We also show that all new types of inputs contribute to the prediction. Results on an independent test dataset reveal that PSIONplus obtains relatively good predictive performance and outperforms existing methods. It secures accuracies of 85.4% and 68.3% for the prediction of ion channels and their types, respectively, and the average accuracy of 96.4% for the discrimination of the four ion channel subtypes. Standalone version of PSIONplus is freely available from https://sourceforge.net/projects/psion/.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Ion Channels

Year: 2016 PMID： 27044036 PMCID： PMC4820270 DOI： 10.1371/journal.pone.0152964

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Ion channels are membrane proteins that facilitate the flow of ions through the lipid membranes [1, 2]. Besides their biological importance, they are of substantial research interest in the context of drug development [3-5]. There are over 300 types of ion channels in living cells [6]. They differ in their structures and cellular functions. Ion channels are gated by variety of factors including voltage, ligands, membrane tension, temperature and light [7]. Considering their mechanism of activation, ion channels are mainly classified into the voltage-gated and ligand-gated ion channels [8, 9]. The ligand-gated ion channels open and close depending on the interactions with specific ligands while the voltage-gated ion channels function in response to the voltage gradient across the membrane. The voltage-gated ion channels can be further classified into several subtypes including potassium (K), sodium (Na), calcium (Ca), anion ion channels, proton channels, transient receptor potential channels and hyperpolarization-activated cyclic nucleotide-gated channels [9]. Studies of structure and function of ion channels continue to attract significant research attention [10-16]. As a highlight, recent years have seen strong interest in the role of ion channels as antiviral targets [17]. In the specific case of influenza A, the structure and mechanistic details of the voltage-gated M2 proton channel was recently analyzed [18-20] and a few high-profile articles on the potential therapy that targets this channel were published [21, 22]. The strong research interest and ubiquity of ion channels [23-25] motivate the development of methods that predict them from protein sequences. A naïve approach that finds ion channels based on their sequence similarity to sequences of known channels was found to be flawed [26]. Consequently, more sophisticated, machine learning methods which can predict different types and subtypes of ion channels that are dissimilar in their sequences were developed. In one of the first attempts, Liu et al. [27] proposed a method to predict voltage-gated potassium channels and certain families of this subtype of channels based on a simple dipeptide compositions extracted from an input sequence and Support Vector Machine (SVM) predictive model. Using a more advanced design that included SVM model and dipeptide composition combined with PSI-BLAST-based [28] and HMMER-based [29] similarity searches, Saha et al. [30] have developed the VGIchan method that predicts voltage-gated ion channels and their subtypes. More recently, in 2011 Lin et al. [31] proposed a method that offers a much wider scope including the prediction of ion channels, ion channels types, and the four subtypes of the voltage-gated ion channels. However, the design of this method was similar to the method by Liu et al. [27] and involved the use of a subset of amino acid and dipeptide composition values and the SVM model. In 2012, Chen and Lin [32] published a narrower in scope approach that predicts subfamilies of the voltage-gated potassium channels, yet again using a similar design that applies SVM and amino acid and dipeptide composition. Finally, in 2014 the same group released a slightly improved method for the prediction of subfamilies of the voltage-gated potassium channels that applies an empirically selected subset of tripeptide composition values and the SVM model [33]. All but one of the existing methods are characterized by a relatively narrow scope being restricted to either voltage-gated potassium channels or voltage-gated ion channels. The one method that was developed in 2011 by Lin et al. offers a comprehensive scope but utilizes a relatively simple design that is similar to all other methods. Our aim is to provide a novel method that provides similarly comprehensive scope, i.e., it predicts whether a given sequence is an ion channel, what type of the channel it is, and which subtype of the voltage-gated ion channel it is, while utilizing a more advanced design that should lead to an improved predictive performance. Our method considers an empirically selected collection of inputs that for the first time in this area utilizes physiochemical properties of amino acid derived from the input protein chain, position specific scoring matrix (PSSM) profiles generated by PSI-BLAST, and predicted secondary structure, relative solvent accessibility and intrinsic disorder.

Materials and Methods

Datasets

The data used to build the proposed prediction method are taken from Lin et al. [31]. Protein sequences were downloaded from UniProt [34] and the Ligand-Gated Ion channel database (http://www.ebi.ac.uk/compneur-srv/LGICdb/LGICdb.php) [35]. The chains that include non-standard amino acid types, fragments of proteins, and proteins annotated based on homology or predictions were excluded. The remaining sequences were clustered at 40% identity using CD-HIT [36] to remove similar chains. This resulted in 298 ion channel proteins with 150 ligand-gated and 148 voltage-gated ion channels. The voltage-gated ion channels include 81 potassium (K), 29 calcium (Ca), 12 sodium (Na) and 26 voltage-gated anion channels. To facilitate assessment of prediction of the ion-channels vs. non-ion channel dataset, 300 membrane proteins that were randomly selected from UniProt and that share <40% identity to the ion channel proteins were designated as the non-ion channel proteins. These data were used to derive three training datasets (Table 1). TRAINION is used to develop predictor that discriminates the ion channel and non-ion channel chains. TRAINVLG is used to build predictor of ion channel types, i.e., voltage-gated vs. and ligand-gated ion channel. Finally, TRAINVGS is the training dataset for prediction of the four subtypes of the voltage-gated ion channels.

Table 1

Datasets used to design and test the proposed method.

Dataset name	Annotations	Number of chains
TRAIN_ION	Ion channel	298
	Non-ion channel	300
TRAIN_VLG	Voltage-gated channel	148
	Ligand-gated channel	150
TRAIN_VGS	Potassium(K)	81
	Calcium(Ca)	29
	Sodium(Na)	12
	Anion	26
TEST30_ION	Ion channel	94
	Non-ion channel	104
TEST30_VLG	Voltage-gated channel	43
	Ligand-gated channel	17
TEST60_VGS	Potassium(K)	120
	Calcium(Ca)	49
	Sodium(Na)	23
	Anion	47

We also developed three new test datasets that include proteins that are dissimilar to proteins in the three training datasets. These test datasets, which were not used to design our predictor, were collected from UniProt two years after the dataset from Lin et al. was established. We followed the protocol from ref. [31]. We collected reviewed chains annotated with the following five Gene Ontology keywords: 1) “ligand-gated channel”; 2) “voltage-gated” and “potassium channel”; 3)“voltage-gated” and “calcium channel”; 4) “voltage-gated” and “sodium channel”; and 5) “voltage-gated” and “anion channel”. Next, we excluded annotations that were inferred from homology, which are predicted and uncertain. The non-ion channel proteins were randomly selected from the UniProt to match the number of the ion-channels. We excluded chains with non-standard amino acid types (X, B and U) and chains that have similarity of over 30% with the proteins in any of the training datasets, based on the clustering with CD-HIT. Consequently, the TEST30ION and TEST30VLG datasets, which are used to assess prediction of ion channels and ion channel types, include 198 and 60 proteins, respectively (Table 1). Using the 30% similarity cutoff did not allow us to collect a sufficient number of proteins for the four subtypes of the voltage-gated ion channels to perform tests. Thus, the test set for these subtypes, TEST60VGS, is based on 60% similarity threshold to the training proteins and includes total of 239 proteins (Table 1).

Assessment of the predictive performance

The predictors of the ion channels, their types and subtypes generate either a binary outcome (ion channel vs. non-ion channel and voltage-gated vs. ligand-gated) or four outcomes (potassium, sodium, calcium and anion ion channel). The assessment of these predictions uses the same measures as in the related works, including accuracy [27, 30–33] and Matthews correlation coefficient (MCC) [27, 30, 32, 33]: where TP is true positive, TN is true negative, FP is false positive, and FN is false negative. We also compute Fmeasure, which is a weighted average of the precision and recall and has maximal and minimal values of 1 and 0, respectively: The accuracy, MCC and Fmeasure are computed for the two binary predictions and for each of the four outcomes in the prediction of voltage-gated ion subtypes. We also compute average accuracy, MCC, Fmeasure and Q4 accuracy to summarize the overall prediction over the four subtypes: where N is total number of sequences and n = 4 is number of classes. The entire design process, which includes feature selection and parameterization of the predictive model, was run using five-fold cross validation on the training datasets; the same features and parameters are used in all benchmark tests. The resulting design is compared using N-fold cross-validation (jackknife test) on the training datasets with the results in ref. [31] where the same jackknife test was performed. Finally, we computed predictive performance on the test datasets utilizing our model trained on the corresponding training datasets.

Overall architecture of the predictor

The proposed method, PSIONplus (Predictor from Sequence of ION channels plus BLAST) combines predictions from a machine learning model (PSION) and from sequence alignment with BLAST. PSION consist of three modules: (1) PSIONION model that predicts whether a given input sequence is an ion channel; (2) PSIONVLG model that predicts whether a given ion channel is voltage- or ligand-gated; and (3) PSIONVGS model that generates predictions of the four subtypes of the voltage-gated channels. The three models share common architecture where the input protein sequences is first processed to obtain its evolutionary profile and predicted secondary structure (SS), relative solvent accessibility (RSA), and intrinsic disorder (ID). Next, this information is combined with the sequence itself to generate a set of numeric features which are input into a predictive model. We applied SVM to generate the model given its widespread use in the prediction of ion channels [27, 30–33] and results from ref. [31] that empirically demonstrate that this machine learning model is superior when compared to four other classifiers including Naïve Bayes, RBF network, logistic regression and random forest. We used the LIBSVM implementation of SVM [37]. The model outputs a prediction based on the numeric scores generated by SVM (ion channel vs. non-ion channel, voltage-gated vs. ligand-gated channel, one subtype of voltage-gated channels). LIBSVM [37] uses “one-against-one” approach for the multi-class classification of the subtypes. LIBSVM constructs k*(k-1)/2 binary classifiers to develop predictor for k classes. In our case, for k = 4 it constructs 6 binary classifiers. For binary classification, LIBSVM estimates the probabilities for each class using parametric sigmoid function as described in ref. [38]. The output class is the class with the higher probability. For the multi-class classification, LIBSVM collects all pairwise class probabilities that are estimated as in ref. [38], generates one probability for each class based on an optimization described in refs. [39],[40], and outputs the class with the highest probability. We designed the SVM model by considering a large pool of features, performing empirical selection of a subset of relevant and well-performing features, and empirically parameterizing the predictive model.

Considered input features

We considered seven groups of features which are based on (1) amino acid composition of the input sequence; (2) dipeptide composition of the input sequence; (3) physiochemical properties of the amino acid in the input sequence; (4) predicted SS; (5) predicted RSA; (6) predicted ID; and (7) PSSM profile. The amino acid composition is defined as the number of residues of a given amino acid type divided by the sequence length. This type of features was used by the prior methods [31, 32]. The dipeptide composition is the composition of all 400 pairs of amino acid types and it was also used in the related works [27, 30–32]. The physiochemical properties are a feature type that is new to this area. We considered hydrophilicity [41], hydrophobicity [42], polarity [43], flexibility [44], propensity for beta-turns [45] and transfer free energy [46], which are quantified based on the corresponding amino acid indices from the AAindex database [47]. The selection is motivated by the fact that the same properties have been used in similar works [48, 49]. We computed the average and standard deviation for each of the six properties over all residues in the input sequence. We also utilized new features that are based on several structural properties that were predicted from the input chain. SS and ID are predicted by the standalone version v3.3 of PSIPRED [50] and v2.43 of DISOPRED [51], respectively. RSA is predicted with SPINEX [52] and is defined as the ratio of solvent accessible surface area of a residue observed in its three dimensional structure to that observed in an extended Ala-X-Ala tripeptide conformation [53, 54]. The PSSM profiles have been widely used in various related predictive efforts [55-61]. We used the blastpgp implementation of PSI-BLAST with the default three iterations (-j 3) utilizing the nr protein database to calculate the PSSM profiles for the input protein sequence. Altogether, we considered the following 878 features: AA_j, composition of j = 1, 2,.., 20 amino acid (AA) types (20 features) Dipeptide_{AA}_{AA}, the composition of AA pairs (20*20 = 400 features). AAproperty_{avg, sd}, the average (avg) or standard deviation (sd) of i = {1 for hydrophilicity, 2 for hydrophobicity, 3 for polarity, 4 for flexibility, 5 for beta-turns, 6 for transfer free energy} amino acid property over all AA in the input protein chain. These features quantify average and variability of propensity for a given property over the entire input protein (6*2 = 12 features) Num_SS_Seg, the total number of predicted secondary structure segments in the input protein chain (1 feature) Num_{C,H,E}_Seg, the number of predicted coil, helix, or strands segments in the input protein chain (3 features) CV_{C,H,E}_{min, max}, the minimal and maximal length of the predicted coil, helix, strand segments divided by the protein length (3*2 = 6 features) Composition_{C, H, E}, the composition of coil, helix, or strand residues, i.e., the number of coil, helix, or strand residues divided by the sequence length (3 features) Total_DisNonDis_Seg, the total number of predicted disordered and structured (non-disordered) segments in the input protein chain (1 feature) Num_{Dis, NonDis}, the number of disorder and structured (non-disordered) segments in the input protein chain (2 features) CV_{Dis, NonDis}_{min, max}, the minimal and maximal length of disorder and structured (non-disordered) segments divided by the sequence length (2*2 = 4 features) Composition_{Dis, NonDis}, the composition of disorder and structured residues, i.e., the number of disorder and structured residues divided by the sequence length (2 features) {Bd, Ed}_{0.25, 0.75}, the composition of buried and exposed residues, i.e., the number of buried and exposed residues divided by the sequence length. A given residue is considered to be buried if it’s predicted RSA < 0.25 or 0.75; otherwise, it is assumed to be exposed. These features quantify to some degree the overall shape of the input protein (2*2 = 4 features) RSA_Seg{4,6,8,10,12,14,16,18,20,22}, the minimal or maximal value of the average of the predicted RSA values for segments which are at least 4, 6, 8, 10, 12, 14, 16, 18, 20, or 22 residues long. These features identify long segments of either exposed or buried residues, which again is related to the shape of the protein molecule (10*2 = 20 features) PSSM_{AA}_{AA}, PSSM profile scores where AA1 and AA2 stand one of the 20 amino acid types in the input protein chain and in the columns of the PSSM profile, respectively. These features quantify evolutionary conservation of individual amino acid types in the input protein chain. We compute the PSSM profile scores by summing up rows in the PSSM profiles for the same AA type. Next, each element in the resulting 400 dimensional vector (20 amino acid types * 20 columns in the PSSM profile) is divided by the length of the sequence and normalized by1/(1+exp(-x)). Example is shown in Fig 1. A similar PSSM profile score was used to classify transporters [62] (20*20 = 400 features)

Fig 1

Example computation of scores from the PSSM profile.

Each feature was normalized into [–1, 1] interval based on the min-max normalization: (2*x—x—x) / (x—x) where x is a value of a given feature X and x and x are the minimal and maximal values of X, respectively.

Feature selection and optimization of the predictive model

Given that some of the considered features may not be useful for the prediction of the ion channels and some of the features could be correlated with each other (redundant), we performed empirical selection of a subset of predictive and non-redundant features. The selection was based on the biserial correlation coefficients (BCC) computed between values of a given feature and the binary outcomes; this correlation was also used in related studies [63, 64]. We performed selection for each of the three types of outcomes, i.e., prediction of ion channels, ion channel types, and subtypes of voltage-gated channels. First, a given training dataset was randomly divided into the five training and test folds to implement the five-fold cross validation protocol. We ranked the features according their average BCC over the five training folds. Second, we removed features that are characterized by low predictive power by considering five cut-offs = {0.1, 0.15, 0.2, 0.25 and 0.3}, i.e., features with the average BCC below a given cut-off were excluded. In the third step we removed correlated features. We selected the feature with the highest average BCC and added the next ranked feature into the selected set of features if the Pearson’s correlation coefficient (PCC) of this feature with every feature in the selected feature set was below a given cut-off value = {0.7, 0.75, 0.8, 0.85 and 0.9}. The use of the two cut-offs results in 5*5 = 25 feature sets. In the fourth step, we further reduced the number of features in each of the 25 feature sets using wrapper-based feature selection. This type of feature selection scores a given feature set based on predictive quality of a prediction model that uses this feature set. We quantified predictive quality with MCC based on predictions using the five-fold cross validation protocol on the corresponding training dataset using the SVM classifier and chose the feature sets that gives the highest MCC score. To clarify, in our cross-validation the training dataset was randomly partitioned into five equally sized subsets. One subset was used as a test dataset and the remaining four subsets were used as a training dataset. This was repeated five times, each time choosing a different subset as the test dataset and using same features and parameters of the prediction model. We combined predictions from the five test subsets together to produce a single MCC value (Table 2) and we also averaged the five MCCs from the 5 test subsets. (Table A in S1 File). In the wrapper selection we attempted to remove each of the features in the set, measured the MCC of the smaller set, and accepted this removal in case if the MCC value increases. As an alternative approach, in the fourth step we implemented feature selection with the principal components analysis (PCA) using SVM classifier and 5-fold cross validation on the training dataset. We considered nine values of the cut-off on the variance value that is covered by the PCA = {0.1, 0.2,…, 0.9} to generate the corresponding nine feature sets. Next, like in the wrapper selection we considered removing one PCA-based feature at the time and we removed it only if this increases MCC. Finally, in the fifth step for each resulting reduced feature set we optimized parameters of the SVM model. Following the Lin et al. [31] we used the radial basis function (RBF) kernel and performed grid search over the regularization parameter C = 2−2, 2−1,…, 24 and width of the RBF kernel gamma = 2−11, 2−10,…,20. We selected the set of parameters that provides the highest value of MCC in the five-fold cross validation on the corresponding training dataset. The results are summarized in Table 2 and Table B in S1 File.

Table 2

Results of the feature selection and optimization of the three predictive models for ion channels, ion channel types, and subtypes of voltage-gated channels.

BCC	PCC	Maximal MCC over selected feature sets (step 4)			Optimal SVM parameters (C, gamma)			Number of features
		Ion channel	Ion channel type	Voltage-gated channel subtype	Ion channel	Ion channel type	Voltage-gated channel subtype	Ion channel	Ion channel type	Voltage-gated channel subtype
0.1	0.9	0.835	0.927	0.697	8, 0.0625	4, 0.0625	16, 0.0625	190	158	46
	0.85	0.832	0.934	0.664	8, 0.0625	4, 0.03125	8, 0.25	205	122	29
	0.8	0.830	0.921	0.656	16, 0.03125	0.5, 0.0625	4, 0.0625	171	102	48
	0.75	0.836	0.934	0.665	4, 0.03125	2, 0.0625	16,0.015625	172	103	71
	0.7	0.796	0.933	0.614	8, 0.0625	4, 0.0625	16, 0.007812	150	107	63
0.15	0.9	0.798	0.928	0.668	2, 0.125	2, 0.125	4, 0.0625	138	109	53
	0.85	0.788	0.934	0.664	4, 0.125	2, 0.0625	8, 0.25	134	102	29
	0.8	0.777	0.927	0.656	4, 0.125	2, 0.0625	4, 0.0625	92	80	48
	0.75	0.802	0.907	0.665	4, 0.125	4, 0.125	16,0.015625	114	110	71
	0.7	0.787	0.922	0.614	2, 0.0625	0.5, 0.03125	16, 0.007812	99	82	63
0.2	0.9	0.773	0.920	0.715	8, 0.03125	1, 0.125	4, 0.0625	70	77	48
	0.85	0.766	0.908	0.562	8, 0.125	4, 0.125	2, 0.25	69	94	37
	0.8	0.769	0.914	0.619	8, 0.125	0.5, 0.25	16, 0.0625	72	76	68
	0.75	0.763	0.934	0.618	8, 0.03125	2, 0.25	4, 0.25	60	56	28
	0.7	0.776	0.920	0.641	16, 0.125	1, 0.0625	2, 0.25	64	65	32
0.25	0.9	0.743	0.921	0.695	4, 0.25	1, 0.25	16, 0.0625	40	63	32
	0.85	0.756	0.893	0.670	8, 0.25	16, 0.015625	16, 0.0625	38	60	33
	0.8	0.760	0.913	0.682	4, 0.5	2, 0.125	16, 0.25	39	69	26
	0.75	0.759	0.893	0.735	8, 0.5	0.5, 0.25	4, 0.125	29	41	25
	0.7	0.741	0.880	0.589	2, 0.5	1, 0.25	8, 0.125	27	42	26
0.3	0.9	0.686	0.908	0.574	2, 0.5	1, 0.25	16, 0.0625	22	53	31
	0.85	0.700	0.907	0.634	1, 1	2, 0.125	16, 0.25	21	37	25
	0.8	0.700	0.914	0.716	1, 1	1, 0.5	8, 0.25	21	38	31
	0.75	0.700	0.907	0.653	1, 1	1, 0.5	16, 0.125	20	33	25
	0.7	0.675	0.893	0.573	0.5, 1.0	2, 0.015625	8, 0.5	16	33	22
Cutoff on variance in PCA		Maximal MCC over selected feature sets (step 4)			Optimal SVM parameters (C, gamma)			Number of features
		Ion channel	Ion channel type	Voltage-gated channel subtype	Ion channel	Ion channel type	Voltage-gated channel subtype	Ion channel	Ion channel type	Voltage-gated channel subtype
0.1		0.445	0.582	0.168	8, 0.00977	16, 0.001953	8, 0.125000	2	1	1
0.2		0.670	0.582	0.240	4,0.007812	16,0.001953	16,0.12500	4	1	1
0.3		0.670	0.817	0.397	4,0.007812	1,0.015625	32,0.007812	4	5	2
0.4		0.680	0.776	0.486	2,0.03125	1,0.015625	8,0.000488	7	6	6
0.5		0.719	0.850	0.503	16,0.003906	2,0.015625	2,0.015625	13	14	6
0.6		0.803	0.870	0.505	4,0.003906	2,0.007812	4,0.007812	32	21	6
0.7		0.767	0.896	0.669	4,0.001953	4,0.003906	4,0.003906	66	38	26
0.8		0.804	0.935	0.661	8,0.001953	16,0.000977	2,0.007812	116	69	22
0.9		0.810	0.922	0.596	8,0.000977	2,0.001953	8,0.007812	153	65	30

The table shows results for different cut-offs for the minimal biserial correlation coefficients (BCC) computed between values of a given feature and the binary outcomes (step 2 of feature selection) and the maximal Pearson’s correlation coefficient (PCC) between features (step 3), the maximal MCC value obtained via wrapper-based feature selection (step 4) and the optimal SVM parameters (step 5) that were computed via five-fold cross validation on the corresponding training dataset, and the final number of selected features. The lower part of the table shows results for an alternative feature selection based on Principal Component Analysis (PCA) with different cut-off on the value of variance. Predictions from the five test folds in the cross validations were combined together to produce a single MCC value. The selected setup for each of the three predictors is shown in bold font. For the prediction of ion-channels, the correlation-based feature selection results in the predictor that secures MCC = 0.836 which is higher than MCC = 0.810 that was obtained with the PCA-based approach. For the ion channel type model, both feature selection lead to models with similar predictive quality (MCC = 0.934 and 0.935) while the correlation-based approach uses fewer features (56 vs. 69). For the prediction of the voltage-gated channel subtypes, correlation- and PCA-based feature selections correspond to MCC = 0.735 and 0.669, respectively (Table 2). We note that results obtained by averaging the MCC over the five cross validation folds lead to consistent results with the same optimal designs that secure MCC = 0.836±0.051 for prediction of ion-channels, MCC = 0.933±0.041 for the ion channel type, and MCC = 0.740±0.100 for the voltage-gated channel subtypes (Table B in S1 File). Consequently, the PSIONplus predictor is built utilizing the feature sets generated with the correlation-based feature selection, which are shown in bold font in Table 2 and Table B in S1 File. The predictor of ion channels, PSIONION, uses 172 features and SVM with C = 4 and gamma = 0.03125; predictor for ion channel types, PSIONVLG, utilizes 56 features and SVM with C = 2 and gamma = 0. 25; and for voltage-gated ion channel subtypes, PSIONVGS, we apply 25 features and SVM with C = 4 and gamma = 0.125.

PSIONplus: combination of SVM model and BLAST

PSIONplus is implemented by combining the prediction of the selected SVM model and sequence alignment computed with BLAST against a dataset of annotated proteins. To compute the prediction from BLAST, we query a given test protein sequence against the sequences from the training dataset and transfer annotation from the most similar hit given that it is sufficiently similar. We only use training sequences for which the corresponding e-value is better than a threshold that we establish based on cross validation on the training datasets. We performed grid search over the following set of e-values: 10−6, 10−5,…,100, 101. We selected the values that provide the highest MCC in the five-fold cross validation on a given training set. Consequently, PSIONplus uses e-value = 0.001 for the prediction of ion channels (based on the TRAINION dataset), e-value = 10 for the ion channel types (based on the TRAINVLG dataset), and e-value = 0.001 for the voltage-gated channel subtypes (based on the TRAINVGS dataset). Besides the binary prediction, the numeric score generated by BLAST equals to normalized e-value of the first hit: score = threshold/(threshold + e-value); this way the score is higher when similarity is higher, which is when the e-value is smaller. If there is no hit from BLAST (all e-values > threshold) then PSIONplus uses the score from the SVM model. Otherwise, PSIONplus uses the score from BLAST. Fig 2 shows the workflow of PSIONplus.

Fig 2

Workflow of the PSIONplus model.

SS: secondary structure, RSA: relative solvent accessibility.

Workflow of the PSIONplus model.

SS: secondary structure, RSA: relative solvent accessibility.

Results

Analysis of predictive model

Table 3 summarizes the selected features for each of the three SVM models: SVMION for the prediction of ion channels, SVMVLG for the prediction of ion channel types, and SVMVGS for the prediction of voltage-gated channel subtypes. It reveals that majority of these features are based on amino acid pairs and PSSM profile scores. However, all types of features were selected in at least one predictive model. This demonstrates that the new types of features that we introduce including PSSM profiles, predicted SS, ID and RSA and physiochemical properties of AAs, contribute to the predictive performance.

Table 3

Summary of considered and selected features used by the PSION predictor.

Feature group	Number of features	Number of selected features
Feature group	Number of features	SVM_ION	SVM_VLG	SVM_VGS
PSSM profile scores	400	75	29	18
Dipeptide composition	400	82	24	4
Predicted relative solvent accessibility	24	4	0	0
Amino acid composition	20	5	1	0
Predicted secondary structures	13	2	1	1
Properties of amino acid	12	3	1	0
Predicted intrinsic disorder	9	1	0	2
Total	878	172	56	25

To quantify relative impact of each type of features we divided the selected features into five groups that are based on dipeptide composition, predicted intrinsic disorder, predicted relative solvent accessibility, predicted secondary structure, and PSSM-based profiles. Next, using features from a given group, we optimized SVM model based on the five-fold cross validation on the corresponding training dataset using the same procedure as described in Materials and Methods section. The accuracies obtained by each feature group on each of the three training datasets are shown in Table 4; we note that in some cases the results are not available if none of the features from a given group was used in the corresponding model. We computed a single value of accuracy based the results that are combined over all test folds (entire test datasets). The best performing feature group is based on the PSSM profiles, which we introduced into the prediction of the ion channels and their types. However, each of the remaining feature groups also obtains relative strong accuracy. For the prediction of the ion channels the lowest accuracy is 60.3% while a baseline classifier, which would always predict the most frequent outcome, has accuracy of 100%*(300/598) = 50.1% (Table 1). Similarly, for the prediction of ion channel type and voltage-gated channel subtype the lowest accuracies are 68.1% and 62.2% compared to the baseline accuracies of 100%*(150/298) = 50.3% and 100%*(81/148) = 54.7%, respectively. Most importantly, the PSION model that combines all these features obtains higher predictive performance compared with the best performing feature group. By using all features together the error rates are reduced by 100%*(91.6–89.6)/(100–89.6) = 19.2% for the prediction of ion channels, by 100%*(96.3–95.6)/(100–95.6) = 15.9% for the prediction of ion channel types, and by 100%*(88.5–81.8)/(100–81.8) = 36.8% for the prediction of voltage-gates channel subtypes (Table 4). This suggests that aggregation of the various types of previously used and new feature types leads to an improved predictive performance.

Table 4

Accuracy obtained based on the cross validation on the training datasets TRAINION and TRAINVLG and Q4 based on the cross validation on the TRAINVGS dataset by different groups of input features.

Models	TRAIN_ION (accuracy)	TRAIN_VLG (accuracy)	TRAIN_VGS (Q₄)
Model based on the PSSM profile	89.6	95.6	81.8
Model based on the dipeptide composition	84.5	87.6	65.5
Model based on the predicted relative solvent accessibility	79.8	not used	not used
Model based on the predicted secondary structure	69.9	68.1	62.2
Model based on the predicted intrinsic disorder	60.3	not used	62.2
Model based on all features	91.6	96.3	88.5

We computed a single value of accuracy based the results that are combined over all test folds (entire test datasets)

Comparative analysis of results on the training datasets

Table 5 compares results generated by PSIONplus and its two modules based on SVM and BLAST based on the jackknife tests on the training datasets with the equivalent results on the same datasets from the only other method that also predicts ion channels, their types, and subtypes of voltage-gated channels from ref. [31]. We compared the accuracies and number of features since the MCC and Fmeasure values were not provided in the other article; these measures are compared on the test datasets.

Table 5

Summary of results based on the jackknife and 5-fold cross validation (5-cv) tests on the training datasets TRAINION, TRAINVLG and TRAINVGS.

Evaluation measure	Method	TRAIN_ION	TRAIN_VLG	TRAIN_VGS
		Ion-channel vs. non-ion channel	Voltage-gated vs. ligand-gated	Potassium	Anion	Calcium	Sodium	Q₄	Average of the four subtypes
Accuracy	Lin et al.	86.6	92.6	92.6	84.6	82.8	75.0	87.8	83.8
(Jackknife)	SVM model	91.5	96.3	93.9	97.3	91.9	96.6	89.9	94.9
	BLAST	98.0	99.7	98.6	99.3	98.0	98.6	97.3	98.6
	PSIONplus	97.7	100	99.3	100	98.0	98.6	98.0	99.0
MCC	SVM model	0.830	0.927	0.880	0.905	0.732	0.782	NA	0.825
(Jackknife)	BLAST	0.960	0.993	0.973	0.977	0.935	0.909	NA	0.948
	PSIONplus	0.953	1	0.986	1	0.935	0.909	NA	0.958
MCC	SVM model	0.833	0.934	0.736	0.855	0.441	0.695	NA	0.682
(5-cv)	BLAST	0.944	0.980	0.774	0.831	0.597	0.773	NA	0.744
	PSIONplus	0.940	0.993	0.846	0.929	0.650	0.773	NA	0.799
Sensitivity	SVM model	93.0	98.0	98.8	84.6	72.4	83.3	NA	84.8
(Jackknife)	BLAST	97.0	99.3	100	96.2	93.1	91.7	NA	95.2
	PSIONplus	98.7	100	100	100	93.1	91.7	NA	96.2
Sensitivity	SVM model	90.3	98.6	96.3	80.8	41.4	75.0	NA	73.4
(5-cv)	BLAST	95.0	98.6	100	73.1	44.8	91.7	NA	77.4
	PSIONplus	97.7	100	100	88.5	51.7	91.7	NA	83.0
# of features	Lin et al.	140	159	104	104	104	104	NA	NA
	PSION	172	56	25	25	25	25	NA	NA

Results of PSIONplus and its two modules based on SVM and BLAST are compared with the method by Lin et al. MCC and Fmeasure were not reported in the article by Lin et al. and thus only accuracy is compared. The best accuracy values for each dataset is shown in bold. For the cross-validation tests we computed a single value of accuracy, MCC and sensitivity based in the results that are combined over all test folds (entire test datasets). NA means “not applicable”. The accuracy of the SVM model used in the PSIONplus predictor is higher than the accuracy of the method by Lin et al. across all three types of predictions. The corresponding error rates of our SVM are reduced by 100%*(91.5–86.6)/(100–86.6) = 36.6%, 100%*(96.3–92.6)/(100–92.6) = 50%, and 100%*(89.9–87.8)/(100–87.8) = 17.2% for the prediction of ion channels, ion channel types, and voltage-gates channel subtypes, respectively. Since our predictor uses a similar or smaller number of features and predictive model compared to the other method, the improved predictive performance stems from the use of novel feature types. Moreover, the PSIONplus that combines this SVM model with sequence alignment obtains even better predictive quality. The corresponding error rates of are reduced by 100%*(97.7–86.6)/(100–86.6) = 82.8%, 100%*(100–92.6)/(100–92.6) = 100%, and 100%*(98–87.8)/(100–87.8) = 83.6% when compared with method by Lin et al. We compared the predictive performance of PSIONplus and BLAST on the training datasets. In Table 5, PSIONplus achieves accuracies of 97.7 and 100 and Q4 of 97.3 on the TRAINION, TRAINVLG and TRAINVGS datasets based on the jackknife test. BLAST achieves comparable levels of accuracy at 98.0, 99.7, and 98.0, respectively. Similar conclusion is true when measuring predictive quality with MCC and both cross-validation and jackknife tests. The strong performance of BLAST is due to the relatively high sequence similarity in these training datasets. Moreover, we also compared sensitivity (defined as the fraction of correctly predicted true positives) of PSIONplus and BLAST. We note that PSIONplus achieves higher sensitivity values at 98.7 and 100 (97.7 and 100) on TRAINION and TRAINVLG, and higher average sensitivity at 96.2 (83.0) on TRAINVGS compared to 97.0, 99.3 and 95.2 (95.0, 98.6, and 77.4) of BLAST when using jackknife (cross-validation) test. These differences indicate that PSIONplus that combines BLAST with the SVM can identify more positives than BLAST alone. This means that some of the correct predictions generated by PSIONplus come from the SVM model.

Comparative analysis of results on the test datasets

The predictive quality of PSIONplus is compared using the test datasets with the method by Lin et al. [31] and with alignment with BLAST for the prediction of ion channels, their types, and subtypes of the voltage-gated channels, and with VGIchan [30] for the prediction of ion channels (Table 6). The empirical results suggest that PSIONplus offers substantially higher predictive quality than VGIchan with MCC equal 0.71 vs. 0.49. Comparison with the predictor by Lin et al. leads to similar conclusions to the conclusions drawn based on the results on the training datasets. PSIONplus obtains higher values of MCC by 0.08 and 0.29, accuracy by 4.6 and 5.0 and Fmeasure by 3.7 and 1.0 for the prediction of ion channels and ion channel types, respectively. The results concerning the prediction of the voltage-gated channel subtypes are similar, with the differences in average Fmeasure, average MCC and average accuracy equal to 7.4, 1.1 and 4.0, respectively. The Q4 of PSIONplus is 92.9 on TEST60VGS, which is higher than the Q4 of 84.9 from Lin et al.

Table 6

Summary of results on the test datasets TEST30ION, TEST30VLG, and TEST60VGS.

Dataset	Prediction outcome	Method	F_measure	MCC	Accuracy	Q₄
TEST30_ION	Ion-channel vs. non-ion channel	VGIchan	63.0	0.49	72.7	NA
		Lin et al. 2011	81.7	0.63	80.8	NA
		BLAST	64.3	0.56	74.7	NA
		PSIONplus	85.4	0.71	85.4	NA
		Confidence interval of PSIONplus	86.0(±3.7)	0.73(±0.07)	86.3(±3.3)	NA
TEST30_VLG	Voltage-gated vs. ligand-gated	Lin et al. 2011	76.6	-0.06	63.3	NA
		BLAST	77.6	0.23	68.3	NA
		PSIONplus	77.6	0.23	68.3	NA
		Confidence interval of PSIONplus	78.1(±6.1)	0.22(±0.15)	68.7(±7.4)	NA
TEST60_VGS	Potassium	Lin et al. 2011	87.6	0.74	86.6	NA
		BLAST	91.6	0.83	90.8	NA
		PSIONplus	94.8	0.90	94.6	NA
	Anion	Lin et al. 2011	86.7	0.85	95.4	NA
		BLAST	86.7	0.85	95.4	NA
		PSIONplus	88.1	0.87	95.8	NA
	Calcium	Lin et al. 2011	73.7	0.67	89.5	NA
		BLAST	91.1	0.90	96.7	NA
		PSIONplus	92.0	0.90	96.7	NA
	Sodium	Lin et al. 2011	90.5	0.90	98.3	NA
		BLAST	93.0	0.93	98.7	NA
		PSIONplus	93.0	0.93	98.7	NA
	Average over	Lin et al. 2011	84.6	0.79	92.4	84.9
	all subtypes	BLAST	90.6	0.88	95.4	90.8
		PSIONplus	92.0	0.90	96.4	92.9
		Confidence interval of PSIONplus	91.9(±2.1)	0.90(±0.03)	96.4(±0.9)	92.9(±1.7)

Results of PSIONplus are compared with VGIchan on the TEST30VLG dataset, and with the method by Lin et al. and BLAST on all datasets. Best MCC, Fmeasure and accuracy values for each dataset are shown in bold. Confidence intervals are obtained by computing average and standard deviations (shown in brackets) of 10 repetition of the test where in each repetition we randomly select 50% of test data set. NA means “not applicable”; for the two-class classification the Q4 equals accuracy. We compared PSIONplus with BLAST on the three test datasets in Table 6. PSIONplus achieves better accuracy = 85.4 than the accuracy = 74.7 by BLAST on the TEST30ION dataset and the same accuracy on the TEST30VLG dataset. For the prediction of voltage-gated four subtypes, PSIONplus obtain average accuracy = 96.4 and MCC = 0.90 which is higher than the average accuracy = 95.4 and MCC = 0.88 by BLAST. The Q4 of PSIONplus is 92.9 which is again higher than the Q4 of BLAST at 90.8. This shows that PSIONplus improves over the sequence alignment and justifies the use of the SVM model in the PSIONplus. We also computed confidence intervals for PSIONplus. We randomly selected 50% of the test proteins and calculated the corresponding Fmeasures, MCCs and accuracies. This was repeated 10 times and we computed the averages and standard deviations over these 10 repetitions. Table 6 shows that the standard deviations are relatively low on the TEST30ION and TEST60VGS datasets. The standard deviations are larger on the TEST30VLG dataset, however, the results obtained by the method by Lin et al. are also proportionally lower. Finally, we estimated false positive rate, defined as the number of false positives divided by the number of actual negatives, of PSIONplus. Our method achieves the false positive rate = 19.2%, 52.9%, and 3.2% on the TEST30ION, TEST30VLG and TEST60VGS datasets, respectively (see Table B in S1 File), compared to 1%, 52.9% and 4.6% by BLAST. Although the false positive rate of PSIONplus is higher than for BLAST on TEST30ION, the sensitivity (true positive rate) of PSIONplus = 90.4% and is much higher than BLAST’s sensitivity that is 47.9% (see Table B in S1 File). This means that the increase by 42.5% in sensitivity by PSIONplus is traded for the higher by 18.2% false positive rate. However, for the TEST60VGS dataset PSIONplus secures both lower average false positive rate and higher average sensitivity when compared to BLAST.

Discussion

We propose the PSIONplus method for accurate prediction of ion channels proteins and their types, and subtypes of the voltage-gated ion channels. Empirical results show that combination of results generated by SVM model with the alignment by BLAST that is implemented in PSIONplus leads to improved predictive performance for the prediction of ion channels and voltage-gated channel subtypes when compared to using just BLAST. Results on the benchmark datasets that are independent of the datasets used to design our predictor reveal that PSIONplus obtains relatively good predictive performance. Its accuracy is 85.4% for the prediction of ion channels, 68.3% for the prediction of ion channel types, and its average accuracy is 96.4% for the prediction of the four subtypes of the voltage-gated channels. PSIONplus outperforms existing methods for the prediction of ion channels including VGIchan and the method by Lin et al. PSIONplus is the first method that uses new types of predictive inputs including PSSM profiles and predicted secondary structure, solvent accessibility and intrinsic disorder. We note that computation of the PSSM profiles and structural predictions is relatively computationally-heavy and it may take up to several minutes for a single sequence on a desktop computer. However, our empirical tests demonstrate that the PSSM profiles provide the strongest predictive input and that all new types of features contribute to the prediction, i.e., prediction using the combined set of all inputs is better when compared to using individual sets of features, and every feature set individually provides good predictive quality. Given the strong predictive value of the PSSM profiles, one interesting extension of our method would be to develop features based on generic (instead of position specific like PSSM) sequence similarity utilizing for instance the BLOSUM matrices. Another potentially impactful extension would be to apply an alternative methods to generate alignment profiles, such as HHBLITS[65] that was shown to be competitive with the currently used PSI-BLAST. Standalone version of PSION can be freely downloaded from https://sourceforge.net/projects/psion/.

This file includes Tables A and B.

(PDF) Click here for additional data file.

60 in total

1. Protein secondary structure prediction based on position-specific scoring matrices.

Authors: D T Jones
Journal: J Mol Biol Date: 1999-09-17 Impact factor: 5.469

Review 2. Recent advances in ion channel research.

Authors: Shin Ho Chung; Serdar Kuyucak
Journal: Biochim Biophys Acta Date: 2002-10-11

3. Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information.

Authors: Shandar Ahmad; M Michael Gromiha; Akinori Sarai
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

4. Identification of voltage-gated potassium channel subfamilies from sequence information using support vector machine.

Authors: Wei Chen; Hao Lin
Journal: Comput Biol Med Date: 2012-01-31 Impact factor: 4.589

5. SVM based prediction of RNA-binding proteins using binding residues and evolutionary information.

Authors: Manish Kumar; M Michael Gromiha; Gajendra P S Raghava
Journal: J Mol Recognit Date: 2011 Mar-Apr Impact factor: 2.137

Review 6. Engineering light-gated ion channels.

Authors: Matthew R Banghart; Matthew Volgraf; Dirk Trauner
Journal: Biochemistry Date: 2006-12-02 Impact factor: 3.162

7. Classification of transporters using efficient radial basis function networks with position-specific scoring matrices and biochemical properties.

Authors: Yu-Yen Ou; Shu-An Chen; M Michael Gromiha
Journal: Proteins Date: 2010-05-15