| Literature DB >> 27044036 |
Jianzhao Gao1, Wei Cui2, Yajun Sheng3, Jishou Ruan1,4, Lukasz Kurgan5,6.
Abstract
Ion channels are a class of membrane proteins that attracts a significant amount of basic research, also being potential drug targets. High-throughput identification of these channels is hampered by the low levels of availability of their structures and an observation that use of sequence similarity offers limited predictive quality. Consequently, several machine learning predictors of ion channels from protein sequences that do not rely on high sequence similarity were developed. However, only one of these methods offers a wide scope by predicting ion channels, their types and four major subtypes of the voltage-gated channels. Moreover, this and other existing predictors utilize relatively simple predictive models that limit their accuracy. We propose a novel and accurate predictor of ion channels, their types and the four subtypes of the voltage-gated channels called PSIONplus. Our method combines a support vector machine model and a sequence similarity search with BLAST. The originality of PSIONplus stems from the use of a more sophisticated machine learning model that for the first time in this area utilizes evolutionary profiles and predicted secondary structure, solvent accessibility and intrinsic disorder. We empirically demonstrate that the evolutionary profiles provide the strongest predictive input among new and previously used input types. We also show that all new types of inputs contribute to the prediction. Results on an independent test dataset reveal that PSIONplus obtains relatively good predictive performance and outperforms existing methods. It secures accuracies of 85.4% and 68.3% for the prediction of ion channels and their types, respectively, and the average accuracy of 96.4% for the discrimination of the four ion channel subtypes. Standalone version of PSIONplus is freely available from https://sourceforge.net/projects/psion/.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27044036 PMCID: PMC4820270 DOI: 10.1371/journal.pone.0152964
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Datasets used to design and test the proposed method.
| Dataset name | Annotations | Number of chains |
|---|---|---|
| TRAINION | Ion channel | 298 |
| Non-ion channel | 300 | |
| TRAINVLG | Voltage-gated channel | 148 |
| Ligand-gated channel | 150 | |
| TRAINVGS | Potassium(K) | 81 |
| Calcium(Ca) | 29 | |
| Sodium(Na) | 12 | |
| Anion | 26 | |
| TEST30ION | Ion channel | 94 |
| Non-ion channel | 104 | |
| TEST30VLG | Voltage-gated channel | 43 |
| Ligand-gated channel | 17 | |
| TEST60VGS | Potassium(K) | 120 |
| Calcium(Ca) | 49 | |
| Sodium(Na) | 23 | |
| Anion | 47 |
Fig 1Example computation of scores from the PSSM profile.
Results of the feature selection and optimization of the three predictive models for ion channels, ion channel types, and subtypes of voltage-gated channels.
| BCC | PCC | Maximal MCC over selected feature sets (step 4) | Optimal SVM parameters ( | Number of features | ||||||
| Ion channel | Ion channel type | Voltage-gated channel subtype | Ion channel | Ion channel type | Voltage-gated channel subtype | Ion channel | Ion channel type | Voltage-gated channel subtype | ||
| 0.1 | 0.9 | 0.835 | 0.927 | 0.697 | 8, 0.0625 | 4, 0.0625 | 16, 0.0625 | 190 | 158 | 46 |
| 0.85 | 0.832 | 0.934 | 0.664 | 8, 0.0625 | 4, 0.03125 | 8, 0.25 | 205 | 122 | 29 | |
| 0.8 | 0.830 | 0.921 | 0.656 | 16, 0.03125 | 0.5, 0.0625 | 4, 0.0625 | 171 | 102 | 48 | |
| 0.75 | 0.934 | 0.665 | 2, 0.0625 | 16,0.015625 | 103 | 71 | ||||
| 0.7 | 0.796 | 0.933 | 0.614 | 8, 0.0625 | 4, 0.0625 | 16, 0.007812 | 150 | 107 | 63 | |
| 0.15 | 0.9 | 0.798 | 0.928 | 0.668 | 2, 0.125 | 2, 0.125 | 4, 0.0625 | 138 | 109 | 53 |
| 0.85 | 0.788 | 0.934 | 0.664 | 4, 0.125 | 2, 0.0625 | 8, 0.25 | 134 | 102 | 29 | |
| 0.8 | 0.777 | 0.927 | 0.656 | 4, 0.125 | 2, 0.0625 | 4, 0.0625 | 92 | 80 | 48 | |
| 0.75 | 0.802 | 0.907 | 0.665 | 4, 0.125 | 4, 0.125 | 16,0.015625 | 114 | 110 | 71 | |
| 0.7 | 0.787 | 0.922 | 0.614 | 2, 0.0625 | 0.5, 0.03125 | 16, 0.007812 | 99 | 82 | 63 | |
| 0.2 | 0.9 | 0.773 | 0.920 | 0.715 | 8, 0.03125 | 1, 0.125 | 4, 0.0625 | 70 | 77 | 48 |
| 0.85 | 0.766 | 0.908 | 0.562 | 8, 0.125 | 4, 0.125 | 2, 0.25 | 69 | 94 | 37 | |
| 0.8 | 0.769 | 0.914 | 0.619 | 8, 0.125 | 0.5, 0.25 | 16, 0.0625 | 72 | 76 | 68 | |
| 0.75 | 0.763 | 0.618 | 8, 0.03125 | 4, 0.25 | 60 | 28 | ||||
| 0.7 | 0.776 | 0.920 | 0.641 | 16, 0.125 | 1, 0.0625 | 2, 0.25 | 64 | 65 | 32 | |
| 0.25 | 0.9 | 0.743 | 0.921 | 0.695 | 4, 0.25 | 1, 0.25 | 16, 0.0625 | 40 | 63 | 32 |
| 0.85 | 0.756 | 0.893 | 0.670 | 8, 0.25 | 16, 0.015625 | 16, 0.0625 | 38 | 60 | 33 | |
| 0.8 | 0.760 | 0.913 | 0.682 | 4, 0.5 | 2, 0.125 | 16, 0.25 | 39 | 69 | 26 | |
| 0.75 | 0.759 | 0.893 | 8, 0.5 | 0.5, 0.25 | 29 | 41 | ||||
| 0.7 | 0.741 | 0.880 | 0.589 | 2, 0.5 | 1, 0.25 | 8, 0.125 | 27 | 42 | 26 | |
| 0.3 | 0.9 | 0.686 | 0.908 | 0.574 | 2, 0.5 | 1, 0.25 | 16, 0.0625 | 22 | 53 | 31 |
| 0.85 | 0.700 | 0.907 | 0.634 | 1, 1 | 2, 0.125 | 16, 0.25 | 21 | 37 | 25 | |
| 0.8 | 0.700 | 0.914 | 0.716 | 1, 1 | 1, 0.5 | 8, 0.25 | 21 | 38 | 31 | |
| 0.75 | 0.700 | 0.907 | 0.653 | 1, 1 | 1, 0.5 | 16, 0.125 | 20 | 33 | 25 | |
| 0.7 | 0.675 | 0.893 | 0.573 | 0.5, 1.0 | 2, 0.015625 | 8, 0.5 | 16 | 33 | 22 | |
| Cutoff on variance in PCA | Maximal MCC over selected feature sets (step 4) | Optimal SVM parameters ( | Number of features | |||||||
| Ion channel | Ion channel type | Voltage-gated channel subtype | Ion channel | Ion channel type | Voltage-gated channel subtype | Ion channel | Ion channel type | Voltage-gated channel subtype | ||
| 0.1 | 0.445 | 0.582 | 0.168 | 8, 0.00977 | 16, 0.001953 | 8, 0.125000 | 2 | 1 | 1 | |
| 0.2 | 0.670 | 0.582 | 0.240 | 4,0.007812 | 16,0.001953 | 16,0.12500 | 4 | 1 | 1 | |
| 0.3 | 0.670 | 0.817 | 0.397 | 4,0.007812 | 1,0.015625 | 32,0.007812 | 4 | 5 | 2 | |
| 0.4 | 0.680 | 0.776 | 0.486 | 2,0.03125 | 1,0.015625 | 8,0.000488 | 7 | 6 | 6 | |
| 0.5 | 0.719 | 0.850 | 0.503 | 16,0.003906 | 2,0.015625 | 2,0.015625 | 13 | 14 | 6 | |
| 0.6 | 0.803 | 0.870 | 0.505 | 4,0.003906 | 2,0.007812 | 4,0.007812 | 32 | 21 | 6 | |
| 0.7 | 0.767 | 0.896 | 0.669 | 4,0.001953 | 4,0.003906 | 4,0.003906 | 66 | 38 | 26 | |
| 0.8 | 0.804 | 0.935 | 0.661 | 8,0.001953 | 16,0.000977 | 2,0.007812 | 116 | 69 | 22 | |
| 0.9 | 0.810 | 0.922 | 0.596 | 8,0.000977 | 2,0.001953 | 8,0.007812 | 153 | 65 | 30 | |
The table shows results for different cut-offs for the minimal biserial correlation coefficients (BCC) computed between values of a given feature and the binary outcomes (step 2 of feature selection) and the maximal Pearson’s correlation coefficient (PCC) between features (step 3), the maximal MCC value obtained via wrapper-based feature selection (step 4) and the optimal SVM parameters (step 5) that were computed via five-fold cross validation on the corresponding training dataset, and the final number of selected features. The lower part of the table shows results for an alternative feature selection based on Principal Component Analysis (PCA) with different cut-off on the value of variance. Predictions from the five test folds in the cross validations were combined together to produce a single MCC value. The selected setup for each of the three predictors is shown in bold font.
Fig 2Workflow of the PSIONplus model.
SS: secondary structure, RSA: relative solvent accessibility.
Summary of considered and selected features used by the PSION predictor.
| Feature group | Number of features | Number of selected features | ||
|---|---|---|---|---|
| SVMION | SVMVLG | SVMVGS | ||
| PSSM profile scores | 400 | 75 | 29 | 18 |
| Dipeptide composition | 400 | 82 | 24 | 4 |
| Predicted relative solvent accessibility | 24 | 4 | 0 | 0 |
| Amino acid composition | 20 | 5 | 1 | 0 |
| Predicted secondary structures | 13 | 2 | 1 | 1 |
| Properties of amino acid | 12 | 3 | 1 | 0 |
| Predicted intrinsic disorder | 9 | 1 | 0 | 2 |
| Total | 878 | 172 | 56 | 25 |
Accuracy obtained based on the cross validation on the training datasets TRAINION and TRAINVLG and Q4 based on the cross validation on the TRAINVGS dataset by different groups of input features.
| Models | TRAINION (accuracy) | TRAINVLG (accuracy) | TRAINVGS (Q4) |
|---|---|---|---|
| Model based on the PSSM profile | 89.6 | 95.6 | 81.8 |
| Model based on the dipeptide composition | 84.5 | 87.6 | 65.5 |
| Model based on the predicted relative solvent accessibility | 79.8 | not used | not used |
| Model based on the predicted secondary structure | 69.9 | 68.1 | 62.2 |
| Model based on the predicted intrinsic disorder | 60.3 | not used | 62.2 |
| Model based on all features | 91.6 | 96.3 | 88.5 |
We computed a single value of accuracy based the results that are combined over all test folds (entire test datasets)
Summary of results based on the jackknife and 5-fold cross validation (5-cv) tests on the training datasets TRAINION, TRAINVLG and TRAINVGS.
| Evaluation measure | Method | TRAINION | TRAINVLG | TRAINVGS | |||||
|---|---|---|---|---|---|---|---|---|---|
| Ion-channel vs. non-ion channel | Voltage-gated vs. ligand-gated | Potassium | Anion | Calcium | Sodium | Q4 | Average of the four subtypes | ||
| Lin | 86.6 | 92.6 | 92.6 | 84.6 | 82.8 | 75.0 | 87.8 | 83.8 | |
| SVM model | 91.5 | 96.3 | 93.9 | 97.3 | 91.9 | 96.6 | 89.9 | 94.9 | |
| BLAST | 98.0 | 99.7 | 98.6 | 99.3 | 98.0 | 98.6 | 97.3 | 98.6 | |
| PSIONplus | 97.7 | 100 | 99.3 | 100 | 98.0 | 98.6 | 98.0 | 99.0 | |
| SVM model | 0.830 | 0.927 | 0.880 | 0.905 | 0.732 | 0.782 | NA | 0.825 | |
| BLAST | 0.960 | 0.993 | 0.973 | 0.977 | 0.935 | 0.909 | NA | 0.948 | |
| PSIONplus | 0.953 | 1 | 0.986 | 1 | 0.935 | 0.909 | NA | 0.958 | |
| SVM model | 0.833 | 0.934 | 0.736 | 0.855 | 0.441 | 0.695 | NA | 0.682 | |
| BLAST | 0.944 | 0.980 | 0.774 | 0.831 | 0.597 | 0.773 | NA | 0.744 | |
| PSIONplus | 0.940 | 0.993 | 0.846 | 0.929 | 0.650 | 0.773 | NA | 0.799 | |
| SVM model | 93.0 | 98.0 | 98.8 | 84.6 | 72.4 | 83.3 | NA | 84.8 | |
| BLAST | 97.0 | 99.3 | 100 | 96.2 | 93.1 | 91.7 | NA | 95.2 | |
| PSIONplus | 98.7 | 100 | 100 | 100 | 93.1 | 91.7 | NA | 96.2 | |
| SVM model | 90.3 | 98.6 | 96.3 | 80.8 | 41.4 | 75.0 | NA | 73.4 | |
| BLAST | 95.0 | 98.6 | 100 | 73.1 | 44.8 | 91.7 | NA | 77.4 | |
| PSIONplus | 97.7 | 100 | 100 | 88.5 | 51.7 | 91.7 | NA | 83.0 | |
| Lin | 140 | 159 | 104 | 104 | 104 | 104 | NA | NA | |
| PSION | 172 | 56 | 25 | 25 | 25 | 25 | NA | NA | |
Results of PSIONplus and its two modules based on SVM and BLAST are compared with the method by Lin et al. MCC and Fmeasure were not reported in the article by Lin et al. and thus only accuracy is compared. The best accuracy values for each dataset is shown in bold. For the cross-validation tests we computed a single value of accuracy, MCC and sensitivity based in the results that are combined over all test folds (entire test datasets). NA means “not applicable”.
Summary of results on the test datasets TEST30ION, TEST30VLG, and TEST60VGS.
| Dataset | Prediction outcome | Method | Fmeasure | MCC | Accuracy | Q4 |
|---|---|---|---|---|---|---|
| TEST30ION | Ion-channel vs. non-ion channel | VGIchan | 63.0 | 0.49 | 72.7 | NA |
| Lin | 81.7 | 0.63 | 80.8 | NA | ||
| BLAST | 64.3 | 0.56 | 74.7 | NA | ||
| PSIONplus | NA | |||||
| Confidence interval of PSIONplus | 86.0(±3.7) | 0.73(±0.07) | 86.3(±3.3) | NA | ||
| TEST30VLG | Voltage-gated vs. ligand-gated | Lin | 76.6 | -0.06 | 63.3 | NA |
| BLAST | NA | |||||
| PSIONplus | NA | |||||
| Confidence interval of PSIONplus | 78.1(±6.1) | 0.22(±0.15) | 68.7(±7.4) | NA | ||
| TEST60VGS | Potassium | Lin | 87.6 | 0.74 | 86.6 | NA |
| BLAST | 91.6 | 0.83 | 90.8 | NA | ||
| PSIONplus | NA | |||||
| Anion | Lin | 86.7 | 0.85 | 95.4 | NA | |
| BLAST | 86.7 | 0.85 | 95.4 | NA | ||
| PSIONplus | NA | |||||
| Calcium | Lin | 73.7 | 0.67 | 89.5 | NA | |
| BLAST | 91.1 | NA | ||||
| PSIONplus | NA | |||||
| Sodium | Lin | 90.5 | 0.90 | 98.3 | NA | |
| BLAST | NA | |||||
| PSIONplus | NA | |||||
| Average over | Lin | 84.6 | 0.79 | 92.4 | 84.9 | |
| all subtypes | BLAST | 90.6 | 0.88 | 95.4 | 90.8 | |
| PSIONplus | ||||||
| Confidence interval of PSIONplus | 91.9(±2.1) | 0.90(±0.03) | 96.4(±0.9) | 92.9(±1.7) |
Results of PSIONplus are compared with VGIchan on the TEST30VLG dataset, and with the method by Lin et al. and BLAST on all datasets. Best MCC, Fmeasure and accuracy values for each dataset are shown in bold. Confidence intervals are obtained by computing average and standard deviations (shown in brackets) of 10 repetition of the test where in each repetition we randomly select 50% of test data set. NA means “not applicable”; for the two-class classification the Q4 equals accuracy.