| Literature DB >> 33905510 |
Abstract
The isoelectric point is the pH at which a particular molecule is electrically neutral due to the equilibrium of positive and negative charges. In proteins and peptides, this depends on the dissociation constant (pKa) of charged groups of seven amino acids and NH+ and COO- groups at polypeptide termini. Information regarding isoelectric point and pKa is extensively used in two-dimensional gel electrophoresis (2D-PAGE), capillary isoelectric focusing (cIEF), crystallisation, and mass spectrometry. Therefore, there is a strong need for the in silico prediction of isoelectric point and pKa values. In this paper, I present Isoelectric Point Calculator 2.0 (IPC 2.0), a web server for the prediction of isoelectric points and pKa values using a mixture of deep learning and support vector regression models. The prediction accuracy (RMSD) of IPC 2.0 for proteins and peptides outperforms previous algorithms: 0.848 versus 0.868 and 0.222 versus 0.405, respectively. Moreover, the IPC 2.0 prediction of pKa using sequence information alone was better than the prediction from structure-based methods (0.576 versus 0.826) and a few folds faster. The IPC 2.0 webserver is freely available at www.ipc2-isoelectric-point.org.Entities:
Mesh:
Substances:
Year: 2021 PMID: 33905510 PMCID: PMC8262712 DOI: 10.1093/nar/gkab295
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Overview of the IPC 2.0 architecture. The input (amino acid sequence in the plain format or multiple sequences in the FASTA format) is processed by individual machine learning models. Separate models depending the prediction task are used. Isoelectric point prediction for peptides is based on separable convolution model (four channels representing the one-hot-encoded sequence, AAindex features, amino acid counts, and predictions from IPC 1.0). The protein pI and pKa prediction models use the ensembles of low level models integrated with support vector regressor. For more details, see Supplementary Figure S1 and ‘Machine Learning Details’ in the Supplementary Material.
Detailed statistics for the datasets used in IPC 2.0.
| Dataset | Entries | Details |
|---|---|---|
|
| 2324 | The dataset consists of proteins derived from two databases: PIP-DB and SWISS-2DPAGE ( |
|
| 119 092 | The dataset consists of the peptides from HiRIEF high-resolution isoelectric focusing experiments from Branca et al. 2014 ( |
|
| 1337 | p |
The full datasets were never used directly. First, the sequences were clustered (to remove duplicates and to average isoelectric point if multiple experimental data existed), then split randomly into 25% and 75% sets (test and training data sets, respectively). The training sets were used for the training and (hyper)parameter optimisation. The test sets were used only once to assess the final performance of the models. For individual datasets’ sequences and experimental isoelectric points, see Supplementary Data 1.
Isoelectric point prediction accuracy on leave-out 25% datasets
| Method | Protein dataseta | Method | Peptide datasetb | ||||||
|---|---|---|---|---|---|---|---|---|---|
| RMSE | MAE |
| Outliersc | RMSE | MAE |
| Outliersc | ||
|
| 0.8479 | 0.5906 | 0.5934 | 247 |
| 0.2216 | 0.1216 | 0.9761 | 2691 |
|
| 0.8608 | 0.6052 | 0.5748 | 251 |
| 0.2299 | 0.1155 | 0.9743 | 2490 |
| IPC_protein | 0.8677 | 0.6109 | 0.5760 | 250 |
| 0.2482 | 0.1394 | 0.9700 | 3179 |
| ProMoST | 0.9113 | 0.6444 | 0.5183 | 263 | Bjellqvist | 0.4051 | 0.2836 | 0.9204 | 11639 |
| Toseland | 0.9278 | 0.6537 | 0.5095 | 250 | Nozaki | 0.4083 | 0.2673 | 0.9191 | 9837 |
| Dawson | 0.9365 | 0.6586 | 0.4977 | 263 | DTASelect | 0.4235 | 0.2796 | 0.9130 | 10606 |
| Bjellqvist | 0.9369 | 0.6536 | 0.5005 | 260 | Thurlkill | 0.4466 | 0.2535 | 0.9033 | 7182 |
| Wikipedia | 0.9484 | 0.6795 | 0.4860 | 262 | Sillero | 0.4747 | 0.2696 | 0.8907 | 7607 |
| Rodwell | 0.9579 | 0.6762 | 0.4706 | 262 | Dawson | 0.4910 | 0.2642 | 0.8831 | 6698 |
| Grimsley | 0.9588 | 0.6953 | 0.4779 | 265 | Wikipedia | 0.5178 | 0.2974 | 0.8700 | 8326 |
| Lehninger | 0.9617 | 0.6783 | 0.4607 | 266 | Grimsley | 0.5264 | 0.3796 | 0.8656 | 15956 |
| Solomon | 0.9631 | 0.6746 | 0.4606 | 272 | Rodwell | 0.5855 | 0.3429 | 0.8337 | 9857 |
| pIR | 1.0148 | 0.7556 | 0.4161 | 315 | Toseland | 0.5860 | 0.3896 | 0.8335 | 13152 |
| Nozaki | 1.0164 | 0.7219 | 0.3980 | 288 | EMBOSS | 0.5971 | 0.3557 | 0.8271 | 11022 |
| Thurlkill | 1.0250 | 0.7573 | 0.3948 | 302 | PredpI-iTRAQ8 | 0.6302 | 0.3503 | 0.8027 | 12059 |
| DTASelect | 1.0278 | 0.7798 | 0.3947 | 319 | PredpI-TMT6 | 0.6365 | 0.3518 | 0.7988 | 12135 |
| EMBOSS | 1.0498 | 0.7757 | 0.3734 | 308 | PredpI-plain | 0.6480 | 0.3710 | 0.7913 | 12813 |
| Sillero | 1.0519 | 0.7694 | 0.3461 | 308 | IPC_peptide | 0.7459 | 0.4860 | 0.7302 | 13599 |
| Patrickios | 2.3764 | 1.8414 | <0 | 517 | Solomon | 0.7518 | 0.4929 | 0.7259 | 13777 |
| PredpI-TMT6 | NA | NA | NA | NA | Lehninger | 0.7697 | 0.5209 | 0.7127 | 15200 |
| PredpI-plain | NA | NA | NA | NA | pIR | 0.8529 | 0.7303 | 0.6387 | 27158 |
| PredpI-iTRAQ8 | NA | NA | NA | NA | ProMoST | 1.1026 | 0.7562 | 0.4104 | 18513 |
| Patrickios | 2.0172 | 1.3927 | <0 | 22818 | |||||
aProtein dataset consisting of 581 proteins (25% randomly chosen proteins, not used for the training or optimization).
bPeptide dataset consisting of 29 774 peptides (25% randomly chosen peptides, not used for the training or optimization).
cThe outliers were defined at 0.5 and 0.25 pH unit difference between the predicted and experimental pI thresholds for the protein and peptide datasets.
NA: The PredpI program was designed for peptides only within the 3.7–4.9 pH range; thus, for proteins, it returned 0 and could not be evaluated on the protein dataset.
New machine learning models developed in this study are in bold. First version of IPC (12) is underscored. Scores calculated after 10-fold cross-validation. Table is sorted by RMSD. For individual methods’ predictions, see Supplementary Data 2. For more details about the datasets, see Table 1.
pKa prediction accuracy of Rosetta pKa dataset.
| Method | Rosetta p | Method | Rosetta p | ||||
|---|---|---|---|---|---|---|---|
| RMSE | MAE | Outliersb | RMSE | MAE | Outliersb | ||
|
|
| ||||||
|
|
|
|
|
|
|
|
|
| Rosseta (Site repack) | 0.8193 | 0.5824 | 27 | Rosseta (Neighbor repack) | 0.8370 | 0.6647 | 9 |
| Rosseta (Ensemble average) | 0.8413 | 0.5460 | 25 | Rosetta (Standard) | 0.9579 | 0.8000 | 9 |
| Rosseta (Neighbor repack) | 0.8676 | 0.6378 | 34 | IPC2_pKa | 0.9766 | 0.8261 | 10 |
| Rosetta (Standard) | 1.0651 | 0.8554 | 46 | Rosseta (Ensemble average) | 1.1892 | 0.9529 | 13 |
|
|
| ||||||
|
|
|
|
|
|
|
|
|
| IPC2_pKa | 0.8523 | 0.5105 | 27 | Rosseta (Neighbor repack) | 0.6216 | 0.5091 | 7 |
| Rosseta (Neighbor repack) | 0.8559 | 0.6487 | 32 | Rosetta (Standard) | 0.6498 | 0.5046 | 8 |
| Rosseta (Ensemble average) | 1.0244 | 0.7566 | 39 | Rosseta (Site repack) | 0.6705 | 0.5227 | 7 |
| Rosetta (Standard) | 1.2303 | 0.9961 | 50 | Rosseta (Ensemble average) | 0.7135 | 0.5364 | 6 |
|
|
| ||||||
|
|
|
|
|
|
|
|
|
| Rosseta (Neighbor repack) | 0.8744 | 0.5887 | 29 | Rosseta (Site repack) | 0.8262 | 0.6165 | 102 |
| Rosetta (Standard) | 0.8880 | 0.7324 | 38 | Rosseta (Neighbor repack) | 0.8332 | 0.6185 | 111 |
| Rosseta (Site repack) | 0.9303 | 0.6549 | 30 | Rosseta (Ensemble average) | 0.9207 | 0.6746 | 114 |
| Rosseta (Ensemble average) | 0.9317 | 0.6972 | 34 | Rosetta (Standard) | 1.0300 | 0.8296 | 151 |
aFor the validation of pKa, the dataset from Kilambi and Gray (2012) was used (260* residues from 34 proteins). The numbers next to the residue type indicate the number of cases and the average pKa value with standard deviation.
bThe outliers are defined at 0.5 pH unit difference between the predicted and experimental pKa threshold.
*The dataset consists of 260 instead of 264 residues due to parsing problems (four missing residues could not be mapped to the protein sequence, due to the wrong residue register). Scores calculated after 10-fold cross-validation.