| Literature DB >> 27769290 |
Abstract
BACKGROUND: Accurate estimation of the isoelectric point (pI) based on the amino acid sequence is useful for many analytical biochemistry and proteomics techniques such as 2-D polyacrylamide gel electrophoresis, or capillary isoelectric focusing used in combination with high-throughput mass spectrometry. Additionally, pI estimation can be helpful during protein crystallization trials.Entities:
Keywords: Isoelectric point; Proteomics; pKa dissociation constant
Mesh:
Substances:
Year: 2016 PMID: 27769290 PMCID: PMC5075173 DOI: 10.1186/s13062-016-0159-9
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Prediction of isoelectric point on the 25 % testing datasets
| Method | Protein dataset | Method | Peptide dataset | ||||
|---|---|---|---|---|---|---|---|
| RMSD | % | Outliers | RMSD | % | Outliers | ||
| IPC_protein | 0.874 | 0 | 46 | IPC_peptide | 0.251 | 0 | 232 |
| Toseland | 0.934 | 14.9 | 52 | Solomons | 0.255 | 0.9 | 235 |
| Bjellqvist | 0.944 | 17.7 | 47 | Lehninger | 0.262 | 2.5 | 236 |
| Dawson | 0.945 | 17.8 | 56 | EMBOSS | 0.325 | 18.5 | 372 |
| Wikipedia | 0.955 | 20.5 | 55 | Wikipedia | 0.421 | 47.9 | 1467 |
| Rodwell | 0.963 | 22.8 | 58 | Toseland | 0.425 | 49.1 | 990 |
| ProMoST | 0.966 | 23.6 | 52 | Sillero | 0.428 | 50.3 | 1223 |
| Grimsley | 0.968 | 24.2 | 60 | Dawson | 0.435 | 52.9 | 1432 |
| Solomons | 0.970 | 24.8 | 58 | Thurlkill | 0.481 | 69.7 | 1361 |
| Lehninger | 0.970 | 25.0 | 59 | Rodwell | 0.502 | 78.4 | 1359 |
| pIR | 1.013 | 38.0 | 58 | DTASelect | 0.550 | 99.1 | 1714 |
| Nozaki | 1.024 | 41.3 | 56 | Nozaki | 0.602 | 124.3 | 1368 |
| Thurlkill | 1.030 | 43.4 | 61 | Grimsley | 0.616 | 131.4 | 1550 |
| DTASelect | 1.032 | 44.1 | 58 | Bjellqvist | 0.669 | 161.5 | 1583 |
| pIPredict | 1.048 | 49.4 | 56 | pIPredict | 1.024 | 493.6 | 2720 |
| EMBOSS | 1.056 | 52.3 | 69 | ProMoST | 1.239 | 873.4 | 2649 |
| Sillero | 1.059 | 53.2 | 63 | pIR | 1.881 | 4159.7 | 3358 |
| Patrickios | 2.392 | 3201.8 | 227 | Patrickios | 1.998 | 5479.1 | 2739 |
| Avg_pIa | 0.960 | 22.1 | 53 | Avg_pI | 0.454 | 59.6 | 1571 |
aAverage from all pKa sets without Patrickios (highly simplified pKa set) and IPC sets. Note, that the average pI is calculated on the level of individual protein or peptide, thus it does not represent the average from values presented in the table for individual methods
% - Note that the pH scale is logarithmic with base 10; thus, the percent difference corresponds to pow(10, x), where x is equal to the delta of the RMSD of two error estimates represented in pH units; for example, the % difference between Toseland and IPC_protein is pow(10, (0.934-0.874))
Protein dataset (IPC_protein was trained on 1,743 proteins with 10-fold cross-validation – data in Table 2, tested on 581 proteins not used for training – data in the table above), peptide dataset (IPC trained on 12,662 peptides with 10-fold cross-validation – data in Table 2, tested on 4,220 peptides not used for training – data in the table above). Outliers correspond to the number of predictions for which the difference between the experimental pI and predicted pI was greater than the threshold of the mean standard error (MSE) of 3 for the protein dataset and MSE of 0.25 for the peptide dataset
Prediction of isoelectric point on the 75 % training datasets
| Method | Protein dataset | Method | Peptide dataset | ||||
|---|---|---|---|---|---|---|---|
| RMSD | % | Outliers | RMSD | % | Outliers | ||
| IPC_protein | 0.838 | 0 | 114 | IPC_peptide | 0.247 | 0 | 635 |
| Toseland | 0.898 | 15.0 | 131 | Solomons | 0.251 | 0.8 | 638 |
|
|
|
|
| Lehninger | 0.256 | 2.4 | 643 |
|
|
|
|
| EMBOSS | 0.322 | 18.8 | 1088 |
| Wikipedia | 0.930 | 23.8 | 157 | Wikipedia | 0.413 | 46.3 | 4280 |
| Rodwell | 0.938 | 26.1 | 159 | Sillero | 0.426 | 50.9 | 3025 |
| ProMoST | 0.938 | 26.1 | 140 | Toseland | 0.427 | 51.2 | 3618 |
| Grimsley | 0.939 | 26.2 | 147 | Dawson | 0.432 | 52.9 | 4192 |
|
|
|
|
| Thurlkill | 0.480 | 70.8 | 4017 |
|
|
|
|
| Rodwell | 0.506 | 81.2 | 4061 |
|
|
|
|
| DTASelect | 0.541 | 96.8 | 4902 |
|
|
|
|
| Nozaki | 0.599 | 124.8 | 4013 |
|
|
|
|
| Grimsley | 0.611 | 130.9 | 4609 |
|
|
|
|
| Bjellqvist | 0.661 | 159.2 | 4672 |
|
|
|
|
| pIPredict | 1.024 | 497.8 | 8051 |
| EMBOSS | 1.040 | 59.4 | 189 | ProMOST | 1.233 | 867.5 | 7999 |
| Sillero | 1.042 | 60.1 | 188 | pIR | 1.862 | 4020.9 | 9921 |
| Patrickios | 2.237 | 2405.1 | 645 | Patrickios | 1.977 | 5266.8 | 8131 |
| Avg_pIa | 0.940 | 26.6 | 151 | Avg_pI | 0.451 | 59.7 | 4600 |
aAverage from all pKa sets without the Patrickios (highly simplified pKa set) and IPC sets. Note, that the average pI is calculated on the level of individual protein or peptide
Protein dataset (IPC_protein trained on 1,743 proteins with 10-fold cross-validation – data in the table above, tested on 581 proteins not used for training – data in Table 1), peptide dataset (IPC trained on 12,662 peptides with 10-fold cross-validation – data in above table, tested on 4,220 peptides not used for training – data in Table 1). Changes in method order in comparison to Table 1 are in bold
Outliers correspond to the number of predictions for which the difference between the experimental pI and the predicted pI exceeded the threshold of an MSE of 3 for the protein dataset and an MSE of 0.25 for the peptide dataset
Prediction of isoelectric points for SWISS-2DPAGE and PIP-DB databases
| Method | SWISS-2DPAGE | Method | PIP-DB | ||||
|---|---|---|---|---|---|---|---|
| RMSD | % | Outliers | RMSD | % | Outliers | ||
| IPC_protein | 0.476 | 0 | 10 | IPC_protein | 1.019 | 0 | 141 |
| Toseland | 0.521 | 10.9 | 18 | Toseland | 1.086 | 16.7 | 153 |
| Bjellqvist | 0.590 | 30.0 | 31 | Bjellqvist | 1.085 | 16.3 | 150 |
| ProMoST | 0.597 | 32.1 | 29 | Dawson | 1.081 | 15.3 | 161 |
| Dawson | 0.599 | 32.5 | 37 | Wikipedia | 1.087 | 16.9 | 163 |
| Wikipedia | 0.619 | 39.0 | 35 | Rodwell | 1.095 | 19.1 | 167 |
| Rodwell | 0.628 | 41.7 | 37 | Grimsley | 1.121 | 26.6 | 170 |
| Grimsley | 0.572 | 24.5 | 21 | Solomons | 1.103 | 21.4 | 159 |
| Solomons | 0.635 | 44.2 | 44 | Lehninger | 1.102 | 21.1 | 161 |
| Lehninger | 0.640 | 45.8 | 44 | ProMOST | 1.111 | 23.5 | 150 |
| Nozaki | 0.679 | 59.4 | 43 | pIR | 1.152 | 35.8 | 184 |
| Thurlkill | 0.691 | 63.9 | 39 | Nozaki | 1.165 | 39.9 | 170 |
| DTASelect | 0.677 | 58.8 | 35 | Thurlkill | 1.180 | 44.9 | 176 |
| EMBOSS | 0.724 | 76.9 | 49 | DTASelect | 1.186 | 47.1 | 173 |
| Sillero | 0.721 | 75.5 | 50 | pIPredict | 1.195 | 50.0 | 182 |
| pIR | 0.761 | 92.4 | 37 | EMBOSS | 1.198 | 51.2 | 191 |
| pIPredict | 0.768 | 95.9 | 33 | Sillero | 1.202 | 52.4 | 187 |
| Patrickios | 1.600 | 1227.9 | 243 | Patrickios | 2.623 | 3918 | 604 |
| Avg_pIa | 0.614 | 37.1 | 32 | Avg_pIa | 1.101 | 20.9 | 160 |
aAverage from all pKa sets without the Patrickios (highly simplified pKa set) and IPC sets. Note, that the average pI is calculated on the level of individual protein or peptide
Both SWISS-2DPAGE and PIP-DB were cleaned of outliers (MSE > 3 between experimental pI and average predicted pI) and clustered by CD-HIT with 99 % sequence identity threshold, as described in the Materials and Methods (982 and 1,307 proteins, respectively), but they were not divided into training and testing datasets. Thus, the results for the IPC sets are slightly overestimated, but this is not relevant, as shown by the comparison of Tables 1 and 2
Outliers correspond to the number of predictions for which the difference between the experimental pI and the predicted pI exceeded the threshold of an MSE of 3 for the protein dataset
Fig. 1Correlation of the experimental versus the theoretical isoelectric points for two protein datasets. Data for SWISS-2DPAGE (left panel) and PIP-DB (right panel) are calculated using the EMBOSS pKa set. Outliers are defined as MSE > 3 and are marked in red. Plots correspond to datasets as presented by the authors before cleaning and the removal of duplicates (duplicates are defined as records that have the same sequence but are referred to as separate records in the database). In both databases, the authors reported multiple pI values from different experiments for the same sequences in separate records. In such cases for the current analysis, the average pI was used. The solid line represents the linear regression after removal of the outliers
Fig. 2Correlation of the experimental versus theoretical isoelectric points calculated using different pKa sets. Data for the main protein dataset (merged dataset created from SWISS-2DPAGE and PIP-DB). R2 – Pearson correlation before the removal of outliers. R2corr – Pearson correlation after the removal of outliers. Additionally, the linear regression models fitted to predictions with outliers (magenta line) and without outliers (blue line) are shown. Outliers (marked in magenta) are defined as pI predictions with MSE > 3 in comparison to the experimental pI. Other predictions are represented as heat maps according to the density of points. The numbers of outliers for both the training and testing set are shown together. For brevity, only six pKa sets are shown
Fig. 3Histograms of the isoelectric points of proteins. Top and middle panels are calculated using the IPC_protein pKa set (in 0.25 pH unit intervals) and represents pI distribution in the SwissProt database, human proteome, Escherichia coli and extreme halophilic archaeon Natrialba magadii. Bottom two panels presents the isoelectric points of the yeast proteome (6,721 proteins) calculated using the EMBOSS pKa set (as presented in the Saccharomyces Genome Database [40]) and the IPC_protein pKa set for comparison
Most commonly used pKa values for the ionizable groups of proteins. Note that Bjellqvist and ProMoST use different amounts of additional pKa values (not shown), which take into account the relative position of the ionized group (whether it is located on the N- or C- terminus or in the middle). For more details, see References 4 and 5 and the “Theory” section on the IPC web site
| Amino acid | NH2 | COOH | C | D | E | H | K | R | Y |
|---|---|---|---|---|---|---|---|---|---|
| EMBOSS [ | 8.6 | 3.6 | 8.5 | 3.9 | 4.1 | 6.5 | 10.8 | 12.5 | 10.1 |
| DTASelect [ | 8 | 3.1 | 8.5 | 4.4 | 4.4 | 6.5 | 10 | 12 | 10 |
| Solomons [ | 9.6 | 2.4 | 8.3 | 3.9 | 4.3 | 6 | 10.5 | 12.5 | 10.1 |
| Sillero [ | 8.2 | 3.2 | 9 | 4 | 4.5 | 6.4 | 10.4 | 12 | 10 |
| Rodwell [ | 8 | 3.1 | 8.33 | 3.68 | 4.25 | 6 | 11.5 | 11.5 | 10.07 |
| Patrickios [ | 11.2 | 4.2 | - | 4.2 | 4.2 | - | 11.2 | 11.2 | - |
| Wikipedia | 8.2 | 3.65 | 8.18 | 3.9 | 4.07 | 6.04 | 10.54 | 12.48 | 10.46 |
| Lehninger [ | 9.69 | 2.34 | 8.33 | 3.86 | 4.25 | 6 | 10.5 | 12.4 | 10 |
| Grimsley [ | 7.7 | 3.3 | 6.8 | 3.5 | 4.2 | 6.6 | 10.5 | 12.04a | 10.3 |
| Toseland [ | 8.71 | 3.19 | 6.87 | 3.6 | 4.29 | 6.33 | 10.45 | 12 | 9.61 |
| Thurlkill [ | 8 | 3.67 | 8.55 | 3.67 | 4.25 | 6.54 | 10.4 | 12 | 9.84 |
| Nozaki [ | 7.5 | 3.8 | 9.5 | 4 | 4.4 | 6.3 | 10.4 | 12 | 9.6 |
| Dawson [ | 8.2b | 3.2b | 8.3 | 3.9 | 4.3 | 6 | 10.5 | 12 | 10.1 |
| Bjellqvist [ | 7.5 | 3.55 | 9 | 4.05 | 4.45 | 5.98 | 10 | 12 | 10 |
| ProMoST [ | 7.26 | 3.57 | 8.28 | 4.07 | 4.45 | 6.08 | 9.8 | 12.5 | 9.84 |
| IPC_protein | 9.094 | 2.869 | 7.555 | 3.872 | 4.412 | 5.637 | 9.052 | 11.84 | 10.85 |
| IPC_peptide | 9.564 | 2.383 | 8.297 | 3.887 | 4.317 | 6.018 | 10.517 | 12.503 | 10.071 |
aArg was not included in the study, and the average pKa from all other pKa sets was taken
bNH2 and COOH were not included in the study, and they were arbitrary taken from Sillero set
Detailed statistics for the available datasets
| Dataset | Initial no. entries | No. entries with sequence and pI | No. entries after removing outliers | No. entries after removing duplicates |
|---|---|---|---|---|
| Gauci et al. | 5,758 | 5,758 | NA | NA |
| PHENYX | 7,582 | 7,582 | NA | NA |
| SEQUEST | 7,629 | 7,629 | NA | NA |
| IPC_peptide | - | 20,969 | 20,969 | 16,882 [25] [75] |
| SWISS-2DPAGE | 2,530 | 1,054 | 1,029 | 982 |
| PIP-DB | 4,947 | 2,427 | 2,254 | 1,307 |
| IPC_protein | - | 3.481 | 3,283 | 2,324 [25] [75] |
NA not available refers to the situation where the given dataset was not created because a merged version was used
Note: all datasets presented in the table are available as hyperlinks; the final datasets were divided randomly into 75 % training and 25 % testing subsets (denoted as [75] and [25], respectively)
Fig. 4Exemplary output of the IPC calculator for the Mycoplasma genitalium G37 proteome (476 proteins). The scatter plot with the predicted isoelectric points versus molecular weight for all proteins is presented at the top. Then, for individual proteins, pI predictions based on different pKa sets are presented alongside the molecular weight and amino acid composition