| Literature DB >> 27534850 |
Thammakorn Saethang1, D Michael Payne1, Yingyos Avihingsanon2, Trairak Pisitkun3,4.
Abstract
BACKGROUND: One very important functional domain of proteins is the protein-protein interacting region (PPIR), which forms the binding interface between interacting polypeptide chains. Post-translational modifications (PTMs) that occur in the PPIR can either interfere with or facilitate the interaction between proteins. The ability to predict whether sites of protein modifications are inside or outside of PPIRs would be useful in further elucidating the regulatory mechanisms by which modifications of specific proteins regulate their cellular functions.Entities:
Keywords: AAindex; Machine learning; Post-translational modification; Protein-protein interacting region
Mesh:
Substances:
Year: 2016 PMID: 27534850 PMCID: PMC4989344 DOI: 10.1186/s12859-016-1165-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Flow diagram of the data generation and preparation processes. Numerical results are shown for each step of the overall process
Fig. 2The numbers of modification sites inside and outside of PPIRs
Fig. 3Results of PhosphoLogo analysis. The x- and y-axes correspond to residue positions and bit scores (×10−1), respectively. For phosphorylated sequences within PPIRs, position-specific sequence analyses revealed favored (a, logo) and disfavored (b, anti-logo) amino acid residues. Similar analyses of phosphorylated sequences outside PPIRs were performed (c, logo, and d, anti-logo). Amino acid types for neighboring positions of central phosphorylated residues (S, T, Y, or H) are indicated by symbols as follows: Φ = nonpolar, Δ = polar, Θ = acidic, Ψ = basic
Classification results for each PTM-specific dataset using conventional features and the SVM as a classifiera
| F1 | Sn | Sp | PPV | ACC | AUC | MCC | |
|---|---|---|---|---|---|---|---|
| Acetylation | |||||||
| Hydropathy | 0.51 | 0.51 | 0.51 | 0.51 | 0.51 | 0.50 | 0.01 |
| Secondary structure | 0.49 | 0.49 | 0.50 | 0.49 | 0.49 | 0.50 | −0.01 |
| Conservation | 0.55 | 0.57 | 0.48 | 0.53 | 0.53 | 0.54 | 0.06 |
| Combined features | 0.53 | 0.53 | 0.54 | 0.53 | 0.54 | 0.55 | 0.07 |
| Phosphorylation | |||||||
| Hydropathy | 0.53 | 0.58 | 0.40 | 0.49 | 0.49 | 0.48 | −0.03 |
| Secondary structure | 0.48 | 0.45 | 0.58 | 0.52 | 0.52 | 0.53 | 0.03 |
| Conservation | 0.53 | 0.55 | 0.45 | 0.50 | 0.50 | 0.51 | 0.01 |
| Combined features | 0.55 | 0.56 | 0.52 | 0.54 | 0.54 | 0.55 | 0.08 |
| Ubiquitylation | |||||||
| Hydropathy | 0.52 | 0.53 | 0.49 | 0.51 | 0.51 | 0.52 | 0.02 |
| Secondary structure | 0.48 | 0.46 | 0.52 | 0.49 | 0.49 | 0.48 | −0.02 |
| Conservation | 0.57 | 0.59 | 0.50 | 0.54 | 0.55 | 0.56 | 0.09 |
| Combined features | 0.54 | 0.54 | 0.53 | 0.53 | 0.53 | 0.55 | 0.06 |
aThis table shows results when datasets were balanced (see Methods). The results using unbalanced datasets are shown in Additional file 7: Table S1
Fig. 5The sampling strategy for balancing interacting and non-interacting sub-datasets. The larger non-interacting sub-dataset was clustered by GibbsCluster into 10 clusters. Each cluster contained sequences representing a different characteristic motif; for the purpose of illustration, shown here are example motifs from the phosphorylation dataset. Equal numbers of sequences from each cluster were randomly selected and combined to create a reduced non-interacting sub-dataset which was similar in size to the interacting sub-dataset
Averaged performance measures for each PTM-specific dataset using the SVM as a classifier and 102 indices of AAindex1 in the encoding process
| Dataset | F1 | Sn | Sp | PPV | ACC | AUC | MCC |
|---|---|---|---|---|---|---|---|
| Acetylation | 0.86 | 0.81 | 0.94 | 0.93 | 0.87 | 0.89 | 0.75 |
| Phosphorylation | 0.84 | 0.74 | 0.99 | 0.99 | 0.86 | 0.92 | 0.75 |
| Ubiquitylation | 0.86 | 0.81 | 0.92 | 0.92 | 0.86 | 0.90 | 0.73 |
The highest possible value for all measures is 1
Standard deviations for all values were < ±0.005
Resulting performance measures when only sub-datasets corresponding to indices in the optimal sets were used in the classification tasks
| Dataset | F1 | Sn | Sp | PPV | ACC | AUC | MCC | #indices used (of 102 total) | #features used (of 1428 total) |
|---|---|---|---|---|---|---|---|---|---|
| Acetylation | 0.87a | 0.82 | 0.93 | 0.93 | 0.87 | 0.90a | 0.76a | 71 | 994 |
| Phosphorylation | 0.89a | 0.81a | 1.00 | 0.99 | 0.90a | 0.92 | 0.82a | 31 | 434 |
| Ubiquitylation | 0.86 | 0.81 | 0.92 | 0.93a | 0.87 | 0.90 | 0.74a | 86 | 1204 |
The maximum possible value for all measures is 1
aSignificantly increased (t-test, α < 0.05) when compared with the use of all 102 indices
Fig. 4Overlap among of the three PTM-specific optimized sets of indices. Reference for each index is showed in Additional file 14: Table S6
Resulting performance measures when best-feature sets were used in the classification tasks, using two different feature selection algorithms
| Dataset | Relief-F | ||||||
|---|---|---|---|---|---|---|---|
| ACC | AUC | MCC | PPV | Sn | Sp | #Features used | |
| Acetylation | 0.88 | 0.92 | 0.78 | 0.95 | 0.82a | 0.95 | 144 |
| Phosphorylation | 0.91 | 0.93 | 0.83 | 0.99a | 0.82 | 1.00a | 73 |
| Ubiquitylation | 0.88 | 0.91 | 0.77 | 0.96 | 0.80a | 0.96 | 512 |
| Information Gain | |||||||
| Acetylation | 0.88 | 0.90a | 0.78 | 0.97 | 0.80a | 0.97 | 82 |
| Phosphorylation | 0.91 | 0.93 | 0.84 | 0.99a | 0.83 | 1.00a | 35 |
| Ubiquitylation | 0.88 | 0.91 | 0.77 | 0.96 | 0.80a | 0.96 | 343 |
aExcept for these values, others were significantly increased (t-test, α < 0.05) when compared with the results using optimized sets of indices shown in Table 3
Resulting performance measures of all six classifiers for all three PTM-specific datasets, using the optimized lists of indices and features as an input
| Classifier | Dataset | Relief-F | Information gain | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ACC | AUC | MCC | CPS | ACC | AUC | MCC | CPS | ||
| SVM | Acetylation | 0.88 | 0.92 | 0.78 | 2.58 | 0.88 | 0.90 | 0.78 | 2.56 |
| Phosphorylation | 0.91 | 0.93 | 0.83 | 2.67 | 0.91 | 0.93 | 0.84 | 2.68 | |
| Ubiquitylation | 0.88 | 0.91 | 0.77 | 2.56 | 0.88 | 0.91 | 0.77 | 2.56 | |
| summation | 7.81 | summation | 7.80 | ||||||
| k-NN | Acetylation | 0.87 | 0.91 | 0.74 | 2.52 | 0.87 | 0.91 | 0.75 | 2.53 |
| Phosphorylation | 0.89 | 0.93 | 0.80 | 2.62 | 0.92 | 0.93 | 0.84 | 2.69 | |
| Ubiquitylation | 0.80 | 0.86 | 0.61 | 2.27 | 0.81 | 0.89 | 0.65 | 2.35 | |
| summation | 7.41 | summation | 7.57 | ||||||
| RF | Acetylation | 1.00 | 1.00 | 1.00 | 3.00 | 1.00 | 1.00 | 1.00 | 3.00 |
| Phosphorylation | 0.91 | 0.93 | 0.82 | 2.66 | 0.90 | 0.93 | 0.80 | 2.63 | |
| Ubiquitylation | 1.00 | 1.00 | 1.00 | 3.00 | 1.00 | 1.00 | 1.00 | 3.00 | |
| summation | 8.66 | summation | 8.63 | ||||||
| C4.5 | Acetylation | 0.85 | 0.87 | 0.70 | 2.42 | 0.88 | 0.89 | 0.76 | 2.53 |
| Phosphorylation | 0.89 | 0.90 | 0.78 | 2.57 | 0.92 | 0.93 | 0.85 | 2.70 | |
| Ubiquitylation | 0.80 | 0.82 | 0.61 | 2.23 | 0.81 | 0.82 | 0.62 | 2.25 | |
| summation | 7.22 | summation | 7.48 | ||||||
| KStar | Acetylation | 0.83 | 0.88 | 0.65 | 2.36 | 0.83 | 0.89 | 0.66 | 2.38 |
| Phosphorylation | 0.87 | 0.92 | 0.74 | 2.53 | 0.89 | 0.93 | 0.79 | 2.61 | |
| Ubiquitylation | 0.71 | 0.76 | 0.43 | 1.90 | 0.79 | 0.82 | 0.57 | 2.18 | |
| summation | 6.79 | summation | 7.17 | ||||||
| MLP | Acetylation | 0.85 | 0.91 | 0.70 | 2.46 | 0.84 | 0.90 | 0.67 | 2.41 |
| Phosphorylation | 0.88 | 0.92 | 0.76 | 2.56 | 0.91 | 0.93 | 0.83 | 2.67 | |
| Ubiquitylation | 0.84 | 0.90 | 0.68 | 2.42 | 0.83 | 0.89 | 0.66 | 2.38 | |
| summation | 7.44 | summation | 7.46 | ||||||
Resulting performance measures for the phosphorylation dataset, using k-NN or C4.5 as the classifiers (instead of the SVM) for all feature selection and classification tasks, including identification of optimized sets of indices and features
| Classifier | Dataset | Information gain | ||||||
|---|---|---|---|---|---|---|---|---|
| PPV | Sn | Sp | ACC | AUC | MCC | CPS | ||
| SVM | Phosphorylation | 0.99 | 0.83 | 1.00 | 0.91 | 0.93 | 0.84 | 2.68 |
| k-NN | Phosphorylation | 0.95 | 0.85 | 0.95 | 0.89 | 0.92 | 0.79 | 2.60 |
| C4.5 | Phosphorylation | 0.96 | 0.87 | 0.97 | 0.91 | 0.92 | 0.84 | 2.67 |
The validation datasets of sequences collected from dbPTM [5], Huebner et al. [31], and Hou et al. [32]
| Dataset | Inside PPIRs | Outside PPIRs | Total |
|---|---|---|---|
| Acetylation | 14 | 71 | 85 |
| Phosphirylation | 92 | 542 | 634 |
| Ubiquitylation | 33 | 71 | 104 |
Results of model evaluation using the validation dataset
| Dataset | PPV | F1 | Sn | Sp | FPR | FNR | ACC | AUC | MCC |
|---|---|---|---|---|---|---|---|---|---|
| Acetylation | 0.82 | 0.72 | 0.64 | 0.97 | 0.03 | 0.36 | 0.92 | 0.91 | 0.68 |
| Phosphirylation | 0.79 | 0.81 | 0.84 | 0.96 | 0.04 | 0.16 | 0.94 | 0.93 | 0.78 |
| Ubiquitylation | 0.87 | 0.71 | 0.61 | 0.96 | 0.04 | 0.39 | 0.85 | 0.82 | 0.63 |
Results of predictions using sequences with unknown PPIR localization (see Additional file 10: Table S11 for the complete list of predicted PPIR localization for these sequences)
| Dataset | Total # of sequences | Prediction result | |||
|---|---|---|---|---|---|
| Inside PPIRs | Percent | Outside PPIRs | Percent | ||
| Acetylation | 32033 | 359 | 1.1 | 31674 | 98.9 |
| Phosphorylation | 258407 | 8967 | 3.5 | 249440 | 96.5 |
| Ubiquitylation | 49628 | 411 | 0.8 | 49217 | 99.2 |
Relative computational time required for generating predictive models
| Dataset | Computational time required (ms) | |
|---|---|---|
| Balanced dataset | Imbalanced dataset | |
| Acetylation | 4,304 | 38,872 |
| Phosphorylation | 37,840 | 351,856 |
| Ubiquitylation | 12,009 | 124,460 |
All tests were performed on a personal computer (CPU:Intel®Core™i5-4200U, RAM:8.00 GB, OS:Windows7 Ultimate)