| Literature DB >> 30696115 |
Md Mehedi Hasan1, Mst Shamima Khatun2, Hiroyuki Kurata3,4.
Abstract
Lysine succinylation is a form of posttranslational modification of the proteins that play an essential functional role in every aspect of cell metabolism in both prokaryotes and eukaryotes. Aside from experimental identification of succinylation sites, there has been an intense effort geared towards the development of sequence-based prediction through machine learning, due to its promising and essential properties of being highly accurate, robust and cost-effective. In spite of these advantages, there are several problems that are in need of attention in the design and development of succinylation site predictors. Notwithstanding of many studies on the employment of machine learning approaches, few articles have examined this bioinformatics field in a systematic manner. Thus, we review the advancements regarding the current state-of-the-art prediction models, datasets, and online resources and illustrate the challenges and limitations to present a useful guideline for developing powerful succinylation site prediction tools.Entities:
Keywords: feature descriptor; lysine succinylation; machine learning; sequence analysis; tool development
Mesh:
Substances:
Year: 2019 PMID: 30696115 PMCID: PMC6406724 DOI: 10.3390/cells8020095
Source DB: PubMed Journal: Cells ISSN: 2073-4409 Impact factor: 6.600
Figure 1An overview of current computational prediction algorithms of succinylation sites.
Summary of the reviewed predictors for lysine succinylation sites.
| Tools | SucPred | iSuc-PseAAC | SuccFind | iSuc-PseOpt | pSuc-Lys | SucStruct | PSSM-Suc | SuccinSite | SuccinSite2.0 | SSEvol-Suc | Success | GPSuc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Species | Generic | Generic | Generic | Generic | Generic | Generic | Generic | Generic | Generic and Species-specific | Generic | Generic | Generic and Species-specific |
| Web-server link |
|
|
|
|
|
|
|
|
|
|
|
|
| Working server | No | Yes | No | No | No | No | No | Yes | Yes | No | No | Yes |
| Machine learning | SVM | SVM | SVM | RF | RF | DT | DT | RF | RF | AdaBoost | SVM | RF and LR |
| Dataset size (Protein/succinylated) | 897/2511 | 896/2521 | 1044/2938 | 896/2521 | 896/2521 | 670/1782 | 670 / 1782 | 2322/5004 | 2322/5004 | 670/1782 | 670/1782 | 2322/5004 |
| Training (Pos/Neg) | 1436/18,958 | 1167/3553 | 2713/23598 | 1167/3553 | 1167/3553 | 1782/1872 | 1782/1643 | 4750/9500 | 4750/9500 | 1782/1872 | 1782/1872 | 4750/9500 |
| Independent (Pos/Neg) | 250/- | - | - | - | - | - | - | 254/2977 | 254/2977 | - | - | 254/2977 |
| Homolog redundancy | 35% | 40% | 30% | 40% | 40% | 40% | 40% | 30% | 30% | 40% | 40% | 30% |
| Window size | from −9 to +9 | from −7 to +7 | from −10 to +10 | from −15 to +15 | from −15 to +15 | from −15 to +15 | from −15 to +15 | from −13 to +13 | from −20 to +20 | from −15 to +15 | from −15 to +15 | from −20 to +20 |
| Adjusted batch prediction | NO | No | No | No | No | No | No | Yes | Yes | No | No | Yes |
| Processing time for a protein | - | Within 20 s | - | - | - | - | - | Within 20 s | Within 5 min | - | - | Within 5 min |
Figure 2Window selection procedure for generating positive and negative samples.
Statistics of the positive and negative samples of nine species-specific datasets used in this study.
| Species | Datasets | Positive Samples | Negative Samples |
|---|---|---|---|
|
| Training | 1351 | 2702 |
| Independent | 54 | 2004 | |
|
| Training | 414 | 828 |
| Independent | 24 | 679 | |
|
| Training | 1942 | 3884 |
| Independent | 289 | 1381 | |
|
| Training | 699 | 1398 |
| Independent | 61 | 242 | |
|
| Training | 961 | 1922 |
| Independent | 90 | 1423 | |
|
| Training | 282 | 564 |
| Independent | 26 | 261 | |
|
| Training | 242 | 484 |
| Independent | 33 | 274 | |
|
| Training | 332 | 664 |
| Independent | 50 | 591 | |
|
| Training | 113 | 226 |
| Independent | 32 | 309 |
Figure 3pLogo graphs of the sequences with the centered succinylation sites. Nine species-specific datasets of H. sapiens, H. capsulatum, M. musculus, E. coli, M. tuberculosis, S. cerevisiae, T. gondii, S. lycopersicum and T. aestivum (https://plogo.uconn.edu/) and their combined (generic) datasets are used. The significantly enriched/depleted amino acid residues (student t-test, p < 0.05) are shown.
Statistics of feature encoding schemes used in the aforementioned succinylation site prediction tools.
| Encoding Types | Genetic Explanation | References |
|---|---|---|
| AAindex | Based on the AAindex indices database, the encoding scheme of AAindex reveals the biochemical properties of the sequences. | [ |
| ACF | The auto correlation function features for surrounding succinylation sequences. | [ |
| EBGW | Coding based on grouped weight of physicochemical properties of sequences surrounding succinylation sites. | [ |
| VDWV | Van der Waals volume properties of surrounding succinylation sequences. | [ |
| WAAC | Position weight amino acid composition of surrounding succinylation sequences. | [ |
| AAC | The amino acid composition characterizes the specific state of the surrounding succinylation sequences. | [ |
| CKSAAP | The CKSAAP encoding represents the short sequence motif information in surrounding succinylation sites. | [ |
| PseAAC | The pseudo amino acid composition reflects a vectorized sequence-coupling model of surrounding succinylation sites. | [ |
| SF | The predicted structural feature reflects the structural properties of protein in surrounding succinylation sites. | [ |
| Binary | The position-specific information measured by binary profile for the curated sequences. | [ |
| PSSM | The PSSM exposes the evolutionary information from the sequences. | [ |
| pCKSAAP | The pCKSAAP reflects the sequence patterns and evolutionary information from the query sequences. | [ |
Data of Table 1 is used.
Performance of five major types of features for the training and independent datasets.
| Methods | Training | Independent | |||
|---|---|---|---|---|---|
|
| RF | SVM | RF | SVM | |
| pCKSAAP | 0.856 | 0.838 | 0.695 | 0.691 | |
| CKSAAP | 0.816 | 0.831 | 0.677 | 0.663 | |
| AAindex | 0.739 | 0.728 | 0.759 | 0.755 | |
| Binary | 0.767 | 0.754 | 0.822 | 0.809 | |
| PseAAC | 0.819 | 0.822 | 0.658 | 0.649 | |
|
| pCKSAAP | 0.789 | 0.792 | 0.638 | 0.634 |
| CKSAAP | 0.788 | 0.783 | 0.619 | 0.607 | |
| AAindex | 0.712 | 0.722 | 0.658 | 0.666 | |
| Binary | 0.713 | 0.698 | 0.665 | 0.647 | |
| PseAAC | 0.759 | 0.743 | 0.612 | 0.614 | |
|
| pCKSAAP | 0.801 | 0.788 | 0.637 | 0.634 |
| CKSAAP | 0.777 | 0.767 | 0.646 | 0.651 | |
| AAindex | 0.648 | 0.655 | 0.679 | 0.672 | |
| Binary | 0.639 | 0.641 | 0.677 | 0.659 | |
| PseAAC | 0.711 | 0.722 | 0.609 | 0.611 | |
|
| pCKSAAP | 0.769 | 0.761 | 0.679 | 0.684 |
| CKSAAP | 0.773 | 0.782 | 0.646 | 0.631 | |
| AAindex | 0.719 | 0.721 | 0.633 | 0.619 | |
| Binary | 0.689 | 0.674 | 0.619 | 0.607 | |
| PseAAC | 0.733 | 0.734 | 0.608 | 0.603 | |
|
| pCKSAAP | 0.708 | 0.712 | 0.688 | 0.679 |
| CKSAAP | 0.689 | 0.675 | 0.664 | 0.671 | |
| AAindex | 0.667 | 0.658 | 0.656 | 0.655 | |
| Binary | 0.629 | 0.617 | 0.639 | 0.634 | |
| PseAAC | 0.643 | 0.634 | 0.629 | 0.617 | |
|
| pCKSAAP | 0.882 | 0.869 | 0.776 | 0.772 |
| CKSAAP | 0.879 | 0.863 | 0.752 | 0.744 | |
| AAindex | 0.742 | 0.733 | 0.759 | 0.749 | |
| Binary | 0.741 | 0.745 | 0.798 | 0.787 | |
| PseAAC | 0.790 | 0.768 | 0.699 | 0.675 | |
|
| pCKSAAP | 0.834 | 0.836 | 0.657 | 0.666 |
| CKSAAP | 0.826 | 0.822 | 0.655 | 0.638 | |
| AAindex | 0.726 | 718 | 0.663 | 0.647 | |
| Binary | 0.744 | 0.745 | 0.679 | 0.671 | |
| PseAAC | 0.801 | 0.788 | 0.678 | 0.664 | |
|
| pCKSAAP | 0.842 | 0.836 | 0.649 | 0.642 |
| CKSAAP | 0.833 | 0.824 | 0.648 | 0.637 | |
| AAindex | 0.753 | 0.765 | 0.644 | 0.629 | |
| Binary | 0.729 | 0.722 | 0.637 | 0.631 | |
| PseAAC | 0.801 | 0.783 | 0.678 | 0.658 | |
|
| pCKSAAP | 0.822 | 0.826 | 0.649 | 0.654 |
| CKSAAP | 0.821 | 0.811 | 0.638 | 0.634 | |
| AAindex | 0.736 | 0.734 | 0.604 | 0.611 | |
| Binary | 0.726 | 0.719 | 0.612 | 0.596 | |
| PseAAC | 0.778 | 0.769 | 0.632 | 0.628 | |
AUC values are used to assess the prediction performance.
Figure 4Performance comparison of generic succinylation site prediction models on an independent dataset.