| Literature DB >> 30717647 |
Kai-Yao Huang1,2, Hui-Ju Kao2,3, Justin Bo-Kai Hsu4, Shun-Long Weng5,6,7, Tzong-Yi Lee8,9.
Abstract
BACKGROUND: Glutarylation, the addition of a glutaryl group (five carbons) to a lysine residue of a protein molecule, is an important post-translational modification and plays a regulatory role in a variety of physiological and biological processes. As the number of experimentally identified glutarylated peptides increases, it becomes imperative to investigate substrate motifs to enhance the study of protein glutarylation. We carried out a bioinformatics investigation of glutarylation sites based on amino acid composition using a public database containing information on 430 non-homologous glutarylation sites.Entities:
Keywords: Intrinsic interdependence; Maximal dependence decomposition; Protein glutarylation
Mesh:
Substances:
Year: 2019 PMID: 30717647 PMCID: PMC7394328 DOI: 10.1186/s12859-018-2394-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Data statistics of training and testing datasets after the removal of homologous sequences using CD-HIT program
| Sequence identity cut-off | Number of glutarylation sites | Number of non-glutarylation sites |
|---|---|---|
| Raw Data | 715 | 4145 |
| 90% | 667 | 3675 |
| 80% | 631 | 3317 |
| 70% | 597 | 3037 |
| 60% | 556 | 2767 |
| 50% | 534 | 2539 |
| 40% | 476 | 1918 |
| Training data | 430 | 860 |
| Independent testing data | 46 | 92 |
Fig. 1Flowchart of performing protein glutarylation site prediction in this work. It mainly consists of data collection and preprocessing, feature investigation, determination of intrinsic interdependence between positions of substrate sites, detection of motif signatures, model training and evaluation, and independent testing
Fig. 2Composition of amino acids around glutarylation sites. a Comparison of AAC between 430 positive and 860 negative sequences. b Position-specific AAC of 430 glutarylated sequences. c Comparison of position-specific AAC between glutarylated and non-glutarylated sequences based on TwoSampleLogo analysis
Five-fold cross validation results on SVM models trained with various features
| Training features | Sensitivity | Specificity | Accuracy | MCC |
|---|---|---|---|---|
| Amino Acid Composition (AAC) | 62.0% | 61.3% | 61.6% | 0.22 |
| Amino Acid Pair Composition (AAPC) | 61.3% | 48.1% | 52.5% | 0.09 |
| CKSAAPª, K = 1 | 62.0% | 51.7% | 55.1% | 0.13 |
| CKSAAPª, K = 2 | 58.8% | 49.8% | 52.8% | 0.08 |
| CKSAAPª, K = 3 | 66.0% | 41.2% | 49.4% | 0.07 |
aCKSAAP Composition of k-spaced amino acid pairs
Fig. 3Comparison of ROC curves among the SVM models trained using various features based on five-fold cross-validation
Fig. 4The intrinsic interdependence between positions around glutarylation sites. The number shown in a circle stands for the position around glutarylation site. A higher value displayed on an edge means a more significant dependence of AAC between two circles (positions)
Fig. 5A hierarchical MDD-clustering process on the detection of motif signatures from 430 glutarylated sequences
Fig. 6ROC curves of six SVM models trained from MDD-clustered subgroups based on five-fold cross-validation
Five-fold cross-validation results for six SVM models trained from MDD-identified motifs
| Dataset | Number of positive data | Number of negative data | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|
| All Data | 430 | 860 | 62.0% | 61.3% | 61.6% | 0.22 |
| Glutar 1 | 102 | 204 | 80.6% | 63.7% | 69.4% | 0.42 |
| Glutar 2 | 59 | 120 | 64.5% | 67.7% | 66.7% | 0.31 |
| Glutar 3 | 60 | 120 | 60.0% | 59.2% | 59.5% | 0.18 |
| Glutar 4 | 55 | 110 | 66.1% | 60.2% | 62.1% | 0.25 |
| Glutar 5 | 62 | 121 | 75.8% | 59.7% | 65.1% | 0.33 |
| Glutar 6 | 92 | 185 | 72.1% | 62.5% | 65.7% | 0.33 |
| Combined result | 430 | 860 | 67.7% | 61.9% | 63.8% | 0.28 |
Fig. 7A case study of glutarylation site prediction on mouse aspartate aminotransferase (UniProt ID: AATM_MOUSE)
Performance comparison between proposed methods and an existing tool (GlutPred) based on independent testing dataset
| Methods | TP | FN | TN | FP | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|---|---|
| Single SVM | 28 | 18 | 63 | 29 | 60.9% | 68.5% | 65.9% | 0.28 |
| Integrated SVM | 30 | 16 | 68 | 24 | 65.2% | 73.9% | 71.0% | 0.38 |
| GlutPred | 25 | 21 | 84 | 8 | 54.3% | 91.3% | 79.0% | 0.50 |