| Literature DB >> 16796745 |
Chung-Tsai Su1, Chien-Yu Chen, Yu-Yen Ou.
Abstract
BACKGROUND: More and more disordered regions have been discovered in protein sequences, and many of them are found to be functionally significant. Previous studies reveal that disordered regions of a protein can be predicted by its primary structure, the amino acid sequence. One observation that has been widely accepted is that ordered regions usually have compositional bias toward hydrophobic amino acids, and disordered regions are toward charged amino acids. Recent studies further show that employing evolutionary information such as position specific scoring matrices (PSSMs) improves the prediction accuracy of protein disorder. As more and more machine learning techniques have been introduced to protein disorder detection, extracting more useful features with biological insights attracts more attention.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16796745 PMCID: PMC1526762 DOI: 10.1186/1471-2105-7-319
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Summary of the datasets employed in this study
| Training data | Testing data | ||||
| PDB693 | D184 | R80 | U79 | P80 | |
| Number of chains | 693 | 184 | 80 | 79 | 80 |
| Number of ordered regions | 1357 | 257 | 151 | 0 | 80 |
| Number of disordered regions | 1739 | 274 | 183 | 79 | 0 |
| Number of residues in the ordered regions | 201937 | 55164 | 29909 | 0 | 16568 |
| Number of residues in the disordered regions | 52663 | 27116 | 3649 | 14462 | 0 |
| Total residues in the dataset | 254600 | 82280 | 33558 | 14462 | 16568 |
The definition of measures employed in this study
| Measure | Abbreviation | Equation * |
| Sensitivity (recall) | TP/(TP+FN) | |
| Specificity | TN/(TN+FP) | |
| Matthews' correlation coefficient | (TP×TN-FP×FN)/sqrt((TP+FP)×(TN+FN)×(TP+FN)×(TN+FP)) | |
| Probability excess | (TP×TN-FP×FN)/((TP+FN)×(TN+FP)) |
* The definition of the abbreviations used: TP is the number of correctly classified disordered residues; FP is the number of ordered residues incorrectly classified as disordered; TN is the number of correctly classified ordered residues; and FN is the number of disordered residues incorrectly classified as ordered.
The performance of each property in the uni-variant analysis on training data
| Property | ||||
| 0.633 | 0.717 | 0.309 | 0.350 | |
| 0.519 | 0.723 | 0.217 | 0.241 | |
| 0.603 | 0.703 | 0.269 | 0.306 | |
| 0.604 | 0.731 | 0.299 | 0.335 | |
| 0.553 | 0.742 | 0.268 | 0.295 | |
| 0.555 | 0.688 | 0.214 | 0.243 | |
| 0.604 | 0.720 | 0.288 | 0.324 | |
| 0.538 | 0.660 | 0.173 | 0.198 | |
| 0.573 | 0.662 | 0.204 | 0.235 | |
| 0.583 | 0.667 | 0.218 | 0.250 | |
| 0.571 | 0.664 | 0.204 | 0.235 | |
| 0.603 | 0.706 | 0.272 | 0.309 | |
| 0.528 | 0.732 | 0.234 | 0.259 | |
| 0.577 | 0.675 | 0.220 | 0.252 | |
The best performance among each property group is highlighted with bold font.
Figure 1The relation of the selected properties after the first level of redundancy analysis.
Performance evaluation on Hydrophobic, Aliphatic, and Aromatic
| Property | ||||
| 0.640 | 0.751 | 0.350 | 0.391 | |
| 0.601 | 0.748 | 0.314 | 0.349 | |
| 0.602 | 0.732 | 0.298 | 0.334 | |
The best performance is highlighted with bold font.
Performance evaluation on Polar, Charged, Positive, and Negative
| Property | ||||
| 0.614 | 0.707 | 0.282 | 0.320 | |
| 0.599 | 0.678 | 0.242 | 0.277 | |
| 0.586 | 0.696 | 0.248 | 0.282 | |
| 0.607 | 0.715 | 0.284 | 0.321 |
The best performance is highlighted with bold font.
The result of the stepwise feature selection
| Property | ||||
| 0.646 | 0.767 | 0.372 | 0.412 | |
| 0.656 | 0.774 | 0.390 | 0.430 | |
| 0.652 | 0.783 | 0.396 | 0.435 |
The best performance is highlighted with bold font.
Figure 2Comparison of using different feature sets on testing data R80.
Figure 3Comparison of using different feature sets on testing data U79 and P80.
Figure 4Comparison of using different feature sets on testing data R80, U79, and P80.
Figure 5Comparing the performance of thirteen disorder prediction packages on testing data R80.
Figure 6Comparing the performance of thirteen disorder prediction packages on testing data U79 and P80.
Figure 7Comparing the performance of thirteen disorder prediction packages on testing data R80, U79, and P80.
The statistics of the property patterns with three identical residues in the ordered and disordered regions
| Patterns | # of matches in # subsequences | Score | ||||
| In ordered regions | In disordered regions | |||||
| # matches | # seqs | # matches | # seqs | |||
| [ | 71229 | 2087 | 16534 | 1555 | 0.44 | |
| [ | 14608 | 1746 | 2297 | 687 | 0.35 | |
| [ | 4766 | 1224 | 2649 | 789 | 0.35 | |
| [ | 38084 | 1965 | 16390 | 1585 | 0.41 | |
| [ | 2530 | 927 | 762 | 372 | 0.50 | |
| [ | 8808 | 1459 | 5705 | 1060 | 0.31 | |
| [ | 37886 | 1949 | 15577 | 1630 | 0.42 | |
| [ | 2050 | 824 | 569 | 284 | 0.48 | |
| [ | 9742 | 1504 | 5662 | 1169 | 0.34 | |
| [ | 467 | 302 | 54 | 47 | 0.28 | |
| [ | 219 | 159 | 20 | 17 | 0.24 | |
| [ | 5 | 5 | 4 | 4 | 0.27 | |
Improvements in discriminating power are highlighted with bold font.
Figure 8The procedure of preparing feature set for training and testing data.
Conventional Amino Acid Properties (Parent Properties)
| Property | I | L | V | C | A | G | M | F | Y | W | H | K | R | E | Q | D | N | S | T | P |
| Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | ||||||||
| Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y | ||||||||||
| Y | Y | Y | Y | Y | Y | Y | Y | Y | ||||||||||||
| Y | Y | Y | ||||||||||||||||||
| Y | Y | Y | Y | |||||||||||||||||
| Y | Y | Y | ||||||||||||||||||
| Y | Y | |||||||||||||||||||
| Y | ||||||||||||||||||||
| Y | Y | Y | Y | Y | ||||||||||||||||
| Y | Y | Y | Y |
Figure 9The propensity for order and the frequency of each amino acid in the training set.
Order/Disorder-based Amino Acid Properties (Child Properties)
| Property | I | L | V | C | A | G | M | F | Y | W | H | K | R | E | Q | D | N | S | T | P |
| Y | Y | Y | Y | Y | Y | Y | Y | |||||||||||||
| Y | Y | Y | Y | Y | ||||||||||||||||
| Y | Y | Y | Y | Y | ||||||||||||||||
| Y | Y | Y | Y | Y | Y | |||||||||||||||
| Y | Y | Y | Y | |||||||||||||||||
| Y | Y | Y | Y | Y | ||||||||||||||||
| Y | Y | Y | ||||||||||||||||||
| Y | Y | Y | ||||||||||||||||||
| Y | ||||||||||||||||||||
| Y | ||||||||||||||||||||
| Y | Y | |||||||||||||||||||
| Y | Y | |||||||||||||||||||
| Y | ||||||||||||||||||||
| Y | ||||||||||||||||||||
| Y | Y | Y | Y | |||||||||||||||||
| Y | ||||||||||||||||||||
| Y | Y | Y |
* Aliphatic, Negative, and Prolineare equivalent to Aliphatic, Negative, and Proline in Table 8, respectively.
# Aromatic, Positive, Proline, Charged, and Tinyeach comprises only a single type of amino acid.
Figure 10The flowchart of the hybrid feature selection mechanism.