| Literature DB >> 35883444 |
Bi Zhao1, Lukasz Kurgan1.
Abstract
Intrinsically disordered regions (IDRs) carry out many cellular functions and vary in length and placement in protein sequences. This diversity leads to variations in the underlying compositional biases, which were demonstrated for the short vs. long IDRs. We analyze compositional biases across four classes of disorder: fully disordered proteins; short IDRs; long IDRs; and binding IDRs. We identify three distinct biases: for the fully disordered proteins, the short IDRs and the long and binding IDRs combined. We also investigate compositional bias for putative disorder produced by leading disorder predictors and find that it is similar to the bias of the native disorder. Interestingly, the accuracy of disorder predictions across different methods is correlated with the correctness of the compositional bias of their predictions highlighting the importance of the compositional bias. The predictive quality is relatively low for the disorder classes with compositional bias that is the most different from the "generic" disorder bias, while being much higher for the classes with the most similar bias. We discover that different predictors perform best across different classes of disorder. This suggests that no single predictor is universally best and motivates the development of new architectures that combine models that target specific disorder classes.Entities:
Keywords: amino acid bias; amino acids; disorder prediction; disorder propensity; disorder scale; intrinsic disorder; intrinsic disordered regions; intrinsically disordered proteins; predictive performance
Mesh:
Substances:
Year: 2022 PMID: 35883444 PMCID: PMC9313023 DOI: 10.3390/biom12070888
Source DB: PubMed Journal: Biomolecules ISSN: 2218-273X
Summary of IDPs and IDR data in the CAID dataset.
| Protein Set | No. Proteins | No. IDRs | No. Disordered Residues | Median IDR Length | Average IDR Length |
|---|---|---|---|---|---|
|
| 652 | 838 | 54,820 | 34 | 65.5 |
|
| 56 | 57 | 9208 | 132 | 157.6 |
|
| 124 | 148 | 1810 | 12 | 12.2 |
|
| 71 | 77 | 14,935 | 139 | 193.9 |
|
| 232 | 256 | 21,389 | 54 | 83.6 |
Figure 1Compositional bias of intrinsic disorder measured for different collections of disordered proteins and regions. (A) TOP-IDP scale; (B) CAID dataset; (C) fully disordered proteins in CAID; (D) short IDRs in CAID; (E) long IDRs in CAID; and (F) disordered binding regions in CAID. The amino acids on the x axis are sorted according to the TOP-IDP scale in the way that is consistent with the original article (data for panel A was adapted from Ref. [15]), from the most order promoting to the most disorder promoting. The propensities are color-coded where green denotes statistically significant depletion; red denotes statistically significant enrichment; and gray denotes that the difference is not statistically significant at the p-value of 0.05. Values of the disorder propensities are shown at the top of the bars.
Figure 2Kendall rank correlation coefficients (KCCs) between the AA biases for disorder in the overall CAID dataset, each of the four categories of IDRs (short, long, fully disordered and binding), and the TOP-IDP scale. The KCC values are color-coded from light blue for low values to dark blue for high values.
Figure 3Kendall rank correlation coefficients (KCCs) between the AA biases for disorder in the overall CAID and putative disorder generated by the top ten predictors from the CAID experiment. The KCC values are color-coded from light blue for low values to dark blue for high values. Disorder predictors are sorted alphabetically.
Predictive performance measured with AUC for the top ten disorder predictors on the CAID dataset and for the six types of IDPs from the CAID dataset. The bold font identifies the methods that secure the highest AUC for a given collection of IDRs. Predictors are sorted alphabetically. We computed the results in the first row and they reproduce the original results from the CAID article [49].
| Dataset | AUCpreD | AUCpreD-np | DisoMine | flDPlr | flDPnn | Predisorder | RawMSA | SPOT-Disorder1 | SPOT-Disorder2 | SPOT-Disorder-Single |
|---|---|---|---|---|---|---|---|---|---|---|
| CAID dataset | 0.757 | 0.751 | 0.765 | 0.793 |
| 0.747 | 0.780 | 0.744 | 0.760 | 0.757 |
| Fully disordered proteins | 0.475 | 0.505 | 0.612 | 0.687 | 0.666 | 0.636 |
| 0.502 | 0.547 | 0.621 |
| Low disorder content with short IDRs | 0.715 | 0.698 | 0.654 | 0.703 |
| 0.708 | 0.651 | 0.675 | 0.687 | 0.678 |
| Low disorder content with binding long IDRs | 0.669 | 0.664 | 0.649 | 0.723 |
| 0.661 | 0.711 | 0.635 | 0.693 | 0.658 |
| Low disordered content with non-binding long IDRs | 0.801 | 0.785 | 0.747 | 0.802 |
| 0.778 | 0.806 | 0.771 | 0.779 | 0.779 |
| High disordered content with binding IDRs | 0.732 | 0.718 | 0.686 | 0.732 | 0.731 | 0.735 |
| 0.716 | 0.732 | 0.726 |
| High disordered content with non-binding IDRs | 0.824 | 0.815 | 0.799 | 0.726 | 0.737 | 0.816 | 0.811 |
| 0.808 | 0.824 |
Predictive performance measured with AUC, AUPR, MCC and F1 for the top ten disorder predictors and the meta-method on the CAID dataset. The bold font identifies the highest value for a given metric. “*” means that the difference between the best-performing meta-method and a given disorder predictor is statistically significant at p-value of 0.05. Methods are sorted by their AUC value.
| Predictors | AUC | AUPR | MCC | F1 |
|---|---|---|---|---|
| Meta-method that selects the best predictor for each disorder class |
|
|
|
|
| flDPnn | 0.814 * | 0.475 * | 0.358 * | 0.462 * |
| flDPlr | 0.793 * | 0.422 * | 0.323 * | 0.433 * |
| RawMSA | 0.780 * | 0.414 * | 0.288 * | 0.404 * |
| DisoMine | 0.765 * | 0.388 * | 0.244 * | 0.367 * |
| SPOT-Disorder2 | 0.760 * | 0.340 * | 0.200 * | 0.351 * |
| AUCpred | 0.757 * | 0.479 * | 0.258 * | 0.399 * |
| SPOT-Disorder-Single | 0.757 * | 0.318 * | 0.221 * | 0.348 * |
| AUCpred-np | 0.751 * | 0.428 * | 0.226 * | 0.349 * |
| Predisorder | 0.747 * | 0.325 * | 0.227 * | 0.359 * |
| SPOT-Disorder1 | 0.744 * | 0.268 * | 0.143 * | 0.284 * |