| Literature DB >> 24493033 |
Petr Klus1, Benedetta Bolognesi1, Federico Agostini1, Domenica Marchese1, Andreas Zanzoni1, Gian Gaetano Tartaglia1.
Abstract
MOTIVATION: The recent shift towards high-throughput screening is posing new challenges for the interpretation of experimental results. Here we propose the cleverSuite approach for large-scale characterization of protein groups. DESCRIPTION: The central part of the cleverSuite is the cleverMachine (CM), an algorithm that performs statistics on protein sequences by comparing their physico-chemical propensities. The second element is called cleverClassifier and builds on top of the models generated by the CM to allow classification of new datasets.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24493033 PMCID: PMC4029037 DOI: 10.1093/bioinformatics/btu074
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The cleverSuite algorithm. The CM estimates the ability of physico-chemical properties to discriminate two input datasets. The statistical analysis gives information about individual property coverages and strength with respect to randomized sets. An exhaustive property-combination search is performed to assess the significance of the datasets separation. The CC uses the models generated by CM to classify new datasets to either the positive or negative set. Individual physico-chemical profiles are reported along with the discrimination statistics
Fig. 2.Grouped property view. Example of properties grouped by class assignment and color (each property is described by 10 predictors). The E.coli solubility analysis is used as illustrative case: soluble proteins (positive case) are more disordered and less hydrophobic/aggregation prone. Low-significance properties (Z-score < Zth; P > 0.01; Section 2) are devoid of color. In the webserver, this view is interactive and shows information about each scale after clicking (see also Supplementary Fig. S1)
Fig. 3.Correlation between coverage and AUC. For the five cases presented in this study, AUC and coverage of individual physico-chemical properties show a correlation r > 0.85. In this example, we use human RNA-binding proteins (compared with lysate; r = 0.95)
Fig. 4.Scale combinations and statistics. (A) Relationship between the number of combined scales and the coverages for both positive (blue bars) and negative (green bars) datasets. (B) Statistics for each scale combination and its individual members. In the webserver, click-through the combination titles reveals scales contained and detailed statistics (three-scale combination is shown; the E.coli solubility analysis is used as example). This view is used to summarize results of both CM and CC
cleverSuite performances
| cleverSuite | Reference | |||||
|---|---|---|---|---|---|---|
| ACC | TPR | TNR | Method | TPR | TNR | |
| Alpha-beta | 97.9 | 90.4 | 93.2 | RePROF | 92.6 | 72.0 |
| Disorder | 86.1 | 84.5 | 73.6 | FoldIndex | 62.9 | 64.7 |
| Solubility | 89.8 | 84.7 | 60.5 | PROSO II | 78.5 | 74.0 |
| Chaperones | 81.6 | 75.4 | 60.0 | Limbo | 100.0 | 22.5 |
| mRNA | 84.3 | 72.9 | 79.2 | RNApred | 82.5 | 52.8 |
aA 10-fold cross-validation accuracy for CM models (ACC is accuracy).
bIndependent validation performances for CC.
cPerformance comparison with algorithms reported in literature. TPR (true positive rate) and TNR (true negative rate) are calculated on the same sets used to validate CC. Links to full results are given in Supplementary Table S1.