| Literature DB >> 28406174 |
Sarah A Middleton1, Joseph Illuminati2, Junhyong Kim1,3.
Abstract
Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this method by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28406174 PMCID: PMC5390313 DOI: 10.1038/srep46321
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Overview of PESS construction.
Training sequences of known fold are threaded against a set of structure templates, and the resulting threading scores act as coordinates within a structural feature space (the PESS). A classifier can then be trained to recognize the subspace occupied by each fold in the PESS. Different colors indicate the fold of each sequence and are shown here only for visualization.
Figure 2Classification and performance using the PESS.
(a and b) Two different methods of classification using the PESS. Colored circles represent training examples within the PESS and are colored by fold. (a) In 1NN classification, the PESS distance between the query (gray circle) and all training examples is computed and the query is assigned to the fold of the nearest training example (dark gray arrow). (b) In 1-vs-all SVM classification, the PESS distance between the query and each of the fold-level hyperplanes (dotted lines) is computed, and the query is assigned to the fold that gives the best score (dark gray arrow), based on signed distance from the fold’s hyperplane. (c) Precision and (d) recall measures were computed for each fold separately after 1NN classification of the SCOP-25 set using the PESS and plotted against the number of training examples for each fold. Marginal histograms show the distribution of folds along each axis.
Overall % accuracy on three benchmarks using 10-fold cross validation.
| Method | EDD | F95 | F194 | |
|---|---|---|---|---|
| Dehzangi | 88.2 | — | — | |
| Saini | 86.6 | — | — | |
| Lyons | 93.8 | — | — | |
| Zakeri | 88.8/96.9 | — | — | |
| Yang and Chen | 90.0 | 82.4 | 79.6 | |
| Wei | 92.6 | 83.6 | 78.2 | |
| This method (PESS) | 1NN – filtered | 89.9 | 84.6 | 82.6 |
| 1NN – all | 90.6 | 84.6 | 82.5 | |
| SVM – filtered | 95.9 | 92.3 | 90.7 | |
| SVM – all | 95.7 | 91.9 | 90.5 | |
aUsing a slightly modified EDD set with 21 additional domains (3418 total) (see Methods).
bWith Interpro functional annotations.
cUsing modified versions of EDD (3625 domains), F95 (6791 domains), and F194 (8525) (see Methods).
dUsing a filtered version of the benchmarks which removed any examples with >25% pairwise identity with a template (based on the sequence of the domain on which the template was based).
eUsing the full benchmark sets. Some training or testing sequence may be similar or identical to templates.
Figure 3Fold classification of the human proteome.
(a) Overview of classification process. Full length human protein sequences were split at predicted domain boundaries to create one or more separate domain sequences per protein (Drew et al. 5). Domain sequences were mapped to the PESS and classified by 1NN classification. A threshold was applied to the nearest neighbor distance (dotted circle), whereby only domains with a nearest neighbor closer than the threshold distance were classified. (b) PCA projection of fold centroids within the PESS, scaled by number of human domains predicted to belong to that fold. Centroids were calculated based on the location of each fold’s training examples within the PESS and are colored by SCOP class. (c) Top ten folds by number of human domain predictions. (d) Top ten likely RNA-binding folds, ranked by number of confirmed RNA-binding domains (RBDs). Confirmed RBDs were determined based on matches to a curated list of RNA-binding related Pfam families.
Figure 4Analysis of unclassified human domains.
(a) t-SNE projection of human domains with nearest-neighbor distance ≥30. Colors indicate cluster assignment by DBSCAN; unclustered domains are shown in black. Dotted lines show related groups of domains. (b) Overview of the EVC2 protein product, Limbin, and its known structure elements. The location of the domain with a putative novel fold is shown in yellow.