| Literature DB >> 20122233 |
Hatice U Osmanbeyoglu1, Jessica A Wehner, Jaime G Carbonell, Madhavi K Ganapathiraju.
Abstract
BACKGROUND: About 30% of genes code for membrane proteins, which are involved in a wide variety of crucial biological functions. Despite their importance, experimentally determined structures correspond to only about 1.7% of protein structures deposited in the Protein Data Bank due to the difficulty in crystallizing membrane proteins. Algorithms that can identify proteins whose high-resolution structure can aid in predicting the structure of many previously unresolved proteins are therefore of potentially high value. Active machine learning is a supervised machine learning approach which is suitable for this domain where there are a large number of sequences but only very few have known corresponding structures. In essence, active learning seeks to identify proteins whose structure, if revealed experimentally, is maximally predictive of others.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20122233 PMCID: PMC3009531 DOI: 10.1186/1471-2105-11-S1-S58
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The coverage of SOM network over the data. Figure represents the coverage of the SOM network. 1000 data points are just shown for more clear representation.
Comparison of TMpro NN: applying active vs. passive learning algorithms for updating training set from benchmark analysis.
| Methods | # of Proteins in Training-Set | Qok | Qhtm | Qhtm | Qhtm | Q2 | |
|---|---|---|---|---|---|---|---|
| Fscore | %obs | %prd | |||||
| 1 | Random | 1 | 14 | 27 | 29 | 25 | 55 |
| 2 | 36 | 63 | 67 | 60 | 65 | ||
| 5 | 51 | 82 | 84 | 80 | 70 | ||
| 10 | 54 | 91 | 95 | 88 | 73 | ||
| 2 | Node-Coverage | 1 | 61 | 94 | 97 | 92 | 75 |
| 2 | 61 | 94 | 97 | 91 | 75 | ||
| 5 | 63 | 94 | 97 | 92 | 75 | ||
| 10 | 61 | 94 | 97 | 92 | 75 | ||
| 3 | Confusion-Rated | 1 | 14 | 27 | 29 | 25 | 55 |
| 2 | 52 | 91 | 95 | 87 | 73 | ||
| 5 | 55 | 91 | 95 | 88 | 73 | ||
| 10 | 59 | 93 | 96 | 89 | 74 | ||
| 4 | Node-Coverage & Confusion-Rated | 1 | 61 | 94 | 97 | 92 | 75 |
| 2 | 59 | 92 | 96 | 88 | 73 | ||
| 5 | 58 | 92 | 96 | 89 | 73 | ||
| 10 | 61 | 94 | 96 | 91 | 74 |
It can be seen that TMpro achieves high segment accuracy (F-score) even if the classifier is trained with just one protein that is found by Active Learning algorithms. The columns from left to right show: method being evaluated; Number of proteins in training-set; Protein level accuracies: Qok, which is the percentage of proteins in which all experimentally determined segments are predicted correctly, and no extra segments are predicted; that is, there is a one to one match between predicted and experimentally determined segments; Segment F-score which is the geometric mean of Recall and Precision; Recall (Qhtm,%obs, percentage of experimentally determined segments that are predicted correctly); and Precision (Qhtm,%pred percentage of predicted segments that are correct). Q2 is the residue level accuracy when all residues in a protein are considered together, and the Q2 value for the entire set of proteins is the average of that of individual proteins. See [30]for further details on these metrics.
Comparison of TMpro NN: applying active vs. passive learning algorithms for updating training set from MPtopo (101 proteins, 443 TM segments).
| Methods | # of Proteins in Training-Set | Qok | Qhtm | Qhtm | Qhtm | Q2 | |
|---|---|---|---|---|---|---|---|
| Fscore | %obs | %prd | |||||
| 1 | Random | 1 | 14 | 40 | 41 | 41 | 63 |
| 2 | 25 | 60 | 59 | 61 | 67 | ||
| 5 | 36 | 84 | 88 | 81 | 74 | ||
| 10 | 35 | 79 | 81 | 78 | 74 | ||
| 2 | Node-Coverage | 1 | 44 | 91 | 92 | 90 | 78 |
| 2 | 44 | 91 | 91 | 90 | 79 | ||
| 5 | 44 | 91 | 92 | 90 | 79 | ||
| 10 | 46 | 91 | 92 | 90 | 79 | ||
| 3 | Confusion-Rated | 1 | 26 | 68 | 75 | 65 | 67 |
| 2 | 34 | 85 | 90 | 81 | 75 | ||
| 5 | 41 | 89 | 91 | 87 | 78 | ||
| 10 | 45 | 91 | 91 | 90 | 79 | ||
| 4 | Node-Coverage & Confusion-Rated | 1 | 44 | 91 | 92 | 90 | 78 |
| 2 | 44 | 90 | 91 | 89 | 79 | ||
| 5 | 44 | 91 | 91 | 90 | 79 | ||
| 10 | 45 | 90 | 91 | 90 | 78 |
For description of columns, see caption of Table 1. Qhtm%obs and Qhtm%pred have been computed per-protein and averaged over all the proteins.
Comparison of TMpro NN: applying active vs. passive learning algorithms for updating training set from PDBTM (191 proteins, 789 TM segments).
| Methods | # of Proteins in Training-Set | Qok | Qhtm | Qhtm | Qhtm | Q2 | |
|---|---|---|---|---|---|---|---|
| Fscore | %obs | %prd | |||||
| 1 | Random | 1 | 20 | 51 | 54 | 49 | 20 |
| 2 | 32 | 69 | 72 | 66 | 32 | ||
| 5 | 35 | 76 | 78 | 73 | 35 | ||
| 10 | 35 | 78 | 81 | 75 | 35 | ||
| 2 | Node-Coverage | 1 | 50 | 91 | 93 | 90 | 79 |
| 2 | 47 | 90 | 91 | 88 | 79 | ||
| 5 | 49 | 91 | 92 | 89 | 79 | ||
| 10 | 50 | 91 | 93 | 90 | 79 | ||
| 3 | Confusion-Rated | 1 | 20 | 48 | 51 | 46 | 70 |
| 2 | 36 | 81 | 84 | 78 | 75 | ||
| 5 | 38 | 85 | 90 | 81 | 74 | ||
| 10 | 46 | 90 | 92 | 87 | 78 | ||
| 4 | Node-Coverage & Confusion-Rated | 1 | 50 | 91 | 93 | 90 | 79 |
| 2 | 49 | 90 | 92 | 88 | 78 | ||
| 5 | 48 | 91 | 93 | 89 | 79 | ||
| 10 | 51 | 91 | 93 | 90 | 79 |
For description of columns, see caption of Table 1. Qhtm,%obs and Qhtm,%pred have been computed per-protein and averaged over all the proteins.
Figure 2Segment level TM prediction F-score results for MPtopo. (A) Random, (B) Node-coverage, (C) Confusion-rated, (D) Node-coverage and confusion-rated. It can be seen that TMpro achieves high segment accuracy (F-score) even if the classifier is trained with just one protein that is found by active learning algorithms. Node-Coverage shows best performance.