| Literature DB >> 17042943 |
Huzefa Rangwala1, George Karypis.
Abstract
BACKGROUND: Protein remote homology detection and fold recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for solving these problems. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the more general multiclass remote homology prediction and fold recognition problems.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17042943 PMCID: PMC1635067 DOI: 10.1186/1471-2105-7-455
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Zero-one and Balanced error rates for the remote homology detection problem optimized for the balanced loss function.
| sf95 | sf40 | ||||
| ZE | BE | ZE | BE | ||
| MaxClassifier | 14.7 | 30.0 | 21.0 | 29.7 | |
| Direct | 11.5 | 23.1 | 10.9 | 13.0 | |
| Two-Level Approaches Without Hierarchy Information | |||||
| Scaling | 9.3 | 16.1 | 10.9 | 13.9 | |
| Ranking Perceptron | Scale & Shift | 10.1 | 19.5 | 12.1 | 15.8 |
| Crammer Singer | 14.7 | 28.9 | 17.6 | 24.1 | |
| Scaling | 9.0 | 15.9 | 11.8 | 15.7 | |
| SVM-Struct | Scale & Shift | 10.7 | 19.9 | 12.1 | 15.1 |
| Crammer Singer | 11.6 | 19.4 | 13.0 | 16.3 | |
| Two-Level Approaches With Fold-level Nodes | |||||
| Scaling | 11.2 | 19.6 | 14.7 | 21.4 | |
| SVM-Struct | Scale & Shift | 10.1 | 19.3 | 12.1 | 16.9 |
| Crammer Singer | 14.7 | 26.0 | 13.0 | 18.2 | |
| Two-Level Approaches With Class-level and Fold-level Nodes | |||||
| Scaling | 11.2 | 20.2 | 13.0 | 18.8 | |
| SVM-Struct | Scale & Shift | 13.5 | 24.7 | 12.1 | 16.8 |
| Crammer Singer | 14.7 | 26.1 | 13.0 | 17.5 | |
ZE and BE denote the zero-one error and balanced error percent rates respectively. The results were obtained by optimizing the balanced loss function.
Zero-one and Balanced error rates for the remote homology detection problem optimized for the zero-one loss function.
| sf95 | sf40 | ||||
| ZE | BE | ZE | BE | ||
| MaxClassifier | 14.7 | 30.0 | 21.0 | 29.7 | |
| Direct | 13.5 | 24.8 | 20.5 | 26.5 | |
| Two-Level Approaches Without Hierarchy Information | |||||
| Scaling | 10.6 | 18.0 | 11.7 | 16.5 | |
| Ranking Perceptron | Scale & Shift | 13.2 | 24.5 | 10.9 | 13.4 |
| Crammer Singer | 17.0 | 34.3 | 14.2 | 19.4 | |
| Scaling | 10.7 | 18.1 | 13.4 | 17.3 | |
| SVM-Struct | Scale & Shift | 12.4 | 23.7 | 13.4 | 17.3 |
| Crammer Singer | 12.7 | 25.2 | 15.5 | 19.8 | |
| Two-Level Approaches With Fold-level Nodes | |||||
| Scaling | 10.4 | 18.7 | 14.7 | 20.0 | |
| SVM-Struct | Scale & Shift | 12.4 | 23.7 | 14.7 | 21.4 |
| Crammer Singer | 13.8 | 25.0 | 14.7 | 19.6 | |
| Two-Level Approaches With Class-level and Fold-level Nodes | |||||
| Scaling | 10.9 | 19.1 | 12.6 | 17.7 | |
| SVM-Struct | Scale & Shift | 11.2 | 20.9 | 13.4 | 17.8 |
| Crammer Singer | 14.1 | 27.6 | 12.6 | 17.1 | |
ZE and BE denote the zero-one error and balanced error percent rates respectively. The results were obtained by optimizing the zero-one loss function.
Zero-one and Balanced error rates for the fold recognition problem optimized for the balanced loss function.
| fd25 | fd40 | ||||
| ZE | BE | ZE | BE | ||
| MaxClassifier | 42.0 | 60.3 | 44.4 | 64.6 | |
| Direct | 38.4 | 52.3 | 40.4 | 56.9 | |
| Scaling | 39.5 | 48.7 | 32.5 | 48.0 | |
| Ranking Perceptron | Scale & Shift | 38.8 | 51.0 | 29.0 | 43.0 |
| Crammer Singer | 37.7 | 49.6 | 36.0 | 49.6 | |
| Scaling | 39.9 | 52.7 | 30.8 | 46.6 | |
| SVM-Struct | Scale & Shift | 39.9 | 52.5 | 28.1 | 42.8 |
| Crammer Singer | 41.3 | 50.5 | 31.1 | 43.3 | |
| Two-Level Approaches With Class-level Nodes | |||||
| Scaling | 39.2 | 52.4 | 29.9 | 45.0 | |
| SVM-Struct | Scale & Shift | 38.1 | 51.6 | 29.0 | 41.7 |
| Crammer Singer | 41.7 | 50.9 | 29.9 | 41.7 | |
| Two-Level Approaches With Superfamily-level Nodes | |||||
| Scaling | 40.2 | 52.6 | 30.5 | 44.5 | |
| SVM-Struct | Scale & Shift | 40.6 | 52.7 | 29.3 | 42.8 |
| Crammer Singer | 38.8 | 48.8 | 31.0 | 44.9 | |
| Two-Level Approaches With Superfamily-level and Class-level Nodes | |||||
| Scaling | 41.0 | 50.9 | 33.7 | 44.6 | |
| SVM-Struct | Scale & Shift | 39.5 | 51.5 | 29.3 | 42.3 |
| Crammer Singer | 40.2 | 51.9 | 30.2 | 42.4 | |
ZE and BE denote the zero-one error and balanced error percent rates respectively. The results were obtained by optimizing the balanced loss function.
Zero-one and Balanced error rates for the fold recognition problem optimized for the zero-one loss function.
| fd25 | fd40 | ||||
| ZE | BE | ZE | BE | ||
| MaxClassifier | 42.0 | 60.3 | 44.4 | 64.6 | |
| Direct | 42.8 | 59.4 | 43.0 | 62.7 | |
| Two-Level Approaches Without Hierarchy Information | |||||
| Scaling | 39.9 | 52.9 | 32.2 | 50.6 | |
| Ranking Perceptron | Scale & Shift | 38.4 | 51.3 | 27.3 | 44.8 |
| Crammer Singer | 34.8 | 48.9 | 37.7 | 56.6 | |
| Scaling | 41.3 | 55.2 | 33.7 | 50.0 | |
| SVM-Struct | Scale & Shift | 41.0 | 54.3 | 29.0 | 46.2 |
| Crammer Singer | 36.6 | 49.4 | 32.5 | 49.6 | |
| Two-Level Approaches With Class-level Nodes | |||||
| Scaling | 39.9 | 52.2 | 31.9 | 50.2 | |
| SVM-Struct | Scale & Shift | 38.4 | 52.9 | 29.3 | 44.6 |
| Crammer Singer | 39.2 | 51.8 | 32.8 | 52.9 | |
| Two-Level Approaches With Superfamily-level Nodes | |||||
| Scaling | 39.5 | 53.9 | 31.3 | 48.8 | |
| SVM-Struct | Scale & Shift | 39.9 | 53.4 | 31.3 | 48.4 |
| Crammer Singer | 37.7 | 52.1 | 33.4 | 51.0 | |
| Two-Level Approaches With Superfamily-level and Class-level Nodes | |||||
| Scaling | 39.2 | 52.2 | 27.3 | 41.0 | |
| SVM-Struct | Scale & Shift | 39.9 | 53.9 | 28.4 | 44.1 |
| Crammer Singer | 38.8 | 54.7 | 31.3 | 48.0 | |
ZE and BE denote the zero-one error and balanced error percent rates respectively. The results were obtained by optimizing the zero-one loss function.
Error rates (top1, top>3) for the remote homology detection problem.
| sf95 | sf40 | ||||
| top1 | top3 | top1 | top3 | ||
| Two-Level Approaches Without Hierarchy Information | |||||
| Scaling | 7.5 | 2.6 | 10.1 | 3.8 | |
| SVM-Struct | Scale & Shift | 9.0 | 2.0 | 10.1 | 3.4 |
| Crammer Singer | 8.1 | 1.7 | 9.2 | 2.5 | |
| Two-Level Approaches With Fold-level Nodes | |||||
| Scaling | 4.6 | 0.9 | 6.3 | 1.7 | |
| SVM-Struct | Scale & Shift | 4.0 | 0.9 | 5.0 | 1.7 |
| Crammer Singer | 6.6 | 2.6 | 5.0 | 1.7 | |
| Two-Level Approaches With Fold-level and Class-Level Nodes | |||||
| Scaling | 5.2 | 1.7 | 5.5 | 1.7 | |
| SVM-Struct | Scale & Shift | 5.8 | 2.3 | 4.2 | 2.1 |
| Crammer Singer | 6.6 | 2.0 | 5.0 | 1.7 | |
The results shown in the table are optimized for the balanced loss function.
Error rates (top1, top3) for the fold recognition problem.
| fd25 | fd40 | ||||
| top1 | top3 | top1 | top3 | ||
| Two-Level Approaches Without Hierarchy Information | |||||
| Scaling | 38.5 | 24.5 | 25.6 | 15.4 | |
| SVM-Struct | Scale & Shift | 37.4 | 24.8 | 24.7 | 15.1 |
| Crammer Singer | 36.3 | 22.7 | 25.0 | 13.4 | |
| Two-Level Approaches With Class-level Nodes | |||||
| Scaling | 36.7 | 21.9 | 20.6 | 11.9 | |
| SVM-Struct | Scale & Shift | 36.3 | 21.6 | 21.2 | 12.2 |
| Crammer Singer | 37.1 | 22.3 | 25.3 | 13.4 | |
| Two-Level Approaches With Superfamily-level Nodes | |||||
| Scaling | 39.9 | 24.5 | 27.9 | 19.5 | |
| SVM-Struct | Scale & Shift | 39.6 | 23.4 | 25.3 | 16.0 |
| Crammer Singer | 40.6 | 27.3 | 26.7 | 15.1 | |
| Two-Level Approaches With Superfamily-level and Class-level Nodes | |||||
| Scaling | 39.2 | 25.2 | 20.6 | 13.7 | |
| SVM-Struct | Scale & Shift | 38.5 | 23.0 | 20.9 | 12.2 |
| Crammer Singer | 37.1 | 23.7 | 24.1 | 12.5 | |
The results shown in the table are optimized for the balanced loss function.
Comparative results for the remote homology detection problem on dataset sf95.
| Ie | Scaling Model | Best Model | ||||
| ZE | BE | ZE | BE | ZE | BE | |
| Without Hierarchy Information | ||||||
| Ranking Perceptron | 21.8 | 36.7 | 9.3 | 16.1 | 9.3 | 16.1 |
| SVM-Struct | 20.7 | 37.6 | 9.0 | 15.9 | 9.0 | 15.9 |
| With Fold-level Nodes | ||||||
| SVM-Struct | 20.4 | 37.5 | 11.2 | 19.6 | 10.1 | 19.3 |
The results for Ie et al were obtained from the supplementary website for the work [19], and represent the results obtained using the simple scaling model in their implementation. The results labeled "Scaling Model" correspond to the performance achieved by our two-level classifiers using the simple scaling model, whereas the results labeled "Best Model" correspond to the best performance achieved among the simple scaling, scaling & shift, and Cramer-Singer models. Both of these results were obtained from Table 2. All results were obtained by optimizing the balanced loss function. ZE and BE denote the zero-one error and balanced error percent rates respectively.
Dataset Statistics.
| Statistic | sf95 | sf40 | fd25 | fd40 |
| ASTRAL filtering | 95% | 40% | 25% | 40% |
| Number of Sequences | 2115 | 1119 | 1294 | 1651 |
| Number of Folds | 25 | 25 | 25 | 27 |
| Number of Superfamilies | 47 | 37 | 137 | 158 |
| Avg. Pairwise Similarity | 12.8% | 11.5% | 11.6% | 11.4 |
| Avg. Max. Similarity | 63.5% | 33.9% | 32.2% | 34.3 |
| Avg. Pairwise Similarity (within folds) | 25.6% | 17.9% | 16.7% | 17.4 |
| Avg. Pairwise Similarity (outside folds) | 10.4% | 11.03% | 11.2% | 11.0 |
The percent similarity between two sequences is computed by aligning the pair of sequences using SW-GSM with a gap opening of 5.0 and gap extension of 1.0. "Avg. Pairwise Similarity" is the average of all the pairwise percent identities, "Avg. Max. Similarity" is the average of the maximum pairwise percent identity for each sequence i.e, it measures the similarity to its most similar sequence. The "Avg. Pairwise Similarity (within folds)" and "Avg. Pairwise Similarity (outside folds)" is the average of the average pairwise percent sequence similarity within the same fold and outside the fold for a given sequence.