| Literature DB >> 23825697 |
Tobias Hamp1, Tatyana Goldberg, Burkhard Rost.
Abstract
One of the most accurate multi-class protein classification systems continues to be the profile-based SVM kernel introduced by the Leslie group. Unfortunately, its CPU requirements render it too slow for practical applications of large-scale classification tasks. Here, we introduce several software improvements that enable significant acceleration. Using various non-redundant data sets, we demonstrate that our new implementation reaches a maximal speed-up as high as 14-fold for calculating the same kernel matrix. Some predictions are over 200 times faster and render the kernel as possibly the top contender in a low ratio of speed/performance. Additionally, we explain how to parallelize various computations and provide an integrative program that reduces creating a production-quality classifier to a single program call. The new implementation is available as a Debian package under a free academic license and does not depend on commercial software. For non-Debian based distributions, the source package ships with a traditional Makefile-based installer. Download and installation instructions can be found at https://rostlab.org/owiki/index.php/Fast_Profile_Kernel. Bugs and other issues may be reported at https://rostlab.org/bugzilla3/enter_bug.cgi?product=fastprofkernel.Entities:
Mesh:
Year: 2013 PMID: 23825697 PMCID: PMC3688983 DOI: 10.1371/journal.pone.0068459
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Sample k-mer tree traversal.
Sketched is one part of a 3-mer trie traversal with two input profiles (P1 and P2). These profiles were generated with proteins that were 186 (P1) and 241 residues long (P2; tables on the top). During traversal, some conserved multi-mers remain at each node that fall below the substitution score threshold σ. The ‘Sample 3-mer trie traversal’ illustrates the transition from two-letter node ‘AA’ to node ‘AAA’ (‘AAA’ is also a leaf, because k=3). At node ‘AA’, five 2-mers have remained from previous transitions (root -> ‘A’ -> ‘AA’) that still fall below the substitution score threshold σ=5. In the transition to node ‘AAA’, each such 2-mer is extended to a 3-mer and each score re-calculated (k-mer extension and new scores in red). 3-mers with a score > 5 are discarded (2/5) and those that remain (3/5) are used in the kernel matrix update. Afterwards, the traversal continues until reaching the lexicographically last leaf (‘YYY’).
Figure 2Speed measurements.
Each arrow compares the runtime of the original implementation (upper symbol) to the new implementation (lower symbol). The symbol type indicates the parameter combination. The number above or below an arrow is the acceleration (original runtime divided by new runtime). All runtimes are wall-clock times of single processes. We did not perform an experiment if it was clear that it would take longer than 24 hours. (A) Kernel matrix calculations. In this subfigure we compare kernel matrix creation runtimes. Data sets correspond to subsets of a redundancy reduced Swiss-Prot database with 5920 (‘Euka (5920)’), 12,500 (‘SP60_13k’), 25,000 (‘SP60_25k’) and 100,000 (‘SP60_100k’) samples, respectively. The SP60_100k experiment (“k=5, σ=7.5”) for which we used 100 CPUs in parallel took 40 minutes and is not shown. (B) Prediction of new targets. This subfigure displays the runtimes for predicting three sets of targets (1, 200 and 20,000 profiles; axis on top) using models created with the training data sets (‘Euka (5920)’ to ‘SP60_100k’; axis on bottom).