| Literature DB >> 28570703 |
Maximilian Alber1, Julian Zimmert2, Urun Dogan3, Marius Kloft2.
Abstract
Training of one-vs.-rest SVMs can be parallelized over the number of classes in a straight forward way. Given enough computational resources, one-vs.-rest SVMs can thus be trained on data involving a large number of classes. The same cannot be stated, however, for the so-called all-in-one SVMs, which require solving a quadratic program of size quadratically in the number of classes. We develop distributed algorithms for two all-in-one SVM formulations (Lee et al. and Weston and Watkins) that parallelize the computation evenly over the number of classes. This allows us to compare these models to one-vs.-rest SVMs on unprecedented scale. The results indicate superior accuracy on text classification data.Entities:
Mesh:
Year: 2017 PMID: 28570703 PMCID: PMC5453486 DOI: 10.1371/journal.pone.0178161
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 11-factorization.
Illustration of the solution of the 1-factorization problem of a graph with many nodes. Node 8 gets arranged centrally and at each step the pattern is rotated by one.
Comparison to existing solver.
| Dataset: | D-LLW | S-LLW | D-WW | S-WW | ||||
|---|---|---|---|---|---|---|---|---|
| Err. | Den. | Err. | Den. | Err. | Den. | Err. | Den. | |
| log( | 21.34 | 100.0 | 21.34 | 100.0 | 19.88 | 100.0 | 19.88 | 100.0 |
| 20.95 | 100.0 | 20.95 | 100.0 | 19.51 | 100.0 | 19.51 | 100.0 | |
| 20.78 | 100.0 | 20.78 | 100.0 | 19.38 | 100.0 | 19.38 | 100.0 | |
| log( | 66.67 | 100.0 | 66.67 | 100.0 | 38.10 | 100.0 | 38.10 | 100.0 |
| 61.90 | 100.0 | 61.90 | 100.0 | 19.05 | 100.0 | 19.05 | 100.0 | |
| 33.33 | 100.0 | 33.33 | 100.0 | 19.05 | 100.0 | 19.05 | 100.0 | |
| log( | 13.33 | 100.0 | 13.33 | 100.0 | 6.67 | 100.0 | 6.67 | 100.0 |
| 26.67 | 100.0 | 26.67 | 100.0 | 13.33 | 100.0 | 13.33 | 100.0 | |
| 26.67 | 100.0 | 26.67 | 100.0 | 13.33 | 100.0 | 13.33 | 100.0 | |
| log( | 87.04 | 100.0 | 87.04 | 100.0 | 28.25 | 100.0 | 28.26 | 100.0 |
| 87.24 | 100.0 | 87.24 | 100.0 | 29.04 | 100.0 | 29.03 | 100.0 | |
| 61.91 | 100.0 | 87.24 | 100.0 | 28.92 | 100.0 | 28.93 | 100.0 | |
| log( | 29.23 | 97.24 | 29.23 | 97.24 | 15.32 | 51.16 | 15.30 | 49.72 |
| 22.97 | 97.24 | 22.97 | 97.24 | 14.80 | 44.74 | 14.80 | 42.70 | |
| 16.15 | 97.17 | 16.15 | 97.04 | 15.98 | 45.97 | 15.98 | 43.47 | |
| log( | 47.96 | 78.00 | 47.96 | 78.00 | 11.31 | 26.42 | 11.31 | 23.45 |
| 33.27 | 78.00 | 33.41 | 77.98 | 11.52 | 22.93 | 11.52 | 20.12 | |
| 12.03 | 78.00 | 12.03 | 77.98 | 12.03 | 23.05 | 12.03 | 20.06 | |
| log( | 26.75 | 100.0 | 26.73 | 100.0 | 15.80 | 100.0 | 15.80 | 100.0 |
| 26.80 | 100.0 | 26.80 | 100.0 | 15.47 | 100.0 | 15.53 | 100.0 | |
| 26.90 | 100.0 | 26.90 | 100.0 | 15.96 | 100.0 | 16.00 | 100.0 | |
| log( | 16.29 | 100.0 | 16.37 | 100.0 | 16.16 | 100.0 | 16.16 | 100.0 |
| 16.09 | 100.0 | 16.15 | 100.0 | 16.37 | 100.0 | 16.28 | 100.0 | |
| 16.34 | 100.0 | 16.28 | 100.0 | 16.32 | 100.0 | 16.24 | 100.0 | |
| log( | 31.84 | 100.0 | 31.84 | 100.0 | 8.17 | 100.0 | 8.17 | 100.0 |
| 30.09 | 100.0 | 30.04 | 100.0 | 9.37 | 100.0 | 9.37 | 100.0 | |
| 28.00 | 100.0 | 28.00 | 100.0 | 10.51 | 100.0 | 10.51 | 100.0 | |
Error on the test set and density in % of the Shark solver (denoted S) and the proposed solver (denoted D). The results across solver implementations show good accordance.
Dataset properties.
| Dataset |
| |||
|---|---|---|---|---|
| 4,463 | 1,858 | 1,139 | 51,033 | |
| 128,710 | 34,880 | 12,294 | 381,581 | |
| 383,408 | 103,435 | 11,947 | 575,555 | |
| 394,754 | 104,263 | 27,875 | 594,158 |
The used datasets from the LSHTC-corpus and their properties. n train and n test denote the number of samples in the training and test set respectively, the number of classes and d the number of dimensions. The most challenging dataset is given by LSHTC-2011. It contains the most samples, classes and dimensions.
Fig 2Speed-up.
Speed-up of our solver averaged over 10 repetitions respectively in the number of cores. For *-MPI-2 and *-MPI-4 the number of cores is split evenly on 2 and 4 machines respectively. We observe a linear speedup in the number of cores for both solvers.
Test error and model density.
| Dataset: | ||||||||
|---|---|---|---|---|---|---|---|---|
| OVR | CS | WW | LLW | OVR | CS | WW | LLW | |
| log( | 93.00 | 59.74 | 72.82 | 92.74 | 11.11 | 69.73 | ||
| 85.36 | 59.74 | 65.34 | 93.00 | 81.54 | 11.13 | 16.44 | 92.74 | |
| 74.54 | 59.74 | 57.59 | 93.00 | 46.76 | 11.12 | 6.06 | 92.74 | |
| 64.37 | 55.49 | 54.57 | 93.00 | 38.20 | 11.76 | 5.74 | 92.74 | |
| 93.00 | 92.74 | |||||||
| log( | 88.12 | 58.57 | 66.47 | 75.26 | 2.53 | 18.50 | ||
| 85.21 | 58.57 | 60.58 | 95.86 | 45.14 | 2.53 | 4.45 | 100.0 | |
| 77.96 | 57.82 | 55.28 | 95.86 | 25.28 | 2.55 | 1.71 | 100.0 | |
| 63.11 | 95.86 | 18.33 | 100.0 | |||||
| 54.18 | 54.41 | * | 2.67 | 1.66 | * | |||
| log( | 83.66 | 49.81 | 58.02 | 72.60 | 1.73 | 16.97 | ||
| 75.15 | 49.65 | 50.20 | 92.63 | 46.20 | 1.71 | 4.06 | 99.52 | |
| 60.38 | 46.14 | 44.94 | 92.63 | 25.87 | 1.76 | 1.52 | 99.52 | |
| 47.33 | * | 18.20 | * | |||||
| 45.60 | 46.15 | * | 2.09 | 1.47 | * | |||
| log( | 87.95 | 59.09 | 68.19 | 72.38 | 1.57 | 13.49 | ||
| 85.85 | 59.09 | 62.14 | 96.18 | 45.97 | 1.57 | 3.16 | 100.0 | |
| 76.78 | 58.18 | 57.31 | 96.18 | 25.97 | 1.55 | 1.19 | 100.0 | |
| 63.11 | * | 18.24 | * | |||||
| 57.78 | 58.32 | * | 1.70 | 1.14 | * | |||
Test set error and model density (in %) as achieved by the OVR, CS, WW, and LLW solvers on the LSHTC datasets. For each solver the result with the best error is in bold font. For LLW entries with a ‘*’ did not converge within a day of runtime.
F1-Scores.
| Dataset: | ||||||||
|---|---|---|---|---|---|---|---|---|
| OVR | CS | WW | LLW | OVR | CS | WW | LLW | |
| log( | 7.00 | 40.26 | 27.18 | 0.61 | 22.08 | 10.73 | ||
| 14.42 | 40.26 | 34.66 | 7.00 | 2.70 | 22.08 | 16.15 | 0.61 | |
| 25.46 | 40.26 | 42.41 | 7.00 | 8.72 | 22.08 | 24.71 | 0.61 | |
| 35.47 | 44.46 | 45.43 | 7.00 | 16.42 | 26.70 | 28.75 | 0.61 | |
| 7.00 | 0.61 | |||||||
| log( | 11.77 | 41.35 | 33.53 | 0.88 | 25.43 | 15.05 | ||
| 14.80 | 41.52 | 39.42 | 4.14 | 1.51 | 25.41 | 20.83 | 0.09 | |
| 22.02 | 42.19 | 44.72 | 4.14 | 3.35 | 25.83 | 27.90 | 0.09 | |
| 36.86 | * | 14.76 | 30.99 | * | ||||
| 45.83 | 45.59 | * | 31.12 | * | ||||
| log( | 16.34 | 50.19 | 41.98 | 0.28 | 20.55 | 8.08 | ||
| 24.85 | 50.35 | 49.80 | 7.37 | 0.69 | 20.72 | 16.17 | 0.01 | |
| 39.62 | 53.86 | 55.06 | 7.37 | 2.64 | 23.76 | 25.94 | 0.01 | |
| 52.67 | * | 12.46 | * | |||||
| 54.40 | 53.85 | * | 31.84 | 30.95 | * | |||
| log( | 12.05 | 40.91 | 31.81 | 0.46 | 22.44 | 10.47 | ||
| 14.15 | 40.91 | 37.86 | 3.82 | 0.62 | 22.46 | 16.48 | 0.05 | |
| 23.22 | 41.82 | 42.69 | 3.82 | 1.89 | 23.37 | 23.17 | 0.05 | |
| 36.89 | * | 10.60 | * | |||||
| 42.22 | 41.86 | * | 26.31 | 26.97 | * | |||
Micro-F1 and Macro-F1 scores (in %) as achieved by the OVR, CS, WW, and LLW solvers on the LSHTC datasets. For each solver and each metric the best result across C values is in bold font. For LLW entries with a ‘*’ did not converge within a day of runtime.
Fig 3Training times.
Training time averaged over 10 repetitions per C for the various solvers.
Results for LLW-Solver.
| log( | 2 | 3 | 4 |
|---|---|---|---|
| 87.73 | 66.74 | 59.31 | |
| 2.08 | 15.07 | 40.69 | |
| 12.27 | 33.26 | 24.58 | |
| 92.74 | 92.74 | 92.74 | |
| 99.88 | 99.87 | 99.90 |
Error, Micro-F1, and Macro-F1 on the test set and model density in % of the LLW solver on the LSHTC-small dataset.