| Literature DB >> 26444880 |
Rafael Coimbra Pinto1, Paulo Martins Engel1.
Abstract
This work builds upon previous efforts in online incremental learning, namely the Incremental Gaussian Mixture Network (IGMN). The IGMN is capable of learning from data streams in a single-pass by improving its model after analyzing each data point and discarding it thereafter. Nevertheless, it suffers from the scalability point-of-view, due to its asymptotic time complexity of O(NKD3) for N data points, K Gaussian components and D dimensions, rendering it inadequate for high-dimensional data. In this work, we manage to reduce this complexity to O(NKD2) by deriving formulas for working directly with precision matrices instead of covariance matrices. The final result is a much faster and scalable algorithm which can be applied to high dimensional tasks. This is confirmed by applying the modified algorithm to high-dimensional classification datasets.Entities:
Mesh:
Year: 2015 PMID: 26444880 PMCID: PMC4596621 DOI: 10.1371/journal.pone.0139931
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Datasets.
| Dataset | Instances (N) | Attributes (D) | Classes |
|---|---|---|---|
| breast-cancer | 286 | 9 | 2 |
| pima-diabetes | 768 | 8 | 2 |
| Glass | 214 | 9 | 7 |
| ionosphere | 351 | 34 | 2 |
| iris | 150 | 4 | 3 |
| labor-neg-data | 57 | 16 | 2 |
| soybean | 683 | 35 | 19 |
| MNIST [ | 70000 | 784 | 10 |
| CIFAR-10 [ | 60000 | 3072 | 10 |
Accuracy of different algorithms on standard datasets.
| Dataset | RF | NN | Lin. SVM | RBF SVM | IGMN | FIGMN |
|---|---|---|---|---|---|---|
| breast-cancer | 69.6 ± 9.1 | 75.2 ± 6.5 | 69.3 ± 7.5 | 70.6 ± 1.5 | 71.4 ± 7.4 | 71.4 ± 7.4 |
| pima-diabetes | 75.8 ± 3.5 | 74.2 ± 4.9 | 77.5 ± 4.4 | 65.1 ± 0.4 | 73.0 ± 4.5 | 73.0 ± 4.5 |
| Glass | 79.9 ± 5.0 | 53.8 ± 7.4 | 62.7 ± 7.8 | 68.8 ± 8.7 | 65.4 ± 4.9 | 65.4 ± 4.9 |
| ionosphere | 92.9 ± 3.6 | 92.6 ± 2.4 | 88.0 ± 3.5 | 93.5 ± 3.0 | 92.6 ± 3.8 | 92.6 ± 3.8 |
| iris | 95.3 ± 4.5 | 95.3 ± 5.5 | 96.7 ± 4.7 | 96.7 ± 3.5 | 97.3 ± 3.4 | 97.3 ± 3.4 |
| labor-neg-data | 89.7 ± 14.3 | 89.7 ± 14.3 | 93.3 ± 11.7 | 93.3 ± 8.6 | 94.7 ± 8.6 | 94.7 ± 8.6 |
| soybean | 93.0 ± 3.1 | 93.0 ± 2.4 | 94.0 ± 2.2 | 88.7 ± 3.0 | 91.5 ± 5.4 | 91.5 ± 5.4 |
| Average | 85.2 | 82.0 | 83.1 | 82.4 | 83.7 | 83.7 |
• statistically significant degradation
Number of Gaussian components created.
| Dataset | # of Components |
|---|---|
| breast-cancer | 14.2 ± 1.9 |
| pima-diabetes | 19.4 ± 1.3 |
| Glass | 15.9 ± 1.1 |
| ionosphere | 74.4 ± 1.4 |
| iris | 2.7 ± 0.7 |
| labor-neg-data | 12.0 ± 1.2 |
| soybean | 42.6 ± 2.2 |
Training and testing running times (in seconds).
| Dataset | IGMN Training | FIGMN Training | IGMN Testing | FIGMN Testing |
|---|---|---|---|---|
| MNIST | 32,544.69 | 1,629.81 | 3,836.06 | 230.92 |
| CIFAR-10 | 2,758,252 | 15,545.05 | - | 795.98 |
* estimated time projected from 100 data points
Fig 1Training and testing times for both versions of the IGMN algorithm with growing number of dimensions.