| Literature DB >> 31498787 |
Prem Junsawang1, Suphakant Phimoltares1, Chidchanok Lursinsap1.
Abstract
Due to the fast speed of data generation and collection from advanced equipment, the amount of data obviously overflows the limit of available memory space and causes difficulties achieving high learning accuracy. Several methods based on discard-after-learn concept have been proposed. Some methods were designed to cope with a single incoming datum but some were designed for a chunk of incoming data. Although the results of these approaches are rather impressive, most of them are based on temporally adding more neurons to learn new incoming data without any neuron merging process which can obviously increase the computational time and space complexities. Only online versatile elliptic basis function (VEBF) introduced neuron merging to reduce the space-time complexity of learning only a single incoming datum. This paper proposed a method for further enhancing the capability of discard-after-learn concept for streaming data-chunk environment in terms of low computational time and neural space complexities. A set of recursive functions for computing the relevant parameters of a new neuron, based on statistical confidence interval, was introduced. The newly proposed method, named streaming chunk incremental learning (SCIL), increases the plasticity and the adaptabilty of the network structure according to the distribution of incoming data and their classes. When being compared to the others in incremental-like manner, based on 11 benchmarked data sets of 150 to 581,012 samples with attributes ranging from 4 to 1,558 formed as streaming data, the proposed SCIL gave better accuracy and time in most data sets.Entities:
Mesh:
Year: 2019 PMID: 31498787 PMCID: PMC6733468 DOI: 10.1371/journal.pone.0220624
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1An example of streaming data with different classes of various sizes in a 2-dimensional space.
The list of symbols and notations used in this paper.
| Symbol | Description |
|---|---|
| - | |
| - Number of neurons in | |
|
| - |
|
| - Total number of data covered by |
|
| - Center vector corresponding to |
|
| - Covariance matrix corresponding to |
|
| - Matrix of orthogonal bases corresponding to |
|
| - |
|
| - Width vector corresponding to |
|
| - VEBF value of a given input vector |
Fig 2An example of class-wise data chunk.
Fig 3Two overlapping conditions for merging two neurons.
Description of each data set.
| Data set | Number of Attributes | Number of Instances | Size | Number of Classes | Area | Ratio of min/max |
|---|---|---|---|---|---|---|
| 4 | 150 | 600 | 3 | Life | 1.00 | |
| 8 | 1,484 | 11,872 | 10 | Life | 0.53 | |
| 19 | 2,310 | 43,890 | 7 | Computer | 1.00 | |
| 21 | 5,000 | 105,000 | 3 | Physical | 1.00 | |
| 16 | 20,000 | 320,000 | 26 | Computer | 0.90 | |
| 54 | 581,012 | 31,374,648 | 7 | Life | 0.01 | |
| 7 | 345 | 2,415 | 2 | Life | 0.73 | |
| 57 | 4,601 | 262,257 | 2 | Computer | 0.65 | |
| 1,558 | 2,359 | 3,675,322 | 2 | Computer | 0.19 | |
| 398 | 11,188 | 4,452,824 | 2 | physical | 1.00 | |
| 50 | 130,065 | 6,503,250 | 2 | Physical | 0.39 |
Parameter setting in each data set.
| Data set | SCIL | VEBF [ | ILVQ [ |
|---|---|---|---|
| 0.7 | 0.3 | (21,17) | |
| 0.4 | 1 | (70,35) | |
| 0.7 | 1 | (180,130) | |
| 0.7 | 1 | (70,110) | |
| 0.7 | 0.7 | (80,100) | |
| 0.15 | 1 | (16,80) | |
| 0.4 | 1 | (90,18) | |
| 0.7 | 0.7 | (200,60) | |
| 0.7 | 1.2 | (155,60) | |
| 0.7 | 0.5 | (200,150 | |
| 0.05 | 0.7 | (280,180) |
Average classification accuracy with standard deviation of each data set.
| Data set | SCIL | VEBF [ | ILVQ [ | CILDA [ | RIL [ |
|---|---|---|---|---|---|
| Iris | 92.13 ± 5.92 | 95.73 ± 4.14* | 96.17 ± 3.47* | ||
| Image segmentation | 69.27 ± 10.52 | 78.48 ± 8.66 | 83.74 ± 2.11 | ||
| Liver | 59.77 ± 6.85 | 60.29 ± 5.61 | 62.75 ± 6.58 | ||
| Yeast | 42.62 ± 12.03 | 49.63 ± 3.03 | 25.72 ± 10.77 | ||
| Letter recognition | 58.64 ± 2.33 | 38.86 ± 3.33 | 55.51 ± 0.8 | ||
| Waveform | 70.79 ± 14.19 | 81.71 ± 1.34 | 78.21 ± 1.08 | ||
| Protein-protein interaction | 50.28 ± 3.52 | 59.73 ± 0.67 | 76.26 ± 0.59 | ||
| Miniboo | 59.65 ± 11.44 | 86.19 ± 0.5 | 87.58 ± 1.36* | ||
| Forest cover type | 63.58 ± 0.25 | 51.3 ± 13.12 | 70.11 ± 0.15 | ||
| Spambase | 68.77 ± 7.49 | 70.92 ± 2.44 | N/A | ||
| Internet | 64.3 ± 20.90 | N/A | N/A | ||
| 4.45 | 3.10 | 3.50 | 2.22 |
Average number of used hidden neurons with the standard deviation of each data set.
| Data set | SCIL | VEBF [ | ILVQ [ | CILDA [ | RIL [ |
|---|---|---|---|---|---|
| Iris | 4.28 ± 0.98 | 23.04 ± 9.53 | 120 | ||
| Image segmentation | 19.68 ± 1.57 | 196.16 ± 56.53 | 1, 848 | ||
| Liver | 31.48 ± 5.55* | 47.84 ± 4.5 | 276 | ||
| Yeast | 54.56 ± 7.93 | 149.36 ± 72.21 | 1, 187.4 | ||
| Letter recognition | 235.44 ± 14.17 | 670.48 ± 51.47 | 16, 000 | ||
| Waveform | 5.52 ± 2.93 | 177.84 ± 71.3 | 4, 000 | ||
| Protein-protein interaction | 37.48 ± 13.43 | 190.2 ± 59.39 | 895.06 | ||
| Miniboo | 2, 691 ± 423 | 2, 285 ± 43 | 104, 051.2 | ||
| Forest cover type | 2, 830 ± 248 | 1, 550 ± 90 | 464, 809.6 | ||
| Spambase | 137.44 ± 27.27 | 3, 681.2 | N/A | ||
| Internet | 137.04 ± 47.56 | N/A | N/A | ||
| 2.81 | 3.64 | 5 |
Average computational time (s) with the standard deviation of each data set.
| Data set | SCIL | VEBF [ | ILVQ [ | CILDA [ | RIL [ |
|---|---|---|---|---|---|
| Iris | 0.04 ± 0.004 | 0.07 ± 0.005 | |||
| Image segmentation | 1.23 ± 0.08 | 5.77 ± 0.59 | 27.26 ± 6.3 | ||
| Liver | 0.15 ± 0.05 | 0.36 ± 0.04 | 0.19 ± 0.03 | ||
| Yeast | 0.54 ± 0.07 | 2.43 ± 0.43 | 5.91 ± 1.12 | ||
| Letter recognition | 18.68 ± 0.84 | 109.78 ± 4.61 | 493 ± 145 | ||
| Waveform | 2.37 ± 0.6 | 11.93 ± 1.76 | 33.37 ± 8.7 | ||
| Protein-protein interaction | 2, 266 ± 605 | 47.25 ± 2.13 | 5, 624 ± 542 | ||
| Miniboo | 936 ± 106 | 603 ± 59 | 1, 673 ± 153 | ||
| Forest cover type | 202, 913 ± 60, 915 | 38, 034 ± 465 | 27, 536 ± 1, 395 | ||
| Spambase | 19.93 ± 1.24 | 8.28 ± 0.48 | N/A | ||
| Internet | 29, 229 ± 1967 | N/A | N/A | ||
| 3.36 | 3.45 | 4.11 |
Comparison using a paired t-test with a significant level of 0.05 for accuracy along the course test between SCIL and CIL methods on each data set.
| Data set | Average accuracy with s.d. on the last ten chunks | Accuracy on the last chunk | Reject/Accept | |||
|---|---|---|---|---|---|---|
| SCIL | CIL [ | SCIL | CIL [ | |||
| Iris | 96.88 ± 5.09 | 96.88 ± 5.09 | 95.00 | 95.00 | N/A | − |
| Image segmentation | 85.24 ± 5.19 | 92.38 | 0.000 | |||
| Liver | 54.45 ± 14.96 | 77.14 | 0.004 | |||
| Yeast | 46.64 ± 11.14 | 59.03 | 0.003 | |||
| Letter recognition | 87.62 ± 1.45 | 90.58 | 90.58 | 0.005 | ||
| Waveform | 85.11 ± 15.59 | 85.11 ± 15.59 | 82.50 | 82.50 | N/A | − |
| Protein-protein interaction | 83.47 ± 13.05 | 74.03 | 0.139 | |||
| Miniboo | 97.93 ± 0.84 | 98.52 | 0.000 | |||
| Spambase | 85.59 ± 14.60 | 93.58 | 0.147 | |||
| Internet | 94.19 ± 5.79 | 99.09 | 99.09 | 0.278 | ||
Comparison using a paired t-test with a significance level of 0.05 for the number of used neurons along the course test between SCIL and CIL methods on each data set.
| Data set | Average number of neurons on the last ten chunks | Number of neurons on the last chunk | Reject/Accept | |||
|---|---|---|---|---|---|---|
| SCIL | CIL [ | SCIL | CIL [ | |||
| Iris | 4.00 ± 0.00 | 4 | N/A | − | ||
| Image segmentation | 11.22 ± 1.55 | 15 | 0.000 | |||
| Liver | 12.89 ± 6.84 | 24 | 0.002 | |||
| Yeast | 25.78 ± 3.42 | 29 | 0.000 | |||
| Letter recognition | 83.20 ± 0.40 | 84 | 0.000 | |||
| Waveform | 3.00 ± 0.00 | 3.00 ± 0.00 | 3 | 3 | N/A | − |
| Protein-protein interaction | 139.80 ± 5.19 | 150 | 0.000 | |||
| Miniboo | 65.00 ± 3.10 | 67 | 0.374 | |||
| Spambase | 28.11 ± 6.69 | 39 | 0.000 | |||
| Internet | 6.33 ± 0.94 | 7 | 7 | 0.278 | ||
Fig 4Classification accuracy on the last ten data chunks for each data set.
Fig 5The number of hidden neurons on the last ten data chunks for each data set.