| Literature DB >> 33286620 |
Weronika Wegier1, Pawel Ksieniewicz1.
Abstract
In the era of a large number of tools and applications that constantly produce massive amounts of data, their processing and proper classification is becoming both increasingly hard and important. This task is hindered by changing the distribution of data over time, called the concept drift, and the emergence of a problem of disproportion between classes-such as in the detection of network attacks or fraud detection problems. In the following work, we propose methods to modify existing stream processing solutions-Accuracy Weighted Ensemble (AWE) and Accuracy Updated Ensemble (AUE), which have demonstrated their effectiveness in adapting to time-varying class distribution. The introduced changes are aimed at increasing their quality on binary classification of imbalanced data. The proposed modifications contain the inclusion of aggregate metrics, such as F1-score, G-mean and balanced accuracy score in calculation of the member classifiers weights, which affects their composition and final prediction. Moreover, the impact of data sampling on the algorithm's effectiveness was also checked. Complex experiments were conducted to define the most promising modification type, as well as to compare proposed methods with existing solutions. Experimental evaluation shows an improvement in the quality of classification compared to the underlying algorithms and other solutions for processing imbalanced data streams.Entities:
Keywords: classification; classifier ensembles; data streams; imbalanced data; oversampling
Year: 2020 PMID: 33286620 PMCID: PMC7517449 DOI: 10.3390/e22080849
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
Comparison of data streams processed during experimental evaluation of modified models, consisting of the type of occurring concept drifts, information on what percent of all samples belong to minority class and the ratio between samples from both classes.
| # | DRIFT TYPE | MINORITY CLASS % | CLASS RATIO |
|---|---|---|---|
| 1 | sudden | 5% | 1:19 |
| 2 | sudden | 10% | 1:9 |
| 3 | sudden | 20% | 1:4 |
| 4 | sudden | 30% | 3:7 |
| 5 | gradual | 5% | 1:19 |
| 6 | gradual | 10% | 1:9 |
| 7 | gradual | 20% | 1:4 |
| 8 | gradual | 30% | 3:7 |
Description of proposed models, including base ensemble algorithms, implemented changes—the way weights are calculated and used data sampling—and labels shown on plots.
| # | BASE ENSEMBLE | WEIGHTING METHOD | SAMPLING | PLOT LABEL |
|---|---|---|---|---|
| 1 | AWE | proportional to | undersampling | u-AWE-g |
| 2 | proportional to | undersampling | u-AWE-b | |
| 3 | proportional to | undersampling | u-AWE-f | |
| 4 | proportional to | oversampling | o-AWE-g | |
| 5 | proportional to | oversampling | o-AWE-b | |
| 6 | proportional to | oversampling | o-AWE-f | |
| 7 | proportional to | — | AWE-g | |
| 8 | proportional to | — | AWE-b | |
| 9 | proportional to | — | AWE-f | |
| 10 | in inverse proportion to MSE | undersampling | u-AWE | |
| 11 | in inverse proportion to MSE | oversampling | o-AWE | |
| 12 | AUE | proportional to | undersampling | u-AUE-g |
| 13 | proportional to | undersampling | u-AUE-b | |
| 14 | proportional to | undersampling | u-AUE-f | |
| 15 | proportional to | oversampling | o-AUE-g | |
| 16 | proportional to | oversampling | o-AUE-b | |
| 17 | proportional to | oversampling | o-AUE-f | |
| 18 | proportional to | — | AUE-g | |
| 19 | proportional to | — | AUE-b | |
| 20 | proportional to | — | AUE-f | |
| 21 | in inverse proportion to MSE | undersampling | u-AUE | |
| 22 | in inverse proportion to MSE | oversampling | o-AUE |
Average value of the F1-score metric for all compared models and every data stream type, with subscript containing a list of other methods, that are statistically worse for the given stream type.
| # | METHOD | SUDDEN DRIFT | GRADUAL DRIFT | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 5% | 10% | 20% | 30% | 5% | 10% | 20% | 30% | ||
| 1 | AWE | 0.385 | 0.496 | 0.690 | 0.780 | 0.358 | 0.495 | 0.674 | 0.760 |
|
|
|
|
|
|
|
|
|
| |
| 2 | AWE | 0.384 | 0.486 | 0.704 | 0.781 | 0.355 | 0.483 | 0.681 | 0.758 |
|
|
|
|
|
|
|
|
|
| |
| 3 | AWE | 0.415 | 0.515 | 0.722 | 0.785 | 0.380 | 0.505 | 0.690 | 0.761 |
|
|
|
|
|
|
|
|
|
| |
| 4 | AWE | 0.410 | 0.547 | 0.720 | 0.783 | 0.375 | 0.507 | 0.690 | 0.761 |
|
|
|
|
|
|
|
|
|
| |
| 5 | AWE | 0.433 | 0.577 | 0.720 | 0.784 | 0.393 | 0.539 | 0.688 | 0.763 |
|
|
|
|
|
|
|
|
|
| |
| 6 | AWE | 0.476 | 0.612 | 0.734 | 0.791 | 0.426 | 0.567 | 0.699 | 0.767 |
|
|
|
|
|
|
|
|
|
| |
| 7 | AWE | 0.451 | 0.579 | 0.722 | 0.784 | 0.419 | 0.538 | 0.681 | 0.755 |
|
|
|
|
|
|
|
|
|
| |
| 8 | AWE | 0.421 | 0.600 | 0.725 | 0.785 | 0.377 | 0.548 | 0.686 | 0.756 |
|
|
|
|
|
|
|
|
|
| |
| 9 | AWE | 0.486 | 0.627 | 0.742 | 0.791 | 0.445 | 0.569 | 0.692 | 0.760 |
|
|
|
|
|
|
|
|
|
| |
| 10 | AWE | 0.359 | 0.429 | 0.628 | 0.740 | 0.345 | 0.449 | 0.624 | 0.744 |
|
|
|
|
|
|
|
|
|
| |
| 11 | AWE | 0.358 | 0.464 | 0.663 | 0.741 | 0.305 | 0.442 | 0.646 | 0.741 |
|
|
|
|
|
|
|
|
|
| |
| 12 | AWE | 0.397 | 0.550 | 0.674 | 0.744 | 0.348 | 0.518 | 0.679 | 0.763 |
|
|
|
|
|
|
|
|
| ||
| 13 | AUE | 0.410 | 0.582 | 0.740 | 0.810 | 0.377 | 0.548 | 0.707 | 0.787 |
|
|
|
|
|
|
|
|
|
| |
| 14 | AUE | 0.403 | 0.567 | 0.733 | 0.807 | 0.374 | 0.541 | 0.708 | 0.786 |
|
|
|
|
|
|
|
|
|
| |
| 15 | AUE | 0.429 | 0.598 | 0.750 | 0.818 | 0.394 | 0.557 | 0.714 | 0.791 |
|
|
|
|
|
|
|
|
|
| |
| 16 | AUE | 0.509 | 0.657 | 0.776 | 0.828 | 0.458 | 0.604 | 0.741 | 0.805 |
|
|
|
|
|
|
|
|
|
| |
| 17 | AUE | 0.494 | 0.645 | 0.756 | 0.819 | 0.454 | 0.607 | 0.737 | 0.803 |
|
|
|
|
|
|
|
|
|
| |
| 18 | AUE | 0.523 | 0.663 | 0.779 | 0.831 | 0.464 | 0.610 | 0.743 | 0.806 |
|
|
|
|
|
|
|
|
|
| |
| 19 | AUE | 0.544 | 0.671 | 0.775 | 0.821 | 0.470 | 0.613 | 0.735 | 0.796 |
|
|
|
|
|
|
|
|
|
| |
| 20 | AUE | 0.499 | 0.646 | 0.757 | 0.815 | 0.456 | 0.611 | 0.732 | 0.794 |
|
|
|
|
|
|
|
|
|
| |
| 21 | AUE | 0.546 | 0.682 | 0.780 | 0.827 | 0.479 | 0.618 | 0.740 | 0.797 |
|
|
|
|
|
|
|
|
|
| |
| 22 | AUE | 0.393 | 0.543 | 0.746 | 0.813 | 0.366 | 0.522 | 0.707 | 0.788 |
|
|
|
|
|
|
|
|
|
| |
| 23 | AUE | 0.467 | 0.610 | 0.760 | 0.820 | 0.421 | 0.563 | 0.724 | 0.800 |
|
|
|
|
|
|
|
|
|
| |
| 24 | AUE | 0.447 | 0.642 | 0.766 | 0.820 | 0.347 | 0.547 | 0.736 | 0.798 |
|
|
|
|
|
|
|
|
| ||
| 25 | WAE | 0.382 | 0.571 | 0.745 | 0.805 | 0.299 | 0.460 | 0.698 | 0.774 |
|
|
|
|
|
|
|
|
| ||
| 26 | OOB | 0.488 | 0.529 | 0.624 | 0.679 | 0.424 | 0.524 | 0.624 | 0.682 |
|
|
|
|
|
|
|
|
| ||
| 27 | UOB | 0.349 | 0.440 | 0.605 | 0.682 | 0.250 | 0.412 | 0.581 | 0.678 |
|
|
|
|
|
|
|
|
| ||
Average value of the G-mean metric for all compared models and every data stream type, with subscript containing a list of other methods, that are statistically worse for the given stream type.
| # | METHOD | SUDDEN DRIFT | GRADUAL DRIFT | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 5% | 10% | 20% | 30% | 5% | 10% | 20% | 30% | ||
| 1 | AWE | 0.792 | 0.791 | 0.826 | 0.845 | 0.781 | 0.804 | 0.822 | 0.832 |
|
|
|
|
|
|
|
|
|
| |
| 2 | AWE | 0.791 | 0.771 | 0.836 | 0.845 | 0.781 | 0.777 | 0.826 | 0.831 |
|
|
|
|
|
|
|
|
|
| |
| 3 | AWE | 0.804 | 0.780 | 0.844 | 0.848 | 0.788 | 0.789 | 0.828 | 0.833 |
|
|
|
|
|
|
|
|
|
| |
| 4 | AWE | 0.781 | 0.799 | 0.842 | 0.846 | 0.773 | 0.785 | 0.827 | 0.833 |
|
|
|
|
|
|
|
|
|
| |
| 5 | AWE | 0.789 | 0.819 | 0.842 | 0.847 | 0.779 | 0.810 | 0.826 | 0.834 |
|
|
|
|
|
|
|
|
|
| |
| 6 | AWE | 0.801 | 0.831 | 0.847 | 0.851 | 0.784 | 0.815 | 0.830 | 0.836 |
|
|
|
|
|
|
|
|
|
| |
| 7 | AWE | 0.683 | 0.733 | 0.807 | 0.836 | 0.678 | 0.725 | 0.786 | 0.817 |
|
|
|
|
|
|
|
|
|
| |
| 8 | AWE | 0.583 | 0.729 | 0.809 | 0.837 | 0.573 | 0.712 | 0.786 | 0.818 |
|
|
|
|
|
|
|
|
|
| |
| 9 | AWE | 0.649 | 0.735 | 0.815 | 0.840 | 0.631 | 0.714 | 0.786 | 0.820 |
|
|
|
|
|
|
|
|
|
| |
| 10 | AWE | 0.753 | 0.704 | 0.761 | 0.804 | 0.764 | 0.744 | 0.769 | 0.815 |
|
|
|
|
|
|
|
|
|
| |
| 11 | AWE | 0.718 | 0.707 | 0.783 | 0.805 | 0.723 | 0.709 | 0.780 | 0.810 |
|
|
|
|
|
|
|
|
|
| |
| 12 | AWE | 0.543 | 0.681 | 0.762 | 0.798 | 0.474 | 0.647 | 0.769 | 0.819 |
|
|
|
|
|
|
|
|
| ||
| 13 | AUE | 0.805 | 0.832 | 0.858 | 0.868 | 0.794 | 0.822 | 0.843 | 0.853 |
|
|
|
|
|
|
|
|
|
| |
| 14 | AUE | 0.801 | 0.824 | 0.852 | 0.865 | 0.791 | 0.819 | 0.843 | 0.853 |
|
|
|
|
|
|
|
|
|
| |
| 15 | AUE | 0.811 | 0.840 | 0.863 | 0.874 | 0.795 | 0.823 | 0.845 | 0.856 |
|
|
|
|
|
|
|
|
|
| |
| 16 | AUE | 0.824 | 0.859 | 0.881 | 0.881 | 0.811 | 0.844 | 0.865 | 0.867 |
|
|
|
|
|
|
|
|
|
| |
| 17 | AUE | 0.820 | 0.853 | 0.866 | 0.874 | 0.812 | 0.846 | 0.862 | 0.866 |
|
|
|
|
|
|
|
|
|
| |
| 18 | AUE | 0.830 | 0.866 | 0.882 | 0.883 | 0.816 | 0.849 | 0.866 | 0.868 |
|
|
|
|
|
|
|
|
|
| |
| 19 | AUE | 0.639 | 0.749 | 0.837 | 0.863 | 0.581 | 0.712 | 0.810 | 0.847 |
|
|
|
|
|
|
|
|
|
| |
| 20 | AUE | 0.595 | 0.733 | 0.824 | 0.858 | 0.562 | 0.708 | 0.808 | 0.845 |
|
|
|
|
|
|
|
|
|
| |
| 21 | AUE | 0.643 | 0.756 | 0.839 | 0.868 | 0.592 | 0.713 | 0.813 | 0.848 |
|
|
|
|
|
|
|
|
|
| |
| 22 | AUE | 0.804 | 0.814 | 0.859 | 0.868 | 0.789 | 0.813 | 0.840 | 0.854 |
|
|
|
|
|
|
|
|
|
| |
| 23 | AUE | 0.804 | 0.841 | 0.869 | 0.875 | 0.783 | 0.822 | 0.852 | 0.863 |
|
|
|
|
|
|
|
|
|
| |
| 24 | AUE | 0.531 | 0.726 | 0.830 | 0.862 | 0.428 | 0.632 | 0.808 | 0.847 |
|
|
|
|
|
|
|
|
| ||
| 25 | WAE | 0.480 | 0.669 | 0.815 | 0.852 | 0.377 | 0.547 | 0.781 | 0.830 |
|
|
|
|
|
|
|
|
| ||
| 26 | OOB | 0.708 | 0.686 | 0.735 | 0.754 | 0.641 | 0.706 | 0.745 | 0.759 |
|
|
|
|
|
|
|
|
| ||
| 27 | UOB | 0.757 | 0.757 | 0.776 | 0.774 | 0.718 | 0.744 | 0.763 | 0.772 |
|
|
|
|
|
|
|
|
| ||
Average value of the balanced accuracy score metric for all compared models and every data stream type, with subscript containing a list of other methods, that are statistically worse for the given stream type.
| # | METHOD | SUDDEN DRIFT | GRADUAL DRIFT | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 5% | 10% | 20% | 30% | 5% | 10% | 20% | 30% | ||
| 1 | AWE | 0.795 | 0.796 | 0.827 | 0.846 | 0.784 | 0.806 | 0.823 | 0.833 |
|
|
|
|
|
|
|
|
|
| |
| 2 | AWE | 0.794 | 0.791 | 0.837 | 0.846 | 0.784 | 0.794 | 0.827 | 0.832 |
|
|
|
|
|
|
|
|
|
| |
| 3 | AWE | 0.807 | 0.805 | 0.846 | 0.849 | 0.791 | 0.803 | 0.830 | 0.834 |
|
|
|
|
|
|
|
|
|
| |
| 4 | AWE | 0.785 | 0.801 | 0.843 | 0.848 | 0.777 | 0.787 | 0.829 | 0.834 |
|
|
|
|
|
|
|
|
|
| |
| 5 | AWE | 0.794 | 0.821 | 0.843 | 0.848 | 0.783 | 0.812 | 0.827 | 0.835 |
|
|
|
|
|
|
|
|
|
| |
| 6 | AWE | 0.807 | 0.834 | 0.849 | 0.852 | 0.791 | 0.818 | 0.832 | 0.838 |
|
|
|
|
|
|
|
|
|
| |
| 7 | AWE | 0.711 | 0.753 | 0.816 | 0.840 | 0.711 | 0.744 | 0.795 | 0.822 |
|
|
|
|
|
|
|
|
|
| |
| 8 | AWE | 0.690 | 0.758 | 0.818 | 0.840 | 0.686 | 0.744 | 0.797 | 0.822 |
|
|
|
|
|
|
|
|
|
| |
| 9 | AWE | 0.705 | 0.765 | 0.825 | 0.844 | 0.696 | 0.747 | 0.799 | 0.825 |
|
|
|
|
|
|
|
|
|
| |
| 10 | AWE | 0.757 | 0.710 | 0.762 | 0.806 | 0.767 | 0.750 | 0.771 | 0.816 |
|
|
|
|
|
|
|
|
|
| |
| 11 | AWE | 0.727 | 0.712 | 0.785 | 0.806 | 0.728 | 0.714 | 0.782 | 0.812 |
|
|
|
|
|
|
|
|
|
| |
| 12 | AWE | 0.650 | 0.712 | 0.771 | 0.802 | 0.636 | 0.706 | 0.786 | 0.825 |
|
|
|
|
|
|
|
|
| ||
| 13 | AUE | 0.809 | 0.835 | 0.859 | 0.869 | 0.797 | 0.825 | 0.844 | 0.855 |
|
|
|
|
|
|
|
|
|
| |
| 14 | AUE | 0.805 | 0.828 | 0.853 | 0.866 | 0.794 | 0.822 | 0.844 | 0.854 |
|
|
|
|
|
|
|
|
|
| |
| 15 | AUE | 0.815 | 0.843 | 0.864 | 0.875 | 0.799 | 0.826 | 0.846 | 0.857 |
|
|
|
|
|
|
|
|
|
| |
| 16 | AUE | 0.829 | 0.861 | 0.882 | 0.882 | 0.816 | 0.846 | 0.866 | 0.868 |
|
|
|
|
|
|
|
|
|
| |
| 17 | AUE | 0.825 | 0.855 | 0.867 | 0.875 | 0.817 | 0.848 | 0.863 | 0.867 |
|
|
|
|
|
|
|
|
|
| |
| 18 | AUE | 0.835 | 0.868 | 0.883 | 0.884 | 0.821 | 0.851 | 0.867 | 0.869 |
|
|
|
|
|
|
|
|
|
| |
| 19 | AUE | 0.713 | 0.781 | 0.846 | 0.867 | 0.683 | 0.756 | 0.824 | 0.851 |
|
|
|
|
|
|
|
|
|
| |
| 20 | AUE | 0.694 | 0.768 | 0.833 | 0.862 | 0.676 | 0.754 | 0.822 | 0.850 |
|
|
|
|
|
|
|
|
|
| |
| 21 | AUE | 0.714 | 0.787 | 0.849 | 0.872 | 0.687 | 0.757 | 0.826 | 0.853 |
|
|
|
|
|
|
|
|
|
| |
| 22 | AUE | 0.807 | 0.819 | 0.861 | 0.870 | 0.792 | 0.816 | 0.841 | 0.855 |
|
|
|
|
|
|
|
|
|
| |
| 23 | AUE | 0.809 | 0.843 | 0.870 | 0.876 | 0.789 | 0.824 | 0.853 | 0.864 |
|
|
|
|
|
|
|
|
|
| |
| 24 | AUE | 0.676 | 0.766 | 0.839 | 0.865 | 0.634 | 0.726 | 0.824 | 0.853 |
|
|
|
|
|
|
|
|
| ||
| 25 | WAE | 0.654 | 0.739 | 0.826 | 0.855 | 0.620 | 0.691 | 0.802 | 0.836 |
|
|
|
|
|
|
|
|
| ||
| 26 | OOB | 0.743 | 0.724 | 0.755 | 0.766 | 0.695 | 0.735 | 0.760 | 0.768 |
|
|
|
|
|
|
|
|
| ||
| 27 | UOB | 0.761 | 0.759 | 0.778 | 0.776 | 0.722 | 0.747 | 0.765 | 0.773 |
|
|
|
|
|
|
|
|
| ||
Figure 1Comparison of base algorithms and their modifications, showing average F1-score value for each chunk of the stream with gradual and sudden concept drifts and 5% of minority class samples.
Figure 2Comparison of base algorithms and their modifications, showing average G-mean value for each chunk of the stream with gradual and sudden concept drifts and 5% of minority class samples.
Figure 3Comparison of base algorithms and their modifications, showing average balanced accuracy score for each chunk of the stream with gradual and sudden concept drifts and 5% of minority class samples.
Figure 4Comparison of the best proposed models with other methods of stream processing using the Test-Then-Train procedure on the stream with gradual and sudden concept drifts and 5% of minority class samples.