Martin Mundt1, Iuliia Pliushch1, Sagnik Majumder2, Yongwon Hong3, Visvanathan Ramesh1. 1. Department of Computer Science and Mathematics, Goethe University, 60323 Frankfurt am Main, Germany. 2. Department of Computer Science, The University of Texas at Austin, Austin, TX 78712, USA. 3. Department of Computer Science, Yonsei University, Seoul 03722, Korea.
Abstract
Modern deep neural networks are well known to be brittle in the face of unknown data instances and recognition of the latter remains a challenge. Although it is inevitable for continual-learning systems to encounter such unseen concepts, the corresponding literature appears to nonetheless focus primarily on alleviating catastrophic interference with learned representations. In this work, we introduce a probabilistic approach that connects these perspectives based on variational inference in a single deep autoencoder model. Specifically, we propose to bound the approximate posterior by fitting regions of high density on the basis of correctly classified data points. These bounds are shown to serve a dual purpose: unseen unknown out-of-distribution data can be distinguished from already trained known tasks towards robust application. Simultaneously, to retain already acquired knowledge, a generative replay process can be narrowed to strictly in-distribution samples, in order to significantly alleviate catastrophic interference.
Modern deep neural networks are well known to be brittle in the face of unknown data instances and recognition of the latter remains a challenge. Although it is inevitable for continual-learning systems to encounter such unseen concepts, the corresponding literature appears to nonetheless focus primarily on alleviating catastrophic interference with learned representations. In this work, we introduce a probabilistic approach that connects these perspectives based on variational inference in a single deep autoencoder model. Specifically, we propose to bound the approximate posterior by fitting regions of high density on the basis of correctly classified data points. These bounds are shown to serve a dual purpose: unseen unknown out-of-distribution data can be distinguished from already trained known tasks towards robust application. Simultaneously, to retain already acquired knowledge, a generative replay process can be narrowed to strictly in-distribution samples, in order to significantly alleviate catastrophic interference.
Entities:
Keywords:
catastrophic forgetting; continual deep learning; deep generative models; open-set recognition; variational inference
Consider an empirically optimized deep neural network for a particular task, for the sake of simplicity, say the classification of dogs and cats. Typically, such a system is trained in a closed world setting [1] according to an isolated learning paradigm [2]. That is, we assume the observable world to consist of a finite set of known instances of dogs and cats, where training and evaluation is limited to the same underlying statistical data population. The training process is treated in isolation, i.e., the model parameters are inferred from the entire existing dataset at all times. However, the real world requires dealing with sequentially arriving tasks and data originating from potentially unknown sources.In particular, should we wish to apply and extend the system to an open world, where several other animals (and non animals) exist, there are two critical questions: (a) How can we prevent obvious mispredictions if the system encounters a new class? (b) How can we continue to incorporate this new concept into our present system without full retraining? With respect to the former question, it is well known that neural networks yield overconfident mispredictions in the face of unseen unknown concepts [3], a realization that has recently resurfaced in the context of various deep neural networks [4,5,6]. With respect to the latter question, it is similarly well known that neural networks, which are trained exclusively on newly arriving data, will overwrite their representations and thus forget encoded knowledge—a phenomenon referred to as catastrophic interference or catastrophic forgetting [7,8]. Although we have worded the above questions in a way that naturally exposes their connection: to identify what is new and think about how new concepts can be incorporated, they are largely subject to separate treatment in the respective literature. While open-set recognition [1,9,10] aims to explicitly identify novel inputs that deviate with respect to already observed instances, the existing continual learning literature predominantly concentrates its efforts on finding mechanisms to alleviate catastrophic interference (see [11] for an algorithmic survey).In particular, the indispensable system component to distinguish seen from unseen unknown data, both as a guarantee for robust application and to avoid the requirement of explicit task labels for prediction, is generally missing from recent continual-learning works. Inspired by this gap, we set out to connect open-set recognition and continual learning. The underlying connecting element is motivated from the prior work of Bendale and Boult [12], who proposed to leverage extreme value theory (EVT) to address open-set detection in deep neural networks. The authors suggested to modify softmax prediction scores on the basis of feature space distances in blackbox discriminative models. Although this approach is promising, it alas comes with the substantial caveat that purely discriminative networks are prone to encode noise as features [13] or fall for a most simple discriminative solution that neglects meaningful features [14]. Inspired by these former insights, we set out to connect open-set recognition and continual learning, while overcoming present limitations through treatment from a generative modeling perspective.Our specific contributions are that we propose to unify the prevention of catastrophic interference in continual learning with open-set recognition in a single model. Specifically, we extend prior EVT works [9,10,12] to a natural formulation on the basis of the aggregate posterior in variational inference with deep autoencoders [15,16]. By identifying out-of-distribution instances we can detect unseen unknown data and prevent false predictions; by explicitly generating in-distribution samples from areas of high probability density under the aggregate posterior, we can simultaneously circumvent rehearsal of ambiguous uninformative examples. This leads to robust application, while significantly reducing catastrophic interference. We empirically corroborated our approach in terms of improved out-of-distribution detection performance and simultaneously reduced the continual catastrophic interference. We further demonstrate benefits through recent deep generative modeling advances, such as autoregression [2,17,18] and introspection [19,20], validated by scaling to high-resolution color images.
1.1. Background and Related Work
1.1.1. Continual Learning
In isolated supervised learning, the core assumption is the presence of i.i.d. data at all times and training is conducted using a dataset , consisting of N pairs of data instances and their corresponding labels for C classes. In contrast, in continual learning, data with arrives sequentially for T disjoint sets, each with number of classes .It is assumed that only the data of the current task is available. Without additional mechanisms, tuning on such a sequence will lead to catastrophic interference [7,8], i.e., representations of former tasks being overwritten through present optimization. A recent review of many continual-learning algorithms to prevent said interference was provided by Parisi et al. [11]. Here, we present a brief summary of the key underlying principles.Alleviating catastrophic interference is most prominently addressed from two angles. Regularization methods, such as synaptic intelligence (SI) [21] or elastic weight consolidation (EWC) [22] explicitly constrain the weights during continual learning to avoid drifting too far away from the previous tasks’ solutions. In a related picture, learning without forgetting [23] uses knowledge distillation [24] to regularize the end-to-end functional.Rehearsal methods on the other hand, store data subsets from distributions belonging to old tasks or generate samples in pseudo-rehearsal [25]. The central component of the latter is thus the selection of significant instances. For methods, such as incremental classifier and representation learning (iCarl) [26], it is therefore common to resort to auxiliary techniques, such as the nearest-mean classifier [27] or core sets [28]. Inspired by complementary learning systems [29], dual-model approaches sample data from a separate generative memory. In a bio-inspired incremental learning architecture (GeppNet) [30], long short-term memory [31] is used for storage, whereas generative replay [32] samples from an additional generative adversarial network (GAN) [33].As detailed in Variational Generative Replay (VGR) [34,35], methods with a Bayesian perspective encompass a natural capability for continual learning by making use of the learned distribution. Existing works nevertheless fall into the above two categories and their combination: a prior-based approach using the former task’s approximate posterior as the new task’s prior [36] or estimating the likelihood of former data through generative replay or other forms of rehearsal [34,37]. Crucially, the success of many continual-learning techniques can be attributed primarily to the considered evaluation scenario. With the exception of VGR [34], the majority of above techniques train a separate classifier per task and thus either require the explicit storage of task labels or assume the presence of a task oracle during evaluation. This multi-head scenario prevents “cross-talk” between classifier units by not sharing them, which would otherwise rapidly decay the accuracy as newly introduced classes directly confuse existing concepts. While the latter is acceptable to limit catastrophic interference, it also signifies a major limitation in practical applications. Even though VGR [34] uses a single classifier, the researchers trained a separate generative model per task to avoid catastrophic interference in the generator.Our approach builds upon these previous works and leverages variational inference in deep generative models. However, we propose to tie the prevention of catastrophic interference with open-set recognition through a natural mechanism based on the aggregate posterior in a single model.
1.1.2. Out-of-Distribution and Open Set Recognition
The above-mentioned literature focused their efforts predominantly on addressing catastrophic interference. Even though continual learning is the desideratum, the corresponding evaluation is thus conducted in a closed world setting, where instances that do not belong to the observed data distribution are not encountered. In reality, this is not guaranteed as users could provide arbitrary inputs or unknowingly present the system with novel inputs that deviate substantially from previously seen instances. Our models thus require the ability to identify unseen examples in the unconstrained open world and categorize them as either belonging to the already known set of classes or as presently being unknown. We provide a small overview of approaches that aim to address this question in deep neural networks. A comprehensive survey was provided by Boult et al. [1].As the most simple approach, the aim of calibration works is to separate a known and unknown input through prediction confidence, often by fine tuning or re-training an already existing model. In out-of-distribution detector for neural networks (ODIN) [38], this is addressed through perturbations and temperature scaling, while Lee et al. [39] used a separately trained GAN to generate out-of-distribution samples from low probability densities and explicitly reduced their confidence through the inclusion of an additional loss term. Similarly, the objectosphere loss [40] defines an objective that explicitly aims to maximize entropy for upfront available unknown inputs.As we do not have access to future data a priori, by definition, a naive conditioning or calibration on unseen unknown data is infeasible. The commonly applied thresholding is insufficient as overconfident prediction values cannot be prevented [3]. Bayesian neural network models [41] could be believed to intrinsically be able to reject statistical outliers through model uncertainty [34] and overcome this limitation of overconfident prediction values. For use with deep neural networks, it was suggested that stochastic forward passes with Monte-Carlo Dropout (MCD) [42] can provide a suitable approximation. However, the closed-world assumption in training and evaluation still persists [1]. In addition, variational approximations in deep networks [15,34,37,43] and corresponding uncertainty estimates suffer from similar overconfidence, and the distinction of unseen out-of-distribution data from already trained knowledge is known to be unsatisfactory [5,6].A more formal approach was suggested in works based on open-set recognition [9]. The key here is to limit predictions originating from open space, that is, the area in obtained embeddings that is outside of a small radius around previously observed training examples. Without re-training, post hoc calibration or modifying loss functions, one approach to open-set recognition in deep networks is through extreme-value theory (EVT) [10,12]. Here, limiting the threat of overconfidence is based on monotonically decreasing the recognition function’s probability with respect to increasing distance of instances to the feature embedding of known training points. The Weibull distribution, as one member of the family of extreme value distributions, has been empirically demonstrated to work well in conjunction with distances in the penultimate deep network layer as the underlying feature space. On the basis of extreme values to this layer’s average activation values, the authors devised a procedure to revise the Softmax prediction values, referred to as OpenMax.In a similar spirit, our work avoids relying on predictive values, while also moving away from empirically chosen deep neural network feature spaces. We instead propose to use EVT to bind the approximate posterior in variational inference. We thus directly operate on the underlying (lower-bound to the) data distribution and the generative factors. This additionally allows us to constrain the generative replay to distribution inliers, which further alleviates catastrophic interference.
2. Materials and Methods
2.1. Unifying Catastrophic Interference Prevention with Open Set Recognition
We first summarize the preliminaries on continual learning from a perspective of variational inference in deep generative models [15,43]. We then proceed by bridging the improved prevention of catastrophic interference in continual learning with the detection of unseen unknown data in open-set recognition.
2.1.1. Preliminaries: Learning Continually through Variational Auto-Encoding
We start with a problem scenario similar to the one introduced in “Auto-Encoding Variational Bayes” [15], i.e., we assume that there exists a data generation process responsible for the creation of the labeled data given some random latent variable . We consider a model with a shared encoder with variational parameters , decoder and linear classifier with respective parameters and . The joint probabilistic encoder learns an encoding to a latent variable , over which a unit Gaussian prior is placed.Using variational inference, the encoder’s purpose is to approximate the true posterior to and . The probabilistic decoder and probabilistic linear classifier then return the conditional probability density of the input and target y under the respective generative model given a sample from the approximate posterior . This yields a generative model , for which we assume a factorization and generative process of the form . For variational inference with this model, the sum over all elements in the dataset n ∈ D in the following lower-bound is optimized:
where KL denotes the Kullback-Leibler divergence. In other words, the right hand side of Equation (1) defines our loss . This model can be seen as employing a variant of a (semi-)supervised variational auto-encoder (VAE) [16] with a term [44], where, in addition to approximating the data distribution, the model learns to incorporate the class structure into the latent space. Without the blue terms, the original unsupervised VAE formulation [15] is recovered. This forms the basis for continual learning with open-set recognition as discussed in the subsequent section. An illustration of the model is shown in Figure 1.
Figure 1
A joint continual-learning model consisting of a shared probabilistic encoder , probabilistic decoder and probabilistic classifier . For open-set recognition and generative replay with outlier rejection, extreme-value theory (EVT) based bounds on the basis of the approximate posterior are established.
Abstracting away from the mathematical detail and speaking informally about the intuition behind the model, we first encode a data input and encode it into two vectors. These vectors represent the mean and standard deviation of a Gaussian distribution. Using the reparametrization trick , a sample from this distribution is then calculated. During training, the respective embedding, also referred to as the latent space, is encouraged to follow a unit Gaussian distribution through the minimization of the Kullback-Leibler divergence. A linear classifier that operates directly on this latent embedding to predict a class for a sample additionally ensures that the obtained distribution is clustered according to the classes.Examples of such fits are shown in the later Figure 2. Finally, the decoder takes, as input, the latent variable and reconstructs the original data input during training. Once the model is finished training, we can also directly draw a sample from the Gaussian distribution, obtain a latent sample and generate a novel data point directly, without the need to compute the encoder first. A corresponding full and formal derivation of Equation (1), the lower-bound to the joint distribution is supplied in Appendix A.1.
Figure 2
2-D latent space aggregate posterior visualization for continually learned MNIST (Modified National Institute of Standards and Technology database). From left to right, the latent space for four, six and then eight classes are shown. This is best viewed in color.
Without further constraints, one could continually train the above model by sequentially accumulating and optimizing Equation (1) over all currently present tasks . Being based on the accumulation of real data, this provides an upper bound to the achievable performance in continual learning. However, this form of continued training is generally infeasible if only the most recent task’s data is assumed to be available. Making use of the model’s generative nature, we can follow previous works [34,37] and estimate the likelihood of former data through generative replay:
whereHere, is a sample from the generative model with its corresponding classifier label . is the number of instances of all previously seen tasks. In this way, the expectation of the log-likelihood for all previously seen tasks is estimated and the dataset at any point in time is a concatenation of past data generations and the current task’s real data.
2.1.2. Open Set Recognition and Generative Replay with Statistical Outlier Rejection
Trained naively in the above fashion, our model will unfortunately suffer from accumulated errors with each successive iteration of generative replay, similar to the current literature approaches. To avoid this, we would alternatively require the training of multiple encoders to approximate each task’s posterior individually, as in variational continual learning (VCL) [36], or train multiple generators, as in VGR [34]. We posit that the main challenge is how high-density areas under the prior are not necessarily reflected in the structure of the aggregate posterior [45]. The latter refers to the practically obtained encoding [46]:To provide intuition, we illustrate this prior-posterior discrepancy on the obtained two-dimensional latent encodings for a continually trained supervised MNIST (Modified National Institute of Standards and Technology database) [47] model in Figure 2. Here, we can make two observations: to preserve the inherent data structure, the aggregate posterior deviates from the prior. In fact, this is further amplified by the imposed necessity for linear class separation and the beta term in Equation (1); however, we note that the discrepancy is desired even in completely unsupervised scenarios [45,46].The underlying rationale is that there needs to be a balance in the effective latent encoding overlap [48], which can best be summarized with a direct quote from the recent work of Mathieu et al. [49]: “The overlap is perhaps best understood by considering extremes: with too little the latents effectively become a lookup table; too much, and the data and latents do not convey information about each other. In either case, meaningfulness of the latent encodings is lost." (p. 4). Additional discussion on the role of beta can be found in Appendix A.2.Thus, the generated data from low-density regions of the aggregate posterior do not generally correspond to the encountered data instances. Conversely, data instances that fall into high-density regions under the prior should not generally be considered as statistical inliers with respect to the observed data distribution; recall Figure 2. This boundary between low- and high-density regions forms the basis for a natural connection between open-set recognition and continual learning: generate from high-density regions and reject novel instances that fall into low-density regions.Ideally, we could find a solution by replacing the prior in the KL divergence of Equation (1) with and, respectively, sampling in Equations (2) and (3). Even though using the aggregate posterior as a subsequent prior is the objective in multiple recent works, it can be challenging in high dimensions, lead to over-fitting or come at the expense of additional hyper-parameters [45,50,51]. To avoid finding an explicit representation for the multi-modal , we draw inspiration from the EVT-based OpenMax approach [12] in deep neural networks. However, instead of using knowledge about extreme distances in penultimate layer activations to modify a Softmax prediction, we now propose to apply EVT on the basis of the class conditional aggregate posterior.In this view, any sample can be regarded as statistically outlying if its distance to the classes’ latent mean is extreme with respect to what has been observed for the majority of correctly predicted data instances, i.e., the sample falls into a region of low density under the aggregate posterior and is less likely to belong to . For convenience, let us introduce the indices of all correctly classified instances at the end of task t as . To obtain bounds on the aggregate posterior, we first define the mean latent vector for each class for all correctly predicted seen data instances and the respective set of latent distances asHere, signifies a choice of distance metric. We proceed to model this set of distances with a per class heavy-tail Weibull distribution on for a given tail-size . As these distances are based on the class conditional approximate posterior, we can thus bound the latent space regions of high density. The tightness of the bound is characterized through , that can be seen as a prior belief with respect to the outlier quantity assumed to be inherently present in the data distribution. The choice of determines the nature and dimensionality of the obtained distance distribution. For our experiments, we find that the cosine distance and thus a univariate Weibull distance distribution per class seems to be sufficient. Using the cumulative distribution function of this Weibull model we can now estimate any sample’s outlier (or inlier) probability:
where the minimum returns the smallest outlier probability across all classes. If this outlier probability is larger than a prior rejection probability , the instance can be considered as unknown. Such a formulation, which we term open variational auto-encoder (OpenVAE), now provides us with the means to learn continually and identify unknown data:For a novel data instance, Equation (6) yields the outlier probability based on the probabilistic encoder , and a false overconfident classifier prediction can be avoided.To mitigate catastrophic interference, Equation (6) can be used on top of to constrain the generative replay (Equation (3)) to the aggregate posterior thus avoiding the need to sample it directly.To give an illustration of the benefits, we show the generated MNIST [47] and larger resolution flower images [52] together with their outlier percentage in Figure 3. In practical application, we discard the ambiguous examples that are due to low-density regions and thus a high outlier probability. Even though we conduct sampling with rejection, note how this is computationally efficient, as we only need to calculate the heavy probabilistic decoder for accepted statistically inlying examples, and sampling from the prior with computation of Equation (6) is almost negligible in comparison.
Figure 3
Generated images with and their corresponding class c obtained from the classifier together with their open-set outlier percentage in our proposed open variational auto-encoder (OpenVAE). Image quality degradation and class ambiguity can be observed with the increasing outlier likelihood. Generated MNIST images are from the 2-D latent space of Figure 2, classified as (top left), (top right), (bottom left) and (bottom right). Generated resolution flower images are based on a 60-dimensional latent space of a model trained with introspection (see experiments and Appendix A.3), which are classified as “sunflower” (top) and “daisy” (bottom).
3. Results
Instead of presenting a single experiment for continual learning in the constant presence of outlying non-task data, we chose to empirically corroborate our proposed approach in two experimental parts. The first section is dedicated to out-of-distribution detection, where we demonstrate the advantages of EVT in our generative model formulation. We then proceed to showcase how catastrophic interference is also mitigated by confining generative replay to aggregate posterior inliers in class incremental learning.We emphasize that whereas the sections are presented individually, our approach’s uniqueness lies in using a core underlying mechanism to unify both challenges simultaneously. The rationale behind choosing this form of presentation is to help readers better contextualize the contribution of OpenVAE with the existing literature as, to the best of our knowledge, there exists no present other work that yields adequate continual classification accuracy while being able to robustly recognize unknown data instances. As such, we will now see that existing continual-learning approaches provide no suitable mechanism to overcome the challenge of providing robust predictions when data outside the known benchmark set are included.
3.1. Open Set Recognition
We experimentally highlight OpenVAE’s ability to distinguish unknown task data from data belonging to known tasks to avoid overconfident false predictions.Experimental Set-Up and EvaluationIn summary, our goal is two-fold. The typical goal is to train on an initial task and correctly classify the held-out or unseen test data for this task. That is, we desire a large average classification test accuracy. In addition to this, in order to ensure that this classification is robust to unknown data, we now additionally desire to have a large value for a second kind of accuracy. Our simultaneous goal is to consider all test data of already trained tasks as inlying, while successfully identifying 100% of completely unknown datasets as outliers.For this purpose, we evaluate OpenVAE’s and other models’ capability to distinguish the in-distribution test set of a respectively trained MNIST (Modified National Institute of Standards and Technology database) [47], FashionMNIST [53], AudioMNIST [54] from the other two and several unknown datasets: Kuzushiji-MNIST (KMNIST) [55], Street-View House Numbers (SVHN) [56] and Canadian Institute for Advanced Research (CIFAR) datasets (in both versions with 10 and 100 classes) [57]. Here, the (Fourier-transformed) audio data is included to highlight the extent of the challenge, as not even a different modality is easy to detect without our proposed approach. In practice, we evaluate three criteria according to which a decision of whether a data instance is an outlier can be made:The classifier’s predictive entropy, as recently suggested to work surprisingly well in deep networks [58] but technically well known to be overconfident [3]. The intuition here is that the predictive entropy considers the probability of all other classes and is at a maximum if the distribution is uniform, i.e., when the confidence in the prediction is low.The generative model’s obtained negative log-likelihood, to concur with previous findings [5,6] on overconfidence in generative models. On the basis of Equation (1), the intuition is that the negative log-likelihood should be much larger for unseen data.Our suggested OpenVAE aggregate posterior-based EVT approach, according to the outlier likelihood introduced Equation (6).ResultsFigure 4 provides a qualitative intuition behind the three criteria and respective percentage of the total dataset being considered as outlying for FashionMNIST. Consistent with Nalisnick et al. [6], we can observe that the use of reconstruction loss can sometimes distinguish between the known tasks’ test data and unknown datasets but results in failure for others. In the case of the classifier predictive entropy, depending on the exact choice of entropy threshold, generally only a partial separation can be achieved. Furthermore, both of these criteria pose the additional challenge of the results being highly dependent on the choice of the precise cut-off value. In contrast, the test data from the known tasks is regarded as inlying across a wide range of rejection priors for Equation (6), and the majority of other datasets is consistently regarded as outlying by our introduced OpenVAE approach.
Figure 4
Model trained on FashionMNIST evaluated on unknown datasets. Robust classification of a known dataset (percentage of dataset outliers at 0%), while correctly flagging unknown datasets as outlying (percentage of dataset outliers at 100%), occurs when the solid green curve is separated from any of the colored dashed curves. (Left) Classifier entropy is insufficient to separate unknown from the known task’s test data. (Center) Reconstruction log-likelihood allows for a partial distinction. (Right) Our posterior-based EVT approach in OpenVAE considers the large majority of unknown data as statistical outliers across a wide range of rejection priors .
Corresponding quantitative outlier detection accuracies are provided in Table 1. To find thresholds for the sensitive entropy and reconstruction curves, we used a validation split to determine the respective value at which of the validation data is considered as inlying before using these priors to determine outlier counts for the known tasks’ test set as well as other datasets. In an intuitive picture, we “trace” the solid green curve of Figure 4 for a validation set of the originally trained dataset, check where we intersect with the x-axis for a y-axis value of 5% and then fix the corresponding criterion’s value at this point as an outlier rejection threshold for testing. We then report the percentage of the test set being considered as an outlier, together with the percentage for various unknown datasets. In the table, we additionally extend our intuition of Figure 4 to now further investigate what would happen if we had not trained a single VAE model that learned reconstruction and classification according to Equation (1) but separate models. For this purpose, we also investigate a dual model approach, i.e., a purely discriminative deep-neural-network-based classifier and a separate unsupervised VAE (Equation (1) without blue terms).
Table 1
Outlier detection values of the joint model and separate discriminative and generative models (denoted as “CNN + VAE”; discriminative convolutional neural network and variational auto-encoder), when considering 95% of the known tasks’ validation data as inlying. The percentage of detected outliers is reported based on the classifier predictive entropy, reconstruction negative log-likelihood (NLL) and our posterior-based extreme-value theory approach. Note that larger values are better, except for the test data of the trained dataset, where ideally 0% should be considered as outlying. The outlier detection values have additionally been color coded, where worse results appear in red. A deeper shading thus indicates a method’s failure to robustly recognize unknown data as such. With this color coding, we can easily see how MNIST appears to be an easy to identify dataset for all approaches; however, we notice right away that our OpenVAE is the only method (row) that does not have a single red value for any dataset combination. In fact, the lowest outlier detection accuracy of OpenVAE is a very high 94.76%.
Outlier Detection at 95% Validation Inliers (%)
MNIST
Fashion
Audio
KMNIST
CIFAR10
CIFAR100
SVHN
Trained
Model
Test Acc.
Criterion
MNIST
Dual,
99.40
Class entropy
4.160
90.43
97.53
95.29
98.54
98.63
95.51
CNN +
Reconstruction NLL
5.522
99.98
99.97
99.98
99.99
99.96
99.98
VAE
OpenMax
4.362
99.41
99.80
99.86
99.95
99.97
99.52
Joint
99.53
Class entropy
3.948
95.15
98.55
95.49
99.47
99.34
97.98
VAE
Reconstruction NLL
5.083
99.50
99.98
99.91
99.97
99.99
99.98
OpenVAE (ours)
4.361
99.78
99.67
99.73
99.96
99.93
99.70
FashionMNIST
Dual,
90.48
Class entropy
74.71
5.461
69.65
77.85
24.91
28.76
36.64
CNN +
Reconstruction NLL
5.535
5.340
64.10
31.33
99.50
98.41
97.24
VAE
OpenMax
96.22
5.138
93.00
91.51
71.82
72.08
73.85
Joint
90.92
Class Entropy
66.91
5.145
61.86
56.14
43.98
46.59
37.85
VAE
Reconstruction NLL
0.601
5.483
63.00
28.69
99.67
98.91
98.56
OpenVAE (ours)
96.23
5.216
94.76
96.07
96.15
95.94
96.84
AudioMNIST
Dual,
98.53
Class entropy
97.63
57.64
5.066
95.53
66.49
65.25
54.91
CNN +
Reconstruction NLL
6.235
46.32
4.433
98.73
98.63
98.63
97.45
VAE
OpenMax
99.82
78.74
5.038
99.47
93.44
92.76
88.73
Joint
98.57
Class entropy
99.23
89.33
5.731
99.15
92.31
91.06
85.77
VAE
Reconstruction NLL
0.614
38.50
3.966
36.05
98.62
98.54
96.99
OpenVAE (ours)
99.91
99.53
5.089
99.81
100.0
99.99
99.98
In this way, we can showcase the advantages of a generative modeling formulation that considers the joint distribution in conjunction with EVT. For instance, we can compare our values with the purely discriminative OpenMax EVT approach [59]. At the same time, this provides a justification for why the existing continual-learning approaches of the next section, especially those relying on the maintenance of multiple models, are non-ideal, as they cannot seem to adequately solve the open-set challenge.In terms of the obtained results, with the exception of MNIST, which appears to be an easy to identify dataset for all approaches, we can make two key observations:Both EVT approaches generally outperform the other criteria, particularly for our suggested aggregate posterior-based OpenVAE variant, where a near perfect open-set detection can be achieved.Even though EVT can be applied to purely discriminative models (as in OpenMax), the generative OpenVAE model trained with variational inference consistently exhibited more accurate outlier detection. We posit that this robustness is due to OpenVAE explicitly optimizing a variational lower bound that considers the data distribution in addition to a pure optimization of features that maximize .Open Set Recognition with Monte-Carlo Dropout Based UncertaintyOne might be tempted to assume that the trained weights of the individual deep neural network encoder layers are still deterministic and the failure of predictive entropy as a measure for unseen unknown data could thus primarily be attributed to uncertainty not being expressed adequately. Placing a distribution on the weights, akin to a fully Bayesian neural network, would then be expected to resolve this issue. For this purpose, we further repeat all of our experiments by treating the model weights as the random variable being marginalized through the use of Monte-Carlo Dropout (MCD) [42]. Accordingly, the models were re-trained with a Dropout probability of in each layer. We then conducted 50 stochastic forward passes through the entire model for prediction. The obtained open-set recognition results are reported in Table 2.
Table 2
Outlier detection values of the joint model and separate discriminative and generative models (denoted as “CNN + VAE”; discriminative convolutional neural network and variational auto-encoder), when considering 95% of known tasks’ validation data is inlying. The percentage of detected outliers is reported based on classifier predictive entropy, reconstruction negative log-likelihood (NLL) and our posterior-based EVT approach. In contrast to Table 1, the results are now averaged over 50 Monte-Carlo dropout samples, with for each layer, per data-point, respectively, to assess the model uncertainty. Note that larger values are better, except for the test data of the trained dataset, where ideally 0% should be considered as outlying. The color coding is analogous to Table 1.
Outlier Detection at 95% Validation Inliers (%)
MNIST
Fashion
Audio
KMNIST
CIFAR10
CIFAR100
SVHN
Trained
Model
Test Acc.
Criterion
MNIST
Dual,
99.41
Class entropy
4.276
91.88
96.50
96.65
95.84
97.37
98.58
CNN +
Reconstruction
4.829
99.99
100.0
99.90
100.0
100.0
100.0
VAE
OpenMax
4.088
87.84
98.06
95.79
97.34
98.30
95.74
Joint,
99.54
Class entropy
4.801
97.63
99.38
98.01
99.16
99.39
98.90
VAE
Reconstruction
5.264
99.98
100.0
100.0
100.0
100.0
100.0
OpenVAE (ours)
4.978
99.99
100.0
99.94
99.96
99.95
99.68
FashionMNIST
Dual,
90.58
Class entropy
75.50
5.366
70.78
74.41
49.42
49.17
38.84
CNN +
Reconstruction NLL
55.45
5.048
59.99
99.83
99.35
99.35
99.62
VAE
OpenMax
77.03
4.920
55.48
70.23
58.73
57.06
44.54
Joint,
91.50
Class Entropy
85.05
4.740
67.90
78.04
63.89
66.11
59.42
AE
Reconstruction
1.227
5.422
85.85
39.76
99.94
99.72
99.99
OpenVAE (ours)
95.83
4.516
94.56
96.04
96.81
96.66
96.28
AudioMNIST
Dual,
98.76
Class entropy
99.97
61.26
4.996
96.77
63.78
65.76
59.38
CNN +
Reconstruction NLL
7.334
52.37
5.100
98.19
99.97
99.90
99.96
VAE
OpenMax
92.74
67.18
5.073
90.41
90.56
90.97
89.58
Joint,
98.85
Class entropy
99.39
89.50
5.333
99.16
94.66
95.12
97.13
VAE
Reconstruction NLL
15.81
53.83
4.837
41.89
99.90
99.82
99.95
OpenVAE (ours)
99.50
99.27
5.136
99.75
99.71
99.59
99.91
Although MCD boosts the outlier detection accuracy, particularly for criteria, such as predictive entropy, the previous insights and drawn conclusions still hold. In summary, the joint generative model generally outperforms a purely discriminative model in terms of open-set recognition, independently of the used metric, and our proposed aggregate posterior-based EVT approach of OpenVAE yields an almost perfect separation of known and unseen unknown data. Interestingly, this was already achieved in the prior table without MCD. Resorting to the repeated model calculation of MCD thus appears to be without enough of an advantage to warrant the added computational complexity in the context of posterior-based open-set recognition, a further key advantage of OpenVAE.
3.2. Learning Classes Incrementally in Continual Learning
To showcase how our OpenVAE approach mitigates catastrophic interference in addition to successfully handling unknown data in robust prediction, we conduct an investigation of the test accuracy when learning classes incrementally.Experimental Set-Up and EvaluationWe consider the incremental MNIST dataset (where classes arrive in groups of two) and the corresponding versions of the FashionMNIST and AudioMNIST datasets, similar to popular literature [11,21,22,32,34]. We re-emphasize that such a setting has a sole focus on mitigating catastrophic interference and does not account for the the challenges presented in the previous open-set recognition section, which we detail in the prospective discussion section. For a flexible comparison, we report our aggregate posterior-based generative replay approach in OpenVAE on both a simple multi-layer perceptron (MLP), as well as a deep convolutional neural network (CNN) based on wide residual networks (WRN). For the former, we follow previous continual-learning studies and employ a two-hidden-layer and 400-unit multi-layer perceptron [60]. For the latter, we use both encoder and decoder architectures of 14-layer wide residual networks [61,62] with a latent dimensionality of 60 [2,18]. For our statistical outlier rejection, we use a rejection prior of and dynamically set tail-sizes to 5% of seen examples per class.For our own experiments, we report the mean and standard deviation of the average classification test accuracy across five experimental repetitions. If our re-implementation of related works achieved a better than original value, we report this number, otherwise the work that reported the specific best value is cited next to it. The full training details, including details on hardware and code, are supplied in Appendix A.4.ResultsIn Table 3, we report the final accuracy after having trained on each of the five increments. For an overall reference, we provide the achievable upper-bound continual-learning performance, i.e., accumulating all data over time and optimizing Equation (1). We can observe that our proposed OpenVAE approach provides significant improvement over generative replay with a conventional supervised VAE. In comparison with the immediately related works, our approach surpasses variational continual learning (VCL) [36], an approach that employs a full Bayesian neural network (BNN), with the additional benefit that our approach scales trivially to complex network architectures.
Table 3
The accuracy at the end of the last increment T = 5 for class incremental learning approaches averaged over five runs. For a fair comparison, if our re-implementation of related works achieved a better than original value, we report our number, otherwise the work that reported the specific best value is cited right next to the result. Intermediate results can be found in Appendix A.6.
Final Accuracy αT(T=5) [%]
Method
MNIST
FashionMNIST
AudioMNIST
MLP upper bound
98.84
87.35
96.43
WRN upper bound
99.29
89.24
97.87
EWC [22]
55.80 [63]
24.48 ± 2.86
20.48 ± 1.73
DGR [32]
75.47 [64]
63.21 ± 1.96
48.42 ± 2.81
VCL [36]
72.30 [35]
32.60 [35]
-
VGR [35]
92.22 [35]
79.10 [35]
-
Supervised VAE
60.88 ± 3.31
62.72 ± 1.38
69.76 ± 1.37
OpenVAE—MLP
87.31 ± 1.22
66.14 ± 0.50
81.84 ± 1.44
OpenVAE—WRN
93.24 ± 3.74
69.88 ± 1.71
87.72 ± 1.59
OpenPixelVAE
96.84 ± 0.35
80.85 ± 0.72
90.23 ± 1.14
In contrast to variational generative replay (VGR) [34], OpenVAE initially appears to fall short. This is not surprising as VGR trains a separate GAN on each task’s aggregate posterior, an apples to oranges comparison considering that we only use a single model. Nevertheless, even in a single model, we can surpass the multi-model VGR by leveraging recent advancements in generative modeling, e.g., by making the neural architecture more complex or augmenting our decoder with autoregressive sampling [2,18] (a complementary technique to OpenVAE, often also called PixelVAE and summarized in Appendix A.3).At the bottom of Table 3, we can see that this significantly improves upon the previously obtained accuracy. The full accuracies, along with other metrics per dataset for all intermediate steps can be found in Appendix A.6.High-Resolution Flower ImagesWhile the main goal of this paper is not to push the achievable boundaries of generation, we take this argument one step further and provide empirical evidence that our suggested aggregate posterior-based EVT sampling provides similar benefits when scaling to higher resolution color images. For this purpose, we consider the additional flowers dataset [52] at a resolution of , investigated with five classes and increments of one class per step [65,66].In addition to autoregressive sampling, we also include a second complementary generative modeling improvement here, called VAEs with introspection (IntroVAE) [19]. A technical description of PixelVAE and IntroVAE is detailed in Appendix A.3. For each generative modeling variant, including autoregression and introspection, we report the degradation of accuracy over time in Figure 5 and demonstrate how their respective open-set-aware version provides substantial improvements. Intuitively, this improvement is due to an increase in the visual generation quality; see the examples in the earlier Figure 3.
Figure 5
Classification accuracy over five runs for continually learned flowers at resolution to demonstrate how generative modeling advances draw similar benefits from our proposed aggregate posterior constrained generative replay (solid lines) over the open-set-unaware baselines (dashed counterparts).
First, it is apparent how every OpenVAE variant improves upon its non open-set aware counterpart. We further observe that the best version, OpenIntroVAE, appears to be in the same ballpark as complex recent GAN approaches [65,66], even though they do not solve the open-set recognition challenge and conduct a simplified evaluation. The latter works use a lower resolution of (we were unable to scale to satisfying results at higher resolution) with additional distillation mechanisms, a continuously trained generator but a classifier that is trained and assessed only once at the end. We nevertheless report the respective values for intuition. We conclude that the obtained final accuracy can be competitive and is remarkably close to the achievable upper bound. A suspected initial VAEs generation quality limitation appears to be lifted with modern extensions and our proposed sampling scheme.We also support our quantitative statements visually with a few selected generated images for the various generative variants in Figure 6. We emphasize that these examples are supposed to primarily provide visual intuition in support of the existing quantitative results, as it is difficult to draw conclusions from a perceived subjective quality from a few images alone. From a qualitative viewpoint, the OpenVAE without generative modeling extensions appears to suffer from the limitations of a traditional VAE and generates blurry images.
Figure 6
Generated flower images for various continually trained models. Images were selected to provide a qualitative intuition behind the quantitative results of Figure 5. Images are compressed for a side-by-side view.
However, our open-set approach nevertheless provides a clearer disambiguation of classes, particularly already at the stage of task 2. The addition of introspection significantly increases the image detail, albeit still degrades considerably due to ambiguous interpolations in samples from low-density areas outside the aggregate posterior. This is again resolved by combining introspection with our proposed posterior-based EVT approach, where image quality is retained across multiple generative replay steps. From a purely visual perspective it is clear why this model outperforms the other approaches significantly in terms of quantitative accuracy values.Interestingly, our visual inspection also hints at why the PixelVAE and its open-set variant perform much worse than perhaps initially expected. As the caveat is the same in both PixelVAE and OpenPixelVAE, we only show generated instances for the latter. From these samples, we can hypothesize why the initial performance is competitive but rapidly declines. It appears that the autoregression suffers from forgetting in terms of its long-range pixel dependency.Whereas at the beginning, the information is locally consistent across the entire image, in each consecutive step, a further portion of subsequent pixels for old tasks is progressively replaced with uncorrelated noise. The conditioning thus appears to primarily be captured on new tasks only, resulting in interference effects. We continue this discussion alongside potential other general limitations of generative modeling variant choices in Appendix A.5.
4. Discussion
As a final piece of discussion, we would like to recall and emphasize a few important points of how our results should be interpreted and contextualized.
4.1. Presence of Unknown Data and Current Benchmarks
Perhaps most importantly, we re-iterate that OpenVAE is unique in that it provides a grounded basis to conduct continual learning in the presence of unknown data. However, as evidenced from the quantitative open-set recognition results, the inclusion of unknown data instances into continual learning would immediately result in the failure of the present continual-learning approaches at this point, simply because they lack a principled mechanism to provide robust predictions. For this reason, we show traditional incremental classification results as a proxy to assess our improved aggregate posterior-based generation quality.Our class incremental accuracy reports in this paper should thus be interpreted with caution as they represent only a part of OpenVAE’s capability, similar to a typical ablation study. We nevertheless provided this type of comparison, in order to situate OpenVAE with respect to some existing generative continual-learning methods in terms of catastrophic forgetting, rather than presenting OpenVAE in isolation in a more realistic new setting.
4.2. State of the Art in Class Incremental Learning and Exemplar Rehearsal
Following the above subsection, we note that a fair comparison of realistic class incremental learning is further complicated due to various involved factors. In fact, multiple related works make various additional assumptions on the extra storage of explicit data subsets and the use of multiple generative models per task or even multiple classifiers. We do not make these assumptions here in favor of generality. In this spirit, we focused our evaluation on our contributions’ relevant novelty with respect to combining the detection of unknown data with the prevention of catastrophic forgetting in generative models.The introduced OpenVAE shows that both are achievable simultaneously. At the same time, the reader familiar with the recent continual-learning literature will likely notice that some modern approaches that are attributed with state of the art in class incremental learning have not been included in our comparison. These approaches all fall into the category of exemplar rehearsal. We would like to emphasize that this is deliberate and not out of ignorance, as we see these works as purely complementary. We nevertheless wish to give deserved credit to these works and provide an outlook to one future research direction.The primary reason for omitting a direct comparison with state of the art works in continual learning that employ exemplar rehearsal is that we believe such a comparison would be misleading. In fact, contrasting our OpenVAE against these works would imply that these methods are somehow competing. In reality, exemplar rehearsal, or the so called extraction of core sets, is an auxiliary mechanism that can be applied out-of-the-box to our experimental set-up in this work. The main premise here is that catastrophic forgetting in continual learning can be reduced by retaining an explicit subset of the original data and subsequently continuously interleaving this stored data into the training process.Early works, such as iCarl [26] show that performance is then a function of two key aspects: the data selection technique and the memory buffer size. The former, selection of an appropriate data subset, essentially boils down to a non-continual-learning question, i.e., how to approximate the entire distribution through only a few instances. Exemplar rehearsal works thus make use of existing techniques here, such as core sets [28], herding [67], nearest mean-classifiers [27] or simply picking data samples uniformly at random [68].The second question, on memory buffer size, has an almost trivial answer. The larger the memory buffer size, the better the performance. This is intuitive, yet also makes comparison challenging, as a memory buffer of the size of the entire dataset is analogous to what we referred to as “incremental upper bound” in our experiments. If we were to simply store the complete dataset, then catastrophic forgetting would be avoided entirely. Modern class incremental learning works make heavy use of this fact and store large portions of the original data, showing that the more data is stored, the higher the performance goes.Primary examples include the recent works on Mnemonics Training [69], Contrastive Continual Learning (Co2L) [70] or Dark Experience Replay (DER) [71]. We do not wish to dive into a discussion here of whether or not such data storage is realistic or what size of a memory buffer should be assumed. A respective reference that questions and discusses whether storing of original data is synonymous with progress in continual learning is Greedy Sampler and Dumb Learner (GDumb) [68], where it is shown that the amount of extracted data alone amounts to a significant portion of “state-of-the-art” performance.Primarily, we point out that the latter works all show that a larger memory buffer shows “better” class incremental learning performance, i.e., less forgetting. However, most importantly, extracting and storing parts of the original data into a separate memory buffer is an auxiliary process that is entirely complementary to our propositions of OpenVAE. As such, each of the methods referenced in this subjection is straightforward to combine with our work. Although we see such a combination as important prospective work, we leave detailed experimentation up to future investigations.The rationale behind this choice is that inclusion of a memory buffer will inevitably additionally boost the performances of the results of Table 3, yet provide no additional insights to our main hypothesis and contribution: the proposition of OpenVAE to show that detection of unknown data for robust prediction can effectively be achieved alongside reduction of catastrophic forgetting in continual learning.
5. Conclusions
We proposed an approach to unify the prevention of catastrophic interference in continual learning with open-set recognition based on variational inference in deep generative models. As a common denominator, we introduced EVT-based bounds to the aggregate posterior. The correspondingly named OpenVAE was shown to achieve compelling results in being able to distinguish known from unknown data, while boosting the generation quality in continual learning with generative replay.We believe that our demonstrated benefits from recent generative modeling techniques in the context of high-resolution flower images with OpenVAE provide a natural synergy to be explored in a range of future applications. We envision prospective works to employ OpenVAE as a baseline when relaxing the closed-world assumption in continual learning and allowing unknown data to appear in the investigated benchmark streams at all times in the move to a more realistic evaluation.
Table A1
Losses obtained for different values for MNIST with a 2-D latent space. Training conducted in isolated fashion to quantitatively showcase the role of . Un-normalized values in nats are reported in brackets for reference purposes.
In Nats per Dimension (Nats in Brackets)
2-D Latent
Beta
KLD
Recon Loss
Class Loss
Accuracy [%]
train
1.0
1.039 (2.078)
0.237 (185.8)
0.539 (5.39)
79.87
test
1.030 (2.060)
0.235 (184.3)
0.596 (5.96)
78.30
train
0.5
1.406 (2.812)
0.230 (180.4)
0.221 (2.21)
93.88
test
1.382 (2.764)
0.228 (178.8)
0.305 (3.05)
92.07
train
0.1
2.055 (4.110)
0.214 (167.8)
0.042 (0.42)
99.68
test
2.071 (4.142)
0.212 (166.3)
0.116 (1.16)
98.73
train
0.05
2.395 (4.790)
0.208 (163.1)
0.025 (0.25)
99.83
test
2.382 (4.764)
0.206 (161.6)
0.159 (1.59)
98.79
Table A2
Losses obtained for different values for MNIST with a 60-D latent space. Training conducted in isolated fashion to quantitatively showcase the role of . Un-normalized values in nats are reported in brackets for reference purposes.
In Nats per Dimension (Nats in Brackets)
60-D Latent
Beta
KLD
Recon Loss
Class Loss
Accuracy [%]
train
1.0
0.108 (6.480)
0.184 (144.3)
0.0110 (0.110)
99.71
test
0.110 (6.600)
0.181 (142.0)
0.0457 (0.457)
99.03
train
0.5
0.151 (9.060)
0.162 (127.1)
0.0052 (0.052)
99.87
test
0.156 (9.360)
0.159 (124.7)
0.0451 (0.451)
99.14
train
0.1
0.346 (20.76)
0.124 (97.22)
0.0022 (0.022)
99.95
test
0.342 (20.52)
0.126 (98.79)
0.0286 (0.286)
99.38
train
0.05
0.476 (28.56)
0.115 (90.16)
0.0018 (0.018)
99.95
test
0.471 (28.26)
0.118 (92.53)
0.0311 (0.311)
99.34
Table A3
A 14-layer wide residual network (WRN) encoder with a widen factor of 10. Convolutional layers (conv) are parametrized by a quadratic filter size followed by the amount of filters. p and s represent zero padding and stride, respectively. If no padding or stride is specified, then p = 0 and s = 1. Skip connections are an additional operation at a layer, with the layer to be skipped specified in brackets. Convolutional layers are followed by batch-normalization and a rectified linear unit (ReLU) activation. The probabilistic encoder ends on fully-connected layers for and that depend on the chosen latent space dimensionality and the data’s spatial size.
Layer Type
WRN Encoder
Layer 1
conv 3×3—48, p = 1
Block 1
conv 3×3—160, p = 1; conv 1×1—160 (skip next layer)conv 3×3—160, p = 1conv 3×3—160, p = 1; shortcut (skip next layer)conv 3×3—160, p = 1
Block 2
conv 3×3—320, s = 2, p = 1; conv 1×1—320, s = 2 (skip next layer)
conv 3×3—320, p = 1conv 3×3—320, p = 1; shortcut (skip next layer)conv 3×3—320, p = 1
Block 3
conv 3×3—640, s = 2, p = 1; conv 1×1—640, s = 2 (skip next layer)conv 3×3—640, p = 1conv 3×3—640, p = 1; shortcut (skip next layer)conv 3×3—640, p = 1
Table A4
A 14-layer WRN decoder with a widen factor of 10. and refer to the input’s spatial dimension. Convolutional (conv) and transposed convolutional (conv_t) layers are parametrized by a quadratic filter size followed by the amount of filters. p and s represent zero padding and stride, respectively. If no padding or stride is specified, then p = 0 and s = 1. Skip connections are an additional operation at a layer, with the layer to be skipped specified in brackets. Every convolutional and fully-connected (FC) layer is followed by batch-normalization and a rectified linear unit (ReLU) activation function. The model ends on a Sigmoid function.
Layer Type
WRN Decoder
Layer 1
FC 640×[Pw/4]×[Ph/4]
Block 1
conv_t 3×3—320, p = 1; conv_t 1×1—320 (skip next layer)conv 3×3—320, p = 1conv 3×3— 320, p = 1; shortcut (skip next layer)conv 3×3—320, p = 1upsample × 2
Block 2
conv_t 3×3—160, p = 1; conv_t 1×1—160 (skip next layer)conv 3×3—160, p = 1conv 3×3— 160, p = 1; shortcut (skip next layer)conv 3×3—160, p = 1upsample × 2
Block 3
conv_t 3×3—48, p = 1; conv_t 1×1—48 (skip next layer)conv 3×3—48, p = 1conv 3×3—48, p = 1; shortcut (skip next layer)conv 3×3—48, p = 1
Layer 2
conv 3×3—3, p = 1
Table A5
The results for class incremental continual-learning approaches averaged over five runs, baselines and the reference isolated learning scenario for MNIST at the end of every task increment. This is an extension of Table 3 in the main body. Here, in addition to the accuracy , also indicates the respective negative log-likelihood (NLL) at the end of every task increment t.
MNIST
t
UB
FT
SupVAE
OpenVAE
PixelVAE DGR
SupPixelVAE
OpenPixelVAE
αbase,t
(%)
1
100.0
100.0
99.97±0.029
99.98 ±0.018
99.97 ±0.002
99.97 ±0.026
99.86 ±0.084
2
99.82
00.00
97.28 ±3.184
99.30 ±0.100
99.54 ±0.285
96.90 ±2.907
99.64 ±0.095
3
99.80
00.00
87.66 ±8.765
96.69 ±2.173
99.16 ±0.611
90.12 ±5.846
98.88 ±0.491
4
99.85
00.00
54.70 ±22.84
94.71 ±1.792
98.33 ±1.119
76.84 ±9.095
98.11 ±0.797
5
99.57
00.00
19.86 ±7.396
92.53 ±4.485
98.04 ±1.397
56.53 ±4.032
97.44 ±0.785
αnew,t (%)
1
100.0
100.0
99.97 ±0.029
99.98 ±0.018
99.97 ±0.002
99.97 ±0.026
99.86 ±0.084
2
99.80
99.85
99.75 ±0.127
99.80 ±0.126
99.71 ±0.122
99.74 ±0.052
99.82 ±0.027
3
99.67
99.94
99.63 ±0.172
99.61 ±0.055
99.41 ±0.084
99.22 ±0.082
99.56 ±0.092
4
99.49
100.0
99.05 ±0.470
99.15 ±0.032
98.61 ±0.312
97.84 ±0.180
98.80 ±0.292
5
99.10
99.86
99.00 ±0.100
99.06 ±0.171
97.31 ±0.575
96.77 ±0.337
98.63 ±0.430
αall,t (%)
1
100.0
100.0
99.97 ±0.029
99.98 ±0.018
99.97 ±0.002
99.97 ±0.026
99.86 ±0.084
2
99.81
49.92
98.54 ±1.638
99.55 ±0.036
99.60 ±0.142
98.37 ±1.448
99.69 ±0.051
3
99.72
31.35
95.01 ±3.162
98.46 ±0.903
98.93 ±0.291
96.14 ±1.836
99.20 ±0.057
4
99.50
24.82
81.50 ±9.369
97.06 ±1.069
98.22 ±0.560
91.25 ±0.992
98.13 ±0.281
5
99.29
20.16
64.34 ±4.903
93.24 ±3.742
96.52 ±0.658
83.61 ±0.927
96.84 ±0.346
γbase,t (nats)
1
63.18
62.08
64.34 ±2.054
62.53 ±1.166
90.52 ±0.263
100.0 ±1.572
99.77 ±2.768
2
62.85
126.8
74.41 ±10.89
65.68 ±1.166
91.27 ±0.789
100.4 ±1.964
101.2 ±3.601
3
63.36
160.4
81.89 ±10.09
69.29 ±1.541
91.92 ±0.991
100.3 ±4.562
101.1 ±4.014
4
64.25
126.9
90.62 ±10.08
71.69 ±1.379
91.75 ±1.136
102.7 ±7.134
101.0 ±4.573
5
64.99
123.2
101.6 ±8.347
77.16 ±1.104
92.05 ±1.212
102.4 ±6.195
100.5 ±4.942
γnew,t (nats)
1
63.18
62.08
64.34 ±2.054
62.53 ±1.166
90.52 ±0.263
100.0 ±1.572
99.77 ±2.768
2
88.75
87.93
89.91 ±0.107
89.64 ±3.709
115.8 ±0.805
125.7 ±2.413
124.6 ±3.822
3
82.53
87.22
87.65 ±0.530
85.37 ±1.725
107.7 ±0.600
118.3 ±3.523
116.5 ±2.219
4
72.68
74.61
79.49 ±0.489
74.75 ±0.777
100.9 ±0.659
107.1 ±5.316
102.3 ±1.844
5
85.88
92.00
93.55 ±0.391
89.68 ±0.618
113.4 ±0.820
118.2 ±1.572
113.3 ±0.755
γall,t (nats)
1
63.18
62.08
64.34 ±2.054
62.53 ±1.166
90.52 ±0.263
100.0 ±1.572
99.77 ±2.768
2
75.97
107.3
82.02 ±5.488
76.62 ±1.695
102.9 ±0.408
111.9 ±2.627
112.7 ±3.300
3
79.58
172.3
89.88 ±3.172
82.95 ±1.878
104.8 ±1.114
114.9 ±4.590
114.6 ±4.788
4
79.72
203.1
95.83 ±2.747
85.30 ±1.524
103.9 ±0.759
114.3 ±3.963
112.1 ±2.150
5
81.97
163.7
107.6 ±1.724
92.92 ±2.283
106.1 ±0.868
118.7 ±5.320
111.9 ±2.663
Table A6
The results for class incremental continual-learning approaches averaged over five runs, baselines and the reference isolated learning scenario for FashionMNIST at the end of every task increment. This is an extension of Table 3 in the main body. Here, in addition to the accuracy , also indicates the respective NLL at the end of every task increment t.
Fashion
t
UB
FT
SupVAE
OpenVAE
PixelVAE DGR
SupPixelVAE
OpenPixelVAE
αbase,t (%)
1
99.65
99.60
99.55±0.035
99.59 ±0.082
99.57 ±0.091
99.58 ±0.076
99.54 ±0.079
2
96.70
00.00
92.02 ±1.175
92.36 ±2.092
82.40 ±6.688
90.06 ±1.782
88.60 ±1.998
3
95.95
00.00
79.26 ±4.170
83.90 ±2.310
78.55 ±3.964
83.70 ±3.571
87.66 ±0.375
4
91.35
00.00
50.16 ±6.658
64.70 ±2.580
54.69 ±3.853
50.23 ±7.004
68.31 ±3.308
5
92.20
00.00
39.51 ±7.173
60.63 ±12.16
60.04 ±5.151
47.83 ±13.41
74.45 ±2.889
αnew,t (%)
1
99.65
99.60
99.55 ±0.035
99.59 ±0.082
99.57 ±0.091
99.58 ±0.076
99.54 ±0.079
2
95.55
97.95
90.98 ±0.626
92.64 ±2.302
97.73 ±1.113
96.47 ±0.596
97.31 ±0.475
3
93.35
99.95
90.26 ±1.435
83.40 ±3.089
99.09 ±0.367
97.33 ±0.725
96.88 ±1.156
4
84.75
99.90
85.65 ±2.127
84.18 ±2.715
97.55 ±0.588
96.12 ±0.675
95.47 ±1.332
5
97.50
99.80
96.92 ±0.774
96.51 ±0.707
98.85 ±0.141
97.91 ±0.596
98.63 ±0.176
αall,t (%)
1
99.65
99.60
99.55 ±0.035
99.59 ±0.082
99.57 ±0.091
99.58 ±0.076
99.54 ±0.079
2
95.75
48.97
91.83 ±0.730
92.31 ±1.163
86.22 ±3.704
92.93 ±0.160
92.17 ±1.425
3
93.02
33.33
83.35 ±1.597
86.93 ±0.870
76.77 ±4.378
84.07 ±1.069
87.30 ±0.322
4
87.51
25.00
64.66 ±3.204
76.05 ±1.391
62.93 ±3.738
64.42 ±1.837
76.36 ±1.267
5
89.24
19.97
58.82 ±2.521
69.88 ±1.712
72.41 ±2.941
63.05 ±1.826
80.85 ±0.721
γbase,t (nast)
1
209.7
209.8
208.9 ±1.213
209.7 ±3.655
267.8 ±1.246
230.8 ±3.024
232.0 ±2.159
2
207.4
240.7
212.7 ±0.579
212.1 ±0.937
273.6 ±0.631
232.5 ±1.582
231.8 ±0.416
3
207.6
258.7
219.5 ±1.376
216.9 ±1.208
274.0 ±0.552
235.6 ±2.784
231.6 ±0.832
4
207.7
243.6
223.8 ±0.837
217.1 ±0.979
273.7 ±0.504
236.4 ±3.157
231.4 ±2.550
5
208.4
306.5
232.8 ±5.048
222.8 ±1.632
274.1 ±0.349
241.1 ±1.747
234.1 ±1.498
γnew,t (nast)
1
209.7
209.8
208.9 ±1.213
209.7 ±3.655
267.8 ±1.246
230.8 ±3.024
232.0 ±2.159
2
241.1
240.2
241.8 ±0.502
241.9 ±0.960
313.4 ±1.006
275.8 ±1.888
275.3 ±1.473
3
213.6
211.8
215.4 ±0.501
213.0 ±0.635
269.1 ±0.616
268.3 ±3.852
262.9 ±1.893
4
220.5
219.7
223.6 ±0.381
220.9 ±0.522
282.4 ±0.321
259.1 ±1.305
259.6 ±2.050
5
246.2
242.0
248.8 ±0.398
244.0 ±0.646
305.8 ±0.286
283.2 ±2.150
283.5 ±2.458
γall,t (nast)
1
209.7
209.8
208.9 ±1.213
209.7 ±3.655
267.8 ±1.246
230.8 ±3.024
232.0 ±2.159
2
224.2
240.4
226.6 ±2.31
226.9 ±0.918
293.8 ±0.349
254.3 ±1.513
255.8 ±0.436
3
220.7
246.1
227.2 ±0.606
224.9 ±0.642
285.7 ±0.510
261.5 ±2.970
259.1 ±0.929
4
220.4
238.7
230.4 ±0.524
226.1 ±0.560
284.9 ±0.703
263.2 ±2.259
259.5 ±3.218
5
226.2
275.1
242.2 ±0.754
234.6 ±0.823
289.5 ±0.396
271.7 ±2.117
267.2 ±0.586
Table A7
The results for class incremental continual-learning approaches averaged over five runs, baselines and the reference isolated learning scenario for AudioMNIST at the end of every task increment. This is an extension of Table 3 in the main body. Here, in addition to the accuracy , also indicates the respective NLL at the end of every task increment t.
Authors: James Kirkpatrick; Razvan Pascanu; Neil Rabinowitz; Joel Veness; Guillaume Desjardins; Andrei A Rusu; Kieran Milan; John Quan; Tiago Ramalho; Agnieszka Grabska-Barwinska; Demis Hassabis; Claudia Clopath; Dharshan Kumaran; Raia Hadsell Journal: Proc Natl Acad Sci U S A Date: 2017-03-14 Impact factor: 11.205
Authors: Walter J Scheirer; Anderson de Rezende Rocha; Archana Sapkota; Terrance E Boult Journal: IEEE Trans Pattern Anal Mach Intell Date: 2013-07 Impact factor: 6.226
Authors: Kyandoghere Kyamakya; Vahid Tavakkoli; Simon McClatchie; Maximilian Arbeiter; Bart G Scholte van Mast Journal: Sensors (Basel) Date: 2022-08-21 Impact factor: 3.847