Literature DB >> 35448220

Unified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition.

Martin Mundt¹, Iuliia Pliushch¹, Sagnik Majumder², Yongwon Hong³, Visvanathan Ramesh¹.

Abstract

Modern deep neural networks are well known to be brittle in the face of unknown data instances and recognition of the latter remains a challenge. Although it is inevitable for continual-learning systems to encounter such unseen concepts, the corresponding literature appears to nonetheless focus primarily on alleviating catastrophic interference with learned representations. In this work, we introduce a probabilistic approach that connects these perspectives based on variational inference in a single deep autoencoder model. Specifically, we propose to bound the approximate posterior by fitting regions of high density on the basis of correctly classified data points. These bounds are shown to serve a dual purpose: unseen unknown out-of-distribution data can be distinguished from already trained known tasks towards robust application. Simultaneously, to retain already acquired knowledge, a generative replay process can be narrowed to strictly in-distribution samples, in order to significantly alleviate catastrophic interference.

Entities: Chemical

Keywords: catastrophic forgetting; continual deep learning; deep generative models; open-set recognition; variational inference

Year: 2022 PMID： 35448220 PMCID： PMC9028364 DOI： 10.3390/jimaging8040093

Source DB: PubMed Journal: J Imaging ISSN： 2313-433X

1. Introduction

Consider an empirically optimized deep neural network for a particular task, for the sake of simplicity, say the classification of dogs and cats. Typically, such a system is trained in a closed world setting [1] according to an isolated learning paradigm [2]. That is, we assume the observable world to consist of a finite set of known instances of dogs and cats, where training and evaluation is limited to the same underlying statistical data population. The training process is treated in isolation, i.e., the model parameters are inferred from the entire existing dataset at all times. However, the real world requires dealing with sequentially arriving tasks and data originating from potentially unknown sources. In particular, should we wish to apply and extend the system to an open world, where several other animals (and non animals) exist, there are two critical questions: (a) How can we prevent obvious mispredictions if the system encounters a new class? (b) How can we continue to incorporate this new concept into our present system without full retraining? With respect to the former question, it is well known that neural networks yield overconfident mispredictions in the face of unseen unknown concepts [3], a realization that has recently resurfaced in the context of various deep neural networks [4,5,6]. With respect to the latter question, it is similarly well known that neural networks, which are trained exclusively on newly arriving data, will overwrite their representations and thus forget encoded knowledge—a phenomenon referred to as catastrophic interference or catastrophic forgetting [7,8]. Although we have worded the above questions in a way that naturally exposes their connection: to identify what is new and think about how new concepts can be incorporated, they are largely subject to separate treatment in the respective literature. While open-set recognition [1,9,10] aims to explicitly identify novel inputs that deviate with respect to already observed instances, the existing continual learning literature predominantly concentrates its efforts on finding mechanisms to alleviate catastrophic interference (see [11] for an algorithmic survey). In particular, the indispensable system component to distinguish seen from unseen unknown data, both as a guarantee for robust application and to avoid the requirement of explicit task labels for prediction, is generally missing from recent continual-learning works. Inspired by this gap, we set out to connect open-set recognition and continual learning. The underlying connecting element is motivated from the prior work of Bendale and Boult [12], who proposed to leverage extreme value theory (EVT) to address open-set detection in deep neural networks. The authors suggested to modify softmax prediction scores on the basis of feature space distances in blackbox discriminative models. Although this approach is promising, it alas comes with the substantial caveat that purely discriminative networks are prone to encode noise as features [13] or fall for a most simple discriminative solution that neglects meaningful features [14]. Inspired by these former insights, we set out to connect open-set recognition and continual learning, while overcoming present limitations through treatment from a generative modeling perspective. Our specific contributions are that we propose to unify the prevention of catastrophic interference in continual learning with open-set recognition in a single model. Specifically, we extend prior EVT works [9,10,12] to a natural formulation on the basis of the aggregate posterior in variational inference with deep autoencoders [15,16]. By identifying out-of-distribution instances we can detect unseen unknown data and prevent false predictions; by explicitly generating in-distribution samples from areas of high probability density under the aggregate posterior, we can simultaneously circumvent rehearsal of ambiguous uninformative examples. This leads to robust application, while significantly reducing catastrophic interference. We empirically corroborated our approach in terms of improved out-of-distribution detection performance and simultaneously reduced the continual catastrophic interference. We further demonstrate benefits through recent deep generative modeling advances, such as autoregression [2,17,18] and introspection [19,20], validated by scaling to high-resolution color images.

1.1. Background and Related Work

1.1.1. Continual Learning

In isolated supervised learning, the core assumption is the presence of i.i.d. data at all times and training is conducted using a dataset , consisting of N pairs of data instances and their corresponding labels for C classes. In contrast, in continual learning, data with arrives sequentially for T disjoint sets, each with number of classes . It is assumed that only the data of the current task is available. Without additional mechanisms, tuning on such a sequence will lead to catastrophic interference [7,8], i.e., representations of former tasks being overwritten through present optimization. A recent review of many continual-learning algorithms to prevent said interference was provided by Parisi et al. [11]. Here, we present a brief summary of the key underlying principles. Alleviating catastrophic interference is most prominently addressed from two angles. Regularization methods, such as synaptic intelligence (SI) [21] or elastic weight consolidation (EWC) [22] explicitly constrain the weights during continual learning to avoid drifting too far away from the previous tasks’ solutions. In a related picture, learning without forgetting [23] uses knowledge distillation [24] to regularize the end-to-end functional. Rehearsal methods on the other hand, store data subsets from distributions belonging to old tasks or generate samples in pseudo-rehearsal [25]. The central component of the latter is thus the selection of significant instances. For methods, such as incremental classifier and representation learning (iCarl) [26], it is therefore common to resort to auxiliary techniques, such as the nearest-mean classifier [27] or core sets [28]. Inspired by complementary learning systems [29], dual-model approaches sample data from a separate generative memory. In a bio-inspired incremental learning architecture (GeppNet) [30], long short-term memory [31] is used for storage, whereas generative replay [32] samples from an additional generative adversarial network (GAN) [33]. As detailed in Variational Generative Replay (VGR) [34,35], methods with a Bayesian perspective encompass a natural capability for continual learning by making use of the learned distribution. Existing works nevertheless fall into the above two categories and their combination: a prior-based approach using the former task’s approximate posterior as the new task’s prior [36] or estimating the likelihood of former data through generative replay or other forms of rehearsal [34,37]. Crucially, the success of many continual-learning techniques can be attributed primarily to the considered evaluation scenario. With the exception of VGR [34], the majority of above techniques train a separate classifier per task and thus either require the explicit storage of task labels or assume the presence of a task oracle during evaluation. This multi-head scenario prevents “cross-talk” between classifier units by not sharing them, which would otherwise rapidly decay the accuracy as newly introduced classes directly confuse existing concepts. While the latter is acceptable to limit catastrophic interference, it also signifies a major limitation in practical applications. Even though VGR [34] uses a single classifier, the researchers trained a separate generative model per task to avoid catastrophic interference in the generator. Our approach builds upon these previous works and leverages variational inference in deep generative models. However, we propose to tie the prevention of catastrophic interference with open-set recognition through a natural mechanism based on the aggregate posterior in a single model.

1.1.2. Out-of-Distribution and Open Set Recognition

The above-mentioned literature focused their efforts predominantly on addressing catastrophic interference. Even though continual learning is the desideratum, the corresponding evaluation is thus conducted in a closed world setting, where instances that do not belong to the observed data distribution are not encountered. In reality, this is not guaranteed as users could provide arbitrary inputs or unknowingly present the system with novel inputs that deviate substantially from previously seen instances. Our models thus require the ability to identify unseen examples in the unconstrained open world and categorize them as either belonging to the already known set of classes or as presently being unknown. We provide a small overview of approaches that aim to address this question in deep neural networks. A comprehensive survey was provided by Boult et al. [1]. As the most simple approach, the aim of calibration works is to separate a known and unknown input through prediction confidence, often by fine tuning or re-training an already existing model. In out-of-distribution detector for neural networks (ODIN) [38], this is addressed through perturbations and temperature scaling, while Lee et al. [39] used a separately trained GAN to generate out-of-distribution samples from low probability densities and explicitly reduced their confidence through the inclusion of an additional loss term. Similarly, the objectosphere loss [40] defines an objective that explicitly aims to maximize entropy for upfront available unknown inputs. As we do not have access to future data a priori, by definition, a naive conditioning or calibration on unseen unknown data is infeasible. The commonly applied thresholding is insufficient as overconfident prediction values cannot be prevented [3]. Bayesian neural network models [41] could be believed to intrinsically be able to reject statistical outliers through model uncertainty [34] and overcome this limitation of overconfident prediction values. For use with deep neural networks, it was suggested that stochastic forward passes with Monte-Carlo Dropout (MCD) [42] can provide a suitable approximation. However, the closed-world assumption in training and evaluation still persists [1]. In addition, variational approximations in deep networks [15,34,37,43] and corresponding uncertainty estimates suffer from similar overconfidence, and the distinction of unseen out-of-distribution data from already trained knowledge is known to be unsatisfactory [5,6]. A more formal approach was suggested in works based on open-set recognition [9]. The key here is to limit predictions originating from open space, that is, the area in obtained embeddings that is outside of a small radius around previously observed training examples. Without re-training, post hoc calibration or modifying loss functions, one approach to open-set recognition in deep networks is through extreme-value theory (EVT) [10,12]. Here, limiting the threat of overconfidence is based on monotonically decreasing the recognition function’s probability with respect to increasing distance of instances to the feature embedding of known training points. The Weibull distribution, as one member of the family of extreme value distributions, has been empirically demonstrated to work well in conjunction with distances in the penultimate deep network layer as the underlying feature space. On the basis of extreme values to this layer’s average activation values, the authors devised a procedure to revise the Softmax prediction values, referred to as OpenMax. In a similar spirit, our work avoids relying on predictive values, while also moving away from empirically chosen deep neural network feature spaces. We instead propose to use EVT to bind the approximate posterior in variational inference. We thus directly operate on the underlying (lower-bound to the) data distribution and the generative factors. This additionally allows us to constrain the generative replay to distribution inliers, which further alleviates catastrophic interference.

2. Materials and Methods

2.1. Unifying Catastrophic Interference Prevention with Open Set Recognition

We first summarize the preliminaries on continual learning from a perspective of variational inference in deep generative models [15,43]. We then proceed by bridging the improved prevention of catastrophic interference in continual learning with the detection of unseen unknown data in open-set recognition.

2.1.1. Preliminaries: Learning Continually through Variational Auto-Encoding

We start with a problem scenario similar to the one introduced in “Auto-Encoding Variational Bayes” [15], i.e., we assume that there exists a data generation process responsible for the creation of the labeled data given some random latent variable . We consider a model with a shared encoder with variational parameters , decoder and linear classifier with respective parameters and . The joint probabilistic encoder learns an encoding to a latent variable , over which a unit Gaussian prior is placed. Using variational inference, the encoder’s purpose is to approximate the true posterior to and . The probabilistic decoder and probabilistic linear classifier then return the conditional probability density of the input and target y under the respective generative model given a sample from the approximate posterior . This yields a generative model , for which we assume a factorization and generative process of the form . For variational inference with this model, the sum over all elements in the dataset n ∈ D in the following lower-bound is optimized: where KL denotes the Kullback-Leibler divergence. In other words, the right hand side of Equation (1) defines our loss . This model can be seen as employing a variant of a (semi-)supervised variational auto-encoder (VAE) [16] with a term [44], where, in addition to approximating the data distribution, the model learns to incorporate the class structure into the latent space. Without the blue terms, the original unsupervised VAE formulation [15] is recovered. This forms the basis for continual learning with open-set recognition as discussed in the subsequent section. An illustration of the model is shown in Figure 1.

Figure 1

A joint continual-learning model consisting of a shared probabilistic encoder , probabilistic decoder and probabilistic classifier . For open-set recognition and generative replay with outlier rejection, extreme-value theory (EVT) based bounds on the basis of the approximate posterior are established.

Abstracting away from the mathematical detail and speaking informally about the intuition behind the model, we first encode a data input and encode it into two vectors. These vectors represent the mean and standard deviation of a Gaussian distribution. Using the reparametrization trick , a sample from this distribution is then calculated. During training, the respective embedding, also referred to as the latent space, is encouraged to follow a unit Gaussian distribution through the minimization of the Kullback-Leibler divergence. A linear classifier that operates directly on this latent embedding to predict a class for a sample additionally ensures that the obtained distribution is clustered according to the classes. Examples of such fits are shown in the later Figure 2. Finally, the decoder takes, as input, the latent variable and reconstructs the original data input during training. Once the model is finished training, we can also directly draw a sample from the Gaussian distribution, obtain a latent sample and generate a novel data point directly, without the need to compute the encoder first. A corresponding full and formal derivation of Equation (1), the lower-bound to the joint distribution is supplied in Appendix A.1.

Figure 2

2-D latent space aggregate posterior visualization for continually learned MNIST (Modified National Institute of Standards and Technology database). From left to right, the latent space for four, six and then eight classes are shown. This is best viewed in color.

Without further constraints, one could continually train the above model by sequentially accumulating and optimizing Equation (1) over all currently present tasks . Being based on the accumulation of real data, this provides an upper bound to the achievable performance in continual learning. However, this form of continued training is generally infeasible if only the most recent task’s data is assumed to be available. Making use of the model’s generative nature, we can follow previous works [34,37] and estimate the likelihood of former data through generative replay: where Here, is a sample from the generative model with its corresponding classifier label . is the number of instances of all previously seen tasks. In this way, the expectation of the log-likelihood for all previously seen tasks is estimated and the dataset at any point in time is a concatenation of past data generations and the current task’s real data.

2.1.2. Open Set Recognition and Generative Replay with Statistical Outlier Rejection

Trained naively in the above fashion, our model will unfortunately suffer from accumulated errors with each successive iteration of generative replay, similar to the current literature approaches. To avoid this, we would alternatively require the training of multiple encoders to approximate each task’s posterior individually, as in variational continual learning (VCL) [36], or train multiple generators, as in VGR [34]. We posit that the main challenge is how high-density areas under the prior are not necessarily reflected in the structure of the aggregate posterior [45]. The latter refers to the practically obtained encoding [46]: To provide intuition, we illustrate this prior-posterior discrepancy on the obtained two-dimensional latent encodings for a continually trained supervised MNIST (Modified National Institute of Standards and Technology database) [47] model in Figure 2. Here, we can make two observations: to preserve the inherent data structure, the aggregate posterior deviates from the prior. In fact, this is further amplified by the imposed necessity for linear class separation and the beta term in Equation (1); however, we note that the discrepancy is desired even in completely unsupervised scenarios [45,46]. The underlying rationale is that there needs to be a balance in the effective latent encoding overlap [48], which can best be summarized with a direct quote from the recent work of Mathieu et al. [49]: “The overlap is perhaps best understood by considering extremes: with too little the latents effectively become a lookup table; too much, and the data and latents do not convey information about each other. In either case, meaningfulness of the latent encodings is lost." (p. 4). Additional discussion on the role of beta can be found in Appendix A.2. Thus, the generated data from low-density regions of the aggregate posterior do not generally correspond to the encountered data instances. Conversely, data instances that fall into high-density regions under the prior should not generally be considered as statistical inliers with respect to the observed data distribution; recall Figure 2. This boundary between low- and high-density regions forms the basis for a natural connection between open-set recognition and continual learning: generate from high-density regions and reject novel instances that fall into low-density regions. Ideally, we could find a solution by replacing the prior in the KL divergence of Equation (1) with and, respectively, sampling in Equations (2) and (3). Even though using the aggregate posterior as a subsequent prior is the objective in multiple recent works, it can be challenging in high dimensions, lead to over-fitting or come at the expense of additional hyper-parameters [45,50,51]. To avoid finding an explicit representation for the multi-modal , we draw inspiration from the EVT-based OpenMax approach [12] in deep neural networks. However, instead of using knowledge about extreme distances in penultimate layer activations to modify a Softmax prediction, we now propose to apply EVT on the basis of the class conditional aggregate posterior. In this view, any sample can be regarded as statistically outlying if its distance to the classes’ latent mean is extreme with respect to what has been observed for the majority of correctly predicted data instances, i.e., the sample falls into a region of low density under the aggregate posterior and is less likely to belong to . For convenience, let us introduce the indices of all correctly classified instances at the end of task t as . To obtain bounds on the aggregate posterior, we first define the mean latent vector for each class for all correctly predicted seen data instances and the respective set of latent distances as Here, signifies a choice of distance metric. We proceed to model this set of distances with a per class heavy-tail Weibull distribution on for a given tail-size . As these distances are based on the class conditional approximate posterior, we can thus bound the latent space regions of high density. The tightness of the bound is characterized through , that can be seen as a prior belief with respect to the outlier quantity assumed to be inherently present in the data distribution. The choice of determines the nature and dimensionality of the obtained distance distribution. For our experiments, we find that the cosine distance and thus a univariate Weibull distance distribution per class seems to be sufficient. Using the cumulative distribution function of this Weibull model we can now estimate any sample’s outlier (or inlier) probability: where the minimum returns the smallest outlier probability across all classes. If this outlier probability is larger than a prior rejection probability , the instance can be considered as unknown. Such a formulation, which we term open variational auto-encoder (OpenVAE), now provides us with the means to learn continually and identify unknown data: For a novel data instance, Equation (6) yields the outlier probability based on the probabilistic encoder , and a false overconfident classifier prediction can be avoided. To mitigate catastrophic interference, Equation (6) can be used on top of to constrain the generative replay (Equation (3)) to the aggregate posterior thus avoiding the need to sample it directly. To give an illustration of the benefits, we show the generated MNIST [47] and larger resolution flower images [52] together with their outlier percentage in Figure 3. In practical application, we discard the ambiguous examples that are due to low-density regions and thus a high outlier probability. Even though we conduct sampling with rejection, note how this is computationally efficient, as we only need to calculate the heavy probabilistic decoder for accepted statistically inlying examples, and sampling from the prior with computation of Equation (6) is almost negligible in comparison.

Figure 3

Generated images with and their corresponding class c obtained from the classifier together with their open-set outlier percentage in our proposed open variational auto-encoder (OpenVAE). Image quality degradation and class ambiguity can be observed with the increasing outlier likelihood. Generated MNIST images are from the 2-D latent space of Figure 2, classified as (top left), (top right), (bottom left) and (bottom right). Generated resolution flower images are based on a 60-dimensional latent space of a model trained with introspection (see experiments and Appendix A.3), which are classified as “sunflower” (top) and “daisy” (bottom).

3. Results

Instead of presenting a single experiment for continual learning in the constant presence of outlying non-task data, we chose to empirically corroborate our proposed approach in two experimental parts. The first section is dedicated to out-of-distribution detection, where we demonstrate the advantages of EVT in our generative model formulation. We then proceed to showcase how catastrophic interference is also mitigated by confining generative replay to aggregate posterior inliers in class incremental learning. We emphasize that whereas the sections are presented individually, our approach’s uniqueness lies in using a core underlying mechanism to unify both challenges simultaneously. The rationale behind choosing this form of presentation is to help readers better contextualize the contribution of OpenVAE with the existing literature as, to the best of our knowledge, there exists no present other work that yields adequate continual classification accuracy while being able to robustly recognize unknown data instances. As such, we will now see that existing continual-learning approaches provide no suitable mechanism to overcome the challenge of providing robust predictions when data outside the known benchmark set are included.

3.1. Open Set Recognition

We experimentally highlight OpenVAE’s ability to distinguish unknown task data from data belonging to known tasks to avoid overconfident false predictions. Experimental Set-Up and Evaluation In summary, our goal is two-fold. The typical goal is to train on an initial task and correctly classify the held-out or unseen test data for this task. That is, we desire a large average classification test accuracy. In addition to this, in order to ensure that this classification is robust to unknown data, we now additionally desire to have a large value for a second kind of accuracy. Our simultaneous goal is to consider all test data of already trained tasks as inlying, while successfully identifying 100% of completely unknown datasets as outliers. For this purpose, we evaluate OpenVAE’s and other models’ capability to distinguish the in-distribution test set of a respectively trained MNIST (Modified National Institute of Standards and Technology database) [47], FashionMNIST [53], AudioMNIST [54] from the other two and several unknown datasets: Kuzushiji-MNIST (KMNIST) [55], Street-View House Numbers (SVHN) [56] and Canadian Institute for Advanced Research (CIFAR) datasets (in both versions with 10 and 100 classes) [57]. Here, the (Fourier-transformed) audio data is included to highlight the extent of the challenge, as not even a different modality is easy to detect without our proposed approach. In practice, we evaluate three criteria according to which a decision of whether a data instance is an outlier can be made: The classifier’s predictive entropy, as recently suggested to work surprisingly well in deep networks [58] but technically well known to be overconfident [3]. The intuition here is that the predictive entropy considers the probability of all other classes and is at a maximum if the distribution is uniform, i.e., when the confidence in the prediction is low. The generative model’s obtained negative log-likelihood, to concur with previous findings [5,6] on overconfidence in generative models. On the basis of Equation (1), the intuition is that the negative log-likelihood should be much larger for unseen data. Our suggested OpenVAE aggregate posterior-based EVT approach, according to the outlier likelihood introduced Equation (6). Results Figure 4 provides a qualitative intuition behind the three criteria and respective percentage of the total dataset being considered as outlying for FashionMNIST. Consistent with Nalisnick et al. [6], we can observe that the use of reconstruction loss can sometimes distinguish between the known tasks’ test data and unknown datasets but results in failure for others. In the case of the classifier predictive entropy, depending on the exact choice of entropy threshold, generally only a partial separation can be achieved. Furthermore, both of these criteria pose the additional challenge of the results being highly dependent on the choice of the precise cut-off value. In contrast, the test data from the known tasks is regarded as inlying across a wide range of rejection priors for Equation (6), and the majority of other datasets is consistently regarded as outlying by our introduced OpenVAE approach.

Figure 4

Model trained on FashionMNIST evaluated on unknown datasets. Robust classification of a known dataset (percentage of dataset outliers at 0%), while correctly flagging unknown datasets as outlying (percentage of dataset outliers at 100%), occurs when the solid green curve is separated from any of the colored dashed curves. (Left) Classifier entropy is insufficient to separate unknown from the known task’s test data. (Center) Reconstruction log-likelihood allows for a partial distinction. (Right) Our posterior-based EVT approach in OpenVAE considers the large majority of unknown data as statistical outliers across a wide range of rejection priors .

Corresponding quantitative outlier detection accuracies are provided in Table 1. To find thresholds for the sensitive entropy and reconstruction curves, we used a validation split to determine the respective value at which of the validation data is considered as inlying before using these priors to determine outlier counts for the known tasks’ test set as well as other datasets. In an intuitive picture, we “trace” the solid green curve of Figure 4 for a validation set of the originally trained dataset, check where we intersect with the x-axis for a y-axis value of 5% and then fix the corresponding criterion’s value at this point as an outlier rejection threshold for testing. We then report the percentage of the test set being considered as an outlier, together with the percentage for various unknown datasets. In the table, we additionally extend our intuition of Figure 4 to now further investigate what would happen if we had not trained a single VAE model that learned reconstruction and classification according to Equation (1) but separate models. For this purpose, we also investigate a dual model approach, i.e., a purely discriminative deep-neural-network-based classifier and a separate unsupervised VAE (Equation (1) without blue terms).

Table 1

Outlier detection values of the joint model and separate discriminative and generative models (denoted as “CNN + VAE”; discriminative convolutional neural network and variational auto-encoder), when considering 95% of the known tasks’ validation data as inlying. The percentage of detected outliers is reported based on the classifier predictive entropy, reconstruction negative log-likelihood (NLL) and our posterior-based extreme-value theory approach. Note that larger values are better, except for the test data of the trained dataset, where ideally 0% should be considered as outlying. The outlier detection values have additionally been color coded, where worse results appear in red. A deeper shading thus indicates a method’s failure to robustly recognize unknown data as such. With this color coding, we can easily see how MNIST appears to be an easy to identify dataset for all approaches; however, we notice right away that our OpenVAE is the only method (row) that does not have a single red value for any dataset combination. In fact, the lowest outlier detection accuracy of OpenVAE is a very high 94.76%.

Outlier Detection at 95% Validation Inliers (%)				MNIST	Fashion	Audio	KMNIST	CIFAR10	CIFAR100	SVHN
Trained	Model	Test Acc.	Criterion
MNIST	Dual,	99.40	Class entropy	4.160	90.43	97.53	95.29	98.54	98.63	95.51
	CNN +		Reconstruction NLL	5.522	99.98	99.97	99.98	99.99	99.96	99.98
	VAE		OpenMax	4.362	99.41	99.80	99.86	99.95	99.97	99.52
	Joint	99.53	Class entropy	3.948	95.15	98.55	95.49	99.47	99.34	97.98
	VAE		Reconstruction NLL	5.083	99.50	99.98	99.91	99.97	99.99	99.98
			OpenVAE (ours)	4.361	99.78	99.67	99.73	99.96	99.93	99.70
FashionMNIST	Dual,	90.48	Class entropy	74.71	5.461	69.65	77.85	24.91	28.76	36.64
	CNN +		Reconstruction NLL	5.535	5.340	64.10	31.33	99.50	98.41	97.24
	VAE		OpenMax	96.22	5.138	93.00	91.51	71.82	72.08	73.85
	Joint	90.92	Class Entropy	66.91	5.145	61.86	56.14	43.98	46.59	37.85
	VAE		Reconstruction NLL	0.601	5.483	63.00	28.69	99.67	98.91	98.56
			OpenVAE (ours)	96.23	5.216	94.76	96.07	96.15	95.94	96.84
AudioMNIST	Dual,	98.53	Class entropy	97.63	57.64	5.066	95.53	66.49	65.25	54.91
	CNN +		Reconstruction NLL	6.235	46.32	4.433	98.73	98.63	98.63	97.45
	VAE		OpenMax	99.82	78.74	5.038	99.47	93.44	92.76	88.73
	Joint	98.57	Class entropy	99.23	89.33	5.731	99.15	92.31	91.06	85.77
	VAE		Reconstruction NLL	0.614	38.50	3.966	36.05	98.62	98.54	96.99
			OpenVAE (ours)	99.91	99.53	5.089	99.81	100.0	99.99	99.98

In this way, we can showcase the advantages of a generative modeling formulation that considers the joint distribution in conjunction with EVT. For instance, we can compare our values with the purely discriminative OpenMax EVT approach [59]. At the same time, this provides a justification for why the existing continual-learning approaches of the next section, especially those relying on the maintenance of multiple models, are non-ideal, as they cannot seem to adequately solve the open-set challenge. In terms of the obtained results, with the exception of MNIST, which appears to be an easy to identify dataset for all approaches, we can make two key observations: Both EVT approaches generally outperform the other criteria, particularly for our suggested aggregate posterior-based OpenVAE variant, where a near perfect open-set detection can be achieved. Even though EVT can be applied to purely discriminative models (as in OpenMax), the generative OpenVAE model trained with variational inference consistently exhibited more accurate outlier detection. We posit that this robustness is due to OpenVAE explicitly optimizing a variational lower bound that considers the data distribution in addition to a pure optimization of features that maximize . Open Set Recognition with Monte-Carlo Dropout Based Uncertainty One might be tempted to assume that the trained weights of the individual deep neural network encoder layers are still deterministic and the failure of predictive entropy as a measure for unseen unknown data could thus primarily be attributed to uncertainty not being expressed adequately. Placing a distribution on the weights, akin to a fully Bayesian neural network, would then be expected to resolve this issue. For this purpose, we further repeat all of our experiments by treating the model weights as the random variable being marginalized through the use of Monte-Carlo Dropout (MCD) [42]. Accordingly, the models were re-trained with a Dropout probability of in each layer. We then conducted 50 stochastic forward passes through the entire model for prediction. The obtained open-set recognition results are reported in Table 2.

Table 2

Outlier Detection at 95% Validation Inliers (%)				MNIST	Fashion	Audio	KMNIST	CIFAR10	CIFAR100	SVHN
Trained	Model	Test Acc.	Criterion
MNIST	Dual,	99.41	Class entropy	4.276	91.88	96.50	96.65	95.84	97.37	98.58
	CNN +		Reconstruction	4.829	99.99	100.0	99.90	100.0	100.0	100.0
	VAE		OpenMax	4.088	87.84	98.06	95.79	97.34	98.30	95.74
	Joint,	99.54	Class entropy	4.801	97.63	99.38	98.01	99.16	99.39	98.90
	VAE		Reconstruction	5.264	99.98	100.0	100.0	100.0	100.0	100.0
			OpenVAE (ours)	4.978	99.99	100.0	99.94	99.96	99.95	99.68
FashionMNIST	Dual,	90.58	Class entropy	75.50	5.366	70.78	74.41	49.42	49.17	38.84
	CNN +		Reconstruction NLL	55.45	5.048	59.99	99.83	99.35	99.35	99.62
	VAE		OpenMax	77.03	4.920	55.48	70.23	58.73	57.06	44.54
	Joint,	91.50	Class Entropy	85.05	4.740	67.90	78.04	63.89	66.11	59.42
	AE		Reconstruction	1.227	5.422	85.85	39.76	99.94	99.72	99.99
			OpenVAE (ours)	95.83	4.516	94.56	96.04	96.81	96.66	96.28
AudioMNIST	Dual,	98.76	Class entropy	99.97	61.26	4.996	96.77	63.78	65.76	59.38
	CNN +		Reconstruction NLL	7.334	52.37	5.100	98.19	99.97	99.90	99.96
	VAE		OpenMax	92.74	67.18	5.073	90.41	90.56	90.97	89.58
	Joint,	98.85	Class entropy	99.39	89.50	5.333	99.16	94.66	95.12	97.13
	VAE		Reconstruction NLL	15.81	53.83	4.837	41.89	99.90	99.82	99.95
			OpenVAE (ours)	99.50	99.27	5.136	99.75	99.71	99.59	99.91

Although MCD boosts the outlier detection accuracy, particularly for criteria, such as predictive entropy, the previous insights and drawn conclusions still hold. In summary, the joint generative model generally outperforms a purely discriminative model in terms of open-set recognition, independently of the used metric, and our proposed aggregate posterior-based EVT approach of OpenVAE yields an almost perfect separation of known and unseen unknown data. Interestingly, this was already achieved in the prior table without MCD. Resorting to the repeated model calculation of MCD thus appears to be without enough of an advantage to warrant the added computational complexity in the context of posterior-based open-set recognition, a further key advantage of OpenVAE.

3.2. Learning Classes Incrementally in Continual Learning

To showcase how our OpenVAE approach mitigates catastrophic interference in addition to successfully handling unknown data in robust prediction, we conduct an investigation of the test accuracy when learning classes incrementally. Experimental Set-Up and Evaluation We consider the incremental MNIST dataset (where classes arrive in groups of two) and the corresponding versions of the FashionMNIST and AudioMNIST datasets, similar to popular literature [11,21,22,32,34]. We re-emphasize that such a setting has a sole focus on mitigating catastrophic interference and does not account for the the challenges presented in the previous open-set recognition section, which we detail in the prospective discussion section. For a flexible comparison, we report our aggregate posterior-based generative replay approach in OpenVAE on both a simple multi-layer perceptron (MLP), as well as a deep convolutional neural network (CNN) based on wide residual networks (WRN). For the former, we follow previous continual-learning studies and employ a two-hidden-layer and 400-unit multi-layer perceptron [60]. For the latter, we use both encoder and decoder architectures of 14-layer wide residual networks [61,62] with a latent dimensionality of 60 [2,18]. For our statistical outlier rejection, we use a rejection prior of and dynamically set tail-sizes to 5% of seen examples per class. For our own experiments, we report the mean and standard deviation of the average classification test accuracy across five experimental repetitions. If our re-implementation of related works achieved a better than original value, we report this number, otherwise the work that reported the specific best value is cited next to it. The full training details, including details on hardware and code, are supplied in Appendix A.4. Results In Table 3, we report the final accuracy after having trained on each of the five increments. For an overall reference, we provide the achievable upper-bound continual-learning performance, i.e., accumulating all data over time and optimizing Equation (1). We can observe that our proposed OpenVAE approach provides significant improvement over generative replay with a conventional supervised VAE. In comparison with the immediately related works, our approach surpasses variational continual learning (VCL) [36], an approach that employs a full Bayesian neural network (BNN), with the additional benefit that our approach scales trivially to complex network architectures.

Table 3

The accuracy at the end of the last increment T = 5 for class incremental learning approaches averaged over five runs. For a fair comparison, if our re-implementation of related works achieved a better than original value, we report our number, otherwise the work that reported the specific best value is cited right next to the result. Intermediate results can be found in Appendix A.6.

	Final Accuracy αT(T=5) [%]
Method	MNIST	FashionMNIST	AudioMNIST
MLP upper bound	98.84	87.35	96.43
WRN upper bound	99.29	89.24	97.87
EWC [22]	55.80 [63]	24.48 ± 2.86	20.48 ± 1.73
DGR [32]	75.47 [64]	63.21 ± 1.96	48.42 ± 2.81
VCL [36]	72.30 [35]	32.60 [35]	-
VGR [35]	92.22 [35]	79.10 [35]	-
Supervised VAE	60.88 ± 3.31	62.72 ± 1.38	69.76 ± 1.37
OpenVAE—MLP	87.31 ± 1.22	66.14 ± 0.50	81.84 ± 1.44
OpenVAE—WRN	93.24 ± 3.74	69.88 ± 1.71	87.72 ± 1.59
OpenPixelVAE	96.84 ± 0.35	80.85 ± 0.72	90.23 ± 1.14

In contrast to variational generative replay (VGR) [34], OpenVAE initially appears to fall short. This is not surprising as VGR trains a separate GAN on each task’s aggregate posterior, an apples to oranges comparison considering that we only use a single model. Nevertheless, even in a single model, we can surpass the multi-model VGR by leveraging recent advancements in generative modeling, e.g., by making the neural architecture more complex or augmenting our decoder with autoregressive sampling [2,18] (a complementary technique to OpenVAE, often also called PixelVAE and summarized in Appendix A.3). At the bottom of Table 3, we can see that this significantly improves upon the previously obtained accuracy. The full accuracies, along with other metrics per dataset for all intermediate steps can be found in Appendix A.6. High-Resolution Flower Images While the main goal of this paper is not to push the achievable boundaries of generation, we take this argument one step further and provide empirical evidence that our suggested aggregate posterior-based EVT sampling provides similar benefits when scaling to higher resolution color images. For this purpose, we consider the additional flowers dataset [52] at a resolution of , investigated with five classes and increments of one class per step [65,66]. In addition to autoregressive sampling, we also include a second complementary generative modeling improvement here, called VAEs with introspection (IntroVAE) [19]. A technical description of PixelVAE and IntroVAE is detailed in Appendix A.3. For each generative modeling variant, including autoregression and introspection, we report the degradation of accuracy over time in Figure 5 and demonstrate how their respective open-set-aware version provides substantial improvements. Intuitively, this improvement is due to an increase in the visual generation quality; see the examples in the earlier Figure 3.

Figure 5

Classification accuracy over five runs for continually learned flowers at resolution to demonstrate how generative modeling advances draw similar benefits from our proposed aggregate posterior constrained generative replay (solid lines) over the open-set-unaware baselines (dashed counterparts).

First, it is apparent how every OpenVAE variant improves upon its non open-set aware counterpart. We further observe that the best version, OpenIntroVAE, appears to be in the same ballpark as complex recent GAN approaches [65,66], even though they do not solve the open-set recognition challenge and conduct a simplified evaluation. The latter works use a lower resolution of (we were unable to scale to satisfying results at higher resolution) with additional distillation mechanisms, a continuously trained generator but a classifier that is trained and assessed only once at the end. We nevertheless report the respective values for intuition. We conclude that the obtained final accuracy can be competitive and is remarkably close to the achievable upper bound. A suspected initial VAEs generation quality limitation appears to be lifted with modern extensions and our proposed sampling scheme. We also support our quantitative statements visually with a few selected generated images for the various generative variants in Figure 6. We emphasize that these examples are supposed to primarily provide visual intuition in support of the existing quantitative results, as it is difficult to draw conclusions from a perceived subjective quality from a few images alone. From a qualitative viewpoint, the OpenVAE without generative modeling extensions appears to suffer from the limitations of a traditional VAE and generates blurry images.

Figure 6

Generated flower images for various continually trained models. Images were selected to provide a qualitative intuition behind the quantitative results of Figure 5. Images are compressed for a side-by-side view.

However, our open-set approach nevertheless provides a clearer disambiguation of classes, particularly already at the stage of task 2. The addition of introspection significantly increases the image detail, albeit still degrades considerably due to ambiguous interpolations in samples from low-density areas outside the aggregate posterior. This is again resolved by combining introspection with our proposed posterior-based EVT approach, where image quality is retained across multiple generative replay steps. From a purely visual perspective it is clear why this model outperforms the other approaches significantly in terms of quantitative accuracy values. Interestingly, our visual inspection also hints at why the PixelVAE and its open-set variant perform much worse than perhaps initially expected. As the caveat is the same in both PixelVAE and OpenPixelVAE, we only show generated instances for the latter. From these samples, we can hypothesize why the initial performance is competitive but rapidly declines. It appears that the autoregression suffers from forgetting in terms of its long-range pixel dependency. Whereas at the beginning, the information is locally consistent across the entire image, in each consecutive step, a further portion of subsequent pixels for old tasks is progressively replaced with uncorrelated noise. The conditioning thus appears to primarily be captured on new tasks only, resulting in interference effects. We continue this discussion alongside potential other general limitations of generative modeling variant choices in Appendix A.5.

4. Discussion

As a final piece of discussion, we would like to recall and emphasize a few important points of how our results should be interpreted and contextualized.

4.1. Presence of Unknown Data and Current Benchmarks

Perhaps most importantly, we re-iterate that OpenVAE is unique in that it provides a grounded basis to conduct continual learning in the presence of unknown data. However, as evidenced from the quantitative open-set recognition results, the inclusion of unknown data instances into continual learning would immediately result in the failure of the present continual-learning approaches at this point, simply because they lack a principled mechanism to provide robust predictions. For this reason, we show traditional incremental classification results as a proxy to assess our improved aggregate posterior-based generation quality. Our class incremental accuracy reports in this paper should thus be interpreted with caution as they represent only a part of OpenVAE’s capability, similar to a typical ablation study. We nevertheless provided this type of comparison, in order to situate OpenVAE with respect to some existing generative continual-learning methods in terms of catastrophic forgetting, rather than presenting OpenVAE in isolation in a more realistic new setting.

4.2. State of the Art in Class Incremental Learning and Exemplar Rehearsal

Following the above subsection, we note that a fair comparison of realistic class incremental learning is further complicated due to various involved factors. In fact, multiple related works make various additional assumptions on the extra storage of explicit data subsets and the use of multiple generative models per task or even multiple classifiers. We do not make these assumptions here in favor of generality. In this spirit, we focused our evaluation on our contributions’ relevant novelty with respect to combining the detection of unknown data with the prevention of catastrophic forgetting in generative models. The introduced OpenVAE shows that both are achievable simultaneously. At the same time, the reader familiar with the recent continual-learning literature will likely notice that some modern approaches that are attributed with state of the art in class incremental learning have not been included in our comparison. These approaches all fall into the category of exemplar rehearsal. We would like to emphasize that this is deliberate and not out of ignorance, as we see these works as purely complementary. We nevertheless wish to give deserved credit to these works and provide an outlook to one future research direction. The primary reason for omitting a direct comparison with state of the art works in continual learning that employ exemplar rehearsal is that we believe such a comparison would be misleading. In fact, contrasting our OpenVAE against these works would imply that these methods are somehow competing. In reality, exemplar rehearsal, or the so called extraction of core sets, is an auxiliary mechanism that can be applied out-of-the-box to our experimental set-up in this work. The main premise here is that catastrophic forgetting in continual learning can be reduced by retaining an explicit subset of the original data and subsequently continuously interleaving this stored data into the training process. Early works, such as iCarl [26] show that performance is then a function of two key aspects: the data selection technique and the memory buffer size. The former, selection of an appropriate data subset, essentially boils down to a non-continual-learning question, i.e., how to approximate the entire distribution through only a few instances. Exemplar rehearsal works thus make use of existing techniques here, such as core sets [28], herding [67], nearest mean-classifiers [27] or simply picking data samples uniformly at random [68]. The second question, on memory buffer size, has an almost trivial answer. The larger the memory buffer size, the better the performance. This is intuitive, yet also makes comparison challenging, as a memory buffer of the size of the entire dataset is analogous to what we referred to as “incremental upper bound” in our experiments. If we were to simply store the complete dataset, then catastrophic forgetting would be avoided entirely. Modern class incremental learning works make heavy use of this fact and store large portions of the original data, showing that the more data is stored, the higher the performance goes. Primary examples include the recent works on Mnemonics Training [69], Contrastive Continual Learning (Co2L) [70] or Dark Experience Replay (DER) [71]. We do not wish to dive into a discussion here of whether or not such data storage is realistic or what size of a memory buffer should be assumed. A respective reference that questions and discusses whether storing of original data is synonymous with progress in continual learning is Greedy Sampler and Dumb Learner (GDumb) [68], where it is shown that the amount of extracted data alone amounts to a significant portion of “state-of-the-art” performance. Primarily, we point out that the latter works all show that a larger memory buffer shows “better” class incremental learning performance, i.e., less forgetting. However, most importantly, extracting and storing parts of the original data into a separate memory buffer is an auxiliary process that is entirely complementary to our propositions of OpenVAE. As such, each of the methods referenced in this subjection is straightforward to combine with our work. Although we see such a combination as important prospective work, we leave detailed experimentation up to future investigations. The rationale behind this choice is that inclusion of a memory buffer will inevitably additionally boost the performances of the results of Table 3, yet provide no additional insights to our main hypothesis and contribution: the proposition of OpenVAE to show that detection of unknown data for robust prediction can effectively be achieved alongside reduction of catastrophic forgetting in continual learning.

5. Conclusions

We proposed an approach to unify the prevention of catastrophic interference in continual learning with open-set recognition based on variational inference in deep generative models. As a common denominator, we introduced EVT-based bounds to the aggregate posterior. The correspondingly named OpenVAE was shown to achieve compelling results in being able to distinguish known from unknown data, while boosting the generation quality in continual learning with generative replay. We believe that our demonstrated benefits from recent generative modeling techniques in the context of high-resolution flower images with OpenVAE provide a natural synergy to be explored in a range of future applications. We envision prospective works to employ OpenVAE as a baseline when relaxing the closed-world assumption in continual learning and allowing unknown data to appear in the investigated benchmark streams at all times in the move to a more realistic evaluation.

Table A1

Losses obtained for different values for MNIST with a 2-D latent space. Training conducted in isolated fashion to quantitatively showcase the role of . Un-normalized values in nats are reported in brackets for reference purposes.

		In Nats per Dimension (Nats in Brackets)
2-D Latent	Beta	KLD	Recon Loss	Class Loss	Accuracy [%]
train	1.0	1.039 (2.078)	0.237 (185.8)	0.539 (5.39)	79.87
test		1.030 (2.060)	0.235 (184.3)	0.596 (5.96)	78.30
train	0.5	1.406 (2.812)	0.230 (180.4)	0.221 (2.21)	93.88
test		1.382 (2.764)	0.228 (178.8)	0.305 (3.05)	92.07
train	0.1	2.055 (4.110)	0.214 (167.8)	0.042 (0.42)	99.68
test		2.071 (4.142)	0.212 (166.3)	0.116 (1.16)	98.73
train	0.05	2.395 (4.790)	0.208 (163.1)	0.025 (0.25)	99.83
test		2.382 (4.764)	0.206 (161.6)	0.159 (1.59)	98.79

Table A2

Losses obtained for different values for MNIST with a 60-D latent space. Training conducted in isolated fashion to quantitatively showcase the role of . Un-normalized values in nats are reported in brackets for reference purposes.

		In Nats per Dimension (Nats in Brackets)
60-D Latent	Beta	KLD	Recon Loss	Class Loss	Accuracy [%]
train	1.0	0.108 (6.480)	0.184 (144.3)	0.0110 (0.110)	99.71
test		0.110 (6.600)	0.181 (142.0)	0.0457 (0.457)	99.03
train	0.5	0.151 (9.060)	0.162 (127.1)	0.0052 (0.052)	99.87
test		0.156 (9.360)	0.159 (124.7)	0.0451 (0.451)	99.14
train	0.1	0.346 (20.76)	0.124 (97.22)	0.0022 (0.022)	99.95
test		0.342 (20.52)	0.126 (98.79)	0.0286 (0.286)	99.38
train	0.05	0.476 (28.56)	0.115 (90.16)	0.0018 (0.018)	99.95
test		0.471 (28.26)	0.118 (92.53)	0.0311 (0.311)	99.34

Table A3

A 14-layer wide residual network (WRN) encoder with a widen factor of 10. Convolutional layers (conv) are parametrized by a quadratic filter size followed by the amount of filters. p and s represent zero padding and stride, respectively. If no padding or stride is specified, then p = 0 and s = 1. Skip connections are an additional operation at a layer, with the layer to be skipped specified in brackets. Convolutional layers are followed by batch-normalization and a rectified linear unit (ReLU) activation. The probabilistic encoder ends on fully-connected layers for and that depend on the chosen latent space dimensionality and the data’s spatial size.

Layer Type	WRN Encoder
Layer 1	conv 3×3—48, p = 1
Block 1	conv 3×3—160, p = 1; conv 1×1—160 (skip next layer)conv 3×3—160, p = 1conv 3×3—160, p = 1; shortcut (skip next layer)conv 3×3—160, p = 1
Block 2	conv 3×3—320, s = 2, p = 1; conv 1×1—320, s = 2 (skip next layer) conv 3×3—320, p = 1conv 3×3—320, p = 1; shortcut (skip next layer)conv 3×3—320, p = 1
Block 3	conv 3×3—640, s = 2, p = 1; conv 1×1—640, s = 2 (skip next layer)conv 3×3—640, p = 1conv 3×3—640, p = 1; shortcut (skip next layer)conv 3×3—640, p = 1

Table A4

A 14-layer WRN decoder with a widen factor of 10. and refer to the input’s spatial dimension. Convolutional (conv) and transposed convolutional (conv_t) layers are parametrized by a quadratic filter size followed by the amount of filters. p and s represent zero padding and stride, respectively. If no padding or stride is specified, then p = 0 and s = 1. Skip connections are an additional operation at a layer, with the layer to be skipped specified in brackets. Every convolutional and fully-connected (FC) layer is followed by batch-normalization and a rectified linear unit (ReLU) activation function. The model ends on a Sigmoid function.

Layer Type	WRN Decoder
Layer 1	FC 640×[Pw/4]×[Ph/4]
Block 1	conv_t 3×3—320, p = 1; conv_t 1×1—320 (skip next layer)conv 3×3—320, p = 1conv 3×3— 320, p = 1; shortcut (skip next layer)conv 3×3—320, p = 1upsample × 2
Block 2	conv_t 3×3—160, p = 1; conv_t 1×1—160 (skip next layer)conv 3×3—160, p = 1conv 3×3— 160, p = 1; shortcut (skip next layer)conv 3×3—160, p = 1upsample × 2
Block 3	conv_t 3×3—48, p = 1; conv_t 1×1—48 (skip next layer)conv 3×3—48, p = 1conv 3×3—48, p = 1; shortcut (skip next layer)conv 3×3—48, p = 1
Layer 2	conv 3×3—3, p = 1

Table A5

The results for class incremental continual-learning approaches averaged over five runs, baselines and the reference isolated learning scenario for MNIST at the end of every task increment. This is an extension of Table 3 in the main body. Here, in addition to the accuracy , also indicates the respective negative log-likelihood (NLL) at the end of every task increment t.

MNIST	t	UB	FT	SupVAE	OpenVAE	PixelVAE DGR	SupPixelVAE	OpenPixelVAE
αbase,t (%)	1	100.0	100.0	99.97±0.029	99.98 ±0.018	99.97 ±0.002	99.97 ±0.026	99.86 ±0.084
	2	99.82	00.00	97.28 ±3.184	99.30 ±0.100	99.54 ±0.285	96.90 ±2.907	99.64 ±0.095
	3	99.80	00.00	87.66 ±8.765	96.69 ±2.173	99.16 ±0.611	90.12 ±5.846	98.88 ±0.491
	4	99.85	00.00	54.70 ±22.84	94.71 ±1.792	98.33 ±1.119	76.84 ±9.095	98.11 ±0.797
	5	99.57	00.00	19.86 ±7.396	92.53 ±4.485	98.04 ±1.397	56.53 ±4.032	97.44 ±0.785
αnew,t (%)	1	100.0	100.0	99.97 ±0.029	99.98 ±0.018	99.97 ±0.002	99.97 ±0.026	99.86 ±0.084
	2	99.80	99.85	99.75 ±0.127	99.80 ±0.126	99.71 ±0.122	99.74 ±0.052	99.82 ±0.027
	3	99.67	99.94	99.63 ±0.172	99.61 ±0.055	99.41 ±0.084	99.22 ±0.082	99.56 ±0.092
	4	99.49	100.0	99.05 ±0.470	99.15 ±0.032	98.61 ±0.312	97.84 ±0.180	98.80 ±0.292
	5	99.10	99.86	99.00 ±0.100	99.06 ±0.171	97.31 ±0.575	96.77 ±0.337	98.63 ±0.430
αall,t (%)	1	100.0	100.0	99.97 ±0.029	99.98 ±0.018	99.97 ±0.002	99.97 ±0.026	99.86 ±0.084
	2	99.81	49.92	98.54 ±1.638	99.55 ±0.036	99.60 ±0.142	98.37 ±1.448	99.69 ±0.051
	3	99.72	31.35	95.01 ±3.162	98.46 ±0.903	98.93 ±0.291	96.14 ±1.836	99.20 ±0.057
	4	99.50	24.82	81.50 ±9.369	97.06 ±1.069	98.22 ±0.560	91.25 ±0.992	98.13 ±0.281
	5	99.29	20.16	64.34 ±4.903	93.24 ±3.742	96.52 ±0.658	83.61 ±0.927	96.84 ±0.346
γbase,t (nats)	1	63.18	62.08	64.34 ±2.054	62.53 ±1.166	90.52 ±0.263	100.0 ±1.572	99.77 ±2.768
	2	62.85	126.8	74.41 ±10.89	65.68 ±1.166	91.27 ±0.789	100.4 ±1.964	101.2 ±3.601
	3	63.36	160.4	81.89 ±10.09	69.29 ±1.541	91.92 ±0.991	100.3 ±4.562	101.1 ±4.014
	4	64.25	126.9	90.62 ±10.08	71.69 ±1.379	91.75 ±1.136	102.7 ±7.134	101.0 ±4.573
	5	64.99	123.2	101.6 ±8.347	77.16 ±1.104	92.05 ±1.212	102.4 ±6.195	100.5 ±4.942
γnew,t (nats)	1	63.18	62.08	64.34 ±2.054	62.53 ±1.166	90.52 ±0.263	100.0 ±1.572	99.77 ±2.768
	2	88.75	87.93	89.91 ±0.107	89.64 ±3.709	115.8 ±0.805	125.7 ±2.413	124.6 ±3.822
	3	82.53	87.22	87.65 ±0.530	85.37 ±1.725	107.7 ±0.600	118.3 ±3.523	116.5 ±2.219
	4	72.68	74.61	79.49 ±0.489	74.75 ±0.777	100.9 ±0.659	107.1 ±5.316	102.3 ±1.844
	5	85.88	92.00	93.55 ±0.391	89.68 ±0.618	113.4 ±0.820	118.2 ±1.572	113.3 ±0.755
γall,t (nats)	1	63.18	62.08	64.34 ±2.054	62.53 ±1.166	90.52 ±0.263	100.0 ±1.572	99.77 ±2.768
	2	75.97	107.3	82.02 ±5.488	76.62 ±1.695	102.9 ±0.408	111.9 ±2.627	112.7 ±3.300
	3	79.58	172.3	89.88 ±3.172	82.95 ±1.878	104.8 ±1.114	114.9 ±4.590	114.6 ±4.788
	4	79.72	203.1	95.83 ±2.747	85.30 ±1.524	103.9 ±0.759	114.3 ±3.963	112.1 ±2.150
	5	81.97	163.7	107.6 ±1.724	92.92 ±2.283	106.1 ±0.868	118.7 ±5.320	111.9 ±2.663

Table A6

The results for class incremental continual-learning approaches averaged over five runs, baselines and the reference isolated learning scenario for FashionMNIST at the end of every task increment. This is an extension of Table 3 in the main body. Here, in addition to the accuracy , also indicates the respective NLL at the end of every task increment t.

Fashion	t	UB	FT	SupVAE	OpenVAE	PixelVAE DGR	SupPixelVAE	OpenPixelVAE
αbase,t (%)	1	99.65	99.60	99.55±0.035	99.59 ±0.082	99.57 ±0.091	99.58 ±0.076	99.54 ±0.079
	2	96.70	00.00	92.02 ±1.175	92.36 ±2.092	82.40 ±6.688	90.06 ±1.782	88.60 ±1.998
	3	95.95	00.00	79.26 ±4.170	83.90 ±2.310	78.55 ±3.964	83.70 ±3.571	87.66 ±0.375
	4	91.35	00.00	50.16 ±6.658	64.70 ±2.580	54.69 ±3.853	50.23 ±7.004	68.31 ±3.308
	5	92.20	00.00	39.51 ±7.173	60.63 ±12.16	60.04 ±5.151	47.83 ±13.41	74.45 ±2.889
αnew,t (%)	1	99.65	99.60	99.55 ±0.035	99.59 ±0.082	99.57 ±0.091	99.58 ±0.076	99.54 ±0.079
	2	95.55	97.95	90.98 ±0.626	92.64 ±2.302	97.73 ±1.113	96.47 ±0.596	97.31 ±0.475
	3	93.35	99.95	90.26 ±1.435	83.40 ±3.089	99.09 ±0.367	97.33 ±0.725	96.88 ±1.156
	4	84.75	99.90	85.65 ±2.127	84.18 ±2.715	97.55 ±0.588	96.12 ±0.675	95.47 ±1.332
	5	97.50	99.80	96.92 ±0.774	96.51 ±0.707	98.85 ±0.141	97.91 ±0.596	98.63 ±0.176
αall,t (%)	1	99.65	99.60	99.55 ±0.035	99.59 ±0.082	99.57 ±0.091	99.58 ±0.076	99.54 ±0.079
	2	95.75	48.97	91.83 ±0.730	92.31 ±1.163	86.22 ±3.704	92.93 ±0.160	92.17 ±1.425
	3	93.02	33.33	83.35 ±1.597	86.93 ±0.870	76.77 ±4.378	84.07 ±1.069	87.30 ±0.322
	4	87.51	25.00	64.66 ±3.204	76.05 ±1.391	62.93 ±3.738	64.42 ±1.837	76.36 ±1.267
	5	89.24	19.97	58.82 ±2.521	69.88 ±1.712	72.41 ±2.941	63.05 ±1.826	80.85 ±0.721
γbase,t (nast)	1	209.7	209.8	208.9 ±1.213	209.7 ±3.655	267.8 ±1.246	230.8 ±3.024	232.0 ±2.159
	2	207.4	240.7	212.7 ±0.579	212.1 ±0.937	273.6 ±0.631	232.5 ±1.582	231.8 ±0.416
	3	207.6	258.7	219.5 ±1.376	216.9 ±1.208	274.0 ±0.552	235.6 ±2.784	231.6 ±0.832
	4	207.7	243.6	223.8 ±0.837	217.1 ±0.979	273.7 ±0.504	236.4 ±3.157	231.4 ±2.550
	5	208.4	306.5	232.8 ±5.048	222.8 ±1.632	274.1 ±0.349	241.1 ±1.747	234.1 ±1.498
γnew,t (nast)	1	209.7	209.8	208.9 ±1.213	209.7 ±3.655	267.8 ±1.246	230.8 ±3.024	232.0 ±2.159
	2	241.1	240.2	241.8 ±0.502	241.9 ±0.960	313.4 ±1.006	275.8 ±1.888	275.3 ±1.473
	3	213.6	211.8	215.4 ±0.501	213.0 ±0.635	269.1 ±0.616	268.3 ±3.852	262.9 ±1.893
	4	220.5	219.7	223.6 ±0.381	220.9 ±0.522	282.4 ±0.321	259.1 ±1.305	259.6 ±2.050
	5	246.2	242.0	248.8 ±0.398	244.0 ±0.646	305.8 ±0.286	283.2 ±2.150	283.5 ±2.458
γall,t (nast)	1	209.7	209.8	208.9 ±1.213	209.7 ±3.655	267.8 ±1.246	230.8 ±3.024	232.0 ±2.159
	2	224.2	240.4	226.6 ±2.31	226.9 ±0.918	293.8 ±0.349	254.3 ±1.513	255.8 ±0.436
	3	220.7	246.1	227.2 ±0.606	224.9 ±0.642	285.7 ±0.510	261.5 ±2.970	259.1 ±0.929
	4	220.4	238.7	230.4 ±0.524	226.1 ±0.560	284.9 ±0.703	263.2 ±2.259	259.5 ±3.218
	5	226.2	275.1	242.2 ±0.754	234.6 ±0.823	289.5 ±0.396	271.7 ±2.117	267.2 ±0.586

Table A7

The results for class incremental continual-learning approaches averaged over five runs, baselines and the reference isolated learning scenario for AudioMNIST at the end of every task increment. This is an extension of Table 3 in the main body. Here, in addition to the accuracy , also indicates the respective NLL at the end of every task increment t.

Audio	t	UB	LB	SupVAE	OpenVAE	PixelVAE DGR	SupPixelVAE	OpenPixelVAE
αbase,t (%)	1	99.99	100.0	99.21±0.568	99.95 ±0.035	100.0 ±0.000	99.71 ±0.218	99.27 ±0.410
	2	99.92	00.00	98.98 ±0.766	98.61 ±0.490	99.52 ±0.273	97.86 ±0.799	97.88 ±2.478
	3	100.0	00.00	92.44 ±1.306	95.12 ±2.248	93.15 ±3.062	81.38 ±5.433	95.82 ±3.602
	4	99.92	00.00	76.43 ±4.715	86.37 ±5.63	81.55 ±8.468	50.58 ±14.60	91.56 ±5.640
	5	98.42	00.00	59.36 ±7.147	79.73 ±4.070	64.60 ±8.739	29.94 ±18.47	75.25 ±10.18
αnew,t (%)	1	99.99	100.0	99.21 ±0.568	99.95 ±0.035	100.0 ±0.000	99.71 ±0.218	99.27 ±0.410
	2	99.75	100.0	91.82 ±4.577	89.23 ±7.384	99.71 ±0.043	99.78 ±0.128	99.81 ±0.189
	3	98.92	99.58	95.20 ±1.495	94.43 ±3.030	98.23 ±1.092	98.41 ±0.507	99.30 ±0.550
	4	97.33	98.67	53.02 ±6.132	72.22 ±8.493	95.31 ±0.868	94.30 ±0.914	97.87 ±0.293
	5	98.67	100.0	84.93 ±6.297	89.52 ±6.586	98.18 ±0.885	97.00 ±0.520	99.43 ±0.495
αall,t (%)	1	99.99	100.0	99.21 ±0.568	99.95 ±0.035	100.0 ±0.000	99.71 ±0.218	99.27 ±0.410
	2	99.83	50.00	93.84 ±2.558	93.93 ±3.756	99.50 ±0.157	98.64 ±0.875	99.67 ±0.033
	3	99.56	33.19	94.26 ±1.669	95.70 ±1.524	95.37 ±1.750	90.10 ±1.431	97.77 ±1.017
	4	98.60	24.58	77.90 ±4.210	85.59 ±3.930	86.97 ±2.797	75.55 ±3.891	95.41 ±1.345
	5	97.87	20.02	81.49 ±1.944	87.72 ±1.594	75.50 ±3.032	63.44 ±5.252	90.23 ±1.139
γbase,t (nast)	1	433.7	423.2	435.2 ±15.69	424.2 ±2.511	434.2 ±1.068	432.6 ±0.321	433.8 ±0.370
	2	422.5	439.4	423.9 ±0.517	425.2 ±1.402	434.4 ±1.082	432.5 ±0.551	433.5 ±1.464
	3	420.7	429.2	422.7 ±0.690	423.8 ±1.148	434.6 ±0.785	432.9 ±0.723	433.1 ±1.269
	4	419.9	428.5	422.8 ±0.367	423.5 ±0.937	434.2 ±1.209	433.0 ±0.781	433.0 ±1.283
	5	418.4	432.9	422.7 ±0.182	423.5 ±0.586	435.1 ±1.915	431.4 ±0.666	432.3 ±0.189
γnew,t (nast)	1	433.7	423.2	435.2 ±15.69	424.2 ±2.511	434.2 ±1.068	432.6 ±0.321	433.8 ±0.370
	2	381.2	384.1	382.5 ±1.355	385.3 ±12.56	390.4 ±0.694	389.4 ±0.208	389.4 ±1.304
	3	435.9	436.7	436.3 ±0.639	436.9 ±0.688	444.7 ±0.545	442.7 ±0.513	442.4 ±0.275
	4	485.9	487.1	486.7 ±0.385	486.5 ±0.701	497.4 ±0.740	494.4 ±0.700	494.8 ±0.386
	5	421.3	425.2	423.9 ±0.681	422.9 ±0.537	431.9 ±1.032	428.0 ±0.851	429.7 ±1.223
γall,t (nast)	1	433.7	423.2	435.2 ±15.69	424.2 ±2.511	435.2 ±15.69	432.6 ±0.321	433.8 ±0.370
	2	401.9	411.8	403.2 ±0.831	403.5 ±1.274	412.4 ±0.871	410.9 ±0.351	411.5 ±1.406
	3	412.1	418.9	413.6 ±0.410	413.8 ±0.573	423.3 ±0.618	421.0 ±1.026	421.9 ±0.661
	4	430.3	438.4	432.4 ±0.436	432.6 ±0.862	441.6 ±0.420	439.8 ±0.833	439.8 ±0.718
	5	427.2	440.4	431.4 ±0.255	430.9 ±0.541	440.3 ±1.297	436.9 ±0.751	437.7 ±0.432

8 in total

1. Hippocampal and neocortical contributions to memory: advances in the complementary learning systems framework.

Authors: Randall C. O'Reilly; Kenneth A. Norman
Journal: Trends Cogn Sci Date: 2002-12-01 Impact factor: 20.229

2. Probability Models for Open Set Recognition.

Authors: Walter J Scheirer; Lalit P Jain; Terrance E Boult
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2014-11 Impact factor: 6.226

Review 3. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.

Authors: R Ratcliff
Journal: Psychol Rev Date: 1990-04 Impact factor: 8.934

4. Long short-term memory.

Authors: S Hochreiter; J Schmidhuber
Journal: Neural Comput Date: 1997-11-15 Impact factor: 2.026

Review 5. Continual lifelong learning with neural networks: A review.

Authors: German I Parisi; Ronald Kemker; Jose L Part; Christopher Kanan; Stefan Wermter
Journal: Neural Netw Date: 2019-02-06

6. Overcoming catastrophic forgetting in neural networks.

Authors: James Kirkpatrick; Razvan Pascanu; Neil Rabinowitz; Joel Veness; Guillaume Desjardins; Andrei A Rusu; Kieran Milan; John Quan; Tiago Ramalho; Agnieszka Grabska-Barwinska; Demis Hassabis; Claudia Clopath; Dharshan Kumaran; Raia Hadsell
Journal: Proc Natl Acad Sci U S A Date: 2017-03-14 Impact factor: 11.205

7. Toward open set recognition.

Authors: Walter J Scheirer; Anderson de Rezende Rocha; Archana Sapkota; Terrance E Boult
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2013-07 Impact factor: 6.226

8. Continual Learning Through Synaptic Intelligence.

Authors: Friedemann Zenke; Ben Poole; Surya Ganguli
Journal: Proc Mach Learn Res Date: 2017

8 in total

1 in total

Review 1. A Comprehensive "Real-World Constraints"-Aware Requirements Engineering Related Assessment and a Critical State-of-the-Art Review of the Monitoring of Humans in Bed.

Authors: Kyandoghere Kyamakya; Vahid Tavakkoli; Simon McClatchie; Maximilian Arbeiter; Bart G Scholte van Mast
Journal: Sensors (Basel) Date: 2022-08-21 Impact factor: 3.847

1 in total