Shichen Cao1, Jingjing Li2, Kenric P Nelson3, Mark A Kon2. 1. Worcester Polytechnic Institute, Worcester, MA 01609, USA. 2. Mathematics & Statistics Department, Boston University, Boston, MA 02215, USA. 3. Photrek, Watertown, MA 02472, USA.
Abstract
We present a coupled variational autoencoder (VAE) method, which improves the accuracy and robustness of the model representation of handwritten numeral images. The improvement is measured in both increasing the likelihood of the reconstructed images and in reducing divergence between the posterior and a prior latent distribution. The new method weighs outlier samples with a higher penalty by generalizing the original evidence lower bound function using a coupled entropy function based on the principles of nonlinear statistical coupling. We evaluated the performance of the coupled VAE model using the Modified National Institute of Standards and Technology (MNIST) dataset and its corrupted modification C-MNIST. Histograms of the likelihood that the reconstruction matches the original image show that the coupled VAE improves the reconstruction and this improvement is more substantial when seeded with corrupted images. All five corruptions evaluated showed improvement. For instance, with the Gaussian corruption seed the accuracy improves by 1014 (from 10-57.2 to 10-42.9) and robustness improves by 1022 (from 10-109.2 to 10-87.0). Furthermore, the divergence between the posterior and prior distribution of the latent distribution is reduced. Thus, in contrast to the β-VAE design, the coupled VAE algorithm improves model representation, rather than trading off the performance of the reconstruction and latent distribution divergence.
We present a coupled variational autoencoder (VAE) method, which improves the accuracy and robustness of the model representation of handwritten numeral images. The improvement is measured in both increasing the likelihood of the reconstructed images and in reducing divergence between the posterior and a prior latent distribution. The new method weighs outlier samples with a higher penalty by generalizing the original evidence lower bound function using a coupled entropy function based on the principles of nonlinear statistical coupling. We evaluated the performance of the coupled VAE model using the Modified National Institute of Standards and Technology (MNIST) dataset and its corrupted modification C-MNIST. Histograms of the likelihood that the reconstruction matches the original image show that the coupled VAE improves the reconstruction and this improvement is more substantial when seeded with corrupted images. All five corruptions evaluated showed improvement. For instance, with the Gaussian corruption seed the accuracy improves by 1014 (from 10-57.2 to 10-42.9) and robustness improves by 1022 (from 10-109.2 to 10-87.0). Furthermore, the divergence between the posterior and prior distribution of the latent distribution is reduced. Thus, in contrast to the β-VAE design, the coupled VAE algorithm improves model representation, rather than trading off the performance of the reconstruction and latent distribution divergence.
An overarching challenge in machine learning is the development of methodologies that ensure the accuracy and robustness of models given limited training data. By accuracy, we refer to the metrics of information theory, such as minimizing the cross-entropy or divergence of an algorithm. In this paper, we define a measure of robustness based on a generalization of information theory. The variational autoencoder (VAE) contributes to improved learning of models by utilizing approximate variational inference [1,2]. By storing a statistical model rather than a deterministic model at the latent layer, the algorithm has increased flexibility in its use for reconstruction and other applications. The variational inference is optimized by minimization of a loss function, the so-called negative evidence lower bound, which has two components. The first component is a cross-entropy between the generated and the source data, also known as the expected negative log-likelihood, while the second is a divergence between the prior and the posterior distributions of the latent layer.Our goal in this research is to provide an evaluation as to whether a generalization of information theory can be applied to improving the robustness of machine learning algorithms. Robustness of autoencoders to outliers is critical for generating a reliable representation of particular data types in the encoded space when using corrupted training data [3]. In this paper, a generalized entropy function is used to modify the negative evidence lower bound loss function of a variational autoencoder. With the MNIST handwritten numerals dataset, we are able to measure the improvement in the robustness of the reconstruction, using a metric also derived from the generalization of information theory. In addition, we find that the accuracy of the reconstruction, as measured by Shannon information theory, is also improved. Furthermore, the divergence between the latent distribution posterior and prior is also reduced. This is important to ensure that the reconstruction improvement is not a result of degrading the latent layer.Our study builds from the work of Kingma and Welling [4] on variational autoencoders and Tran et al. [5] on deep probabilistic programming. Variational autoencoders are an unsupervised learning method for training encoder and decoder neural networks. Between the encoder and decoder, the parameters of a multidimensional distribution are learned to form a compressed latent representation of the training data [6]. It is an effective method for generating complex datasets such as images and speech. Zalger [7] implemented the application of VAE for aircraft turbomachinery design and Xu et al. [8] used VAEs to achieve unsupervised anomaly detection for seasonal key performance indicators (KPIs) in web applications. VAEs have been used to construct probabilistic models of complex physical phenomena [9]. Autoencoders can use a variety of latent variable models, but restricting the models can enhance performance. Sparse autoencoders add a penalty for the number of active hidden layer nodes used in the model. Variational autoencoders further restrict the model to a probability distribution specified by a set of encoder parameters which approximates the actual conditional probability . Variational inference, as reviewed by Blei et al. [10], is used to learn this approximation by minimizing an objective function such as the Kullback–Liebler divergence. The decoder learns a set of parameters for a generative distribution , where is the latent variable, and is the output generated data. The complexity of the data distribution makes direct computation of the divergence between the approximate and exact latent conditional probabilities intractable; however, a variational or evidence lower bound (ELBO) is computable and consists of two components, the expected reconstruction log-likelihood of the generated data (cross-entropy) and the negative of the divergence between the latent posterior conditional probability and a latent prior distribution , which is typically a standard normal distribution but can be more sophisticated for particular model requirements.Recently, Higgins et al. [11] proposed a -VAE framework, which can provide a more disentangled latent representation [12] by increasing the weight of the KL-divergence term of the ELBO. Since the KL-divergence is a regularization that constrains the capacity of the latent information channel , increasing the weight of the regularization with puts pressure on the learnt posterior so it is more tightly packed. The effect seems to be an encouragement of each dimension to store distinct information and excess dimensions as highly packed noise. However, this improvement is a trade-off between the divergence and reconstruction components of the ELBO metric. We will show that the coupled VAE algorithm improves both components of the ELBO.The next section provides an introduction to the design of the variational autoencoder. A comparison with other generative algorithms is included. Section 3 introduces nonlinear statistical coupling and its application to defining metrics for the robustness, accuracy, and decisiveness of decision algorithms. In this paper, use of the uppercase letter for the terms ’Robustness’, ’Accuracy’, and ’Decisiveness’ refers to the specific metrics, which will be introduced in Section 3.1. Lowercase letters for these terms will be used when referring to the general properties. Following the definition of the reconstruction assessment metrics, the generalization of the negative ELBO is defined. This coupled negative ELBO provides control over the weighting of rare versus common samples in the distribution of the training set. Additional details of the derivation of the generalized negative ELBO function and metrics are provided in Appendixes Appendix A.1 and Appendix A.2, respectively. In Section 4, the improved autoencoder is evaluated using the MNIST handwritten numeral test set. Measurements of the reconstruction and the characteristics of the posterior latent variables are analyzed. Section 5 provides a visualization of the changes in the latent distribution using a 2-dimensional distribution. Section 6 demonstrates that the coupled VAE algorithm provides significantly improved stability in the model performance when the input image is corrupted from the training set. This provides evidence of the improved robustness of the algorithm. Section 7 provides a discussion, conclusion, and suggestions for future research.
2. The Variational Autoencoder
A variational autoencoder consists of an encoder, a decoder, and a loss function. Figure 1 represents the basic structure of an autoencoder. The encoder Q is a neural network that converts high-dimensional information from the input data into a low-dimensional hidden, latent representation . Some information is lost during this data compression because the dimension is reduced. The decoder P decompresses from latent space to reconstruct the data. While, in general, autoencoders can learn a variety of representations, VAEs especially learn the parameters of a probability distribution. The model used here learns the means and standard deviations of a collection of multivariate Gaussian distributions and stores this information in a two-layered space. The training loss function, which is the negative evidence lower bound, is optimized by using stochastic gradient descent.
Figure 1
The variational autoencoder consists of an encoder, a probability model, and a decoder.
2.1. Vae Loss Function
The encoder reads the input data and compresses and transforms it into a fixed-shape latent representation , while the decoder decompresses and reconstructs the information from this latent representation, outputting specific distribution parameters to generate a new reconstruction . The true posterior distribution of given datapoint is unknown, but we use the Gaussian approximation with mean vector and covariance matrix instead. The goal of the algorithm is to maximize the variational or evidence lower bound (ELBO) on the marginal density of individual datapoints.For a dataset consisting of N independent and identically distributed samples, the variational lower bound for the datapoint or image in the original VAE algorithm [4] is
The first term on the right-hand side is the negative Kullback–Leibler divergence between the posterior variational approximation and a prior distribution which is selected to be a standard Gaussian distribution. The second term on the right-hand side is denoted as the expected reconstruction log-likelihood, and is referred to as the cross-entropy. Let be the dimensionality of ; then, the Kullback–Leibler divergence simplifies toThe expected reconstruction log-likelihood (cross-entropy) can be estimated by sampling, i.e.,
where L denotes the number of samples for each datapoint and we set in our study. Supposing data given has the following probability density,
where is the output of the decoder. Therefore, the loss function can be calculated byFor our work, the loss function is modified to improve the robustness of the variational autoencoder, something that will be discussed in Section 4.
2.2. Comparison with Other Generative Machine Learning Methods
The paradigm of generative adversarial networks (GANs) is a recent advance in generative machine learning methods. The basic idea of GANs was published in a 2010 blog post by Niemitalo [13], and the name ‘GAN’ was introduced by Goodfellow et al. [14]. In comparison with variational autoencoders, generative adversarial networks are used for optimizing generative tasks specifically. GANs can produce models with true latent spaces, as is the case of bidirectional GAN (BiGAN) and adversarially learned inference (ALI) [15,16], which are designed to improve the performance of GANs. However, GANs cannot generate reasonable results when data are high-dimensional [17]. By contrast, as a probabilistic model, the specific goal of a variational autoencoder is to marginalize out noninformative variables during the training process. The ability to use complex priors in the latent space enables existing expert knowledge to be incorporated.Bayesian networks form another generative model. Pearl [18] proposed the Bayesian network paradigm in 1985. Bayesian networks have a strong ability to capture the symbolic figures of input information and combine objective probabilities with subjective estimates for both qualitative and quantitative modeling. The basic concept of Bayesian networks is built on Bayes’s theorem. Another effective way to solve for the posterior of the distribution derived from neural networks is to train and predict using variational inference techniques [19]. Compared with the original Bayesian network, the basic building blocks of deep networks provide multiple loss functions for making multitarget predictions, for transfer learning, and for varying outputs depending on the situation. The improvement of the deeper architectures, using VAE specifically, continues to occur.Other generative models are now commonly combined with a variational autoencoder to improve performance. Ebbers et al. [20] developed a VAE with a hidden Markov model (HMM) as the latent model for discovering acoustic units. Dilokthanakul et al. [2] studied the use of Gaussian mixture models as the prior distribution of the VAE to perform unsupervised clustering through deep generative models. They showed a heuristic algorithm called ‘minimum information constraint’ and it is capable of improving the unsupervised clustering performance with this model. Srivastava and Sutton [1] presented the effective autoencoding variational Bayes-based inference method for latent Dirichlet allocation (LDA). This model solves the problems caused by autoencoding variational Bayes by the Dirichlet prior and by component collapsing. Additionally, this model matches traditional methods’ inaccuracy with much better inference time.
3. Accounting for Risk with Coupled Entropy
Machine learning algorithms, including the VAE, have achieved efficient learning and inference for many image processing applications. Nevertheless, assuring accurate forecasts of the uncertainty is still a challenge. Problems such as outliers and overfitting impact the robustness of scientific prediction and engineering systems. This paper concentrates on assessing and improving the robustness of the VAE algorithm.In this study, we draw upon the principles of nonlinear statistical coupling (NSC) [21,22] to define a generalization to information theory and apply the resulting entropic functions to the definition of the negative ELBO loss function for the training of the variational autoencoder [23]. NSC is derived from nonextensive statistical mechanics [24], which generalizes the variational calculus of maximum entropy to include constraints related to the nonlinear dynamics of complex systems and in turn to the nonexponential decay of the maximizing distributions. The NSC frame focuses this theory on the role of nonlinear coupling in generalizing entropy and its related functions. The approach defines a family of heavy-tailed (positive coupling) and compactly supported (negative coupling) distributions which maximize a generalized entropy function referred to as coupled entropy. The variational methods underlying NSC can be applied to a variety of problems in mathematical physics [25,26]. Here, we examine how NSC can broaden the role of approximate variational inference in machine learning to include sensitivity to the risks of outlier events occurring in the tail of the distribution of the phenomena being learned.
3.1. Assessing Probabilistic Forecasts with the Generalized Mean
First, proper metrics are needed to evaluate the accuracy and robustness of machine learning algorithms, such as VAE. The arithmetic mean and the standard deviation are widely used to measure the central tendency and fluctuation, respectively, of a random variable. Nevertheless, these are inappropriate for probabilities, which are formed by ratios. A random variable formed by the ratio of two independent random variables has a central tendency determined by the geometric mean, as described by McAlister [27]. Information theory addresses this issue by taking the logarithm of the probabilities, then the arithmetic mean; however, we will show that the generalizations of information theory are easier to report and visualize in the probability domain.In [28], a risk profile was introduced, which is the spectrum of the generalized means of probabilities and provides an assessment of the the central tendency and fluctuations of probabilistic inferences. The generalized mean is a translation of generalized information-theoretic metrics back to the probability domain, and is derived in the next section. Its use as a metric for evaluating and training inference algorithms is related to the Wasserstein distance [29], which incorporates the generalized mean. The accuracy of the likelihoods is measured with robust, neutral, and decisive risk bias using the , (geometric) and (arithmetic) means, respectively. With no risk bias (), the geometric mean is equivalent to transforming the cross-entropy between the forecast and the distribution of the test samples to the probability domain. The arithmetic mean () is a simple measure of the Decisiveness (i.e., were the class probabilities in the right order so that a correct decision can be made?). This measure de-weights probabilities near zero since increasing r reduces the influence of small probabilities on the average. To complement the arithmetic mean, we choose a negative conjugate value. The conjugate is not the harmonic mean () because this turns out to be too severe a test. Instead, is chosen based on a dual transformation between heavy-tail (positive ) and compact-support (negative ) domains of the coupled Gaussian distribution. The risk sensitivity r can be decomposed into the nonlinear coupling and the power and dimension of the variable . The dual transformation between the positive/negative domains of the coupled Gaussians has the following relationship: . Taking and , the coupling for a risk bias of one is and the conjugate values are and [23]. The Robustness metric increases the weight of probabilities near zero since negative powers invert the probabilities prior to the average.For simplicity, we refer to these three metrics as the Robustness, Accuracy, and Decisiveness. The label ‘accuracy’ is used for the neutral accuracy, since ‘neutralness’ is not appropriate and ‘neutral’ does not express that this metric is the central tendency of the accuracy. Summarizing:Similar to the standard deviation, the arithmetic mean and −2/3 mean play roles as measures of the fluctuation. Figure 2 shows an example of input images from the MNIST dataset and the generated output images produced by the VAE. Despite the blur in some output images, the VAE succeeds in generating very similar images to the input. However, the histogram in Figure 3, which plots the frequency of the likelihoods over a log scale, shows that the probabilities of ground truth range over a large scale. The geometric mean or Accuracy captures the central tendency of the distribution at . The Robustness and the Decisiveness capture the span of the fluctuation in the distribution. The −2/3 mean or Robustness is and the arithmetic mean or Decisiveness is . The minimal value of the −2/3 mean metric is an indicator of the poor robustness of the VAE model, which can be improved. We measure and display the performance in the probability space in order to simplify the comparison between the three metrics. In the next subsection, we will show their relationship with a generalization of the log-likelihood. If, however, we were to plot histograms in the log-space, separate histograms would be required for each metric. By using the probability space, we can display one histogram overlaid with three different means. Appendix A.2 describes the origin of the Robustness–Accuracy–Decisiveness metrics.
Figure 2
Example set of (a) MNIST input images and (b) VAE-generated output images.
Figure 3
A histogram of the likelihoods that the VAE-reconstructed images match the input images. The objective of the coupled VAE research is to demonstrate that the Robustness, which is the −2/3 generalized mean, can be increased by penalizing the cost of producing outlier reconstructions. The Accuracy is the exponential of the average log-likelihood and the Decisiveness is the arithmetic mean.
In order to improve performance against the robust metric, the training of the variational autoencoder needs to incorporate this generalized metric. To do so, we derive a coupled loss function in the next subsection.
3.2. Definition of Negative Coupled ELBO
As we discussed in Section 2, the goal of a VAE algorithm is to optimize a low-dimensional model of a high-dimensional input dataset. This is accomplished using approximate variational inference by maximizing an evidence lower bound (ELBO). Equivalently, the negative ELBO defines a loss function which can be minimized, . In this paper, we provide initial evidence that the accuracy and robustness of the variational inference can be improved by generalizing the negative ELBO to account for the risk of outlier events. Here, we provide a definition of the generalization and in Appendix A.1 a derivation is provided.The generalized loss function in the coupled variational autoencoder (VAE) method is defined as follows.(Negative Coupled ELBO). Given thewhereis the generalized (coupled) KL-divergence in the original loss function in Equation (is the generalized reconstruction loss (coupled cross-entropy) in the original loss function in Equation (In the next section, we show preliminary experimental evidence that the negative coupled ELBO can be used to improve the robustness and accuracy of the variational inference. We show that increasing the coupling parameter of the loss function has the effect of increasing the Accuracy (8) and Robustness (9) metrics of the generated data. Additionally, we show that the improvement in the generation process is not at the expense of the divergence between the posterior and the prior latent distributions. Thus, the overall ELBO is improved, indicating an improvement in the approximate variational inference. Furthermore, in Section 6, we show that improvements are more substantial when the algorithm is seeded by images from the corrupted MNIST database. While the experimental results of this report focus on a two-layer dense neural network and the (corrupted)-MNIST datasets, the generalization of information-theoretic cost functions for machine learning training is applicable to a broader range of architectures and datasets. For instance, the CIFAR-10 reconstruction is typically processed with a deep neural network [30] and is planned for future research.
4. Results Using the MNIST Handwritten Numerals
The MNIST handwritten digit database is a large database of handwritten digits consisting of a training set of 60,000 images and a test set of 10,000 images widely used for evaluating machine learning and pattern recognition methods. The digits have been size-normalized and centered in fixed-size images. Each image in the database contains 28 by 28 grayscale pixels. Pixel values vary from 0 to 255. Zero means the pixel is white, or background, while 255 means the pixel is black, or foreground [31]. In this and the next section, we examine the performance of the coupled VAE algorithm in reconstructing images of the MNIST database. In Section 6, we show the stability of the coupled VAE when reconstruction is distorted by samples from the corrupted MNIST database.For this research, we used the MNIST database as the input since it was used in the traditional VAE. Specifically, input is a batch of 28 by 28 pixel photos of handwritten numbers. The encoder encodes the data, which are 784-dimensional for each image in a batch into the latent layer space. For our experiment, the dimension of the latent variable z can be from 2 to 20. Taking the latent layers z as the input, the probability distribution of each pixel is computed using a Bernoulli or Gaussian distribution by the decoder. The decoder outputs the corresponding 784 parameters to reconstruct an image. We used specific numbers of images from the training set as the batch size and a fixed number of epochs. Additionally, for the learned MNIST manifold, visualizations of learned data and reproduced results were plotted. The algorithm and experiments were developed with Python and the TensorFlow library. Our Python code can be found in the Data Availability Statement.The input images and output images for different values of coupling are shown in Figure 4. represents the original VAE model. Compared with the original algorithm, output images generated by the modified coupled VAE model show small improvements in detail and clarity. For instance, the fifth digit in the first row of the input images is ‘4’, but the output image in the original VAE is more like ‘9’ rather than ‘4’, while the coupled VAE method generates ‘4’ correctly. For the seventh digit ‘4’ in the first row, the generated image in the coupled VAE has an improved clarity compared to the traditional VAE.
Figure 4
(a) The MNIST input images and (b) the output images generated by the original VAE. (c–e) The output images generated by the modified coupled VAE model show small improvements in detail and clarity. For instance, the fifth digit in the first row of the input images is ‘4’, but the output image in the original VAE is more like ‘9’ rather than ‘4’, while the coupled VAE method produced ‘4’ correctly.
Figure 5 shows the likelihood histograms for 5000 input images with coupling values of . The red, blue, and green lines represent the arithmetic mean (decisiveness), geometric mean (central tendency), and −2/3 mean (robustness), respectively. When , the minimal value of the Robustness metric indicates that the original VAE suffers from poor robustness. As becomes large, the geometric mean and the −2/3 mean metrics start to increase while the arithmetic mean metric mostly stays the same. Since the probability of producing a correct image by a uniform random sampling is , the accuracy achieved by the VAE algorithm is significantly improved, even though the absolute value of the Accuracy metric seems small. As the coupling increases, the coupled loss function approaches infinity faster. This eventually causes computational errors. For instance, when , the loss function has a computational error at the 53rd epoch; when , the loss function has a computational error at the 8th epoch. Further investigations of the computational bounds of the algorithm are planned. The specific relationship between coupling and probabilities for input images is shown in Table 1. The increased Robustness metric shows that the modified loss does improve the robustness of the the reconstructed image. In the next section, we also examine the performance of the divergence between the posterior and prior distributions of the latent layer.
Figure 5
The histograms of likelihood for the reconstruction of the input images with various coupling values. The red, blue, and green lines represent the arithmetic mean (Decisiveness), geometric mean (Accuracy), and −2/3 mean (Robustness), respectively. The minimal value of the Robustness metric indicates that the original VAE suffers from poor robustness. As increases, the Robustness and Accuracy improve while the Decisiveness is mostly unchanged.
Table 1
The Decisiveness, Accuracy, and Robustness of the reconstruction likelihood as a function of the coupling .
Coupling κ
Decisiveness
Accuracy
Robustness
0
1.31×10−15
2.41×10−39
1.40×10−79
0.025
6.61×10−15
5.89×10−35
9.91×10−81
0.05
7.18×10−12
5.80×10−32
1.31×10−73
0.1
1.34×10−12
7.09×10−29
2.57×10−71
Furthermore, compared with the original VAE model, the geometric mean, which measures the accuracy of the input image likelihood, is larger for the coupled algorithm. The improvement of this metric means that the input images (truth) are assigned to higher likelihoods on average by the coupled VAE model.The standard deviation of latent variables is shown in rose plots in Figure 6. The angular position of a bar represents the value of , clockwise from 0 to 1. The radius of the bar measures the frequency of different values from 0 to 100. As the coupling increases, the range and the average value of these standard deviations decrease. To be specific, when , of all dimensions in all 5000 batches ranges from 0.09 to 0.72; when , ranges from 0.02 to 0.3; when , ranges from 0.001 to 0.09; when , ranges from 0.00007 to 0.06.
Figure 6
The rose plots of the various standard deviation values in 20 dimensions. The range and average values of these standard deviations are reduced as coupling increases.
We note that as coupling parameter increases, the variability of the latent space diminishes. One possible method to address this problem is to use heavy-tail distribution in the latent layer. Chen et al. [32] and Nelson [23] used the Student’s t as the distribution [33] of the latent layer to incorporate heavy-tail decay.We choose samples in which the likelihoods of input images are close to the three metrics and plot the standard deviation of each dimension of the latent variable of these samples in Figure 7. The red, blue, and green lines represent samples near the decisiveness, accuracy, and robustness, respectively. It shows that when , the standard deviations of range from 0.1 to 0.7. However, as increases, values of fluctuate less and decrease toward 0. Magnified plots are shown to visualize the results further. The general trend for is to be more significant for samples near decisiveness, intermediate near the accuracy, and smaller for samples near robustness. An exception is , where overlaps for samples near the robustness and accuracy. The histogram likelihood plots with a two-dimensional latent variable are shown in Figure 8. The increased values of the arithmetic mean metric and −2/3 mean metric show that the accuracy and robustness of the output MNIST images in the VAE model have been improved, consistent with the result in the 20-D model. While the performance improvements are modest, we will show in Section 6 that the performance improvements when the algorithm is seeded with corrupted images is much more substantial. First, we provide a visualization of the changes in the latent distribution using two dimensions.
Figure 7
The standard deviation of latent variable samples near the three generalized mean metrics. The red, blue, and green lines represent samples near the Decisiveness, Accuracy, and Robustness, respectively. As increases, values of fluctuate less and decrease toward 0. Magnified plots are shown to visualize the results further.
Figure 8
The histogram likelihood plots with a two-dimensional latent variable. Like the 20-D model, the increased values of the arithmetic mean metric and −2/3 mean metric show that the accuracy and robustness of the VAE model have been improved.
5. Visualization of Latent Distribution
In order to understand the relationship between increasing coupling of the loss function with the means and the standard deviations of the Gaussian model, we examine a two-dimensional model which can be visualized. Compared with the high-dimensional model, the probability likelihoods for the two-dimensional model are lower, indicating that the higher dimensions do improve the model. Nevertheless, like the 20-dimensional model, the distribution of likelihood is compressed toward higher values as the coupling increases and, therefore, can be used to analyze the results further. Larger likelihood of input images along with both means closer to the origin and smaller standard deviations of latent variables are the primary characteristics as the coupling parameter of the loss function is increased. As a result, both the robustness and accuracy of likelihoods increase. To be specific, when increases from 0 to 0.075, the geometric mean metric increases from to , and the −2/3 mean metric increases from to , while the arithmetic metric does not change very much. In this case, the reconstructed images have a higher probability of replicating the input image using the coupled VAE method.The rose plots in Figure 9 show that the range and variability of the mean values of latent variables decrease as the coupling increases. From the view of means, the posterior distribution of the latent space is closer to the prior, the standard Gaussian distribution. From the view of standard deviations, the posterior distribution of the latent space is further from the prior.
Figure 9
The rose plots of the various mean (above four figures) and standard deviation (below four figures) values in 2 dimensions. The range of means is reduced and mean values become closer to 0 as coupling increases.
The latent space plots shown in Figure 10 are the visualizations of images of the numerals from 0 to 9. Images are embedded in a 2D map where the axis is the values of the 2D latent variable. The same color represents images that belong to the same numeral, and they cluster together since they have higher similarity to each other. The distances between spots represent the similarities of images. The latent space plots show that the different clusters shrink together more tightly when coupling becomes larger. The plots shown in Figure 11 are the visualizations of the learned data manifold generated by the decoder network of the coupled VAE model. A grid of values from a two-dimensional Gaussian distribution is sampled. The distinct digits each exist in different regions of the latent space and smoothly transform from one digit to another. This smooth transformation can be quite useful when the interpolation between two observations is needed. Additionally, the distribution of distinct digits in the plot becomes more even, and the sharpness of the digits increases when increases.
Figure 10
The plot of the latent space of VAE trained for 200 epochs on MNIST with various values. Different numerals cluster together more tightly as coupling increases.
Figure 11
The plot of visualization of learned data manifold for generative models with the axes as the values of each dimension of latent variables. The distinct digits each exist in different regions of the latent space and smoothly transform from one digit to another.
As shown in Table 2, as the coupling increases from 0 to 0.075, the negative ELBO (the loss) decreases from 172.3 to 146.7, the coupled KL-divergence decreases from 5.8 to 5.6, and the coupled reconstruction loss decreases from 166.5 to 141.1. It shows that the reconstruction loss plays a dominant role (with proportion over 96%), while the divergence term has a much lower effect (with proportion under 4%) in the loss function. The overall improvement of coupled loss is based on both the smaller coupled KL-divergence and the smaller coupled reconstruction error, instead of a trade-off between them. There is a high degree of variability in this improvement, so there are reasons to be cautious about the degree of improvement. In addition, since the coupled loss function is adjusting the metric, the property being measured is also adjusting. Part of our future research plan is to explore how the relative performance between the reconstruction and the latent space can be compared.
Table 2
Components of coupled ELBO with a 2-dimensional latent layer under different values of coupling. The improvement in the coupled KL-divergence is very slight, while it is larger for the coupled reconstruction loss.
Coupling κ
Coupled KL-Divergence
Coupled RE Loss
Coupled ELBO
KL Proportion
RE Proportion
0
5.8 +/− 1.7
166.5 +/− 52.2
172.3
3.38%
96.62%
0.025
5.7 +/− 1.6
156.4 +/− 49.8
162.1
3.53%
96.47%
0.05
5.6 +/− 1.6
149.2 +/− 46.6
154.8
3.61%
96.39%
0.075
5.6 +/− 1.7
141.1 +/− 44.6
146.7
3.82%
96.18%
6. Performance with Corrupted Images
We also evaluate the performance of the coupled VAE algorithm when keyed by images from the corrupted MNIST (C-MNIST) dataset [34]. The reconstructed images under 5 different corruptions: Gaussian corruption, glass blur corruption, impulse noise corruption, shot noise corruption and shear corruption, with two coupling values and are shown in the Figure 12. Based on the visualization of the generated images, the qualitative visual improvement in clarity using the coupling is modest.
Figure 12
The images with 5 different corruptions are shown in the first row. The reconstructed images when and are shown in the second and third rows, respectively. The qualitative visual improvement in clarity using the coupling is modest.
We also conduct the further analyses for the performance of the coupled VAE with each corruption. For the MNIST images with Gaussian corruption, as shown in the Figure 13, when the coupling parameter increases, all the three metrics—robustness, central tendency, and decisiveness—increase. The robustness improves the most, central tendency is the next, and decisiveness has the least improvement. Furthermore, we confirm that the reconstruction improvement is not a trade-off with latent distribution divergence, as shown in Table 3. This is in contrast to the -VAE [11] method which merely alters the weight between the reconstruction and divergence components of the negative ELBO cost function.
Figure 13
The histograms of marginal likelihood for the MNIST images with Gaussian corruption shown. All three metrics increase as the coupling parameter increases. The robustness improves the most, central tendency is the next, and decisiveness has the least improvement. From to , the Robustness improves from to , the Accuracy improves from to , and the Decisiveness improves from to .
Table 3
The components of the coupled ELBO under Gaussian corruptions are provided in the table. The coupled KL-divergence initially increases when moving away from the standard VAE design with to , however, it then steadily decreases with increasing . The coupled reconstruction loss (column three) shows steady improvement. The overall negative coupled ELBO shows consistent improvement as the coupling increases. The relative importance of the divergence and reconstruction varies as the coupling increases but in each case it is approximately a to relative weighting.
Coupling κ
Coupled KL-Divergence
Coupled RE Loss
Coupled ELBO
KL Proportion
RE Proportion
κ=0
23.9 +/− 3.8
131.6 +/− 40.7
155.5
15.34%
84.66%
κ=0.025
29.6 +/− 2.3
119.9 +/− 38.5
149.5
19.80%
80.20%
κ=0.05
26.0 +/− 0.9
111.1 +/− 36.5
137.1
18.94%
80.06%
κ=0.075
21.4 +/− 0.5
104.4 +/− 34.3
125.8
16.98%
83.02%
κ=0.1
18.4 +/− 0.6
98.9 +/− 32.7
117.3
15.71%
84.28%
In the Table 3, analyses of the components of the coupled ELBO are provided. Comparisons as the coupling changes are somewhat confusing because the metric itself is changing. Therefore, as the coupling increases the measure of performance is more difficult. Nevertheless, there is still an overall tendency towards improved performance, even with this caveat. The second column shows that the coupled KL-divergence initially increases when moving away from the standard VAE design with , however, it then steadily decreases with increasing . This may be due to the distinct difference between the logarithm and even a slight deviation from the logarithm. The coupled reconstruction loss (column three) shows steady improvement. The overall negative coupled ELBO shows consistent improvement as the coupling increases. The relative importance of the divergence and reconstruction varies as the coupling increases but in each case it is approximately a to relative weighting.The improvement of the three metrics with glass blur corruption, impulse noise corruption, shot noise corruption and shear corruption is also observed and shown in Figure 14, Figure 15, Figure 16 and Figure 17, respectively. Similar to the Gaussian corruption, all the three metrics gradually increase as the coupling parameter increases from 0 to . The respective analyses of the components of the coupled ELBO with glass blur corruption, impulse noise corruption, shot noise corruption and shear corruption are provided in Table 4, Table 5, Table 6 and Table 7. The four corruptions share the consistent results, the coupled KL-divergence initially increases when moving away from the standard VAE design with , but it then steadily decreases with increasing . The overall negative coupled ELBO shows consistent improvement as increases. It means that if the coupling parameter is relatively large (), both the KL-divergence and the reconstruction loss will be improved, thus the overall improvement of the algorithm is not a trade-off between the reconstruction accuracy and the latent distribution divergence.
Figure 14
The histograms of marginal likelihood for the MNIST images with glass blur corruption are shown. All the three metrics increase as the coupling parameter increases from 0 to 0.1.
Figure 15
The histograms of marginal likelihood for the MNIST images with impulse noise corruption are shown. All the three metrics increase as the coupling increases from 0 to 0.1.
Figure 16
The histograms of marginal likelihood for the MNIST images with shot noise corruption are shown. All the three metrics increase as the coupling parameter increases from 0 to 0.1.
Figure 17
The histograms of marginal likelihood for the MNIST images with shear corruption are shown. All the three metrics increase as the coupling parameter increases from 0 to 0.1.
Table 4
The components of the coupled ELBO under glass blur corruptions are provided in the table. The coupled KL-divergence initially increases when moving away from the standard VAE design with , but it then steadily decreases with increasing . The coupled reconstruction loss shows steady improvement. The overall negative coupled ELBO shows consistent improvement as increases.
Coupling κ
Coupled KL-Divergence
Coupled RE Loss
Coupled ELBO
KL Proportion
RE Proportion
κ=0
22.3 +/− 3.5
196.1 +/− 55.3
218.4
10.19%
89.81%
κ=0.025
29.4 +/− 2.0
178.8 +/− 50.1
208.2
14.12%
85.88%
κ=0.05
25.5 +/− 0.7
164.1 +/− 45.7
189.6
13.44%
86.56%
κ=0.075
20.9 +/− 0.4
154.0 +/− 43.0
174.9
11.96%
88.04%
κ=0.1
18.0 +/− 0.4
145.1 +/− 40.0
163.1
11.05%
88.95%
Table 5
The components of the coupled ELBO under impulse noise corruptions are provided in the table. The coupled KL-divergence initially increases when moving away from the standard VAE design with , but it then steadily decreases with increasing . The overall negative coupled ELBO shows consistent improvement as increases.
Coupling κ
Coupled KL-Divergence
Coupled RE Loss
Coupled ELBO
KL Proportion
RE Proportion
κ=0
24.2 +/− 3.8
170.7 +/− 34.7
195.0
12.43%
87.57%
κ=0.025
29.9 +/− 2.2
148.0 +/− 31.0
177.9
16.81%
83.19%
κ=0.05
26.0 +/− 0.8
131.6 +/− 28.5
157.7
16.52%
83.48%
κ=0.075
21.4 +/− 0.6
120.9 +/− 26.7
142.3
15.05%
84.95%
κ=0.1
18.5 +/− 0.6
111.8 +/− 25.2
130.3
14.21%
85.79%
Table 6
The components of the coupled ELBO under shot noise corruptions are provided in the table. The coupled KL-divergence increases when moving away from the standard VAE design with , but it then steadily decreases with increasing . The coupled reconstruction loss shows steady improvement. The overall negative coupled ELBO shows consistent improvement as increases.
Coupling κ
Coupled KL-Divergence
Coupled RE Loss
Coupled ELBO
KL Proportion
RE Proportion
κ=0
23.9 +/− 3.8
98.9 +/− 28.3
122.8
19.45%
80.55%
κ=0.025
29.9 +/− 2.4
88.9 +/− 26.2
118.8
25.14%
74.86%
κ=0.05
26.1 +/− 1.0
81.8 +/− 25.0
108.0
24.21%
75.80%
κ=0.075
21.6 +/− 0.7
77.6 +/− 23.9
99.2
21.80%
78.20%
κ=0.1
18.6 +/− 0.6
73.4 +/− 22.8
92.0
20.17%
79.83%
Table 7
The components of the coupled ELBO under shear corruptions are provided. The coupled KL-divergence increases when moving away from the standard VAE design with , but it then steadily decreases with increasing . The coupled reconstruction loss shows steady improvement. The overall negative coupled ELBO shows consistent improvement as increases.
Coupling κ
Coupled KL-Divergence
Coupled RE Loss
Coupled ELBO
KL Proportion
RE Proportion
κ=0
24.8 +/− 4.0
114.1 +/− 31.7
138.9
17.85%
82.15%
κ=0.025
30.4 +/− 2.4
102.3 +/− 29.0
132.7
22.92%
77.08%
κ=0.05
26.1 +/− 0.9
94.7 +/− 27.5
120.8
21.61%
78.39%
κ=0.075
21.8 +/− 0.7
89.5 +/− 26.3
111.3
19.61%
80.39%
κ=0.1
18.6 +/− 0.6
84.9 +/− 24.9
103.5
17.97%
82.03%
7. Discussion and Conclusions
This investigation sought to determine whether the accuracy and robustness of variational autoencoders can be improved using certain statistical methods developed within the area of complex systems theory. Our investigation provides evidence that the tail shape of the negative evidence lower bound can be controlled in such a way that the cost of outlier events is adjustable. We refer to this method as a coupled VAE, since the control parameter models the nonlinear deviation from the exponential and logarithmic functions of linear analysis. A positive coupling parameter increases the cost of these tail events and thereby trains the algorithm to be robust against such outliers. Additionally, this improves both the accuracy of reconstructed images and reduces the divergence of the posterior latent distribution from the prior. We have been able to document this improvement using the histogram of the reconstructed marginal likelihoods. Metrics of the histogram are formed from the arithmetic mean, geometric mean, and −2/3 mean, which represent Decisiveness, Accuracy, and Robustness, respectively. Both the accuracy and the robustness are improved by increasing the coupling of the loss function. There is a limit to such increases in the coupling beyond which the training process no longer converges.These performance improvements have been evaluated for the MNIST handwritten numeral dataset and its corrupted modification C-MNIST. We used a two-layer dense neural network for the encoder/decoder. The latent layer is a 20-dimensional Gaussian distribution and for visualization a 2-dimensional distribution was also examined. Without the corruption, we observed improvements in both components of the negative coupled ELBO loss function, namely the image reconstruction loss (marginal likelihood) and the latent distribution (divergence between the prior and posterior). Thus, the coupled VAE is able to improve the model representation, rather than just trading off reconstruction and divergence performance, as does the highly cited -VAE design. The likelihood of the reconstructed image matching the original improves in Accuracy by and in Robustness by when the coupling parameter was increased from (the standard VAE) to (the largest value of the coupled VAE reported). The Decisiveness did not change significantly, though there is potential that negative values of the coupling could influence this metric. The performance improvements when the algorithm is seeded by the C-MNIST dataset are far more significant, demonstrating the improved stability of the algorithm. All five corruptions examined (Gaussian, glass blur, impulse noise, shot noise, and shear) show significant improvement in Robustness and Accuracy and some improvement in the Decisiveness. For example, under the Gaussian corruption, the improvements in the reconstruction likelihood for Accuracy are and those for the Robustness are when the coupling parameter is increased from (the standard VAE) to . The significant improvement in Robustness using the corrupted MNIST dataset demonstrates that the coupled negative ELBO cost function reduces the risk of overfitting by forcing the network to learn general solutions that are less likely to create outliers.The modifications of the latent posterior distributions have been further examined using a two-dimensional representation. We show that the latent variables have both a tighter distribution of the mean about its prior value of zero, and a movement of standard deviations towards zero, away from the prior of one, as coupling increases. Overall, the coupled KL-divergence does indeed decrease as the coupling is increased, indicating improvement in the latent representation. Thus, improvements in the reconstruction evident from both visual clarity of images and increased accuracy in measured likelihoods are not due to a trade-off with the latent representation. Rather, the negative coupled ELBO metric shows improvement in both latent layer divergence and output image reconstruction. This improvement in the two components of the evidence lower bound provides evidence that the coupled VAE improves the approximate variational inference of the model.In future research, we plan to study the coupled Gaussian distribution as the prior and posterior distribution of the latent layer. This may be helpful for achieving greater separation between the images into distinct clusters similar to what has been achieved with t-stochastic neighborhood embedding methods [35]. If so, it may be possible to improve the decisiveness of the likelihoods in addition to further improvements in the accuracy and robustness. Since our approach generalizes the training of the decoder and encoder networks, it is expected to be seamlessly applicable to other datasets and neural network architectures. We are conducting research to apply our method to a convolutional neural network design that can process more complex datasets such as CIFAR-10. This first demonstration of the coupled ELBO cost function has provided experimental results applied to a shallow neural network but the approach is also applicable to the training of deep neural networks.