Literature DB >> 36167898

Superiority of quadratic over conventional neural networks for classification of gaussian mixture data.

Abstract

To enrich the diversity of artificial neurons, a type of quadratic neurons was proposed previously, where the inner product of inputs and weights is replaced by a quadratic operation. In this paper, we demonstrate the superiority of such quadratic neurons over conventional counterparts. For this purpose, we train such quadratic neural networks using an adapted backpropagation algorithm and perform a systematic comparison between quadratic and conventional neural networks for classificaiton of Gaussian mixture data, which is one of the most important machine learning tasks. Our results show that quadratic neural networks enjoy remarkably better efficacy and efficiency than conventional neural networks in this context, and potentially extendable to other relevant applications.

Entities: Chemical

Keywords: Artificial neural networks; Backpropagation; Classification; Gaussian mixture models; Quadratic neural networks; Quadratic neurons

Year: 2022 PMID： 36167898 PMCID： PMC9515302 DOI： 10.1186/s42492-022-00118-z

Source DB: PubMed Journal: Vis Comput Ind Biomed Art ISSN： 2524-4442

Introduction

In machine learning, the mainstream approach is now artificial neural networks (ANNs), especially deep neural networks. Usually, a neural network consists of several layers of neurons, each of which consists of a linear compartment in the form of the inner product of inputs and weights and a nonlinear unit known as an activation function to make a signal on (activated) or off (attenuated). Deep neural networks have been recently shown to achieve remarkable successes in various applications such as natural language processing [1, 2], auto-driving [3-5], game-playing [6], image analysis [7, 8], and image reconstruction [9]. Classification/clustering is one of the essential pattern recognition techniques in machine learning, and has a wide arrange of applications such as bioinformatics [10, 11] and medial imaging [12, 13]. It is well known that the Gaussian mixture model (GMM) is the most popular data model. Since the prior probability of each Gaussian component is typically not given, known as latent variables, the correct parameters of GMM are solved using the expectation-maximization (EM) algorithm. Alternatively, a neural network approach can be used to classify GMM data. Clearly, the decision boundary for the classification can be viewed as a complicated function where a network with a large number of neurons can approximate that boundary. After the classification network is trained, the inference by the trained network is more efficient than the EM algorithm, which is iterative and time-consuming. In our previous study [14-19], a new type of neurons, referred to as quadratic neurons, was introduced, where the inner product inside a conventional neuron is upgraded to a quadratic function. The initial motivation is to enrich the diversity of artificial neurons, inspired by the fact that the biological diversity exists at the cellular level, and such diversity enables efficiency, flexibility, functionality, and other benefits. Hence, it is hypothesized that a quadratic neural network would be advantageous similarly, which can, for example, approximate a given function with a lighter structure than a conventional neural network. The main purpose of this paper is to highlight the superiority of quadratic over conventional neural networks with the classification task as an illustrative example. The rest of the paper is organized as follows. In the next section, we review the theoretical minimum error in the GMM classification and the EM algorithm that is traditionally used to reach that error bound. In the third section, we present our procedure for initializing and training the conventional and quadratic networks with an adapted backpropagation (BP) algorithm. In the fourth section, we perform numerical experiments systematically and establish the superiority of quadratic networks over the conventional counterparts in the GMM classification. Finally, in the last section we discuss relevant issues and conclude the paper.

Methods

GMM-based classification error

In statistical classification, the Bayes error rate is theoretically optimal. In practice, without knowing latent GMM parameters, the Bayes error rate cannot be directly calculated. To close the gap, the classic EM algorithm can be used to approximate the optimal error rate, which is the benchmark to evaluate the performance of classification neural networks.

Bayes error

Given the mean , covariance , and prior probability of each Gaussian component of GMM , the posterior probability is calculated bywhich means a D dimensional sample vector , , should be assigned to the Gaussian component,where D, N, and K represent the dimensionality of the sample vector, the sample size and the number of Gaussian components respectively. We can obtain the Bayesian inference results, a size N vector , by applying Eq. 2 to the entire sample pool and compare it with the ground truth labels . However, in most of real cases all these GMM parameters are not directly known. Fortunately, we can use the EM algorithm to estimate them, as described in the following subsection. Note that the inference cannot be directly used as the predicting label of each sample since our task is clustering instead of classification. For example, while the ground truth parameters are , the results from the EM algorithm can be , for . Hence, we have to perform an order correction, i.e., rearranging as . A solution to this problem is to perform an exclusive search so that the accuracy or loss can be optimized. By doing so, the best match will be found as our final result. More efficiently, the alternating variables method can be used as described in Algorithm 1, a common derivative-free method for numerical optimization, with the idea to maximize the accuracy by exchanging two coordinates each time and fixing all the remaining ones. We set the MaxCycle according to the number of Gaussian components K, and in our experiment , which is sufficiently large for . Then, we compute Eq. 2 using parameters after the order correction we present above for the Bayesian inference results and gain the Bayes error as the banchmark of performance of neural networks.

EM algorithm

As a classic iterative method, the EM algorithm consists of the following two steps: expectation (E) and maximization (M). The E step evaluates the expectation function based on the currently available intermediate parameters, and the M step updates the intermediate parameters to maximize the expectation function. To estimate all the parameters, the expectation function in the E step is the posterior probability for . To start the EM procedure, for , we initialize with a size D vector that filled with values from the standard normal distribution, with D by D identity matrix and . Then, for the jth iteration, , the posterior probability in the E step is computed asin terms of the current parameters , . After this E step, the M step goes as follows:The E and M steps are repeated until the parameters being estimated converge within a pre-specified range or a maximum number of iterations is finished. With these estimated GMM parameters, Eq. 2 and Algorithm 1 can be used for GMM-oriented classificaiton.

Neural network training

Training a neural network involves two steps: initialization which sets up network parameters appropriately, and optimization which adjusts the neural parameters iteratively. An optimizer used in the second step is illustrated in Fig. 1. The key idea is to perform computational optimization using the well-known BP algorithm with respect to an objective or loss function.

Fig. 1

Neural network training as a computational optimization process with respect to an objective function which is the error rate for classification, without loss of generality

Neural network training as a computational optimization process with respect to an objective function which is the error rate for classification, without loss of generality While the conventional and quadratic neural networks can be trained based on the same idea of computational optimization, they differ in specific steps, since the chain rule must be applied to different functions that summarize data (i.e., inter product versus quadratic operation). Specifically, let us formulate the forward and BP processes in the following two subsections respectively, and then describe the whole process in the third subsection.

Forward computation

An exemplary feed-forward neural network is shown in Fig. 2, including input, hidden, and output layers. There are L layers in total, in each of which there is a number of neurons. A typical layer first implements affine transforms for conventional neurons and quadratic operations for quadratic neurons, and then nonlinear activations are performed, which are common for conventional and quadratic neurons.

Fig. 2

Forward computation of a feed-forward neural network with L layers of neurons

Forward computation of a feed-forward neural network with L layers of neurons An illustration of the affine layer of conventioinal and quadratic neurons is shown in Fig. 3. For a conventional neural network, the affine transform can be expressed in terms of a input matrix and a weight matrix plus a bias row vector as follows:For a quadratic neural network, the quadratic transform can be expressed aswhere stands for matrix multiplication and means an element-wise square operation. In this study, the ReLU function is used as the activation function, but if the l-th layer is the last layer of the network, i.e., , the softmax function is computed instead. Therefore, the output of each layer is computed as follows:

Fig. 3

Illustration of the affine layer of conventioinal (left) and quadratic (right) neurons respectively

Illustration of the affine layer of conventioinal (left) and quadratic (right) neurons respectively In other words, the input to the forward process is a N by D sample matrix , and output is a N by K matrix . The prediction of each sample vector is quantified byThe loss or error is produced when the prediction differs from the ground truth. Note that in the forward computation we compute and store the output of each affine transform, which are subsequently used for the gradient descent search in the BP process described in the following subsection.

BP formulation

To optimize a neural network, we perform numerical optimization. Specifically, we first find the partial derivatives with respect to each of the parameters and update them via gradient descent search at a suitable step size (learning rate). Using the chain rule, this process was formulated as the well-known BP algorithm, which is widely used to train a neuronal network. As its name indicates, the BP process computes the partial derivatives layer-wise from the output layer to the input layer. A brief BP diagram is shown in Fig. 4.

Fig. 4

BP process to train a neural network of L layers

BP process to train a neural network of L layers Let Q stand for the cross-entropy loss value defined aswhere N is the number of sample vectors, is the predicted result, and is the ground truth label for each of the samples . Recall that the activation function of the output layer is the softmax function, hence the gradient of the output layer can be computed asIf , the activation function is the ReLU function, and we haveFor a conventional neural network, we know thatwhereThe same chain rule can be applied to optimize a quadratic neural network layer-wise. Specifically, let us consider Eq. 6 in the following three parts:and we haveThen, the gradients with respect to the parameters in the three parts can be respectively found as follows:andIn contrast to the forward computation, the input to the BP procedure is the predicted result , which is the output of the forward process. For layer , one layer at a time, we compute using Eqs. 9 or 10 depending on whether it is the last layer, the same for the conventional and quadratic neural networks. Then, we compute according to Eqs. 11 (for conventional neurons) or 13 (for quadratic neurons) respectively, where denotes a vector of all trainable parameters of the network. Finally, we compute , which is used in Eqs. 12 (for conventional neurons) and 14 (for quadratic neurons) respectively for the next iteration. After the gradient of the network is obtained, we update the parameters via 'Adam' in this study.

Whole training process

Initiation. Let us use a series of integers to describe a feed-forward neural network architecture of our interest,where represents the dimension of . Then, the total number of neurons used in the network is . Note that , the dimension of input samples, and , the number of classes. Then, the network can be randomly initialized with a vector of parameters for each layer. Specifically, for each layer , let d_from be the input dimension and d_to the output dimension Setting all weights – for a conventional neural network and for a quadratic neural network – and biases – for a conventional network and for a quadratic network – as follows:where np stands for NumPy (version 1.23.0), a Python package. That is, the bias is a 1 by zero matrix, and the weight is a by matrix. Optimization. As shown in Fig. 1, given a neural network we just initialized and a training dataset containing samples including the corresponding labels , we can repeat the forward computation and BP processes described in the above two subsections until the stopping criteria are satisfied. The cross-entropy losses on the training and validation samples will be estimated during the training process.

Results and discussion

Using the training methods in the preceding section, we optimized conventional and quadratic neural networks to solve a number of GMM-based classification problems. At the beginning, we solved a three-class problem in the two-dimensional (2D) space to illustrate the working principle. Then, we performed a systematic comparison between conventional and quadratic neural networks on samples with different numbers of classes and dimensions. Finally, we applied all methods on three real data sets. Meanwhile, we used the EM algorithm and Bayes inference to obtain the Bayes error rate as the performance benchmark of the neural networks.

Illustrative classification example

Our initial classification problem assumes a finite number of classes (the first example, K = 3) in the 2D space (): two Gaussian clusters plus a background, which can be viewed as a special case of the Gaussian distribution. As in other network-based classification networks, a one-hot vector was used in our networks as well. The parameters of the background were set towhere b indicates the background. Then, we randomly set the parameters of the other Gaussian clusters asfor where stands for matrix multiplication and np stands for NumPy (version 1.23.0), a Python package. Given the mean and covariance , we generated points for each class except the background where was chosen randomly. Then, we generated points for the background. The entire dataset was shuffled and split into the three parts: 50% as training samples, 20% as validation samples, and 30% as test samples. Figure 5 shows the scatter plot of sample points.

Fig. 5

Scatter plot of sample points which contains three classes: two Gaussian clusters plus a background

Scatter plot of sample points which contains three classes: two Gaussian clusters plus a background Decreasing loss on the validation samples during the training process of conventional (C) and quadratic (Q) neural networks with different numbers of neurons. The notation stands for the architecture of a neural network as described in Eq. 15 We trained conventional and quadratic neural networks with different numbers of neurons for GMM classification. The decreasing loss is shown in Fig. 6 on the validation samples during the training process. The decision boundaries are shown in Fig. 7 for the conventional and quadratic neural networks as well as EM algorithm respectively. It took hundreds of neurons for the conventional network to approach the elliptical boundaries, while the quadratic network accurately fitted them with only three quadratic neurons.

Fig. 6

Decreasing loss on the validation samples during the training process of conventional (C) and quadratic (Q) neural networks with different numbers of neurons. The notation stands for the architecture of a neural network as described in Eq. 15

Fig. 7

Decision boundaries of EM and neural networks, including C and Q neural networks with different numbers of neurons. The notation stands for the architecture of a neural network as described in Eq. 15

Decision boundaries of EM and neural networks, including C and Q neural networks with different numbers of neurons. The notation stands for the architecture of a neural network as described in Eq. 15 The lighter the network structure, the higher the computation efficiency. Table 1 shows time spent to train the conventional and quadratic neural networks, and the accuracies of the EM and neural networks on the test samples. Our quadratic neural network with only one neural layer produced a performance closer to the EM benchmark than the conventional neural network of more than one hundred conventional neurons. Also, the time need for the quadratic neural network is only about 7% that of the conventional counterpart.

Table 1

The average accuracy of and time needed by EM algorithm, Q and C neural networks with different numbers of neurons in 2D spaces with two Gaussian clusters plus a background. The notation stands for the architecture of a neural network as described in Eq. 15

	Accuracy (%)	Time (s)
C(2-3)	31.1752	2.44
C(2-10-3)	91.8122	23.15
C(2-100-3)	91.8391	213.61
Q(2-3)	91.8525	16.30
EM	91.8625

Systematic comparation

To systematically compare conventional and quadratic networks, we tested conventional and quadratic networks in 2D and three-dimensional (3D) spaces with and 8 Gaussian clusters. In each case, we randomly generated 50 samples using the aforementioned method except we replaced the background by a Gaussian cluster and set . Typical scatter plots of these samples are represented in Fig. 8.

Fig. 8

Typical scatter plots of samples in 2D (left) and 3D (right) spaces with (top) and (bottom)

Typical scatter plots of samples in 2D (left) and 3D (right) spaces with (top) and (bottom) We trained and tested the EM algorithm, conventional and quadratic networks with different numbers of layers/neurons in terms of the average accuracy. The resutls are summarized in Table 2. Very interestingly, in all cases, the accuracy of the quadratic networks with only output layer of few neurons is higher than that of the conventional network of over one hundred neurons. Meantime, the training time needed for quadratic neural networks is only about , on average, of that taken by the much more complicated conventional network. Generally speaking, the quadratic neural networks delivered a performance very close to that of the EM algorithm.

Table 2

The average accuracy of and time needed by EM algorithm, Q and C neural networks with different numbers of neurons in 2D and 3D spaces with and 8 Gaussian clusters. The notation stands for the architecture of a neural network as described in Eq. 15

	Accuracy (%)	Time (s)	Accuracy (%)	Time (s)
	D = 2, K = 5		D = 3, K = 5
C(2-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$92.36 \pm 7.89$$\end{document}92.36±7.89	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$11.5 \pm 5.0$$\end{document}11.5±5.0	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$84.72 \pm 10.32$$\end{document}84.72±10.32	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$13.7 \pm 5.8$$\end{document}13.7±5.8
C(2-10-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95.36 \pm 5.66$$\end{document}95.36±5.66	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$23.3 \pm 14.1$$\end{document}23.3±14.1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$92.15 \pm 8.04$$\end{document}92.15±8.04	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$24.8 \pm 8.6$$\end{document}24.8±8.6
C(2-100-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95.47 \pm 5.68$$\end{document}95.47±5.68	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$62.7 \pm 27.9$$\end{document}62.7±27.9	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$92.54 \pm 8.01$$\end{document}92.54±8.01	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$71.8 \pm 29.2$$\end{document}71.8±29.2
Q(2-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95.53 \pm 5.66$$\end{document}95.53±5.66	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$15.8 \pm 6.7$$\end{document}15.8±6.7	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$92.74 \pm 7.98$$\end{document}92.74±7.98	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$17.6 \pm 7.4$$\end{document}17.6±7.4
EM	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95.60 \pm 5.65$$\end{document}95.60±5.65		\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$92.90 \pm 7.97$$\end{document}92.90±7.97
	D = 2, K = 8		D = 3, K = 8
C(2-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$83.84 \pm 5.53$$\end{document}83.84±5.53	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$22.8 \pm 8.0$$\end{document}22.8±8.0	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$76.47 \pm 6.31$$\end{document}76.47±6.31	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$28.8 \pm 10.6$$\end{document}28.8±10.6
C(2-10-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$87.90 \pm 3.75$$\end{document}87.90±3.75	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$47.6 \pm 17.1$$\end{document}47.6±17.1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$82.45 \pm 5.36$$\end{document}82.45±5.36	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$62.6 \pm 18.5$$\end{document}62.6±18.5
C(2-100-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$88.06 \pm 3.70$$\end{document}88.06±3.70	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$122.6 \pm 39.2$$\end{document}122.6±39.2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$82.68 \pm 5.34$$\end{document}82.68±5.34	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$151.8 \pm 41.4$$\end{document}151.8±41.4
Q(2-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$88.13 \pm 3.67$$\end{document}88.13±3.67	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$33.3 \pm 11.0$$\end{document}33.3±11.0	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$82.79 \pm 5.31$$\end{document}82.79±5.31	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$46.2 \pm 13.3$$\end{document}46.2±13.3
EM	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$88.19 \pm 3.64$$\end{document}88.19±3.64		\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$82.86 \pm 5.32$$\end{document}82.86±5.32

Real data

Finally, we applied conventional and quadratic networks on three real data sets from the UCI Machine Learning repository [20]: protein localization sites (yeast), pen-based recognition of handwritten digits (pendigits), and isolated letter speech recognition (isolet). All three data sets’ attribute types are numerical. Some basic information about these datasets are in Table 3. For the yeast dataset, we split the whole dataset in the same proportions as that described in the first subection. Typical yeast cell (Saccharomyces cerevisiae cell) images the Cell Image Library [21] are shown in Fig. 9, visualized through transmission electron microscopy. For the pendigits and isolet datasets, with the test samples already provided, 30% of training samples were used for validation.

Table 3

Basic information about three real data sets: protein localization sites (yeast), pen-based recognition of handwritten digits (pendigits), and isolated letter speech recognition (isolet)

Datasets	Train	Test	Dimensions	Classes
yeast	1484		8	10
pendigits	7494	3498	16	10
isolet	6238	1559	617	26

Fig. 9

Typical yeast images from the Cell Image Library (http://cellimagelibrary.org/groups/50815)

Basic information about three real data sets: protein localization sites (yeast), pen-based recognition of handwritten digits (pendigits), and isolated letter speech recognition (isolet) Typical yeast images from the Cell Image Library (http://cellimagelibrary.org/groups/50815) We trained and tested the EM algorithm, conventional and quadratic networks with different numbers of layers/neurons on each dataset 20 times. The average accuracy of and time needed by each method are shown in Table 4. In each application, the quadratic neural network with only layer of few neurons has the highest accuracy while its training time is about half of the conventional networks orders of magnitude larger than the quadratic version.

Table 4

The average accuracy of and time needed by Q and C neural networks with different numbers of neurons for three real datasets. The notation stands for the architecture of a neural network as described in Eq. 15

	Yeast		Pendigits		Isolet
	Accuracy (%)	Time (s)	Accuracy (%)	Time (s)	Accuracy (%)	Time (s)
C(2-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$57.17 \pm 1.79$$\end{document}57.17±1.79	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.5 \pm 0.3$$\end{document}0.5±0.3	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$90.01 \pm 1.54$$\end{document}90.01±1.54	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$3.3 \pm 2.7$$\end{document}3.3±2.7	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$89.94 \pm 3.30$$\end{document}89.94±3.30	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$11.2 \pm 0.2$$\end{document}11.2±0.2
C(2-10-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$58.21 \pm 1.93$$\end{document}58.21±1.93	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.7 \pm 0.2$$\end{document}0.7±0.2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$93.23 \pm 2.39$$\end{document}93.23±2.39	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4.0 \pm 1.9$$\end{document}4.0±1.9	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$92.00 \pm 0.60$$\end{document}92.00±0.60	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$11.8 \pm 0.7$$\end{document}11.8±0.7
C(2-100-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$59.69 \pm 2.96$$\end{document}59.69±2.96	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.9 \pm 0.2$$\end{document}0.9±0.2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$96.66 \pm 0.29$$\end{document}96.66±0.29	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$14.8 \pm 27.4$$\end{document}14.8±27.4	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$94.47 \pm 0.30$$\end{document}94.47±0.30	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$49.8 \pm 2.6$$\end{document}49.8±2.6
Q(2-3)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$60.99 \pm 1.27$$\end{document}60.99±1.27	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.5 \pm 0.1$$\end{document}0.5±0.1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$97.04 \pm 0.30$$\end{document}97.04±0.30	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$6.1 \pm 0.2$$\end{document}6.1±0.2	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95.01 \pm 0.17$$\end{document}95.01±0.17	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$21.2 \pm 0.2$$\end{document}21.2±0.2

Conclusions

Although it has been well tested with a solid theoretical foundation, the EM algorithm needs to take an entire dataset into the memory, processes them iteratively, and is time-consuming, under the restriction that data must come from GMM. Furthermore, when new samples become available, parameters need to be adjusted again. A neural network approach can be much more desirable, effective and efficient, workable with many data models in principle thanks to its universal approximation nature. After a network is well trained, new samples can be used to fine-tune the network or processed to inference in a feed-forward fashion, being extremely efficient and generalizable to cases much more complicated than GMM. Very interestingly, compared to conventional networks, quadratic networks can deliver a performance close to that of the EM algorithm in the GMM cases and yet be orders of magnitude simpler than conventional networks for the same classification task. In conclusion, in this paper we have numerically and experimentally demonstrated the superiority of quadratic networks over conventional ones. It is underlined that the quadratic neural network of a much lighter structure rivals the conventional network of a complexity orders of magnitude more in solving the same classification problems. Clearly, the superior classification performance of quadratic networks could be translated to medical imaging tasks, especially radiomics.

9 in total

1. Machine learning and statistical methods for clustering single-cell RNA-sequencing data.

Authors: Raphael Petegrosso; Zhuliu Li; Rui Kuang
Journal: Brief Bioinform Date: 2019-06-27 Impact factor: 11.622

2. A new type of neurons for machine learning.

Authors: Fenglei Fan; Wenxiang Cong; Ge Wang
Journal: Int J Numer Method Biomed Eng Date: 2017-09-15 Impact factor: 2.747

3. Generalized backpropagation algorithm for training second-order neural networks.

Authors: Fenglei Fan; Wenxiang Cong; Ge Wang
Journal: Int J Numer Method Biomed Eng Date: 2018-02-06 Impact factor: 2.747

4. Universal approximation with quadratic deep networks.

Authors: Fenglei Fan; Jinjun Xiong; Ge Wang
Journal: Neural Netw Date: 2020-01-18

5. Quadratic Autoencoder (Q-AE) for Low-Dose CT Denoising.

Authors: Fenglei Fan; Hongming Shan; Mannudeep K Kalra; Ramandeep Singh; Guhan Qian; Matthew Getzin; Yueyang Teng; Juergen Hahn; Ge Wang
Journal: IEEE Trans Med Imaging Date: 2019-12-31 Impact factor: 10.048

Review 6. Deep learning for cellular image analysis.

Authors: Erick Moen; Dylan Bannon; Takamasa Kudo; William Graf; Markus Covert; David Van Valen
Journal: Nat Methods Date: 2019-05-27 Impact factor: 28.547

7. Grandmaster level in StarCraft II using multi-agent reinforcement learning.

Authors: Oriol Vinyals; Igor Babuschkin; Wojciech M Czarnecki; Michaël Mathieu; Andrew Dudzik; Junyoung Chung; David H Choi; Richard Powell; Timo Ewalds; Petko Georgiev; Junhyuk Oh; Dan Horgan; Manuel Kroiss; Ivo Danihelka; Aja Huang; Laurent Sifre; Trevor Cai; John P Agapiou; Chris Apps; David Silver; Max Jaderberg; Alexander S Vezhnevets; Rémi Leblond; Tobias Pohlen; Valentin Dalibard; David Budden; Yury Sulsky; James Molloy; Tom L Paine; Caglar Gulcehre; Ziyu Wang; Tobias Pfaff; Yuhuai Wu; Roman Ring; Dani Yogatama; Dario Wünsch; Katrina McKinney; Oliver Smith; Tom Schaul; Timothy Lillicrap; Koray Kavukcuoglu; Demis Hassabis
Journal: Nature Date: 2019-10-30 Impact factor: 49.962

8. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation.

Authors: Fabian Isensee; Paul F Jaeger; Simon A A Kohl; Jens Petersen; Klaus H Maier-Hein
Journal: Nat Methods Date: 2020-12-07 Impact factor: 28.547

9. Identifying Windows of Susceptibility by Temporal Gene Analysis.

Authors: Kristin P Bennett; Elisabeth M Brown; Hannah De Los Santos; Matthew Poegel; Thomas R Kiehl; Evan W Patton; Spencer Norris; Sally Temple; John Erickson; Deborah L McGuinness; Nathan C Boles
Journal: Sci Rep Date: 2019-02-26 Impact factor: 4.996

9 in total