| Literature DB >> 35719687 |
Miguel-Angel Mendez Lucero1, Rafael-Michael Karampatsis1, Enrique Bojorquez Gallardo1, Vaishak Belle1,2.
Abstract
In a seminal book, Minsky and Papert define the perceptron as a limited implementation of what they called "parallel machines." They showed that some binary Boolean functions including XOR are not definable in a single layer perceptron due to its limited capacity to learn only linearly separable functions. In this work, we propose a new more powerful implementation of such parallel machines. This new mathematical tool is defined using analytic sinusoids-instead of linear combinations-to form an analytic signal representation of the function that we want to learn. We show that this re-formulated parallel mechanism can learn, with a single layer, any non-linear k-ary Boolean function. Finally, to provide an example of its practical applications, we show that it outperforms the single hidden layer multilayer perceptron in both Boolean function learning and image classification tasks, while also being faster and requiring fewer parameters.Entities:
Keywords: learning function spaces; neural networks; parallel machines; perceptron; signal perceptron
Year: 2022 PMID: 35719687 PMCID: PMC9203047 DOI: 10.3389/frai.2022.770254
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Every possible unary Boolean function.
|
|
|
|
|
|
|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 | 1 |
Every possible binary Boolean function (a.k.a logical connectives).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (0, 0) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| (0, 1) | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| (1, 0) | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
| (1, 1) | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
Figure 1From left to right: The abstract definition of parallel machine and the perceptron as a parametric implementation of the parallel machine. Figures based on the diagrams depicted in the book Perceptrons (Minsky, 1969).
Figure 2Modifying the parameters of an analytic sinusoid with amplitude 1 phase 0 and frequency π. The sinusoid in red represents the imaginary part of the complex sinusoid and the blue represents the real part. (A) The original signal, (B) the signal with a phase-shift, (C) the signal with a change of amplitude and phase-shift, and (D) a signal with a change of amplitude phase and frequency.
Figure 3From left to right: The Implementation of the MLP with one hidden layer as a parallel machine and the implementation of the signal perceptron.
Learning algorithm for the signal perceptron based on a system of linear equations.
| 1: |
Table of all parameters required by the signal perceptron to define every binary Boolean function (as defined in Table 2).
|
|
|
|
|
|---|---|---|---|
| 0 0 0 0 | 0, 0, 0, 0 | 1 0 0 0 | 0.25, 0.25, 0.25, 0.25 |
| 0 0 0 1 | 0.25, –0.25, –0.25, 0.25 | 1 0 0 1 | 0.5, 0, 0, 0.5 |
| 0 0 1 0 | 0.25, 0.25, –0.25, –0.25 | 1 0 1 0 | 0.5, 0.5, 0, 0 |
| 0 0 1 1 | 0.5, 0, –0.5, 0 | 1 0 1 1 | 0.75 0.25, –0.25, 0.25 |
| 0 1 0 0 | 0.25, –0.25, 0.25, –0.25 | 1 1 0 0 | 0.5, 0, 0.5, 0 |
| 0 1 0 1 | 0.5, –0.5, 0, 0 | 1 1 0 1 | 0.75, –0.25, 0.25, 0.25 |
| 0 1 1 0 | 0.5, 0, 0, –0.5 | 1 1 1 0 | 0.75, 0.25, 0.25, –0.25 |
| 0 1 1 1 | 0.75, –0.25, –0.25, –0.25 | 1 1 1 1 | 1, 0, 0, 0 |
These were obtained by running an implementation of Algorithms 1 and 2.
Gradient descent based learning algorithm for the signal perceptron.
| |
| |
| 2: |
Spatial complexity and learning times of different architectures.
|
| |||||
|---|---|---|---|---|---|
|
|
|
|
|
|
|
| Learnable parameters | 4 | 4 | 4 | 12 | 17 |
| Avg forward/inference time (ms) | 0.0085 | 0.0065 | 0.0409 | 0.0270 | 0.0410 |
| Avg backward/backprop time (ms) | 0.0409 | 0.0336 | 0.1698 | 0.1899 | 0.2232 |
Signal perceptron (SP) NumPy, real signal perceptron (RSP) NumPy, RSP PyTorch, Fourier signal perceptron (FSP) PyTorch, and single hidden layer MLP PyTorch.
Number of learned functions per method when learning the sixteen functions of the binary Boolean function Space.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| 100 | 0.001 | 0 | 0 | 0 | 0 | 0 |
| 100 | 0.01 | 0 | 0 | 0 | 0 | 0 |
| 100 | 0.1 | 16 | 16 | 16 | 16 | 1 |
| 1,000 | 0.001 | 0 | 0 | 0 | 0 | 0 |
| 1,000 | 0.01 | 16 | 16 | 16 | 2 | 0 |
| 1,000 | 0.1 | 16 | 16 | 16 | 16 | 0 |
| 10,000 | 0.001 | 16 | 16 | 16 | 1 | 0 |
| 10,000 | 0.01 | 16 | 16 | 16 | 16 | 0 |
| 10,000 | 0.1 | 16 | 16 | 16 | 16 | 6 |
| 20,000 | 0.001 | 16 | 16 | 16 | 9 | 0 |
| 20,000 | 0.01 | 16 | 16 | 16 | 16 | 2 |
| 20,000 | 0.1 | 16 | 16 | 16 | 16 | 14 |
The methods compared are different implementations of the signal perceptron and the MLP with one hidden layer.
Average final loss for different implementations of the signal perceptron and MLP with one hidden layer when learning the sixteen functions of the binary Boolean function space.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| 100 | 0.001 | 0.6233 − 1.8718 | 0.7238 | 0.4258 | 0.2629 | 0.2476 |
| 100 | 0.01 | 0.0718 + 2.3059 | 0.0189 | 0.0164 | 0.1776 | 0.2260 |
| 100 | 0.1 | 4.2005 · 10−20 + 1.7220 | 9.4023 · 10−20 | 5.7768 · 10−15 | 0.0566 | 0.1868 |
| 1,000 | 0.001 | 0.6233 − 1.8718 | 0.7238 | 0.4258 | 0.2629 | 0.2476 |
| 1,000 | 0.01 | 1.7142 · 10−18 + 4.8695 | 2.1990 · 10−18 | 1.4449 · 10−12 | 0.0656 | 0.1864 |
| 1,000 | 0.1 | 2.5458 · 10−32 + 2.4818 | 3.4570 · 10−32 | 7.2082 · 10−15 | 5.1609 · 10−05 | 0.1264 |
| 10,000 | 0.001 | 2.2651 · 10−18 + 1.2682 | 5.7428 · 10−18 | 1.4452 · 10−10 | 0.0535 | 0.1865 |
| 10,000 | 0.01 | 5.6787 · 10−30 + 8.6296 | 5.0416 · 10−30 | 1.2703 · 10−12 | 2.4053 · 10−05 | 0.1050 |
| 10,000 | 0.1 | 3.0393 · 10−32 + 6.0275 | 5.7537 · 10−32 | 6.9512 · 10−15 | 8.8542 · 10−08 | 0.0264 |
| 20,000 | 0.001 | 5.7354 · 10−28 + 3.5035 | 6.0837 · 10−28 | 1.6507 · 10−10 | 0.0040 | 0.1806 |
| 20,000 | 0.01 | 5.3189 · 10−30 + 1.4378 | 4.6655 · 10−30 | 1.6408 · 10−12 | 1.2077 · 10−05 | 0.0536 |
| 20,000 | 0.1 | 2.7013 · 10−32 + 5.8603 | 3.7074 · 10−32 | 6.5677 · 10−15 | 2.8152 · 10−08 | 0.0012 |
Figure 4Figure that plots the loss function of all 16 binary Boolean functions using the FSP PyTorch and the MLP PyTorch implementation using learning rate of 0.1. The graph on the right shows that even after 20,000 epochs, the MLP cannot exactly learn some of the functions.
Spatial complexity and learning times of different architectures for learning in parallel all 16 functions from the Binary Boolean Function Space for different implementations of the signal perceptron and MLP with one hidden layer.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Learnable parameters | 64 | 64 | 64 | 72 | 92 |
| Avg forward/inference time (ms) | 0.0089 | 0.0066 | 0.0412 | 0.0273 | 0.0349 |
| Avg backward/backprop time (ms) | 0.0420 | 0.0323 | 0.1808 | 0.2238 | 0.2360 |
Measuring final loss when learning all 16 functions of the binary Boolean function Space, using different implementations of the signal perceptron and MLP with one hidden layer.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| 100 | 0.001 | 1.2122 − 3.9221 | 0.9694 | 0.7269 | 0.8520 | 0.2357 |
| 100 | 0.01 | 0.0262 + 1.0550 | 0.0320 | 0.5925 | 0.3650 | 0.2647 |
| 100 | 0.1 | 1.0533 · 10−19 + 4.8589 | 8.5732 · 10−20 | 0.0828 | 0.1835 | 0.2254 |
| 1,000 | 0.001 | 1.2122 − 3.9221 | 0.9694 | 0.7269 | 0.8520 | 0.2357 |
| 1,000 | 0.01 | 3.6959 · 10−18 + 1.7917 | 4.5429 · 10−18 | 0.0724 | 0.1837 | 0.2268 |
| 1,000 | 0.1 | 3.2451 · 10−32 + 4.2776 | 2.5410 · 10−32 | 1.2004 · 10−11 | 0.0894 | 0.1934 |
| 10,000 | 0.001 | 5.2071 · 10−18 + 2.4675 | 6.0695 · 10−18 | 0.0659 | 0.1848 | 0.2432 |
| 10,000 | 0.01 | 5.7437 · 10−30 + 6.4925 | 5.4604 · 10−30 | 4.1244 · 10−10 | 0.0526 | 0.1873 |
| 10,000 | 0.1 | 3.9144 · 10−32 + 6.5162 | 2.9243 · 10−32 | 3.8058 · 10−12 | 3.2690 · 10−10 | 0.1245 |
| 20,000 | 0.001 | 7.2465 · 10−28 + 1.1227 | 6.3567 · 10−28 | 0.0057 | 0.1796 | 0.2289 |
| 20,000 | 0.01 | 5.3793 · 10−30 + 8.4984 | 5.4062 · 10−30 | 3.8469 · 10−10 | 0.0752 | 0.1859 |
| 20,000 | 0.1 | 3.0273 · 10−32 + 4.4578 | 3.3691 · 10−32 | 3.5718 · 10−12 | 7.1793 · 10−11 | 0.0556 |
Figure 5Figure that plots the loss function of different architectures when learning the whole binary Boolean space simultaneously with batch gradient descent for 10,000 episodes and learning rate of 0.1.
The spatial and computational complexity of different architectures for learning MNIST and FashionMNIST.
|
|
|
|
|
|
|---|---|---|---|---|
| Learnable parameters |
| 406,528 | 407,050 | 669,706 |
| Avg forward/inference time (ms) |
| 0.0894 | 0.1468 | 0.2142 |
| Avg backward/backprop time (ms) |
| 1.0108 | 1.3031 | 1.1778 |
128 signals FSP, 512 signals FSP, MLP of 1 hidden layer (512 nodes), and MLP of 2 hidden layers (512, 512). The results in bold letters indicate the model that has the best spatial and computational complexity.
Performance metrics by training different architectures on the MNIST and FashionMNIST datasets.
|
|
|
|
|
| |
|---|---|---|---|---|---|
|
| |||||
| Avg training loss | 0.0028 |
| 0.2164 | 0.8635 | |
| Avg test loss | 0.0805 |
| 0.2870 | 0.7757 | |
| Avg accuracy | 97.5% |
| 89.2% | 67.9% | |
|
| |||||
| Avg training loss | 0.1342 |
| 0.5977 | 0.7346 | |
| Avg test loss | 0.4034 |
| 0.8041 | 1.2000 | |
| Avg accuracy | 86.0% |
| 68.5% | 53.1% | |
Used architectures: 128 signals FSP, 512 signals FSP, MLP of 1 hidden layer (512 nodes), and MLP of 2 hidden layers (512, 512). The results in bold letters indicate the model that achieved the best performance.
Summary table that compares the signal perceptron against the Rosenblatt perceptron, the 1-hidden-layer MLP, and some neurons in the machine learning literature that claim to solve the problem of learning non-linearly separable functions from Boolean function spaces.
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| RP (Rosenblatt, | BP | 3 | 4 | ( | - | - | Math | Disproven |
| MLP (Baum, | BP | 9 | 41 | ( | 16/16 | 241/256 | Math | Proven |
| CBP | 6 | 8 | 2( | 16/16 | 250/256 | Experimental | Disproven | |
| GN (Kulkarni and Venayagamoorthy, | BP/PSO | 9 | 11 | 2( | 14/16 | 114/256 | Experimental | Disproven |
| DMN (Ritter and Urcid, | SLMP | 8 | 48 | 2* | 1/16 | 1/256 | Math | Proven |
| SN (Maass and Schmitt, | - | - | - | - | - | - | Math | Disproven |
| SP (ours) | CBP/SLE | 8 | 16 | 2* | 16/16 | 256/256 | Math | Proven |
| RSP (ours) | BP/SLE | 4 | 8 |
| 16/16 | 256/256 | Math | Proven |
| FSP (ours) | BP | 8 | 24 | 16/16 | 256/256 | Math | Proven | |
The table is divided into learning method, spatial complexity of each unit, experimental evaluation, and k-arity proof. Regarding the notation for the upper-bounds used to calculate the spatial complexity, the m term represents the domain size which for Boolean functions is m = 2, k is the arity of the input size of each of the neurons. The symbol − means that the information was missing, not discussed, or not proven by the paper.