| Literature DB >> 35355829 |
Jiahao Su1, Jingling Li2, Xiaoyu Liu2, Teresa Ranadive3, Christopher Coley4, Tai-Ching Tuan3, Furong Huang2.
Abstract
We propose a framework of tensorial neural networks (TNNs) extending existing linear layers on low-order tensors to multilinear operations on higher-order tensors. TNNs have three advantages over existing networks: First, TNNs naturally apply to higher-order data without flattening, which preserves their multi-dimensional structures. Second, compressing a pre-trained network into a TNN results in a model with similar expressive power but fewer parameters. Finally, TNNs interpret advanced compact designs of network architectures, such as bottleneck modules and interleaved group convolutions. To learn TNNs, we derive their backpropagation rules using a novel suite of generalized tensor algebra. With backpropagation, we can either learn TNNs from scratch or pre-trained models using knowledge distillation. Experiments on VGG, ResNet, and Wide-ResNet demonstrate that TNNs outperform the state-of-the-art low-rank methods on a wide range of backbone networks and datasets.Entities:
Keywords: deep learning; model compression; neural networks; tensor decomposition; tensor networks
Year: 2022 PMID: 35355829 PMCID: PMC8959219 DOI: 10.3389/frai.2022.728761
Source DB: PubMed Journal: Front Artif Intell ISSN: 2624-8212
Figure 1A toy example of invariant structures. The periodic and modulated structures are exposed by exploiting the low rank structure in the reshaped matrix.
Figure 2Tensor diagrams of (A) a scalar a ∈ ℝ, (B) vector v ∈ ℝ, (C) matrix M ∈ ℝ, and (D) tensor .
Primitive tensor operations.
|
|
|
|
|---|---|---|
| mode-( |
| |
| mode-( |
| |
| mode-( |
|
If .
Figure 3Diagrams of primitive tensor operations. Given and , we illustrate (A) , (B) , and (C) with the above tensor diagrams.
Figure 4Generalized tensor operation diagrams. Generalized tensor operations apply one or more primitive tensor operations to two or more tensors. The above tensor diagrams illustrate three different generalized tensor operations, which represent (A) a 1D-convolutional layer from a neural network, (B) a CP-tensor decomposition, and (C) a tensor-ring decomposition.
Figure 5Tensor diagrams of convolutional layers. (A) The traditional convolutional layer is the building block for CNN; (B–E) The tensorial convolutional layers are building blocks for TNNs.
Figure 6Relationship between NNs and TNNs. Suppose the class of NNs and TNNs have the same architecture (i.e., only the tensor operation at each layer is different), and f is the target concept. (1) Learning of a NN with q parameters results in g that is closest to f in , while learning of a TNN with q parameters results in h that is closest to f in . Apparently, h is closer to f than g, (2) Compression of a pre-trained NN to NNs with p parameters (p ≤ q) results in g that is closest to g in , while compression of g to TNNs with p parameters results in h that is closest to g in . Apparently, the compressed TNN h is closer to g than the compressed NN g.
Complexities of traditional convolutional layer and various tensorial convolutional layers.
|
|
|
|
|
|
|---|---|---|---|---|
| original |
|
|
|
|
| mCP |
|
|
|
|
| mTK | (2 | (2 | (2 | (2 |
| mTT |
|
|
|
|
| mTR |
|
|
|
|
Suppose X = Y = X′ = Y′ = D, S = T = N, H = W = k, and D≫k (cf. section 4).
Remark: The number of FLOPs does not accurately reflect the actual running time on GPUs, as the existing CUDA library can not fully utilize the degree of parallelism in general tensor operations.
Figure 7An interleaved group module without nonlinearity (A) is expressed as a tensorial layer (B).
Figure 8A bottleneck module without nonlinearity is expressed as a Tucker decomposition of the original layer.
Figure 9Test Error curves for sequential knowledge distillation (Seq-KD) vs. end-to-end knowledge distillation (E2E-KD) on ResNet-32 for CIFAR-10. Both approaches use layer-wise decomposition (Layer-Decomp) for initialization.
Test accuracy of ResNet-32 on CIFAR-10 — comparison between end-to-end knowledge distillation (E2E-KD) using low-rank compression (NN-C) against sequential knowledge distillation (Seq-KD) with our TNN-based compression (TNN-C).
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| ||
| NN-SVD (Denton et al., | 83.09 | 87.27 | 89.58 | 90.85 | TNN-TR (Wang et al., | - | 80.80 | - | 90.60 |
| NN-CP (Denton et al., | 84.02 | 86.93 | 88.75 | 88.75 | TNN-mCP | 85.7 | 89.86 |
| - |
| NN-TK (Kim et al., | 83.57 | 86.00 | 88.03 | 89.35 | TNN-TK | 61.06 | 71.34 | 81.59 | 87.11 |
| NN-TT (Garipov et al., | 77.44 | 82.92 | 84.13 | 86.64 | TNN-mTT | 78.95 | 84.26 | 87.89 | - |
Test accuracy of ResNet-32 on CIFAR10.
Cited from Wang et al. (.
The architecture is proposed as a baseline in Garipov et al. (.
The original ResNet-32 achieves 93.2% test accuracy with 0.46M parameters (He et al., .
The bold number indicates the best performance in the table.
Test accuracy of ResNet-32 on CIFAR-10 — comparison between sequential knowledge distillation (Seq-KD) against end-to-end knowledge distillation (E2E-KD) using NN-C.
|
| ||||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
| |||||
|
|
|
|
|
|
|
|
| |
| NN-SVD (Denton et al., | 74.04 |
| 85.28 |
|
| 89.58 |
| 90.85 |
| NN-CP (Denton et al., | 83.19 |
|
| 86.93 |
| 88.75 |
| 88.75 |
| NN-TK (Kim et al., | 80.11 |
|
| 86.00 |
| 88.03 |
| 89.35 |
| NN-TT (Garipov et al., |
| 77.44 |
| 82.92 |
| 84.13 |
| 86.64 |
The original ResNet-32 achieves 93.2% accuracy with 0.46M parameters (He et al., .
The bold numbers indicate the better option between sequential knowledge distillation (Seq-KD) and end-to end knowledge (E2E-KD) for each setting.
Test accuracy of ResNet-32 on CIFAR-10 — comparison between sequential knowledge distillation (Seq-KD) for both baseline low-rank compression (NN-C) and our TNN-based compression (TNN-C).
|
|
| ||||
|---|---|---|---|---|---|
|
|
|
|
| ||
| NN-CP (Denton et al., | 83.19 | 88.50 | TNN-mCP |
|
|
| NN-TK (Kim et al., |
|
| TNN-mTK | 71.34 | 81.59 |
| NN-TT (Garipov et al., | 80.77 | 87.08 | TNN-mTT |
|
|
The original ResNet-32 achieves 93.2% accuracy with 0.46M parameters (He et al., .
The bold numbers indicate the better option between traditional low-rank compression and our TNN-based compression.
Test accuracy of LeNet-5 on MNIST.
|
| |||
|---|---|---|---|
|
|
|
| |
| TNN-mCP | 97.21 | 97.92 | 98.65 |
| TNN-mTK | 97.71 | 98.56 | 98.52 |
| TNN-mTT | 97.69 | 98.43 | 98.63 |
We compress all fully-connected layers using .
Test accuracy of ResNet-32 on CIFAR-10—comparison between sequential knowledge distillation (Seq-KD) against learning from scratch (Learn-Scratch) using our TNNs.
|
|
| |||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| TNN-mCP | 85.70 | 89.86 | 91.28 | 81.41 | 82.12 | 82.93 |
| TNN-mTK | 61.60 | 71.34 | 81.59 | 60.65 | 61.46 | 65.75 |
| TNN-mTT | 78.95 | 84.26 | 87.89 | 79.95 | 81.82 | 83.08 |
The original ResNet-32 achieves 93.2% accuracy with 0.46M parameters (He et al., .
Test accuracy of Wide-ResNet-28-10 on CIFAR-100.
|
|
|
|
|
|
|---|---|---|---|---|
| NN-TT (Garipov et al., | 37.02% | 54.65% | 52.69% | 51.42% |
| NN-CP (Denton et al., |
|
| 56.9% | 64.83% |
|
| 0.33% |
| 0.66% |
|
| TNN-mTT | 61.67% |
| 66.82% |
|
We compare .
The bold numbers indicates the performance under the same compression rate.
Top-1 test accuracy of ResNet-50 on ImageNet.
|
|
| ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| NN-CP | 57.86% | 64.17% | 69.37% | 71.52% | 72.08% | 72.44% | TNN-mCP | 72.65% | 73.76% | 74.03% | 75.00% | 75.31% | 77.31% |
| NN-TT | 56.82% | 62.23% | 65.54% | 66.21% | 66.90% | 66.92% | TNN-mTT | 69.27% | 73.04% | 73.51% | 73.50% | 73.87% | 74.14% |
| NN-TR | 56.59% | 62.97% | 69.59% | 71.61% | 73.04% | 73.21% | TNN-mTR | 67.49% | 73.23% | 74.12% | 75.01% | 75.32% | 75.16% |
We compare .
Performance of TNNs vs. NNs counterparts on CIFAR-10.
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|
| Train | 100% | 100% | 100% | 100% | 100% | 100% |
| Test | 93.68% | 92.64% | 95.09% | 95.83% | 91.79% | 92.49% |
NN stands for the uncompressed model proposed by the original paper. All models are trained from scratch (i.e., without reference models).