| Literature DB >> 36211021 |
Yongping Dan1, Zongnan Zhu1, Weishou Jin1, Zhuo Li1.
Abstract
Recently, Vision Transformer (ViT) has been widely used in the field of image recognition. Unfortunately, the ViT model repeatedly stacks 12-layer encoders, resulting in a large number of model computations, many parameters, and slow training speed, making it difficult to deploy on mobile devices. In order to reduce the computational complexity of the model and improve the training speed, a parallel and fast Vision Transformer method for offline handwritten Chinese character recognition is proposed. The method adds parallel branches of the encoder module to the structure of the Vision Transformer model. Parallel modes include two-way parallel, four-way parallel, and seven-way parallel. The original picture is fed to the encoder module after flattening and linear embedding processing operations. The core step in the encoder is the multihead attention mechanism. Multihead self-attention can learn the interdependence between image sequence blocks. In addition, the use of data expansion strategies increases the diversity of data. In the two-way parallel experiment, when the model is 98.1% accurate on the dataset, the number of parameters and the number of FLOPs are 43.11 million and 4.32 G, respectively. Compared with the ViT model, whose parameters and FLOPs are 86 million and 16.8 G, respectively, the two-way parallel model has a 50.1% decrease in parameters and a 34.6% decrease in FLOPs. This method has been demonstrated to effectively reduce the computational complexity of the model while indirectly improving image recognition speed.Entities:
Mesh:
Year: 2022 PMID: 36211021 PMCID: PMC9534625 DOI: 10.1155/2022/8255763
Source DB: PubMed Journal: Comput Intell Neurosci
A summary table of offline HCCR related work.
| Algorithm name | Brief methodology | Highlights | Limitations |
|---|---|---|---|
| MCDNN [ | The model trained eight networks using different datasets, each with four convolutional layers and two fully connected layers. | It is the first model to successfully apply CNN to handwritten Chinese character recognition. | |
|
| |||
| R–CNN and ATR-CNN [ | R-CNN consists of relaxation convolution layers whose neurons within a feature map do not share the same convolutional kernel. ATR-CNN further adopts an alternate training strategy, i.e., the weight parameters of a certain layer do not change by the backpropagation algorithm given a training epoch. | Relaxation convolution can be considered to enhance the learning ability of the neural network. | The replacement of the traditional convolutional layer with a relaxation convolution layer cannot further improve the recognition accuracy. |
|
| |||
| BP-NN [ | The algorithm is improved by the selection of initial weights, excitation function, error function, and so on. | The method improves the speed and accuracy of offline handwritten Chinese character recognition. | The convergence speed is too slow, and it is easy to fall into the local minimum point. |
|
| |||
| HCCR-IncBN [ | This model takes advantage of the sparse connections of the Inception module, performs convolution operations on the same input feature map at multiple scales, and uses 1 × 1 convolution kernels to compress data multiple times, which can increase the depth of the network and ensure that the computing resources are reduced. | The model has fewer training parameters, converges faster, and only requires 26 MB of storage space to store the entire model. | The recognition accuracy of the model is low. |
|
| |||
| SqueezeNet [ | The proposed model retains small convolution kernels instead of large ones. In addition, the feature fusion algorithm between layers and the softmax function with L2-norm constraints are used. | The model parameters become less, the training becomes faster, and the portability is strong. | The accuracy of the model drops. |
A summary table of vision transformer related work.
| Algorithm name | Brief methodology | Highlights | Limitations |
|---|---|---|---|
| CrossViT [ | The architecture consists of a stack of K multiscale transformer encoders. Each multiscale transformer encoder uses two different branches to process image tokens of different sizes and fuses the tokens at the end with an efficient module based on cross-attention of the CLS tokens. | The dual-branch transformer combines image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. | The model increases in FLOPs and model parameters. |
|
| |||
| ViT and GCN [ | Firstly, the scene image is divided into patches, and the positional encoding and vision transformer are used to encode the patches. Consequently, the long-range dependencies can be mined. On the other hand, the scene image is converted into superpixels. | Computing efficiency has been significantly improved. | The dataset is complex. The model is designed for higher-resolution vision tasks. |
|
| |||
| ViT [ | First, the images under analysis are divided into patches, then converted into sequences by flattening and embedding. To maintain information about the position, the embedding position is added to these patches. Then, the resulting sequence is fed to several multihead attention layers to generate the final representation. | To boost the classification performance, the authors explore several data augmentation strategies to generate additional data for training. | The number of model parameters is large. |
|
| |||
| Vision transformer [ | In this study, for the first time, authors utilized ViT to classify breast US images using different augmentation strategies. The results are provided as classification accuracy and area under the curve (AUC) metrics. | The results indicate that the ViT models have comparable efficiency with or even better than the CNNs in the classification of US breast images. | The authors use the strategy of transferring pretrained ViT models for further adaptation. |
|
| |||
| Convolutional vision transformer (CvT) [ | This is accomplished through two primary modifications: a hierarchy of transformers containing a new convolutional token embedding and a convolutional transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture. | The model has fewer parameters and lower FLOPs. | |
|
| |||
| Re-attention [ | The model can regenerate the attention maps to increase their diversity at different layers with negligible computation and memory cost. The proposed method makes it feasible to train deeper ViT models with consistent performance improvements via minor modifications to existing ViT models. | The model has minimal computational and memory overhead. | |
Figure 1The overall architecture of the two-way parallel Vision Transformer.
Figure 2(a) Transformer encoder module; (b) multihead self-attention head; (c) self-attention head.
Figure 3Multilayer perceptron head structure.
Figure 4An example of applying data augmentation to a dataset.
Dataset characteristics.
| Dataset | DHWDB |
|---|---|
| Number of class | 16 |
| Number of images per class | 1810∼2575 |
| Image size | 224 × 224 |
| Total number of images in the dataset | 36210 |
Figure 5Examples of samples for each category in the dataset.
Performance of different models on the DHWDB dataset: parameters; FLOPs; accuracy.
| Methods | Number of encoder layers per channel | Epochs | #Params (M) | FLOPs (G) | Acc. (%) |
|---|---|---|---|---|---|
| T-ViT | 3 | 300 | 43.11 | 4.32 | 98.1 |
| 4 | 300 | 57.28 | 5.72 | 98.3 | |
| 6 | 300 | 85.62 | 8.52 | 98.6 | |
|
| |||||
| F-ViT | 2 | 300 | 57.28 | 2.94 | 96.6 |
| 3 | 300 | 85.62 | 4.36 | 97.3 | |
| 6 | 300 | 170.63 | 8.61 | 97.7 | |
|
| |||||
| S-ViT | 2 | 300 | 99.79 | 2.99 | 96.3 |
| 3 | 300 | 148.38 | 4.43 | 97.1 | |
| 4 | 300 | 198.98 | 5.86 | 97.0 | |
The parameters and FLOPs of the best T-ViT model proposed and other models.
| Model | #Params (M) | FLOPs (G) |
|---|---|---|
| ResNet-101 [ | 44.7 | 7.9 |
| Swin-B [ | 88 | 15.4 |
| CrossViT-18 [ | 43.3 | 9.03 |
| T-ViT | 43.11 | 4.32 |
Figure 6The visualization of the validation accuracy of the T-ViT model (3 encoders per channel).