| Literature DB >> 36262120 |
Yongping Dan1, Zongnan Zhu1, Weishou Jin1, Zhuo Li1.
Abstract
The Transformer shows good prospects in computer vision. However, the Swin Transformer model has the disadvantage of a large number of parameters and high computational effort. To effectively solve these problems of the model, a simplified Swin Transformer (S-Swin Transformer) model was proposed in this article for handwritten Chinese character recognition. The model simplifies the initial four hierarchical stages into three hierarchical stages. In addition, the new model increases the size of the window in the window attention; the number of patches in the window is larger; and the perceptual field of the window is increased. As the network model deepens, the size of patches becomes larger, and the perceived range of each patch increases. Meanwhile, the purpose of shifting the window's attention is to enhance the information interaction between the window and the window. Experimental results show that the verification accuracy improves slightly as the window becomes larger. The best validation accuracy of the simplified Swin Transformer model on the dataset reached 95.70%. The number of parameters is only 8.69 million, and FLOPs are 2.90G, which greatly reduces the number of parameters and computation of the model and proves the correctness and validity of the proposed model. ©2022 Dan et al.Entities:
Keywords: Handwritten Chinese character recognition; Shifting the window’s attention; Simplified Swin Transformer; Window attention
Year: 2022 PMID: 36262120 PMCID: PMC9575930 DOI: 10.7717/peerj-cs.1093
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Complete S-Swin Transformer model architecture (except for the Stage 4 part).
Figure 2S-Swin Transformer block layer.
Figure 3The patch size change in the model.
Figure 4(A) Multi-head attention. (B) Self-attention process.
Dataset characteristics.
| Dataset | Classification | Total images | Training ratio |
|---|---|---|---|
| T-HWDB1.1 | 300 | 104105 | 80% |
Figure 5Sample examples of individual categories in the dataset.
Figure 6Each column represents the same character written by different people.
Parameter settings.
| Description | Value |
|---|---|
| Input | 224×224 |
| Learning rate | 0.0001 |
| Batch size | 8 |
| Dropout | 0.1 |
| Epochs | 150 |
AlexNet and VGG16 parameter settings.
| Models | Parameters | |||
|---|---|---|---|---|
| Learning rate | Batch size | Dropout | Epochs | |
| AlexNet ( | 0.001 | 8 | 0.1 | 200 |
| VGG16 ( | 0.001 | 8 | 0.1 | 200 |
Experimental results (Validation Accuracy, Number of Parameters, FLOPs).
| Model | Window_size | Accuracy (%) | Parameter (M) | FLOPs (G) |
|---|---|---|---|---|
| AlexNet ( | 95.10 | 15.19 | 0.30 | |
| VGG16 ( | 95.10 | 135.48 | 15.40 | |
| Swin transformer | 7×7 | 95.10 | 27.70 | 4.30 |
| S-Swin transformer | 7×7 | 95.40 | 8.69 | 2.90 |
| Swin transformer | 14×14 | 95.40 | 27.70 | 4.30 |
| S-Swin transformer | 14×14 | 95.70 | 8.69 | 2.90 |
Figure 7The validation accuracy and iteration number curves.
Figure 8The change curve of the loss function with the number of iterations in the verification process.