| Literature DB >> 32545702 |
Savita Ahlawat1, Amit Choudhary2, Anand Nayyar3, Saurabh Singh4, Byungun Yoon4.
Abstract
Traditional systems of handwriting recognition have relied on handcrafted features and a large amount of prior knowledge. Training an Optical character recognition (OCR) system based on these prerequisites is a challenging task. Research in the handwriting recognition field is focused around deep learning techniques and has achieved breakthrough performance in the last few years. Still, the rapid growth in the amount of handwritten data and the availability of massive processing power demands improvement in recognition accuracy and deserves further investigation. Convolutional neural networks (CNNs) are very effective in perceiving the structure of handwritten characters/words in ways that help in automatic extraction of distinct features and make CNN the most suitable approach for solving handwriting recognition problems. Our aim in the proposed work is to explore the various design options like number of layers, stride size, receptive field, kernel size, padding and dilution for CNN-based handwritten digit recognition. In addition, we aim to evaluate various SGD optimization algorithms in improving the performance of handwritten digit recognition. A network's recognition accuracy increases by incorporating ensemble architecture. Here, our objective is to achieve comparable accuracy by using a pure CNN architecture without ensemble architecture, as ensemble architectures introduce increased computational cost and high testing complexity. Thus, a CNN architecture is proposed in order to achieve accuracy even better than that of ensemble architectures, along with reduced operational complexity and cost. Moreover, we also present an appropriate combination of learning parameters in designing a CNN that leads us to reach a new absolute record in classifying MNIST handwritten digits. We carried out extensive experiments and achieved a recognition accuracy of 99.87% for a MNIST dataset.Entities:
Keywords: OCR; convolutional neural networks; handwritten digit recognition; pre-processing
Year: 2020 PMID: 32545702 PMCID: PMC7349603 DOI: 10.3390/s20123344
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Shallow neural network vs deep neural network.
| Factors | Shallow Neural Network (SNN) | Deep Neural Network (DNN) |
|---|---|---|
|
| - single hidden layer (need to be fully connected). | - multiple hidden layers (not necessarily fully connected). |
|
| - requires a separate feature extraction process. | - supersedes the handcrafted features and works directly on the whole image. |
|
| - emphasizes the quality of features and their extraction process. | - able to automatically detect the important features of an object (here an object can be an image, a handwritten character, a face, etc.) without any human supervision or intervention. |
|
| - requires small amount of data. | - requires large amount of data. |
Figure 1Typical convolutional neural network architecture.
Figure 2Receptive field and projective field.
Figure 3Visualization of filter of size 5 × 5 with activation map. (Input neuron 28 × 28 and convolutional layer 24 × 24).
Figure 4Representation of kernel and stride in a convolutional layer.
Figure 5Dilated convolution: (a) receptive field of 3 × 3 using 1-dilated convolution; (b) receptive field of 7 × 7 using 2-dilated convolution; (c) receptive field of 15 × 15 using 4-dilated convolution.
Figure 6Max pooling with filter 2 × 2 and stride size.
Figure 7Sample MNIST handwritten digit images.
Configuration details and accuracy achieved for convolutional neural network with three layers.
| Model | Layer | k | s | d | p | i/p | o/p | r | Recognition Accuracy (%) and Total Time Elapsed | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8-16-32 | 6-12-24 | 12-24-32 | 8-16-24 | 8-24-32 | 12-24-28 | |||||||||
|
| Layer 1 | 5 | 2 | 2 | 2 | 28 | 14 | 5 | 93.76% | 84.76% | 98.76% | 94.08% | 96.12% | 98.08% |
| Layer 2 | 5 | 2 | 1 | 2 | 14 | 7 | 9 | |||||||
| Layer 3 | 5 | 2 | 1 | 2 | 7 | 4 | 25 | |||||||
|
| Layer 1 | 5 | 2 | 1 | 2 | 28 | 14 | 5 | 96.04% | 88.91% | 99% | 93.80% | 96.12% | 98.48% |
| Layer 2 | 3 | 2 | 1 | 2 | 14 | 7 | 9 | |||||||
| Layer 3 | 3 | 2 | 1 | 2 | 7 | 4 | 17 | |||||||
|
| Layer 1 | 5 | 2 | 1 | 2 | 28 | 14 | 5 | 98.96% | 86.88% |
| 98.72% | 99.28% | 99.60% |
| Layer 2 | 5 | 2 | 1 | 2 | 14 | 7 | 13 | |||||||
| Layer 3 | 5 | 2 | 1 | 2 | 7 | 4 | 29 | |||||||
|
| Layer 1 | 3 | 3 | 1 | 1 | 28 | 10 | 3 | 80.16% | 68.40% | 87.72% | 78.84% | 85.96% | 88.16% |
| Layer 2 | 3 | 3 | 1 | 1 | 10 | 4 | 9 | |||||||
| Layer 3 | 3 | 3 | 1 | 1 | 4 | 2 | 27 | |||||||
|
| Layer 1 | 5 | 3 | 1 | 2 | 28 | 10 | 5 | 87.08% | 80.96% | 90.08% | 87.22% | 92.24% | 93.32% |
| Layer 2 | 3 | 3 | 1 | 1 | 10 | 4 | 11 | |||||||
| Layer 3 | 3 | 3 | 1 | 1 | 4 | 2 | 29 | |||||||
|
| Layer 1 | 5 | 3 | 1 | 2 | 28 | 10 | 5 | 96.48% | 87.96% | 97.16% | 93.68% | 97.04% | 98.06% |
| Layer 2 | 5 | 3 | 1 | 2 | 10 | 4 | 17 | |||||||
| Layer 3 | 5 | 3 | 1 | 2 | 4 | 2 | 53 | |||||||
Configuration details and accuracy achieved for convolutional neural network with four layers.
| Model | Layer | k | s | d | p | i/p | o/p | r | Recognition Accuracy (%) and Total Time Elapsed | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6-12-24-32 | 12-24-24-32 | 12-24-24-24 | 6-24-24-24 | 12-24-28-32 | |||||||||
|
| Layer 1 | 3 | 2 | 1 | 1 | 28 | 14 | 3 | 95.36% | 98.34% | 98.48% | 96.56% | 98.80% |
| Layer 2 | 3 | 2 | 1 | 1 | 14 | 7 | 7 | ||||||
| Layer 3 | 3 | 2 | 1 | 1 | 7 | 4 | 15 | ||||||
| Layer 4 | 3 | 2 | 1 | 1 | 4 | 2 | 31 | ||||||
|
| Layer 1 | 3 | 2 | 2 | 2 | 28 | 14 | 5 | 84.20% | 93.72% | 91.56% | 89.96% | 95.16% |
| Layer 2 | 3 | 2 | 2 | 2 | 14 | 7 | 13 | ||||||
| Layer 3 | 3 | 2 | 1 | 1 | 7 | 4 | 21 | ||||||
| Layer 4 | 3 | 2 | 1 | 1 | 4 | 2 | 37 | ||||||
|
| Layer 1 | 5 | 2 | 2 | 2 | 28 | 14 | 9 | 77.16% | 94.60% | 90.88% | 85.48% | 94.04% |
| Layer 2 | 3 | 2 | 2 | 2 | 14 | 7 | 17 | ||||||
| Layer 3 | 3 | 2 | 1 | 1 | 7 | 4 | 25 | ||||||
| Layer 4 | 3 | 2 | 1 | 1 | 4 | 2 | 41 | ||||||
|
| Layer 1 | 5 | 1 | 2 | 2 | 28 | 17 | 9 | 98.20% | 99.12% | 99.44% | 99.04% | 99.60% |
| Layer 2 | 3 | 2 | 2 | 2 | 17 | 7 | 13 | ||||||
| Layer 3 | 3 | 2 | 1 | 1 | 7 | 4 | 17 | ||||||
| Layer 4 | 3 | 2 | 1 | 1 | 4 | 2 | 25 | ||||||
|
| Layer 1 | 5 | 1 | 2 | 2 | 28 | 28 | 9 | 98.60% | 99.64% | 99.64% | 99.20% |
|
| Layer 2 | 5 | 2 | 1 | 2 | 28 | 14 | 13 | ||||||
| Layer 3 | 3 | 2 | 1 | 1 | 14 | 7 | 17 | ||||||
| Layer 4 | 3 | 2 | 1 | 1 | 7 | 4 | 27 | ||||||
Figure 8Receptive fields and recognition accuracies for CNN: (a) architecture having three layers; (b) architecture having four layers.
Recognition accuracy with different optimizers.
| Model | Recognition Accuracy (%) | |||
|---|---|---|---|---|
| Momentum | Adam | Adagrad | Adadelta | |
| CNN_3L | 99.76% | 99.89 | 98.67 | 99.77 |
| CNN_4L | 99.76% | 99.35 | 98 | 99.73 |
Comparison of proposed CNN architecture for numeral recognition with other techniques.
| Handwritten Numeral Recognition | ||||
|---|---|---|---|---|
| Reference | Approach | Database | Features | Accuracy (%)/Error Rate |
| [ | CNN | MNIST | Pixel based | 0.23% |
| [ | CNN | MNIST | Pixel based | 0.19% |
| [ | CNN | MNIST | Pixel based | 0.53% |
| [ | CNN | MNIST | Pixel based | 0.21% |
| [ | CNN | MNIST | Pixel based | 0.17% |
| [ | Deep Learning | The Chars74K | Pixel based | 88.89% (GoogleNet) 77.77% (Alexnet) |
| [ | CNN | Urdu Nasta’liq handwritten dataset (UNHD) | Pixel and geometrical based | 98.3% |
| Proposed approach | CNN | MNIST | Pixel and geometrical based | 99.89% |