| Literature DB >> 33261136 |
Inzamam Mashood Nasir1, Muhammad Attique Khan1, Mussarat Yasmin2, Jamal Hussain Shah2, Marcin Gabryel3, Rafał Scherer3, Robertas Damaševičius4.
Abstract
Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique's major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique.Entities:
Keywords: data augmentation; deep learning; document classification; feature selection; imbalanced dataset
Year: 2020 PMID: 33261136 PMCID: PMC7730850 DOI: 10.3390/s20236793
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Detailed model of the proposed method.
Nomenclature of variables used in the definitions and equations.
| Variable | Description | Variable | Description |
|---|---|---|---|
|
| Threshold Value |
| Sum of images in the |
|
| Difference between the threshold and the sum of a single class | ||
|
| Tobacco3482 dataset |
| RVL-CDIP dataset |
|
| Balanced Dataset |
| Input from the previous neuron |
|
| Output of the current neuron |
| Weight of the connection between |
|
| Activation function |
| DCNN features of AlexNet |
|
| DCNN features of VGG19 |
| The merit |
|
| Feature-classification correlations |
| Feature-feature correlations |
|
|
|
Figure 2Flow diagram of data augmenter.
Dataset before and after applying the data augmentation algorithm.
| Classes in | # of Images before Augmentation | # of Images after Augmentation | Classes in RVL-CDIP |
|---|---|---|---|
| Advertisement | 230 | 620 | Advertisement |
| 599 | 620 | ||
| Form | 431 | 620 | Form |
| Letter | 567 | 620 | Letter |
| Memo | 620 | 620 | Memo |
| News | 188 | 620 | News Article |
| Note | 201 | 620 | Handwritten |
| Report | 265 | 620 | Scientific Report |
| Resume | 180 | 620 | Resume |
| Scientific | 261 | 620 | Scientific Publication |
Figure 3Structure of AlexNet Model.
Figure 4Structure of VGG19 Model.
Figure 5Labeled outputs of the proposed technique.
Figure 6Sample images from Tobacco3482 dataset (one image per class). Left to right: (Advertisement, Email, From, Memo, News, Letter, Note, Report, Resume and Scientific).
Figure 7Confusion matrices for the Tobacco3482 dataset: (a) AlexNet, (b) VGG19, (c) proposed method on the original dataset, and (d) proposed method on the augmented dataset.
Comparison of classification accuracy, false-negative rate (FNR), and Training Time on Tobacco3482 Dataset. Best values are shown in bold.
| Method | Experiments | Performance Measures | |||||
|---|---|---|---|---|---|---|---|
| AlexNet | VGG-19 | Proposed (Original Dataset) | Proposed (Augmented Dataset) | Accuracy (%) | FNR (%) | Training Time (s) | |
|
| √ |
|
|
| |||
| √ |
|
|
| ||||
| √ |
|
|
| ||||
| √ |
|
|
| ||||
| Linear Discriminant | √ | 79.7 | 20.3 | 772.5 | |||
| √ | - | - | - | ||||
| √ | 82.4 | 17.6 | 593.2 | ||||
| √ | 84.0 | 16.0 | 659.9 | ||||
| L-SVM | √ | 81.6 | 18.4 | 1731.7 | |||
| √ | 79.0 | 21.0 | 2198.9 | ||||
| √ | 81.8 | 18.2 | 971.3 | ||||
| √ | 84.3 | 15.7 | 1170.2 | ||||
| Q-SVM | √ | 89.6 | 10.4 | 742.2 | |||
| √ | 87.1 | 12.9 | 1996.0 | ||||
| √ | 87.4 | 12.6 | 582.6 | ||||
| √ | 91.8 | 8.2 | 625.2 | ||||
| F-KNN | √ | 87.1 | 12.9 | 846.8 | |||
| √ | 83.7 | 16.3 | 1720.9 | ||||
| √ | 85.0 | 15.0 | 742.6 | ||||
| √ | 89.5 | 10.5 | 872.9 | ||||
| M-KNN | √ | 73.8 | 26.2 | 744.9 | |||
| √ | 65.5 | 34.5 | 1920.3 | ||||
| √ | 73.9 | 26.1 | 621.1 | ||||
| √ | 76.5 | 23.5 | 767.4 | ||||
| C-KNN | √ | 74.0 | 26.0 | 1604.8 | |||
| √ | 65.9 | 34.1 | 4147.5 | ||||
| √ | 72.8 | 27.2 | 598.4 | ||||
| √ | 76.4 | 23.6 | 719.1 | ||||
| W-KNN | √ | 87.1 | 12.9 | 951.4 | |||
| √ | 83.0 | 17.0 | 2393.5 | ||||
| √ | 84.3 | 15.7 | 687.2 | ||||
| √ | 88.7 | 11.3 | 746.9 | ||||
| Subspace Discriminant | √ | 89.5 | 10.3 | 5305.0 | |||
| √ | 87.5 | 12.5 | 6304.3 | ||||
| √ | 88.3 | 11.7 | 1716.8 | ||||
| √ | 89.7 | 10.3 | 2079.2 | ||||
| Subspace KNN | √ | 87.0 | 13.0 | 2498.6 | |||
| √ | 83.2 | 16.8 | 2508.9 | ||||
| √ | 86.9 | 13.1 | 1958.2 | ||||
| √ | 89.4 | 10.6 | 2398.9 | ||||
Classification results after feature fusion. Best values are shown in bold.
| Classifier | Performance Measures | |||||
|---|---|---|---|---|---|---|
| Sensitivity | Precision | AuC | FNR | Accuracy | Training Time (s) | |
| C-SVM |
|
|
|
|
| 3037.7 |
| Linear Discriminant | 81.2 | 81.3 | 89.7 | 18.7 | 81.3 | 3055.7 |
| L-SVM | 84.5 | 84.8 | 98.1 | 15.6 | 84.4 | 2861.1 |
| Q-SVM | 90.3 | 90.3 | 99.0 | 9.70 | 90.3 | 2989.4 |
| F-KNN | 86.7 | 87.0 | 92.6 | 13.2 | 86.8 | 2176.9 |
| M-KNN | 75.0 | 76.3 | 95.5 | 24.9 | 75.1 | 2176.7 |
| C-KNN | 74.5 | 76.0 | 95.3 | 25.5 | 74.5 | 5307.6 |
| W-KNN | 86.9 | 87.4 | 98.5 | 13.0 | 87.0 |
|
| Subspace Discriminant | 86.1 | 86.2 | 98.5 | 13.7 | 86.3 | 8794.6 |
| Subspace KNN | 86.9 | 87.0 | 95.5 | 13.0 | 87.0 | 3876.5 |
Analysis of proposed method on original data. Best values are shown in bold.
| Method | Min (%) | Avg (%) | Max (%) |
|
|
|
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| LD | 79.4 | 80.90 | 82.4 | 1.5 | 1.0606 | 80.9 ± 2.079 (±2.57%) |
| L-SVM | 78.3 | 80.05 | 81.8 | 1.75 | 1.2374 | 80.05 ± 2.425 (±3.03%) |
| Q-SVM | 84.8 | 86.10 | 87.4 | 1.3 | 0.9192 | 86.1 ± 1.802 (±2.09%) |
| F-KNN | 83.2 | 84.10 | 85.0 | 0.9 | 0.6363 | 84.1 ± 1.247 (±1.48%) |
| M-KNN | 70.6 | 72.25 | 73.9 | 1.65 | 1.6670 | 72.25 ± 2.87 (±3.17%) |
| C-KNN | 71.1 | 71.95 | 72.8 | 0.85 | 0.6010 | 71.95 ± 1.178 (±1.64%) |
| W-KNN | 81.6 | 82.95 | 84.3 | 1.35 | 0.9545 | 82.95 ± 1.871 (±2.26%) |
| ESDA | 85.4 | 86.85 | 88.3 | 1.45 | 1.0253 | 86.85 ± 2.010 (±2.31%) |
| ESKNN | 83.2 | 85.05 | 86.9 | 1.85 | 1.3081 | 85.05 ± 2.564 (±3.01%) |
Analysis of proposed method on augmented dataset. Best values are shown in bold.
| Method | Min (%) | Avg (%) | Max (%) |
|
|
|
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| LD | 81.7 | 82.8 | 84.0 | 1.15 | 0.8131 | 82.85 ± 1.594 (±1.92%) |
| L-SVM | 82.6 | 83.4 | 84.3 | 0.85 | 0.6010 | 83.45 ± 1.178 ( |
| Q-SVM | 89.4 | 90.6 | 91.8 | 1.2 | 0.8485 | 90.6 ± 1.663 (±1.84%) |
| F-KNN | 87.1 | 88.3 | 89.5 | 1.2 | 0.8485 | 88.3 ± 1.663 (±1.88%) |
| M-KNN | 73.8 | 75.1 | 76.5 | 1.35 | 0.9545 | 75.1 ± 1.871 (±2.59%) |
| C-KNN | 73.6 | 75.1 | 76.4 | 1.4 | 0.9899 | 75.0 ± 1.940 (±2.59%) |
| W-KNN | 84.9 | 86.8 | 88.7 | 1.9 | 1.3435 | 86.8 ± 2.633 (±3.03%) |
| ESDA | 85.7 | 87.7 | 89.7 | 2.0 | 1.4142 | 87.7 ± 2.772 (±3.16%) |
| ESKNN | 86.3 | 87.8 | 89.4 | 1.55 | 1.0969 | 87.85 ± 2.148 (±2.45%) |
Comparison with existing techniques on the Tobacco3482 dataset.
| Paper | Dataset | Accuracy | Training Time (s) | Prediction Time (s) |
|---|---|---|---|---|
| Afzal et al. [ | Tobacco3482 | 77.6 | - | - |
| Kölsch et al. [ | Tobacco3482 | 83.24 | - | - |
| Afzal et al. [ | Tobacco3482 | 91.13 | - | - |
| Sarkhel & Nandi [ | Tobacco3482 | 82.78 | - | - |
| Wiedemann & Heyer [ | Tobacco-800 | 93 | - | - |
| Proposed | Primary: | AlexNet: 90.1 | 670.8 | 2.34 |
| VGG19: 89.6 | 947.3 | 3.95 | ||
| Original: 92.2 | 329.5 | 1.62 | ||
| Augmented: 93.1 | 364.1 | 0.78 |