| Literature DB >> 35855372 |
Tejal Tiwary1, Rajendra Prasad Mahapatra2.
Abstract
Recently, the progress on image understanding and AIC (Automatic Image Captioning) has attracted lots of researchers to make use of AI (Artificial Intelligence) models to assist the blind people. AIC integrates the principle of both computer vision and NLP (Natural Language Processing) to generate automatic language descriptions in relation to the image observed. This work presents a new assistive technology based on deep learning which helps the blind people to distinguish the food items in online grocery shopping. The proposed AIC model involves the following steps such as Data Collection, Non-captioned image selection, Extraction of appearance, texture features and Generation of automatic image captions. Initially, the data is collected from two public sources and the selection of non-captioned images are done using the ARO (Adaptive Rain Optimization). Next, the appearance feature is extracted using SDM (Spatial Derivative and Multi-scale) approach and WPLBP (Weighted Patch Local Binary Pattern) is used in the extraction of texture features. Finally, the captions are automatically generated using ECANN (Extended Convolutional Atom Neural Network). ECANN model combines the CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory) architectures to perform the caption reusable system to select the most accurate caption. The loss in the ECANN architecture is minimized using AAS (Adaptive Atom Search) Optimization algorithm. The implementation tool used is PYTHON and the dataset used for the analysis are Grocery datasets (Freiburg Groceries and Grocery Store Dataset). The proposed ECANN model acquired accuracy (99.46%) on Grocery Store Dataset and (99.32%) accuracy on Freiburg Groceries dataset. Thus, the performance of the proposed ECANN model is compared with other existing models to verify the supremacy of the proposed work over the other existing works.Entities:
Keywords: Alternative text; Blind people; Deep learning; Extended convolutional atom neural network (ECANN); Image captioning; Natural language processing
Year: 2022 PMID: 35855372 PMCID: PMC9283099 DOI: 10.1007/s11042-022-13443-5
Source DB: PubMed Journal: Multimed Tools Appl ISSN: 1380-7501 Impact factor: 2.577
Review on various existing methods
| Author Name | Dataset | Technique | Pre-processing | Performance (%) |
|---|---|---|---|---|
| Heng Song et al. [ | COCO2014 and Flickr30K | avtmNet | Image Resizing | Flickr30K: BLEU (0.248), ROUGE-L (0.494), METEOR (0.208), CIDEr (0.598), SPICE (0.157) COCO2014: BLUE (0.3317), ROUGE (0.567), METEOR (0.273), CIDEr (1.126), SPICE (0.201) |
| Zhenrong Deng et al. [ | COCO 2014 and Flickr30K | DenseNet+LSTM | Image Resizing | Flickr30K: BLEU (0.667), METEOR (0.214). COCO2014: BLEU (0.739) and METEOR (0.270). |
| Yuchen Wei et al. [ | RPC, D2S, Grozi-120, Groci-3.2 K | Deep learning models (Priming Network, RetinaNet, DNN, One-shot learning) | Noise elimination and removing redundant data | RPC: Priming Network-mAP (97.91%) D2S: RetinaNet -mAP (89.6%) Groci-120: DNN-Precision (45.20%), Recall (52.70%) Groci-3.2 K: One-shot Learning-Precision (92.19%), Recall (87.89%) |
| Fen Xiao et al. [ | MS COCO and Flickr30K | Dual LSTM | – | Flickr30K: BLEU (68.6), METEOR (21.5) MS COCO: BLEU B-1(75.8), METEOR (27.1) |
| Singh, A et al. [ | Hindi genome dataset | Encoder (CNN)and Decoder (LSTM) based model | IndicNLP Tokenizer | BLEU (3.57), RIBES (0.08) |
| Iwamura et al. [ | MSR-VTT2016-Image, MSCOCO | CNN_LSTM | Image Resizing | MSR-VTT2016-Image: BLEU (49.9), METEOR (16.1) MSCOCO: BLEU (75.9), METEOR (26.7) |
Fig. 1Schematic model of proposed method
Fig. 2Schematic representation of ECANN model
Fig. 3Architecture of CNN
Fig. 4Memory Block of LSTM
Hyperparameters of ECANN model
| Sl. No | Hyperparameters | ECANN |
|---|---|---|
| 1. | Learning algorithm | AAS |
| 2. | Initial learning rate | 0.001 |
| 3. | Activation Function | sigmoid |
| 4. | Loss Function | cross-entropy |
| 5. | Mini batch size | 30 |
| 6. | Size of word vectors | 100 |
| 7. | Maximum length of sequence | 1000 |
| 8. | No of neurons in hidden layer | 100 |
| 9. | Hidden layer | 10 |
| 10. | Max epochs | 100 |
Fig. 5Sample images of freiburg groceries dataset
Fig. 6Sample images of grocery store dataset
Outcomes of ECANN based automatic image caption generation
| Method | Dataset | Accuracy (%) | Precision (%) | Recall (%) | F-score (%) |
|---|---|---|---|---|---|
| ECANN (Proposed) | Freiburg Groceries Dataset | 99.32 | 99.73 | 98.94 | 99.33 |
| Grocery Store Dataset | 99.46 | 99.35 | 99.57 | 99.46 |
Comparison on precision and recall [6]
| Metrics/Method | Dataset | Precision (%) | Recall (%) |
|---|---|---|---|
| ECANN (Proposed) | Freiburg Groceries Dataset | ||
| Grocery Store Dataset | |||
| VGG16 | Grozi-3.2 K | 61.73 | 43.22 |
| Grozi-120 | 50.44 | 30.69 | |
| GP-20 | 90.95 | 92.82 | |
| GP-180 | 89.92 | 87.63 | |
| VGG16 + AT(BRISK) | Grozi-3.2 K | 64.78 | 46.22 |
| Grozi-120 | 46.32 | 29.50 | |
| GP-20 | 94.33 | 93.85 | |
| GP-180 | 85.55 | 80.74 | |
| VGG16 + AT(SIFT) | Grozi-3.2 K | 65.83 | 45.52 |
| Grozi-120 | 49.05 | 29.37 | |
| GP-20 | 93.85 | 93.85 | |
| GP-180 | 92.19 | 87.89 | |
| ResNet-18 [ | Grocery Store dataset | 89.97 | 88.95 |
| ResNet-101 [ | Grocery Store Dataset | 93.68 | 93.55 |
| DenseNet-169 [ | Grocery Store Dataset | 93.22 | 92.55 |
Accuracy comparison on grocery store dataset [14]
| Methods | Accuracy (%) |
|---|---|
| ECANN (Proposed) | 99.46 |
| DenseNet-169 | 84.0 |
| VGG16 | 73.8 |
| AlexNet | 69.3 |
Accuracy comparison on Freiburg dataset
| Methods | Accuracy (%) |
|---|---|
| ECANN (Proposed) | 99.32 |
| CaffeNet [ | 78.9 |
| DenseNet-169 | 82.51 |
| VGG16 | 70.86 |
| AlexNet | 67.43 |
Accuracy comparison on with and without feature extraction (Grocery Store Dataset) [14]
| Method/Classifier | ECANN (Proposed) | VGG166 | VGG167 | AlexNet6 | Densenet-169 | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Softmax | Softmax-ft | SVM | SVM-ft | SVM | SVM-ft | SVM | SVM-ft | SVM | SVM-ft | |
| Accuracy (%) | 90.14% | 99.46% | 62.1% | 73.3% | 57.3% | 71.7% | 69.2% | 72.6% | 72.5% | 85.0% |
Accuracy comparison on with and without feature extraction (Freiburg Dataset)
| Method/Classifier | ECANN (Proposed) | VGG166 | VGG167 | AlexNet6 | Densenet-169 | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Softmax | Softmax-ft | SVM | SVM-ft | SVM | SVM-ft | SVM | SVM-ft | SVM | SVM-ft | |
| Accuracy (%) | 90.03% | 99.32% | 61.58% | 72.7% | 55.12% | 70.9% | 68.49% | 71.6% | 72% | 84.79% |
Accuracy comparison Proposed and existing models using softmax classifier with and without feature extraction (Grocery store and Freiburg dataset)
| Method/Classifier | ECANN (Proposed) | VGG166 | VGG167 | AlexNet6 | Densenet-169 | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Softmax | Softmax-ft | Softmax | Softmax-ft | Softmax | Softmax-ft | Softmax | Softmax-ft | Softmax | Softmax-ft | |
| Accuracy (Grocery store dataset) | 90.14% | 99.46% | 66.9% | 78.76% | 61.72% | 74.58% | 72.49% | 76.93% | 77.3% | 87.62% |
| Accuracy (Freiburg dataset) | 90.03% | 99.32% | 64.56% | 77.12% | 60.92% | 72.82% | 70.03% | 74.13% | 75.63% | 86% |
Comparative assessment of proposed with other basic models
| Method | Accuracy (%) | Precision (%) | Recall (%) | F-score (%) | ||||
|---|---|---|---|---|---|---|---|---|
| Freiburg Groceries Dataset | Grocery Store Dataset | Freiburg Groceries Dataset | Grocery Store Dataset | Freiburg Groceries Dataset | Grocery Store Dataset | Freiburg Groceries Dataset | Grocery Store Dataset | |
| ECANN | ||||||||
| CNN | 97.01 | 97.56 | 97.64 | 97.22 | 96.35 | 97.21 | 97.16 | 97.53 |
| RNN | 96.26 | 96.74 | 96.57 | 96.63 | 95.47 | 96.43 | 96.32 | 96.76 |
| DNN | 96.87 | 96.93 | 96.93 | 96.87 | 95.39 | 96.92 | 96.88 | 96.92 |
| DBN | 95.29 | 95.37 | 95.76 | 95.54 | 94.62 | 95.67 | 95.31 | 95.42 |
Fig. 7Graphical assessment of ECANN based image caption generation
Fig. 8Accuracy comparison on various DL models for grocery store dataset
Fig. 9Accuracy comparison on various DL models for Freiburg dataset
Fig. 10Accuracy comparison on different models with and without feature extraction
Fig. 11Accuracy comparison of different models on with and without feature extraction
Fig. 12a and b Accuracy comparison of various DL models using softmax classifier with and without feature extraction
Fig. 13ROC curve
Fig. 14Precision and recall comparison on different datasets
Fig. 15Training and testing performance of the proposed ECANN model (a) Accuracy (b) Loss
Fig. 16Outcomes of image captions