| Literature DB >> 33043311 |
Mahdi Rezaei1, Mahsa Shahidi2.
Abstract
The challenge of learning a new concept, object, or a new medical disease recognition without receiving any examples beforehand is called Zero-Shot Learning (ZSL). One of the major issues in deep learning based methodologies such as in Medical Imaging and other real-world applications is the requirement of large annotated datasets prepared by clinicians or experts to train the model. ZSL is known for having minimal human intervention by relying only on previously known or trained concepts plus currently existing auxiliary information. This is ever-growing research for the cases where we have very limited or no annotated datasets available and the detection / recognition system has human-like characteristics in learning new concepts. This makes the ZSL applicable in many real-world scenarios, from unknown object detection in autonomous vehicles to medical imaging and unforeseen diseases such as COVID-19 Chest X-Ray (CXR) based diagnosis. In this review paper, we introduce a novel and broaden solution called Few / one-shot learning, and present the definition of the ZSL problem as an extreme case of the few-shot learning. We review over fundamentals and the challenging steps of Zero-Shot Learning, including state-of-the-art categories of solutions, as well as our recommended solution, motivations behind each approach, their advantages over each category to guide both clinicians and AI researchers to proceed with the best techniques and practices based on their applications. Inspired from different settings and extensions, we then review through different datasets inducing medical and non-medical images, the variety of splits, and the evaluation protocols proposed so far. Finally, we discuss the recent applications and future directions of ZSL. We aim to convey a useful intuition through this paper towards the goal of handling complex learning tasks more similar to the way humans learn. We mainly focus on two applications in the current modern yet challenging era: coping with an early and fast diagnosis of COVID-19 cases, and also encouraging the readers to develop other similar AI-based automated detection / recognition systems using ZSL.Entities:
Keywords: Autonomous vehicles; COVID-19 pandemic; Chest X-Ray (CXR); Deep learning; Machine learning; SARS-CoV-2; Semantic embedding; Supervised annotation; Zero-shot learning
Year: 2020 PMID: 33043311 PMCID: PMC7531283 DOI: 10.1016/j.ibmed.2020.100005
Source DB: PubMed Journal: Intell Based Med ISSN: 2666-5212
Fig. 1Posterior-Anterior (PA)/Anterior-Posterior (AP) Chest X-Rays and the corresponding CT images of COVID-19 patients.
Fig. 2Similarities and differences between seen and unseen examples derived from textual descriptions and train and test images. The test images are concept cars (a) and COVID-19 symptoms (b).
Fig. 3Overview of ZSL models. Typical approaches use one of the three embedding types or a combination of them. (a) Semantic embedding models that map visual features to the semantic space. (b) Models that map visual and semantic features to an intermediate latent space. (c) Visual embedding models that map semantic features to the visual space.
Common ZSL and GZSL methods categorised based on their embedding space model, with further divisions in a top-down manner.
| Semantic Embedding | Two-Step Learning | Attributes classifiers | DAP-Based [ |
| Direct Learning | Implicit knowledge representation | Linear [ | |
| Explicit knowledge representation | Graph Conv. Networks (GCN) [ | ||
| Cross-Modal Latent Embedding | Fusion-based Models | Fusion of seen class data | Combination of seen classes properties [ |
| Common Representation Space Models | Mapping of the visual and semantic spaces in a joint intermediate space | Parametric [ | |
| Visual Embedding | Visual Space Embedding | Learning of the semantic to visual projection | Linear [ |
| Data Augmentation | Image generation | Gaussian distribution [ | |
| Visual feature generation | GAN [ | ||
| Leveraging Web Data | Web images crawling | Dictionary learning [ | |
| Hybrid | Visual + Semantic Embedding | Reconstruction of the semantic features | AutoEncoder [ |
| Visual + Cross Modal Embedding | Feature generation with aligned semantic features | Semantic to visual mapping [ | |
| All | Utilisation of generator and discriminator together with the regressor | GAN + Dual Learning [ | |
Statics of the attribute datasets accounting for the number of attributes, classes plus their splits and their total number of images.
| Attribute Datasets | #attributes | #images | |||
|---|---|---|---|---|---|
| SUN [ | 102 | 717 | 580 + 65 | 72 | 14,340 |
| CUB [ | 312 | 200 | 100 + 50 | 50 | 11,788 |
| AWA1 [ | 85 | 50 | 27 + 13 | 10 | 30,475 |
| AWA2 [ | 85 | 50 | 27 + 13 | 10 | 37,322 |
| aPY [ | 64 | 32 | 15 + 5 | 12 | 15,339 |
Zero-shot learning results for the Standard Split (SS) and Proposed Split (PS) on SUN, CUB, AWA1, AWA2, and aPY datasets. We measure Top-1 accuracy in for the results. †and ‡denote inductive and transductive settings respectively.
| Methods | SUN | CUB | AWA1 | AWA2 | aPY | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| SS | PS | SS | PS | SS | PS | SS | PS | SS | PS | ||
| DAP [ | 38.9 | 39.9 | 37.5 | 40.0 | 57.1 | 44.1 | 58.7 | 46.1 | 35.2 | 33.8 | |
| IAP [ | 17.4 | 19.4 | 27.1 | 24.0 | 48.1 | 35.9 | 46.9 | 35.9 | 22.4 | 36.6 | |
| ConSE [ | 44.2 | 38.8 | 36.7 | 34.3 | 63.6 | 45.6 | 67.9 | 44.5 | 25.9 | 26.9 | |
| CMT [ | 41.9 | 39.9 | 37.3 | 34.6 | 58.9 | 39.5 | 66.3 | 37.9 | 26.9 | 28.0 | |
| SSE [ | 54.5 | 51.5 | 43.7 | 43.9 | 68.8 | 60.1 | 67.5 | 61.0 | 31.1 | 34.0 | |
| LATEM [ | 56.9 | 55.3 | 49.4 | 49.3 | 74.8 | 55.1 | 68.7 | 55.8 | 34.5 | 35.2 | |
| ALE [ | 59.1 | 58.1 | 53.2 | 54.9 | 78.6 | 59.9 | 80.3 | 62.5 | 30.9 | 39.7 | |
| DeViSE [ | 57.5 | 56.5 | 53.2 | 52.0 | 72.9 | 54.2 | 68.6 | 59.7 | 35.4 | 39.8 | |
| † | SJE [ | 57.1 | 53.7 | 55.3 | 53.9 | 76.7 | 65.6 | 69.5 | 61.9 | 32.0 | 32.9 |
| ESZSL [ | 57.3 | 54.5 | 55.1 | 53.9 | 74.7 | 58.2 | 75.6 | 58.6 | 34.4 | 38.3 | |
| SYNC [ | 59.1 | 56.3 | 54.1 | 55.6 | 72.2 | 54.0 | 71.2 | 46.6 | 39.7 | 23.9 | |
| SAE [ | 42.4 | 40.3 | 33.4 | 33.3 | 80.6 | 53.0 | 80.7 | 54.1 | 8.3 | 8.3 | |
| GFZSL [ | 62.9 | 60.6 | 53.0 | 49.3 | 80.5 | 68.3 | 79.3 | 63.8 | 51.3 | 38.4 | |
| DEM [ | – | 61.9 | – | 51.7 | – | 68.4 | – | 67.1 | – | 35.0 | |
| GAZSL [ | – | 61.3 | – | 55.8 | – | 68.2 | – | 68.4 | – | 41.1 | |
| f-CLSWGAN [ | – | 60.8 | – | 57.3 | – | 68.8 | – | 68.2 | – | 40.5 | |
| CVAE-ZSL [ | – | 61.7 | – | 52.1 | – | 71.4 | – | 65.8 | – | – | |
| SE-ZSL [ | 64.5 | 63.4 | 60.3 | 59.6 | 83.8 | 69.5 | 80.8 | 69.2 | – | – | |
| ALE-tran [ | – | 55.7 | – | 54.5 | – | 65.6 | – | 70.7 | – | 46.7 | |
| ‡ | GFZSL-tran [ | – | 64.0 | – | 49.3 | – | 81.3 | – | 78.6 | – | 37.1 |
| DSRL [ | – | 56.8 | – | 48.7 | – | 74.7 | – | 72.8 | – | 45.5 | |
Generalised Zero-Shot Learning results for the Proposed Split (PS) on SUN, CUB, AWA1, AWA2, and aPY datasets. We measure the Top-1 accuracy in for seen (), unseen () and their harmonic mean (H). †and ‡denote inductive and transductive settings, respectively.
| Methods | SUN | CUB | AWA1 | AWA2 | aPY | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| H | H | H | H | H | ||||||||||||
| DAP [ | 4.2 | 25.1 | 7.2 | 1.7 | 67.9 | 3.3 | 0.0 | 88.7 | 0.0 | 0.0 | 84.7 | 0.0 | 4.8 | 78.3 | 9.0 | |
| IAP [ | 1.0 | 37.8 | 1.8 | 0.2 | 72.8 | 0.4 | 2.1 | 78.2 | 4.1 | 0.9 | 87.6 | 1.8 | 5.7 | 65.6 | 10.4 | |
| ConSE [ | 6.8 | 39.9 | 11.6 | 1.6 | 72.2 | 3.1 | 0.4 | 88.6 | 0.8 | 0.5 | 90.6 | 1.0 | 0.0 | 91.2 | 0.0 | |
| CMT [ | 8.1 | 21.8 | 11.8 | 7.2 | 49.8 | 12.6 | 0.9 | 87.6 | 1.8 | 0.5 | 90.0 | 1.0 | 1.4 | 85.2 | 2.8 | |
| CMT∗ [ | 8.7 | 28.0 | 13.3 | 4.7 | 60.1 | 8.7 | 8.4 | 86.9 | 15.3 | 8.7 | 89.0 | 15.9 | 10.9 | 74.2 | 19.0 | |
| SSE [ | 2.1 | 36.4 | 4.0 | 8.5 | 46.9 | 14.4 | 7.0 | 80.5 | 12.9 | 8.1 | 82.5 | 14.8 | 0.2 | 78.9 | 0.4 | |
| LATEM [ | 14.7 | 28.8 | 19.5 | 15.2 | 57.3 | 24.0 | 7.3 | 71.7 | 13.3 | 11.5 | 77.3 | 20.0 | 0.1 | 73.0 | 0.2 | |
| ALE [ | 21.8 | 33.1 | 26.3 | 23.7 | 62.8 | 34.4 | 16.8 | 76.1 | 27.5 | 14.0 | 81.8 | 23.9 | 4.6 | 73.7 | 8.7 | |
| DeViSE [ | 16.9 | 27.4 | 20.9 | 23.8 | 53.0 | 32.8 | 13.4 | 68.7 | 22.4 | 17.1 | 74.7 | 27.8 | 4.9 | 76.9 | 9.2 | |
| † | SJE [ | 14.7 | 30.5 | 19.8 | 23.5 | 59.2 | 33.6 | 11.3 | 74.6 | 19.6 | 8.0 | 73.9 | 14.4 | 3.7 | 55.7 | 6.9 |
| ESZSL [ | 11.0 | 27.9 | 15.8 | 12.6 | 63.8 | 21.0 | 6.6 | 75.6 | 12.1 | 5.9 | 77.8 | 11.0 | 2.4 | 70.1 | 4.6 | |
| SYNC [ | 7.9 | 43.3 | 13.4 | 11.5 | 70.9 | 19.8 | 8.9 | 87.3 | 16.2 | 10.0 | 90.5 | 18.0 | 7.4 | 66.3 | 13.3 | |
| SAE [ | 8.8 | 18.0 | 11.8 | 7.8 | 54.0 | 13.6 | 1.8 | 77.1 | 3.5 | 1.1 | 82.2 | 2.2 | 0.4 | 80.9 | 0.9 | |
| GFZSL [ | 0.0 | 39.6 | 0.0 | 0.0 | 45.7 | 0.0 | 1.8 | 80.3 | 3.5 | 2.5 | 80.1 | 4.8 | 0.0 | 83.3 | 0.0 | |
| DEM [ | 20.5 | 34.3 | 25.6 | 19.6 | 57.9 | 29.2 | 32.8 | 84.7 | 47.3 | 30.5 | 86.4 | 45.1 | 11.1 | 75.1 | 19.4 | |
| GAZSL [ | 21.7 | 34.5 | 26.7 | 23.9 | 60.6 | 34.3 | 25.7 | 82.0 | 39.2 | 19.2 | 86.5 | 31.4 | 14.2 | 78.6 | 24.1 | |
| f-CLSWGAN [ | 42.6 | 36.6 | 39.4 | 43.7 | 57.7 | 49.7 | 57.9 | 61.4 | 59.6 | 52.1 | 68.9 | 59.4 | 32.9 | 61.7 | 42.9 | |
| CVAE-ZSL [ | – | – | 26.7 | – | – | 34.5 | – | – | 47.2 | – | – | 51.2 | – | – | – | |
| SE-GZSL [ | 40.9 | 30.5 | 34.9 | 41.5 | 53.3 | 46.7 | 56.3 | 67.8 | 61.5 | 58.3 | 68.1 | 62.8 | – | – | – | |
| CADA-VAE [ | 47.2 | 35.7 | 40.6 | 51.6 | 53.5 | 52.4 | 57.3 | 72.8 | 64.1 | 55.8 | 75.0 | 63.9 | – | – | – | |
| ALE-tran [ | 19.9 | 22.6 | 21.2 | 23.5 | 45.1 | 30.9 | 25.9 | – | – | 12.6 | 73.0 | 21.5 | 8.1 | – | – | |
| ‡ | GFZSL-tran [ | 0 | 41.6 | 0 | 24.9 | 45.8 | 32.2 | 48.1 | – | – | 31.7 | 67.2 | 43.1 | 0.0 | – | – |
| DSRL [ | 17.7 | 25.0 | 20.7 | 17.3 | 39.0 | 24.0 | 22.3 | – | – | 20.8 | 74.7 | 32.6 | 11.9 | – | – | |
Seen-Unseen relatedness results on CUB and NAB datasets with easy (SCS) and hard (SCE) splits. Top-1 accuracy is reported in.
| Methods | CUB | NAB | ||
|---|---|---|---|---|
| SCS | SCE | SCS | SCE | |
| MCZSL [ | 34.7 | – | – | – |
| WAC-Linear [ | 27.0 | 5.0 | – | – |
| WAC-Kernel [ | 33.5 | 7.7 | 11.4 | 6.0 |
| ESZSL [ | 28.5 | 7.4 | 24.3 | 6.3 |
| SJE [ | 29.9 | – | – | – |
| ZSLNS [ | 29.1 | 7.3 | 24.5 | 6.8 |
| SynC | 28.0 | 8.6 | 18.4 | 3.8 |
| SynC | 12.5 | 5.9 | – | – |
| ZSLPP [ | 37.2 | 9.7 | 30.3 | 8.1 |
| GAZSL [ | 43.7 | 10.3 | 35.6 | 8.6 |
| CANZSL [ | 45.8 | 14.3 | 38.1 | 8.9 |
Fig. 4ImageNet results measured with Top-1 accuracy in for the 9 splits including 2 and 3 hops away from ImageNet-1K training classes (2H and 3H) and 500, 1K and 5K most (M) and least (L) populated classes, and All the remaining ImageNet-20K classes.