Literature DB >> 33043311

Zero-shot learning and its applications from autonomous vehicles to COVID-19 diagnosis: A review.

Mahdi Rezaei1, Mahsa Shahidi2.   

Abstract

The challenge of learning a new concept, object, or a new medical disease recognition without receiving any examples beforehand is called Zero-Shot Learning (ZSL). One of the major issues in deep learning based methodologies such as in Medical Imaging and other real-world applications is the requirement of large annotated datasets prepared by clinicians or experts to train the model. ZSL is known for having minimal human intervention by relying only on previously known or trained concepts plus currently existing auxiliary information. This is ever-growing research for the cases where we have very limited or no annotated datasets available and the detection / recognition system has human-like characteristics in learning new concepts. This makes the ZSL applicable in many real-world scenarios, from unknown object detection in autonomous vehicles to medical imaging and unforeseen diseases such as COVID-19 Chest X-Ray (CXR) based diagnosis. In this review paper, we introduce a novel and broaden solution called Few / one-shot learning, and present the definition of the ZSL problem as an extreme case of the few-shot learning. We review over fundamentals and the challenging steps of Zero-Shot Learning, including state-of-the-art categories of solutions, as well as our recommended solution, motivations behind each approach, their advantages over each category to guide both clinicians and AI researchers to proceed with the best techniques and practices based on their applications. Inspired from different settings and extensions, we then review through different datasets inducing medical and non-medical images, the variety of splits, and the evaluation protocols proposed so far. Finally, we discuss the recent applications and future directions of ZSL. We aim to convey a useful intuition through this paper towards the goal of handling complex learning tasks more similar to the way humans learn. We mainly focus on two applications in the current modern yet challenging era: coping with an early and fast diagnosis of COVID-19 cases, and also encouraging the readers to develop other similar AI-based automated detection / recognition systems using ZSL.
© 2020 Elsevier B.V.

Entities:  

Keywords:  Autonomous vehicles; COVID-19 pandemic; Chest X-Ray (CXR); Deep learning; Machine learning; SARS-CoV-2; Semantic embedding; Supervised annotation; Zero-shot learning

Year:  2020        PMID: 33043311      PMCID: PMC7531283          DOI: 10.1016/j.ibmed.2020.100005

Source DB:  PubMed          Journal:  Intell Based Med        ISSN: 2666-5212


Introduction

Object recognition is one of the highly researched areas of computer vision. Recent recognition models have led to great performance through established techniques and large annotated datasets. After several years of research, the attention over this topic has not only dimmed but it has been proven that there are still ways and rooms to refine models to eliminate existing issues in this area. The number of newly emerging unknown objects are growing. Some examples of these unseen or rarely-seen objects are futuristic object designs like the next generation of concept cars, other existing concepts but with restricted access to them (such as licensed or private medical imaging datasets), or rarely seen objects (such a traffic signs with graffiti on them), or fine-grained categories of objects (such as detection of COVID-19 in comparison with the easier task of detecting a common pneumonia). This brings the necessity of developing a fresh way of solving object recognition problems that concern lesser human supervision and lesser annotated datasets. Several approaches have tried to gather web images to train the developed deep learning models, but aside from the problem of the noisy images, the searched keywords are still a form of human supervision. One-Shot learning (OSL) and Few-shot learning (FSL) are two solutions that are able to learn new categories via one or a few images, respectively [70,76,104]. Natural language processing (NLP) is another major area of research in AI, and the application of Few-shot learning in the integration of NLP and object recognition has become a hot topic recently. [165] was the first FSL-based model to improve the performance of an NLP system. Zero-shot learning (ZSL) [7,38,80,178,188] is an emerging research which is completely free of any laborious task of data collection and annotation by experts. Zero-shot learning is a novel concept and learning technique without accessing any exemplars of the unseen categories during training, yet it is able to build recognition models with the help of transferring knowledge from previously seen categories and auxiliary information. The auxiliary information may include textual description, attributes, or vectors of word labels. This means the ZSL is interdisciplinary by nature with two inseparable components of visual and textual data. One of the interesting facts about ZSL is its similarity with the way human learns and recognises a new concept without seeing them beforehand. For example, a ZSL-based model would be able to automatically learn and diagnose COVID-19 patients, based on the existing chest X-ray images of patients with asthma and lungs inflammatory diseases which are already recognised and labelled by clinicians, plus some new auxiliary information about the COVID-19 attributes. Here, the auxiliary data can be the description of physicians and clinicians about the unique type of visual patterns, features, damages, or differences they have noticed on the Chest X-ray of patients with positive COVID-19 comparing to asthma X-ray images. A similar concept or approach is applicable in autonomous vehicles [132], where a self-driving car is responsible for automatic detection of surrounding cars including e.g. an unseen Tesla concept car based on the subgroup of labelled classic sedan cars plus auxiliary information about the common differences of concept cars than the classic cars; or recognising a Persian deer, based on the auxiliary information available for it and its appearance similarities or differences with other previously known deer. For instance, it belongs to a subgroup of the fallow deer, but with a larger body, bigger antlers, white spots around the neck, and also flat antlers for the male type. Fig. 1 (a) shows three examples of Posterior-Anterior (PA) and AP projection of chest X-rays of positive cases of COVID-19, and Fig. 1(b) represents their corresponding axial CT scans, taken from the COVID-ChestXRay dataset [27]. As it can be seen in the images, common evident anomalies may include unilateral or bilateral patchy ground-glass opacities (GGOs), patchy consolidations and parenchymal thickening. The goal of this research is to build an artificial intelligence based-model that can diagnoses COVID-19 without providing any visual exemplars in the training phase. In that case, the side (auxiliary) information should be provided to assist diagnosis in the test phase. In Fig. 2 , the auxiliary information is provided in the form of textual descriptions for two examples of concept cars and COVID-19 X-rays. In Fig. 2(a) we aim at distinguishing new unseen concept cars (bottom row), using description on the exterior of the target and how it differs an already learned car from existing classic vehicle classification system such as in [133]. Similarly, visual differences and similarities between healthy Chest X-rays, Asthma cases, and COVID-19 positive cases are described in Fig. 2(b) as the auxiliary information.
Fig. 1

Posterior-Anterior (PA)/Anterior-Posterior (AP) Chest X-Rays and the corresponding CT images of COVID-19 patients.

Fig. 2

Similarities and differences between seen and unseen examples derived from textual descriptions and train and test images. The test images are concept cars (a) and COVID-19 symptoms (b).

Posterior-Anterior (PA)/Anterior-Posterior (AP) Chest X-Rays and the corresponding CT images of COVID-19 patients. Similarities and differences between seen and unseen examples derived from textual descriptions and train and test images. The test images are concept cars (a) and COVID-19 symptoms (b). Let’s assume our pre-trained AI-based medical imaging system is capable of detecting Asthma cases, based on common deep learning techniques using a previously large dataset of labelled Asthma Chest X-ray images. However, these days we are facing an unknown COVID-19 pandemic with very limited annotated Chest X-rays. Obviously, we can not proceed on the same way of training traditional deep-learning methods, due to very sparse labelled images for COVID-19. The good point is that our medical experts and clinicians can provide some auxiliary information (textual descriptions) about common features and similarities among the COVID-19 positive chest X-rays to infer their findings. In Fig. 3 , the side information is provided in form of what “attributes”: such as foggy effects, white spot features, blurred edges, and white/low-intensity pixel dominance in various areas of the chest X-ray images of COVID-19 patients.
Fig. 3

Overview of ZSL models. Typical approaches use one of the three embedding types or a combination of them. (a) Semantic embedding models that map visual features to the semantic space. (b) Models that map visual and semantic features to an intermediate latent space. (c) Visual embedding models that map semantic features to the visual space.

Overview of ZSL models. Typical approaches use one of the three embedding types or a combination of them. (a) Semantic embedding models that map visual features to the semantic space. (b) Models that map visual and semantic features to an intermediate latent space. (c) Visual embedding models that map semantic features to the visual space. Our idea behind the utilisation of ZSL models is to detect, understand, and recognise new concepts using an existing similar deep-learning based classifier, plus the integration of auxiliary information. This turns it to a completely new and efficient detector/recogniser or diagnosing system without the requirement of collecting a new dataset and a vast amount of costly and time-consuming labelling, especially when a speedy solution is crucial and life-saving; such as the recent global pandemic. In this research we will have four main contributions, as follows: We propose to categorise the reviewed approaches based on the embedding spaces that each model uses to learn/infer unseen objects/concepts as well as describing the variations to the data embedding inside those embedding spaces (Fig. 3 and Table 1). We evaluate the performance of the state-of-the-art models on famous benchmark datasets (Tables 3–5, Fig. 4). To the best of our knowledge, we are the first to include the evaluation of data-synthesising methods in the research field of applied Zero-shot learning. We study the motivation behind leveraging each space as a way to solve the ZSL challenge by reviewing current issues and solutions to them. We provide sufficient technical justifications to support the ideas of using the proposed ZSL model as one of the best practices for COVID-19 diagnosis and other similar applications. The rest of the materials in the article is organised as follows. In Section 2, we introduce the problem of Few-shot, One-shot and Zero-shot learning. In Section 3, we discuss about the test and train phases of the Zero-shot learning and generalised Zero-shot learning systems. Section 4 provides with embedding approaches followed by evaluation protocols in Section 5. In Section 6, we analyse the outcome of the experiments performed on different state-of-the-art methodologies. Further discussion about the applications of ZSL is investigated in Section 7. In Section 8, we discuss the outcome of this research, and finally, the concluding remarks in Section 9.

Few-shot/one-shot and zero-shot learning

Few-shot learning (FSL) is the challenge of learning novel classes with a tiny training dataset of one or a few images per category. FSL is closely related to knowledge transfer where a model, previously trained on large data, is used for a similar task with fewer training data. The more the transferred knowledge is accurate, the better FSL will generalise. Moreover, many approaches employ meta-learning to learn the challenge of few-shot or few-example learning [64,156]. The main challenge is to improve the generalisation ability as it often faces the overfitting problem. In this type of learning, there is an auxiliary dataset that contains N classes each having K annotated samples of the new examples in the training phase. This makes the problem into a N-way-K-shot classification:where is the training example and is its corresponding label. denotes the number of N categories and K defines the number of examples. Few-shot learning has K ​> ​1 samples. Among the relevant research works, [163] uses the shared features among classes to compensate for the requirement for the large data, and follows a learning procedure based on boosted decision stumps. HDP-DBM [141] develops a compound of a deep Boltzmann machine and a hierarchical Dirichlet process to learn the abstract knowledge at different hierarchies of the concept categories. [156] Proposes prototypical networks that computes Euclidean distance between prototype representations of each class. It was not until recently that Few-shot learning was introduced in computer-aided diagnosis. For the first time, the idea of using additional information (attributes) in FSL, was introduced in [165]. [121] proposes a model to classify skin lesions. [68] uses FSL for Glaucoma Diagnosis from fundus images. [127] studies the problem of chest X-ray classification of five symptoms including Consolidation. In the case of one-shot learning, there is only example per class in the supporting set, thus it faces more challenge in comparison to the FSL. Bayesian Program Learning (BPL) framework [77] presents each concept of the handwritten characters as a simple probabilistic program. [14] proposes cross-generalisation algorithm. It replaces the features from the previously learned classes with similar features of the novel classes to adapt to the target task. In Bayesian learning, [41] depicts prior knowledge in the form of probability density function on the parameters of the model, and updates them to compute the posterior model. Matching Nets (MN) [172] uses non-parametric attentional memory mechanisms, and an “episode” during the training time. [25] captures salient features of general lung datasets using an encoder and augments multiple views for images, then uses the prototypical network for a 2-way, 1-shot classification. Zero-shot learning is the extreme case of the FSL where . In other words, the difference between the two is the devoid of any visual examples of the target classes in the training phase of ZSL, while in few-shot learning, the support set contains few labelled samples of the novel categories. Also, auxiliary information in the form of class embeddings is one of the main components of Zero-shot learning. ZSL approaches might extend their solutions to one-shot or few-shot learning by either updating the training data with one or few generated samples from augmentation techniques, or by having access to a few of the unseen images during the training time [5], [17], [23], [59], [145], [147], [164], [171], [189], [199]. [145,189] both use auxiliary text-based information.

ZSL test and training phases

ZSL models can be seen from two points of views in terms of training and test phase: Classic ZSL and Generalised ZSL (GZSL) settings. In the classic ZSL settings, the model only detects the presence of new classes at the test phase, while in GZSL settings, the model predicts both unseen and seen classes at the test time; hence, GZSL is more applicable for real-world scenarios [75,86,94,145,210]. The same idea can be applied to FSL to train in the generalised model, called generalised few-shot learning (GFSL) that detects both known and novel classes at the test time. In the next paragraphs, we discuss two types of training approaches: Inductive vs. Transductive training. Inductive Training: This training setting only uses the seen class of information to learn a new concept. The training data for the inductive setting is:where x represents image features, y is the class labels, and denotes the class embeddings. Moreover, and indicate seen class images and seen class labels, respectively. Inductive learning accounts for the majority of the settings used in ZSL and Generalised Zero-Shot Learning (GZSL). e.g. in [7,22,43,52,90,113,137,171,189,204,207]. Transductive Training: Although the original idea of zero-shot learning is more related to the inductive setting, in many scenarios, the transductive setting is used where either unlabelled visual or textual information, or both for unseen classes are used together with the seen class data e.g. in [6], [44], [52], [71], [90], and in [134,143,159,171,174,189,192,205,207]. The training data for transductive learning is:where denotes that images come from the union of seen and unseen classes. Similarly, and indicate the train labels and class embeddings belong to both seen and novel categories. According to [197], any approach that relies on label propagation will fall into the category of transductive learning. Feature generating network with labelled source data and unlabelled target data [189] are also considered as transductive methods. The transductive setting is seen as one of the solutions to the domain shift problem, since the provided unseen labelled information during training reduces the discrepancy between the two domains. There is a slight nuance between the transductive learning and semi-supervised learning; in the transductive setting, the unlabelled data solely belong to the unseen test classes, while in semi-supervised setting, unseen test classes might not be present in the unlabelled data. Furthermore, the difference between FSL and the transductive ZSL learning is the existence of a few labelled examples of the unseen classes alongside annotated seen class examples in the few-shot learning. While in the transductive ZSL setting, the examples for the unseen classes are all unlabelled. ZSL models are developed based on two high-level major strategies to be taken into account: a) defining the “Embedding Space” to combine visual and non-visual auxiliary data, and b) choosing an appropriate “Auxiliary Data Collection” technique. Embedding Spaces. Fig. 3 demonstrates the overall structure of a ZSL system in terms of embedding spaces and auxiliary data types collection techniques. Such systems either map the visual data to the semantic space (Fig. 3a) or embed both visual and semantic data to a common latent space (Fig. 3b), or see the task as a missing data problem, and then map the semantic information to the visual space (Fig. 3c). Two or all of these approaches can also be combined and embedded together to boost up the benefits of each individual categories. From a different point of view, semantic spaces can also be sub-categorised into euclidean and non-euclidean spaces. The intrinsic relationship between data points is better preserved when the geometrical relation between them is considered. These spaces are commonly based on clusters or graph networks. Some researchers may prefer manifold learning for the ZSL challenge. e.g. in [63], [83], [91], [134], [175], [181], [192], [193], [207], [210]. The Euclidean spaces are more conventional and simpler as the data has a flat representation in such spaces. However, the loss of information is a common issue of these spaces, as well. Examples of methods using Euclidean spaces are [43,80,106,137,187], and [145]. Auxiliary Data Collection. As mentioned before, Zero-shot learning is the challenge of learning novel classes without seeing their exemplars during the training. Instead, the freely available auxiliary information is used to compensate for the lack of visually labelled data. Such information can be categorised into two groups: Human annotated attributes. The supervised way of annotating each image with its related attributes is an arduous process and requires time and expertise, but since they are manual, they yield noiseless and important attributes needed for learning and inference. There are several datasets in which side information in the form of attributes can be attained for each image. i.e. aPY [40], AWA1 [80], AWA2 [188], CUB [173], and SUN [118]. Several ZSL methods leverage the attributes as the side information [7,97,137], or visual attributes [40,79]. Unsupervised auxiliary information. There are several forms of auxiliary information that have minimum supervision and are widely used in the ZSL setting, such as human gazes [66], WordNet which is a large-scale lexical database of 117,000 English words [[4], [5]], [7,83,100,102,123,135,136,181,185], or Textual descriptions such as Web search [135], Wikipedia articles [4], [7], [37], [38], [43], [84], [112], [113], [123], [211], and sentence descriptions [129]. Textual side information needs to be transformed into class embeddings in order to be used at the training and testing stages. Word embedding and language embedding are the two representation techniques used for textual side information. As we gradually proceed, later we review on different embedding classes as well.

ZSL data embedding techniques

In this section, we first provide the task definition of ZSL and GZSL. Then we review the four recent approaches on the problem. In the standard inductive setting as mentioned earlier in Section 3, the training set isand the objective function to be minimised is as follows:where, is the mapping function. Through the training phase, the classifier is learned for ZSL to predict only the novel classes at the test time, or for the GZSL challenge to estimate both novel classes and the previously learned seen classes. For instance, the classifier f can be a COVID-19 diagnoser. We categorise the embedding methodologies into four categories based on the space they learn/infer target classes (like COVID-19 detection in Fig. 3): Semantic Embedding: A semantic space with textual nature in which features are in the form of class embeddings. Intermediate-Space Embedding: A space where both class embeddings and visual feature embedding present in conjunction. Visual Embedding: A space where training and inferring is done with visual feature representations similar to the traditional recognition problems. Hybrid Embedding Models: A combination of spaces are used in some models to bring together the advantages the different spaces have. The majority of methods focus on the general tasks; however, they are scalable to disease classification.

Semantic embedding

Semantic embedding itself can be subcategorised into two tasks of Attribute Classification and Label Embedding which will be discussed here:

Attribute classifiers

Primitive approaches of Zero-Shot learning leverage manually annotated attributes in a two-stage learning schema. Attributes in an image are predicted in the first stage and labels of unseen classes would be chosen using similarity measures in the second stage. [79] uses a probabilistic classifier to learn the attributes and then estimates posteriors for test classes. [136] proposes a method to avoid manual supervision with mining the attributes in an unsupervised manner. [135] adopts DAP together with a hierarchy-based knowledge transfer for large-scale settings. [65]’s method is based on IAP, and uses Self-Organising and Incremental Neural Networks (SOINN) to learn and update attributes online. Later in IAP-SS by [65], an online incremental learning approach is used for faster learning of the new attributes. The Direct Attribute Prediction (DAP) [80] first learns the posteriors of the attributes, then estimates the posteriors of seen classes. On the other hand, Indirect Attribute Prediction (IAP) [80] first learns the posteriors for seen classes then uses them to compute the posteriors for the attributes. [179] uses a unified probabilistic model based on the Bayesian Network (BN) [110] that discovers and captures both object-dependent and object-independent relationships to overcome the problem of relating the attributes. ConSE [113] learns the probability of the training samples. It then predicts an unseen class by the convex combination of the class label embedding vectors. [59] uses a random forest approach for learning more discriminative attributes. Hierarchy and Exclusion (HEX) [31] considers relations between objects and attributes and maps the visual features [130,161] of the images to a set of scores to estimate labels for unseen categories. [8] takes on an unsupervised approach where they capture the relations between the classes and attributes with a three-dimensional tensor while using a DAP-based scoring function to infer the labels. LAGO by [12] also follows the DAP model. It learns soft and-or logical relations between attributes. Using soft-OR, the attributes are divided into groups, and the label class from unseen samples is predicted via a soft-AND within these groups. If each attribute comes from a singleton group, the all-AND will be used.

Label embedding

Instead of using an intermediate step, more recent approaches learn to map images to the structured euclidean semantic space automatically which would be the implicit way of representing knowledge. The compatibility function for linear mapping is:where is the image embedding for training classes and w is the parameters in vector form to be learned. In the case of bilinear projection where it is more common, w takes the form of matrix: SOC [114] first maps the image features to the semantic embedding space, it then estimates the correct class using nearest neighbour. DeViSE by [43] uses a linear corresponding function with a combination of dot-product similarity and hinge rank loss used in [183]. ALE [6] optimises the ranking loss in [167] alongside the bi-linear mapping compatibility function. SJE [7] learns a bi-linear compatibility function using the structural SVM objective function [166]. ESZSL [137] introduces a better regulariser and optimises a close form solution objective function in a linear manner. ZSLNS [123] proposes a -norm based loss function. [17] takes on a metric learning approach and linearly embed the visual features to the attribute space. LAGO [12] is a probabilistic model that depicts soft and-or relations between groups of attributes. In a case where all attributes form all-OR group, It becomes similar to ESZSL [137] and learns a bilinear compatibility function. AREN [190] uses attentive region embedding while learning the bilinear mapping to the semantic space in order to enhance the semantic transfer. ZSLPP [38] combines two networks VPDE-net for detecting bird parts from images and PZSC-net that trains a part-based Zero-Shot classifier from the noisy text of the Wikipedia. DSRL [197] uses non-negative sparse matrix factorisation to align vector representations with the attribute-based label representation vectors so that more relevant visual features are passed to the semantic space. Some approaches to ZSL use non-linear compatibility functions. CMT [157] uses a two-layer neural network, similar to common MLP networks by [131] alongside the compatibility function. In UDA [71] a non-linear projection from feature space to semantic space (word vector and attribute) is proposed in an unsupervised domain adaptation problem based on regularised sparse coding. [84] uses a deep neural network [161] regression which generates pseudo attributes for each visual category via Wikipedia. LATEM [185] constructs a piece-wise non-linear compatibility function alongside a ranking loss. [23] regularises the model using structural relations of the cluster by which cluster centres characterise visual features. QFSL by [159] solves the problem in a transductive setting, and projects both sources and target images into several specified points to fight bias problem. GFZSL [171] uses both linear and non-linear regression models and generates a probability distribution for each class. For transductive setting, it uses Expectation-Maximisation (EM) to estimate a Gaussian Mixture Model (GMM) of unlabelled data in an iterative manner. Leveraging the non-euclidean spaces to capture the manifold structure of the data is another approach to the problem. Together with the knowledge graphs, the explicit relations between the labels will be demonstrated. In this setting, the side information mainly comes from a hierarchy ontology like WordNet. The mapping function will have the following form:where X is the feature matrix and A is the adjacency matrix of the graph. Propagated Semantic Transfer (PST) [134] first uses DAP model to transfer knowledge to novel categories, following the graph-based learning schema, it improves local neighbourhood in them. DMaP [91] jointly optimises the projecting of the visual features and the semantic space to improve the transferability of the visual features to the semantic space manifold. MFMR [193] decomposes the visual feature matrix into three matrices to further facilitate the mapping of visual features to the semantic spaces. To improve the representation of the geometrical manifold structure of the visual and semantic features, manifold regularisation is used. In [83] a Graph Search Neural Network (GSNN) [102] is used in the semantic space based on the WordNet knowledge graph to predict multiple labels per image using the relations between them. [181] distils both auxiliary information in forms of word embedding and knowledge graph to learn novel categories. DGP [63] proposes dense graph propagation to propagate knowledge directly through dense connections. In [210], a graphical model with a low dimensional visually semantic space is utilised which has a chain-like structure to close the gap between the high-dimensional features and the semantic domain.

Intermediate-Space Embedding

One of the methods of embedding is to measure the similarity between the visual and semantic features in a joint space.

Fusion-based models

Considering unseen classes as a fusion of previously learned seen concepts is called hybrid learning. Standard scoring function for hybrid models is defined as: SSE [204] considers the histogram similarity between the seen class auxiliary information and seen visual data. SYNC [22] uses two spaces of semantic and model space, and the alignment is conducted with phantom classes. With the sparse linear combination of the classifiers for the phantom classes, the final classifier is learned. TVSE [192] learns a latent space using collective matrix factorisation with graph regularisation to incorporate the manifold structure between source and target instances, moreover, it represents each sample as a mixture of seen class scores. LDF [93] combines the prototypes of seen classes and jointly learns embeddings for both user-defined attributes and latent attributes.

Joint representation space models

Inferring unseen labels via measuring similarity between cross-modal data in a shared latent space is another workaround to the ZSL challenge. The first term in the objective function for standard cross-modal alignment approaches is:with y being a one-hot vector of corresponding class labels and is the Frobenius norm. Approaches to joint space learning are grouped into two categories, Parametric which follow a slow learning via optimising a problem and Non-parametric that leverage data points extracted from neural networks in a shared space. In parametric methods including [44] a multi-view alignment space is proposed for embedding low-level visual features. The learning procedure is based on the multi-view Canonical Correlation Analysis (CCA) [47]. [100] applies PCA and ICA embeddings to reveal the visual similarity across the classes and obtains the semantic similarity with the WordNet graph, followed by embedding the two outputs into a common space. MCZSL [4] uses visual part and multi-cue language embedding in a joint space. In [108] both images and words are represented by Gaussian distribution embedding. JLSE [205] decides on a dictionary learning approach to learn the parameters of source and target domains across two separate latent spaces where the similarity is computed by the likelihood of similarity independent to the class label. CDL [61] uses a coupled dictionary to align the structure of visual-semantic space using discriminative information of the visual space. In [73,138], a coupled sparse dictionary is leveraged to relate visual and attribute features together. It uses entropy regularisation to alleviate the domain shift problem. There are several non-parametric methods. ReViSE [164] that combines auto-encoders with Maximum Mean Discrepancy (MMD) loss [49] in order to align the visual and textual features. DMAE [109] introduces a latent alignment matrix with representations from auto-encoders optimised by kernel target alignment (KTA) [29] and squared-loss mutual information (SMI) [195]. DCN [94] proposes a novel Deep Calibration Network in which an entropy minimisation principle is used to calibrate the uncertainty of unseen classes as well as seen classes. To narrow the semantic gap, BiDiLEL [176] introduces a sequential bidirectional learning strategy and creates a latent space using the visual data, then the semantic representations of unseen classes are embedded in the previously created latent space. This method comprises both parametric and non-parametric models.

Visual embedding

Visual embedding is the other type of ZSL methods that performs classification in the original feature space and is orthogonal to semantic space projection. This is done by learning a linear or non-linear projection function. For linear corresponding functions, WAC-Linear [37] uses textual description for seen and unseen categories and projects them to the visual feature space with a linear classifier. [207] follows a transductive setting in which it refines unseen data distributions using unseen image data. To approximate the manifold structure of data, they use a global linear mapping for synthesising virtual cluster centres. [52] assigns pseudo labels to samples using reliability (with robust SVM) and diversity (via diversity regularisation). For learning a non-linear corresponding function in WAC-Kernel [36] and in order to leverage any kind of side information, a kernel method is proposed to predict a kernel-based on the representer theorem [144]. DEM [202] uses the least square embedding loss to minimise the discrepancy between the visual features and their class representation embedding vector in the visual feature space. OSVE [96] reversely maps from attribute space to visual space then trains the classifier using SVM [11]. In [60] the authors introduce a stacked attention network that corporates both global and local visual features weighted by relevance along with the semantic features. In [174] visual constraint is used in class centres in the visual space to avoid the domain shift problem.

Visual data augmentation

There are a variety of generative networks that augment unseen data, taking GAN [48] as an example, the first term in objective function would be: is the synthesised data of the generator and is random Gaussian noise. The role of the discriminator D and generator G contradicts in loss function as the first one attempts to maximise the loss while the latter tries to minimise it. Another widely used generative neural network is the Variational AutoEncoder (VAE) [69]: The first term is the reconstruction loss, and the latter is the Kullback-Leibler divergence that works as a regulariser. RKT [175] leverages relational knowledge of the manifold structure in the semantic space, and generates virtually labelled data for unseen classes from Gaussian distributions generated by sparse coding. Then it projects them alongside the seen data to the semantic space via linear mapping. GLaP [90] generates virtual instances of an unseen class with the assumption that each representation obeys a prior distribution where one can draw samples from. To ease the embedding to the semantic space, GANZrl [162] proposes to increase the visual diversity by generating samples with specified semantics using GAN models. SE-GZSL [75] uses a feedback-driven mechanism for its discriminator that learns to map the produced images to the corresponding class attribute vectors. To enforce the similarity of the distribution of the sample and generated sample, a loss component was added to the VAE objective [69] function. Synthesised images often suffer from looking unrealistic since they lack intricate details. A way around this issue is to generate features instead. [18] uses a GMMN model [89] to generate visual features for unseen classes. In [42] a multi-modal cycle consistency loss is used in training the generator for better reconstruction of the original semantic features. CVAE-ZSL [106] takes attributes and generates features for the unseen categories via a Conditional Variations AutoEncoder (CVAE) [158]. norm is used as the reconstruction loss. GAZSL [211] utilises noisy textual descriptions from Wikipedia to generate visual features. A visual pivot regulariser is introduced to help generate features with better qualities. f-CLSWGAN [187] combines three conditional GAN variants for a better data generation. f-VAEGAN-D2 [189] combines the architectures of conditional VAE [158], GAN [48] and a non-conditional discriminator for the transductive setting. LisGAN [87] generates unseen features from random noises using conditional Wasserstein GANs [9]. For regularisation, they introduce semantically meaningful soul samples for each class and force the generated features to be close to at least one of the soul samples. Gradient Matching Network (GMN) [143] trains an improved version of the conditional WGAN [51] to produce image features for the novel classes. It also introduces Gradient Matching (GM) loss to improve the quality of the synthesised features. In order to synthesise unseen features, SPF-GZSL [86] selects similar instances and combines them to form pseudo features using a centre loss function [182]. In Don’t Even Look Once (DELO) by [209] a detection algorithm is conducted to synthesise unseen visual features to gain high confidence predictions for unseen concepts while maintaining low confidence for backgrounds with vanilla detectors. Instead of augmenting data using synthesising methods, data can be acquired by gathering web images. [112] jointly uses web data which are considered weakly-supervised categories alongside the fully-supervised auxiliary labelled categories. It then learns a dictionary for the two categories.

Hybrid Embedding Models

Several works make use of both visual and semantic projections to reconstruct better semantics to confront domain shift issue by alleviating the contradiction between the two domains. Semantic AutoEncoder (SAE) [72] adds a visual feature reconstruction constraint. It combines linear visual-to-semantic (encoder) and linear semantic-to-visual (decoder). SP-AEN [24] is a supervised Adversarial AutoEncoder [101] which improves preserving the semantics by reconstructing the images from the raw 256 ​× ​256 ​× ​3 RGB colour space. BSR [153] uses two different semantic reconstructing regressors to reconstruct the generated samples into semantic descriptions. CANZSL [26] combines feature-synthesis with semantic embedding by using a GAN for generating visual features and an inverse GAN to project them into semantic space. In this way, the produced features are consistent with their corresponding semantics. Some of the synthesising approaches utilise a common latent space to align the generated features space with the semantic space to facilitate capturing the relations between the two spaces. [97] introduces a latent-structure-preserving space where synthesised features from given attributes would suffer less from bias and variance decay with the help of Diffusion Regularisation. CADA-VAE [145] generates a visual feature latent space where both of visual and semantic features are embedded in this space by a VAE [69]. It uses Distribution Alignment (DA) loss and Cross-Alignment (CA) loss to align the cross-modal latent distributions. GDAN [58] combines all three approaches and designs a dual adversarial loss. In this way, regressor and discriminator learn from each other. A summary of the different approaches is reported in Table 1 . The number of methods are growing with time and we can interpret that some areas like direct learning, common space learning and visual data synthesising are more popular in solving the task, while models combining different approach are fairly newer techniques thus have fewer works that are reported here.
Table 1

Common ZSL and GZSL methods categorised based on their embedding space model, with further divisions in a top-down manner.

ModelsCategoriesMain FeaturesDescription
Semantic EmbeddingTwo-Step Learning
Attributes classifiers
DAP-Based [8,12,79,80,135,136] IAP-Based [65,79,80,113] Bayesian network (BN) [179], Random Forest Model [59], HEX Graph [31]
Direct LearningImplicit knowledge representation
Linear [6,7,12,17,38,43,114,123,137,171,190,197] or Non-Linear [23,71,84,159,171,185]Compatibility Functions
Explicit knowledge representationGraph Conv. Networks (GCN) [181], Knowledge Graphs [63,83,91,134], 3-Node Chains [210], Matrix Tri-Factorisation with Manifold Regularisation [193]

Cross-Modal Latent EmbeddingFusion-based Models
Fusion of seen class data
Combination of seen classes properties [22,93,204], Combination of seen class scores [192]
Common Representation Space ModelsMapping of the visual and semantic spaces in a joint intermediate spaceParametric [4,44,61,73,100,108,138,205], Non-parametric [94,109,164], or Both [176]

Visual EmbeddingVisual Space Embedding
Learning of the semantic to visual projection
Linear [37,52,207] or Non-linear [36,60,96,174,202] Projection functions
Data AugmentationImage generation
Gaussian distribution [90,175], GAN [162], VAE [75]
Visual feature generation
GAN [42,87,211], WGAN [143,187], CVAE [106,209], VAE ​+ ​GAN [189], GMMN [18], Similar feature combination [86]
Leveraging Web DataWeb images crawlingDictionary learning [112]

HybridVisual ​+ ​Semantic Embedding
Reconstruction of the semantic features
AutoEncoder [72], Adversarial AutoEncoder [24], GAN with two reconstructing regressors [153], GAN an inverse GAN [26]
Visual ​+ ​Cross Modal Embedding
Feature generation with aligned semantic features
Semantic to visual mapping [97], VAE [145]
AllUtilisation of generator and discriminator together with the regressorGAN ​+ ​Dual Learning [58]
Common ZSL and GZSL methods categorised based on their embedding space model, with further divisions in a top-down manner.

Evaluation protocols

In this section, we review some of the standard evaluation techniques to analyse the performance of the ZSL techniques based on the common benchmark datasets in the field, also in terms of dataset splits, class embeddings, image embeddings, and various evaluation metrics. First, the benchmark datasets.

Benchmark datasets

There are several well-known benchmark datasets for Zero-shot learning that are frequently used. North America Birds (NAB) [168] is a fine-grained dataset of birds consisting of 1011 classes and 48,562 images. Images are categorised based on their visual attributes. A new version of this dataset is proposed by [38] in which the identical leaf nodes are merged to their parent nodes where their only differences were genders and resulted in final 404 classes. Attribute datasets. SUN Attribute [118] is a medium-scale and fine-grained attribute dataset consisting of 102 attributes, 717 categories and a total of 14,340 images of different scenes. CUB-200-2011 Birds (CUB) [173] is a 200 category fine-grained attribute dataset with 11,788 images of bird species that includes 312 attributes. Animals with Attributes (AWA1) [80] is another attribute dataset of 30,475 images with 50 categories and 85 attributes, the image features in this dataset are licensed and not available publicly. Later, Animals with Attributes 2 (AWA2) was presented by [188] which is a free version of AWA1 with more images than the previous one (37,322 images), with the same number of classes and attributes, but different images. aPascal and Yahoo (aPY) [40] is a dataset with a combination of 32 classes, including 20 pascal and 12 yahoo attribute classes with 15,339 images and 64 attributes in total. A summary of the statics for the attribute datasets are gathered in Table 2 .
Table 2

Statics of the attribute datasets accounting for the number of attributes, classes plus their splits and their total number of images.

Attribute Datasets#attributesyyUyS#images
SUN [118]102717580 ​+ ​657214,340
CUB [173]312200100 ​+ ​505011,788
AWA1 [80]855027 ​+ ​131030,475
AWA2 [188]855027 ​+ ​131037,322
aPY [40]643215 ​+ ​51215,339
Statics of the attribute datasets accounting for the number of attributes, classes plus their splits and their total number of images. ImageNet [32] is a large-scale dataset that contains 14 million images, shared between 21k categories with each image having one label that makes it a popular benchmark to evaluate models in real-world scenarios. Its organisation is based on WordNet hierarchy [105]. ImageNet is imbalanced between classes as the number of samples in each class vary greatly and is partially fine-grained. A more balanced version has 1k classes with 1000 images in each category. There are several approaches in FSL setting for COVID-19 diagnosis, however ZSL is still new in the field of disease recognition, we introduce a dataset suited for the task of ZSL/GZSL that contains the required image and textual descriptions in one place. COVID-ChestXRay [27] is a small and public dataset of CXR and CT scans suitable for ZSL and Few-shot learning experiences. At the time of this research, it had 444 unique clinical notes for a total of 16 categories, from no finding (normal cases) to other pneumonic cases like COVID-19, MERS, and SARS.

Dataset splits

Here we discuss the original splits of the datasets as well as the other splits proposed for the Zero-shot problem. Standard Splits (SS). In ZSL problems, unseen classes should be disjoint to seen classes and test time samples are limited to unseen classes, thus the original splits aim to follow this setting. SUN [118] proposed to use 645 classes for training among which 580 of the classes are used for training, 65 classes are for validation and the remaining 72 classes will be used for testing. For CUB, [6] introduces the split of 150 training classes (including 50 validation classes) and 50 test classes. As for AWA1, [80] introduced the standard split of 40 classes for training (13 validation classes) and 10 classes for testing. The same splits are used for AWA2. In aPY, 20 classes of Pascal are used for training (15 classes for training and 5 for validation), while the 12 classes of Yahoo are used for testing. Proposed Splits (PS). The standard split images from SUN, CUB, AWA1 and aPY overlap with some images of pre-trained ResNet-101 ImageNet model. To solve the problem, proposed splits (PS) is introduced by [186] where no test images are contained in the ImageNet 1K dataset. [186] proposes 9 ZSL splits for the ImageNet dataset; two of which evaluate the semantic hierarchy in distance-wise scales of 2-hops (1509 classes) and 3-hops (7678 classes) from the 1k training classes. The remaining six splits consider the imbalanced size of classes with increasing granularity splits starting from 500, 1K and 5K least-populated classes to 500, 1K and 5K most-populated classes, or All which denotes a subset of 20k other classes for testing. Seen-Unseen relatedness. To measure the relatedness of seen samples to unseen classes, [38] introduces two splits Super-Category-Shared (SCS) and Super-Category-Exclusive (SCE). SCS is the easy split since it considers the relatedness to the parent category while SCE is harder and measures the closeness of an unseen sample to that particular child node.

Class embeddings

There exist several class embeddings, each suitable for a specific scenario. Class embeddings are in forms of vectors of real numbers which can further be used to make predictions based on the similarity between them and can be obtained through four categories: attributes, word embeddings, hierarchical ontology, and language modelling. The last three are done in an unsupervised manner thus do not require human labour.

Supervised attribute-embeddings

Human annotated attributes are done under the supervision of experts with a great amount of effort. Binary, relative and real-valued attributes are three types of attributes embeddings. Binary attributes depict the presence of an attribute in an image thus value is either 0 or 1. They are the easiest type and are provided in benchmark attribute datasets AWA1, AWA2, CUB, SUN, aPY. Relative attributes [115] on the other hand, show the strength of an attribute in a given image comparing to the other images. The real-valued attributes are in continuous form thus they have the best quality [7]. In the SUN attribute dataset [118], they have achieved confidence through averaging the binary labels from multiple annotators.

Unsupervised word-embeddings

Also known as Textual corpora embedding. Bag of Words (BOW) [54] is a one-hot encoding approach. It simply shows the number of occurrences of the words in a representation called bag and is negligent of word orders and grammar. One-hot encoding approaches had a drawback of giving the stop words (like “a”, “the” and “of”) high relevancy counts. Later, Term Frequency-Inverse Document Frequency (TF-IDF) [142] used term weighting to alleviate this problem by filtering the stop words and to keep meaningful words. Word2Vec [103] is a widely used two-layered neural embedding model and has two variants, CBOW and skip-gram. CBOW predicts a target word in the centre of a context using its surroundings words while the skip-gram model predicts surrounding words using a target word. CBOW is faster in train and usually results in better accuracy for frequent words while Skip-gram is preferred for rare words and it works well with sparse training data. Global Vectors (GloVe) [119] is trained on Wikipedia. It combines local context window methods and global matrix factorisation. Glove learns to consider global word-word co-occurrence matrix statistics to build the word embeddings.

Hierarchy embedding

WordNet [105] is a large-scale public lexical database of 117,000 synsets. Synsents are a group of words that are semantically related to each other. i.e. synonyms, homonyms and meronymies of English words that are organised using the hierarchy distances with a graph structure, thus Approaches based on knowledge graphs often follow the WordNet to measure the similarity between the word meanings [4,5,7,83,100,102,123,135,136,181,185].

Language modelling

In the general ZSL scenarios, word by word representations considered; however, with the advent of transfer learning in the natural language processing (NLP), and the introduction of contextual word embeddings, the boundaries of the capabilities of the embeddings has been pushed further. Unlike the traditional word embeddings, language models can capture the meaning of the words based on the context in which they appear. Several contextual representations have been introduced recently and showed great results. These existing pre-trained models can be fine-tuned on various ZSL tasks. ELMo [120] is a contextual embedding model. Following morphological clues together with a deep bidirectional language model (biLM), ELMo learns the representations. Bidirectional Encoder Representations from Transformer (BERT) [33] is a multi-layer bidirectional Transformer encoder [170] trained upon BooksCorpus [212] dataset and English Wikipedia. It outperforms ELMo with having more parameters and layers. The pre-trained BERT model can be fine-tuned with just one additional output layer. However, BERT suffers from fine-tuning discrepancy due to ignoring the relation the masked positions have. XLNet [196] uses an autoregressive model to introduce a method that overcomes the shortcoming of BERT. In addition to the datasets used by BERT, XLNet pre-trains the model on Giga5 [116], ClueWeb 2012-B extended by [20] and Common Crawl1 . ALBERT [81] increases the model size. It lowers the memory usage with two parameter reduction techniques. The first one is a factorized embedding parameterization. The second one is cross-layer parameter sharing. These two techniques result in lower memory usage and higher training speed than BERT. The data used for pre-training is the same as XLNeT. In this article, we report the results of ZSL and GZSL using the same class embeddings as [186] that is Word2Vec trained on Wikipedia for ImageNet and per-class attributes for the attribute datasets, and for the seen-unseen relatedness task we follow [38] and consider TF-IDF for the CUB and NAB datasets.

Image embeddings

Existing models use either shallow or deep feature representation. Examples of shallow features are SIFT [99], PHOG [16], SURF [15] and local self-similarity histograms [148]. Among the mentioned features, SIFT is the commonly used features in ZSL models like [6,22,44]. Deep features are obtained from deep CNN architectures [161] and contain higher-level features. Extracted features are one of the followings: 4096-dim top-layer hidden unit activations (fc7) of the AlexNet [74], 1000-dim last fully connected layer (fc8) of VGG-16 [155], 4096-dim of the 6th layer (fc6) and 4096-dim of the last layer (fc7) features of the VGG-19 [155]. 1024-dim top-layer pooling units of the GoogleNet [160]. and 2048-dim last layer pooling units of the ResNet-101 [55]. In this paper, we consider the ResNet-101 network which is pre-trained on ImageNet-1K without any fine-tuning. That is the same image embedding used in [186]. Features are extracted from whole images of SUN, CUB, AWA1, AWA2, and ImageNet and the cropped bounding boxes of aPY. For the seen-unseen relatedness task, VGG-16 is used for CUB and NAB as proposed in [38].

Evaluation metrics

Common evaluation criteria used for ZSL challenge are: Classification accuracy. One of the simplest metrics is classification accuracy in which the ratio of the number of the correct predictions to samples in class y is measured. However, it results in a bias towards the populated classes. Average per-class accuracy. To reduce the bias problem for the populated classes, average per-class accuracies are computed by multiplying the division of the classification accuracy to division of their cumulative sum. Harmonic mean. For performance evaluation on both seen and unseen classes (i.e. the GZSL setting), the Top-1 accuracies for the seen and unseen classes are used to compute the harmonic mean: In this paper, we designate the Top-1 accuracies and the harmonic mean as the evaluation protocols [188].

Experimental results

As the main contributions of this research, and for the first time, we provide a comprehensive experiments of 21 state-of-the-art models in ZSL/GZSL domain that include the evaluations and comparisons of data-synthesising methods. In this section, first we provide the results for ZSL, GZSL and seen-unseen relatedness on attribute datasets, then we present the experimental results on the ImageNet dataset. A minor part of the results is reported from [188] for a more comprehensive comparison.

Zero-shot learning results

For the original ZSL task where only unseen classes are being estimated during the test time, we compare 21 state-of-the-art models in Table 3 , among which, DAP [80], IAP [80] and ConSE [113] belong to attribute classifiers. CMT [157], LATEM [185], ALE [6], DeViSE [43], SJE [7], ESZSL [137], GFZSL [171] and DSRL [197] are from compatibility learning approaches, SSE [204] and SYNC [22] are representative models of cross-modal embedding, DEM [202], GAZSL [211], f-CLSWGAN [187], CVAE-ZSL [106], SE-ZSL [75] are visual embedding models. From the hybrid or combination category, we compare the results of SAE [72]. Three transductive approaches ALE-tran [6], GFZSL-tran [171] and DSRL [197] are also presented among the selected models. Due to the intrinsic nature of the transductive setting, the results are competitive and in some cases better than the inductive methods, i.e. for GFZSL-tran [171] the accuracy is 9.9 higher than CVAE-ZSL [106] for PS split of AWA1 dataset. However, in comparison with the inductive form of the same model, there are cases where the inductive model has better accuracies. i.e. in PS split of the aPY dataset, the performance is 38.4 vs 37.1 or for ALE-tran [6] model in PS split of SUN it’s 58.1 vs 55.7, also for PS split of CUB it is 54.9 vs 54.5 with its inductive type. GFZSL [171], a compatibility-based approach, has the best scores compared to other models of the same category in every dataset except for the CUB where SJE [7] tops the results in both splits. This superiority could be due to the generative nature of the model. GFZSL [171] performs the best on AWA1 both in inductive and transductive settings. Out of cross-modal methods, SYNC [22] performs better than SSE [204] in SUN and CUB datasets, while for AWA1, AWA2 and aPY in SS split it has lower performance than SSE [204] in the proposed split. Visual generative methods have proved to perform better as they make the problem into the traditional supervised form, among which, SE-ZSL [75] has the most outstanding performance. For the proposed split in one case on CUB dataset, SE-ZSL [75] performs better than ALE-tran [6] which is its transductive counterpart where the accuracies are 59.6 vs 54.5. In PS split of AWA1, CVAE-ZSL [106] stays at the top, with 1.9 higher accuracy than the second-best performing model. The accuracies for SS splits are higher than PS in most cases and the reason could be the test images included in training samples, especially for AWA1 and AWA2, as reported in [186].
Table 3

Zero-shot learning results for the Standard Split (SS) and Proposed Split (PS) on SUN, CUB, AWA1, AWA2, and aPY datasets. We measure Top-1 accuracy in for the results. †and ‡denote inductive and transductive settings respectively.

MethodsSUN
CUB
AWA1
AWA2
aPY
SSPSSSPSSSPSSSPSSSPS
DAP [80]38.939.937.540.057.144.158.746.135.233.8
IAP [80]17.419.427.124.048.135.946.935.922.436.6
ConSE [113]44.238.836.734.363.645.667.944.525.926.9
CMT [157]41.939.937.334.658.939.566.337.926.928.0
SSE [204]54.551.543.743.968.860.167.561.031.134.0
LATEM [185]56.955.349.449.374.855.168.755.834.535.2
ALE [6]59.158.153.254.978.659.980.362.530.939.7
DeViSE [43]57.556.553.252.072.954.268.659.735.439.8
SJE [7]57.153.755.353.976.765.669.561.932.032.9
ESZSL [137]57.354.555.153.974.758.275.658.634.438.3
SYNC [22]59.156.354.155.672.254.071.246.639.723.9
SAE [72]42.440.333.433.380.653.080.754.18.38.3
GFZSL [171]62.960.653.049.380.568.379.363.851.338.4
DEM [202]61.951.768.467.135.0
GAZSL [211]61.355.868.268.441.1
f-CLSWGAN [187]60.857.368.868.240.5
CVAE-ZSL [106]61.752.171.465.8
SE-ZSL [75]64.563.460.359.683.869.580.869.2

ALE-tran [6]55.754.565.670.746.7
GFZSL-tran [171]64.049.381.378.637.1
DSRL [197]56.848.774.772.845.5
Zero-shot learning results for the Standard Split (SS) and Proposed Split (PS) on SUN, CUB, AWA1, AWA2, and aPY datasets. We measure Top-1 accuracy in for the results. †and ‡denote inductive and transductive settings respectively.

Generalised Zero-Shot Learning results

A more real-world scenario where previously learned concepts are estimated alongside new ones is necessary to experiment. 21 state-of-the-art models, same as ZSL challenge, include: DAP [80], IAP [80], ConSE [113], CMT [157], SSE [204], LATEM [185], ALE [6], DeViSE [43], SJE [7], ESZSL [137], SYNC [22], SAE [72], GFZSL [171], DEM [202], GAZSL [211], f-CLSWGAN [187], CVAE-ZSL [106], SE-GZSL [75], ALE-train [6], GFZSL-tran [171], DSRL [197]. CADA-VAE [145] is added to the comparison as a model combining the visual feature augmentation approach with the cross-modal alignment. CMT∗ [157] has a novelty detection and is included in the report as an alternative version to CMT [157]. The reports in Table 4 are in PS splits. As shown in the table, the results on are dramatically higher than since in GZSL, the test search space includes seen classes as well as unseen classes, this gap is the most conspicuous in attribute classifiers like DAP [80] that performs poorly on AWA1 and AWA2, hybrid approaches and in GFZSL [171] where it results in 0 accuracy on SUN and CUB when training classes are estimated at test time. However for three models f-CLSWGAN [187], SE-GZSL [75] and CADA-VAE [145] in SUN dataset, the accuracy for is higher than , i.e. for SE-GZSL [75] it is 10.4 higher. For a fair comparison, the weighted average of training and test classes is also reported. According to harmonic means, the best model on all evaluated datasets is SE-ZSL [75], although the results haven’t been reported for aPY. In some cases, the attribute classifier achieves the best results on . Transductive models have fluctuating results in comparison with their inductive types. CADA-VAE [145] achieves the best performance in all of the harmonic means cases (results for aPY are not reported) and shows the best results, higher than all of the transductive methods.
Table 4

Generalised Zero-Shot Learning results for the Proposed Split (PS) on SUN, CUB, AWA1, AWA2, and aPY datasets. We measure the Top-1 accuracy in for seen (), unseen () and their harmonic mean (H). †and ‡denote inductive and transductive settings, respectively.

MethodsSUN
CUB
AWA1
AWA2
aPY
yUySHyUySHyUySHyUySHyUySH
DAP [80]4.225.17.21.767.93.30.088.70.00.084.70.04.878.39.0
IAP [80]1.037.81.80.272.80.42.178.24.10.987.61.85.765.610.4
ConSE [113]6.839.911.61.672.23.10.488.60.80.590.61.00.091.20.0
CMT [157]8.121.811.87.249.812.60.987.61.80.590.01.01.485.22.8
CMT∗ [157]8.728.013.34.760.18.78.486.915.38.789.015.910.974.219.0
SSE [204]2.136.44.08.546.914.47.080.512.98.182.514.80.278.90.4
LATEM [185]14.728.819.515.257.324.07.371.713.311.577.320.00.173.00.2
ALE [6]21.833.126.323.762.834.416.876.127.514.081.823.94.673.78.7
DeViSE [43]16.927.420.923.853.032.813.468.722.417.174.727.84.976.99.2
SJE [7]14.730.519.823.559.233.611.374.619.68.073.914.43.755.76.9
ESZSL [137]11.027.915.812.663.821.06.675.612.15.977.811.02.470.14.6
SYNC [22]7.943.313.411.570.919.88.987.316.210.090.518.07.466.313.3
SAE [72]8.818.011.87.854.013.61.877.13.51.182.22.20.480.90.9
GFZSL [171]0.039.60.00.045.70.01.880.33.52.580.14.80.083.30.0
DEM [202]20.534.325.619.657.929.232.884.747.330.586.445.111.175.119.4
GAZSL [211]21.734.526.723.960.634.325.782.039.219.286.531.414.278.624.1
f-CLSWGAN [187]42.636.639.443.757.749.757.961.459.652.168.959.432.961.742.9
CVAE-ZSL [106]26.734.547.251.2
SE-GZSL [75]40.930.534.941.553.346.756.367.861.558.368.162.8
CADA-VAE [145]47.235.740.651.653.552.457.372.864.155.875.063.9

ALE-tran [6]19.922.621.223.545.130.925.912.673.021.58.1
GFZSL-tran [171]041.6024.945.832.248.131.767.243.10.0
DSRL [197]17.725.020.717.339.024.022.320.874.732.611.9
Generalised Zero-Shot Learning results for the Proposed Split (PS) on SUN, CUB, AWA1, AWA2, and aPY datasets. We measure the Top-1 accuracy in for seen (), unseen () and their harmonic mean (H). †and ‡denote inductive and transductive settings, respectively.

Seen-unseen relatedness results

For fine-grained problems, sometimes it is important to measure the closeness of previously known concepts to novel unknown ones. For this purpose, a total of eleven models are compared in Table 5 . MCZS [4], WAC-Linear [37], WAC-Kernel [36], ESZSL [137], SJE [7], ZSLNS [123], SynC [22], SynC [22], ZSLPP [38], GAZSL [211] and CANZSL [26]. SCE is the hard split thus has lower accuracies compared to the SCS splits. The two variations reported for SYNC [22] model, SynC denotes the setting in which the standard Crammer-Singer loss is used, and SynC [22] depicts setting with one-versus-other classifiers. The first setting has better accuracies on CUB. CANZSL [26] outperforms all other models in both datasets and splits and improves the accuracy by 4 from 10.3 to 14.3 on SCE split of the CUB dataset and 35.6 vs 38.1 in SCS splits of NAB compared to the next best performing model is GAZSL [211]. Similar to previous experiments, in the seen-unseen relatedness challenge, models that contain feature generating steps have the highest results.
Table 5

Seen-Unseen relatedness results on CUB and NAB datasets with easy (SCS) and hard (SCE) splits. Top-1 accuracy is reported in.

MethodsCUB
NAB
SCSSCESCSSCE
MCZSL [4]34.7
WAC-Linear [37]27.05.0
WAC-Kernel [36]33.57.711.46.0
ESZSL [137]28.57.424.36.3
SJE [7]29.9
ZSLNS [123]29.17.324.56.8
SynCfast[22]28.08.618.43.8
SynCOVO[22]12.55.9
ZSLPP [38]37.29.730.38.1
GAZSL [211]43.710.335.68.6
CANZSL [26]45.814.338.18.9
Seen-Unseen relatedness results on CUB and NAB datasets with easy (SCS) and hard (SCE) splits. Top-1 accuracy is reported in.

Zero-shot learning results on ImageNet

ImageNet is a large-scale single-labelled dataset with an imbalanced number of data that possesses WordNet hierarchy instead of human-annotated attributes, thus is useful mean to measure the performance of various methods in recognition-in-the-wild scenarios. The performances of 12 state-of-the-art models are reported here. They are ConSE [113], CMT [157], LATEM [185], ALE [6], DeViSE [43], SJE [7], ESZSL [137], SYNC [22], SAE [72], f-CLSWGAN [187], CADA-VAE [145] and f-VAEGAN-D2 [189]. All of the Top-1 accuracies, except for the data generating models are reported from [186] experiments. As it can be understood from Fig. 4 (a), feature generating methods have outstanding performance compared to other approaches. Although the results of f-VAEGAN-D2 [189] are available only for 2H, 3H and all splits, it still has the highest accuracies among other models. SYNC [22] and f-CLSWGAN [187] are the next best performing models with approximately the same accuracies. ConSE [113] is a representative model from attribute-classifier based models, as it is also superior to direct compatibility approaches. ESZSL [137], a model with linear compatibility function outperforms the other model within its category. However, in one case, SJE [7] has slightly better accuracy in L500 split setting. It can be interpreted from the figures that on coarse-grained classes, the results are conspicuously better, while fine-grained classes with few images per class have more challenges. However, if the test search space is too big then the accuracies decrease. i.e. M5K has lower accuracies compared to L500 splits, and on 20K split, it is the lowest.
Fig. 4

ImageNet results measured with Top-1 accuracy in for the 9 splits including 2 and 3 hops away from ImageNet-1K training classes (2H and 3H) and 500, 1K and 5K most (M) and least (L) populated classes, and All the remaining ImageNet-20K classes.

ImageNet results measured with Top-1 accuracy in for the 9 splits including 2 and 3 hops away from ImageNet-1K training classes (2H and 3H) and 500, 1K and 5K most (M) and least (L) populated classes, and All the remaining ImageNet-20K classes. The GZSL results are important in the way that they depict the models’ ability to recognise both seen and unseen classes at the test time. The results for the SYNC [22] model is only reported in the L5K setting. As shown in Fig. 4(b), the trend is Similar to ZSL where populated classes have better results than the least populated classes, yet have poor results if the search spaces become too big like the decreasing trends in most and least populated classes. Moreover, data-generating approaches dominate other strategies. CADA-VAE [145] that has the advantages of both cross-modal alignment and data feature synthesising methods, evidently outperforms other models. In one case, i. e M500, it nearly has double the accuracy of f-CLSWGAN [187]. For the semantic embedding category, although ESZSL [137] had better results on ZSL, it falls behind approaches like ALE [6], DeViSE [43] and SJE [7].

Applications

During the very recent years, zero-shot learning has proved to be a necessary challenge to-be-solved for different scenarios and applications. The number of demands for learning without accessing to the unseen target concepts is also increasing each year. Zero-shot learning is widely discussed in the computer vision field, such as object recognition in general, as in [133,140] where they aim to locate the objects beside recognising them. Several other variations of ZSL models are proposed for the same task purpose such as [13,30,126]. Zero-shot emotion recognition [200] has the task of recognising unseen emotions, while zero-shot semantic segmentation aims to segment the unseen object categories [19,177]. Moreover, on the task of retrieving images from a large scale set of data, Zero-shot has a growing number of research [98,194] along with sketch-based image retrieval systems [34,35,150]. Zero-shot learning has an application on visual imitation learning to reduce human supervision by automating the exploration of the agent [82,117]. Action recognition is the task of recognising the sequence of actions from the frames of a video. However, if the new actions are not available when training, Zero-shot learning can be a solution, such as in [45,107,124,149]. Zero-shot Style Transfer in an image is the problem of transferring the texture of source image to target image while the style is not pre-determined and it is arbitrary [151]. Zero-shot resolution enhancement problem aims at enhancing the resolution of an image without pre-defined high-resolution images for training examples [154]. Zero-shot scene classification for HSR images [85] and scene-sketch classification has been studied in [191] as other applications of ZSL in computer vision. Zero-shot learning has also left its footprint in the area of NLP. Zero-Shot Entity Linking, links entity mentions in the text using a knowledge base [95]. Many research works focus on the task of translating languages to another without pre-determined translation between pairs of samples [50,53,62,78]. In sentence embedding [10] and in Style transfer of text, a common technique is to convert the source to another style via arbitrary styles like the artistic technique discussed in [21]. In the audio processing field, zero-shot based voice conversion to another speaker’s voice [122] is an applicable scenario of ZSL. In the era of the COVID-19 pandemic, many researchers have tried to work on Artificial Intelligence and Machine learning based methodologies to recognise the positive cases of the COVID-19 patients based on the CT scan images or Chest X-rays. Two prominent features in chest CT used for diagnosis are ground glass opacities (GGO) and consolidation which has been considered by some of the researchers such as [39,92,198], and [146]. [111] uses three CNN models to detect COVID-19, in which the ResNet50 shows a very high rate of classification performance. [146] introduces a deep-learning based system that segments the infected regions and the entire lung in an automatic manner. [184] shows that the increase in unilateral or bilateral Procalcitonin and consolidation with surrounding halo is prominent in chest CT of paediatric patients. [88] introduces the COVNet to extract the 2D local and 3D global features in 3D chest CT slices. The method claims the ability of classifying COVID-19 from community acquired pneumonia (CAP). [152] shows different imaging patterns of the COVID-19 cases depending on the time of infection. [208] classifies four stages to respiratory CT scan changes and shows the most dramatic changes to be in the first 10 days from the onset of initial symptoms. [201] introduces a deep learning based anomaly detection model which extracts the high-level features from the input chest X-ray image. [56] introduce COVIDX-Net to classify the positive cases for the COVID-19 in X-ray images. It uses 7 different architectures, which VGG19 outperforms the others. [3] proposes a COVID-CAPS that is based on the Capsule Networks [57] to avoid the drawbacks of CNN-based architectures as it captures better spatial information. It performs on a small dataset of X-ray images. [1] employs a class decomposition mechanism in DeTraC [2] which is a deep convolutional network that can handle image dataset irregularities of the X-ray images. [203] proposes a method for X-ray medical image segmentation using task driven generative adversarial networks. [128] proposes a 21-layer CNN called CheXNet, trained on the ChestX-ray14 dataset [180] to detect pneumonia with the localisation of the most infected areas from the X-ray images. [139] shows a possible diagnostic criteria could be the existence of bilateral pulmonary areas of consolidation found in the chest X-rays, and [169] uses DenseNet-169 for the purpose of feature extraction followed by an SVM classifier to detect Pneumonia from chest X-ray images. A common weakness among the majority of the above-mentioned research works is that they either conduct their evaluations on a very limited number of cases due to the lack of comprehensive datasets (which puts the validity of the reported results under a question), or they suffer from underlying uncertainties due to unknown nature and characteristics of the novel COVID-19, not only for the medical community, but also for the machine learning and data analytic experts. In such an uncertain atmosphere with limited training dataset, we strongly recommend the adaptation of Zero-shot learning and its variances (as discussed in Fig. 4) as an efficient deep learning based solution towards COVID-19 diagnosis. Diagnosis and recognition of the very recent and global challenge of COVID-19 disease caused by the severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) is a perfect real-world application of Zero-shot learning, where we do not have millions of annotated datasets available; and the symptoms of the disease and the chest X-ray of infected people may significantly vary from person to person. Such a scenario can truly be considered as a novel unseen target or classification challenge. We only know some of the symptoms of the infected people with COVID-19 in forms of advices, text notes, chest X-ray interpenetration, all as the auxiliary data which have partial similarities with other lung inflammatory diseases, such as Asthma or SARS. So, we have to seek for a semantic relationship between training and the new unseen classes. Therefore, ZSL can help us significantly to cope with this new challenge like the induction of the SARS-CoV-2, from previously learned diagnosis of the Asthma, and the Pneumonia using written medical documents of the respiratory tracts and chest X-ray images. In the case of the few-shot learning, a handful of the chest CT scans or X-ray of the positive cases of the COVID-19 can also be beneficial as further support-set alongside the chest X-ray images of SARS, Asthma and Pneumonia to infer the novel COVID-19 cases. As a general rule and based on the recent successful applications, we can infer that in any scenarios that the goal is set to reduce supervision, and the target of the problem can be learned through side information and its relation to the seen data, the Zero-shot learning method can be conducted as one of the best learning techniques and practices.

Discussion

A typical zero-shot learning problem is usually faced with three popular issues that need to be solved in order to enhance the performance of the model. These issues are Bias, Hubness and domain shift; and every model revolves around solving one or more of the issues mentioned. In this section, we discuss efforts done by different approaches to alleviate bias, hubness and domain-shift and infer the logic each approach owns to learn its model. Bias. The problem with ZSL and GZSL tasks is that the imbalanced data between training and test classes cause a bias towards seen classes at prediction time. Other reasons for bias could be high-dimensionality and the devoid of manifold structure of features. Several data generating approaches have worked on alleviating bias by synthesising visual data for unseen classes. [187] generates semantically rich CNN features of the unseen classes to make unseen embedding space more known. [106] generates pseudo seen and unseen class features, and then it trains an SVM classifier to mitigate bias. [143] improves the quality of the synthesised examples by using gradient matching loss. Models combining data generation or reconstruction along with other techniques have proved to be effective in alleviating bias. [97] uses an intermediate space to help discover the geometric structure of the features that previously didn’t with the regression-based projections. [24] uses calibrated stacking rule. [145] generates latent feature sizes of 64 with the idea that low-dimensional representations tend to mitigate bias. [153] uses two regressors to calculate reconstruction to diminish the bias. Transductive-based approaches like [143] are also used to solve the bias issue. In [159], it forces the unseen classes to be projected into fixed pre-defined points to avoid results with bias. Hubness [125]. In large-dimensional mapping spaces, samples (hubs) might end up falsely as the nearest neighbours of several other points in the semantic space and result in an incorrect prediction. To avoid the hubness, [176] proposes a stage-wise bidirectional latent embedding framework. When a mapping is done from high-dimensional feature space to a low-dimensional semantic space using regressors, the distinctive features will partially fade while in the visual feature space, the structures are better preserved. Hence, the visual embedding space is well-known for mitigating the hubness problem. [174,202] use the output of the visual space of the CNN as the embedding space. Domain-shift. Zero-shot learning challenge can be considered as a domain adaptation problem. This is because the source labelled data is disjoint with the target unlabelled domain data. This is called project domain-shift. Domain adaptation techniques are used to learn the intrinsic relationships among these domains and transfer knowledge between the two. A considerable amount of works has been done through a transductive setting which has been successful to overcome the domain-shift issue. [44] a multi-view embedding framework, performs label propagation on graph a heuristic one-stage self-learning approach to assign points to their nearest data points. [71] introduces a regularised sparse coding based unsupervised domain adaptation framework that solves the domain shift problem. [206] uses a structured prediction method to solve the problem by visually clustering the unseen data. [174] uses a visual constraint on the centre of each class when the mapping is being learned. Since the pure definition of the ZSL challenge is the inaccessibility of unseen data during training, several inductive approaches tried to solve the problem as well. [72] proposes to reconstruct the visual features to alleviate this issue. [197] performs sparse non-negative matrix factorisation for both domains in a common semantic dictionary. MFMR [193] exploits the manifold structure of test data with a joint prediction scheme to avoid domain shift. [138] uses entropy minimisation in optimisation. [86] preserves the semantic similarity structure in seen and unseen classes to avoid the domain-shift occurrence. [87] mitigates projection domain-shift by generating soul samples that are related to the semantic descriptions. These three common issues together with inferiorities of each methods will be a motivation to decide on a particular approach when solving the ZSL problem. Attribute classifiers are considered customised since human-annotations are used; however, this makes the problem a laborious task that has strong supervision. Compatibility learning approaches have the ability to learn directly by eliminating the intermediate step but often face with the bias and hubness problem. Manifold learning solves this weakness of the semantic learning approaches by preserving the geometrical structure of the features. Cross-modal latent embedding approaches take on a different point of view and leverage both visual and semantic features and the similarity and differences between them. They often propose methods for aligning the structures between the two modes of features. This category of methods also suffers from the hubness problem for the problems dealing with high-dimensional data. Visual space embedding approaches have the advantage of turning the problem into a supervised one by generating or aggregating visual instances for the unseen classes. Plus are a favourable approach for solving hubness problem due to the high-dimensionality of the visual space that can preserve information structure better and also bias problem by alleviating the imbalanced data by generating unseen class samples. Here a challenge would be generating more realistic looking data. Another different setting is transductive learning that presents solutions to bias problem, by creating balance in data by gathering unseen data, yet not applicable to many of the real-world problems since the original definition of ZSL limits the use of unseen data during the training phase. Depending on the real-world scenarios, each way of solving the problem might be the most appropriate choice. Some approaches improve the solution by combining two or more methods to benefit from each one’s strengths.

Conclusion

In this article, we performed a comprehensive and multi-faceted review on the Zero-Shot/Generalised Zero-shot Learning challenge, its fundamentals, and variants for different scenarios and applications such as COVID-19 diagnosis, Autonomous Vehicles, and similar complex real-world applications which involve fully/partially new concepts that have never/rarely seen before, besides the barrier of limited annotated dataset. We divided the recent state-of-the-art methods into four space-wise embedding categories. We also reviewed different types of side and auxiliary information. We went through the popular datasets and their corresponding splits for the problem of ZSL. The paper also contributed in performing the experiment results for some of the common baselines and elaborated on assessing the advantages and disadvantages of each group, as well as the ideas behind different areas of solutions to improve each group. Our evaluation reveals that data synthesis methods and combinational approaches yield the best performance, as by synthesising data, the problem shifts to the classic recognition/diagnosis problem, and by combining other methods, the model utilises the advantage of each embedding techniques. The models even outperform compatibility learning models in transductive setting. This means, the models consisting a visual data generation step, lead to better results than other approaches and settings. Furthermore, the accuracies improve when the unseen classes have closer semantic hierarchy and relatedness distance to the seen classes. Finally, we reviewed the current and potential real-world applications of ZSL and GZSL in the near future. To the best of our knowledge, such a comprehensive and detailed technical review and categorisation of the ZSL methodologies, alongside with an efficient solution for the recent challenge of COVID-19 pandemic is not done before; hence, we expect it to be helpful in developing new research directions among AI and health-related research community.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  19 in total

1.  Cross-Domain Matching with Squared-Loss Mutual Information.

Authors:  Makoto Yamada; Leonid Sigal; Michalis Raptis; Machiko Toyoda; Yi Chang; Masashi Sugiyama
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2015-09       Impact factor: 6.226

2.  Learning with hierarchical-deep models.

Authors:  Ruslan Salakhutdinov; Joshua B Tenenbaum; Antonio Torralba
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2013-08       Impact factor: 6.226

3.  Coronavirus Disease 2019 (COVID-19): Role of Chest CT in Diagnosis and Management.

Authors:  Yan Li; Liming Xia
Journal:  AJR Am J Roentgenol       Date:  2020-03-04       Impact factor: 3.959

4.  Attribute-based classification for zero-shot visual object categorization.

Authors:  Christoph H Lampert; Hannes Nickisch; Stefan Harmeling
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2014-03       Impact factor: 6.226

5.  Unsupervised X-ray image segmentation with task driven generative adversarial networks.

Authors:  Yue Zhang; Shun Miao; Tommaso Mansi; Rui Liao
Journal:  Med Image Anal       Date:  2020-02-07       Impact factor: 8.545

6.  Beyond imitation: Zero-shot task transfer on robots by learning concepts as cognitive programs.

Authors:  Miguel Lázaro-Gredilla; Dianhuan Lin; J Swaroop Guntupalli; Dileep George
Journal:  Sci Robot       Date:  2019-01-16

7.  Sensitivity of Chest CT for COVID-19: Comparison to RT-PCR.

Authors:  Yicheng Fang; Huangqi Zhang; Jicheng Xie; Minjie Lin; Lingjun Ying; Peipei Pang; Wenbin Ji
Journal:  Radiology       Date:  2020-02-19       Impact factor: 11.105

8.  Clinical and CT features in pediatric patients with COVID-19 infection: Different points from adults.

Authors:  Wei Xia; Jianbo Shao; Yu Guo; Xuehua Peng; Zhen Li; Daoyu Hu
Journal:  Pediatr Pulmonol       Date:  2020-03-05

9.  Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy.

Authors:  Lin Li; Lixin Qin; Zeguo Xu; Youbing Yin; Xin Wang; Bin Kong; Junjie Bai; Yi Lu; Zhenghan Fang; Qi Song; Kunlin Cao; Daliang Liu; Guisheng Wang; Qizhong Xu; Xisheng Fang; Shiqin Zhang; Juan Xia; Jun Xia
Journal:  Radiology       Date:  2020-03-19       Impact factor: 11.105

View more
  4 in total

1.  What Can COVID-19 Teach Us about Using AI in Pandemics?

Authors:  Krzysztof Laudanski; Gregory Shea; Matthew DiMeglio; Mariana Rastrepo; Cassie Solomon
Journal:  Healthcare (Basel)       Date:  2020-12-01

Review 2.  Artificial Intelligence for COVID-19 Detection in Medical Imaging-Diagnostic Measures and Wasting-A Systematic Umbrella Review.

Authors:  Paweł Jemioło; Dawid Storman; Patryk Orzechowski
Journal:  J Clin Med       Date:  2022-04-06       Impact factor: 4.241

3.  Complex Human Action Recognition Using a Hierarchical Feature Reduction and Deep Learning-Based Method.

Authors:  Fatemeh Serpush; Mahdi Rezaei
Journal:  SN Comput Sci       Date:  2021-02-13

4.  MEDAS: an open-source platform as a service to help break the walls between medicine and informatics.

Authors:  Liang Zhang; Johann Li; Ping Li; Xiaoyuan Lu; Maoguo Gong; Peiyi Shen; Guangming Zhu; Syed Afaq Shah; Mohammed Bennamoun; Kun Qian; Björn W Schuller
Journal:  Neural Comput Appl       Date:  2022-01-16       Impact factor: 5.102

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.