Literature DB >> 34219969

Aerial scene understanding in the wild: Multi-scene recognition via prototype-based memory networks.

Yuansheng Hua^1,2, Lichao Mou^1,2, Jianzhe Lin³, Konrad Heidler^1,2, Xiao Xiang Zhu^1,2.

Abstract

Aerial scene recognition is a fundamental visual task and has attracted an increasing research interest in the last few years. Most of current researches mainly deploy efforts to categorize an aerial image into one scene-level label, while in real-world scenarios, there often exist multiple scenes in a single image. Therefore, in this paper, we propose to take a step forward to a more practical and challenging task, namely multi-scene recognition in single images. Moreover, we note that manually yielding annotations for such a task is extraordinarily time- and labor-consuming. To address this, we propose a prototype-based memory network to recognize multiple scenes in a single image by leveraging massive well-annotated single-scene images. The proposed network consists of three key components: 1) a prototype learning module, 2) a prototype-inhabiting external memory, and 3) a multi-head attention-based memory retrieval module. To be more specific, we first learn the prototype representation of each aerial scene from single-scene aerial image datasets and store it in an external memory. Afterwards, a multi-head attention-based memory retrieval module is devised to retrieve scene prototypes relevant to query multi-scene images for final predictions. Notably, only a limited number of annotated multi-scene images are needed in the training phase. To facilitate the progress of aerial scene recognition, we produce a new multi-scene aerial image (MAI) dataset. Experimental results on variant dataset configurations demonstrate the effectiveness of our network. Our dataset and codes are publicly available.

Entities: Chemical Disease Gene Species

Keywords: Convolutional neural network (CNN); Memory network; Multi-head attention-based memory retrieval; Multi-scene aerial image dataset; Multi-scene recognition in single images; Prototype learning

Year: 2021 PMID： 34219969 PMCID： PMC8218792 DOI： 10.1016/j.isprsjprs.2021.04.006

Source DB: PubMed Journal: ISPRS J Photogramm Remote Sens ISSN： 0924-2716 Impact factor: 8.979

Introduction

With the enormous advancement of remote sensing technologies, massive high-resolution aerial images are now available and beneficial to a large variety of applications, e.g., urban planning (Marmanis et al., 2018, Audebert et al., 2018, Marcos et al., 2018, Mou and Zhu, 2018b, Li et al., 2018, Qiu et al., 2020, Li et al., 2020b), traffic monitoring (Mou and Zhu, 2018c, Mou and Zhu, 2016), disaster assessment (Vetrivel et al., 2018, Lee et al., 2017), and natural resource management (Lucchesi et al., 2013, Weng et al., 2018, Cheng et al., 2017, Zarco-Tejada et al., 2014, Wen et al., 2017, Mou and Zhu, 2018a, Qiu et al., 2019). Driven by these applications, aerial scene recognition that refers to assigning aerial images scene-level labels is now becoming a fundamental but challenging task. In recent years, many efforts (Zhu et al., 2017), e.g., developing novel network architectures (Murray et al., 2019, Cheng et al., 2020, Bi et al., 2020, Niazmardi et al., 2017, Lin et al., 2020, Zhu et al., 2018) and pipelines (Byju et al., 2000, Xu et al., 2020, Wang et al., 2019, Zhu et al., 2019), publishing large-scale datasets (Xia et al., 2017, Jin et al., 2018), introducing multi-modal and multi-temporal data (Hu et al., 2020, Tuia et al., 2016, Ru et al., 2020, Li et al., 2020a), have been deployed to address this task, and most of them treat it as a single-label classification problem. A common assumption shared by these researches is that an aerial image belongs to only one scene category, while in real-world scenarios, it is more often that there exist various scenes in a single image (cf. Fig. 1). Furthermore, we notice that aerial images used to learn single-label scene classification models are usually well-cropped so that target scenes could be centered and account for the majority of an aerial image. Unfortunately, this might be infeasible for practical applications. Therefore, in this paper, we aim to deal with a more practical and challenging problem, multi-scene classification in a single image, which refers to inferring multiple scene-level labels for a large-scale, unconstrained aerial image. Fig. 1 shows an example image, where we can see that multiple scenes, e.g., residential, parking lot, and commercial, co-exist in one aerial image. We note that there is another research branch of aerial image understanding, multi-label object classification, which refers to the process of inferring multiple objects present in an aerial image. These studies (Sumbul and Demir, 2019, Zegeye and Demir, 2018, Hua et al., 2020, Khan et al., 2019, Hua et al., 2019, Zeggada et al., 2017, Koda et al., 2018) mainly focus on recognizing object-level labels, while in our task, an image is classified into multiple scene categories, which provides a more comprehensive understanding of large-scale aerial images in scene-level. To the best of our knowledge, multi-scene recognition in unconstrained aerial images still remains underexplored in the remote sensing community.

Fig. 1

Illustration of how humans learn to perceive unconstrained aerial images being composed of multiple scenes. We first learn and memorize individual aerial scenes. Then we can possess the capability of understanding complex scenarios by learning from only a limited number of hard instances. We believe by simulating this learning process, a deep neural network can also learn to interpret multi-scene aerial images. To achieve this task, huge quantities of well-annotated multi-scene images are needed for the purpose of training models. However, we note that such annotations are not easy in the remote sensing community. This could be attributed to the following two reasons. On the one hand, the visual interpretation of multiple scenes is more arduous than that of a single scene in an aerial image, and therefore, labeling multi-scene images requires more work. On the other hand, low-cost annotation techniques, e.g., resorting to crowdsourcing OpenStreetMap (OSM) through keyword searching (Xia et al., 2017, Jin et al., 2018, Long et al., 2020), perform poorly in yielding multi-scene datasets owing to the incompleteness and incorrectness of certain OSM data. Examples of erroneous OSM data are shown in Fig. 2. In addition, manually rectifying annotations generated from crowdsourcing data are inevitable due to error-proneness. Such a procedure is quite labor-consuming, as every scene is required to be checked in case that present ones are mislabeled as absent. Aiming to solve the aforementioned limitations, in this work, we propose to train a network for recognizing complex multi-scene aerial images by using only a small number of labeled multi-scene images but a huge amount of existing, annotated single-scene data. Our motivation is based on an intuitive observation about how humans learn to perceive complex scenes being composed of multiple entities (National Research Council, 2000, Liu et al., 2008, McLaren and Suret, 2000): we first learn and memorize individual objects (through flash cards for example) when we were babies and then possess the capability of understanding complex scenarios by learning from only a limited number of hard instances (cf. Fig. 1). We believe that this learning process also applies to the interpretation of multi-scene aerial images. Driven by this observation, we propose a novel network, termed as prototype-based memory network (PM-Net), which is inspired by recent successes of memory networks in natural language processing (NLP) tasks (Sukhbaatar et al., 2015, Miller et al., 2016) and video analysis (Shi et al., 2019, Park et al., 2020, Lai et al., 2020). To be more specific, we first learn the prototype representation of each aerial scene from single-scene aerial images and then store these prototypes in the external memory of PM-Net. Afterwards, for a given query multi-scene image, a multi-head attention-based memory retrieval module is devised to retrieve scene prototypes that are associated with the query image from the external memory for inferring multiple scene labels (see Fig. 3).

Fig. 2

Examples of incomplete () and incorrect () OSM data. : the commercial is not annotated in OSM data. : the orchard is mislabeled as residential.

Fig. 3

Architecture of the proposed PM-Net. Particularly, we first learn scene prototypes from well-annotated single-scene aerial images and then store them in the external memory of PM-Net. Afterwards, given a query multi-scene image, a multi-head attention-based memory retrieval module is devised to retrieve scene prototypes that are relevant to the query image, yielding for the prediction of multiple labels. denotes the embedding function, and its output is a D-dimensional feature vector. S and H represent numbers of scenes and heads, respectively. L and U denote channel dimensions of the key and value in the memory retrieval module.

Examples of incomplete () and incorrect () OSM data. : the commercial is not annotated in OSM data. : the orchard is mislabeled as residential. Architecture of the proposed PM-Net. Particularly, we first learn scene prototypes from well-annotated single-scene aerial images and then store them in the external memory of PM-Net. Afterwards, given a query multi-scene image, a multi-head attention-based memory retrieval module is devised to retrieve scene prototypes that are relevant to the query image, yielding for the prediction of multiple labels. denotes the embedding function, and its output is a D-dimensional feature vector. S and H represent numbers of scenes and heads, respectively. L and U denote channel dimensions of the key and value in the memory retrieval module. The contributions of this work are fourfold. We take a step forward to a more practical and challenging task in aerial scene understanding, namely multi-scene classification in single images, which aims to recognize multiple scenes present in a large-scale, unconstrained aerial image. Such a task is in line with real-world scenarios and capable of providing a comprehensive picture for a given geographic area. Given that labeling multi-scene images is very labor-intensive and time-consuming, we propose a PM-Net that can be trained for our task by leveraging large numbers of existing single-scene aerial images and a small number of labeled multi-scene images. In order to facilitate the progress of multi-scene recognition in single aerial images, we create a new dataset, multi-scene aerial image (MAI) dataset. To the best of our knowledge, this is the first publicly available dataset for aerial multi-scene interpretation. Compared to existing single-scene aerial image datasets, images in our dataset are unconstrained and contain multiple scenes, which are more in line with the reality. We carry out extensive experiments with different configurations. Experimental results demonstrate the effectiveness of the proposed network. The remaining sections of this paper are organized as follows. Section 2 reviews studies in memory networks and prototypical networks, and the architecture of the proposed prototype-based memory network is introduced in Section 3. Section 4 describes experimental configurations and analyzes results. Eventually, conclusions are drawn in Section 5.

Related work

Since very few efforts have been deployed to this task in the remote sensing community, we only review literatures related to our algorithm in this section.

Memory networks

A memory network takes as input a query and retrieves complementary information from the external memory. In Sukhbaatar et al. (2015), the memory network is first proposed and utilized to address question–answering tasks, where questions are regarded as queries, and statements are stored in the external memory. To retrieve statements for predicting answers, the authors compute relative distances between queries and the external memory through dot product. In the following work, Miller et al. (2016) improves the efficiency of retrieving large memories by pre-selecting small subsets with key hashing. Moreover, the memory network is further applied in video analysis (Shi et al., 2019, Park et al., 2020, Lai et al., 2020) and image captioning (Cornia et al., 2020). In Shi et al. (2019), the authors devise a dual augmented memory network to memorize both target and background features of an video, and use a Long Short-Term Memory (LSTM) to communicate with previous and next frames. In Park et al. (2020), the authors propose a memory network to memorize normal patterns for detecting anomalies in an video. As an attempt in image captioning, Cornia et al. (2020) devise a learnable memory to learn and memorize priori knowledge for encoding relationships between image regions. Inspired by these works, we devise a memory network and store scene prototypes in the memory for recognizing scenes present in multi-scene images.

Prototypical networks

Prototypical networks are characterized by classifying images according to their distances from class prototypes. In learning with limited training samples, such networks are popular and achieved many successes recently (Snell et al., 2017, Guerriero et al., 2018, Yang et al., 2018, Huang et al., 2020, Zhang et al., 2020, Tang et al., 2019). To be specific, Snell et al. (2017) propose to first learn a prototype representation for each category and then identify images by finding their nearest category prototypes. Guerriero et al. (2018) aim to alleviate the heavy expense of learning prototypes by initializing and updating prototypes with those learned in previous training epochs. Yang et al. (2018) propose to combine prototypical networks and CNNs for tackling the open world recognition problem and improving the robustness and accuracy of networks. Similarly, Huang et al. (2020) propose to integrate prototypical networks and graph convolutional neural networks for learning relational prototypes. Albeit variant, most existing works share a common way to extract prototypes, which is taking average of samples belonging to the same categories. Therefore, we follow this prototype extraction strategy in our work.

Methodology

Overview

The proposed PM-Net consists of three essential components: a prototype learning module, an external memory, and a memory retrieval module. Specifically, the prototype learning module is devised to encode prototype representations of aerial scenes, which are then stored in the external memory. The memory retrieval module is responsible for retrieving scene prototypes related to query images through a multi-head attention mechanism. Eventually, retrieved scene prototypes are utilized to infer the existence of multiple scenes in the query image.

Scene prototype learning and writing

Following the observation introduced in Section 1, we propose to learn and memorize scene prototypes with the support of single-scene aerial images. The procedure consists of two stages. We first employ an embedding function to learn semantic representations of all single-scene images. Then, feature representations belonging to the same scene category are encoded into a scene prototype and stored in the external memory. Formally, let denote the i-th single-scene image belonging to scene s, and i ranges from 1 to . is the number of samples annotated as s. The embedding function can be learned via the following objective function:where represents learnable parameters of , and is a one-hot vector denoting the scene label of . is a multilayer perceptron (MLP) with parameters and its outputs are activated by a softmax function to predict probability distributions. Following the overwhelming trend of deep learning, here we employ a deep CNN, e.g., ResNet-50 (He et al., 2016), as the embedding function and learn its parameters on public single-scene aerial image datasets. After sufficient training, is expected to be capable of learning discriminative representations for different aerial scenes. Once is learned, the scene prototype can be computed by averaging representations of all aerial images belonging to the same scene (Snell et al., 2017, Guerriero et al., 2018, Yang et al., 2018). Let be the prototype representation of scene s. We calculate with the following equation: By doing so, in the single-scene classification, an image closely around in the common embedding space is supposed to belong to scene s. Similarly, in the multi-scene scenario, the representation of an aerial image comprising scene s should show high relevance with . After encoding all scene prototypes, the external memory can be formulated as follows:where S denotes the number of scenes. represents the concatenation operation. Given that is a D-dimensional vector, is a matrix of . Note that D varies when using different backbone CNNs as embedding functions.

Multi-head attention-based memory retrieval

Inspired by successes of the multi-head self-attention mechanism (Vaswani et al., 2017) in natural language processing tasks (Radford et al., 2018, Radford et al., 2019, Devlin et al., 2018, Wolf et al., 2020), we develop a multi-head attention-based memory retrieval module to retrieve scene prototypes from the memory for a given query image . Given a query multi-scene aerial image , to retrieve relevant scene prototypes from , we develop a multi-head attention-based memory retrieval module. In particular, we first extract the feature representation of through the same embedding function and linearly project it to an L-dimensional query . Similarly, we transform the external memory into key and value , and both are implemented as MLPs. The channel dimension of the key is L, while that of the value is U. The relevance between and each scene prototype can be measured by dot product similarity and a softmax function as follows: The output is an S-dimensional vector, where each component represents a relevance probability that a specific scene prototype is related to the query image. Subsequently, the retrieved scene prototypes are computed by weight-summing all values with the following equation: Since the memory retrieval is designed in a multi-head fashion, the final retrieved prototype is reformulated as follows:where H denotes the number of heads, and each head yields a retrieved prototype by transforming and to the variant query , key , and value . Eventually, the output is fed into a fully-connected layer followed by a sigmoid function for inferring presences of aerial scenes.

Implementation details

For a comprehensive assessment of our PM-Net, we implement the embedding function with various backbone CNNs. Specifically, we conduct experiments on four CNN architectures, and details are as follows: PM-VGGNet: is built on VGG-16 (Simonyan and Zisserman, 2014) by replacing all layers after the last max-pooling layer in block5 with a global average pooling layer. PM-Inception-V3: Inception-V3 (Szegedy et al., 2015) is utilized, and layers before and including the global average pooling layer are employed as . PM-ResNet: We modify ResNet-50 (He et al., 2016) by discarding layers after the global average pooling layer and using the remaining layers as . PM-NASNet: The backbone of is mobile NASNet (Zoph and Le, 2017). As with the modification in PM-ResNet, only layers before and including the global average pooling layer are used. In our experiments, we train original deep CNNs on single-scene aerial image datasets and then take them as the embedding function following the aforementioned points. Subsequently, we yield scene prototypes and concatenate all of them along the first axis to form .

Experiments and discussion

In this section, we introduce a newly produced multi-scene aerial image dataset, MAI dataset, and two single-scene datasets, i.e., UCM and AID datasets, which are used in experiments. Then network configurations and training schemes are detailed in SubSection 4.2. The remaining subsections discuss and analyze the performance of the proposed network thoroughly.

Dataset description and configuration

MAI dataset

To facilitate the progress of aerial scene interpretation in the wild, we yield a new dataset, MAI dataset, by collecting and labeling 3923 large-scale images from Google Earth imagery that covers the United States, Germany, and France. The size of each image is , and spatial resolutions vary from 0.3 m/pixel to 0.6 m/pixel. After capturing aerial images, we manually assign each image multiple scene-level labels from in total 24 scene categories, including apron, baseball, beach, commercial, farmland, woodland, parking lot, port, residential, river, storage tanks, sea, bridge, lake, park, roundabout, soccer field, stadium, train station, works, golf course, runway, sparse shrub, and tennis court. Notably, OSM data associated with the collected images cannot be directly employed as reference owing to the problems presented in Section 1. Such a labeling procedure is extremely time- and labor-consuming, and annotating one image costs around 20 s, which is ten times more than labeling a single-scene image. Several example multi-scene images are shown in Fig. 4. Numbers of aerial images related to various scenes are reported in Fig. 5. Among existing datasets, BigEarthNet (Sumbul et al., 2019) is one of the most relevant datasets, which consists of Sentinel-2 images acquired over the European Union with spatial resolutions ranging from 10 m/pixel to 60 m/pixel. Spatial sizes of images vary from pixels to pixels, and each is assigned multiple land-cover labels provided from the CORINE Land Cover map2. Compared to BigEarthNet, our dataset is characterized by its high-resolution large-scale aerial images and worldwide coverage.

Fig. 4

Example images in our MAI dataset. Each image is pixels, and their spatial resolutions range from 0.3 m/pixel to 0.6 m/pixel. We list their scene-level labels here: (a) farmland and residential; (b) baseball, woodland, parking lot, and tennis court; (c) commercial, parking lot, and residential; (d) woodland, residential, river, and runway; (e) river and storage tanks; (f) beach, woodland, residential, and sea; (g) farmland, woodland, and residential; (h) apron and runway; (i) baseball field, parking lot, residential, bridge, and soccer field.

Fig. 5

Statistics of the proposed MAI dataset for multi-scene classification in single aerial images.

UCM dataset

UCM dataset (Yang and Newsam, 2010) is a commonly used single-scene aerial image dataset produced by Yang and Newsam from the University of California Merced. This dataset comprises 2100 aerial images cropped from aerial ortho imagery provided by the United States Geological Survey (USGS) National Map, and the spatial resolution of the collected images is one foot. The size of each image is pixels, and all image samples are classified into 21 scene-level classes: overpass, forest, beach, baseball diamond, building, airplane, freeway, intersection, harbor, golf course, runway, agricultural, storage tank, mobile home park, medium residential, sparse residential, chaparral, river, tennis courts, dense residential, and parking lot. The number of aerial images collected for each scene is 100, and several example images are shown in Fig. 6. To learn scene prototypes from these single-scene images, we randomly choose 80% of image samples per scene category to train and validate the embedding function and utilize the rest for testing.

Fig. 6

Example single-scene aerial categories in the UCM dataset: (a) agricultural, (b) dense residential, (c) forest, (d) storage tanks, (e) baseball field, (f) parking lot, (g) river, (h) runway, (i) golf course, and (j) tennis court.

AID dataset

AID dataset (Xia et al., 2017) is a another popular single-scene aerial image dataset which consists of 10000 aerial images with a size of pixels. These images are captured from Google Earth imagery that is taken over China, the United States, England, France, Italy, Japan, and Germany, and spatial resolutions of the collected images vary from 0.5 m/pixel to 8 m/pixel. In total, there are 30 scene categories, including viaduct, river, baseball field, center, farmland, railway station, meadow, bare land, storage tanks, beach, mountain, park, bridge, playground, church, commercial, desert, forest, parking, industrial, square, sparse residential, pond, medium residential, port, resort, airport, school, stadium, and dense residential. The number of images in different classes ranges from 220 to 420. Similar to the data split in the UCM dataset, 20% of images are chosen from each scene as test samples, while the remaining images are utilized to train and validate the embedding function. Some example images of the AID dataset are exhibited in Fig. 7.

Fig. 7

Example single-scene aerial categories in the AID dataset: (a) beach, (b) baseball field, (c) airport, (d) railway station, (e) stadium, (f) park, (g) playground, (h) bridge, (i) viaduct, and (j) commercial. The influence of the number of on both dataset configurations. and dot lines represent mean scores on UCM2MAI and AID2MAI. The line indicates the average of them.

Dataset configuration

In order to widely evaluate the performance of our method, we utilize two variant dataset configurations, UCM2MAI and AID2MAI, based on common scene categories shared by UCM/AID and MAI. Specifically, the UCM2MAI configuration consists of 1600 single-scene aerial images from the UCM dataset and 1649 multi-scene images from our MAI dataset. 16 aerial scenes that are commonly included in both two datasets are considered in UCM2MAI, and numbers of their associated images are listed in Table 1. Besides, the AID2MAI configuration is composed of 7050 and 3239 aerial images from the AID and MAI datasets, respectively. 20 common scene categories are taken into consideration, and the number of images related to each scene is present in Table 1. Although such configurations might limit the number of recognizable scene classes, we believe this limitation can be addressed by collecting more single-scene images by crawling OSM data and producing large-scale multi-scene aerial image datasets. We select only 90 and 120 multi-scene aerial images from UCM2MAI and AID2MAI as training instances, respectively, and test networks on the remaining multi-scene images. For rare scenes (e.g., port and train station), we select all associated training images, while for common scenes, we randomly select several of their training samples. It is noteworthy that we yield the scene prototype of residential by taking an average of high-level representations of aerial images belonging to scene medium residential and dense residential. Besides, although the UCM and AID datasets do not contain images for sea, their images for beach often comprise both sea and beach (cf. (c) in Fig. 7). Therefore, we make use of training samples labeled as beach to yield the prototype representation of sea.

Table 1

The number of images associated with each scene.

	UCM2MAI		AID2MAI
Scene Category	UCM	MAI	AID	MAI
apron	100	194	360	54
baseball field	100	75	220	235
beach	100	94	400	130
commercial	100	607	350	1391
farmland	100	680	370	983
woodland	100	762	250	1312
parking lot	100	708	390	1777
port	100	3	380	9
residential	200	958	700	2082
river	100	209	410	686
storage tanks	100	89	360	193
sea	100*	51	400*	59
golf course	100	75	-	-
runway	100	230	-	-
sparse shrub	100	336	-	-
tennis court	100	114	-	-
bridge	-	-	360	878
lake	-	-	420	756
park	-	-	350	638
roundabout	-	-	420	281
soccer field	-	-	370	302
stadium	-	-	290	136
train station	-	-	260	9
works	-	-	390	186

All	1600	1649	7050	3239

* indicates that the number of images is not counted in total amounts, as the scene prototype of beach and sea are learned from the same images.

The number of images associated with each scene. * indicates that the number of images is not counted in total amounts, as the scene prototype of beach and sea are learned from the same images.

Training details

The training procedure consists of two phases: 1) learning the embedding function on large quantities of single-scene aerial images and 2) training the entire PM-Net on a limited number of multi-scene images in an end-to-end manner. Thus, various training strategies are applied to each phase and detailed as follows. In the first training phase, the embedding function is initialized with the corresponding deep CNNs pretrained on ImageNet (Deng et al., 2009), and weights in are initialized by a Glorot uniform initializer. Eq. (1) is employed as the loss of the network, and Nestrov Adam (Dozat, 2015) is chosen as the optimizer, of which parameters are set as recommended: , and . The learning rate is set as and decayed by when the validation loss fails to decrease for two epochs. In the second learning phase, we initialize with parameters learned in the previous training stage and employ the Glorot uniform initializer to initialize all weights in , and the last fully-connected layer. L and U are set to the same value of 256, and the number of heads is defined as 20. Notably, all weights are trainable, and the embedding function is tuned during the second training phase as well. Differences between two training phases are summarized in Table 2. Multiple scene-level labels are encoded as multi-hot vectors, where 0 indicates the absence of the corresponding scene while 1 refers to existing scenes. Accordingly, the loss is defined as binary cross-entropy. The optimizer is the same as that in the first training phase, but here we make use of a relatively large learning rate, . The network is implemented on TensorFlow and trained on one NVIDIA Tesla P100 16 GB GPU for 100 epochs. We set the size of training batch to 32 for both training phases.

Table 2

Differences between two training phases.

Phase	Learnable Module	Dataset		Memory
Phase	Learnable Module	Pretraining fϕ	Fine-tuning module	Memory
1	prototype learning	ImageNet	UCM/AID	updated
2	memory retrieval	UCM/AID	MAI	frozen

Differences between two training phases.

Evaluation metrics

For the purpose of evaluating the performance of networks quantitatively, we utilize example-based (Wu and Zhou, 2016) and (Van Rijsbergen, 1979) scores as evaluation metrics and calculate them with the following equation:where and denote example-based precision and recall (Tsoumakas and Vlahavas, 2007). We calculate and as follows:where , and represent numbers of false negatives, false positives, and true positives in an example, respectively. In our case, an example is a multi-scene aerial image, and by averaging scores of all examples in the test set, the mean example-based F scores, precision, and recall can be eventually computed. In addition to example-based evaluation metrics, we also calculate label-based precision and recall with Eq. 8 but replace , and with numbers of false negatives, false positives, and true positives in respect of each scene category. The mean and can then be calculated. Note that principle indexes are the mean and scores.

Results on UCM2MAI

For a comprehensive evaluation, we compare the proposed PM-Net with two baselines, CNN* and CNN. The former is initialized with parameters pretrained on ImageNet, and the latter is pretrained on single-scene datasets. Besides, we compare our network with a memory network, Mem-N2N (Sukhbaatar et al., 2015). Since Mem-N2N was proposed for the question answering task, we adapt it to our task by replacing its inputs, i.e., embeddings of questions and statements, with query image representations and scene prototypes , respectively. To be more specific, we feed to a CNN backbone and take its output as the input of Mem-N2N. Scene prototypes are stored in the memory of Mem-N2N and retrieved according to . The initialization of is the same as that of our network, and the entire Mem-N2N is trained in an end-to-end manner. Various backbones of embedding functions are test, and quantitative results are reported in Table 3. Besides, we also compare with a multi-attention driven multi-label classification network, termed as K-Branch CNN (Sumbul and Demir, 2019). K-Branch samples images into K spatial resolutions and extracts their features with separate branches. Afterwards, a bidirectional recurrent neural network is employed to encode their relationships for inferring multiple labels. In our experiments, K is set as default, 3, and input sizes of the three branches are , and , respectively. Here we analyze results from the following three perspectives.

Table 3

Numerical Results on UCM2MAI (%).

Model	m. F1	m. F2	m. pe	m. re	m. pl	m. rl
VGGNet* (Simonyan and Zisserman, 2014)	32.16	32.79	35.08	34.35	21.74	22.57
VGGNet (Simonyan and Zisserman, 2014)	51.42	49.04	62.00	48.38	36.80	27.44
Mem-N2N-VGGNet (Sukhbaatar et al., 2015)	52.16	50.93	57.26	50.73	20.79	22.58
K-Branch CNN (Sumbul and Demir, 2019)	47.04	43.15	64.57	41.83	37.93	22.28
proposed PM-VGGNet	54.42	51.16	67.35	49.95	47.24	26.79

Inception-V3* (Szegedy et al., 2015)	48.03	44.37	62.22	42.80	47.36	20.43
Inception-V3 (Szegedy et al., 2015)	53.96	51.28	65.47	50.49	51.03	32.88
Mem-N2N-Inception-V3 (Sukhbaatar et al., 2015)	56.06	55.27	62.95	55.92	47.90	30.48
proposed PM-Inception-V3	58.56	58.06	64.17	58.73	46.44	26.47

ResNet* (He et al., 2016)	48.36	45.00	63.90	43.84	53.63	28.35
ResNet (He et al., 2016)	51.39	48.31	65.33	47.37	51.89	30.54
Mem-N2N-ResNet (Sukhbaatar et al., 2015)	54.31	51.45	63.97	50.31	44.33	24.58
proposed PM-ResNet	56.89	54.11	69.85	53.38	55.93	29.76

NASNet* (Zoph and Le, 2017)	43.64	39.94	58.56	38.39	46.01	19.69
NASNet (Zoph and Le, 2017)	52.03	49.43	64.24	48.75	49.99	33.75
Mem-N2N-NASNet (Sukhbaatar et al., 2015)	55.17	53.05	64.71	52.65	49.60	29.14
proposed PM-NASNet	60.13	59.57	67.04	60.42	58.60	35.04

CNN* is initialized with weights pretrained on ImageNet.

CNN, Mem-N2N, and PM-Net are initialized with parameters pretrained on the UCM dataset.

m. and m. indicate the mean and score.

m. and m. indicate mean example-based precision and recall.

m. and m. indicate mean label-based precision and recall.

Numerical Results on UCM2MAI (%). CNN* is initialized with weights pretrained on ImageNet. CNN, Mem-N2N, and PM-Net are initialized with parameters pretrained on the UCM dataset. m. and m. indicate the mean and score. m. and m. indicate mean example-based precision and recall. m. and m. indicate mean label-based precision and recall.

The effectiveness of learnt single-scene prototypes

To demonstrate the effectiveness of the prototype-inhabiting external memory, here we focus on comparisons between PM-Net and standard CNNs. In Table 3, PM-VGGNet increases the mean and scores by and , respectively, with respect to VGGNet, and PM-ResNet obtains increments of and in the mean and scores compared to ResNet. Besides, it is interesting to observe that PM-NASNet achieves not only the best mean and scores ( and ) but also relatively high example-based precision and recall in comparison with other competitors. This demonstrates that employing NASNet as the embedding function can enhance the robustness of PM-Net. Comparisons between PM-Inception-V3 with Inception-V3 show that the external memory module contributes to improvements of and in the mean and scores, respectively. To summarize, memorizing and leveraging scene prototypes learned from huge quantities of single-scene images can improve the performance of network in multi-label scene recognition when limited training samples are available. For a deep insight, we further conduct ablation studies on the prototype modality and embedding function. Single- vs. multi-prototype representations. We note that images collected over variant countries show high intra-class variability, and therefore, we wonder whether learning multi-prototype scene representations could improve the effectiveness of PM-Net. Specifically, instead of yielding scene prototypes via Eq. 2, we partition representations of single-scene aerial images belonging to the same scene into several clusters and take cluster centers as multi-prototype representations of each scene. In our experiments, we test two clustering methods, K-Means (Lloyd, 1982) and Agglomerative (Zepeda-Mendoza and Resendis-Antonio, 2013), with PM-ResNet on both UCM2MAI and AID2MAI, and results are shown in Fig. 9. We can see that the performance of PM-ResNet is decreased with the increasing number of cluster centers either using K-Means or Agglomerative clustering algorithms. Explanations could be that there are no obvious subclusters within each scene category (cf. Fig. 13), and thus PM-Net does not benefit from fine-grained multi-prototype representations.

Fig. 9

Fig. 13

T-SNE visualization of image representations and scene prototypes learned by VGGNet on (a) UCM and (b) AID datasets, respectively. Dots in the same color represent features of images belonging to the same scene, and stars denote scene prototypes.

The influence of the number of cluster centers on both dataset configurations. K-Means ( and dash lines) and Agglomerative ( and lines) clustering algorithms are tested with PM-ResNet on both UCM2MAI and AID2MAI, respectively. Frozen vs. trainable embedding function. The embedding function plays a key role in both scene prototype learning and memory retrieval. In the former, we train the embedding function on single-scene images, while in the latter, the function is fine-tuned on multi-scene images. To explore the effectiveness of fine-tuning, we conduct experiments on freezing the embedding function when learning the memory retrieval module. The comparisons between PM-Net learned with frozen and trainable embedding functions are shown in Fig. 10. It can be observed that PM-Net with a trainable embedding function shows higher performance on both UCM2MAI and AID2MAI configurations. The reason could be that sources of single- and multi-scene images are variant, and fine-tuning can narrow their gaps.

Fig. 10

Comparisons between freezing and fine-tuning embedding functions on (a) UCM2MAI and (b) AID2MAI, respectively. bars represent the performance of PM-Net with frozen embedding functions, and bars denote the performance of PM-Net with trainable embedding functions. Triplet vs. cross-entropy loss. Triplet loss (Schroff et al., 2015) is known as learning discriminative representations by minimizing distances between embeddings of the same class while pushing away those of different classes. To study its performance in our task, we train the embedding function by replacing Eq. 1 with the following equation:where and denote positive and negative samples, i.e., images belonging to common and different classes, respectively, and is set as default, . The trained embedding function is then utilized to extract scene prototypes and initialize in the phase of learning the memory retrieval module. Besides, all the other setups are remained the same. We compare the performance of PM-Net using embedding functions trained through different loss functions in Fig. 11. It can be seen that training embedding functions with the triplet loss leads to decrements of the network performance. This can be attributed to that limited numbers of positive and negative samples in each batch can lead to local optimum. More specifically, the size of training batches is 32, and the number of scenes are 16 and 20 in UCM2MAI and AID2MAI, respectively. Thus, it is high probably that only a certain number of scenes are included in one batch, and comprehensively modeling relations between embeddings of samples from all scenes is infeasible. This also illustrates the larger performance decay on UCM2MAI compared to AID2MAI.

Fig. 11

Comparisons of different loss functions on (a) UCM2MAI and (b) AID2MAI, respectively. bars denote the performance of PM-Net using embedding functions trained by the triplet loss, and bars denote the performance of PM-Net with the cross-entropy loss as .

The effectiveness of our multi-head attention-based memory retrieval module

As a key component of the proposed PM-Net, the multi-head attention-based memory retrieval module is designed to retrieve scene prototypes from the external memory, and we evaluate its effectiveness by comparing PM-Net with Mem-N2N. As shown in Table 3, PM-Net outperforms Mem-N2N with variant embedding functions. Specifically, PM-VGGNet increases the mean and scores by and , respectively, compared to Mem-N2N-VGGNet. While taking ResNet as the embedding function, the improvement can reach in the mean score. Besides, the highest increments of mean and scores, and , are achieved by PM-NASNet. These observations demonstrate that our memory retrieval module plays a key role in inferring multiple aerial scenes. An explanation could be that compared to the memory reader in Mem-N2N, our module comprise multiple heads, and each of them focuses on encoding a specific relevance between the query image and variant scene prototypes. In this case, more comprehensive scene-related memories can be used for inferring multiple scene labels. Moreover, we analyze the influence of the number of heads in the memory retrieval module. Fig. 8 shows mean scores achieved by PM-Net with variant head numbers on both UCM2MAI and AID2MAI. We can observe that the network performance is first boosted with an increasing number of heads and then decreased gradually when the number exceeds 20.

Fig. 8

The influence of the number of on both dataset configurations. and dot lines represent mean scores on UCM2MAI and AID2MAI. The line indicates the average of them.

Moreover, we also conduct experiments on directly utilizing relevances for inferring multiple scene labels. Specifically, we set the number of heads to 1 and replace the softmax activation in Eq. 4 with the sigmoid function. Relevances between the query image and scene prototypes can then be interpreted as the existence of each scene. We compare it with our memory retrieval module on variant backbones, and results are shown in Fig. 12. We can see that utilizing relevances as weights for aggregating scene prototypes leads to higher network performance.

Fig. 12

Comparisons between taking relevance as predictions and prototype weights on (a) UCM2MAI and (b) AID2MAI, respectively. and bars represent the performance of PM-Net making predictions from relevances and aggregated scene prototypes, respectively.

The benefit of exploiting single-scene training samples

Let’s start with the conclusion: exploiting single-scene images significantly contributes to our task. To analyze its benefit, we mainly compare CNNs* and CNNs. It can be observed that even with identical network architectures, the performance of CNN is superior to that of CNN*. More specifically, VGGNet achieves the highest improvement of the mean scores, , in comparison with VGGNet*. NASNet shows higher performance in all metrics compered to ResNet*, while other CNNs perform poorly in only the mean example-based precision with respect to their corresponding CNNs*. Besides, we visualize features of single-scene images learned by VGGNet on UCM and AID datasets via t-SNE, respectively. As shown in Fig. 13, extracted features are discriminative and separable in the embedding space, which demonstrates the effectiveness of learning the embedding function on single-scene aerial image datasets. To summarize, except for learning scene prototypes, single-scene training samples can also benefit multi-label scene interpretation by pretraining CNNs which are further utilized to initialize the embedding function. T-SNE visualization of image representations and scene prototypes learned by VGGNet on (a) UCM and (b) AID datasets, respectively. Dots in the same color represent features of images belonging to the same scene, and stars denote scene prototypes. We exhibit several example predictions of PM-ResNet trained on UCM2MAI in Table 4. False positives are marked as red, while false negatives are in blue. As shown in the forth example at the top row, we see that PM-Net can accurately perceive aerial scenes even in complex contexts, but unseen scene appearance (i.e. apron and runway in snow) can influence its prediction.

Table 4

Example images and predictions on UCM2MAI.

Results on AID2MAI

Table 5 reports numerical results on the AID2MAI configuration. It can be seen that the performance of PM-Net is superior to all competitors in the mean score. Compared to Mem-N2N-VGGNet, the proposed PM-VGGNet increases the mean and scores by and , respectively, while improvements reach and in comparison with VGGNet. PM-ResNet achieves the best mean score and example-based precision, and , respectively. With NASNet as the backbone, exploiting the proposed memory retrieval module contributes to increments of and in mean and scores compared to directly learning NASNet on a small number of multi-scene samples.

Table 5

Numerical results on AID2MAI (%).

Model	m. F1	m. F2	m. pe	m. re	m. pl	m. rl
VGGNet* (Simonyan and Zisserman, 2014)	41.57	36.36	64.02	34.04	25.98	12.80
VGGNet (Simonyan and Zisserman, 2014)	48.30	50.80	48.53	54.19	32.89	44.75
Mem-N2N-VGGNet (Sukhbaatar et al., 2015)	45.92	43.17	56.16	42.22	23.10	18.76
K-Branch CNN (Sumbul and Demir, 2019)	47.67	43.88	63.84	42.37	26.53	16.15
proposed PM-VGGNet	54.37	51.44	65.69	50.39	48.06	22.40

Inception-V3* (Szegedy et al., 2015)	45.92	40.76	66.17	38.43	39.56	14.71
Inception-V3 (Szegedy et al., 2015)	51.81	49.44	62.91	48.93	45.26	36.32
Mem-N2N-Inception-V3 (Sukhbaatar et al., 2015)	52.13	53.83	52.53	56.21	33.33	29.05
proposed PM-Inception-V3	53.08	49.26	69.42	47.85	48.20	24.65

ResNet* (He et al., 2016)	50.06	46.88	64.32	45.98	39.48	22.34
ResNet (He et al., 2016)	54.74	52.76	65.54	52.62	47.54	40.23
Mem-N2N-ResNet (Sukhbaatar et al., 2015)	53.26	60.41	46.15	68.07	23.75	30.21
proposed PM-ResNet	57.42	54.34	70.62	53.33	55.34	29.55

NASNet* (Zoph and Le, 2017)	47.53	42.93	65.57	40.94	34.79	16.42
NASNet (Zoph and Le, 2017)	53.08	50.68	64.33	50.17	46.68	37.43
Mem-N2N-NASNet (Sukhbaatar et al., 2015)	39.27	40.72	38.52	42.38	20.03	20.41
proposed PM-NASNet	54.11	52.39	64.03	52.30	43.16	33.99

CNN, Mem-N2N, and PM-Net are initialized with parameters pretrained on the AID dataset.

Numerical results on AID2MAI (%). CNN, Mem-N2N, and PM-Net are initialized with parameters pretrained on the AID dataset. We present some example predictions of PM-ResNet in Table 6. As shown in the top row, PM-ResNet learned with a limited number of annotated multi-scene images can accurately identify various aerial scenes even image contextual information is complicated. The bottom row shows some inaccurate predictions. It can be observed that although bridge and parking lot account for relatively small areas in last two examples at the top row, the proposed PM-Net can successfully detect them. Similar observations can also be found in the first and third example at the bottom row that residential and parking lot are recognized by our network, even they are located at the corner. In conclusion, quantitative results illustrate the effectiveness of our network in learning to perform unconstrained multi-scene classification, and example predictions further demonstrate it.

Table 6

Example images and predictions on AID2MAI.

Conclusion

In this paper, we propose a novel multi-scene recognition network, namely PM-Net, to tackle both the problem of aerial scene classification in the wild and scarce training samples. To be more specific, our network consists of three key elements: 1) a prototype learning module for encoding prototype representations of variant aerial scenes, 2) a prototype-inhabiting external memory for storing high-level scene prototypes, and 3) a multi-head attention-based memory retrieval module for retrieving associated scene prototypes from the external memory for recognizing multiple scenes in a query aerial image. For the purpose of facilitating the progress as well as evaluating our method, we propose a new dataset, MAI dataset, and experiment with two dataset configurations, UCM2MAI and AID2MAI, based on two single-scene aerial image datasets, UCM and AID. In scene prototype learning, we train the embedding function on most of single-scene images as we aim to simulate the real-life scenario, where massive single-scene samples can be collected at low cost by resorting to OSM data. To learn memory retrieval, our network is fine-tuned on only around 100 training samples from the MAI dataset. Experimental results on both UCM2MAI and AID2MAI illustrate that learning and memorizing scene prototypes with our PM-Net can significantly improve the classification accuracy. The best performance is achieved by employing ResNet as the embedding function, and the best mean score reaches nearly 0.6. We hope that our work can open a new door for further researches in a more complicated and challenging task, multi-scene interpretation in single images. Looking into the future, we intend to apply the proposed network to the recovery of weakly supervised scenes.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

6 in total

1. A Multiple-Instance Densely-Connected ConvNet for Aerial Scene Classification.

Authors: Qi Bi; Kun Qin; Zhili Li; Han Zhang; Kai Xu; Gui-Song Xia
Journal: IEEE Trans Image Process Date: 2020-03-03 Impact factor: 10.856

2. Multi-Temporal Scene Classification and Scene Change Detection With Correlation Based Fusion.

Authors: Lixiang Ru; Bo Du; Chen Wu
Journal: IEEE Trans Image Process Date: 2020-12-29 Impact factor: 10.856

3. The easy-to-hard effect in human (Homo sapiens) and rat (Rattus norvegicus) auditory identification.

Authors: Estella H Liu; Eduardo Mercado; Barbara A Church; Itzel Orduña
Journal: J Comp Psychol Date: 2008-05 Impact factor: 2.231

4. Recurrently exploring class-wise attention in a hybrid convolutional and bidirectional LSTM network for multi-label aerial image classification.

Authors: Yuansheng Hua; Lichao Mou; Xiao Xiang Zhu
Journal: ISPRS J Photogramm Remote Sens Date: 2019-03 Impact factor: 8.979

5. Local climate zone-based urban land cover classification from multi-seasonal Sentinel-2 images with a recurrent residual network.

Authors: Chunping Qiu; Lichao Mou; Michael Schmitt; Xiao Xiang Zhu
Journal: ISPRS J Photogramm Remote Sens Date: 2019-08 Impact factor: 8.979

6. A framework for large-scale mapping of human settlement extent from Sentinel-2 images via fully convolutional neural networks.

Authors: Chunping Qiu; Michael Schmitt; Christian Geiß; Tzu-Hsin Karen Chen; Xiao Xiang Zhu
Journal: ISPRS J Photogramm Remote Sens Date: 2020-05 Impact factor: 8.979

6 in total