Literature DB >> 36193118

A hybrid generative adversarial network for weakly-supervised cloud detection in multispectral images.

Jun Li¹, Zhaocong Wu^2,3,4, Qinghong Sheng¹, Bo Wang¹, Zhongwen Hu⁵, Shaobo Zheng², Gustau Camps-Valls⁶, Matthieu Molinier⁷.

Abstract

Cloud detection is a crucial step in the optical satellite image processing pipeline for Earth observation. Clouds in optical remote sensing images seriously affect the visibility of the background and greatly reduce the usability of images for land applications. Traditional methods based on thresholding, multi-temporal or multi-spectral information are often specific to a particular satellite sensor. Convolutional Neural Networks for cloud detection often require labeled cloud masks for training that are very time-consuming and expensive to obtain. To overcome these challenges, this paper presents a hybrid cloud detection method based on the synergistic combination of generative adversarial networks (GAN) and a physics-based cloud distortion model (CDM). The proposed weakly-supervised GAN-CDM method (available online https://github.com/Neooolee/GANCDM) only requires patch-level labels for training, and can produce cloud masks at pixel-level in both training and testing stages. GAN-CDM is trained on a new globally distributed Landsat 8 dataset (WHUL8-CDb, available online doi:https://doi.org/10.5281/zenodo.6420027) including image blocks and corresponding block-level labels. Experimental results show that the proposed GAN-CDM method trained on Landsat 8 image blocks achieves much higher cloud detection accuracy than baseline deep learning-based methods, not only in Landsat 8 images (L8 Biome dataset, 90.20% versus 72.09%) but also in Sentinel-2 images ("S2 Cloud Mask Catalogue" dataset, 92.54% versus 77.00%). This suggests that the proposed method provides accurate cloud detection in Landsat images, has good transferability to Sentinel-2 images, and can quickly be adapted for different optical satellite sensors.

Entities: Chemical

Keywords: Cloud detection; Cloud distortion model; Deep learning; Generative adversarial networks (GAN); Remote sensing

Year: 2022 PMID： 36193118 PMCID： PMC9483037 DOI： 10.1016/j.rse.2022.113197

Source DB: PubMed Journal: Remote Sens Environ ISSN： 0034-4257 Impact factor: 13.850

Introduction

Over the last decades, remote sensing satellites have collected a large amount of Earth Observation data and been used for global environment monitoring (Li et al., 1998; Roy et al., 2014; Woodcock et al., 2001; Zhu et al., 2020). However, about 67% of the Earth's surface is covered by clouds most of the time (King et al., 2013). Clouds block not only the transmission of solar radiation but also the reflected signal from the land surface which greatly reduces the availability of optical remote sensing images (Fisher, 2014). The signals of ground objects are completely blocked due to the low transmittance of thick clouds, while their high reflectance can easily be confused with highlight areas such as bare land or buildings. Although signals from ground objects can partially transmit thin clouds and reach the satellite sensors, image quality is typically deteriorated. Therefore, cloud detection is a very important step in the exploitation of remote sensing images to monitor our planet. Although humans can label clouds successfully, this process is very time-consuming, tedious and expensive, especially in the current scenario of the Big Earth data deluge. Many methods for cloud detection in remote sensing images have been proposed in recent decades (Amato et al., 2008; Mitchell et al., 1977; Scaramuzza et al., 2012; Wind et al., 2010) to reduce the human labor costs. Generally, the most used traditional methods can be divided into two types: single-image rule-based methods and multitemporal change detection-based methods (Gómez-Chova et al., 2017). Rule-based methods are widely used to produce basic masks, and can distinguish clouds from clear-sky pixels by employing the spectral differences between dark land and clouds. (Luo et al., 2008) applied predefined thresholds to specific bands for cloud detection in MODIS images. The threshold for the corresponding band was set according to the characteristics of the cloud and land surface in it. An algorithm based on the dynamic threshold for cloud detection in MODIS and Landsat 8 images was proposed in (Wei et al., 2016); this method selects visible-to-NIR bands to separate land surfaces from clouds, and uses the short-wave infrared bands to distinguish clouds from snow/ice. Function of mask (Fmask) (Zhu and Woodcock, 2012), which combines threshold and cloud spectral characteristics to detect clouds, has proven to be very effective for cloud detection in different satellite images (Qiu et al., 2017; Zhu et al., 2015). (Frantz et al., 2018) separates clouds from bright surfaces based on parallax effects and improves the accuracy of potential cloud pixels (PCPs) produced by Fmask. By integrating multiple threshold methods, cloud detection in MEdium Resolution Imaging Spectrometer (MERIS) data achieved good results (Mei et al., 2017). (Li et al., 2017) proposed a multi-feature combined (MFC) method that applies thresholding to spectral features to produce basic cloud masks and then combines geometric features with texture features to optimize cloud detection results. In multi-temporal approaches, images from different periods within a certain area are used to calculate the difference between cloudy and clear images (Frantz et al., 2015; Goodwin et al., 2013; Lyapustin et al., 2008; Gómez-Chova et al., 2017). The sudden increase of reflectance in the blue band on a pixel-by-pixel basis in multitemporal images was used to detect clouds, and successfully applied to Formosat-2 and Landsat (Hagolle et al., 2010). (Jin et al., 2013) used spectral differences of blue, shortwave infrared and thermal infrared bands of the two images to detect clouds and cloud shadows in Landsat TM/ETM+ images. (Zhu and Woodcock, 2014) proposed a multi-temporal Mask (TMask) algorithm which first builds a reflectance prediction model using multi-temporal data and then compares the difference between ground truth and predicted reflectance to detect clouds. Combining threshold and multitemporal algorithms, (Han et al., 2014) used the threshold method to detect thick clouds, then used a modified scale-invariant feature transform method to transform clear reference images in the same regions to the coordinates of the cloud images. (Zhu and Helmer, 2018) proposed an automated cloud detection method in which cloud and cloud shadow indexes of multitemporal remote sensing images are computed and analyzed. (Mateo-García et al., 2018) used long time series Landsat 8 images to produce clear images on Google Earth Engine (GEE) (Gorelick et al., 2017) and treated cloud detection as a change detection task between real images and the generated clear images. Unlike rule-based methods and multitemporal methods, machine learning-based cloud detection algorithms require a lot of training samples (Baetens et al., 2019; Zhu et al., 2019; Pérez-Suay et al., 2018). (Lee et al., 1990) proposed a texture-based neural network with four layers for cloud detection in Landsat MSS images, achieving 96% accuracy for Cirrus clouds. (Scaramuzza et al., 2012) trained a statistical classifier C5.0 on about 8 million pixels randomly selected from 103 training scenes to create standalone decision trees. (Hughes and Hayes, 2014) first constructed a neural network to distinguish clouds, cloud shadows, water bodies, ice/snow and clean background, then a series of spatial information such as the membership values of neighboring pixels was used to solve the problem of mixed pixel information. In (Yuan and Hu, 2015), the image was first segmented into super-pixels, then a bag of words (BOW) model was used to construct compact features from the Scale-invariant feature transform (SIFT) (Lowe, 2004), and last a Support Vector Machine (SVM) was trained on the constructed features to classify the superpixels into cloud regions and cloud-free regions. (Zi et al., 2018) used the Simple Linear Iterative Cluster (SLIC) to divide images into superpixels, which are then classified into cloud-free and cloud-free by principal component analysis (PCA) and SVM. (Wei et al., 2020) first trained a Random Forest (RF) model by using globally sampled cloud and cloud-free pixels to produce a coarse cloud mask then Super-pixels Extracted via Energy-Driven Sampling (SEEDS) were employed to process the preliminary classification results to obtain the final cloud detection results. The main disadvantages of the above methods are: 1) The thresholds are set based on human experience and professional knowledge, and it is difficult to separate clouds from highlights when the number of bands is limited; 2) Multi-temporal methods require time series images in which the clear pixels are necessary but are not always available, and may not work well in frequently clouded regions; 3) The performances of machine learning-based methods rely greatly on artificially designed features. In recent years, deep learning has rapidly developed and been successfully used in many computer vision tasks such as image classification (He et al., 2016) and semantic segmentation (Chen et al., 2018; Ronneberger et al., 2015). Convolutional Neural Network (CNN) have been widely used in remote sensing image processing (Zhu et al., 2017b; Reichstein et al., 2019; Camps-Valls et al., 2021), for applications such as land cover and land use classification (LCLU) (Huang et al., 2018; Luus et al., 2015; Zhang et al., 2019), object detection (Cheng et al., 2016; Deng et al., 2018; Zhong et al., 2018) and hyperspectral image classification (Paoletti et al., 2018; Wu and Prasad, 2018). CNN has also been applied to cloud detection in remote sensing images (Mateo-Garcia et al., 2017). To study the performance of deep learning methods in cloud detection, (Le Goff et al., 2017) proposed an end-to-end CNN-based method to detect clouds in SPOT6 images. (Zhan et al., 2017) designed a fully convolutional neural network combined with a multiscale prediction strategy to distinguish cloud and snow in multispectral remote sensing images. (Xie et al., 2017) proposed a multilevel cloud detection method based on deep learning, first transforming RGB to HIS color space, then clustering the image into super-pixels and finally training a CNN model to predict the probability that each super-pixel belongs to the cloud class. A CNN-based method was proposed for pixel-level classification in remote sensing images (Drönner et al., 2018). (Jeppesen et al., 2019) adopted U-Net architecture (Ronneberger et al., 2015) for cloud detection in Landsat 8 images, training their model with cloud masks produced by Fmask as references, and obtaining results close to those of a model trained on manually labeled cloud masks. A multi-scale convolutional feature fusion (MSCFF) method for cloud detection was proposed in (Li et al., 2019). MSCFF up-samples the feature maps at different levels and then fuses all up-sampled feature maps to produce the final results. It achieved high accuracy on medium and high-resolution images. The transfer learning strategy proposed in (Mateo-García et al., 2020) proved that a deep learning-based cloud detection model can be transferred between satellites that have similar spectral and spatial features. In (Li et al., 2021), a lightweight model was designed for cloud detection in Sentinel-2 images by fusing multi-scale spectral-spatial features (CD-FM3SF). The CD-FM3SF model has three input branches to process bands at different resolutions and three output branches to produce masks at different resolutions. Results showed that low-resolution shortwave infrared bands can improve cloud detection accuracy in snow regions, which are usually among the most challenging scenes. Although CNN has been successfully used for cloud detection in remote sensing images and outperformed many traditional methods, most proposed methods adopt an end-to-end training strategy that requires a lot of manually labeled cloud masks for the training procedure, which is time-consuming and expensive to collect. Thus, fully automatic cloud detection is still an operational challenge, and a method that not only presents state-of-the-art performance but would also significantly reduce the time required to label the reference training dataset (e.g. with labels at a block level instead of pixel-level) is highly desirable. Recently, generative adversarial networks (GANs) as an unsupervised deep learning model have shown to be very effective in sample generation. GANs consist of at least two networks in the framework: a generative network and a discriminative network. GANs were designed to produce realistic samples by playing a minimax game (Goodfellow et al., 2014), and have been successfully used for image style transformation with paired images (Isola et al., 2017) or unpaired images (Zhu et al., 2017a). Due to its effectiveness, GAN is one of the most promising methods for unsupervised learning on complex distributions and has also been adopted for many remote sensing applications (Bermudez et al., 2019; Gao et al., 2020; Li et al., 2020b). Cloud detection was treated as a mixed energy separation task in (Zou et al., 2019), in which the cloud component of a cloudy image was first extracted by a generative network and fused with a clear image to obtain a fake cloud image, then the cloud component was used as the reference of the fake cloud image to train a cloud matting network. Based on the theory in (Zhu et al., 2017a), (Wu et al., 2020) designed a cloud detection model that combines the self-attention mechanism and GAN (SAGAN). By introducing the attention mechanism into the image translation process, SAGAN does not require cloud masks for training and can automatically detect clouds in the image. In our previous work CR-GAN-PM (Li et al., 2020a), a redefined physical model of cloud distortion was proposed and combined with a GAN framework for thin cloud removal and cloud distortion layer extraction. GANs in the CR-GAN-PM method was used to learn the mapping from cloudy to clear features, and produce pixel-level labels from block-level labels. Using the fact that cloud and background components in the cloudy image are independent, CR-GAN-PM first separated these two components from the cloudy image by using clear images from other regions as “references”. Then the cloudy image was reconstructed by feeding the two components into the cloud distortion model. By assessing the global consistency of the reconstructed image and the original cloudy images, it can be ensured that cloud components are strongly related to the original cloudy image. Although the CR-GAN-PM algorithm produces an accurate cloud reflectance layer, the corresponding cloud distortion layer contains too much background information, which makes it difficult to run a simple binarization process on the cloud distortion layer to obtain the cloud mask. In addition, the cloud distortion model in CR-GAN-PM did not take the radiance transfer process on thermal bands into consideration. In their recent survey of cloud detection methods, Li et al. (2022) concluded that the combination of physical models and deep learning, as well as the integration of cloud removal into cloud detection will further improve cloud detection efficiency and facilitate applications of remote sensing images. In our earlier work CR-GAN-PM (Li et al., 2020a), we have proven that the combination of a physical model and deep learning improves thin cloud removal. In this study, we present a hybrid weakly-supervised cloud detection method combining GANs and a cloud distortion model (GAN-CDM) for cloud detection in remote sensing images. GAN-CDM upgrades CR-GAN-PM and solves its limitations mainly by (1) Only outputting one group of cloud distortion layers no matter how many input channels were used (instead of one cloud thickness component per input band for cloud removal in CR-GAN-PM), to make it more suitable for cloud detection; (2) Using the cross-entropy loss to put a high gradient punishment on the correlation between all cloud distortion layers and the restored background layer to better separate clouds and background; (3) Introducing a new cloud distortion model for thermal infrared bands, and (4) A new effective threshold selection strategy was proposed to further reduce human labor. After intercomparing ten cloud detection algorithms comprehensively, the Cloud Masking Intercomparison eXercise (CMIX) recently gave some recommendations to improve cloud detection performance (Skakun et al., 2022). One such recommendation is to use cloud optical thickness as a potential metric to solve the disagreement upon the definition of “cloud”, which influences the performance and evaluation of the algorithms. It should be noted that the output reflectance value can be treated as cloud thickness, which can represent the cloud influence at pixel-level. Thus, the performance of GAN-CDM will not be affected by the definition of “cloud”. The CMIX study also suggested that cloud datasets covering different surfaces and cloud types at global scale are desirable to thoroughly test the performance of the algorithms at a global scale. Since GAN-CDM is a weakly-supervised cloud detection method that requires block-level labels in training, for which a suitable training dataset is not available, a novel cloud detection dataset WHUL8-CDb with block-level labels was collected, harmonized and used in experiments. The WHUL8-CDb dataset contains 500 Landsat 8 images distributed globally, covering the same eight biomes as in L8 Biome dataset (Foga et al., 2017), and cropped into patches which were then classified into either cloudy or clear. The dataset is released to the community as an open dataset for further model development and intercomparison. In summary, the major contributions of this work are as follows: Improving upon our previous models, GAN-CDM is proposed for cloud detection. Unlike end-to-end methods, the training of GAN-CDM does not require manually labeled cloud masks. The inputs of the proposed GAN-CDM method are image blocks and corresponding block-level labels (0 for cloudy image, 1 for clear image). Labeling images at block level takes about 0.7% of the time of labeling the same image at pixel-level (cloud mask) which is the main advantage of GAN-CDM over end-to-end methods. This has implications for the way labeling will be done in the future, and promises great advances in supervised algorithms. All Landsat 8 multispectral bands were taken into consideration in GAN-CDM. GAN-CDM and two similar deep learning-based baseline methods were trained on WHUL8-CDb dataset, then tested on the independent L8 Biome dataset and Fmask 4.0 Landsat 8 validation dataset for performance on the same sensor and bands. Experimental results on the two Landsat 8 datasets show that GAN-CDM-based models outperform baseline deep learning-based models when using the same bands as inputs. An ablation experiment demonstrates the effectiveness of the proposed cloud distortion model for thermal infrared bands. The well-trained GAN-CDM-based models and baseline deep learning-based methods were run on the independent Sentinel-2 Cloud Mask Catalogue (S2-CMC) dataset and Fmask 4.0 Sentinel-2 validation dataset to analyze their transferability to another optical sensor. Results on the two Sentinel-2 datasets show that GAN-CDM-based models have better transferability than similar deep learning-based baseline methods. The rest of this paper is organized as follows. The experimental data is introduced in Section 2. In section 3, the proposed GAN-CDM method is introduced. Section 4 shows the experimental results and compares the performances of GAN-CDM with other cloud detection methods, including discussions. Conclusions and final remarks are given in Section 5.

Data

Training datasets

WHUL8-CDb dataset

Since the proposed GAN-CDM model in this paper is a weakly-supervised method, our goals are using samples labeled at the block-level to train a model that can produce pixel-level labels, and transfer the well-trained model to other sensors that have similar bands and spatial resolution. Landsat 8 imagery is selected as the training data, as it is easy to obtain and has a wide band range that covers most bands in many optical satellites. The L8 Biome dataset has been widely used for the accuracy assessment of cloud detection methods (Foga et al., 2017). The training dataset used in this work, called WHUL8-CDb, includes 500 globally distributed Landsat 8 Level-1C images (Fig. 1) collected following the distribution of the L8 Biome dataset as it comes to biomes and dominant land cover types. The motivation here is that spectral features of different cloud types are similar to each other in different regions, while the backgrounds are very different. Fig. 1 shows that WHUL8-CDb includes a balanced distribution of surfaces. The acquisition time of images in WHUL8-CDb ranges from 18/11/2013 to 9/3/2021 which ensures a wide range of environmental conditions such as different seasons. All the features of WHUL8-CDb can help improve the development of cloud detection algorithms, as recommended in CMIX (Skakun et al., 2022). To increase the diversity of training samples, we selected images with more background pixels by setting the cloud percentage from 0 to 50 when searching images. To handle all cloud situations, 5 images full of clouds were collected and included in WHUL8-CDb.

Fig. 1

Global distribution of experimental data. The training data includes WHUL8-CDb dataset, L8_Abs, S2_Abs and Baetens (Baetens et al., 2019). The testing data includes the L8 Biome, S2 CMC, L8_Fm4v and S2_Fm4v. Background image credits: ESA-CCI-LC (Defourny et al., 2017). Each training image was clipped into small patches with a non-overlapping 384 × 384 pixels sliding window so that the training samples could be put into CNN models. In this way, 106,424 patches were generated from 500 Landsat 8 images, then manually classified into two subsets: a subset with cloudy images and a subset with clear images. To make full use of the image information, all patches were rotated by 90°, 180°and 270° to increase the number of training samples. Finally, 425,696 training samples were obtained, split into 233,372 clear samples and 192,324 cloudy samples.

Auxiliary datasets

Since the values of cloud reflectance produced by the GAN-CDM method range from 0 to 1, a threshold is needed to obtain a binary cloud mask. The optimal threshold for many methods is usually based on manually labeled cloud masks which is very time-consuming to produce. The goal of GAN-CDM is to achieve high cloud detection accuracy while reducing human labor in model training. Therefore, a threshold selection based on “absolute” pixels (TSAP) in GAN-CDM was carried out with only coarsely delineated and easy-to-get cloud masks as the auxiliary dataset (Table S1). The threshold selection of GAN-CDM requires few pixel-level labels, however, this does not mean that all pixels in an image need to be labeled. As known, Fmask is a very effective cloud detection algorithm for Landsat 4–7, 8 and Sentinel-2 images. So, the latest Fmask 4.0 version was used to produce labeled pixels for GAN-CDM with a high degree of confidence (“absolute” pixels). In the result of Fmask 4.0, the lower the cloud probability, the lower possibility the pixel is to be cloud. The default cloud probability thresholds of Fmask 4.0 are 17.5% and 20% for Landsat 8 and Sentinel-2 images, respectively. To obtain “absolute” clear pixels, the cloud probability thresholds for Landsat 8 and Sentinel-2 are set to 15% and 17.5%, respectively, and the dilated number of pixels is set as default (3 pixels in Fmask 4.0 to commit more clouds). In this way, Fmask 4.0 will commit a lot of clouds which means that the predicted clear pixels are “absolute” clear. On the other hand, to obtain “absolute” cloud pixels, the dilated number of pixels for clouds is set to 0, and the cloud probability thresholds for Landsat 8 and Sentinel-2 are set to 30% and 32.5%, respectively. Then the “absolute” clear and cloud pixels for Landsat 8 (L8_Abs) and Sentinel-2 (S2_Abs) are treated as auxiliary data for threshold selection in GAN-CDM. As shown in Fig. 1, L8_Abs and S2_Abs datasets include 5 Landsat 8 and 5 Sentinel-2 images randomly selected from the world, respectively. Both datasets cover snow/ice, barren, urban, vegetation and water to ensure that the auxiliary datasets are representative. In order to verify the process of “absolute” clear and cloud pixels, an additional Sentinel-2 cloud cover dataset (Baetens et al., 2019) was selected. The “Baetens” dataset contains 38 Sentinel-2 images, in which 31 images were included as “Reference dataset” and 7 images were included in the “Hollstein dataset”. Those 7 images from the “Hollstein dataset” were not used in our study because they were acquired before Sentinel-2 was operational. The 31 images in the “Baetens Reference dataset” are distributed over 10 regions, and each region contains 3 to 4 images. We only selected 1 image randomly from each region. Then, upon careful visual inspection, two images were discarded because they were overlapping with images in the L8 Biome and S2 SMC datasets. Finally, 8 images covering snow/ice, barren, urban, vegetation and water were selected.

Validation datasets

We include three independent datasets for experimental validation, that contain cloud masks at pixel level. Because the proposed method was trained on Landsat 8 images, the L8 Biome dataset was selected as one of the validation datasets. As shown in Table 1, Landsat 8 images have 11 bands ranging from 0.433 μm to 12.51 μm, corresponding to 8 bands in Sentinel-2 imagery. Thus, a publicly available Sentinel-2 cloud cover validation dataset S2 CMC dataset was also included in our validation experiments. Because the resolutions of coastal and cirrus bands in Sentinel-2 (60 m) are much lower than that of visible and near-infrared bands, they were not included in the experiments. The validation dataset collected by (Qiu et al., 2019) for Fmak 4.0 validation was also included.

Table 1

Landsat 8 and Sentinel-2 bands information. The bands used in the experiment are marked in bold.

Landsat 8			Sentinel-2
Band	Wavelength (μm)	Resolution (m)	Band	Wavelength (μm)	Resolution (m)
1 (Coastal)	0.430–0.450	30	1 (Coastal)	0.433–0.453	60
2 (Blue)	0.450–0.515	30	2 (Blue)	0.458–0.523	10
3 (Green)	0.525–0.600	30	3 (Green)	0.543–0.578	10
4 (Red)	0.630–0.680	30	4 (Red)	0.650–0.680	10
/			5 (Red Edge)	0.698–0.713	20
/			6 (Red Edge)	0.733–0.748	20
/			7 (Red Edge)	0.773–0.793	20
5 (NIR)	0.845–0.885	30	8 (NIR)	0.785–0.900	10
/			8A (Narrow NIR)	0.848–0.881	20
/			9 (Water vapor)	0.935–0.955	60
9 (Cirrus)	1.360–1.380	30	10 (Cirrus)	1.360–1.390	60
6 (SWIR 1)	1.560–1.660	30	11 (SWIR 1)	1.565–1.655	20
7 (SWIR 2)	2.100–2.300	30	12 (SWIR 2)	2.100–2.280	20
8 (PAN)	0.503–0.676	15	/
10 (TIRS 2)	10.600–11.900	100	/
11 (TIRS 2)	11.500–12.510	100	/

Landsat 8 and Sentinel-2 bands information. The bands used in the experiment are marked in bold. L8 Biome: the Landsat 8 cloud cover assessment validation dataset (Foga et al., 2017), termed L8 Biome, was created by USGS. This dataset contains 96 globally distributed Landsat 8 Level-T scenes and covers eight biomes. The cloud masks are at 30 m resolution, the same as that of Operational Land Imager (OLI) multi-spectral bands. The cloud percentages in this dataset are split into three stages: less than 35%, between 35% and 65% and over 65%, each stage containing 32 scenes. S2 CMC: The Sentinel-2 Cloud Mask Catalogue, termed S2 CMC in this study, was produced by (Francis et al., 2020). This dataset contains 513 subscenes of 1022 × 1022 pixels with associated cloud masks at 20 m resolution. The images in this dataset were randomly selected from the 2018 Level-1C Sentinel-2 archive and cover different land cover types around the world. Fmask 4.0 validation dataset: The dataset was collected by (Qiu et al., 2019). Two subsets in this dataset were used in this study: Landsat 8 and Sentinel-2 validation datasets, called L8_Fm4v and S2_Fm4v respectively in this study. Each dataset contains 16 images, in which 2 images were randomly selected from the 8 locations of each biome of the L8 Biome dataset. Then the validation pixels were selected by simple random sampling from each image. L8_Fm4v and S2_Fm4v datasets contain 2963 and 2098 labeled pixels, respectively. Table 2 shows the four cloud mask validation datasets used in this study. According to (Li et al., 2019), 4 images (Table S2) in the L8 Biome dataset were found less accurately labeled after careful visual inspection of cloud masks. Thus, only the remaining 92 images were used in our experiments. All images in S2 CMC were visually checked and used for validation. The globally distributed 637 images from these datasets can validate the performance of the proposed GAN-CDM method on different land surface types. The validation datasets from Sentinel-2 can evaluate the transferability of the proposed GAN-CDM method trained on Landsat 8 images. The total 4.399 billion labeled pixels in the four validation datasets ensure that the performance of GAN-CDM can be quantitatively assessed.

Table 2

Four cloud cover validation datasets used in the study.

Name	Source	Label type	Resolution (m)	Images	Image size (pixel)	Labeled pixels
L8 Biome	Landsat 8	Full image	30	92*	7000×6000	3864×10⁶
S2 CMC	Sentinel-2	Full image	20	513	1022×1022	535×10⁶
L8_Fm4v	Landsat 8	Random sampling	30	16	7000×6000	2963
S2_Fm4v	Sentinel-2	Random sampling	20	16	5490×5490	2098

Four cloud cover validation datasets used in the study.

GAN-CDM model

Physical model of cloud distortion

The cloud distortion model is first proposed in (Mitchell et al., 1977) and is widely used in image dehazing. The model includes two main components: transmission of background and reflection of clouds:where x and y are the coordinates of the pixel in the image, a is the attenuation coefficient of atmosphere, I is the solar radiation energy, φ(x, y) and t(x, y) are the reflectance of background and transmittance of cloud, respectively. As shown in Fig. 2, the effects of cloud on the incident energy from the solar and ground include reflection, transmission and absorption. In our previous work (Li et al., 2020a), we took the upwards reflectance, transmittance and absorption of clouds into consideration and constrained their sum to be equal to 1. Furthermore, the attenuation coefficient of atmosphere a in Eq. (1) was treated as a component of transmittance of cloud downward in the cloud distortion model:

Fig. 2

Cloud distortion model.

Cloud distortion model. The constraints of Eq. (2) are:where a(x, y), t(x, y), γ(x, y), α(x, y) are the down-transmittance, up-transmittance, reflectance and absorptance of cloud, respectively. τ(x, y) is called two-way transmittance and ρ(x, y) is called constant layer. Unlike visible, near-infrared and shortwave infrared bands (VNS), the signal in thermal infrared Sensor (TIRS) bands comes mainly from the thermal radiation of the background. The cloud influence on TIRS bands only contains up-transmittance, absorptance and reflectance. Therefore, a novel cloud distortion model for thermal bands is defined as:where δ(x, y) and θ(x, y) are the thermal radiation of background and cloud, respectively. With Eqs. (2), (3), (4), all effects of clouds on the imaging process are considered in the redefined cloud distortion models. The next major task is to find an easy way to solve for their unknown variables. GAN is selected as a part of the framework of the proposed GAN-CDM and the details are introduced in the next subsection.

Networks in GAN-CDM

The framework of GAN-CDM

The GAN algorithm is originally used to generate samples. It consists of two networks: the generative network (G) and the discriminative network (D). G tries to generate realistic samples to confuse D, and D tries to distinguish real samples and from generated samples. G and D play a minimax game in this process, formulated as follows:where d is the distribution of real samples, d is the distribution of input samples, G(z) is the generated sample. G tries to minimize V(G, D) and D tries to maximize V(G, D). After G is well-trained, D cannot distinguish whether G(z) is from the distribution of real or generated samples. The flowchart of the proposed GAN-CDM method is shown in Fig. 3. GAN-CDM consists of three networks: two generators (G1 and G2) and one discriminator (D). The input image c is first put into G1 and G2 to produce background and cloud components, respectively. Then D is used to distinguish whether the background component produced by G1 includes clouds or not (0 or 1), regardless of the quality of the generated image. If G1 produces a high-quality cloudy image, the output of D will also be 0. The generative loss of G1 will be large which will force G1 to produce a more clear image. Finally, the input image is reconstructed using background and cloud components with the redefined cloud distortion model. For a better understanding of the process of GAN-CDM method, the intermediary results of the two scenes are shown in Fig. 15.

Fig. 3

Framework of GAN-CDM. The inputs of GAN-CDM are c, n and their block-level labels (whether the image contains clouds or not).

Fig. 15

OA Curves and added OA values of GCM and WDCD-based models, respectively.

Framework of GAN-CDM. The inputs of GAN-CDM are c, n and their block-level labels (whether the image contains clouds or not). G1 and D in GAN-CDM form the GAN structure and play the minimax game. As proved in LSGAN (Mao et al., 2017), the binary cross-entropy loss is easy to saturate, while the mean square error (MSE) can avoid this situation. We adopt MSE between the output of D and block-level label as the loss function of D:where d is the distribution of real clear images, d is the distribution of cloudy images. To confuse D, G1 tries to produce high-quality “clear” images. The loss function of G1 is:where G1(c) is the produced “clear” image. By minimizing L, G1 will update its weight parameters to map the cloud features to background features. With the constraint of GAN loss, what can only be ensured that G1(c) is a “clear” image. However, it cannot be ensured that the “clear” image is correlated with the input image. Thus, we introduce the redefined cloud distortion models into GAN-CDM to reconstruct the input image with the extracted background and cloud components. We use c′ to represent the reconstructed image. The reconstruction process for VNS bands can be expressed as follows:where G2(c) and G2(c) are two-way transmittance and reflectance of clouds, respectively. We assume that the reflectance and radiance of clouds in TIRS bands are 0 and the signal observed is only from the transmission of background. Because the up-transmittance is not solved separately, we assume that the higher the reflectance in VNS bands, the lower the transmittance in TIRS bands. So, we define the cloud distortion model in thermal infrared bands as By comparing the difference between the reconstructed image c′ and the input image c, G1 and G1 will cooperate to ensure that the background and cloud components are correlated with the input image. The L1 loss was found effective for image reconstruction in CycleGAN (Zhu et al., 2017b). We adopt L1 distance to measure the difference between c′ and c:with L and L, the produced image G1(c) will be clear and correlated with the input image which means that G1(c) contains the same background component as the input image and the cloud regions in the input image is translated into clear regions in G1(c). Once the background and cloud components are completely separated, they should be uncorrelated with each other. The cloud component has less varied texture than the background component. Zhang et al. (2018) proposed a newly designed edge correlation loss function for separating transmission and reflection layers, the edge correlation can be calculated as follows:where m is the down-sample factor and is set to 3 in this work, n and q are the number of outputs of G1 and G2, ∇ is the gradient operator and ‖∗‖ is Frobenius norm. By minimizing L, the correlation between cloud and background components will be as low as possible. It should be noted that G2 only produces three channels (q = γ, τ, ρ) which are reflectance, transmittance and constraint layers (Eq. (3)). The reasons are: (1) If G2 produces three layers for each input band, the output channels of G2 will be 3n (n is the number of input bands) which will increase a lot of parameters and computation. Instead, the computation loops of L are reduced from m × n × 3n to m × n × 3. (2) Since GAN-CDM only aims to detect clouds, not remove them, a rough reconstruction of input images with the three layers is enough. The inputs of GAN-CDM are cloudy and clear images. The main use of clear images is to guide G1 to learn the distribution of clear images. As there are no clouds in clear images, the values of G2(c) G2(c) and G2(c) of clear images are supposed to be 0, 1 and 0, respectively. To optimize G2 on clear images and reduce the false detection rate, we also input clear images into G2 and use a constant metric as the reference for G2. The cross-entropy loss function (De Boer et al., 2005) is used as the loss function of this optimization process:where y is the constant metric used as the reference for G2(n). The final loss function of G1 and G2 is as follows:where λ=0.5, λ=10, λ=1 and λ=1 are the weights for individual losses within L(G1, G2) and set similar to those in (Li et al., 2020b) with minor adjustments. In the proposed GAN-CDM method, G1 and G2 cooperate with each other and update the parameters together with the common target of minimizing L(G1, G2).

The network architecture of GAN-CDM

Inspired by (Gandelsman et al., 2018) for image segmentation, both G1 and G2 in GAN-CDM adopt U-Net architecture and are designed for combining the high resolution features with the up-sampled output. The detailed network architecture of G1/G2 is shown in Fig. 4. The basic components of G1/G2 are convolution and deconvolution sequences. It can be seen that for each convolution sequence in the encoder (except for the last), the output is not only passed to the next convolution sequence but also concatenated with the output of the corresponding deconvolution sequence, then passed to the next deconvolution sequence.

Fig. 4

Architecture of the Generators G1/G2 of Fig. 2.

Architecture of the Generators G1/G2 of Fig. 2. The details for each convolution and deconvolution layer in G1/G2 are shown in Table 3. All layers in G1 and G2 are the same except for the number of output channels and the activation function of the output. The main architecture of G1/G2 consists of thirteen integrated processing units. The first seven processing units have the same sequence of layers: convolution, instance normalization and leaky rectified linear unit (LReLU) layers. We call Conv this convolution sequence. The slope of the LReLU layers is set to 0.2. The kernel size and stride of the convolution layer in Conv_i are set to 3×3 and 2, except that the strides of Conv_1 are set to 1. The first six processing units in the decoder have the same sequence of layers: deconvolution, instance normalization and rectified linear unit (ReLU) layers. We call this deconvolution sequence Deconv. The kernel size and stride of the deconvolution layer in Deconv_i are set to 4×4 and 2. The Conv_out_G1 and Conv_out_G2 are the last convolution sequences of G1 and G2 and only contain the convolution layer. The number of output channels of Conv_out_G1 is set to n which is the number of input bands. Conv_out_G2 output 3 channels which represent reflectance, two-way transmittance and constraint layers, respectively. Because the incident solar radiation is set to 1 and the reflectance of the background is necessarily positive and less than 1, a sigmoid function is used as the activation function of G1 to normalize the range of the output values to [0,1]. According to Eq. (5), the sum of the 3 outputs of G2 should be 1, we adopt the Softmax function as the activation function to satisfy this constraint. As shown in Table 4, the discriminator D in GAN-CDM is the same as that of Patch-GAN (Zhu et al., 2017a), which simply adopts five Conv sequences to extract features and judge whether the input is real or not.

Table 3

Details of parameters' setting of G1/G2.

Block	Id	Output channels	Kernel size	Stride
Encoder	Conv_1	64	3×3	1
	Conv_2	128	3×3	2
	Conv_3	256	3×3	2
	Conv_4	512	3×3	2
	Conv_5	512	3×3	2
	Conv_6	512	3×3	2
	Conv_7	512	3×3	2
Decoder	Deconv_1	512	4×4	2
	Deconv_2	512	4×4	2
	Deconv_3	512	4×4	2
	Deconv_4	256	4×4	2
	Deconv_5	128	4×4	2
	Deconv_6	64	4×4	2
	Conv_out_G1	n	3×3	1
	Conv_out_G2	3	3×3	1
	Sigmoid_G1	n	/	/
	Softmax_G2	3	/	/

Table 4

Details of parameters' setting of D.

Block	Id	Output channels	Kernel size	Stride
Encoder	Conv_1	64	4×4	2
	Conv_2	128	4×4	2
	Conv_3	256	4×4	2
	Conv_4	512	4×4	2
	Conv_5	1	4×4	1

Details of parameters' setting of G1/G2. Details of parameters' setting of D.

Results and discussion

Experimental setup

Baseline methods

In the experiments, the proposed GAN-CDM is compared with two traditional cloud detection methods, Fmask 4.0 (Qiu et al., 2019) and Sen2cor 2.8.0 (Main-Knorn et al., 2017), as well as two deep learning-based methods GCM (Zou et al., 2019) and WDCD (Li et al., 2020a). GCM, which also combines GAN with the cloud distortion model for cloud detection, and WDCD require the same block-level labeled training samples as GAN-CDM. WDCD first extracts useful features by classifying images as cloudy or clear. Pixel-level labels are then produced by removing pooling layers in the pre-trained model. Because Fmask 4.0 was trained on the L8 Biome dataset to obtain optimal parameters, Fmask 4.0 was excluded in the validation experiment on the L8 Biome dataset. Since Fmask 4.0 and Sen2Cor require full S2 file input but S2 CMC dataset only contains sub-scenes, those methods were excluded in the validation experiment on S2 CMC dataset.

Hyperparameters setting

In the training stage, the hyperparameters of GCM and WDCD-based models were set as default. The Adam-optimizer was used to train the GAN-CDM-based models and the parameters of the Adam-optimizer were set to the same values as those in (Zhu et al., 2017a): β1= 0.9, β2= 0.999, and the initial learning rate = 0.0002. Exponential decay with decay rate = 0.96 was adopted as the decay policy. The total iterations were 1,000,000. The decay policy was used after the first 100,000 iterations. The batch size was set to 1. The training experiments were conducted on 2 Intel (R) Xeon (R) CPU E5–2640 v4 x86_64 @ 2.4 GHz running on Linux operating systems, with a NVIDIA Tesla V100 GPU with 16 GB memory (4 GB is enough for training GAN-CDM). GCM and GAN-CDM were implemented on the TensorFlow platform (version 1.14) and their discriminator and generators were trained alternatively from batch to batch as in CycleGAN (Zhu et al., 2017b). WDCD was implemented with the PyTorch platform (version 1.9). The inputs of GCM, WDCD and GAN-CDM were images of WHUL8-CDb and corresponding block-level labels in the training stage. The range of the pixel values was normalized to [0,1] by dividing by 65,535 before being fed into deep learning-based methods. As mentioned in section 2.2, the bands 1/2/3/4/5/6/7/9 in Landsat 8 are corresponding to bands 1/2/3/4/8/10/11/12 in Sentinel-2 and the spatial resolution of bands 1/10 is much lower than other bands in Sentinel-2. Thus, the common bands 2/3/4/5/6/7 in Landsat 8 were selected as the base bands. For each deep learning-based method, two models are trained with various numbers of input bands: 4 bands (bands 2/3/4/5) for GCM-4, WDCD-4 and GAN-CDM-4 models, 6 bands (bands 2/3/4/5/6/7) for GCM-6, WDCD-6 and GAN-CDM-6 models. To analyze the effectiveness of cloud distortion model for TIRS band, 8 bands (bands 2/3/4/5/6/7/10/11) based models (GCM-8 T, WDCD-8 T and GAN-CDM-8 T) were trained and evaluated on Landsat 8 imagery. 10 bands (bands 1/2/3/4/5/6/7/9/10/11) based models (GCM, WDCD and GAN-CDM) were also used to analyze the improvement introduced by bands 1/9 (Coastal/Cirrus) bands. To ensure the fairness of the experiments, we only modified the data path and input/output bands in the source code of GCM and WDCD, keeping the original implementations for the models. In the testing stage, the pixel values were normalized to [0, 1] by dividing by 65,535 and 10,000 for Landsat 8 and Sentinel-2 images, respectively. Unlike in training, we only ran the well-trained generator G2 on testing images to generate cloud distortion layers, in which only the reflectance layer was used for producing the final cloud mask. Because the reflectance layer ranges from 0 to 1, it was binarized to get a final cloud mask following the process described in section 4.2. Due to memory limitations of the GPU, a whole testing image could not be processed at once. Thus, we applied a sliding window with size of 384×384 and a stride of 192×192 on the input images to extract cloud distortion layers and avoid border effects.

Accuracy metrics

In order to evaluate the performances of cloud detection methods quantitatively, the predicted binary masks were compared with the manually labeled cloud masks. Overall accuracy (OA), user's accuracy (UA), producer's accuracy (PA), F1-score and intersection over union (IoU) were selected as performance measures in this work:where true-positive (TP) and true-negative (TN) are correctly predicted cloud and clear pixels, false-positive (FP) and false-negative (FN) are wrongly predicted cloud and clear pixels, respectively. OA, F1-score and IoU are comprehensive indicators.

Threshold selection

Unlike some end-to-end deep learning-based cloud detection methods that use 0.5 as a default threshold to output binary masks, GCM, WDCD and the proposed GAN-CDM-based models output a continuous value ranging from 0 to 1 and adopt a different process to obtain optimal thresholds. To produce binary cloud masks, GCM-based models treated the pixels whose values are greater than 100 and less than 200 as cloud pixels, and WDCD combined the average and standard deviation values of the results of clear samples in the training dataset to obtain the optimal threshold. We adopted the default binarization processes for GCM and WDCD-based models. For GAN-CDM-based models, the optimal threshold was obtained based on auxiliary datasets such as “absolute” clear and cloud pixels generated by Fmask 4.0. As shown in Fig. 5, looking at the rows, the ranges of reflectance values produced by GAN-CDM-based models with different input bands are different. This is because GAN-CDM is designed to extract common cloud distortion layers that can be used to roughly reconstruct all input bands, and those cloud distortion layers produced by GAN-CDM-based models differ from the input bands. Looking at columns, the ranges of reflectance values produced by the same model with different satellite sensors are a little different which may be caused by the difference of bit number between Landsat 8 and Sentinel-2 sensors.

Fig. 5

Examples of reflectance layers produced by GAN-CDM-4 and GAN-CDM-6. (a) Grass, water and ice in L8 Biome. (b) Forest, mountain and shrubland in S2 CMC.

Examples of reflectance layers produced by GAN-CDM-4 and GAN-CDM-6. (a) Grass, water and ice in L8 Biome. (b) Forest, mountain and shrubland in S2 CMC. The experiments in this subsection are used to obtain the optimal thresholds of GAN-CDM-based models with different input bands for different sensors. Since OA is a comprehensive accuracy indicator, the optimal threshold was selected according to its value. OA curves are calculated by applying different thresholds on the cloud reflectance layer, and the threshold where the model obtains the highest OA is selected as the optimal threshold. To better sample the threshold value, we multiply the reflectance layer by 255 before calculating OA, so that the threshold value ranges from 0 to 255. The L8_Abs dataset was used to obtain the thresholds of GAN-CDM-based models for Landsat 8 images (Fig. 6). The optimal thresholds for GAN-CDM-based models vary with input bands on L8_Abs dataset. OA values of GAN-CDM-based models increase as the threshold improves, then reach a plateau and decrease. The thresholds at maximum OA values are marked by the dotted line. GAN-CDM-6 and GAN-CDM obtain the highest OA when the threshold is 7. The optimal threshold for GAN-CDM-4 is 11. It can also be seen that GAN-CDM-6 and GAN-CDM always obtain higher OA than GANCDM-4 for thresholds ranging from 0 to 15. GAN-CDM-6 and GAN-CDM have very similar OA on different thresholds.

Fig. 6

OA Curves of GAN-CDM-4 and GAN-CDM-6 on L8_Abs dataset.

OA Curves of GAN-CDM-4 and GAN-CDM-6 on L8_Abs dataset. Since GAN-CDM-based models can also be applied to Sentinel-2 images, S2_Abs dataset was used to obtain the thresholds of GAN-CDM-based models for Sentinel-2 images. Fig. 7 shows the curves of average OA of GAN-CDM-4, GAN-CDM-6 on S2_Abs dataset. It can be seen that OA curves of GAN-CDM-4 and GAN-CDM-6 have a similar trend and GAN-CDM-6 always obtains higher OA than GAN-CDM-4 with thresholds ranging from 1 to 15. GAN-CDM-6 achieves the highest OA when the threshold is 6, and the threshold for the highest OA is 7 for GAN-CDM-4.

Fig. 7

OA Curves of GAN-CDM-4 and GAN-CDM-6 on S2_Abs dataset.

Comparison with baseline methods

Performance on Landsat 8 images

GAN-CDM-6 achieves the highest OA, UA, F1-score and IoU among all methods on 92 images in the L8 Biome (Table 5). GAN-CDM-4 ranks second and third respectively in all accuracy indicators except for PA. Specifically, the OA of GAN-CDM-4, GAN-CDM-6 are 86.07% and 90.20%, respectively, while the highest OA of WDCD and GCM based models is 72.09%. GAN-CDM-based models exceed WDCD and GCM-based models by at least 15 percentage points in OA, which indicates that GAN-CDM-based models perform much better than WDCD and GCM-based models in cloud detection. The lowest F1-score of GAN-CDM-based models is 86.20%, while the highest F1-score of WDCD and GCM-based models is 77.39%, which means that GAN-CDM-based models can achieve a better balance than WDCD and GCM-based models in UA and PA.

Table 5

Average accuracy indicators of different methods on 92 images in L8 Biome. The highest values are marked in bold for the same number of input bands, and underlined for the best overall method.

Method	OA	UA	PA	F1-score	IoU
WDCD-4	72.08%	64.61%	94.32%	76.69%	62.19%
GCM-4	58.92%	58.98%	51.35%	54.90%	37.84%
GAN-CDM-4	86.07%	83.25%	89.38%	86.20%	75.75%
WDCD-6	72.09%	63.90%	98.12%	77.39%	63.13%
GCM-6	60.30%	57.68%	69.34%	62.98%	45.96%
GAN-CDM-6	90.20%	85.43%	96.30%	90.54%	82.72%

Average accuracy indicators of different methods on 92 images in L8 Biome. The highest values are marked in bold for the same number of input bands, and underlined for the best overall method. In Fig. 8, the three comprehensive accuracy indicators of all methods are visualised according to the different land cover types. The larger the area covered by a curve on the radar plot, the more effective the corresponding method is overall on the different classes. It can be seen that the areas of GAN-CDM-based models are always larger than those of WDCD and GCM-based models on OA, F1-score and IoU when they have the same input bands. This indicates that GAN-CDM-based models are more effective than WDCD and GCM-based models for all classes. For methods with 4 bands as input, GAN-CDM-4 performs much better than WDCD-4 and GCM-4 in OA, F1-score and IoU on most biomes, except for snow/ice. When taking 6 bands as input, GAN-CDM-6 performs much better than WDCD-6 and GCM-6 in OA, F1-score and IoU on all biomes.

Fig. 8

Average OA, F1-score and IoU of different methods in different land cover types for 92 images in the L8 Biome (Details are in Table S3).

Average OA, F1-score and IoU of different methods in different land cover types for 92 images in the L8 Biome (Details are in Table S3). The most challenging land cover type for WDCD-based models, GCM-6 and GAN-CDM based models is snow/ice, while for GCM-4 is grass, which is the easiest land cover type for GAN-CDM-based models. The easiest land cover type for the WDCD-based models is wetland. The easiest land cover type for GCM-4 and GCM-6 is water. The most challenging and easiest land cover types for WDCD and GAN-CDM based models do not change with input bands, unlike GCM-based models. This indicates that GCM-based models are more unstable than WDCD and GAN-CDM based models on the L8 Biome dataset when taking different bands as inputs. Cloud detection results of four typical scenes on L8 Biome are shown in Fig. 9. WDCD and GCM-based models classify snow/ice in scene a) as clouds, while there are no commission errors in results of GAN-CDM-based models. In scene b), WDCD-based models commit many barren regions as clouds. The omission errors in the results of GCM-4 are slightly high, while there are many commission errors in the result of GCM-6. The results of GAN-CDM-4 and GAN-CDM-6 are very close to the reference, In grass scene c), WDCD-based models cannot separate clouds from background around them and classify some bright surfaces as clouds. GCM-based models fail at scene c). GAN-CDM-based models detect most clouds correctly. The clouds in the upper middle region are detected by GAN-CDM-6, while omitted by GAN-CDM-4. There are many commission errors in the results of WDCD-based models on scene d), while the results of GCM-based models contain a lot of omission errors. GAN-CDM-4 obtains more accurate results than GAN-CDM-6 on scene d). Compared with WDCD and GCM-based models, GAN-CDM-based models can always achieve better cloud detection results on these four scenes with the same input bands.

Fig. 9

Examples of cloud detection results of different methods for L8 Biome. (a) Snow/ice. (b) Barren and water. (c) Grass, barren and urban. (d) Shrubland and water. All images are composited with NIR-G-B except image (a) is composited with SWIR-G-B. WDCD-based models always get lots of commission errors on these four scenes. This is likely because an empirical constant was used to obtain the threshold in WDCD-based models, which may be not suitable for Landsat-8 images. The results of WDCD-based models on scenes a), b) and c) contain serious grid effects which may be caused by the fact that they classify the whole image patch into cloud or clear during training; and the cloud activation map is not accurate. GCM-based models miss some very high reflectance objects on scenes a), b) and d), this is because GCM-based models treat the pixel values larger than 200 as background in its binarization process. GAN-CDM-based models extract a cloud reflectance layer from the input image, the higher the value, the more likely it is to be a cloud. Therefore, a simple binarization process on cloud reflectance is robust enough.

Transfer learning performance on Sentinel-2 CMC

In this part, methods trained on WHUL8-CDb with 4 and 6 bands as inputs are directly run on 513 images in S2 CMC without any fine-tuning. The quantitative results are shown in Table 6 in which it can be seen that OA, F1-score and IoU of GAN-CDM-6 are 92.54%, 92.90% and 86.74% which are the highest among all methods. Although the UA of GCM-4 and GCM-6 is 93.31% and 97.10%, respectively, their OA, PA, F1-score and IoU are much lower than WDCD and GAN-CDM based models, which indicates that GCM-based models trained on WHUL8-CDb dataset do not transfer well to S2 CMC. Compared with WDCD-based models when using the same input bands, GAN-CDM-based models obtain higher values in OA, UA, F1-score and IoU, except for PA. GAN-CDM-4 even performs better than WDCD-6 in OA, UA, F1-score and IoU. Overall, the results in Table 6 show that GAN-CDM-based models have better transferability than WDCD and GCM-based models from the WHUL8-CDb dataset (Landsat 8) onto the S2 CMC dataset (Sentinel-2).

Table 6

Average accuracy indicators of different methods on 513 images in S2 CMC. The highest values are marked in bold for the same number of input bands, and underlined for the best overall method.

Method	OA	UA	PA	F1-score	IoU
WDCD-4	77.00%	70.58%	96.48%	81.52%	68.80%
GCM-4	60.03%	93.31%	26.04%	40.72%	25.56%
GAN-CDM-4	88.03%	84.34%	94.85%	89.29%	80.65%
WDCD-6	76.25%	69.76%	96.78%	81.08%	68.18%
GCM-6	52.45%	97.10%	10.10%	18.30%	10.07%
GAN-CDM-6	92.54%	92.96%	92.83%	92.90%	86.74%

Average accuracy indicators of different methods on 513 images in S2 CMC. The highest values are marked in bold for the same number of input bands, and underlined for the best overall method. The land covers in S2 CMC are classified into 11 types, in which open and enclosed water were classified into two types. We treat both open and enclosed water as a single water class, reducing to 10 land cover types for S2 CMC. The average OA, F1-score and IoU of different methods in the 10 land cover types for 513 images in S2 CMC are shown in Fig. 10. GAN-CDM-based models always obtain higher values than WDCD and GCM-based models in different land cover types, thus performing better overall in different land cover types. The performance of GAN-CDM-4 in snow/ice is almost the same as that of WDCD-4 and worse than in other land cover types, whereas GAN-CDM-6 always outperforms WDCD-6 and GCM-6 and has stable performance in all land cover types. In addition, the added value of GAN-CDM-6 over WDCD-6 and GCM-6 is higher than those of GAN-CMD-4 over WDCD-4 and GCM-4. All WDCD, GCM and GAN-CDM-based models obtain the highest F1-score and IoU on wetlands, which means that on the S2 CMC dataset, the easiest land cover type does not change with input bands. The most challenging land cover types for WDCD-4 and WDCD-6 are mountain and coastal. One possible explanation is that the information in SWIR bands can improve the performance of WDCD-based models in mountain areas, but is not useful for separating clouds from coastal. For GCM-4 and GCM-6, the most challenging classes are urban and snow/ice, because GCM-based models output as many bands as inputs, which makes it harder to optimize the network parameters when using more input bands. Snow/ice and barren are the most challenging classes for GAN-CDM-4 and GAN-CDM-6 because SWIR bands can help GAN-CDM-based models distinguish clouds and snow/ice, but do not help much in barren areas. The results demonstrate that information usage from SWIR bands differs for different methods when applied to S2 CMC dataset. Although the worst land cover types for GAN-CDM-based models are snow/ice and barren, GAN-CDM-based models still perform much better than GCM and WDCD-based models overall. This is because clear images including snow/ice and barren are used to optimize G2 during training.

Fig. 10

Average OA, F1-score and IoU of different methods in different land cover types for 513 images in S2 CMC (Details are in Table S4).

Average OA, F1-score and IoU of different methods in different land cover types for 513 images in S2 CMC (Details are in Table S4). Cloud detection results of six examples in S2 CMC are shown in Fig. 11. WDCD-based models fail in these six images and can hardly distinguish background from clouds. WDCD-based models try to classify more pixels as clouds, which causes a high PA for WDCD-based models. GCM-4 and GCM-6 have high commission errors in snow/ice and barren scenes, respectively, while they omit many high reflectance clouds in these scenes. GAN-CDM-based models do not classify bright surfaces as clouds in snow/ice and barren scenes except that GAN-CDM-4 classifies very few snow/ice regions into clouds. In urban scenes, GCM-based models can detect thick clouds but omit thin clouds. Although some highlights in urban scenes are classified as clouds by GAN-CDM-based models, the results of GAN-CDM-based models are much better than those of WDCD and GCM based models and close to the reference cloud masks. In easy scenes, GCM-based models omit many clouds, while the results of GAN-CDM-based models are almost the same as the reference.

Fig. 11

Examples of cloud detection results of different methods in S2 CMC. (a) Snow/ice. (b) Barren. (c) Urban. (d) Forest. (e) Wetland. (f) Water. All images are composited with NIR-G-B except image (a) is composited with SWIR-G-B.

Performance on Fmask 4.0 validation dataset

From Table 7, we can see that Fmask 4.0 performs the best among all methods on the L8_Fm4v dataset. It should be noted that Fmask 4.0 was trained on 90 of 92 images in the L8 Biome dataset, which has overlap regions with Fmask 4.0 validation dataset. However, the weakly supervised method GAN-CDM-6 performs close to the supervised method Fmask 4.0, ranks second on OA, UA and F1-score. When taking the same bands as input, WDCD and GCM-based models always have lower OA, UA, PA, F1-score and IoU than GAN-CDM-based models. GCM-based models always get the lowest values on all accuracy metrics than other models.

Table 7

Average accuracy indicators of different methods on L8_Fm4v. The highest values are marked in bold for the same number of input bands, and underlined for the best overall method.

Method	OA	UA	PA	F1-score	IoU
WDCD-4	74.52%	61.77%	85.34%	71.67%	55.85%
GCM-4	55.11%	42.21%	51.12%	46.24%	30.07%
GAN-CDM-4	84.27%	78.87%	79.71%	79.29%	65.68%
WDCD-6	75.80%	61.19%	98.21%	75.40%	60.52%
GCM-6	56.40%	45.35%	75.42%	56.64%	39.51%
GAN-CDM-6	96.05%	92.53%	97.41%	94.91%	90.31%
Fmask 4.0	96.42%	96.43%	94.01%	95.20%	90.85%

Average accuracy indicators of different methods on L8_Fm4v. The highest values are marked in bold for the same number of input bands, and underlined for the best overall method. Table 8 shows the quantitative results of different methods on S2_Fm4v dataset. It can be seen that Fmask 4.0 ranks first and GAN-CDM-6 ranks second on OA, F1-score and IoU among all methods. This may be because Fmask 4.0 was trained on a Sentinel-2 dataset, which was simulated from the L8 Biome dataset for cloud detection on Sentinel-2 images. Whereas GAN-CDM-6 was trained on true Landsat 8 images labeled at block-level, with only threshold selection done on the S2_Abs dataset (a coarse Sentinel-2 cloud dataset that is time-saving to produce). Although Sen2Cor obtains the highest UA among all methods (97.92%), its OA, PA, F1-score and IoU are much lower than GAN-CDM-4. GCM-based models perform the worst also when applied on Sentinel-2 images.

Table 8

Average accuracy indicators of different methods on S2_Fm4v. The highest values are marked in bold for the same number of input bands and underlined for the best overall method.

Method	OA	UA	PA	F1-score	IoU
WDCD-4	66.92%	60.52%	99.81%	75.36%	60.46%
GCM-4	51.62%	52.92%	40.92%	46.15%	30.00%
GAN-CDM-4	90.51%	85.47%	97.93%	91.28%	83.95%
WDCD-6	73.12%	65.39%	99.72%	78.99%	65.27%
GCM-6	45.71%	45.35%	34.90%	39.45%	24.57%
GAN-CDM-6	92.04%	90.43%	94.26%	92.31%	85.71%
Fmask 4.0	95.66%	96.64%	94.73%	95.68%	91.71%
Sen2Cor	88.80%	97.92%	79.59%	87.80%	78.26%

Average accuracy indicators of different methods on S2_Fm4v. The highest values are marked in bold for the same number of input bands and underlined for the best overall method. The quantitative results on Fmask 4.0 validation dataset demonstrate that the weakly supervised GAN-CDM-based models obtain good cloud detection results that are close to a supervised method, while requiring much less human effort when preparing training data. When transferred to other sensors such as Sentinel-2, GAN-CDM-based models can also achieve more satisfactory cloud detection than baseline deep learning-based models.

Overall discussion about method performance on the four cloud detection validation datasets

The quantitative experiment results in the four globally distributed cloud detection validation datasets show that GAN-CDM-based models are generally capable of cloud detection in Landsat 8 and Sentinel-2 images better than baseline deep learning-based models. Although the most challenging land cover types for GAN-CDM-based models are snow/ice and barren, GAN-CDM-based models still perform much better than deep learning-based baseline methods on these land cover types. The easiest land cover types for GAN-CDM-based models are vegetation and wetland. GAN-CDM-based models can produce accurate cloud masks and outperform deep learning-based baseline methods on these land cover types. The hardest land cover types for WDCD-based models are snow/ice, barren and mountain, while the easiest land cover types are shrubland and wetland. The hardest land cover types for GCM-based models differ with input bands and datasets, which means GCM-based models are not as stable and robust as GAN-CDM and WDCD-based models. Although more input spectral bands bring additional information, the cloud detection accuracies of GAN-CDM, WDCD, and GCM-based models are not always improved with more input bands. From the quantitative results in Table 5, Table 6, it can be seen that WDCD-6 does not always perform better than WDCD-4 on L8 Biome and S2 CMC datasets. From the visual results in Fig. 9, Fig. 11, we can see that with SWIR bands as input, GAN-CDM-6 almost does not classify barren and snow/ice as clouds, while WDCD-6 and GCM-6 still can not distinguish some bright surfaces and clouds. This means that GAN-CDM-6 can make better use of the information in SWIR bands than WDCD-6 and GCM-6. Combining the results in Table 5, Table 6, Table 7, Table 8, GAN-CDM-based models are more effective than deep learning-based baseline methods. GAN-CDM-6 always performs the best among all deep learning-based baseline methods not only on Landsat 8 images but also on Sentinel-2 images. When compared with supervised methods Fmask 4.0 and Sen2Cor, GAN-CDM-6 still achieves competitive performances. Overall, GAN-CDM-based models perform much better than WDCD and GCM-based models. There are two main reasons for this. On the one hand, GAN-CDM-based models combine a physical model of cloud distortion with GAN. The physical model considers two-way transmission and absorption of clouds, and can describe the cloud influence on signal more accurately. On the other hand, clear images and corresponding constant references are put into GAN-CDM-based models when training. The values of the reflectance layer of clear pixels are closer to 0, which will reduce the possibility of classifying the background into clouds in a simple binarization process. Another thing that needs to be emphasized is that GAN-CDM-based models, trained with only block-level labels and coarse auxiliary datasets for threshold selection, can not only predict pixel-level cloud labels with good performance but also be well-transferred for cloud detection in other satellite sensors with similar image bands. Combining the quantitative and qualitative results in the L8 Biome and S2 CMC datasets, the cloud detection and transfer learning capabilities of GAN-CDM-based models are superior to deep learning-based baseline methods.

Analysis of theoretical computational cost

Although deep learning-based models are faster and more accurate than traditional methods, high-performance GPU is required for model training and inference. To analyze the theoretical computational costs of deep learning-based cloud detection methods, we calculate the floating-point operations per second (FLOPs) which can indicate the model complexity. Table 9 shows the FLOPs results of different models in training and inference (1 GFLOPs = 10 × 9 FLOPs). It can be seen that the FLOPs for a given method differ very little when the number of input bands changes. GAN-CDM-based models rank second-best in training, and best at inference. On the contrary, WDCD-based models obtain the lowest FLOPs in training but the highest FLOPs at inference. GAN-CDM-based models use about three times more FLOPs in training than inference. This is because GAN-CDM-based models include 3 networks in training, but only 1 network at inference. GCM-based models also include 3 networks in training and 1 network at inference, but use about 35 more GFLOPs in training than at inference. This is because the cloud matting network of GCM-based models is larger than the other two networks. Although WDCD-based models obtain the lowest FLOPs in training, at inference the FLOPs increase about 31 times. This is because WDCD-based models down-sample feature maps in training, but do not change the feature maps at inference.

Table 9

The training and inference FLOPs of deep learning-based models. The best values are marked in bold.1 GFLOPs = 10 FLOPs.

Input size	(384,384,4)			(384,384,6)
Method	GAN-CDM-4	GCM-4	WDCD-4	GAN-CDM-6	GCM-6	WDCD-6
Training (GFLOPs)	199.65	418.11	33.78	201.66	419.40	33.95
Inference (GFLOPs)	61.40	382.83	1063.33	61.73	383.85	1063.67

The training and inference FLOPs of deep learning-based models. The best values are marked in bold.1 GFLOPs = 10 FLOPs.

Sensitivity analysis on auxiliary datasets for threshold selection

Since the thresholds of GAN-CDM-based models can be selected on coarsely delineated auxiliary datasets, the Baetens dataset was used to analyze the influence of different auxiliary datasets on the performances of GAN-CMD-based models. Fig. 12 shows the OA curves of GAN-CDM-4 and GAN-CDM-6 on the Baetens dataset. The OA curves of both methods are very similar, and the thresholds for the highest OA are 7 for both methods.

Fig. 12

OA Curves of GAN-CDM-based models on the Baetens dataset.

OA Curves of GAN-CDM-based models on the Baetens dataset. Fig. 13 shows the accuracy values of GAN-CDM-4 and GAN-CDM-6 on S2 CMC and S2_Fm4v datasets when using different auxiliary datasets for threshold selection. For GAN-CDM-4 on S2 CMC dataset (Fig. 13 a)), the best performances on all measures but PA are obtained when using the S2_Abs or the Baetens dataset as an auxiliary dataset. Conversely, on the S2_Fm4v dataset, GAN-CDM-4 achieves the best performances in all measures but UA when the threshold was selected on the L8_Abs dataset (Fig. 13 b)). From Fig. 13 c) and d), we can see that GAN-CDM-6 performs best on S2 CMC and S2_Fm4v datasets for all measures but PA when using the S2_Abs dataset for threshold selection. Overall, the performances of GAN-CDM-6 are more consistent than those of GAN-CDM-4 on different validation datasets and when using different auxiliary datasets for threshold selection. This means that GAN-CDM-6 is more robust than GAN-CDM-4 when using different auxiliary datasets for threshold selection.

Fig. 13

Performances of GAN-CDM-4 and GAN-CDM-6 on S2 CMC and S2_Fm4v datasets, with different auxiliary datasets for threshold selection.

Performances of GAN-CDM-4 and GAN-CDM-6 on S2 CMC and S2_Fm4v datasets, with different auxiliary datasets for threshold selection. Fig. 14 shows the performances of GAN-CDM-4 and GAN-CDM-6 with different auxiliary datasets in different land cover types in S2 CMC dataset. GAN-CDM-4 with S2_Abs/Baetens as an auxiliary dataset obtains higher OA, F1-score and IoU in water, forest and urban classes than with L8_Abs as an auxiliary dataset, while in snow/ice and coastal classes, GAN-CDM-4 performs better when using L8_Abs dataset than using S2_Abs/Baetens dataset as an auxiliary dataset. GAN-CDM-6 performs almost the same on all land cover types except coastal, no matter which of the auxiliary datasets is used. This demonstrates that GAN-CDM-6 is more consistent than GAN-CDM-4 in different land cover types when taking different auxiliary datasets for threshold selection.

Fig. 14

Average OA, F1-score and IoU of GAN-CDM-based models in different land cover types for 513 images in S2 CMC dataset (Details are in Table S5).

Superiority of TSAP

The adaptiveness of TSAP

To analyze the adaptability of the proposed TSAP strategy, we also applied TSAP on L8_Abs and S2_Abs auxiliary datasets for GCM and WDCD-based models, to obtain the “optimal” thresholds which were then used for producing cloud masks. From Fig. 15 a) and b), it can be seen that the “optimal” thresholds for GCM-4 and GCM-6 for L8_Abs were 97 and 67 respectively, and for S2_Abs, 118 and 117 respectively. When using the “optimal” thresholds to produce cloud masks, both GCM-4 and GCM-6 get added OA values on the L8 Biome and S2 CMC datasets except GCM-6 fails on the L8 Biome dataset. The improvement of OA for GCM-4 and GCM-6 when using TSAP was higher on S2 CMC than on L8 Biome dataset. For WDCD-4 and WDCD-6, the “optimal” thresholds were 159 and 222 for L8_Abs, 173 and 218 for S2_Abs, respectively. Unlike GCM-4 and GCM-6, WDCD-4 and WDCD-6 improved OA more on the L8 Biome dataset than on S2 CMC dataset with the “optimal” thresholds. Fig. 15. b), c) and d) show that TSAP improves 6-bands models more than 4-bands models for GCM and WDCD-based models. The results in Fig. 15 demonstrate that the proposed TSAP strategy can also help improve the performance of GCM and WDCD-based models except GCM-6 on the L8 Biome dataset. OA Curves and added OA values of GCM and WDCD-based models, respectively.

The robustness of TSAP

In this section, the influence of the cloud probability threshold of Fmask 4.0 on the performance of GAN-CDM-4 and GAN-CDM-6 was discussed. Because the default cloud probability threshold of Fmask 4.0 was set to detect more “clouds”, we set the first stride to −1.5 and 7.5 for “absolute” clear and cloud pixels, −0.5 and 2.5 for the next 3 strides. In this way, five groups of cloud probability thresholds were produced for L8_Abs and S2_Abs datasets. Table 10 shows the cloud probability thresholds in each group (LT1 to LT5 for L8_Abs, ST1 to ST5 for S2_Abs). After applying the respective cloud probability thresholds on Fmask 4.0 for L8_Abs and S2_Abs datasets, we obtain five auxiliary datasets for “optimal” threshold for Landsat 8 and Sentinel-2 imagery.

Table 10

Cloud probability thresholds for “absolute” clear and cloud pixels.

Dataset	No	Cloud probability threshold for “absolute” clear pixels	Cloud probability threshold for “absolute” cloud pixels
L8 Abs	LT1	17.5 (Default)	17.5 (Default)
	LT2	16	25
	LT3	15.5	27.5
	LT4	15 (Adopted)	30 (Adopted)
	LT5	14.5	32.5
S2 Abs	ST1	20 (Default)	20 (Default)
	ST2	18.5	27.5
	ST3	18	30
	ST4	17.5 (Adopted)	32.5 (Adopted)
	ST5	17	35

Cloud probability thresholds for “absolute” clear and cloud pixels. Fig. 16 shows the optimal thresholds and corresponding OA values for different cloud probability threshold groups. From a), we can see that when the cloud probability threshold ranges from LT1 to LT5, the performances of GAN-CDM-4 are almost the same on the L8 Biome dataset. b) shows that GAN-CDM-6 gets a little higher OA (0.01) with LT3 than with other groups. For S2 CMC dataset, GAN-CDM-4 performs slightly better (0.01) with ST4 and ST5 than with other groups, and GAN-CDM-6 has very similar performances with different groups. The results show that GAN-CDM-4 and GAN-CMD-6 with TSAP are robust with different cloud probability thresholds.

Fig. 16

OA Curves of GAN-CDM-4 and GAN-CDM-6 with different cloud probability thresholds on L8_Abs and S2_Abs datasets, respectively. The optimal thresholds are marked in dot lines.

Effectiveness of cloud distortion model for TIRS band

To analyze the effectiveness of the proposed cloud distortion model for TIRS bands, we introduced Landsat 8 bands 10 and 11 to GAN-CDM-6 to construct GAN-CDM-8 T model. WDCD-8 T and GCM-8 T models were also constructed for comparison. Models (WDCD-8 T, GCM-8 T and GAN-CDM-8 T) trained with bands 2/3/4/5/6/7/11/12 are used to validate the advantage of cloud distortion model for TIRS band. The optimal threshold for GAN-CDM-8 T is 6 on L8_Abs dataset. Fig. 17 shows the results of WDCD-8 T, GCM-8 T and GAN-CDM-8 T on the L8 Biome dataset. Of those three models using TIRS bands, GAN-CDM-8 T obtains the best performance across all accuracy indicators. This shows that GAN-CDM-8 T makes the best use of the information contained in thermal bands, thanks to its cloud distortion model.

Fig. 17

Average accuracy indicators of models with bands 2/3/4/5/6/7/10/11 as input (marked in dot) and corresponding degraded performance (marked in the bars) over 6-bands based models on 92 images in the L8 Biome dataset. From Fig. 17, it can be seen that when comparing model performance without TIRS bands and with thermal bands, the performances of all methods are degraded when considering TIRS bands, except OA and UA for GCM-8 T. The reason may be that TIRS bands resampled from 100 m to 30 m in Landsat 8 product will introduce noise to TIRS bands, thus reducing the ability to produce accurate cloud reflectance (GCM-8 T)/ cloud activation map (WDCD-8 T) or cloud distortion layers (GAN-CDM-8 T). GCM-based models produce cloud reflectance to generate cloud masks. The cloud reflectance is very low in TIRS band. Thus, when taking TIRS band as input, the cloud reflectance of GCM-8 T will decrease. The default threshold of GCM-8 is the same as that of GCM-6 which is the reason why GCM-8 detects fewer clouds (higher UA) than GCM-6. WDCD-8 T is based on cloud activation map (CAM). The CAM should be stronger when taking more helpful information. However, it obtains the worst performance. This means that the rescaled noise in thermal infrared bands influences the performances of weakly-supervised methods such as WDCD-8 T, GCM-8 T and GAN-CDM-8 T. Although GAN-CDM-8 T is influenced by TIRS bands, it still outperforms WDCD-8 T and GCM-8 T. This illustrates that the proposed cloud distortion model for thermal infrared bands can help reduce the influence of rescaling noise in TIRS band for GAN-CDM.

Performance improvement introduced by coastal and cirrus bands

Coastal band (1) and Cirrus band (11) is mainly used for monitoring atmosphere water, which may be beneficial for cloud detection. Models with bands 1/2/3/4/5/6/7/9/10/11 (WDCD, GCM and GAN-CDM) were trained, then tested on the L8 Biome dataset and results are shown in Fig. 18. It can be seen that GAN-CDM performs better than WDCD and GCM in all accuracy indicators except slightly worse than WDCD in PA (1.35%). The performance improvement introduced by Coastal and Cirrus bands over models with bands 2/3/4/5/6/7/11/12 is also shown in Fig. 8, from which we can see that the performances of WDCD, GCM and GAN-CDM improve much in all accuracy indicators when taking Coastal and Cirrus bands as additional inputs. WDCD obtains more performance improvement than GCM and GAN-CDM in all accuracy indicators. This is likely because WDCD-8 T performed much worse than GCM-8 T and GAN-CDM-8 T, so there is more room for WDCD to improve. (See Fig. 17.)

Fig. 18

Average accuracy indicators of models with bands 1/2/3/4/5/6/7/9/10/11 as input (marked in dot) and corresponding improved performance (marked in the bars) over models with bands 2/3/4/5/6/7/11/12 on 92 images in the L8 Biome dataset.

Performance on Landsat 8 images versus supervised methods

To appreciate the results of the proposed method GAN-CDM, we compare our results with several supervised architectures in Table 11. Even if a direct comparison between weakly-supervised and supervised methods is not straightforward and fair, the proposed weakly supervised method GAN-CDM performs well and is comparable in accuracy and robustness scores to the supervised methods. It should be noted that DeepLab (Chen et al., 2018), DCN (Zhan et al., 2017) and MSCFF (Li et al., 2019) were trained on 73 of 92 images in the L8 Biome dataset. GAN-CDM even got higher OA, PA, F1-score and IoU than Deeplab and performed better than DCN on OA and PA. GAN-CDM-4 got 2.17% added value over MSCFF-4 on OA.

Table 11

Average accuracy indicators of different methods on 19 test images in L8 Biome in MSCFF. The highest values are marked in bold for the same number of input bands and underlined for the best overall method.

Method	OA	UA	PA	F1-score	IoU
WDCD-4	60.44%	53.82%	96.76%	69.17%	52.87%
GCM-4	88.46%	51.60%	80.51%	62.89%	45.87%
GAN-CDM-4	95.67%	86.17%	91.90%	88.94%	80.09%
MSCFF-4	93.50%	88.08%	94.52%	92.60%	87.80%
WDCD	81.76%	50.96%	99.45%	67.39%	50.82%
GCM	86.44%	36.19%	82.34%	50.28%	33.59%
GAN-CDM	94.33%	77.68%	98.34%	86.80%	76.67%
DeepLab	87.72%	91.26%	81.37%	86.00%	75.50%
DCN	92.37%	95.96%	87.27%	91.40%	84.20%
MSCFF	94.96%	95.05%	93.93%	94.50%	89.50%

Limitations of GAN-CDM

Although GAN-CDM-based models are only trained on block-level labels and outperform deep learning-based baseline methods and Sen2Cor, few labeled pixels are needed to obtain the classification thresholds for GAN-CDM-based models with different input bands and sensors. Obtaining the threshold by using outputs of clear training images, as done in the WDCD method, is a good way that will be taken into consideration in future works. Cloud shadow detection is also an important step in the pre-processing of remote sensing images. Although GAN-CDM-based models consider the physical process of cloud absorption, the cloud shadow component is fused into the two-way transmission layer produced by GAN-CDM (Fig. 15 a)). In addition, the shadows of other objects are also fused into the two-way transmission layer. It is challenging to find a simple binarization process to separate cloud shadows from the two-way transmission layer. Cloud shadows are mainly related to clouds, sun azimuth and altitude angles, which means that the detection of cloud shadows can be done with these parameters. Therefore, a traditional physical model can be used to obtain cloud shadows based on cloud detection results, like in Fmask. Another limitation of GAN-CDM is that it is difficult to produce “clear” images from all cloud images (Fig. 19 b)). But GAN-CDM can transfer cloud pixels to clear pixels such as barren, snow/ice because their textures are similar to clouds. In this way, the “clear” images produced by G1 can fool D.

Fig. 19

Vision results on two Landsat 8 samples. Columns 1–6 are false-color composite images (RGB: 5/4/3), restored, reconstructed images and cloud distortion layers of GAN-CDM-based models.

Conclusions

In this study, a new algorithm GAN-CDM for cloud detection in remote sensing images have been presented, building upon and adapting our cloud removal model CR-GAN-PM for cloud detection. The proposed GAN-CDM and the current study exhibit key advantages and results. First, unlike some end-to-end deep learning-based methods that require pixel-level labels for training, GAN-CDM is designed to extract cloud components from input images with only image blocks and corresponding block-level labels. Second, a block-level labeled Landsat 8 cloud detection dataset over the globe, WHUL8-CDb, was created, harmonized and labeled to train the proposed GAN-CDM-based models. The training dataset, well-trained GAN-CDM models and code are released online (doi:https://doi.org/10.5281/zenodo.6420027 and https://github.com/Neooolee/GANCDM) for reproducibility and to allow further model development and intercomparison by a broader community. Finally, by using coarse auxiliary datasets for threshold selection, the pre-trained GAN-CDM-based models achieve competitive performances on four cloud detection validation datasets, two with Landsat 8 images and two with Sentinel-2 images. The worldwide distribution and different resolutions of validation datasets confirm the good performance of GAN-CDM-based models in different land cover types and resolutions. Both visual and quantitative analysis of experimental results demonstrate that the proposed hybrid GAN-CDM-based models can achieve higher accuracy and have better transferability than baseline deep learning-based models for cloud detection when trained on Landsat 8 and tested on Sentinel-2. In our future work, we will mainly focus on cloud shadow detection and the improvement of cloud detection accuracy, especially for the most challenging classes. We will design a post-processing method to separate cloud shadow from two-way transmission layer correctly. Designing a better solution which does not need to use coarse auxiliary dataset to obtain the optimal threshold will also be considered.

CRediT authorship contribution statement

Jun Li: Conceptualization, Methodology, Writing – original draft, Funding acquisition. Zhaocong Wu: Supervision, Project administration. Qinghong Sheng: Supervision, Project administration. Bo Wang: Validation, Resources, Funding acquisition. Zhongwen Hu: Writing – review & editing. Shaobo Zheng: Data curation, Software. Gustau Camps-Valls: Investigation, Methodology, Writing - review & editing. Matthieu Molinier: Investigation, Methodology, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

5 in total