| Literature DB >> 36236483 |
Elizar Elizar1,2, Mohd Asyraf Zulkifley1, Rusdha Muharar2, Mohd Hairi Mohd Zaman1, Seri Mastura Mustaza1.
Abstract
In general, most of the existing convolutional neural network (CNN)-based deep-learning models suffer from spatial-information loss and inadequate feature-representation issues. This is due to their inability to capture multiscale-context information and the exclusion of semantic information throughout the pooling operations. In the early layers of a CNN, the network encodes simple semantic representations, such as edges and corners, while, in the latter part of the CNN, the network encodes more complex semantic features, such as complex geometric shapes. Theoretically, it is better for a CNN to extract features from different levels of semantic representation because tasks such as classification and segmentation work better when both simple and complex feature maps are utilized. Hence, it is also crucial to embed multiscale capability throughout the network so that the various scales of the features can be optimally captured to represent the intended task. Multiscale representation enables the network to fuse low-level and high-level features from a restricted receptive field to enhance the deep-model performance. The main novelty of this review is the comprehensive novel taxonomy of multiscale-deep-learning methods, which includes details of several architectures and their strengths that have been implemented in the existing works. Predominantly, multiscale approaches in deep-learning networks can be classed into two categories: multiscale feature learning and multiscale feature fusion. Multiscale feature learning refers to the method of deriving feature maps by examining kernels over several sizes to collect a larger range of relevant features and predict the input images' spatial mapping. Multiscale feature fusion uses features with different resolutions to find patterns over short and long distances, without a deep network. Additionally, several examples of the techniques are also discussed according to their applications in satellite imagery, medical imaging, agriculture, and industrial and manufacturing systems.Entities:
Keywords: artificial intelligence; convolutional neural network; deep learning; machine learning; multiscale features; neural network
Mesh:
Year: 2022 PMID: 36236483 PMCID: PMC9573412 DOI: 10.3390/s22197384
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1The primary taxonomy of multiscale-deep-learning architecture used in classification and segmentation tasks.
Figure 2Multiscale receptive fields of deep-feature maps that are used to activate the visual semantics and their contexts. Multiscale representations help in better segmenting the objects by combining low-level and high-level representations.
Figure 3Multiscale CNN, defined as a network with multiple distinct CNN networks with various contextual input sizes that run concurrently, whereby the outputs are combined at the end of the network to obtain rich multiscale semantic features.
Figure 4The spatial-pyramid-pooling module extracts information from different scales that varies among different subregions. Using a four-level pyramid, the pooling kernels cover the whole, half, and small portions of the image. A more powerful representation could be fused with information from the different subregions within these receptive fields.
Figure 5Multilevel spatial bin, with the example of bin-size-6 resultant feature maps segmented into 6 × 6 subsets.
Figure 6In ASSP, the atrous convolution uses a parameter called the dilation rate that adjusts the field of view to allow a wider receptive field for better semantic-segmentation results. By increasing the dilation rate at each block, the spatial resolution can be preserved, and a deeper network can be built by capturing features at multiple scales.
Figure 7In early fusion, all local attributes (shapes and colors) are retrieved from identical regions and locally concatenated before encoding. In late fusion, image representations are derived independently for each attribute and concatenated afterward.
Figure 8Feature-pyramid-network (FPN) model that combines low- and high-resolution features via a top-down pathway to enrich semantic features at all levels.
The Application of Multiscale Deep Learning in Satellite Imagery.
| Literature | Target Task | Network Structure | Method | Strength | Weakness |
|---|---|---|---|---|---|
| Gong et al., 2019 [ | Hyperspectral Image | Spatial Pyramid Pooling | CNN with multiscale convolutional layers, using multiscale filter banks with different metrics to represent the features for HSI classification. | The accuracy is comparable to or even better than other classifications in the both spectral and spectral-spatial classification of the HSI image. | Extracts only the spatial features in the limited-size filtering or convolutional windows. |
| Hu et al., 2018 [ | Small Objects | Multiscale-Feature CNN | Identifying small objects by extracting features at different object convolution levels and applying multiscale features. | When compared with Faster RCNN, the accuracy of the small-object detection is significantly higher. | The performance is restricted by the computational costs and image representations. |
| Cui et al., 2019 [ | Hyperspectral Image | Atrous Spatial Pyramid Pooling | Integrating both fused features from multiple receptive fields and multiscale spatial features based on the structure of the feature pyramid at various levels. | Better accuracy compared with other classification methods for Indian Pine, Pavia University, and Salina Datasets. | The classification significantly depends on the quality and quantity of the labeled samples, which are costly and time consuming to obtain. |
| Li et al., 2019 [ | Aerial Image | Multiscale U-Net | The main structure is U-Net with cascaded dilated convolution at the bottom with varying dilation rates. | The best accuracy for the whole set is compared to four well-known methods using Inria Aerial Image Dataset. The best IoU in Chicago and Vienna Image in the same dataset. | The average IoU performance is still very weak, and especially in the Inria dataset. |
| Gong et al., 2021 [ | Hyperspectral Image | Multiscale Fusion + Spatial Pyramid Pooling | The main structure includes a 3D CNN module, a squeeze-and-excitation module, and a 2D CNN pyramid-pooling module. | The method was evaluated on three public hyperspectral classification datasets: Indian Pine, Salinas, and Pavia University. The classification accuracies were 96.09%, 97%, and 96.56%, respectively. | The method still has the misclassification of bricks and gravel. The classification performance is still weak, and especially in the Indian Pine dataset. |
| Liu et al., 2021 [ | Hyperspectral Image | Multiscale Fusion | Multiscale feature learning uses three simultaneous pretrained ResNet sub-CNNs, a fusion operation, and a U-shaped deconvolution network. A region proposal network (RPN) with an attention mechanism is used to extract building-instance locations, which are used to eliminate building occlusion. | When compared with a mask R-CNN, the proposed method improved the performance by 2.4% on the self-annotated building dataset of the instance-segmentation task, and by 0.17% on the ISPRS Vaihingen semantic-labeling-contest dataset. | The use of fusion strategies invariably results in increased computational and memory overhead. |
| Liu et al., 2018 [ | UC Merced Dataset, | Multiscale CNN + SPP | The proposed method trains the network on multiscale images by developing a dual-branch CNN network: F-net (given that training is performed at a fixed scale), and V-net (given that training is performed with varied input scales per n-iterations). | The MCNN reached a classification accuracy of 96.66 ± 0.90 for the UC Merced Dataset, 93.75 ± 1.13 for the SIRI-WHU Dataset, and 91.80 ± 0.22 for the AID Dataset. | This method reduces the possibility of feature discrimination by focusing solely on the feature map from the last CNN layer and ignoring the feature data from additional layers. |
| Gao et al., 2022 [ | Hyperspectral Image | Multiscale Fusion | This method employs cross-spectral spatial-feature extraction (SSCEM). This module sent previous CNN layer information into the spatial and spectral extraction branches independently, and so changes in the other domain after each convolution could be fully exploited. | The proposed network excels in many deep-learning-based networks on three HSI datasets. It also cuts down on the number of training parameters for the network, which helps, to a certain extent, to prevent overfitting problems. | The performance is restricted by the complexity of the network structure, which implies a greater computational cost. |
The Application of Multiscale CNNs in Medical Imaging.
| Literature | Target Task | Network Structure | Method | Strength | Weakness |
|---|---|---|---|---|---|
| Wolterink et al., 2017 [ | Vessel Segmentation | CNN + Stacked Dilation Convolution | CNN with ten-layer network. The first eight layers are the feature-extraction levels, whereas Layers 9 and 10 are fully connected classification layers. Each feature-extraction layer uses 32 kernels. The level of the dilation rate increases between Layers 2 and 7. | The myocardium and blood pool had Dice indices of 0.80 ± 0.06 and 0.93 ± 0.02, respectively, average distances to boundaries of 0.96 ± 0.31 and 0.89 ± 0.24 mm, respectively, and Hausdorff distances of 6.13 ± 3.76 and 7.07 ± 3.01 mm, respectively. | Due to hardware limitations, the work still used a large receptive field and led to a less precise prediction. |
| Du et al., 2020 [ | Vessel Segmentation | Dilated Residual Network + Modified SPP | The network’s inception module initializes a multilevel feature representation of cardiovascular pictures. The dilated-residual-network (DRN) component extracts features, classifies the pixels, and anticipates the segmentation zones. A hybrid pyramid-pooling network (HPPN) then aggregates the local and worldwide DRN information. | Best result in quantitative segmentation compared with four well-known methods in all five substructures (left ventricle (LV), right ventricle (RV), left atrium (LA), right atrium (RA), and LV myocardium (LV_Myo)). | The HD value of this method is higher than that of U-Net, which shows that it still has some issues with segmenting small targets. |
| Kim et al., 2018 [ | Lung Cancer | Multiscale Fusion CNN | Multiscale-convolution inputs with varying levels of inherent contextual abstract information in multiple scales with progressive integration and multistream feature integration in an end-to-end approach. | On two parts of the LUNA16 Dataset (V1 and V2), the method did much better than other approaches by a wide margin. The average CPMs were 0.908 for V1, and 0.942 for V2. | The anchor scheme used by the nodule detectors introduces an excessive number of hyperparameters that must be fine-tuned for each unique problem. |
| Muralidharan et al., 2022 [ | Chest X-ray | Multiscale Fusion | The input image is divided into seven modes, which are then fed into a multiscale deep CNN with 14 layers (blocks) and an additional four extra layers. Each block has an input layer, convolution layer, batch-normalization layer, dropout layer, and max-pooling layer, whereby the block is stacked three successive times. | The proposed model successfully differentiates COVID-19 from viral pneumonia and normal classes with accuracy, precision, recall, and F1-score values of 0.96, 0.97, 0.99, and 0.98, respectively. | The obtained results are still based on random combinations of the extracted modes, and so they need to run the model with every possible combination of the hyperparameters to obtain the desired result. |
| Amer et al., 2021 [ | Echocardiography | Multiscale Fusion + Cascaded Dilated Convolution | The network uses residual blocks and cascaded-dilated-convolution modules to pull both coarse and fine multiscale features from the input image. | Dice-similarity-performance measure of 95.1% compared with expert’s annotation and surpasses Deeplabv3 and U-Net performances by 8.4% and 1.2%, respectively. | The work only measures the image-segmentation performance, without including the LV-ejection-fraction (ED and ES) clinical cardiac indicators. |
| Yang et al., 2021 [ | Cardiac MRI | Dilated Convolution | The dilated block of the segmentation network captures and aggregates multiscale information to create segmentation probability maps. The discriminator part differentiates the segmentation probability map and the ground truth at the pixel level to provide confidence probability maps. | The Dice coefficients on the ACDC 2017 for both ED and ES are 0.94 and 0.89, respectively. The Hausdorff distances for both the ED and ES are 10.6 and 12.6 mm, respectively. | The model still produces weak Dice coefficients in both the ED and ES of the left-ventricle-myocardium part. |
| Wang et al., 2021 [ | Cardiac MRI | Multiscale Fusion/Dilated Convolution | The encoder part uses dilated convolution. The decoding part reconstructs the full-size skip-connection structure for contextual-semantic-information fusion. | The Dice coefficients on the ACDC 2017, MICCAI 2009, and MICCAI 2018 datasets reached 96.2%, 98.0%, and 96.8%, respectively. Overall, Jaccard indices of 0.897, 0.964, and 0.937 were observed, with Hausdorff distances of 7.0, 5.2, and 7.5 mm, respectively. | The work only measures the image-segmentation performance, without including the LV-ejection-fraction (ED and ES) clinical cardiac indicators. |
| Amer et al., 2022 [ | Echocardiography | U-Net + Multi-scale Spatial Attention + Dilated Convolution | The model uses a U-Net architecture with channel attention and multiscale spatial attention to learn multiscale feature representations with diverse modalities, as well as shape and size variability. | The proposed model outperformed the basic U-Net, ResDUnet, Attention U-Net, and U-Net3+ models by 4.1%, 2.5%, 1.8%, and 0.4%, respectively, on lung CT images. It also outperformed the basic U-Net, ResDUnet, Attention U-Net, and U-Net3++ models by 2.8%, 1.6%, 1.1%, and 0.6%, respectively, on the left-ventricle images. | The approach still struggles to capture edge details accurately, and it loses segmentation detail at complicated edges. |
The Application of Multiscale Deep Learning in Agriculture Sensing.
| Literature | Target Task | Network Structure | Method | Strength | Weakness |
|---|---|---|---|---|---|
| Hu et al., 2018 [ | Plant Leaf | Multiscale Fusion CNN | With a list of bilinear interpolation procedures, the input image is split up into several low-resolution images. These images are then fed into the network so that it can learn to understand different features at different depths. | Produced a better accuracy rate in most of the MalayaKew Leaf Dataset and LeafSnap Plant Leaf Dataset. | The training process required a more complex sample set that needed to provide both whole and segmented images. |
| Li et al., 2018 [ | Chinese Herbal Medicines | Multiscale Fusion CNN | Near and far multiscale input images are fused together into a six-channel image using a CNN of three convolutional and three pooling layers. | The requirements of Chinese-herbal-medicine classification were met by the model, with a classification accuracy of more than 90%. | There are still many problems with the method, such as less training data, a less accurate classification, and less ability to avoid interference. |
| Turkoglu et al., 2021 [ | ZueriCrop Dataset | Early Fusion + CNN | The model consists of layered CNN networks. In a hierarchical tree, different network levels are indicative of increasingly finer label resolutions. At the refining stage, the three-dimensional probability regions from three different stages are passed to the CNN. | The achieved precision, recall, F1 score, and accuracy are 0.601, 0.498, 0.524, and 0.88, respectively, which outperforms the advanced benchmarked methods. | It is unclear how to adapt the model layout to standard CNNs without affecting the feature-extraction backbone for recurrent networks. |
| Li et al., 2021 [ | Crop Image (UAVSAR and RapidEye) | Multiscale Fusion CNN | A sequence of object scales is gradually fed into the CNN, which transforms the acquired features from smaller scales into larger scales by adopting gradually larger convolutional windows. | This technique provides a novel method for solving the issue of image classification for a variety of terrain types. | The model still generates blurred boundaries between crop fields due to the requirement for an input patch. |
| Wang et al., 2021 [ | Tomato Gray Mold Dataset | Feature Fusion + MobileNetv2 + Channel Attention Module | MobileNetv2 was used as the base network, whereby multiscale feature fusions provide the fused feature maps. The efficient channel-attention module then enhances these feature maps, and the relevant feature paths are weighted. The resultant features were used to predict mold on tomatoes. | Precision and F1 score reached 0.934 and 0.956, respectively, and it outperformed the Tiny-YOLOv3, MobileNetv2-YOLOv3, MobileNetv2-SSD, and Faster R-CNN performances. | Missed detection persists, and especially at extreme shooting angles, and it imposes inaccurate early diagnosis at different parts under different shooting conditions. |
| Zhou et al., 2022 [ | Fish Dataset | ASPP + GAN | A generative adversarial network (GAN) is introduced before applying CNN to augment the existing dataset. Then, the ASPP module fuses the input and output of a dilated convolutional layer with a short sample rate to acquire rich multiscale contextual information. | On the validation dataset, the obtained F1 score, GA, and mIoU reached 0.961, 0.981, and 0.973, respectively. | The model still loses a lot of segmentation detail at the complicated edges. |
The Application of Multiscale Deep Learning in Industrial and Manufacturing Systems.
| Literature | Target Task | Network Structure | Method | Strength | Weakness |
|---|---|---|---|---|---|
| Ding X., He Q., 2017 [ | Fault Bearing Dataset | Wavelet-Packet-Energy (WPE) Image + Deep Convolutional Network | The deep convolutional network has three convolutional layers, two max-pooling layers, and one multiscale layer. The multiscale layer combines the final convolutional layer’s output with the subsequent pooling layer’s output to diagnose any issue on the bearing. Six spindle-bearing datasets with ten-class health states under four loads are used to verify the proposed method performance. | The deep convolutional network achieved stable and high identification accuracies of 98.8%, 98.8%, 99.4%, 99.4%, 99.8, and 99.6 for datasets A, B, C, D, E, and F, respectively. | Increased complexity, which implies a greater computational cost and is impractical in practice. |
| Jiang et al., 2020 [ | C-MAPSS Dataset | Bi-LSTM and Multiscale CNN Fusion Network | The last 3 layers of the fusion network used Bi-LSTM with 64 cells, a multiscale CNN with 32 convolution kernels, and 2 × 2 maximum pooling kernels. The combined output of the two networks determines the predicted RUL. | The proposed fusion model has better RMSE indicators compared with the CNN, LSTM, and Bi-LSTM, tested on four subsets of the dataset. | The method is prone to overfitting, and it is difficult to use the dropout algorithm to prevent it because recurrent connections to LSTM units are probabilistically removed from the activation and weight updates during network training. |
| Wang et al., 2021 [ | Pronostia Bearing Dataset | Multiscale-CNN with Dilated Convolution Block | A complex signal is decomposed using an integrated dilated convolution block. Multiple stacked integrated dilated convolution blocks are fused to create a multiscale feature extractor to mitigate the information loss. | The mean absolute error (MAE) and root mean squared error (RMSE) of the proposed method are the lowest among the comparison methods. | The method does not include uncertainty prediction in the deep-learning model, making it impractical in practice. |
| Zhu et al., 2019 [ | Pronostia Bearing Dataset | Multiscale-CNN | The time-frequency representation (TFR) can represent a complex and nonstationary signal of the bearing degradation. The TFRs and their assigned RULs were sent to a multiscale model structure to pull out more features that could be used to predict the RUL. The multiscale layer maintains the global and local properties to boost the network capacity. | The mean absolute error (MAE) and root mean squared error (RMSE) of the proposed method are the lowest among the other data-driven methods. | The performance is restricted by the complexity of the network structure, which implies a greater computational cost. |
| Li et al., 2020 [ | C-MAPSS Dataset | Multiscale Deep Convolutional Attention Network | The MS-DCNN has three different sizes of convolution operations and multiscale blocks that are put together in parallel. The three multiscale-block output-feature maps are passed to a standard CNN after the multiscale convolution. At the end of the MS-DCNN network, one neuron is connected to provide the final result of the predicted RUL value. | Compared with the other advanced methods, such as the semi-supervised setup, MODBNE, DBN, and LSTM, the RMSE indicators of the proposed method reduced the error by 8.92%, 14.87%, 3.55%, 1.94%, respectively, tested on four datasets. | To learn the prediction models, the method needs a substantial amount of data, which may not be feasible in real-life situations. |
| Wang et al., 2021 [ | Pronostia Bearing Dataset | Multiscale Convolutional Attention Network | First, self-attention modules are constructed to combine multisensor data. Then, an automatic multiscale learning technique is implemented. Finally, high-level representations are loaded into dynamic dense layers for regression analysis and RUL estimation. | The proposed strategy fuses multisensor data and improved RUL-prediction accuracy. Its prediction performance was better than previous prognostics methods. | The approach incorrectly presumes that the monitoring data collected by different sensors contribute equally to the RUL estimation, which leads to an inaccurate RUL prediction. |