Literature DB >> 36194591

Bridge crack detection based on improved single shot multi-box detector.

Guanlin Lu¹, Xiaohui He¹, Qiang Wang¹, Faming Shao¹, Jinkang Wang¹, Qunyan Jiang¹.

Abstract

Owing to the development of computerized vision technology, object detection based on convolutional neural networks is being widely used in the field of bridge crack detection. However, these networks have limited utility in bridge crack detection because of low precision and poor real-time performance. In this study, an improved single-shot multi-box detector (SSD) called ISSD is proposed, which seamlessly combines the depth separable deformation convolution module (DSDCM), inception module (IM), and feature recalibration module (FRM) in a tightly coupled manner to tackle the challenges of bridge crack detection. Specifically, DSDCM was utilized for extracting the characteristic information of irregularly shaped bridge cracks. IM was designed to expand the width of the network, reduce network calculations, and improve network computing speed. The FRM was employed to determine the importance of each feature channel through learning, enhance the useful features according to their importance, and suppress the features that are insignificant for bridge crack detection. The experimental results demonstrated that ISSD is effective in bridge crack detection tasks and offers competitive performance compared to state-of-the-art networks.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36194591 PMCID： PMC9531840 DOI： 10.1371/journal.pone.0275538

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

1. Introduction

As a fundamental component of the transportation system, the bridge not only takes responsibility for transporting items but also ensures the safety of the transport personnel. However, bridges are prone to various types of damage owing to natural or human factors. Among them, deck cracks are a common problem in bridge services. Cracks in a bridge accelerate the speed of corrosion of the armature, resulting in deterioration of the bridge structure [1]. Furthermore, the presence of cracks affects the integrity, durability, and seismic performance of a bridge and considerably reduces its quality [2, 3]. To maintain the healthy state of bridges, it is important for the engineering community, national government administrative services, and bridge construction companies to detect and repair cracks in a timely manner. The development of bridge crack-detection methods has been relatively slow. Traditional manual detection is not only time-consuming and laborious but also has many unsafe factors. The bridge inspection vehicle is a special vehicle that can provide a working platform for bridge inspection personnel during the inspection process and is equipped with bridge inspection instruments for flow inspection and/or maintenance operations. However, its utility is limited by its high production cost and complex manufacturing process. Non-destructive testing technology has been widely used in the field of bridge crack detection. Common non-destructive testing methods include optical fiber sensing [4], ultrasonic detection [5], and acoustic emission detection [6]. However, these non-destructive methods have some limitations. Optical fiber sensing technology requires the laying of optical fibers, which is expensive. Acoustic detection technology is only suitable for detecting cracks in a single direction of a bridge deck with a small detection range. Acoustic emission detection technology can only detect cracks that are being generated at present and cannot detect cracks that have previously formed. Therefore, the high detection cost, limited working conditions, and inefficient detection speed limit the traditional detection methods based on manual detection or instrument information characteristic analysis, and it is important to devise a new technical means to carry out real-time and efficient bridge crack detection. Computer vision technology has improved rapidly with the rapid development of computer automation. As an important research topic in the field of computer vision, the main task of object detection is to locate a target of interest in an image and accurately judge the specific category and location of the target. In recent years, object detection has been widely used in intelligent video surveillance, fault detection, medical treatment, and other fields. Many scholars have proposed different types of object detection models for crack detection. Li et al. [7] presented a model based on a support vector machine (SVM) to detect bridge cracks. Nishikawa et al. [8] used several simple image filters to design a multi-sequential image filter. Wang et al. [9] presented a model based on mathematical morphology for detecting cracks in steel. Cha et al. [10] combined the Hough transform with an SVM to detect cracks. These methods mainly use manually extracted features to detect cracks. Compared with the traditional crack detection technology, it improves the detection accuracy and speed. However, the results of these methods are affected by human subjective factors in feature processing, such as people’s professional ability, grasp of standards, and other complex factors. In recent years, convolutional neural networks (CNN) have made significant progress in object detection [11-14]. Many researchers have applied CNN to crack detection. Chen et al. [15] presented a network called NB-CNN, which combines a CNN with naïve Bayes data fusion to detect cracks. This algorithm can only recognize the location of cracks with a low detection accuracy. To achieve higher detection accuracy, Cha et al. [16] proposed a network based on faster R-CNN [17]; however, a large number of parameters affect the network detection speed. Dung et al. [18] proposed a crack model based on a fully convolutional network (FCN) for crack detection because the complex environmental noise that influences detection accuracy is not high. Various algorithms have been applied to crack detection, but the data samples used for model training are often collected in an ideal environment and manually modified in the labeling process. During the actual detection of the model, the interference of the environment (such as illumination, occlusion, jitter, etc.) will cause the domain offset of the input image, reduce the model to extract useful crack features, and increase the difficulty of the model feature processing. The output results are mixed with too many irrelevant features, limiting the model’s detection accuracy. Therefore, most models have achieved good results in the ideal environment, but the performance of the interference model of environmental factors is significantly depressed in the actual detection. Given the limitations of traditional detection methods and the shortcomings of current deep learning detection algorithms, to improve the detection accuracy, detection speed, and the robustness of detection methods, this paper designs a new, efficient, and anti-interference bridge crack detection network based on the theoretical knowledge of deep learning. The main contributions of this study are as follows: A deep separable deformation convolution operation is used to replace the conventional convolution operation, which optimizes the fitting ability of the prediction box and improves the feature extraction ability of the model. An inception module is introduced to expand the network width while controlling the increase in the number of parameters, which reduces the calculations required for detection, recognition, and classification and improves the detection speed of the model. The feature recalibration module, which combines the channel attention mechanism and spatial mechanism, is applied to improve the detection accuracy of the network by suppressing unimportant features while enhancing important features in space and channels. The remainder of this paper is organized as follows. In the second section, we introduce related studies. In the third section, we introduce the details of the proposed method. In the fourth section, the details of the experiments on the proposed network and their results are presented. Finally, in fifth section, a brief conclusion is presented.

2. Preliminaries and related work

Currently, object detection is widely used for intelligent video surveillance, fault detection, medical treatment, and other fields. Many scholars have proposed different types of object detection models based on convolutional neural networks, which can be roughly divided into two types. The first type is a candidate region-based object detection model, represented by a regional convolutional neural network (R-CNN) and a real-time regional recommendation convolutional neural network (Faster R-CNN), which divides the detection process into two steps. In the first step, feature information is extracted from the input image according to candidate region selection algorithms (such as selection search [19] and edge search [20], etc.). The second step is to classify and adjust the position of the feature information obtained from the candidate region and finally output the object detection results. Although these models have high accuracy, the detection speed is slow, and it is difficult to meet the real-time requirements of bridge crack detection. The second type is a regression-based object detection model represented by a single-shot multi-box detector (SSD) [21] and unified real-time object detection (YOLO) [22]. Compared to object detection models based on candidate regions, regression-based object detection models have a faster detection speed. Fig 1 shows the structure of the SSD, which is divided into three parts: the main layer based on VGG16(very deep convolutional networks for large-scale image recognition) [23], the feature extraction layer, and the classification layer. The VGG16 network structure in the main layer was optimized. First, the sixth and seventh convolution layers, Conv6 and Conv7, are used to replace FC6 and FC7 (fully connected layers) in the original structure of VGG16 to avoid the interference of the full connection layer with the detection object features and position information. Second, the feature mapping relationships of conv4_3, conv7_2, conv8_2, conv9_2, conv10_2, and conv11 are combined to form a multiscale feature extraction layer in the SSD. Finally, a 3×3 convolution is used to calculate the output feature graphs of the detection layer one by one to obtain the confidence required for detection target classification in the target detection task, and another 3×3 convolution is used to obtain the position information required for detection target regression in the object detection task. SSD adopts the method of multiscale object detection and applies an end-to-end learning model for bridge crack detection. Bridge cracks and pavement cracks are equal to concrete surface cracks. Based on the engineering application prospect of the crack detection method, this paper analyzes some representative crack detection networks based on SSD. Yan et al. [24] designed a pavement crack detection network based on SSD network by integrating the idea of deformation convolution, which improved the accuracy of network crack detection. Yang et al. [25] embedded the receptive field enhancement module into the SSD network to enhance its ability for crack feature extraction and improve the crack detection accuracy of the network. Feng et al. [26] also proposed an accurate bridge crack-detection algorithm based on an SSD. These algorithms can achieve good detection results under the conditions of a simple background and no interference but cannot meet the accuracy and speed requirements of bridge crack detection under complex conditions. Therefore, the original SSD model was optimized in this study to improve the detection accuracy and speed.

Fig 1

Structure of SSD.

3. Methodology

The overall architecture of our proposed ISSD is shown in Fig 2, which introduces a composite structure with the DSDCM (Depth-Separable Deformation Convolution Module) and the IM (Inception Module) as the main components to extract features efficiently, and the FRM (Feature Recalibration Module) used to enhance the weight of effective features. Specially, given an image x with size 300×300×3 and generate feature maps with sizes of 38×38×512, 19×19×1024, 10×10×512, 5×5×256, 3×3×256 and 1×1×256 respectively after different stages of DSDCM and composite structure processing. Then FRM is introduced to calibrate the characteristic relationship between channels and suppress interference information for the above characteristic maps of different scales to lay a solid foundation for the fusion of subsequent multi-scale characteristic maps. Finally, the fused characteristic map outputs the final detection results under the action of NMS (Non-Maximum Suppression). More implementation details regarding DSDCM, IM, and FRM are described in the following subsections.

Fig 2

Structure of ISSD.

3.1 Depth separable deformation convolution module (DSDCM)

3.1.1 Deformation convolution

Conventional convolution kernels are usually of fixed size (e.g., 3×3, 5×5, and 7×7), whereas the adaptability of the model to the geometric deformation of objects is almost entirely due to the diversity of the data [27]. In the bridge crack detection task, the conventional convolution has an insufficient fitting ability for narrow and long-strip bridge cracks, which leads to a low accuracy of the detection results. Therefore, we adopted the convolutional kernel distribution form, which offsets the position of each sampling point in the conventional convolution kernel to shift the position of the sampling point, realizing arbitrary deformation of the convolution, aiming to enhance the feature extraction ability of the model for bridge cracks. Typical cases of deformable convolutions are shown in Fig 3.

Fig 3

Three typical cases of convolution kernel shifting of conventional convolution.

Three typical cases of convolution kernel shifting of conventional convolution.

Group A is the conventional distribution, Group B is the distribution after arbitrary migration, Group C is the distribution after scaling transformation and Group D is the distribution after rotational transformation. The conventional convolution operation is divided into two steps [28]. In the first step, region R, which corresponds to the receptive field of the convolution kernel, is sampled on the input feature map. The second step is to successively sum the values of each sampling point and the weights of the corresponding convolution kernel positions. Region R defines the size of the receptive field, as shown in Eq 1. A point is convoluted in the output characteristic graph y: where p represents the elements in the receptive field and x is the input feature graph. The deformed convolution kernel is obtained by shifting each element in the conventional convolution receptive field R. The offsets are {△p|n = 1,2,3⋯,N} and N is the number of elements in the receptive field R. Because the offsets are not integers, deviations exist between the location of the sampling points and the actual pixel points of the feature map after the deformation operation of the convolution kernel. In this case, a bilinear interpolation method was used for processing. where p = (p0+p+△p), q enumerates the positions of all integral spaces in the feature graph, and G(.,.) represents the bilinear interpolation kernel.

3.1.2 Depth separable convolution

Conventional convolution operations combine channel and dimensional mappings. Depth separable convolution [29] deals not only with the spatial dimension but also with the relationship between the depth dimension and channel. Conventional convolution performs convolution operations on all channels in the input image areas. Depth separable convolution uses different convolution kernels on different channels to perform convolution operations. The operation processes of deep separable convolution are divided into two steps. The first step is a channel-by-channel convolution operation, in which three feature images are generated by the deep convolution operation. The second step is a point-by-point convolution operation, in which the three feature images generated by channel convolution are weighted in the depth direction, and a new feature map is generated. The deep-separable convolution processes are illustrated in Fig 4.

Fig 4

Schematic diagram of depth separable convolution.

The number of parameters and amount of computation in the convolution affect the detection speed of the model. Compared with conventional convolution kernels, deep separable convolution kernels significantly reduce the number of parameters and the amount of computation. Assuming that the input image is D•D•M, the size of the convolution kernel is D•D•M and its number is N. The number of parameters of conventional convolution is then calculated by The number of parameters of deep convolution is calculated by The number of parameters of point-by-point convolution is calculated by The number of parameters of deep separable convolution is calculated by By comparing Formula 7 with Formula 10, it can be seen that depth separable convolution can greatly reduce the number of model parameters and improve the efficiency of model operation by decoupling spatial and depth information.

3.1.3 Depth separable deformable convolution

A deformable convolutional network is a variant of a convolutional neural network that is very effective for solving complex visual tasks and learning dense spatial changes. Depth separable convolution mainly achieves model acceleration by decoupling spatial and depth information, which can further reduce the detection time of the model in practical applications. In this study, deep separable convolution is mainly integrated into the process of deformable convolution, and a deep separable deformable convolution module is proposed, which can further enhance the feature extraction ability, reduce the number of parameters, reduce the size, and improve the running speed of the network model. The entire process of deep-separation deformable convolution was divided into three steps, as shown in Fig 5. First, deep separation deformable convolution was used to sample the input feature maps to obtain the offset of each pixel point. Subsequently, a bilinear interpolation algorithm was used to obtain the pixel amount of each pixel point offset, which is equivalent to generating deformation on the convolution kernel to achieve the purpose of sampling the variable shape. Finally, the deformable convolution operation was performed with the convolution kernel with an offset on the input feature maps. Deep separation deformable convolution reduces the number of network parameters, and hence not only improves the speed of network operation but also improves the degree of network sparseness and enhances the ability of network feature extraction.

Fig 5

Schematic diagram of the depth separable deformable convolution.

3.2 Inception module (IM)

The most direct way to improve the deep neural network is to increase the scale of the network. It includes increasing the depth and width of the network. Considering that the edge feature is the main feature of the cracks, the deepening of the network will lead to the decline of shallow feature learning ability, which is not conducive to detecting the cracks. Therefore, we choose to increase the size of filter banks in each layer, increase the width of the network, and then improve the network’s performance. However, this method has two shortcomings: 1) A larger size usually means more parameters, and it is easier to cause overfitting of the network, especially in the case of insufficient samples. 2) Even if the size of each layer of the network is increased evenly, the total amount of computation will be increased sharply. Moreover, many operations will be wasted when the network capacity is underutilized. Inspired by reference [30], we adopted a sparse processing method in the feature dimension to alleviate the shortcomings caused by increasing the network width. Specially, an inception module in an SSD network was introduced to increase the network parallel computing ability and reduce the number of parameters. In addition, considering the limitation of computing resources of the field crack detection platform, we optimize it based on the original inception module. Specifically, we used 1×1 convolution as a reduced layer to reduce the number of channels and the amount of calculation. Further, we employed two 3×3 convolution to replace the 5×5 convolution in the original structure to achieve less parameter calculation under the premise of the same receptive field [31]. Compared with the original structure, the optimized structure not only reduces the number of module parameters, improves the reasoning speed of the whole network, but also increases the adjustment performance of the module to the dimension of the feature map, realizes cross-channel information combination, and is conducive to the enhancement of the overall detection performance of the network. The structure comparison of inception module is shown in Fig 6.

Fig 6

Schematic diagram of the inception structure.

Group A represents the structure of the original inception and Group B represents the structure of the improved inception.

Schematic diagram of the inception structure.

Group A represents the structure of the original inception and Group B represents the structure of the improved inception. The introduction of inception modules in the feature extraction stage of the optimized model improved the feature fusion capabilities in the hidden layers of the model and fully broadened the channels of contextual information sharing. This method was helpful in improving the feature extraction efficiency of the bridge crack model. Although the optimized model increased the structural redundancy and the number of parameters, the changes in the parameters in the feature layers were controlled in a small range, and the feature extraction results were normalized in batches before entering the recognition layer. Thus, the increases in calculation were not obvious, and the detection speed of the model was improved.

3.3 Feature recalibration module (FRM)

To make the network pay more attention to channel features with effective information, suppress irrelevant features, and calibrate the feature relationship between channels, FRM was introduced into the network. The FRM adopted the design idea of squeeze-and-excitation networks (SENet) [32], and its structure is shown in Fig 7. The FRM process can be divided into three steps. First, the input feature maps were compressed to obtain global information, and the specific formula is as follows: where H and W represent the height and width of the characteristic graph, respectively. u represents the c-th channel of the characteristic graph; u(i, j) represents the pixels of the i-th row and j-th column in the c-th channel; and z represents the output after the compression operation.

Fig 7

Structure of the feature recalibration module.

Second, the channel features obtained by the compression operation are activated to generate the weight of each channel. The specific formula is as follows: where w1 and w2 represent the fully connected operation, s represents the weight generated, φ represents the ReLU activation function, and σ represents the sigmoid function. Finally, the weight generated in the activation operation is assigned to different channels, and the specific formula is as follows: where y represents the output matrix of the c-th channel. FRM automatically obtained the importance of each feature channel through learning, improved the useful features according to their importance, and suppressed the features that were not useful for the current task; thus, the weight of the effective feature map was large, and the weight of the invalid or small effect feature map was small, which improved the detection of bridge cracks.

3.4 Loss function

The loss function of the entire network is composed primarily of position loss (L) and confidence loss (L). The position loss uses the SmoothL1 loss function to calculate the error between the ground truth box and the prediction box, and the confidence loss uses the Softmax loss function to calculate the correct detection probability. The total loss function of network includes classification loss function and regression loss function. The specific formula of the total loss function of the network is as follows: where L(x,c,l,g) denotes the total loss function, c represents the degree of confidence, l represents the prediction box, g represents the ground truth box, N represents the number of matches between the ground truth box and the prediction box, α represents the weight coefficient, and x = {0,1}. When the IOU (intersection over union) is greater than the threshold (set to 0.5 in this study), x = 1. Otherwise, x = 0. The specific formula of the position loss function is as follows: where i∈P represents the i-th prediction box area as a positive sample, o and o represent the offsets in the x and y directions between the center of the prediction box or the ground truth box and the default box, respectively; w and h represent the deviation between the width and height of the prediction box or the ground truth box and the default box, respectively; (when the i-th prediction box matches the j-th ground truth box, x = 1, otherwise x = 0), is the position parameter of the ground truth box after encoding, and is the position parameter of the prediction box. The specific formula of the confidence loss function is as follows: where i∈N1 represents the i-th prediction box area as a negative sample, is the probability that the prediction box is the background, and represents the probability calculated by the Softmax function.

4. Experiment and results

The bridge crack dataset used for model training and testing was introduced, and the effectiveness of each method was verified separately. The proposed network was then compared with FCN [33], SSD, U-Net [34], CrackDFANet [35], LDCC-Net [36], FPHBN [37], and (ABCNet) Network in reference [38]. Finally, conclusions were drawn by analyzing the experimental results.

4.1 Experimental dataset and computer environment

In this study, two crack datasets were used as samples: the SDNET [39] and CCIC datasets [40]. The images in the SDNET dataset were collected from walls, roads, and bridge surfaces. The entire dataset contained more than 56000 images which were divided into those with and without cracks. The CCIC dataset collected 40000 images of cracks and noncracks. The inherent data hunger of deep learning network makes the network training need massive data as support. This paper analyzes the characteristics of data samples in SDNET and CCIC, and constructs a new dataset according to the data fusion principle of similar label merging. To better train the network, this paper adopts 300×300 fixed size nonoverlapping clipping windows to randomly selected 16000 images from the SDNET dataset and 14000 images from CCIC to combine into a larger crack dataset called the WCD dataset. Samples from the WCD dataset were proportionally divided into a training set and a test set. The number of different samples in the WCD dataset is listed in Table 1.

Table 1

Number of different samples in the WCD dataset.

Type	SDNET		CCIC
Type	Crack	Non-crack	Crack	Non-crack
Train	6000	6000	5000	5000
Test	2000	2000	2000	2000

The bridge crack-detection model proposed in this study is a program environment built in the TensorFlow framework. The experimental hardware was a Dell Precision T3630 workstation; the specific parameters of the workstation are listed in Table 2.

Table 2

Specific index parameters of the workstation.

Hardware/Software	Specification/Parameters/Version
CPU	Intel Core i5 8 Generation
GPU	NVIDIA GeForce GTX1060/6GB
RAM	8GB
Anaconda	3–5.1.0
Python	2.7.5
TensorFlow	1.10

4.2 Evaluation indicators

There are various object detection models based on the convolutional neural network, and the principles of object detection are different for different detection models. To quantitatively evaluate the detection performance of the models, we must establish a corresponding reference standard to evaluate the detection performance of all models comprehensively and objectively. The commonly used evaluation indexes of object detection include the mean accuracy and number of images processed per second (FPS) [41]. The confusion matrix is the most basic and intuitive method for measuring an object detection model [42]. The confusion matrix includes the following four indicators: ① The true value is positive, and the model considers it to be positive (true positive = TP). ② The true value is positive, but the model considers it to be negative (false negative = FN). ③ The true value is negative, but the model considers it positive (false positive = FP). ④ The true value is negative, and the model considers it to be negative (true negative = TN). False negatives are statistical errors of the first type, and false positives are statistical errors of the second type. Because the indicators in the confusion matrix count the number of samples, it is difficult to accurately evaluate the model using only the number of samples when processing a large amount of data. Four secondary indicators were extended from the basic statistical results of the confusion matrix: accuracy, accuracy rate, and recall rate. Accuracy is the proportion of all correctly judged results in the model to the total observed values, and its formula is shown in Eq (18). Accuracy rate refers to the proportion of all results in which the model prediction is positive and correct. Its formula is shown in (19). The recall rate refers to the proportion of all results whose true values are positive and correctly predicted by the model. Its formula is shown in (20). Since precision and recall are a pair of contradictory indicators, in general, when the precision value is high the recall value is often low, and when the recall value is high the precision value is often low. In order to comprehensively consider the influence of these two indicators, F−measure (weighted harmonic mean of Precision and Recall) is proposed, and its formula is expressed as shown in (21). F−measure not only improves the precision and recall rates but also ensures that the gap between them is narrowed as much as possible to measure the detection efficiency of the model more comprehensively. In the statistical analysis of the model test results, the recall rate value is typically used as the abscissa, and the precision rate value is used as the ordinate to draw the P–R curve. By observing the fluctuation of the P–R curve, the precision rate can be negatively correlated with the recall rate value. IOU [43] reflects the correlation between the predicted value detected by the model and the real value of the objects. IOU was calculated as follows: where B represents the size of the detection box, B represents the size of the calibration box of the detection target, area(B∩B) represents the coincidence area of the two boxes, and area(B∪B) represents the total area of the two boxes combined. The higher the correlation, the higher is the IOU. In the model training process, thresholds of different IOU should be set to measure the detection accuracy of the model. Experimental results in [36] show that it is appropriate to set the threshold value of IOU as 0.5 in the bridge crack detection task. The accuracy of model detection is usually described by a precision–recall curve (PR curve). The PR curve takes the recall rate as the vertical axis and accuracy as the horizontal axis. The accuracy curve of the recall rate is commonly used to measure the detection performance of the models. The curve generally showed that the recall rate was low when the accuracy was high, and when the recall rate was high, the accuracy was low.

4.3 Network training

Learning rate is a key parameter in network training. An unreasonable learning rate will lead to the problem of gradient explosion or gradient disappearance of the network and failure to complete the training. A reasonable learning rate will promote network convergence. The relationship between the loss function value and number of epochs at different learning rates in the network training process is shown in Fig 8. The curve variation trend indicated that when the learning rate was 0.0001, the loss function curve declined slowly, and a long time was required to reach convergence. When the learning rate was 0.001, the loss function curve decreased rapidly and converged within a short time. When the learning rate was 0.1, the loss function curve decreased rapidly in the early stages and gradually in the later stages. When the learning rate was 1.0, a gradient explosion occurred in the training stage of the network, and the network could not complete the training. Therefore, the learning rate was set as 0.001. Furthermore, this paper compares three commonly used gradient descent methods, namely batch gradient descent method (BGD), random gradient descent method (SGD), and small-batch gradient descent method (MBGD). BGD sacrifices speed while pursuing accuracy. Too slow convergence speed can not meet the timeliness requirements of detection. On the contrary, SGD adopts the strategy of reducing iterative samples to improve the update speed of each round of parameters. However, it isn’t easy to ensure detection accuracy. Considering the small scale of bridge crack samples and the large sample size of the data set, we use MBGD with a batch size of 32 and weight attenuation of 0.0001 to balance speed and accuracy.

Fig 8

Curves of training loss with different learning epochs.

Epoch refers to sending all training samples to the network to complete forwarding calculation and backpropagation. With the increase in the number of epochs, the number of weight update iterations increases, and the network’s performance also changes. A reasonable number of iterations is the key to practical training the network to achieve the best state. Fig 9 shows the results under the different epochs. When the number of iterations is small (100 epochs), the network is in the state of fitting, and the detection effect is poor, resulting in the loss of 6 parts in the results. As the number of iterations increases to 160 epochs, the detection performance of the network is gradually improved, the detection accuracy of the network is improved, and the missing detection part is reduced to 4. When the number of iterations reaches about 220, the detection performance of the network comes the best. However, due to the crack scale, there is still a lack of some detailed features. When the number of iterations is 280 epochs, the number of detected errors will be increased from the area of no cracks to the size of no cracks. This shows that the network has the trend of overfitting. With the increase of training times to 340 times, the network presents the state of overfitting, the area of false detection object increases to 7, and the area of incorrect detection area will not expand. Therefore, we set the number of iterations to 220, maintaining the convergence consistency of network training loss under different learning rates.

Fig 9

Comparison of output results at different epochs.

The green boxes locate the missing detection parts of the detection results and the red boxes locate the false detection parts of the detection results.

Comparison of output results at different epochs.

The green boxes locate the missing detection parts of the detection results and the red boxes locate the false detection parts of the detection results. In addition, other specific training parameters were set during the training process, as listed in Table 3.

Table 3

Network parameter setting.

Type	Parameter	Value
training	Initial learning rate	0.001
	Momentum	0.9
	Weight decay	0.0001
testing	Initial learning rate	0.001
	Momentum	0.9
	Weight decay	0.0001

4.4 Effectiveness of network structure

To evaluate the effectiveness of the components of the proposed ISSD, ablation experiments were performed on the WCD dataset. SSD(VGG-16) was used for comparison with the same parameter settings, and the experimental results are presented in Table 4.

Table 4

Validation results of components in the ISSD.

Network	Accuracy	Precision	Recall	F-measure	FPS
SSD	0.7835	0.7795	0.7846	0.7820	53
SSD+DSDCM	0.8321	0.8249	0.8297	0.8273	52
SSD+IM	0.8153	0.8048	0.8131	0.8089	60
SSD+FRM	0.8415	0.8357	0.8386	0.8371	54
ISSD	0.9153	0.9053	0.9116	0.9084	75

Several conclusions can be drawn from the results. First, the proposed network ISSD can achieve superior performance compared to other networks (Precision: 0.9053, Re: 0.9116, F-measure: 0.9084, FPS: 75). Second, the components of the proposed ISSD can improve the detection accuracy and speed of the network. Comparing the results of SSD with and without the proposed components, it can be seen that the DSDCM improved the F-measure by 4%, the IM enhanced the FPS of the network by 21%, and the FRM increased the F-measure by 5%. Third, different components optimize the network to different degrees. In particular, comparing the results of SSD + FRM and SSD + DSDCM, it can be seen that the accuracy of the former has a 1% advantage over the latter, which shows that FRM reduces the impact of negative samples on network accuracy and is conducive to the improvement of network accuracy. Comparing the results of SSD + DSDCM and SSD + IM, it can be seen that the detection speed of the latter is 60 FPS, which is higher than that of the former at 52FPS. The superior performance is due to IM improving the network parallel computing ability and reducing the number of parameters. Based on the comprehensive analysis of the above experimental results, the bridge crack detection network ISSD designed in this paper has dramatically improved the detection accuracy and detection speed compared with the original network (the accuracy advantage is about 13%, and the speed advantage is about 28%). The superior performance shows that the SSD algorithm has enough room for improvement in solving practical engineering problems and effectively promotes the application of SSD algorithm-based detection networks in the field of bridge cracks.

4.5 Comparison with state-of-the-art bridge crack detection networks

4.5.1 Overall performance analysis

To prove that the ISSD was more competitive than the other bridge crack networks, the proposed network was compared with the FCN, SSD, U-Net, CrackDFANet, LDCC-Net, FPHBN and ABCNet. All the networks were trained and tested on the same hardware platform using the same dataset. Table 5 shows a series of quantitative experimental results. On the whole, the F-measure of all networks is more than 0.8. which indicates that all networks have certain detection performance. Specifically, the F-measure of FCN is lower than 0.8, the F-measure of FPHBN and ABCNet are close to 0.87, and that of the remaining networks is about 0.89. The F-measure of ISSD is the most prominent, reaching 0.912, which shows that ISSD is good at capturing local details, which are often rich in texture features, and are very important in bridge crack detection.

Table 5

The results of different networks on the WCD dataset.

Network	Precision	Recall	F-measure
FCN	0.801	0.791	0.796
U-Net	0.887	0.873	0.880
CrackDFANet	0.895	0.881	0.888
LDCC-Net	0.896	0.883	0.889
FPHBN	0.851	0.849	0.850
ABCNet	0.869	0.857	0.863
ISSD	0.901	0.917	0.912

Further, we compared the computational efficiency and computational complexity of all networks, as shown in Table 6. In the experiment, all networks run the same number of iterations under the same hardware platform and experimental settings. The results show that in all models, the floating-point computation of ISSD and LDCC is the lowest, far lower than that of other networks. In addition, ISSD and LDCC net’s reasoning speed is outstanding, exceeding 73 FPS. Although LDCC is close to ISSD in terms of computational efficiency and computational resource complexity, based on Table 5, ISSD achieves the best detection effect with the highest computational efficiency and the lowest computational resource complexity.

Table 6

The computational efficiency and computational complexity of different networks.

Network	Epochs	Flops	FPS
FCN	220	15.4G	28
U-Net	220	5.21G	31
CrackDFANet	220	1.32G	73
LDCC-Net	220	1.29G	75
FPHBN	220	2.73G	58
ABCNet	220	1.36G	62
ISSD	220	1.28G	77

To intuitively analyze and compare the detection performance of the network, we selected four kinds of crack samples for depth visual feature analysis. Fig 10 shows the detection results of the networks: single shape sample (row 1), composite shape sample (row 2), regional interference sample (row 3), and robust interference sample (row 4). The single sample detection results show that the all networks can describe the crack shape and its area to varying degrees., except for the interference of environmental noise. Specifically, the results of FCN, FPHBN, and ABCNet are disturbed by the environment to varying degrees. LDCC-Net and ISSD are more refined to extract crack texture features, which is conducive to detecting cracks. By analyzing the detection results of composite shape samples, it is found that the FCN network lags behind other networks in the expression of crack detail information, which is due to the complete convolutional structure of FCN. The detailed information is diluted in the progressive pooling operation, which affects the expression of local features of the network. The robust feature extraction ability and excellent negative sample screening ability of the network ISSD designed in this paper support its accurate expression of the crack shape of the composite structure. The background of the other two types of samples is complex. There are different degrees of interference. Compared with the detection performances of the first two samples, the performance of FCN is weakened to the greatest extent, and other networks are also significantly suppressed. U-net, CrackDFANet, and LDCC-Net cannot even wholly depict the shape of cracks. Such results show that for the network based on visual feature detection, the interference factors in the environment have a significant limit on the network performance. The ISSD network designed in this paper only achieves relatively stable detection results and cannot eliminate this limit.

Fig 10

Visualization of detection results of compared networks on the WCD dataset.

4.5.2 Performance in real images

To further compare the anti-jamming capability of the network, we select four kinds of crack samples for depth visual feature analysis. Fig 11 shows the detection results of the network: the transverse crack sample under substantial interference (row 1), the transverse crack sample under shadow interference (row 2), the cross-crack sample under large-area interference (row 3), and mesh crack sample under substantial interference (line 4). The first set of experiments (row 1) shows substantial interference, which undoubtedly poses a challenge to crack detection. Although ISSD improves the noise reduction ability to a certain extent, it is still not ideal. For the input image of the shadow interference area (row 2), the detection results of all networks interfere to varying degrees. FCN and FPHBN have seriously interfered. The rest of the networks can alleviate the interference of shadow on the detection results to a certain extent, but ISSD can extract the crack texture features more fully, which shows that ISSD still maintains a robust feature ability under certain interference conditions. For the input image with extensive area interference (row 3), the optimization strategy adopted by the network is challenging to deal with due to the prominent visual characteristics of the background, and the detection results of the all networks are disturbed. Unlike the last three experiments, the crack shape of the samples selected in the fourth group is more complex, posing a new challenge to the network(row4). Due to the influence of background interference, the ability of the network to extract network crack features is reduced. In addition, the networks can not accurately predict the crack part’s shape, and the detection effect is not ideal. From the processing results of the above four complex samples, the network proposed in this paper has achieved relatively stable optimization results in enhancing the target and reducing the interference. However, in terms of engineering practice, it is a severe challenge to realize the ability of anti-interference and anti-noise.

Fig 11

The visualization of detection results of compared networks in real images.

5. Conclusion

This study proposes a bridge crack detection model ISSD, which combines DSDCM, IM, and FRM closely and seamlessly. Specifically, DSDCM improves the crack feature extraction ability of the model, IM improves the reasoning speed of the network, and FRM alleviates the interference of irrelevant channel features. Further, a series of experiments show that compared with several existing crack detection networks, ISSD has better performance, reaching 0.912 F-measure and 77 FPS. Although the proposed ISSD method can obtain more satisfactory performance than other methods, the complexity of neural network structure and computing power requirements are significant challenges for the current portable bridge crack detection terminal. In addition, the anti-interference ability of the network is still difficult to overcome all kinds of environmental noise in engineering applications. We will focus on these issues in future research. 26 Apr 2022

PONE-D-22-07252

Research on Bridge Crack Detection Based on Improved Single Shot Multi-Box Detector

PLOS ONE Dear Dr. He, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jun 10 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Ardashir Mohammadzadeh, Phd Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 3. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. 4. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability. Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized. Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access. We will update your Data Availability statement to reflect the information you provide in your cover letter. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: N/A ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: “An improved single-shot multi-box detector (SSD) called ISSD is proposed, which seamlessly 8 combines the depth separable deformation convolution module (DSDCM), inception module (IM), and feature 9 recalibration module (FRM) in a tightly coupled manner to tackle the challenges of bridge crack detection”. I have the following points which need to be properly answered. Comment #1: The authors do not discuss the figures presented in the experimental results section. This is a critical issue in this section.‎ Comment #2: There are lots of typos. English needs to revise again with a professional editing service. Also, the figures are not clear in some cases. Comment #3: Mention the limitations and future works of the developed system elaborately. Comment #4: Several techniques have been described in the Introduction section. How do the authors outperform each of these reviewed systems? A clear statement is needed to highlight the contribution (use the table to discuss each method). Comment #5: Discuss the stability of the system in terms of complexity. Comment#6: Detailed evaluation of how the proposed algorithm performs and the varying parameters must be provided. You only provided a learning rate. Comment#7: All the figures must be in 300dpi. Comment#8: Parametric settings need to discuss. How you select the parameters in your model? Comment#9: The conclusion is confusing. You need to change it. Should be according to the body of the manuscript. You must write some findings in the conclusion section. Comment#10: All the figures are not cited properly in the text. Reviewer #2: Comments: 1. In line 52, what subjective factors are in consideration should be indicated and elaborated. 2. In lines 59, 61, and 64, the complex environmental noise should be elaborated in terms of its nature and in terms of its effect on detection. 3. There are inconsistencies between SSD layer names in lines 95-100 and SSD layer names in Fig 1. In [21], there is a Conv6 layer as also indicated in line 96, however Figure 1 does not have it. 4. In general, Fig 2 is not digestible as a whole. Please give data structure information (i.e., dimensions) about colored boxes. There are color inconsistencies between figure legend and the figure. It is not evident where the composite structure sits in the architecture nor there is a reference about it in the text. Please carefully review the figure to make it easy to comprehend and more connected to the text. 5. References and the text are not consistent: - [24] is not work of Zhao et al. - Lu et al.'s work [25] is not about bridge crack detection as asserted by the authors. - Feng et al.'s wok [26] is not about bridge crack detection. However, on can stress that they are correlated problems. Just align the previous work and your statements. 6. [28] is not about deformable convolutions. Please update all the references in the manuscript and make them consistent to the text. The manuscript is not scrutable with that state. 7. In Section 3.1.1, deformation convolution is illustrated however it does not give any credit to the its original inventors: https://arxiv.org/pdf/1703.06211.pdf Please cite properly. ([27] is not that paper.) 8. In Section 3.1.2, in line 160, the reference [29] is not about depth separable convolution. The DOI in [29], doi:10.4271/2014-01-0975, is not a paper titled Xception. It is titled as Fatigue Behavior of Stainless Steel Sheet Specimens at Extremely High Temperatures. 9. WCD dataset does not have any reference, thus not accessible. NOTE: This manuscript has too many fundamental errors. Please carefully review the paper and then I will be able to read it with a mind in peace. Thank you. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 23 May 2022 Original Manuscript ID: PONE-D-22-07252 Original Article Title: Research on Bridge Crack Detection Based on Improved Single Shot Multi-Box Detector To: PLOS ONE Editor Re: Response to reviewers Dear Editor, Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments. We are uploading (a) our point-by-point response to the comments (below) (response to reviewers), (b) an updated manuscript with yellow highlighting indicating changes (Supplementary Material for Review), and (c) a clean updated manuscript without highlights (Main Manuscript). Best regards, et al. Reviewer#1, Concern # 1: The authors do not discuss the figures presented in the experimental results section. This is a critical issue in this section.‎ Author response: We gratefully appreciate for your valuable suggestions. We re-combed the writing ideas of the experimental part, focused on the specific experimental results of different networks, and analyzed the deep-seated reasons behind the various experimental results from the level of target detection mechanism. The reviewers put forward valuable opinions, which increased the article's expression logic and helped improve the overall quality of the article. Author action: We revised this part of the article and highlighted it. (Line433-444 and Line452-506).________________________________________ Reviewer#1, Concern # 2: There are lots of typos. English needs to revise again with a professional editing service. Also, the figures are not clear in some cases. Author response: Thank you again for your positive comments and valuable suggestions to improve the quality of our manuscript. In view of these problems, we have updated the illustrations in the article. Unfortunately, our article has been polished by a commercial translation agency before submission, and we have further optimized the language part of the article. Although we are not very confident, we still hope it can meet your requirements. The English editing service certificate is as follows: . Author action: We improved the resolution of the pictures in the article to 300dpi. The modified pictures are as follows: FIG 1. Structure of SSD. FIG 2. Structure of ISSD. FIG 3. Three typical cases of convolution kernel shifting of conventional convolution. Group A is the conventional distribution, Group B is the distribution after arbitrary migration, Group C is the distribution after scaling transformation and Group D is the distribution after rotational transformation. FIG 4. Schematic diagram of depth separable convolution. FIG 5. Schematic diagram of the inception structure. Group A represents the structure of the original inception and Group B represents the structure of the improved inception. FIG 6. Structure of the Feature recalibration module. FIG 7. Curves of training loss with different learning epochs. FIG 8. Comparison of output results at different epochs. The green boxes locate the missing detection parts of the detection results and the red boxes locate the false detection parts of the detection results. FIG 9. Visualization of detection results of compared networks on the WCD dataset. From left to right: original images, GT, FCN, U-Net, CrackDAFNet, ISSD. FIG 10. The visualization of detection results of compared networks on special cases. From left to right: original images, GT, FCN, U-Net, CrackDFANet, ISSD. FIG 11. Precision-recall curves of composed networks on WCD dataset. ________________________________________ Reviewer#1, Concern # 3: Mention the limitations and future works of the developed system elaborately. Author response: We gratefully appreciate for your valuable suggestions. In the conclusion of the article, we summarize the advantages and disadvantages of the network proposed in this paper, and formulate the direction of the next work. Author action: We updated the conclusion of the article to make the structure of the full text more complete. (Line 518-526).________________________________________ Reviewer#1, Concern # 4: Several techniques have been described in the Introduction section. How do the authors outperform each of these reviewed systems? A clear statement is needed to highlight the contribution (use the table to discuss each method). Author response: Special thanks to you for your good comments. This paper designs a bridge crack detection network based on the deep learning theory. The work of this paper is only a part of the bridge crack detection project. The advantage of the detection network based on computer vision is its learnability. The network can learn the relevant knowledge of crack detection from the data samples and can detect the bridge crack accurately and quickly by relying on the mighty computing power of the convolution neural network. Compared with other traditional detection technologies, crack detection methods mainly rely on artificial vision inspection or instrument signal characteristic analysis. The detection accuracy and speed lag behind the detection network designed in this paper. However, bridge crack detection is a complex engineering project. The whole project includes many hardware, software, and corresponding manual operations. The design of bridge crack in this paper is a systematic engineering project, so it is impossible to compare with other detection technologies from the overall perspective of bridge crack detection engineering. Author action: We revised the description of highlighting the traditional detection methods and the advantages and disadvantages of the network designed in this paper. (Line39-41, Line51-54 and Line 62-82).________________________________________ Reviewer#1, Concern # 5: Discuss the stability of the system in terms of complexity. Author response: Thanks again to the reviewer, we are also aware of this problem. Network parameter quantity is an index to evaluate the complexity of a network system based on deep learning. We add this part to the article and combine it with the network's accuracy and reasoning speed to analyze the network's stability from the perspective of complexity. In addition to the content shown in the manuscript itself, we tested it on different training and test sets and different computing platforms, and the test results did not change significantly. Therefore, our network has stable performance under low computing power dependence after sufficient training. Due to the article's length, we don't think putting these details in the manuscript is necessary. Author action: We added this part in the manuscript. (Line 452-464 and Table 5).________________________________________ Reviewer#1, Concern # 6: Detailed evaluation of how the proposed algorithm performs and the varying parameters must be provided. You only provided a learning rate. Author response: Special thanks to you for your good comments. Network parameters and training mode play a vital role in the final performance of the network. Combined with the characteristics of bridge crack detection, we show more detailed information on the network training process in the article and comprehensively analyze the network's performance in different gradient descent modes and different iterative stages. Author action: We show more details of the network training process in the article, analyze the speed and performance of network training under different gradient descent modes, and compare the network's performance under different iterative stages on this basis. (Line386-412 and Fig.8).________________________________________ Reviewer#1, Concern # 7: All the figures must be in 300dpi. Author response: Thank you again for your positive comments and valuable suggestions to improve the quality of our manuscript. In view of these problems, we have updated the figures in the article. Author action: The fonts in our picture has been replaced, and the resolution of the picture has been increased to 300ppi. ________________________________________ Reviewer#1, Concern # 8: Parametric settings need to discuss. How you select the parameters in your model? Author response: Special thanks to you for your good comments. This question has been partially answered in question 6. Author action: We show more details of the network training process in the article, analyze the speed and performance of network training under different gradient descent modes, and compare the network's performance under different iterative stages on this basis. (Line386-412 and Fig.8).________________________________________ Reviewer#1, Concern #9: The conclusion is confusing. You need to change it. Should be according to the body of the manuscript. You must write some findings in the conclusion section. Author response: It is true as Reviewer suggested that conclusion in the article is not clear in expression. We rewrote the conclusion chapter, focusing on the overview of the work done and the results achieved in this paper. In addition, according to the problems found in the experimental process, we analyzed the shortcomings of the work done in this paper and made a direction for future work. Author action: We updated the manuscript by adding the description of this part in the article. (Line 518-526).________________________________________ Reviewer#1, Concern # 10: All the figures are not cited properly in the text. Author response: Thank you again for your positive comments and valuable suggestions to improve the quality of our manuscript. Before submitting the manuscript, we have arranged the article in strict accordance with the requirements of the journal. We will continue to communicate with the editor to determine the citation format of the pictures in the article.________________________________________ Reviewer#2, Concern # 1: In line 52, what subjective factors are in consideration should be indicated and elaborated. Author response: Special thanks to you for your good comments. The core problem of the effect of the detection method based on manual design features is the quality of manual design features, which are usually completed manually, and subjective factors are inevitably introduced in the feature design process. Author action: We updated the manuscript by modifying the description of this part. (Line 51-54)________________________________________ Reviewer#2, Concern # 2: In lines 59, 61, and 64, the complex environmental noise should be elaborated in terms of its nature and in terms of its effect on detection. Author response: Thank you again for your positive comments and valuable suggestions to improve the quality of our manuscript. In view of these problems, we have updated the illustrations in the article. Author action: We updated the manuscript by adding the description of this part. (Line62-68)________________________________________ Reviewer#2, Concern # 3: There are inconsistencies between SSD layer names in lines 95-100 and SSD layer names in Fig 1. In [21], there is a Conv6 layer as also indicated in line 96, however Figure 1 does not have it. Author response: Thanks again to the reviewer, we are also aware of this problem. We updated the content of Figure 1 to conform to the description of this part in the article. Author action: We have updated the content of Figure 1. The modified Fig. 1 is as follows. original image Updated image________________________________________ Reviewer#2, Concern # 4: In general, Fig 2 is not digestible as a whole. Please give data structure information (i.e., dimensions) about colored boxes. There are color inconsistencies between figure legend and the figure. It is not evident where the composite structure sits in the architecture nor there is a reference about it in the text. Please carefully review the figure to make it easy to comprehend and more connected to the text. Author response: Special thanks to you for your good comments. We have replaced Fig2. We revised figure 2, added the corresponding parameters of the feature layer, adjusted the corresponding relationship between the legend and the legend color, and marked the position of the composite structure in the network. The composite structure of this paper is designed by ourselves. There is no relevant reference. We explain it in the corresponding position in the article. Author action: We added a description of Figure 2 in the article (Line129-136) The modified Fig.2 is as follows. original image Updated image ________________________________________ Reviewer#2, Concern # 5: References and the text are not consistent:[24] is not work of Zhao et al. Lu et al.'s work [25] is not about bridge crack detection as asserted by the authors. Feng et al.'s wok [26] is not about bridge crack detection. However, on can stress that they are correlated problems. Just align the previous work and your statements. Author response: We gratefully appreciate for your valuable suggestions. Bridge cracks and pavement cracks are both concrete surface cracks. From the perspective of network engineering applications, we expect that the network designed in this paper is not limited to bridge crack detection, so we refer to many similar cracks detection research. Author action: We have revised the relevant parts of the article to strengthen the preciseness of the expression of the article. (Line 112-121 and Line 579-584)________________________________________ Reviewer#2, Concern # 6: [28] is not about deformable convolutions. Please update all the references in the manuscript and make them consistent to the text. The manuscript is not scrutable with that state. Author response: We are very sorry for our negligence in this manuscript, we reviewed the address of reference [28] and revised the address in the manuscript. In addition, we reorganized all references and revised 6 references. Author action: Once again, we reorganized all the reference and revised 6 references in this manuscript. (Line566-567, line579-584 and line588-597)________________________________________ Reviewer#2, Concern # 7: In Section 3.1.1, deformation convolution is illustrated however it does not give any credit to the its original inventors: https://arxiv.org/pdf/1703.06211.pdf Please cite properly. ([27] is not that paper.) Author response: We gratefully appreciate for your valuable suggestions. We reviewed the address of reference [27] and revised the address in the manuscript. Author action: The revised address of reference [27] in the manuscript is highlighted. (Line 588-591)________________________________________ Reviewer#2, Concern # 8: In Section 3.1.2, in line 160, the reference [29] is not about depth separable convolution. The DOI in [29], doi:10.4271/2014-01-0975, is not a paper titled Xception. It is titled as Fatigue Behavior of Stainless-Steel Sheet Specimens at Extremely High Temperatures. Author response: We gratefully appreciate for your valuable suggestions. We reviewed the address of reference [29] and revised the address in the manuscript. Author action: The revised address of reference [29] in the manuscript is highlighted. (Line 595-598)________________________________________ Reviewer#2, Concern # 9: WCD dataset does not have any reference, thus not accessible. NOTE: This manuscript has too many fundamental errors. Please carefully review the paper and then I will be able to read it with a mind in peace. Author response: Thanks again to the reviewer, we are also aware of this problem. We didn't consider this problem in the process of writing the manuscript. WCD data set is a new data set obtained after merging the existing public data sets (SDNET and CCIC), which is not described too much in the first draft. After the reviewer's reminder, we added this part to the manuscript. At the same time, we checked the manuscript more carefully to ensure that there were no basic errors. Author action: We added this part to the manuscript and carefully checked the full text of the manuscript. (Line312-317)________________________________________ Submitted filename: Response to Reviewers.docx Click here for additional data file. 5 Jul 2022

PONE-D-22-07252R1

Bridge Crack Detection Based on Improved Single Shot Multi-Box Detector

PLOS ONE Dear Dr. He, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Aug 19 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

PONE-D-22-07252R2

Bridge Crack Detection Based on Improved Single Shot Multi-Box Detector

PLOS ONE Dear Dr. He, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Oct 17 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Ardashir Mohammadzadeh, Phd Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments: Add a direction for readers; add some statements that how the performance can be improved by type-3 fuzzy logic systems; [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: Thank you for addressing most of my comments. I have concern about one comment I did in the second revision request (Comment 1). I believe I couldn't be clear about it. I originally mentioned about Szegedy’s Inception-v3 work. There are three models of interest that are portrayed in three figures: Figure X, Figure Y, and Figure Z. Figure X: Figure 2(b) of Going deeper with convolutions (https://arxiv.org/pdf/1409.4842.pdf). Figure Y: Figure 5 of Rethinking the Inception Architecture for Computer Vision (https://arxiv.org/pdf/1512.00567.pdf). Figure Z: Figure 6(b) of this manuscript (previously Figure 5(b)). Figure X is the original Inception module of Szegedy et. al.'s work, that is authors of this manuscript also indicate an improved version over the Figure X. Figure Y is an improved version of Figure X which is also proposed by Szegedy et. al. Figure Z is the improved version of Figure X that authors proposes. In lines 222-238, the authors beautifully explained how this proposed network works and why those layers are chosen. However, I urge the authors to comment on the improvement of the model in Figure Z over the model in Figure Y. In other words, models in Figure Y and Z should be compared. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: Yes: Ahmet Agirman ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

13 Sep 2022 Original Manuscript ID: PONE-D-22-07252R1 Original Article Title: Bridge Crack Detection Based on Improved Single Shot Multi-Box Detector To: PLOS ONE Editor Re: Response to reviewers Dear Editor, Thank you for allowing a resubmission of our manuscript, with an opportunity to address the reviewers’ comments. We are uploading (a) our point-by-point response to the comments (below) (response to reviewers), (b) an updated manuscript with yellow highlighting indicating changes (Supplementary Material for Review), and (c) a clean updated manuscript without highlights (Main Manuscript). Best regards, et al. Reviewer#1, Concern # 1: Discuss the stability of the system in terms of complexity.‎ Author response: We gratefully appreciate for your valuable suggestions. In the experimental part of the manuscript, we added experiments on network computing efficiency and computing power cost, and comprehensively analyzed the relationship between network complexity, computing power consumption and network performance. Author action: We revised this part of the article and highlighted it. (Line448-466, Table 5 and Table 6).________________________________________ Reviewer#2, Concern # 1: In the lines 234-243, the authors discuss an improved version of the original inception module and illustrates the improved version at Fig5b. Inception-v3 is presented in the “Rethinking the Inception Architecture for Computer Vision” of the Szegedy et. al.’s work. The authors should elaborate their improvement of Fig5b over the Fig5 of Szegedy’s Inception-v3 work. If that network is implied, please reference it. In the Fig.5b, the network concatenates onto “previous layer”, it should be checked. Author response: Special thanks to you for your good comments. The expression of the relevant contents of the manuscript is not clear and does not highlight the key points. We have modified the relevant contents of the manuscript to make the contents of the manuscript more rigorous. Author action: We updated the manuscript by modifying the description of this part. (Line 222-238)________________________________________ Reviewer#2, Concern # 2: In section 3.1.3, the marriage of depth wise separable convolutions and deformations convolutions proposed. However, its inner structure and mechanism is not explained and illustrated in a figure other then a box in the Fig2. The mechanism of the proposed combination network should be illustrated and explained other than or additional to lines 212-219. Author response: Thank you again for your positive comments and valuable suggestions to improve the quality of our manuscript. In view of these problems, we have added the structure diagram of relevant parts. Author action: We updated the manuscript by adding the structure diagram of this part. FIG 5. Schematic diagram of the depth separable deformable convolution. ________________________________________ Reviewer#2, Concern # 3: The line 285 reads L(x,c,l,g) while eq. (14) doesn’t have such a term. Author response: Thanks for your careful checks. We are sorry for our carelessness. Based on your comments, we have made the corrections to make the unit harmonized within the whole manuscript. Author action: We have checked the manuscript again to ensure its quality. ________________________________________ Reviewer#2, Concern # 4: In Section 4, authors compare performance of their proposed network with a non-SSD based networks such as CrackDFA, however, with a quick search, there is another proposed work of Xu et. al. titled “Automatic Bridge Crack Detection Using a Convolutional Neural Network” which is also non-SSD based method however based on the depth wise separable convolutions. Authors are urged to add more literature on the problem of “bridge crack detection” with other NNs not just SSD and add comparisons also on such papers. Author response: Special thanks to you for your good comments. We added more comparison networks in the comparison experiment, and carried out the comparison experiment under the same experimental platform and experimental settings, which made the performance comparison objects of our network more diverse, not limited to SSD based networks, and increased the persuasion of the manuscript. Author action: We updated the manuscript by modifying the description of this part. (Line 303-306, 448-466 and 471-512, Table 5, Table 6, Fig. 10 and Fig. 11)________________________________________ Submitted filename: Response to Reviewers.docx Click here for additional data file. 19 Sep 2022 Bridge Crack Detection Based on Improved Single Shot Multi-Box Detector PONE-D-22-07252R3 Dear Dr. He, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Ardashir Mohammadzadeh, Phd Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 23 Sep 2022 PONE-D-22-07252R3 Bridge Crack Detection Based on Improved Single Shot Multi-Box Detector Dear Dr. He: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Ardashir Mohammadzadeh Academic Editor PLOS ONE

5 in total

1. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

Authors: Shaoqing Ren; Kaiming He; Ross Girshick; Jian Sun
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2016-06-06 Impact factor: 6.226

2. Automatic Pixel-Level Pavement Crack Recognition Using a Deep Feature Aggregation Segmentation Network with a scSE Attention Mechanism Module.

Authors: Wenting Qiao; Qiangwei Liu; Xiaoguang Wu; Biao Ma; Gang Li
Journal: Sensors (Basel) Date: 2021-04-21 Impact factor: 3.576

3. SDNET2018: An annotated image dataset for non-contact concrete crack detection using deep convolutional neural networks.

Authors: Sattar Dorafshan; Robert J Thomas; Marc Maguire
Journal: Data Brief Date: 2018-11-06

4. Automatic Tunnel Crack Detection Based on U-net and a Convolutional Neural Network with Alternately Updated Clique.

Authors: Gang Li; Biao Ma; Shuanhai He; Xueli Ren; Qiangwei Liu
Journal: Sensors (Basel) Date: 2020-01-28 Impact factor: 3.576

5. Clustering of Covid-19 morbidity cases in Germany.

Authors: D A Petrusevich
Journal: IOP Conf Ser Mater Sci Eng Date: 2020-05

5 in total