Literature DB >> 35071933

Deep Convolutional Neural Network with Deconvolution and a Deep Autoencoder for Fault Detection and Diagnosis.

Abstract

In chemical plants and other industrial facilities, the rapid and accurate detection of the root causes of process faults is essential for the prevention of unknown accidents. This study focused on deep learning while considering the different phenomena that can occur in industrial facilities. A deep convolutional neural network with deconvolution and a deep autoencoder (DDD) is proposed. DDD assesses the process dynamics and the nonlinearity between process variables. During the operation of DDD, fault detection is carried out using the reconstruction error between the data reconstructed through the model and the input data. After a process fault is detected, the magnitude of the contribution of each process variable to the detected process fault is calculated by applying gradient-weighted class activation mapping to the established network. The effectiveness of DDD in fault detection and diagnosis was verified through experiments on the Tennessee Eastman process dataset, demonstrating that it can achieve improved performance compared to the conventional fault detection and diagnosis.

Entities: Chemical

Year: 2022 PMID： 35071933 PMCID： PMC8772318 DOI： 10.1021/acsomega.1c06607

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

In chemical plants, accidents and mechanical failures result in significant economic losses. In such environments, process variables, such as temperature, flow rates, and pressure, are measured constantly to monitor and control chemical plant operations while ensuring the safety of the equipment as well as that of the human resources.[1] It is also possible to monitor chemical plant processes by applying techniques based on statistical process control (SPC), which rely on the variable data of measured processes. The number of process variables required to account for all the physical and chemical phenomena increases as the processes become increasingly complex. Therefore, monitoring multiple process variables collectively using multivariate SPC (MSPC) is considered a highly efficient approach. As a result, various MSPC-based methodologies[2] have been proposed. Examples of statistical methods that rely on MSPC include principal component analysis (PCA),[3] independent component analysis,[4] partial least squares analysis,[5] artificial neural networks,[6] and support vector machines.[7] In addition, deep neural networks have been receiving increased research attention because the networks can be used to express complex relationships between various process variables. Using neural networks with multiple layers, it is possible to realize the deeper learning of features contained in the data in a stepwise manner. Deep learning-based approaches, such as the deep autoencoder (DAE)[8] and convolutional neural networks (CNNs),[9] are used to detect and classify faults in chemical processes[10] and motor bearing.[11] Such approaches are also employed in other various fields and applications such as the diagnosis of malfunctions, including bearing failures[12] and turbine failures.[13] However, the interpretation of constructed deep neural networks can be difficult. This is despite their necessity in identifying the root causes of process faults in chemical processes following the detection of such faults. Therefore, in the field of artificial intelligence, several methods for clarifying the bases estimated using deep neural networks have been proposed. For example, CNNs are currently considered effective approaches in the field of image processing. CNNs incorporate a visualization method known as gradient-weighted class activation mapping (Grad-CAM)[14] whose main function involves clarifying the basis for a judgment. Grad-CAM is a method for calculating the part of a neural network that contributes the most to a specific output classification using the gradient of the convolutional layer and the probability score. This method has been used in the classification of pig models[15] and MRI brain images to establish a basis[16] for the classification of Alzheimer’s disease. However, although conventional methods can be applied in supervised learning applications, such as those for image classification, the methods cannot be applied in unsupervised learning applications, such as those for detecting process faults in chemical plants. Therefore, it is crucial to detect process faults and diagnose such faults. The aim of this study was to develop a method for detecting process faults using a deep neural network. In this study, a method that combines a CNN with a DAE is proposed to consider the nonlinearity between process variables and process dynamics in process variables. Because CNNs can consider the pixel intensity as well as the spatial relationship between pixels, it is possible to extract the temporal characteristics of each process variable. Subsequent dimensional reduction was performed using a DAE, after which the latent variables considering the nonlinearity of the variables were extracted. Process faults were detected and diagnosed using the data transformed through the model (similar to T2 in PCA-based MSPC) and the reconstruction error of the input data (similar to Q in PCA-based MSPC). This methodology is referred to as a deep convolutional neural network with deconvolution and a deep autoencoder (DDD). By applying Grad-CAM in the constructed neural network, it is possible to detect and diagnose process faults in latent variables by visualizing the high-weight input variables for each latent variable. To verify the effectiveness of DDD, the performances of existing MSPC-based methods, i.e., DAE, CNN, and DDD, in detecting process faults were compared on the Tennessee Eastman process (TEP) dataset. Furthermore, the process variables related to a process fault are diagnosed using Grad-CAM for the DDD.

Methods

The proposed method, DDD, combines a DAE with a CNN, and fault diagnosis using DDD is based on the incorporation of Grad-CAM. First, DAE, CNN, and Grad-CAM are explained, and then DDD and fault diagnosis with DDD are discussed.

Deep Autoencoder

A basic autoencoder (AE) is a neural network comprising three layers, namely, an input layer, a single hidden layer, and an output layer. A network model comprising multiple hidden layers is referred to as a DAE. Figure shows a schematic diagram of an AE. Given the input data XϵR, where n represents the number of samples and u represents the number of process variables X, the input samples xϵR, i = 1,2, ..., n are encoded into neurons hϵR, where v represents the number of neurons in the hidden layer, using the following formula:where W1ϵR and b1ϵR represent the weight and bias in the encoding process, respectively. Subsequently, the input sample is decoded from the neurons hϵR into the reconstructed sample x̂ϵR using the following formula:where W2ϵR and b2ϵR represent the weight and bias in the decoding process, respectively. Therefore, the reconstructed data X̂ϵR of X are obtained from the AE. f represents an activation function that extracts the input features, and the sigmoid, tanh, and rectified linear unit (ReLU) functions are used as general activation functions. The AE is trained so that the reconstruction error between X and X̂ diminishes, and θ = {W1, b1, W2, b2} is updated using the backpropagation method, as shown in the following equation:

Figure 1

Basic concept of an autoencoder.

Convolutional Neural Network

A CNN is a neural network that comprises an input layer, a convolutional layer, a pooling layer, a deconvolution layer, and an output layer. The convolutional layer applies a defined number of filters to obtain a feature map of the input image, and the pooling layer reduces the number of input features. The convolutional and pooling layers are alternately repeated several times to extract the final number of features. The image reconstructed through the deconvolution layer is then the output.

Convolutional Layer

The output of the convolutional layer comprises feature maps in which each unit is connected to a local patch of the input feature map via a weighted filter. All the units in the output feature map share the same filter, and within a layer, different feature maps use different filters. The convolutional layer can be used to facilitate the detection or recognition of patterns present in the process data. Assuming that there are M × 1 input feature maps x in the l layer and N filters, the output feature map x at the jth position in the l + 1 layer is calculated as follows:where krepresents the kernel of the jth filter connected to the ith, xrepresents the ith input’s feature map, x represents the jth input’s feature map, brepresents the bias corresponding to the jth filter, f represents the activation function, and the asterisk symbol (∗) represents the convolution operation. Common activation functions for neural networks include the sigmoid, tanh, and ReLU functions. Assuming a kernel size of s × s, the number of all parameters in the convolutional layer is calculated as follows: The output feature map obtained from the convolutional layer is transferred to the pooling layer.

Pooling Layer

The pooling layer follows the convolutional layer and downsamples the input feature map. The purpose of the pooling layer is to compress information and transform the input data into a more manageable form. The use of pooling layers has two advantages. First, because the relative positions of the features forming the local pattern may differ slightly, detecting the features with similar local positions offers enhanced reliability. Second, the dimensionality of feature representation can be reduced without setting parameters, thereby significantly reducing the computation time and the parameters of the entire network. Primarily, there are two modes of pooling: maximum pooling and average pooling. The maximum pooling mode calculates the maximum value among the units in the feature map, and the average pooling mode calculates the average value of the units. In the pooling layer, when M feature maps of the l layer are the input, M feature maps are the output, as shown in the following formula:where xrepresents the jth input feature map, x represents the jth output feature map, βand brepresent the multiplicative and additive biases corresponding to the jth filter, respectively, f represents the activation function, and “down” represents the subsampling function.

Deconvolution Layer

The deconvolution layer recovers the number of features extracted through the convolution and pooling layers to the resolution of the original feature. It is advantageous in that it can be used to perform upsampling at the same time as the training process, and it can be used to minimize the loss resulting from feature resampling.[17]

Training

After obtaining the reconstructed X̂ from the output layer, learning is performed so that the reconstruction errors for X and X̂ diminish, while the network parameters are refreshed through the backpropagation method, as shown in the following equation:

Gradient-Weighted Class Activation Mapping

Grad-CAM is a method that enables the visual explanation of CNN-based prediction results using the gradient information that flows into the last convolutional layer of a CNN. The CNNs used in image analysis comprise a feature extraction part that stacks convolutional and pooling layers over multiple layers and an identification part that receives the feature quantity output and matches it with a class label to perform supervised learning. The identification component typically comprises a fully connected multilayer neural network, and the final layer is used to convert the feature quantity into a probability score for each class. Grad-CAM is used to identify image locations with a significant effect on the probability score for each class by averaging the changes (derivative coefficients) that occur in the probability scores when an insignificant change is applied to an image location. First, by applying the formula presented below, the gradient ∂y/∂Aof the intensity Aat the (i, j) pixel of the kth convolutional feature map is calculated using the probability score y of class c. By averaging these values for all pixels, the weighting factor αfor the kth filter of class c can be computed. A larger αvalue indicates the increased importance of the feature map A for class c, as follows:where Z indicates the number of channels. A heat map of a size that is similar to that of the convolutional feature map is generated by calculating the weighted average of k filters using the calculated value of α, after which the output of the ReLU function is obtained. Overlaying onto the input data is possible by resizing LGrad – CAM.

Proposed DDD

A CNN can be used to assess the correlation between adjacent elements of the input tensor. However, accurate feature extraction is not possible for unordered process data, even when convolution is performed using a general 3 × 3 filter. Accordingly, in this study, only the temporal characteristics of each process variable are initially extracted through the CNN, meaning that the order of process variables does not matter. The DAE is then connected to extract the nonlinearity between process variables. Figure shows an outline of DDD.

Figure 2

Basic concept of DDD.

Basic concept of DDD. To evaluate the process dynamics, the components of the input sample are converted into m × (n + 1) (m represents the number of input variables, and n represents the number of time delay variables). For temporal feature extraction, the sample is first input into a hidden layer comprising a convolutional layer and pooling layer. Afterward, multidimensional data are converted into one-dimensional data via a fully connected layer. The one-dimensional data are then input into the hidden layer of the DAE to realize the connection between the CNN and the DAE. The number of neurons in the middle layer of the DAE is compressed such that it is at least smaller than m in the input layer. The data are then reconstructed through the decoder and the deconvolution layer. For the loss function L, the model is trained; therefore, the reconstruction errors at the input layer level are insignificant. In DDD, it is necessary to select the number of convolutional layers, the number of filters, the number of hidden layers in the AE, and the number of neurons in each hidden layer. The temporal midpoints of the training data used in this study are regarded as the validation data, and the final model was constructed using a combination of hyperparameters with the smallest L of the temporal midpoints. DDD detects process faults using two statistics, namely, as T2 statistics and Q statistics, similar to the PCA-based MSPC method. The T2 statistic is calculated using the square of the distance obtained from the origin of the standardized number of the hidden layer neurons, as follows:where d represents the number of neurons in the middle layer, t represents the value of the cth neuron, and σ represents the standard deviation of the cth neuron. The Q statistic is obtained from the reconstruction error between the input and output layers, as follows: Each threshold τ or τ was set to a value containing 99.7% of the total T2 and Q values calculated using the training data. 99.7% is a value that is based on the 3σ method. The resulting model was used to determine whether the new data were abnormal. If the T2 (Ttest2) and Q statistics obtained by inputting the new data into the model (Qtest) satisfy either Ttest2 > τ or Qtest > τ, the new data are considered abnormal, and the other data are considered normal. The process variables related to such abnormalities are searched when an abnormal condition is detected through the monitoring process. For the T2 statistic, the weight of the ith input variable in the cth neuron of the hidden layer is represented using w, and the contribution of the ith input variable to the T2 statistic is defined as follows: The contribution of the ith input variable of the Q statistic is defined as follows:

Results and Discussion

The TEP dataset[18] was used to verify the effectiveness of DDD. Eastman Chemical Company developed the TEP dataset to mimic an actual industrial process, and this dataset has been used to evaluate the performance of various methods for process control and monitoring. The TEP dataset comprises five main units, namely, a reactor, stripper, condenser, recycle compressor, and separator, with a total of eight components (A through H). The liquid products, G and H, and the by-product, F, are generated from the gaseous reactants A, C, D, and E through chemical reactions. The process is described in detail in the study by Downs and Vogel.[18] The TEP dataset incorporates a total of 52 variables, which include 22 process variables, 11 instrumental variables, and 19 component analysis result variables. In this study, only 22 process measurement variables were used because the process variables are affected because of manipulation. Each process variable employed herein is listed in Table S1. The values for these process variables were measured every 3 min. The training data comprised 1500 min of normal data (500 samples), and the test data comprised 2880 min of data (960 samples) in which 21 types of process faults listed in Table S2 occurred. These datasets and control structures are similar to those previously reported in the literature.[19] In each of 21 types of test data, a process fault occurred after 480 min (160 samples). The models are constructed using the training data, which include only the normal data. To consider the process dynamics, the samples inputted into DDD were transformed into an m × (n + 1) matrix, where m represents the number of input variables, and n represents the number of time delay variables. In this study, m was set as 22 for the number of input variables, and n was set as 22, based on the study by Krizhevsky et al.(20) Data were preprocessed by range-scaling at a range of 0–1 for each process variable. The false negative rate (FNR) (%) and false alarm rate (FAR) (%) were used to evaluate the performance of process fault detection using DDD, as shown below:where TP represents the number of samples that are normal when the model is also normal, FN represents the number of samples that are normal when the model is abnormal, and TN represents samples that are actually abnormal when the model is abnormal. FP represents the number of samples considered normal by the model when the samples are actually abnormal, FNR represents the proportion of classes that are considered abnormal among the actual normal samples, and FAR represents the proportion of classes that are considered normal among the samples that are actually abnormal. Performance improves with a decrease in either FNR or FAR. The capability to detect process faults was determined to be inferior to random estimation if both the FNR and FAR were ≥50%. The DAE and CNN were used as comparison methods. Table shows the hyperparameters required for each method. lAE represents the number of hidden layers, sϵR represents the rate of reduction from the number of neurons in the previous layer, and s is required only for determining the lAE quantity. lconv represents the number of convolutional layers, and fϵR represents the number of filters. f is required only for determining the lconv quantity. Table displays the candidates for each hyperparameter, and Table shows the combination of hyperparameters with the minimum loss function using the median data from the training dataset.

Table 1

Hyperparameters for Each Method

method	hyperparameter
DAE	l_AE, s
CNN	l_conv, f
DDD	l_conv, f, l_AE, s

Table 2

Candidates for Hyperparameters

hyperparameter	candidates
l_AE	1, 2, 3, 4
s	1/2, 1/3
l_conv	1, 2, 3, 4
f	8, 16, 32, 64

Table 3

Optimized Hyperparameter Values for Each Method

method	l_AE	s	l_conv	f
DAE	4	1/2, 1/3, 1/3, 1/3
CNN			4	64, 64, 64, 64
DDD	3	1/3, 1/3, 1/3	4	8, 8, 16, 16

A model using the hyperparameter values listed in Table was constructed, and 21 process faults were detected. Table shows the results of abnormality detection using DAE, CNN, and DDD. In this study, methods that exceeded 50% in terms of either FNR or FAR were not used in performance comparison. Regarding process faults 3, 4, 9, 15, and 16, it was confirmed that no process fault was detected because FAR, FNR, or both exceeded 50% using all methods. DDD exhibited the most favorable FAR among seven process faults, followed by the DAE and the CNN with six and four process faults, respectively. The CNN demonstrated the most favorable FNR among nine process faults, followed by DDD with eight process faults.

Table 4

FNR and FAR Results for Each Method in 21 Process Faults

	DAE		CNN		DDD
	FNR	FAR	FNR	FAR	FNR	FAR
1	15.6	0.4	4.4	0.4	24.4	0.4
2	8.1	1.4	4.4	1.8	4.4	1.5
3	44.4	51.9	26.9	54.8	10.0	72.5
4	11.3	73.6	4.4	76.8	9.4	72.8
5	11.3	48.9	4.4	56.5	9.4	54.3
6	6.3	0.1	1.3	0.0	7.5	0.1
7	12.5	32.4	8.1	37.4	6.3	35.9
8	19.4	0.9	3.8	0.0	2.5	1.3
9	48.1	52.9	33.1	57.1	31.3	63.5
10	11.9	14.8	16.3	14.1	9.4	11.8
11	17.5	1.4	20.0	0.6	24.4	0.4
12	25.0	0.0	17.5	0.0	7.5	0.0
13	8.1	4.8	1.3	5.3	6.9	3.9
14	18.8	0.1	8.1	0.0	21.3	0.1
15	6.9	61.6	1.3	60.9	9.4	65.4
16	71.3	10.5	61.3	11.5	63.8	13.8
17	24.4	2.3	19.4	2.5	25.6	1.1
18	11.9	6.4	13.8	7.4	8.1	6.0
19	8.1	3.8	3.1	1.8	1.9	1.5
20	10.6	12.4	0.0	13.6	2.5	13.9
21	26.9	27.8	27.5	41.5	10.6	29.9

Figure shows the delay time from 480 min after an abnormality occurs until the point the abnormality is detected. The delay time was confirmed to be insignificant for DDD entirely. The average delay times for DAE, CNN, and DDD were 22.9, 41.7, and 22.4 min, respectively, and DDD exhibited the highest average speed in the detection of process faults.

Figure 3

Expected fault detection delay: (a) DAE, (b) CNN, and (c) DDD.

Expected fault detection delay: (a) DAE, (b) CNN, and (c) DDD. As examples of detected process faults, Figures –7 show the time plots for each statistic corresponding to process faults 2, 6, 13, and 19, respectively. The black horizontal line represents the threshold value, which signals an abnormality when the threshold value is exceeded. The time presented on the horizontal axis represents the time at which an abnormality is detected using each statistic. According to Figures –7, the time plots of the Q statistics for DAE, CNN, and DDD are similar. By contrast, the CNN cannot calculate the T2 statistic, the DAE indicated subthreshold values in all cases, and neither could detect process faults. Despite this result, it was confirmed (see Figures –6) that DDD could accurately detect process faults with respect to the T2 statistic. For process fault 13, as shown in Figure , process faults could be detected earlier compared with using the Q statistic. For process fault 19, as shown in Figure , DDD could detect process faults accurately based on the Q statistic. Therefore, it was confirmed that DDD can be used to detect abnormalities accurately.

Figure 4

Figure 7

Time plot of each statistic for each method in process fault 19. Horizontal straight lines indicate the thresholds, and values in the x-axis indicate the fault detection times. (a) DAE, (b) CNN, and (c) DDD.

Figure 6

Time plot of each statistic for each method in process fault 13. Horizontal straight lines indicate the thresholds, and values in the x-axis indicate the fault detection times. (a) DAE, (b) CNN, and (c) DDD.

Time plot of each statistic for each method in process fault 2. Horizontal straight lines indicate the thresholds, and values in the x-axis indicate the fault detection times. (a) DAE, (b) CNN, and (c) DDD. Time plot of each statistic for each method in process fault 6. Horizontal straight lines indicate the thresholds, and values in the x-axis indicate the fault detection times. (a) DAE, (b) CNN, and (c) DDD. Time plot of each statistic for each method in process fault 13. Horizontal straight lines indicate the thresholds, and values in the x-axis indicate the fault detection times. (a) DAE, (b) CNN, and (c) DDD. Time plot of each statistic for each method in process fault 19. Horizontal straight lines indicate the thresholds, and values in the x-axis indicate the fault detection times. (a) DAE, (b) CNN, and (c) DDD. Subsequently, the process variables related to the abnormalities were identified. For the DDD, the diagnoses of process faults 1, 2, 6, and 13 are outlined in Figures –11, respectively. The contribution of each process variable to the process fault can be calculated using the T2 and Q statistics. It was confirmed that DDD can be used to diagnose abnormalities of process variables using the T2 and Q statistics, which cannot otherwise be conducted using conventional deep learning-based techniques. Process variable 1 showed that DDD contributed highly to the Q statistic in process fault 1, which is presented in Figure , and an abnormality in the supply flow rate of raw material A was identified. DDD’s T2 statistic further contributed to the process fault detection of process variable 20, thereby suggesting that the compressor failure is related to the feed rate of raw material A. Based on process fault 2, which is shown in Figure , DDD was used to successfully diagnose the presence of an error in the purging of process variable 10. This can be expected to cause abnormalities in the composition of the product. According to process fault 6, as shown in Figure , process variables 7, 13, and 16 have larger contributions than other process variables, and abnormalities occur in the reactor, separator, and stripper. However, according to Table S2, this diagnostic result is considered different from the actual cause of the abnormality. It can be concluded that only the T2 statistic of DDD indicates that the supply flow rate of raw material A with respect to process variable 1 is abnormal, and thus, it can contribute to the identification of the root cause of process fault 6. Regarding process fault 13, which is presented in Figure , the abnormal pressure levels in the reactor, separator, and stripper were diagnosed because of the drift in the reaction rate constant. Moreover, DDD was used to successfully determine that there was a substantial contribution by the flow rate of raw material D. Therefore, DDD can be used to increase the information used to identify the causes of process faults by digitizing the degree of influence on the process faults resulting from the high-dimensional feature quantities expressed through deep learning, and it can contribute to the detection of the causes of process faults that cannot be confirmed using conventional methods.

Figure 8

Process fault diagnosis results of DDD in process fault 1.

Figure 11

Process fault diagnosis results of DDD in process fault 13.

Figure 9

Process fault diagnosis results of DDD in process fault 2.

Figure 10

Process fault diagnosis results of DDD in process fault 6.

Process fault diagnosis results of DDD in process fault 1. Process fault diagnosis results of DDD in process fault 2. Process fault diagnosis results of DDD in process fault 6. Process fault diagnosis results of DDD in process fault 13.

Conclusions

In this study, a deep convolutional neural network with deconvolution and a deep autoencoder (DDD) was proposed for the construction of an MSPC-based deep neural network that assesses the process dynamics and the nonlinearity between process variables. DDD can be used to detect and diagnose process faults through the constructed neural network. Based on the CNN and DAE, DDD can be used to effectively represent the relationship between process variables hidden in process data while simultaneously accounting for the dynamic characteristics and nonlinearity of process variables. By calculating the Q and T2 statistics using DDD, it is possible to detect process faults based on these factors, and the T2 and Q statistics can be used to digitize the information related to the process variables that contribute to a specific abnormality. A case study using the TEP dataset was conducted to verify the effectiveness of DDD. DDD can be used to determine the contributions of various process variables to each process fault quantitatively. Overall, compared with conventional process fault detection methods, DDD demonstrates enhanced performance, and its implementation successfully increases the number of determining factors used for identifying the causes of process faults through its ability to effectively present the process variables involved in process faults. Because tensorial data can be analyzed in chemical and biological manufacturing processes,[21] the tensorial data that consider both the process variables and the process dynamics can be analyzed effectively using DDD as future research. However, it should be noted that DDD has limitations in that process data in normal states are required to construct process fault detection and diagnosis models. It is expected that the proposed approach can improve the efficiency of process control and management in chemical plants and industrial facilities through the detection and diagnosis of process faults.

4 in total