Literature DB >> 35271138

Missing and Corrupted Data Recovery in Wireless Sensor Networks Based on Weighted Robust Principal Component Analysis.

Jingfei He¹, Yunpei Li¹, Xiaoyue Zhang¹, Jianwei Li¹.

Abstract

Although wireless sensor networks (WSNs) have been widely used, the existence of data loss and corruption caused by poor network conditions, sensor bandwidth, and node failure during transmission greatly affects the credibility of monitoring data. To solve this problem, this paper proposes a weighted robust principal component analysis method to recover the corrupted and missing data in WSNs. By decomposing the original data into a low-rank normal data matrix and a sparse abnormal matrix, the proposed method can identify the abnormal data and avoid the influence of corruption on the reconstruction of normal data. In addition, the low-rankness is constrained by weighted nuclear norm minimization instead of the nuclear norm minimization to preserve the major data components and ensure credible reconstruction data. An alternating direction method of multipliers algorithm is further developed to solve the resultant optimization problem. Experimental results demonstrate that the proposed method outperforms many state-of-the-art methods in terms of recovery accuracy in real WSNs.

Entities: Chemical

Keywords: missing and corrupted data recovery; robust principal component analysis; weighted nuclear norm; wireless sensor networks

Year: 2022 PMID： 35271138 PMCID： PMC8914969 DOI： 10.3390/s22051992

Source DB: PubMed Journal: Sensors (Basel) ISSN： 1424-8220 Impact factor: 3.576

1. Introduction

Wireless sensor networks (WSNs) contain a group of spatially distributed sensor nodes that are capable of communicating wirelessly and collecting data from the surrounding environments [1,2]. Recently, WSNs have been widely applied in different domains, such as environmental monitoring [3], military management [4], and health care [5]. Typically, the main task of WSNs is to collect sensing data from all sensor nodes to a certain sink and then perform further analysis based on the monitoring data, and the collected data are usually composed of readings sensed by multiple nodes in consecutive time slots. However, due to the poor environments and energy constraints in WSNs, data loss and corruption are inevitable in practical applications. Therefore, it is important to reconstruct the real data from partially collected data with corruption. Recently, various reconstruction methods have been proposed for data recovery in WSNs. Based on data interpolation techniques, a K nearest neighbor (KNN)-based method [6] was proposed to simply utilize the values of the nearest neighbors to estimate the missing values. The Delaunay triangulation (DT) [7] utilizes the vertices as their global errors to reconstruct virtual triangles for data interpolation. Based on compressed sensing (CS) [8], the distributed compressed sensing (DCS) method [9,10] was proposed to exploit the sparsity of the data under various transform domains. Since many signals in various applications are always distributed into two-dimensional data (i.e., matrix form) and exhibit second-order sparsity (i.e., the low-rankness), matrix completion (MC) [11] has emerged as a novel technology and has been applied to many fields, such as image inpainting [12], magnetic resonance imaging [13], and recommendation systems [14]. The matrix completion aims at recovering the missing entries of a low-rank matrix from the incompletion observations, which can be formulated as a rank minimization problem. In general, solving this problem is NP-hard, since the rank function is non-convex. Fortunately, the nuclear norm, the sum of all singular values of the matrix, is the convex approximation of the rank function and can be used as an alternative [11]. Since the readings collected from nodes during time slots in WSNs can also be distributed into a matrix exhibiting low-rankness, the matrix completion-based methods have been proposed to utilize the correlation of WSNs data. An efficient data collection approach (EDCA) [15] and spatiotemporal compressive data collection (STCDG) method [16] were firstly proposed to recover the WSNs data by exploiting the spatiotemporal correlation in the form of low-rankness. Recently, several methods jointly utilizing low-rank and spatiotemporal sparsity feature [17,18] were proposed. Considering that the missing of row of the data matrix due to a broken node will greatly degrade the recovery accuracy, the matrix completion method [19] was proposed to utilize the interpolation technique for WSNs data recovery. In addition, in order to address the needs of real-time reconstruction of data in practical applications, the sliding window-based reconstruction approach [20,21] was proposed to achieve real-time data recovery. However, the reconstruction performance of these methods will greatly degenerate when corruption exists in the sampled data. Direct constraint of the low-rankness cannot avoid the impact of corruption on the reconstruction of normal data. A two-phase MC-based data recovery scheme (MC-Two-Phase) [22] was proposed to recover the normal data without the influence of corruption by detecting the corruption with the principal component analysis (PCA) [23] before reconstruction. Although PCA can be utilized to detect faults corrupted by small noise, it has the problem of poor robustness. To overcome the limitations of PCA, the robust principal component analysis (RPCA) method [24,25,26] have been proposed in recent years. The RPCA method improves the robustness since it only emphasized that the noise is sparse regardless of the strength of the noise. However, it is unreasonable to treat all singular values equally in the traditional RPCA algorithm, since different singular values may contain signal information with different important levels. To solve the above problem, we propose a weighted robust principal component analysis (WRPCA) method for the reconstruction of WSN data with corruption. The main contributions of this paper are the following: Firstly, based on RPCA, the original data with outliers are decomposed into a sum of a low-rank normal data matrix and a sparse abnormal matrix to avoid the influence of outliers during reconstruction. Secondly, the low-rankness of WSNs data is revealed by the variation of singular values of two real datasets collected from the Inter Berkeley Research lab and GreenOrbs. Thirdly, the weighted nuclear norm is introduced to constrain the low-rankness and preserve the principal components of WSNs data. The rest of this paper is organized as follows. Section 2 presents the basics of RPCA. Section 3 describes the proposed method and the reconstruction method. Section 4 shows the result of computer experiments and analysis, which is followed by the conclusion of the paper in Section 5.

2. Basics of RPCA

Although PCA can be used to detect corruptions, it is sensitive to gross noise and outliers. The performance and applicability of PCA are limited due to the lack of robustness to gross corruptions in real-life scenarios. As an improvement of PCA, RPCA can handle grossly corrupted data well. Suppose that data matrix can be viewed as consisting of the two components: a low-rank matrix and a sparse matrix : The low-rank matrix and sparse matrix can be obtained by solving the following problem: where denotes the rank of the matrix, is the norm, and is the balance parameter. Equation (2) is non-convex and NP-hard, which is difficult to solve. Typically, the matrix nuclear norm, the convex approximation of the rank function, can be used as an alternative. Therefore, the above problem can be cast as the following convex optimization problem: where denotes the nuclear norm of matrix , is the singular value of matrix , and is the norm. The main goal of (3) is to reconstruct low-rank normal data from the corrupted observation data . RPCA has been successfully applied in different domains, including image processing [27], multimedia [28], document analysis [29], etc. The nuclear norm minimization utilized in (3) shrinks all the singular values equally [30], ignoring that different singular values may have different importance. Actually, the real data sensed in the monitoring area always exhibit low-rankness, and the unavoidable corrupted data are sparsely distributed in the sensed data matrix. Based on RPCA, we propose a weighted robust principal component analysis method to recover the missing data in WSNs with the data corruption.

3. The Proposed Method

3.1. Problem Formulation and Signal Feature

Consider a WSN consisting of one sink and sensor nodes, and the sensor nodes sense the environmental information and send the signal to the sink in each time slot. During time slots, readings are gathered in the sink and can be organized into a matrix . However, due to hardware and network conditions, data loss and corruption may occur in the network. Mathematically, only partial data can be successfully collected in the sink, and the original data contain the corrupted data. Here, is the random sampling operator. That is, under the sampling ratio , for a matrix , there are entries that are sampled from the whole data randomly, where . It is worth noting that the sampled partial data also contains the sampled corruption data. It is necessary to reconstruct the uncorrupted whole data from the sampled partial data under the sampling ratio . The data sensed in a certain area during a consecutive time are always redundant and highly correlated and can be distributed into a matrix (uncorrupted matrix ) exhibiting low-rankness. Since the outliers in real WSN are uniformly and randomly distributed, the sparsely distributed corrupted data can be denoted by the matrix . Therefore, the whole data can be regarded as a combination of the uncorrupted data matrix and the corrupted matrix . In order to verify that the uncorrupted data in WSNs is low-rank, two datasets from the Inter Berkeley Research lab [31] and GreenOrbs [32] were used as testing data. Since data loss and corruption exist in both two datasets, two small but completed subset data without corruption are selected as the ground truth for our verification experiment. Specifically, the selected Inter Berkeley Research lab subset data including temperature and humidity data were measured by 49 sensor nodes during 138 time slots, and the selected GreenOrbs subset data were measured by 130 sensor nodes during 129 time slots. As shown in Figure 1, the singular values of the two attribute data matrix illustrate the low-rankness for both two datasets.

Figure 1

The first 10 singular values of two attribute data matrix for Berkeley and GreenOrbs data.

3.2. Proposed Method

Since the original data matrix can be decomposed into a low-rank matrix and a sparse matrix , the WSNs data recovery problem can be expressed by (3). However, the NNM method adopts the same threshold for each singular value, which is not appropriate because the larger singular values usually represent the major data components of the data and contain more signal information. The larger singular values should be shrunk less to preserve the major data components. In order to improve the practically and flexibility of the nuclear norm, the weighted nuclear norm is utilized in the recovery of WSNs data. The weighted nuclear norm of matrix is defined as: where is the weight coefficient, and is the singular value of . It is clear that the weighted nuclear norm becomes the conventional nuclear norm when . The weighted nuclear norm minimization (WNNM) based low-rank matrix completion problem can be described as: Gu et al. [30] proved that the problem can be solved by the following singular value thresholding formula: The larger singular values should be given smaller weights to achieve less shrinkage, and the smaller ones should be given greater weights to achieve more shrinkage. The weights should be inversely proportional to singular values. Therefore, in this paper, we set the weight as: where is a constant, is the number of columns in , is the variance of noise, and only needs to be a very small number to avoid dividing by zero. By introducing the weighted approach, different singular values are shrunk differently with weight , which further preserves the major components of data. Then, a WSNs data reconstruction method is proposed by applying WNNM in traditional RPCA to recover from partial measurement . It can be described as: Only partial measurement is known as a prior in (8). The original data and uncorrupted can be reconstructed from the partial data . By introducing a quadratic penalty term, (8) can be converted to the following formulation: where and are the regularization parameters. The proposed method incorporates both the RPCA and WNNM in a single formulate to further preserve the major data components. The recovered can be obtained as the uncorrupted completed data in WSNs.

3.3. Model Optimization

To solve (9), a reconstructed algorithm based on an alternating direction method of multipliers (ADMM) [33,34] is introduced. The augmented Lagrangian function of (9) can be written as: where is the Lagrangian multiplier, and is the penalty parameter. More details of the proposed algorithm are given as follows. For the -subproblem, we update as follows: Here, the preconditional conjugate gradient (PCG) algorithm is applied to solve this problem in this paper. For the -subproblem, we update as follows: In general, the WNNM problem is non-convex. Gu et al. [30] proved that the problem has a fixed point and can be solved by the singular value thresholding formula: where is the Singular Value Decomposition (SVD) of , and denotes taking singular value thresholding to the diagonal matrix . Since the threshold can be effected by and , here, we set to simplify the solution; then, . The initial can be estimated as: For the -subproblem, we update as follows: We can find the solution via the well-known soft thresholding formula: For the -subproblem, we update as follows: In practical implementation, we initialize , , , and as the zeros matrices. Then, (9) can be solved by repeating the above steps until is smaller than a predefined tolerance parameter or the number of iterations reaches the predefined maximum. The main computational cost of (9) depends on the update of , which requires computing the SVD of the matrix per iteration. The computational complexity per iteration is .

4. Experiments and Analysis

Most existing WSNs data reconstruction methods (e.g., KNN [6], CS [9], EDCA [15], and methods utilizing both low-rank and sparsity feature [17,18]) have achieved satisfying recovery performances, but they do not consider the case that the WSNs data have outliers. Therefore, the performance of our proposed method is compared with the RPCA method [24] and MC-Two-Phase method [22], which consider outliers during reconstruction.

4.1. Experimental Environments

The two datasets adopted to verify the low-rank property of normal WSNs data were also utilized for the reconstruction experiments. The normal data without corruption can be denoted by , and let and denote the normal matrix for Berkeley data and GreenOrbs data, respectively. The normal data can be regarded as the ground truth WSNs data. In real WSNs, due to data loss and corruption, the measurement is partially sampled and contains corruption. To obtain the partial measurement from normal matrix in the experiment, the following steps were performed. Firstly, the partial sampled normal data can be obtained by according to the sampling ratio . Then, entries in were randomly selected as the corruption data by adding additional random Gaussian noise with zero mean and variance , where is the corruption ratio. The parameters and can be chosen according to the characteristics of the signal collected by the sink. In this paper, , and are set to , 0.05, 3.3, and 0.05, respectively. The Normalized Mean Absolute Error (NMAE) is used to measure the recovery performance of different methods on missing data and corrupted data: where is the recovered data, denotes the missing data set, and is the corrupted data set. The experimental result is an average of 50 repeated experiments.

4.2. Recovery Performance Comparisons

To compare the proposed method with the existing methods, temperature and humidity data from the Inter Berkeley Research lab and GreenOrbs were utilized for the recovery performance comparisons. With the sampling ratio 0.1, 0.2, 0.3, and 0.4, the corruption ratio 0.2, 0.3, 0.4, 0.5, and 0.6, Figure 2 and Figure 3 show the recovery performance of each method for Berkeley temperature and humidity data, while Figure 4 and Figure 5 show the recovery performance of each method for GreenOrbs temperature and humidity data, respectively. As can be seen, the proposed method has a lower NMAE than the comparison methods on both missing and corrupted data in two datasets, especially at low sampling ratio and high corruption ratio.

Figure 2

The recovery performance of each method for temperature data sensed in the Inter Berkeley Research lab. The (a) is the NMAE of missing data, (b) the NMAE of corrupted data.

Figure 3

The recovery performance of each method for humidity data sensed in the Inter Berkeley Research lab. The (a) is the NMAE of missing data, (b) the NMAE of corrupted data.

Figure 4

The recovery performance of each method for temperature data sensed in GreenOrbs. The (a) is the NMAE of missing data, (b) the NMAE of corrupted data.

Figure 5

The recovery performance of each method for humidity data sensed in GreenOrbs. The (a) is the NMAE of missing data, (b) the NMAE of corrupted data.

As shown in Figure 2b, the of the proposed method, RPCA, and MC-Two-Phase is 0.030, 0.037, and 0.099 when and , while the values are 0.048, 0.097, and 0.140 when is up to 0.6. The results show that the proposed method not only improves the recovery accuracy of WSNs data but also has strong robustness to the gross noise. Especially, from Figure 4, we can see that as the increases from 0.2 to 0.6, the corresponding and of MC-Two-Phase and RPCA dramatically increase, while that has very little change of the proposed method, and even the entire range of change is only less than 0.01. As shown in Figure 3, the and of the proposed method decrease rapidly with the increase in , but there is little change for the MC-Two-Phase. Specifically, in Figure 3b, the of the proposed method, RPCA and MC-Two-Phase is 0.056, 0.069, and 0.076 when and , while the values are 0.029, 0.040, and 0.069 when is up to 0.4.

5. Conclusions

In this paper, we propose a WRPCA method to increase the recovery accuracy of WSNs data with loss and corruption. The original data matrix is treated as a sum of a low-rank normal data matrix and a sparse abnormal matrix to avoid the influence of corruption. In addition, the weighted nuclear norm minimization is utilized to further constrain the low-rankness of the normal data and overcome the problem that the nuclear norm minimization treats all singular values equally. The experimental results show that the proposed method has better recovery performance in both loss and corruption data. In further work, the higher-order low-rankness of multi-attribute data in WSNs can be explored for multi-attribute data reconstruction.

4 in total

1 in total

1. Fast Fault Diagnosis in Industrial Embedded Systems Based on Compressed Sensing and Deep Kernel Extreme Learning Machines.

Authors: Nanliang Shan; Xinghua Xu; Xianqiang Bao; Shaohua Qiu
Journal: Sensors (Basel) Date: 2022-05-25 Impact factor: 3.847