| Literature DB >> 30103440 |
Yusheng Zhou1,2, Rufu Qin3, Huiping Xu4,5, Shazia Sadiq6, Yang Yu7.
Abstract
With the construction and deployment of seafloor observatories around the world, massive amounts of oceanographic measurement data were gathered and transmitted to data centers. The increase in the amount of observed data not only provides support for marine scientific research but also raises the requirements for data quality control, as scientists must ensure that their research outcomes come from high-quality data. In this paper, we first analyzed and defined data quality problems occurring in the East China Sea Seafloor Observatory System (ECSSOS). We then proposed a method to detect and repair the data quality problems of seafloor observatories. Incorporating data statistics and expert knowledge from domain specialists, the proposed method consists of three parts: a general pretest to preprocess data and provide a router for further processing, data outlier detection methods to label suspect data points, and a data interpolation method to fill up missing and suspect data. The autoregressive integrated moving average (ARIMA) model was improved and applied to seafloor observatory data quality control by using a sliding window and cleaning the input modeling data. Furthermore, a quality control flag system was also proposed and applied to describe data quality control results and processing procedure information. The real observed data in ECSSOS were used to implement and test the proposed method. The results demonstrated that the proposed method performed effectively at detecting and repairing data quality problems for seafloor observatory data.Entities:
Keywords: ARIMA; data interpolation; data quality control; outlier detection; seafloor observatory
Year: 2018 PMID: 30103440 PMCID: PMC6111880 DOI: 10.3390/s18082628
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Main measurement parameters and data collection intervals of some sensors.
| Sensors | Main Measurement Parameters | Data Collection Intervals (s) |
|---|---|---|
| ADCP | depth, current velocity and direction | 10 |
| CTD | temperature, salinity, depth | 10 |
| CO2 Pro | CO2 | 6 |
| Hydrolab DS 5 | PH, chlorophyll-a | 8 |
| PBR Tide-Wave Recorder | depth | 1 |
| Magnetometer | magnetic data | 1 |
| OBS-3+ | turbidity | 10 |
| Ocean Bottom Seismometer | earth motion data | 0.01 |
Figure 1Typical data quality problems in (a) conductivity and (b) pH observed data series. Red dots denote outliers, and magenta circles denote data gaps.
Figure 2The seafloor observatories data quality control operational framework. Orange denotes the general pretest process, green denotes the outlier detection process, and blue denotes the data interpolation process.
ECSSOS data quality control flagging system.
| Quality Flag | Description |
|---|---|
| 1 | Good data, passed all tests. |
| 2 | Data quality control procedures are not performed. |
| 3 | Suspect data, failed the ARIMA model-based test. |
| 4 | Bad data, failed the range rationality test. |
| 5 | Stuck value data. |
| 7 | Interpolated single data point by the ARIMA method. |
| 8 | Interpolated successive multi data points by the ARIMA method. |
| 9 | Missing data. |
Figure 3The data quality control algorithm.
Definition of TP, FN, FP, and TN.
| Actual Data Status | Tested Results | |
|---|---|---|
| Abnormal | Normal | |
| Abnormal | True Positive (TP) | False Negative (FN) |
| Normal | False Positive (FP) | True Negative (TN) |
Data outlier detection effectiveness of different methods.
| Methods | TP | FN | FP | TN | Precision | Recall | |
|---|---|---|---|---|---|---|---|
| ARIMA | 414 | 27 | 16 | 9543 | 0.9628 | 0.9388 | 0.9506 |
| 3sd method | 192 | 249 | 11 | 9548 | 0.9458 | 0.4354 | 0.5963 |
| 2sd method | 281 | 160 | 153 | 9406 | 0.6475 | 0.6372 | 0.6423 |
Figure 4Outlier detection in a subset pH data. (a) Original time series, the results of the (b) ARIMA method, (c) 3sd method, and (d) 2sd method. Red dots denote manually labeled outliers, while green circles are outliers detected by the method.
The data interpolation accuracy of ARIMA for three measurements in two situations.
| Single-Point Interpolation | Successive Multipoint Interpolation | |||||
|---|---|---|---|---|---|---|
| Temperature | Conductivity | Pressure | Temperature | Conductivity | Pressure | |
| MAPE | 0.0015% | 0.0075% | 0.0226% | 0.0241% | 0.0798% | 0.0973% |
| MAE | 0.0002 | 0.0020 | 0.0032 | 0.0034 | 0.0220 | 0.0140 |
| RMSE | 0.0023 | 0.0060 | 0.0049 | 0.0114 | 0.0687 | 0.0190 |
Figure 5(a) An example of data interpolation for successive multipoint in pressure data; (b) a partial enlargement of the section of (a) in the magenta circle. Red dots denote successive interpolated values.