| Literature DB >> 34604740 |
David Gordon1,2, Panayiotis Petousis3, Henry Zheng2, Davina Zamanzadeh2, Alex A T Bui1,2.
Abstract
We present a novel approach for imputing missing data that incorporates temporal information into bipartite graphs through an extension of graph representation learning. Missing data is abundant in several domains, particularly when observations are made over time. Most imputation methods make strong assumptions about the distribution of the data. While novel methods may relax some assumptions, they may not consider temporality. Moreover, when such methods are extended to handle time, they may not generalize without retraining. We propose using a joint bipartite graph approach to incorporate temporal sequence information. Specifically, the observation nodes and edges with temporal information are used in message passing to learn node and edge embeddings and to inform the imputation task. Our proposed method, temporal setting imputation using graph neural networks (TSI-GNN), captures sequence information that can then be used within an aggregation function of a graph neural network. To the best of our knowledge, this is the first effort to use a joint bipartite graph approach that captures sequence information to handle missing data. We use several benchmark datasets to test the performance of our method against a variety of conditions, comparing to both classic and contemporary methods. We further provide insight to manage the size of the generated TSI-GNN model. Through our analysis we show that incorporating temporal information into a bipartite graph improves the representation at the 30% and 60% missing rate, specifically when using a nonlinear model for downstream prediction tasks in regularly sampled datasets and is competitive with existing temporal methods under different scenarios.Entities:
Keywords: deep learning; graph neural networks; imputation; irregular sampling; missing data; temporal data
Year: 2021 PMID: 34604740 PMCID: PMC8480427 DOI: 10.3389/fdata.2021.693869
Source DB: PubMed Journal: Front Big Data ISSN: 2624-909X
FIGURE 1Joint bipartite graph. Captures temporal information [e.g., observation nodes for patient 1 (Pt. 1)] without creating actual edges as well as captures information between the observation and feature nodes (i.e., edge attributes). Abbreviations: estimated glomerular filtration rate (eGFR), potassium (K), respiratory rate (RR), systolic blood pressure (SBP), patient 2 (Pt. 2).
FIGURE 2Missing data imputation methods.
Dataset profiles.
| Stocks | Energy | ICU | |
| # of observations | 4,120 (2004-2020) | 3,001 | 8,250 (550 patients) |
| # of features (cont, cat) | 6 (6,0) | 28 (28,0) | 16 (15,1) |
| Label | Volume | Light Usage | Ventilator |
| Missing rate (MR) | 30% | 30% | 30% |
| 60% | 60% | 60% | |
| Measurement frequency | 24 h | Every 10 min | 4 h |
| Sequence length | 21 | 24 | 15 |
| # of trainable edges | |||
| 30% MR | 602K | 2.7M | 2.6M |
| 60% MR | 345K | 1.5M | 1.5M |
Average sampling frequency.
Imputation Methods Prediction Task (GBR/LR). Where a similar R2 for the imputed values to the original values suggests a potentially more accurate representation.
| Setting | Method | Missing rate | Stocks GBR/LR | Energy GBR/LR | ICU GBR/LR |
| n/a |
| n/a | 0.588/0.33 | 0.623/0.31 | 0.657/0.66 |
|
|
| 30% | 0.554/0.36 | 0.732/0.70 | |
| 60% | 0.662/ | 0.500/0.38 | 0.840/0.78 | ||
|
| 30% | 0.677/0.30 | 0.504/ | 0.602/0.57 | |
| 60% | 0.472/0.43 | ||||
|
| 30% | 0.654/ | 0.352/0.24 | 0.458/0.43 | |
|
| 60% | 0.754/0.39 | 0.105/0.09 | 0.223/0.21 | |
|
| 30% | 0.609/ | 0.579/0.55 | ||
|
| 60% | 0.511/0.26 | 0.660/ | 0.362/0.29 | |
|
| 30% | 0.563/0.27 |
| ||
| 60% | 0.537/0.26 | 0.379/0.18 |
| ||
|
|
| 30% | 0.372/0.24 | 0.732/0.73 | |
| 60% |
| 0.968/0.85 | |||
|
| 30% | 0.526/0.30 | 0.489/ | 0.635/0.64 | |
| 60% | 0.372/0.26 | 0.230/0.14 | 0.613/0.61 | ||
|
| 30% | 0.538/ | 0.638/ | ||
| 60% | 0.430/0.25 | 0.221/0.15 | 0.610/ | ||
|
| 30% | 0.614/0.37 | 0.716/0.43 | 0.732/0.71 | |
| 60% | 0.777/ | 0.824/0.58 | 0.853/0.76 | ||
|
| 30% | 0.478/0.26 | 0.463/0.25 | 0.634/0.61 | |
| 60% | 0.334/0.23 | 0.220/0.12 | 0.603/0.54 | ||
|
| 30% | 0.531/0.29 | 0.491/0.29 | ||
| 60% | 0.391/0.23 | 0.209/0.12 | |||
|
| 30% | 0.531/0.31 | 0.459/0.27 | 0.636/0.64 | |
| 60% | 0.414/0.24 | 0.221/0.15 | 0.616/0.60 | ||
|
| 30% | 0.465/0.26 | 0.466/0.27 | 0.636/ | |
| 60% | 0.414/0.26 | 0.233/0.15 | 0.607/ |
Our method.
Bold values are the best method(s), closest to the original values, per setting, dataset, missing rate, and model (nonlinear vs linear).
Imputation Methods RMSE. A smaller RMSE is better.
| Setting | Method | Stocks | Energy | ICU | |||
| 30% | 60% | 30% | 60% | 30% | 60% | ||
|
|
| 0.073 | 0.107 | 0.1305 | 0.2349 | 0.1045 | 0.118 |
|
| 0.036 | 0.041 |
|
| 0.1666 | 0.189 | |
|
| 0.043 | 0.044 | 0.1684 | 0.1672 | 0.2339 | 0.2346 | |
|
| 0.031 |
| 0.1337 | 0.1493 | 0.1777 | 0.244 | |
|
|
| 0.083 | 0.1364 | 0.1443 |
|
| |
|
|
| 0.071 | 0.106 | 0.132 | 0.2328 | 0.1009 | 0.1162 |
|
|
| 0.097 | 0.1075 | 0.1418 |
| 0.0901 | |
|
| 0.0386 | 0.131 | 0.0966 | 0.1939 | 0.0826 | 0.0862 | |
|
| 0.0329 |
|
|
| 0.0942 | 0.1078 | |
|
| 0.2299 | 0.231 | 0.2076 | 0.2070 | 0.2141 | 0.2139 | |
|
| 0.0394 | 0.180 | 0.1307 | 0.2331 | 0.1395 | 0.3079 | |
|
| 0.0387 | 0.131 | 0.1541 | 0.1779 | 0.0827 | 0.1150 | |
|
| 0.0337 | 0.085 | 0.1463 | 0.146 | 0.0768 |
| |
Our method.
Bold values are the best method per setting, dataset, and missing rate.