| Literature DB >> 34106083 |
Dongchul Cha1,2, MinDong Sung1, Yu-Rang Park1.
Abstract
BACKGROUND: Machine learning (ML) is now widely deployed in our everyday lives. Building robust ML models requires a massive amount of data for training. Traditional ML algorithms require training data centralization, which raises privacy and data governance issues. Federated learning (FL) is an approach to overcome this issue. We focused on applying FL on vertically partitioned data, in which an individual's record is scattered among different sites.Entities:
Keywords: coding; data; data sharing; dataset; federated learning; machine learning; model; performance; privacy; protection; security; training; unsupervised learning; vertically incomplete data
Year: 2021 PMID: 34106083 PMCID: PMC8262549 DOI: 10.2196/26598
Source DB: PubMed Journal: JMIR Med Inform
Figure 1Classification of federated learning. Assume a colorectal cancer patient dataset, and only the target label is gathered centrally. (a) In horizontally partitioned data, patients share the same feature space, but features are collected at different sites. (b) In vertically partitioned data, patient features are present in different sites.
Figure 2The autoencoder network, which is an unsupervised machine learning algorithm. Input and output are the same; thus, they have identical feature space. (a) The conventional autoencoder has a latent space dimension smaller than the input space (m
Dataset composition and training parameters with division to simulate vertically partitioned data.
| Dataset | Division | Dataset size (number of rows) | Feature dimension | Autoencoder layers | Aggregated dimension |
| Adult income | 3 sites | 23,374 | 5, 5, 4 | 64-128-64 | 384×23,374 |
| Schwannoma | 3 sites | 50 | 7, 3, 5 | 64-128-64 | 384×50 |
| eICU | 7 sites | 15,762 | 3, 4, 9, 3, 3, 4, 6 | 64-128-64 | 896×15,762 |
Figure 3The workflow of vertical federated learning using overcomplete autoencoder. The UCI adult income dataset is illustrated as an example. The dataset consists of 14 features and 1 target label (income). Original data are vertically divided into several datasets, three in this case, to assume data distribution among different sites. The heatmaps show each feature with prevalence. Each site (a, b, c) trains an autoencoder and transmits latent data, which are differently distributed, as seen in the heatmaps (a’, b,’ c’). The latent data are aggregated for training to a server, and the server performs model training. The accuracy of models created using the original data versus aggregated latent data is compared.
Classification results of the three datasets.
| Site | Adult income dataset | Schwannoma dataset | eICU dataset | ||||||
|
| Accuracy | AUROCa | Accuracy | AUROC | Accuracy | AUROC | |||
|
|
|
|
|
|
|
| |||
|
| Before VFLb | 0.83 | 0.91 | 0.90 | 0.84 | 0.81 | 0.89 | ||
|
| After VFLc | 0.82 | 0.90 | 0.82 | 0.84 | 0.80 | 0.88 | ||
|
| Differenced | –1.20 | –1.10 | –8.89 | 0 | –1.23 | –1.12 | ||
|
|
|
|
|
|
|
| |||
|
| Before VFL | 0.81 | 0.89 | 0.82 | 0.81 | 0.70 | 0.72 | ||
|
| After VFL | 0.77 | 0.83 | 0.78 | 0.86 | 0.70 | 0.72 | ||
|
| Difference | –4.94 | –6.74 | –4.88 | +6.17 | 0 | 0 | ||
|
|
|
|
|
|
|
| |||
|
| Before VFL | 0.81 | 0.90 | 0.76 | 0.82 | 0.73 | 0.80 | ||
|
| After VFL | 0.77 | 0.83 | 0.78 | 0.83 | 0.72 | 0.79 | ||
|
| Difference | –4.94 | –7.78 | +2.63 | +1.22 | –1.37 | –1.25 | ||
|
|
|
|
|
|
|
| |||
|
| Before VFL | 0.67 | 0.73 | 0.48 | 0.60 | 0.55 | 0.57 | ||
|
| After VFL | 0.76 | 0.83 | 0.62 | 0.71 | 0.56 | 0.57 | ||
|
| Difference | +13.43 | +13.70 | +29.17 | +18.33 | 1.82 | 0 | ||
aAUROC: area under the receiver operating characteristics curve.
bVFL: vertical federated learning.
cCorresponding to the latent representation of original data (central, A, B, or C) in the code layer.
dThe difference is compared between AUROCs in classification tasks.