Literature DB >> 35755872

Federated learning of molecular properties with graph neural networks in a heterogeneous setting.

Abstract

Chemistry research has both high material and computational costs to conduct experiments. Intuitions are interested in differing classes of molecules, creating heterogeneous data that cannot be easily joined by conventional methods. This work introduces federated heterogeneous molecular learning. Federated learning allows end users to build a global model collaboratively while keeping their training data isolated. We first simulate a heterogeneous federated-learning benchmark (FedChem) by jointly performing scaffold splitting and latent Dirichlet allocation on existing datasets. Our results on FedChem show that significant learning challenges arise when working with heterogeneous molecules across clients. We then propose a method to alleviate the problem: Federated Learning by Instance reweighTing (FLIT(+)). FLIT(+) can align local training across clients. Experiments conducted on FedChem validate the advantages of this method. This work should enable a new type of collaboration for improving artificial intelligence (AI) in chemistry that mitigates concerns about sharing valuable chemical data.

Entities: Chemical

Keywords: federated learning; graph neural network; molecular property prediction

Year: 2022 PMID： 35755872 PMCID： PMC9214329 DOI： 10.1016/j.patter.2022.100521

Source DB: PubMed Journal: Patterns (N Y) ISSN： 2666-3899

Introduction

There is an increasing trend to apply machine learning for molecule-property prediction to avoid the expense of experiments or reduce the tremendous computational costs required for accurate quantum-chemical calculations. A large focus has been on applying graph neural networks to predicting molecular properties.1, 2, 3, 4, 5, 6 These works assume a central server that has access to all data. However, such a centralized-learning scenario may not represent how institutions share chemical data. Due to intellectual-property concerns and the intrinsic value of chemical data, it can be difficult for academic labs, national labs, and private institutions to share their molecule datasets. Heterogeneous federated molecular learning where three institutions focus on different types of molecules The server has no access to training data. We propose federated learning to obtain a generalized global model without access to private molecular data. For federated learning, local models are trained with their data on the client side and then are aggregated for a global one on the server side without seeing the data. One of the main concerns for federated molecular-property prediction is the heterogeneously distributed client data since institutions focus on specific categories of molecules for their research interests. For example, institutions may wish to collaborate to construct an accurate model of pharmacokinetic clearance time of small molecules like shown in Figure 1. Each institution studies specific drug-like molecules and their variants for their therapeutic targets. Each institution cannot share molecules, but it is beneficial to have a model for clearance time. Trained local models will heavily deviate from each other in this example, and it is thus sub-optimal to directly apply vanilla federated-learning methods, e.g., Federated Average (FedAvg), to aggregate the heterogeneous local models. Although several works are proposed to handle the heterogeneity problem,, a broader problem is the lack of heterogeneous federated molecular learning benchmarks to judge these methods for chemical data.

Figure 1

Heterogeneous federated molecular learning where three institutions focus on different types of molecules

The server has no access to training data.

This paper first proposes a federated heterogeneous molecular learning benchmark, FedChem. FedChem simulates the heterogeneous settings based on scaffold splitting and latent Dirichlet allocation (LDA). We first adopt scaffold splitting to split the molecules based on their two-dimensional structure, and molecules with similar structures are grouped accordingly. Then, a heterogeneous setting is obtained by applying LDA on the scaffold subgroups, where LDA is a commonly used technique to simulate heterogeneous settings in conventional federated classification tasks., We benchmark existing federated-learning methods on the proposed heterogeneous suite FedChem and observer a remarkable performance degradation for the commonly used method FedAvg. We then propose Federated Learning with Instance reweighTing (FLIT) to alleviate the heterogeneity problem by adapting focal loss for federated learning. The motivation of FLIT is that local models will be trained to overfit their data, which, however, do not share the same distribution as the global one. That is, the prediction of local models would be over-confident for certain types of molecules while with high uncertainty for others. FLIT can align client training by adding weights to the uncertain cases by utilizing the local and received global models. As a result, the locally trained models will be more consistent with each other, and the federated-learning performance can be eventually improved. We measure the uncertainty for training samples by the loss values and the prediction consistency among neighbored samples and develop two methods as FLIT and FLIT+ (FLIT(+) being the abbreviation for both). Our experiments on the proposed benchmark FedChem validate the advantages of FLIT(+) over existing federated-learning methods. Our main contributions are summarized as follows: We propose a federated heterogeneous molecular learning benchmark based on MoleculeNet, termed FedChem. FedChem employs scaffold splitting and LDA to simulate the heterogeneous settings. We propose FLIT(+) algorithms to alleviate the heterogeneity problem. FLIT(+) can align the client training by putting more weights on uncertain samples. We conduct experiments to benchmark the proposed and existing federated-learning methods on FedChem. Comprehensive experiments validate the effectiveness of the proposed methods.

Related work

Federated learning

Federated learning was proposed by McMahan et al. and has been applied in a wide range of fields including healthcare, biometrics, and natural images and videos., As a popular method, FedAvg element wisely aggregates the parameters of local models to obtain a global one. However, recent studies indicate that FedAvg may not handle the heterogeneity problem properly., There are two categories of methods developed to alleviate the problem: improvements for server-side aggregation,17, 18, 19, 20, 21, 22, 23, 24 and client-side regularization methods.25, 26, 27, 28, 29, 30 Client-side methods can use the local training data and attract increasing attention. Our method also follows this line of research. Federated proximal (FedProx) regularizes the local learning with a proximal term to encourage the updated local model not to deviate significantly from the global model. A similar idea is adopted in personalized federated learning. SCAFFOLD adopts additional control variates to alleviate the gradient dissimilarity across different communication round. Federated model distillation transfers the soft predictions of a shared dataset to reduce the communication cost and regularizes the local training with distillation loss. Federated meta-learning incorporates model agnostic meta-learning (MAML) for local training to improve the generalization ability of local models., Robust federated-learning has been studied by several works.,, Reference Architecture for Federated Learning Systems (FLRA) adversarially conducts training on clients to make the model robust to affine distribution shifts. Most of the client-side federated-learning methods add a regularization term to restrict the local training process so that the optimized local model would not significantly deviate from the global one.,, Consequently, the local models will be more consistent with each other, and the consistency could benefit the server-side aggregation. However, the regularization may also hinder the local optimization and lead to sub-optimal results for local training. Our method does not impose constraints on the local training, and, alternatively, we instance wisely reweight the local training samples to align the local data distribution to the global one inspired by recent work.32, 33, 34, 35 Heterogeneous federated learning is related to federated domain adaptation (FDA).36, 37, 38 FDA aims to improve the performance for specific target training domains, while general heterogeneous federated learning aims to improve the performance for all training data. There are several works focusing on federated graph neural networks,39, 40, 41, 42, 43, 44 and federated molecular-property prediction., GraphFL applies MAML to improve the robustness of training. The method in Xie et al. alleviates the heterogeneity problem by group wisely aggregating clients’ models. However, existing work does not study federated molecular learning in heterogeneous settings where the clients’ datasets are non-independent and identically distributed (IID) in molecular structure and properties.

Deep molecular-property prediction

Graph neural network is commonly adopted for molecular learning.3, 4, 5, Message-passing neural network (MPNN) iteratively propagates the vertex features through message-passing layers. SchNet adopts continuous-filter convolution to achieve E(3)-invariant molecular learning. DimeNet and DimeNet++ include directional information when training graph neural network for better performance., Other works apply an SO(3) equivariance message-passing layer to predict the properties of molecular data., A new structure is proposed by EQNN to efficiently achieve an E(n) equivalent. We employ MPNN and SchNetThe for client-side training in the proposed federated molecular learning framework FedChem, and our framework can seamlessly integrate other models for client-side training, e.g., other graph network networks,,, sequence models,, etc.

Results and discussion

Notations and settings

We first briefly describe federated heterogeneous molecular learning (FedChem). We assume that there are L institutions that work on the same tasks with roughly different groups of molecules. That is, the data are distributed heterogeneously across institutions. Each institution develops a neural network for molecular-property prediction.,, The neural network trained on their data may suffer from poor generalization ability, and they thus intend to collaborate for a global model without sharing their data with the central server and other participants. We propose to apply federated learning to obtain a global model for all participants without access to clients’ data. Formally, we denote the overall dataset as , where is the local dataset owned by the l-th institution/client that may not share the same distribution as the overall data. is the i-th molecule in graph representation with vertex as , edge as , and ground-truth label as . Ground truth could be either concrete values for regression tasks or categorical values for classification tasks. We utilize a local graph neural network to handle the data for the l-th client, and it is implemented with MPNN or SchNet. To enable the clients to collaborate with each other, we have a central sever that receives and aggregates the uploaded local networks for a global one , where is the global model, and is the aggregation function, e.g., FedAvg, federated optimzation, federated distillation, "ensemble distillation and model fusion (FedDF), federated matched averaging, etc. Note that the central server contains no training data and also cannot access any local data. FedChem simulates heterogeneous federated molecular learning with existing datasets, e.g., MoleculeNet. Our method relies on scaffold splitting to group molecules based on their structure (graph). Molecules with similar structures are grouped into a scaffold subset. Scaffold splitting first groups the molecules into scaffold groups and then assign samples from each group to clients according to the unbalanced partition method LDA. We detail the approach to generating heterogeneous settings in the experimental section. Our method of generating a heterogeneous dataset is different from typical existing methods, which simulate label-distribution shift. For example, Karimireddy et al. and Wang et al. split samples based on class to each client, which makes the label distributions of local datasets on clients inconsistent with the global label distribution. In reality, institutions focus on molecules with similar structures via processes like lead optimization or hit finding. Thus, we typically see structurally heterogeneous molecules on the client side (domain shift), while the label distributions among local clients can be similar. To simulate the structural heterogeneity with existing centralized datasets, we adopt scaffold splitting and do not rely on the ground-truth label. Intuitively, samples from different scaffold subsets are analogous to the samples from different domains for general machine-learning tasks, and molecules (images) within a scaffold subset (domain) share similar structures (style) but show different chemical properties (ground-truth label). We illustrate the scaffold splitting to help readers better understand our heterogeneity simulation method. Moreover, it is non-trivial to generalize existing heterogeneous federated dataset simulation methods to regression and multi-label tasks, while our method can be easily adapted to any problems. We benchmark several existing federated-learning methods on FedChem and observe that the heterogeneity problem brings significant challenges to federated molecular learning.

Federated learning with FedChem

The basic training pipeline for FedChem is briefly introduced as follows: we first initialize a global model at the server side and then for each federated-learning communication round: (1) the server broadcasts global model to clients, (2) clients conduct training in parallel, and specifically, the l-th client is trained with its own data for an updated model as , and (3) the server collects updated local models from clients and then aggregate these models into a global one as . We iteratively perform steps 1–3 for C communication rounds to obtain the final global model. We adopt FedAvg for server-side aggregation throughout the paper, but FedChem can be easily extended to involve other aggregation methods., We summarize the training procedure for federated learning with FedChem in Algorithm 1 by taking FedAvg as the aggregation method. Note that the server may select a subset of clients during each communication round for scalability. Input: # clients L, # local updates T, # Comms round C. Output: Global Model 1: Server initialize a global model Server init. 2: while Communication Round do 3: Server broadcasts to clients 4: Client init. 5: for to L in parallel do Client Update 6: for to K do Update for K steps 7: Sample a minibatch 8: Update local model by gradient descent 9: end for 10: Client sends updated model to Server 11: end for 12: Server gets Server Update 13: end while

Client-side updates

For completeness, we describe typical training steps to update the graph neural network (GNN) model for client-side training. We adopt MPNN set-to-set (MPNNs2s) and SchNet for molecule-level property prediction in our experiments, and other popular models (such as DimeNet, Gin, Graph Convolutional Network [GCN], etc.) can also be unified in FedChem. Molecule-level GNN usually contains two phases: a message-passing phase and a readout phase., The message-passing phase allows the vertex to propagate and collect information from their neighbors through the graph and is usually composed of two steps as message generation and vertex update. Formally, given the l-th client model with T message-passing layers and a sampled graph (we omit the subscript for the sample, i.e., ), we define the message-passing function on the i-th vertex asand the vertex update function aswhere denotes the representation of the i-th vertex in the t-th layer of , denotes the edge between the i-th and w-th vertex, and denotes the set of neighbors for vertex i in graph . generates the message by aggregating the feature of and its neighbors and also the edges between them. updates the i-th vertex by transforming the original features and the received message . Different GNN models are implemented with different and . For example, the message function of GCN is defined as and , where is a linear layer and is the Laplacian-regularized adjacency matrix. SchNet implements the message function with a continuous filter layer and with a vertex(atom)-wise convolutional module. The message-passing phase could aggregate and transform the vertex features for high-level representations. After T message-passing layers, we adopt a readout function to aggregate the vertex representations for graph level representation as should be permutation invariant and can be implemented with either a simple sum pooling or a learnable neural network. The graph-level representation is further used to obtain an estimation for the ground-truth molecular property .

Federated learning by instance reweighting FLIT(+)

According to our experiments on the proposed heterogeneous federated-learning benchmark FedChem, heterogeneity brings significant difficulties to federated molecular learning. This section proposes a method to alleviate the heterogeneity problem, namely FLIT. FLIT adapts the formulation of focal loss for federated learning by involving a global model in local training objectives and can align the local training across clients by focusing on uncertain samples., We illustrate the motivation of FLIT in Figure 2.

Figure 2

Illustration for the motivation of FLIT

We assume two clients as A and B, and the local data on these clients do not share the same distribution as the global one. Local models trained on biased local data will overfit the majority groups of data and underfit others. FLIT measures each sample’s prediction confidence and puts more weight on the uncertain data. As a result, the local data distribution will be better aligned to the global one, and the trained local models will also be more consistent with each other.

Illustration for the motivation of FLIT We assume two clients as A and B, and the local data on these clients do not share the same distribution as the global one. Local models trained on biased local data will overfit the majority groups of data and underfit others. FLIT measures each sample’s prediction confidence and puts more weight on the uncertain data. As a result, the local data distribution will be better aligned to the global one, and the trained local models will also be more consistent with each other. Learning to reweight training samples is widely used in curriculum learning, hard-sample mining, domain generalization,,, debiasing, model calibration, adversarial defense, etc. Our method is closely related to focal loss and worst-case optimization. Mukhoti et al. point out that focal loss could make the objective value aligned with the prediction confidence. GroupDRO improves the model generalization ability by assigning more weights for groups with the worst performance. FLIT relies on an instance-reweighting framework to improve the federated molecular-property prediction in a heterogeneous setting. The basic observation of FLIT is that, under the heterogeneous settings, the local model will be trained to overfit the small-scaled data at hand. Therefore, the local model will be over-confident for the majority groups of local training samples and may perform poorly and even worse than the received global model on the rare molecules at the client-side. As a result, the local models trained on different clients will significantly deviate from each other, and the inconsistency remarkably degrades the performance of the global model , which is aggregated from the local models in a data-free manner. The sub-optimal performance of FedAvg is wildly admitted by existing studies., FLIT puts more weight on samples with low prediction confidence by utilizing the local and global models to alleviate the problem. FLIT explores two different ways to define the prediction confidence, i.e., the loss value (FLIT), and also augmented with prediction consistency among the neighbors (FLIT+). By focusing on the identified uncertain samples, FLIT(+) makes local training more consistent across clients and eventually leads to better federated-learning performance.

Federated learning by instance reweighting

By jointly using the local model and global model , FLIT reweights training samples to align the biased local data distribution to the global one. Eventually, the local models across clients will be well aligned for better performance. Given a molecule sampled from the dataset of the l-th client , the original focal loss for binary classification tasks is defined aswhere is defined based on the prediction of molecule as By substituting the binary cross entropy loss into Equation 4, we have A generalized formulation for instance-reweighting can then be obtained aswhere is a non-negative function that indicates the uncertainty of training samples and is defined by jointly utilizing the local model and global model aswhere indicates the prediction uncertainty of x with the model F. Equation 7 puts more weights on samples if the updated local model is less confident than the global model. We note that can take other types of formulation, and we implement it with Equation 7 for simplicity. Moreover, for FLIT, we follow the focal loss and define as the loss value, i.e., We substitute Equation 8 into Equations 7 and 6, and the resulted method is termed FLIT. Compared with the vanilla focal loss, FLIT integrates the global model into the local training, which turns out to benefit the federated learning according to our experiments.

FLIT+

An alternative way to define for sample is the prediction discrepancy between the sample and its neighbors. Intuitively, the larger the discrepancy is, the less confident the model is for predicting the sample. To measure the prediction discrepancy for the neighborhoods, we aim to search for the data pairs with largest prediction discrepancy in the neighborhoods. Since directly searching for the exact neighbor is computationally expensive and is implausible with the local biased dataset, we alternatively adopt adversarial neighbor inspired by virtual adversarial training (VAT). Adversarial neighbors are similar to in terms of the input but has the most different prediction. Concretely, we measure the discrepancy by adversarial learning with a given model F for aswhere is a small positive value, is the step size, can be Kullback-Leibler (KL) divergence for classification or Euclidean distance for regression. Equation 9 measures the discrepancy between predictions of the molecule with graph and its virtual adversarial neighbor . Equation 9 generates a virtual adversarial neighbor that is similar to (since ε is small) but with the most different prediction. We optimize r on the positions for QM9 and vertex features for other datasets. We omit detail steps for optimizing Equation 9, and please refer to Miyato et al. for details. We jointly use the loss value and the discrepancy defined in Equation 9 and obtainwhere λ is a hyperparameter. By substituting the formulation into Equation 7, we obtain to measure the uncertainty of the training samples, and accordingly, we obtain FLIT+ by optimizing the objective as Including in the training objective is essential to make the neighborhood prediction consistency a valid uncertainty measurement. Moreover, in experiments, we notice that federated learning can benefit from the virtual adversarial training alone, i.e., setting . This should be attributed to the fact that virtual adversarial training could improve the generalization ability of the local model and can be regarded as another way to align the local training implicitly. Detailed results and analysis can be found in the experimental procedures. We use FLIT(+) to denote both FLIT and FLIT+. We summarize FLIT(+) for client update in Algorithm 2. Input: , , γ. Output: 1: Save Equation 8 or Equation 10 2: Init. 3: for to K do Train on the l-th Client 4: Sample a minibatch 5: Calculate by Equation 8 (or by Equation 10) 6: Obtain by Equation 7 7: Normalize 8: Update by optimizing Equation 6 (or Equation 11) 9: Update moving average 10: end for 11: Client sends updated model to Server

Implementation details

Since the scale of may vary significantly especially for regression tasks, it is not proper to directly apply Equations 6 and 11 for general tasks. We propose to normalize the by its moving average as , whereis the moving average and B is the size of minibatch; β is set as 0.8 in this paper. Moreover, we note that the prediction and the discrepancy for the received global model only need to be calculated once per communication round and thus will not bring much computational cost.

Experimental procedures

Resource availability

Lead contact

Any further information, questions, or requests should be sent to A. White (andrew.white@rochester.edu).

Materials availability

Our study did not involve any physical materials.

Data and code availability

All used data are publicly available. For reproducibility, our code is available at https://github.com/ur-whitelab/fedchem.git. The code has also been deposited at Zenodo under https://doi.org/10.5281/zenodo.6485682.

Datasets

We conducted experiments on a total of nine datasets retrieved from MoleculeNet for molecular-property prediction, including four regression datasets (FreeSolv, Lipophilicity, ESOL, and QM9) and five classification datasets (Tox21, SIDER, ClinTox, BBBP, and BACE). We follow the prediction tasks in Wu et al. and summarize the statistics for all datasets in Table 1.

Table 1

Statistics of datasets

Dataset	#Compounds	#tasks	Task type	Metric
FreeSolv	642	1	Reg.	RMSE
Lipophilicity	4,200	1	Reg.	RMSE
ESOL	1,128	1	Reg.	RMSE
QM9	133,885	12	Reg.	MAE
Tox21	7,831	12	Cls.	ROC-AUC
SIDER	1,427	27	Cls.	ROC-AUC
ClinTox	1,478	2	Cls.	ROC-AUC
BBBP	2,039	1	Cls.	ROC-AUC
BACE	1,213	1	Cls.	ROC-AUC

Reg., regression; Cls., classification; RMSE, root-mean-square error; MAE, mean absolute error; ROC-AUC, receiver operating characteristic-area under the curve.

Statistics of datasets Reg., regression; Cls., classification; RMSE, root-mean-square error; MAE, mean absolute error; ROC-AUC, receiver operating characteristic-area under the curve.

Compared methods

To justify the proposed benchmark FedChem, we compare our results with MoleculeNet (MolNet) for centralized training. To validate the effectiveness of FLIT(+), we compare FLIT(+) with FedAvg, FedProx, and MOON. Moreover, we also implement two variants of FLIT(+) as FedAvg with focal loss for client training (federated focal [FedFocal]) and FedAvg with VAT for client training (FedVAT). We describe the compared methods as follows: FedAvg simply element wisely aggregates the local models to a global one. FedProx regularizes local training to alleviate the heterogeneity problem. MOON applies contrastive learning for federated learning to correct the local training. FedFocal is proposed in this paper and is a variant of FLIT. FedFocal applies focal loss Equation 4 to local training and adopts FedAvg for server update. FedFocal is proposed to validate the effectiveness of involving the global model into local training as FLIT. FedVAT is also proposed in this paper and is a variant of FLIT+. FedVAT jointly optimizes Equation 9 and original training loss for client training and adopts FedAvg for server update. Compared with FLIT+, FedVAT does not use an instance-reweighting training strategy. FLIT is proposed in this paper and is described in Algorithm 2. FLIT+ is proposed in this paper. Compared with FLIT, FLIT+ jointly uses loss values and the discrepancy between nearby samples to measure the uncertainty of samples as described in Equation 10 and adopts Equation 11 as the learning objective. We perform grid search on the excluded validation set for hyperparameter tuning and model selection. For FedProx, we search the hyperparameter μ from . For MOON, we search the hyperparameter from . We search γ used for instance reweighting for FLIT(+) and FedFocal from and search λ from for FLIT+. FedVAT adopts a hyperparameter to balance VAT loss and primary loss, which is searched from . We report results on the testing set by the model with the best performance on the validation set.

Main results

The experimental results on regression and classification datasets are shown in Tables 2 and 3, respectively. We draw several points according to the results. First, comparing our centralized training results (denoted as FedChem) with MolNet, we obtain competitive results by using MPNNs2s and SchNet. Specifically, we obtain a significant performance gain by adopting SchNet for QM9 dataset. Second, comparing the performance of FedAvg with different α for each dataset, we can conclude that the heterogeneity settings introduced by FedChem indeed lead to performance degradation for 7 out of 9 datasets (i.e., FreeSolv, ESOL, QM9, Tox21, ClinTox, BBBP, and BACE). FedAvg shows stable performance for Lipophilicity and SIDER. The reason may be that we do not consider the relation between scaffold subgroups in our current settings, and the resulted clients’ datasets are rather homogeneous. Third, we observe a significant performance gain for most datasets by comparing heterogeneous federated-learning methods with FedAvg. For example, the proposed FLIP+ achieves a 0.543 improvement with and 0.162 improvement with for FreeSolv. The results suggest the necessity to mitigate the heterogeneity when conducting federated learning and validate the effectiveness of the proposed FLIT(+). However, we also observe that the performance improvements of our methods are rather marginal for several datasets. The reasons may be attributed to the fact that our current scaffold splitting may not lead to heterogeneous datasets. We will continue our work for a better method to simulate the heterogeneity problem for federated molecular-property prediction.

Table 2

Performance for federated molecular regression

Dataset	α	Centralized training		Federated learning
Dataset	α	MolNeta	FedChema_ours	FedAvg	FedProx	MOON	FedFocal_ours	FedVAT_ours	FLIT_ours	FLIT+_ours
FreeSolv ⇓	0.1	1.40	1.430	1.771	1.693	1.376	1.686	1.371	1.634	1.228d
	0.5			1.445	1.376	1.423	1.322	1.299	1.366	1.127d
	1			1.223	1.216	1.469	1.294	1.150	1.277	1.061d
Lipophilicity ⇓	0.1	0.655	0.6290	0.6361d	0.6403	0.6426	0.6403	0.6556	0.6563	0.6392
	0.5			0.6306	0.6365	0.6339	0.6351	0.6333	0.6368	0.6270d
	1			0.6505	0.6474	0.6442	0.6461	0.6488	0.6443	0.6403d
ESOL ⇓	0.1	0.97	0.6570	0.8016	0.7702	0.7537d	0.8022	0.7776	0.7788	0.7642
	0.5			0.7524	0.7382	0.7258	0.7708	0.7243	0.7426	0.7119d
	1			0.7056	0.6828	0.6751	0.6822	0.7253	0.6705d	0.6998
QM9 ⇓	0.1	0.0479b	0.0890c	0.5889	0.6036	0.5817	0.6164	0.5606	0.5713	0.5356d
	0.5			0.5906	0.5751	0.5707	0.6059	0.5656	0.5658	0.5222d
	1			0.5786	0.5691	0.5808	0.5822	0.5602	0.5621	0.5282d

indicate if lower or higher numbers are better.

Results were obtained with centralized training.

Results were retrieved from Klicpera et al. with a seperate SchNet for each task.

Results were obtained by a single multitask network. Smaller α of LDA generates more extreme heterogeneous scenario. FedFocal and FedVAT are proposed in this paper as the variants of FLIT(+).

Best federated-learning results.

Table 3

Performance for federated molecular classification

Dataset	α	Centralized training		Federated learning
Dataset	α	MolNeta	FedChema_ours	FedAvg	FedProx	MOON	FedFocal_ours	FedVAT_ours	FLIT_ours	FLIT+_ours
Tox21 ⇑	0.1	0.829	0.8182	0.7705	0.7732	0.7331	0.7696	0.7733	0.7711	0.7802b
	0.5			0.7811	0.7774	0.7461	0.7812	0.7787	0.7825	0.7870b
	1			0.7770	0.7775	0.7457	0.7881b	0.7706	0.7748	0.7806
SIDER ⇑	0.1	0.638	0.6260	0.6029	0.6056b	0.5885	0.6016	0.6027	0.6035	0.6038
	0.5			0.6011	0.5931	0.5966	0.6086	0.5981	0.6096	0.6146b
	1			0.6011	0.6023	0.5901	0.6003	0.6053	0.6072	0.6174b
ClinTox ⇑	0.1	0.832	0.8903	0.7491	0.7540	0.7892b	0.7789b	0.7581	0.7761	0.7775
	0.5			0.7521	0.7423	0.7917b	0.7770	0.7614	0.7888b	0.7852
	1			0.7784	0.7791	0.8001	0.8036b	0.7743	0.7849	0.7993
BBBP ⇑	0.1	0.690	0.8674	0.8361	0.8610	0.8737b	0.8550	0.8673b	0.8666	0.8663
	0.5			0.8594	0.8879b	0.8865	0.8726	0.8641	0.8671	0.8774
	1			0.8453	0.8557b	0.8487	0.8378	0.8386	0.8515	0.8515
BACE ⇑	0.1	0.806	0.8834	0.8203	0.8328	0.8373	0.8253	0.8166	0.8242	0.8467b
	0.5			0.8212	0.8398	0.8285	0.8332	0.8417	0.8516	0.8667b
	1			0.8486	0.8408	0.8561	0.8497	0.8578b	0.8497	0.8561

indicate if lower or higher numbers are better.

Results are obtained with centralized training.

Best federated-learning results.

Performance for federated molecular regression indicate if lower or higher numbers are better. Results were obtained with centralized training. Results were retrieved from Klicpera et al. with a seperate SchNet for each task. Results were obtained by a single multitask network. Smaller α of LDA generates more extreme heterogeneous scenario. FedFocal and FedVAT are proposed in this paper as the variants of FLIT(+). Best federated-learning results. Performance for federated molecular classification indicate if lower or higher numbers are better. Results are obtained with centralized training. Best federated-learning results. Moreover, the proposed instance-reweighting methods (FedFocal, FLIT, and FLIT+) outperform the regularization-based methods FedProx and MOON. The proposed FLIT additionally utilizes the global model and performs better than its counterpart FedFocal. For example, FLIT improves FedFocal from 0.8022 to 0.7788 with and from 0.7708 to 0.7426 with for ESOL. Lastly, FLIT+ further improves the performance of FLIT by measuring the uncertainty with loss values and discrepancy between neighbors. We also observe that FedVAT can benefit federated learning by encouraging locality smoothness for better generalization performance. By incorporating VAT into the FLIT framework, FLIT+ achieves the best overall performance. FLIT(+) has more consistent results across different settings of α compared with its counterparts, indicating the effectiveness of FLIT+ for dealing with heterogeneity problems (see Figure 3).

Figure 3

Performance of baseline and our methods with varying communication rounds

Asterisk (∗) denotes that the results are obtained with centralized training. We find our method has a strong advantage with a few communication rounds.

Performance of baseline and our methods with varying communication rounds Asterisk (∗) denotes that the results are obtained with centralized training. We find our method has a strong advantage with a few communication rounds.

Sensitivity analysis for federated learning

This section studies the influence of the number of clients and communication rounds on the federated-learning performance. For simplicity, we conduct experiments on ESOL, ClinTox, and BACE. The results of the different number of maximum communication rounds are shown in Figure 4. We vary the maximum communication round from while fixing the total local steps. We find that increasing the frequency of communication can benefit federated learning, although it also leads to increased transfer costs. The performance with different numbers of clients is shown in Figure 4. We vary the number of clients within since a large number of clients would lead to small local datasets, making training infeasible. We find that the performance of federated learning usually decreases (ESOL and BACE) or is stable (ClinTox) as the client number increases. This indicates that small-scale local training data degrade the federated-learning performance.

Figure 4

Performance of baseline and our methods with different number of clients

See Figure 3 for color legend. The small-scale local training data reduce federated-learning performance for all methods.

Performance of baseline and our methods with different number of clients See Figure 3 for color legend. The small-scale local training data reduce federated-learning performance for all methods.

Settings for heterogeneous FedChem

For all datasets except QM9, we first randomly split the dataset into 80% for training, 10% for validation, and 10% for testing following Wu et al. QM9 is partitioned into 110,000 samples for training, 10,000 samples for validation, and the remaining for testing following. To simulate the heterogeneous settings for federated learning, we first perform scaffold splitting to partition the training data into subgroups. Then, we assign the molecules of each subgroup to clients by LDA., We control the degree of heterogeneity by tuning α for LDA. Smaller α leads to more severe heterogeneity, and we vary α from . Moreover, we deliberately balance the number of molecules for each client following He et al. to control for the effect of the example number on performance. As for federated-learning settings, we set the default communication rounds C to 30 and the default number of clients to four for all datasets except for QM9, which is set to eight. For client training, we set the batch size to 64 and use Adam with a learning rate of 1 × 10-4 and a weight decay of 1 × 10-5. For all datasets except QM9, we simulate four clients and train the local model for 10,000 local steps. QM9 has eight clients, and we train the model for 100,000 local steps. We conduct federated learning with the FedML framework. For all datasets except QM9, we use MPNNs2s implemented by Deep Graph Library. MPNNs2s has three message-passing layers and three set2sets layers. The hidden feature of the edge is 16, and the output feature of the vertex is 64. We perform three set2set steps. For QM9, we adopt SchNet with six interaction layers and implement SchNet by PyTorch-Geometric. The number of hidden channels and filters of SchNet is 128, and the number of Gaussian is set as 50 for continuous filter layers. We implement a multitask network for datasets with multi-objectives. We run experiments on a server with eight NVIDIA RTX 2080 Ti graphics cards.

Conclusions

Chemistry can be a challenging domain for deep learning because of the computational and materials cost per training example. For example, each row in the Tox21 dataset costs about $50–$300 million USD. Therefore, contributing data to a public dataset may be impossible for institutions due to the intrinsic value of the data. Federated learning is a way to build global models while preventing the dissemination of chemical data. We propose a benchmark called FedChem for heterogeneous chemical data, which mimics how chemical data distributes among institutions. FedChem is composed of regression- and classification-learning scenarios from the existing MoleculeNet dataset and utilizes scaffold splitting and LDA to assign molecules with different structures to different clients. FedChem can be tuned to generate scenarios with different degrees of heterogeneity. Given that existing federated-learning methods perform poorly on FedChem, we propose an instance reweighting framework called FLIT(+), inspired by focal loss, to align the training process across clients. We show that FLIT(+) is robust to different tasks and datasets with extensive experiments. One possible future direction is to develop personalized federated learning for FedChem. Moreover, since our current heterogeneous simulation method may not lead to severe structural heterogeneity problems in some cases, we will explore other approaches for more heterogeneous settings.

11 in total