Literature DB >> 33411624

Cloud-Based Federated Learning Implementation Across Medical Centers.

Suraj Rajendran^1,2, Jihad S Obeid³, Hamidullah Binol⁴, Ralph D Agostino⁵, Kristie Foley⁶, Wei Zhang¹, Philip Austin⁷, Joey Brakefield⁸, Metin N Gurcan⁴, Umit Topaloglu^1,4,5.

Abstract

PURPOSE: Building well-performing machine learning (ML) models in health care has always been exigent because of the data-sharing concerns, yet ML approaches often require larger training samples than is afforded by one institution. This paper explores several federated learning implementations by applying them in both a simulated environment and an actual implementation using electronic health record data from two academic medical centers on a Microsoft Azure Cloud Databricks platform.
MATERIALS AND METHODS: Using two separate cloud tenants, ML models were created, trained, and exchanged from one institution to another via a GitHub repository. Federated learning processes were applied to both artificial neural networks (ANNs) and logistic regression (LR) models on the horizontal data sets that are varying in count and availability. Incremental and cyclic federated learning models have been tested in simulation and real environments.
RESULTS: The cyclically trained ANN showed a 3% increase in performance, a significant improvement across most attempts (P < .05). Single weight neural network models showed improvement in some cases. However, LR models did not show much improvement after federated learning processes. The specific process that improved the performance differed based on the ML model and how federated learning was implemented. Moreover, we have confirmed that the order of the institutions during the training did influence the overall performance increase.
CONCLUSION: Unlike previous studies, our work has shown the implementation and effectiveness of federated learning processes beyond simulation. Additionally, we have identified different federated learning models that have achieved statistically significant performances. More work is needed to achieve effective federated learning processes in biomedicine, while preserving the security and privacy of the data.

Entities: Chemical

Mesh：

Year: 2021 PMID： 33411624 PMCID： PMC8140794 DOI： 10.1200/CCI.20.00060

Source DB: PubMed Journal: JCO Clin Cancer Inform ISSN： 2473-4276

Recent advancements in artificial intelligence (AI) have demonstrated the potential to transform medicine[1] and are promising for improving outcomes while reducing the cost of patient care because of its capability for earlier, more accurate diagnosis and personalized patient-centered care. Image classification, speech recognition, and natural language processing have seen some noteworthy achievements.[2] Moreover, thanks to machine learning (ML), hospitals can accomplish more efficient clinical workflows by reducing unnecessary procedures, which leads to further cost reductions.[1] Key Objective Machine learning (ML) models have the promise and potential of transforming health care from diagnosis to the treatment recommendations. However, lack of sufficient heterogeneous data because of patient privacy protections risks ML model generalizability. For this study, multiple ML models were implemented on highly heterogeneous data for a validated scientific question across medical centers without sharing data. Knowledge Generated Heterogeneous data across centers have improved the model performances compared with a simulation in a single institution. Additionally, cloud platforms have adequate tools and security controls to run federated learning implementations. Relevance Despite advancements, ML models are still not widely used at clinics because of a lack of sufficient and diverse data. This study has tested a platform on which many organizations can improve their models in a federated learning fashion. The performance of an ML algorithm depends highly on the amount and quality of data it is trained on, particularly for more complex models.[3] In the era of precision medicine, the availability of complex multidimensional patient data sets requires larger population samples for generalization.[4] Furthermore, the scarcity of data in underrepresented populations may lead to biases when training data do not sufficiently reflect the attributes of these populations.[5] Healthcare data quality and algorithmic challenges are also known barriers for ML.[6] Many approaches have been proposed to address the lack of data heterogeneity.[7,8] The most promising of these approaches requires multi-institutional collaborations that would increase not only the size of the training data but also its data diversity. Ideally, study data from each institution would be shared via a central data store where a single model can be trained on the combined multi-institutional data. However, there are several obstacles to implementing such a solution.[7-9] First, central storage and transferring large amounts of data over the network have an exorbitant associated cost.[10] The second major obstacle is the regulatory barrier surrounding patient data protection. Sharing patient data (with or without protected health information) requires several legal and regulatory approvals and interinstitutional agreements, which can be a cumbersome and lengthy process. The aforementioned obstacles necessitate the development of various federated learning strategies to train ML models without sharing confidential patient data across institutional firewalls. Federated learning is an ML framework in which models are trained on data that reside at each institution.[7,9]

Federated Learning Models

Federated learning models can be divided mainly into two groups, parallel and nonparallel. Parallel training is developed with the intention of faster (and optimized) completion of the runs; however, it often poses a large logistical problem in certain applications, including the lack of uniformity in network connection speeds and computational resources. On the other hand, nonparallel training, although less efficient, can be implemented across nonhomogeneous computing environments without the need for synchronization of runs. Chang et al[9] have tested three nonparallel training structures: ensemble training, single weight training, and cyclical weight training. Nonparallel models, including ensemble training, involve training separate models at each of the institutions on their respective data and subsequently gathering averages of the weights for each model toward a final model. In single weight training, the model is first trained on data from one institution until the training validation loss begins to plateau. The trained model is then transferred to the second institution where it is further trained on new data. The same process is continued across other institutions in the collaborative environment (Appendix Fig A1). It is important to note that the training order of the specific institutions often affects the final performance of the model.[9]

FIG A1.

Single weight training mechanism.

Cyclical weight training is very similar to single weight training in that the same model is transferred from one institution to the next with two main important differences. First, at each institution, the model is only trained for a preset number of epochs (generally, a lower number of epochs yield a better performing final model). Second, after the last institution trains the model with the initial preset number of epochs, the model is returned to the first institution to be retrained for the second group of the preset number of epochs. In essence, the model is trained by each institution multiple times before the final model is produced, hence the cyclical nature of this process. The process is summarized in the Appendix Figure A2.[9]

FIG A2.

Cyclical weight training mechanism.

OBJECTIVE

To evaluate our federated ML approach, we trained ML models to predict the risks of diseases associated with tobacco and radon using data from electronic health records (EHRs) at two healthcare systems. Tobacco use is the leading modifiable risk factor for lung cancer. The majority of counties in the Carolinas have adult smoking rates that exceed the national average.[11] Radon is a colorless, odorless, radioactive gas. According to the Environmental Protection Agency (EPA), it is the most significant modifiable risk factor for lung cancer after tobacco use.[12,13] Radon is present in the ground as a byproduct of uranium decay, and it typically enters homes and buildings as it diffuses into the air.[14,15] At present, North and South Carolina have an average indoor radon screening level > 4 pCi/L in many of its counties, and the EPA recommends radon mitigation measures at 4 pCi/L or greater.[15] Prior studies have already demonstrated that smoking and radon have both independent and synergistic effects on lung cancer and chronic obstructive pulmonary disease (COPD) incidences. Given that these are well-established risk factors for lung cancer and COPD, they are optimal use cases for establishing the capability of several federated ML models on disease outcomes. Moreover, because data are derived from both EHRs (patient-level data) and publicly available data (ecological data), they also allow us to test these models using data extracted from distinct sources.

MATERIALS AND METHODS

The Institutional Review Boards (IRBs) of the Wake Forest University Health Sciences and Medical University of South Carolina (MUSC) approved the study with protocol numbers IRB00056277 and Pro00090097, respectively. Upon IRB approval, patient EHR data from Wake Forest Baptist Medical Center (WFBMC) and the MUSC were used to test the efficacy and performance of federated learning models trained in a Databricks (Databricks Inc., San Francisco, CA) cloud environment. The deidentified data included each patient's sex, age (in years), race (Asian, Caucasian, African American, American Indian, other, or patient refused), smoking level (0, 1, or 2), radon exposure level (0, 1, or 2), and diagnosis of lung cancer and COPD (ICD-10: C34 and J44, respectively). Sex and race were integer encoded. For smoking, 0 is nonsmoker, 1 is former smoker (quit 5 + years ago), and 2 is coded to represent current smokers. For radon, 0 is used for an exposure of < 2 pCi/L, 1 is between 2 pCi/L and 4 pCi/L, and 2 is used for cases that have an exposure of > 4 pCi/L. Radon data for both North Carolina and South Carolina were gathered from public websites and from private companies that perform radon measurements.[16]

Data Security

To avoid sharing the data across institutions, we employed several MS Microsoft Azure (Microsoft, Redmond, WA) security toolkits.[17] First, data from each institution had been loaded to the respective Azure storage account, which is equipped with its own access control mechanism. Moreover, we needed to avoid the option of in-code access secret codes, usernames, and passwords, being stored in shared Jupyter Notebooks, thus compromising security. To circumvent this issue, we enabled the Azure Key Vault to house the storage account's access secrets. The Key Vault is configured for each study team through the Azure Active Directory (AD), which relies on institutional AD-based user management. Finally, we had to create a scope in Databricks with Key Vault's Domain Name System and Directory ID to point notebooks to the correct Key Vault. The configuration is outlined on Microsoft's public Microsoft Docs repository.[18]

Study Design

In the first part of this study, we attempted to simulate single weight and cyclical weight training in a local environment based on smoking and radon data using WFBMC data. Subsequently, actual implementation involves the two-institution federated learning processes with separate Azure subscriptions (and data) that were accessible only by the respective teams. We then tested the use of separate Databricks resources in combination with GitHub as a method of sharing model weights with various institutions without sharing data. We hypothesized that both methods would improve model performance compared with a model trained on only one institution's data.

Preprocessing.

The study data have three diagnosis categories for prediction: lung cancer, other, and COPD that were integer encoded. To reduce the problem into a binary classification task, we created two outcome classes: class 1, which is a disease state with either lung cancer or COPD and class 0, no disease state (ie, no cancer or COPD). The distributions of the data sets are shown in the Appendix Figure A3 and Table A1. Conducting a test of homogeneity between the two data sets resulted in a χ2 statistic of 19,008 (P < .0001), suggesting that the two data sets are extremely heterogeneous. Data were split (67/33) between training and testing data for each institution.

FIG A3.

Distributions of each dataset across classes.

TABLE A1.

A Comprehensive Distribution of the Two Institutions' Datasets Across Each Feature

Because of a heavy imbalance in the Wake Forest data set (most of the labels were class 0), the synthetic minority oversampling technique (SMOTE)[19] was used to reduce the imbalance in WFBMC data. SMOTE effectively increases the count of the minor class in a set of data.

Model construction.

In this study, we constructed an artificial neural network (ANN) to model our data. The model's weights were initialized with a Keras initializer. The learning rate was 0.001. The model consisted of two dense layers and used Adam as its optimization function.[20] For single weight training, the models implemented an early stopping algorithm dependent on validation loss. For the cyclical weight training, each institution trained the model for five cycles with 10 epochs each. Additionally, the model was trained for 10 epochs and then transferred to the next institution. A total of five models were built: base 1 (model trained on institution 1's data), base 2 (model trained on institution 2's data), single weight model A (institution 1 trains the model first), single weight model B (institution 2 trains the model first), and cyclical weight model. Furthermore, two single weight models were necessary to capture performance changes because of the ordering of institutions in single weight training. Test data from each institution were run through the five models, and model performances were captured. For each of these measurements, 10 trials were conducted and then averaged to reach a final performance metric. A student's t test was used to determine significance between different models. Logistic regression (LR) models were also constructed to test whether federated learning methods proved efficient when applied on traditional ML methods. It is important to note that because LR is not an epoch-based learning algorithm, only single weight federated learning was conducted in addition to base tests. Data aggregation was conducted as was done for the ANN. Model training took an average of 6.35 seconds. Transfer mechanism to and from GitHub took 3.02 and 3.05 seconds on average, respectively.

Simulation.

As the correlation of radon and tobacco with lung cancer and COPD is previously established, we sought to demonstrate that ML models resulted in the same correlation when using federated learning processes. For the simulation, we created two mock institutions with two unique training sets and one shared test set. To create these two unique training sets, data from the WFBMC were randomly shuffled and divided into three parts: two training sets and one test set, each of equal size. To ensure replicability and objectiveness, all trials for both cyclical weight training and single weight training were performed on the same splits of data.

Implementation on Azure Databricks.

To set up the federated learning environment on Databricks, it was essential to develop a method to save ML models so that models can be transferred across institutions, as shown in the Appendix Figure A4. This allows training in an asynchronous fashion. Each institution's data were located in respective institutional Azure storage accounts. The Azure Key Vault was configured to limit access to only study staff via Azure AD. The data were accessed from the Jupyter Notebook through the Key Vault with an appropriate access scope. The ANN and LR models were implemented in the Jupyter environment.

FIG A4.

The workflow of the federated learning environment in databricks.

To perform the federated learning, the trained model was saved on a shared GitHub repository, which was accessible by either institution. GitHub version control was required for model upload.[21] The model could then be shared with other collaborators who had access to the shared GitHub repository. The GitHub application programming interface was accessed via the PyGitHub Python library to effectively implement this system.[22] Each institution had a unique and personal access token to this repository that was saved as a Databricks secret in their respective Azure Account. Upon training the model, each institution had access to the shared model, which was saved as a pickle file in their Jupyter Notebook. The shared repository could be accessed asynchronously, and the pertinent model could be extracted.

RESULTS

Four performance metrics were captured: F1 score, precision, recall, and accuracy. The primary metric that was used for model improvement was the F1 score, given the class imbalance. The F1 score is the harmonic mean of the precision and recall. The mean metrics along with their respective standard deviations up to four significant digits are shown in Tables 1-4.

TABLE 1.

TABLE 4.

LR Model Performances on WF's Test Data: Base 1 (Model Trained on WF's Data), Base 2 (Model Trained on MUSC's Data), Single Weight Model A (WF Trains the Model First), and Single Weight Model B (MUSC Trains the Model First); LR Model Performances on MUSC's Test Data: Base 1 (Model Trained on WF's Data), Base 2 (Model Trained on MUSC's Data), Single Weight Model A (WF Trains the Model First), and Single Weight Model B (MUSC Trains the Model First)

ANN Model Performances on Institution 1 Test Data: Base 1 (Model Trained on Institution 1's Data), Base 2 (Model Trained on Institution 2's Data), Single Weight Model A (Institution 1 Trains the Model First), and Single Weight Model B (Institution 2 Trains the Model First); ANN Model Performances on Institution 2 Test Data: Base 1 (Model Trained on Institution 1's Data), Base 2 (Model Trained on Institution 2's Data), Single Weight Model A (Institution 1 Trains the Model First), and Single Weight Model B (Institution 2 Trains the Model First) LR Model Performances on Institution 1 Test Data: Base 1 (Model Trained on Institution 1's Data), Base 2 (Model Trained on Institution 2's Data), Single Weight Model A (Institution 1 Trains the Model First), and Single Weight Model B (Institution 2 Trains the Model First); LR Model Performances on Institution 2 Test Data: Base 1 (Model Trained on Institution 1's Data), Base 2 (Model Trained on Institution 2's Data), Single Weight Model A (Institution 1 Trains the Model First), and Single Weight Model B (Institution 2 Trains the Model First) ANN Model Performances on WF's Test Data: Base 1 (Model Trained on WF's Data), Base 2 (Model Trained on MUSC's Data), Single Weight Model A (WF Trains the Model First), and Single Weight Model B (MUSC Trains the Model First); ANN Model Performances on MUSC's Test Data: Base 1 (Model Trained on WF's Data), Base 2 (Model Trained on MUSC's Data), Single Weight Model A (WF Trains the Model First), and Single Weight Model B (MUSC Trains the Model First) LR Model Performances on WF's Test Data: Base 1 (Model Trained on WF's Data), Base 2 (Model Trained on MUSC's Data), Single Weight Model A (WF Trains the Model First), and Single Weight Model B (MUSC Trains the Model First); LR Model Performances on MUSC's Test Data: Base 1 (Model Trained on WF's Data), Base 2 (Model Trained on MUSC's Data), Single Weight Model A (WF Trains the Model First), and Single Weight Model B (MUSC Trains the Model First)

Simulation

Table 1 represents the performance of the ANN models across each institution's test data. All three federated learning models depicted in Table 1 had a significant increase in accuracy and F1 score over the base models when tested on both institutions' test data (P < .05). The single weight model showed the most significant improvement over the base values. Table 2 represents the performances of LR models across each institution's test data. There was no statistically significant improvement of the accuracy or F1 score provided by the federated learning methods in the case of LR models. Figure 1 shows the receiver operating characteristic (ROC) curves that also show the earlier stated statistical significance.

TABLE 2.

LR Model Performances on Institution 1 Test Data: Base 1 (Model Trained on Institution 1's Data), Base 2 (Model Trained on Institution 2's Data), Single Weight Model A (Institution 1 Trains the Model First), and Single Weight Model B (Institution 2 Trains the Model First); LR Model Performances on Institution 2 Test Data: Base 1 (Model Trained on Institution 1's Data), Base 2 (Model Trained on Institution 2's Data), Single Weight Model A (Institution 1 Trains the Model First), and Single Weight Model B (Institution 2 Trains the Model First)

FIG 1.

ROC curves corresponding to performance metrics in tables. (A) ROC curve based on ANN models' performances against institution 1 test data. (B) ROC curve based on ANN models' performances against institution 2 test data. (C) ROC curve based on LR models' performances against institution 1 test data. (D) ROC curve based on LR models' performances against institution 2 test data. ANN, artificial neural network; LR, logistic regression; ROC, receiver operating characteristic.

Actual Implementation on Azure

Table 3 represents the performances of the ANN models across each institution's test data on the Databricks environment. When applied to WFBMC test data, the single weight model B and cyclical weight model had a significantly higher accuracy than both base models (P < .01). Against the MUSC's test data, the cyclical weight model showed significant improvement in both F1 score and accuracy (P < .05).

TABLE 3.

ANN Model Performances on WF's Test Data: Base 1 (Model Trained on WF's Data), Base 2 (Model Trained on MUSC's Data), Single Weight Model A (WF Trains the Model First), and Single Weight Model B (MUSC Trains the Model First); ANN Model Performances on MUSC's Test Data: Base 1 (Model Trained on WF's Data), Base 2 (Model Trained on MUSC's Data), Single Weight Model A (WF Trains the Model First), and Single Weight Model B (MUSC Trains the Model First)

Similar to the simulation, the LR federated learning methods did not show much improvement over the base model. Table 4 represents the performances of LR models across each institution's test data, and Figure 2 shows the ROC curves that also show the earlier stated statistical significance.

FIG 2.

ROC curves corresponding to performance metrics in tables. (A) ROC curve based on ANN models' performances against WF's test data. (B) ROC curve based on ANN models' performances against the MUSC's test data. (C) ROC curve based on LR models' performances against WF's test data. (D) ROC curve based on LR models' performances against the MUSC's test data. ANN, artificial neural network; LR, logistic regression; MUSC, Medical University of South Carolina; ROC, receiver operating characteristic; WF, Wake Forest.

DISCUSSION

The purpose of this study was to determine whether federated learning methods would improve the performance of the ML models in health care while preserving the security and privacy of the patient data. The results, from both the simulation environment and the actual implementation on Databricks, suggest that federated learning methods do have the potential to improve model performances with a few caveats. One observation made during this study was that the federated learning methods generally did not improve the performance of LR. This might be due to the lower complexity of LR and lack of an iterative training process (ie, epochs) when compared with ANN models. For the ANN models, there was generally one federated model that performed better than baseline models for each of the four attempts shown in the results. However, the type of federated model that performed best in each of these situations varied. One such reason may be the training order of institutions in these federated learning methods. As previous studies have shown, the training order of institutions impacts the final performance of the model, especially for single weight training, which was shown a similar effect in our results. Whereas single weight model A, when tested with Wake Forest data, provided an F1 score of 0.3675, the same federated process in single weight model B provided an F1 score of 0.4716. We hypothesize that this is due to the persistent role that the order of institutions plays in single weight training. One concern with the federated ML methods is their susceptibility to an adversarial attack in the collaborative environment. Several studies have shown that there are multiple attacks such as membership interference or attribute inference that could affect the safety of the models.[23-27] Privacy-preserving AI has been coined and implemented by some; however, it does not yet provide the necessary protection in a federated learning environment and may even lead to the exposure of potentially sensitive data.[28-33] Further evaluation of model security and alternative approaches are still needed and will be reserved for future studies. One limitation of this work is the use of a relatively simplistic model with few features for testing our federated learning approach. Future work should include the implementation of more complex ML, including deep learning models using this infrastructure. In this project we demonstrated a federated learning process for ML models in a collaborative academic health center setting going beyond simulation. While this is still an emerging field, our work establishes the potential for federated learning to significantly improve model performance. Previous studies have focused mainly on simulated data; however, we have taken the additional step of implementing them across institutions. In doing so, we have demonstrated an efficient way of sharing and accessing models across institutions. While our investigation focused on binary classification, the same protocol for nonbinary outcome analysis can be easily implemented with new ML models.

8 in total

1. Machine Learning in Medicine.

Authors: Mike Fralick; Errol Colak; Muhammad Mamdani
Journal: N Engl J Med Date: 2019-06-27 Impact factor: 91.245

2. A systematic study of the class imbalance problem in convolutional neural networks.

Authors: Mateusz Buda; Atsuto Maki; Maciej A Mazurowski
Journal: Neural Netw Date: 2018-07-29

3. Lung cancer risk from residential radon: meta-analysis of eight epidemiologic studies.

Authors: J H Lubin; J D Boice
Journal: J Natl Cancer Inst Date: 1997-01-01 Impact factor: 13.506

4. Predicting the Future - Big Data, Machine Learning, and Clinical Medicine.

Authors: Ziad Obermeyer; Ezekiel J Emanuel
Journal: N Engl J Med Date: 2016-09-29 Impact factor: 91.245

5. Lung cancer in radon-exposed miners and estimation of risk from indoor exposure.

Authors: J H Lubin; J D Boice; C Edling; R W Hornung; G R Howe; E Kunz; R A Kusiak; H I Morrison; E P Radford; J M Samet
Journal: J Natl Cancer Inst Date: 1995-06-07 Impact factor: 13.506

6. Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification.

Authors: Michel Oleynik; Amila Kugic; Zdenko Kasáč; Markus Kreuzthaler
Journal: J Am Med Inform Assoc Date: 2019-11-01 Impact factor: 4.497

7. Distributed deep learning networks among institutions for medical imaging.

Authors: Ken Chang; Niranjan Balachandar; Carson Lam; Darvin Yi; James Brown; Andrew Beers; Bruce Rosen; Daniel L Rubin; Jayashree Kalpathy-Cramer
Journal: J Am Med Inform Assoc Date: 2018-08-01 Impact factor: 7.942

8. BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Authors: Jinhyuk Lee; Wonjin Yoon; Sungdong Kim; Donghyeon Kim; Sunkyu Kim; Chan Ho So; Jaewoo Kang
Journal: Bioinformatics Date: 2020-02-15 Impact factor: 6.937

8 in total

4 in total

Review 1. Machine Learning for Acute Kidney Injury Prediction in the Intensive Care Unit.

Authors: Eric R Gottlieb; Mathew Samuel; Joseph V Bonventre; Leo A Celi; Heather Mattie
Journal: Adv Chronic Kidney Dis Date: 2022-09 Impact factor: 4.305

2. Privacy-preserving federated learning for scalable and high data quality computational-intelligence-as-a-service in Society 5.0.

Authors: Amirhossein Peyvandi; Babak Majidi; Soodeh Peyvandi; Jagdish C Patra
Journal: Multimed Tools Appl Date: 2022-03-22 Impact factor: 2.577

3. Performance and Information Leakage in Splitfed Learning and Multi-Head Split Learning in Healthcare Data and Beyond.

Authors: Praveen Joshi; Chandra Thapa; Seyit Camtepe; Mohammed Hasanuzzaman; Ted Scully; Haithem Afli
Journal: Methods Protoc Date: 2022-07-13

4. FIDChain: Federated Intrusion Detection System for Blockchain-Enabled IoT Healthcare Applications.

Authors: Eman Ashraf; Nihal F F Areed; Hanaa Salem; Ehab H Abdelhay; Ahmed Farouk
Journal: Healthcare (Basel) Date: 2022-06-15

4 in total