| Literature DB >> 32364235 |
Tsung-Ting Kuo1, Rodney A Gabriel1,2, Krishna R Cidambi3, Lucila Ohno-Machado1,4.
Abstract
OBJECTIVE: Predicting patient outcomes using healthcare/genomics data is an increasingly popular/important area. However, some diseases are rare and require data from multiple institutions to construct generalizable models. To address institutional data protection policies, many distributed methods keep the data locally but rely on a central server for coordination, which introduces risks such as a single point of failure. We focus on providing an alternative based on a decentralized approach. We introduce the idea using blockchain technology for this purpose, with a brief description of its own potential advantages/disadvantages.Entities:
Keywords: blockchain distributed ledger technology; clinical information systems; decision support systems; online machine learning; privacy-preserving predictive modeling
Mesh:
Year: 2020 PMID: 32364235 PMCID: PMC7309256 DOI: 10.1093/jamia/ocaa023
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1.The network topology for a 4-site (s, s, s and s) setting, with each site holding protected health information data (d, d, d and d). (a): Centralized network topology adopted by state-of-the-art learning methods. Such an architecture carries several risks: (R1) the sites may not be allowed to transmit data outside specific computer environments due to their institutional policies, (R2) the data being disseminated and the transfer records could be altered without a clear way to determine immutability, (R3) trained models could also be tampered within the central server without being noticed by other participating sites and thus undermines provenance,, (R4) the server represents a single point of failure,, (R5) additional security to protect data is not offered, (R6) the client-server architecture may present synchronization problems,, and (R7) the sites cannot join/leave the network at any time. Furthermore, long-term sustainability of the whole network becomes dependent on the institution that serves as the central coordinating node (R8). Typically, once a coordinating institution is chosen, the network architecture is built around the coordinating center, with nodes serving as data providers that are unable to assume the role when needed. (b): Decentralized network topology (blockchain) adopted by ExplorerChain. Five desirable features make blockchain suitable to mitigate the problems faced by centralized architectures: (R1) Blockchain is, by design, decentralized; the verification of transactions is achieved by majority voting., Each institution can control the use of computational resources. (R2) A blockchain provides an immutable audit trail. That is, changing the data or records is very difficult., (R3) The traceable origins certify data provenance., In our case, each trained model is recorded in a collaborative and distributed ledger, which cannot be updated silently by any of the sites without being noticed. (R4) The peer-to-peer architecture of blockchain ensures that there is no risk of single point of failure,, and thus improves security and robustness. Also, by removing the dependency on a central node, blockchain increases the availability of the models at all sites at all times. (R5) The enhanced security/privacy features of blockchain further protect data and models. Additionally, (R6) The blockchain mechanism can remove synchronization conflicts.,,, (R7) Each site can join/leave the network freely without imposing overhead on a central server or disrupting the machine learning process.,, Finally, network long-term sustainability (R8) is increased because its architecture is fully transparent and each participating site can collaborate with low operation/maintenance costs.
Comparison of the state-of-the-art distributed learning methods
| Method | Author | Reference | Architecture | Learning | Focus | Status |
|---|---|---|---|---|---|---|
|
| Wu et al |
| Client-Server | Batch | Healthcare | Evaluated |
|
| Wang et al |
| Client-Server | Online | Healthcare | Evaluated |
|
| Jiang et al |
| Client-Server | Batch | Healthcare | Evaluated |
|
| Shi et al |
| Client-Server | Batch | Healthcare | Evaluated |
|
| Kuo et al |
| Peer-to-Peer | Batch | Healthcare | Evaluated |
|
| Kuo et al |
| Peer-to-Peer | Batch | Healthcare | Evaluated |
|
| Kuo et al |
| Peer-to-Peer | Online | Healthcare | Proposed |
|
| Chen et al |
| Peer-to-Peer | Online | Privacy/Security | Evaluated |
|
| Kuo et al | – | Peer-to-Peer | Online | Healthcare | Evaluated |
Figure 2.A simplified example of ExplorerChain. Only the aggregated data (ie, the machine learning model) and the meta information are stored on-chain, while the protected health information (PHI, the observation-level patient data) are stored off-chain. This design ensures that the institutions can share information to improve the predictive model without transmitting PHI. It should be noted that the amount of transactions is set to be zero, indicating that the blockchain serves purely as a nonfinancial distributed ledger.
Statistics and features of the datasets tested in our experiments. The class distribution (ie, the percentage of the positive/negative classes) is also included. The numerical covariates are labeled with an asterisk symbol (“*”), while the categorical ones are converted into binary through dummy coding. The values for the myocardial infarction and cancer biomarker datasets are adapted from
| Dataset | MyocardialInfarction(Edin) | CancerBiomarker(CA) | Length of Hospitalization(THA) | |
|---|---|---|---|---|
|
| 9 | 2 | 34 | |
|
| 1,253 | 141 | 960 | |
|
| 0.219 / 0.781 | 0.638 / 0.362 | 0.278 / 0.722 | |
|
| Presence of Disease | Presence of Cancer | Hospital Length of Stay is greater than 3 days | |
|
| Pain in Right Arm | CA-19* | Male Sex | SA - Posterior |
| Nausea | CA-125* | Age ≥ 65 years old | SA - Anterolateral | |
| Hypo Perfusion | – | Preoperative METs < 4 | SA - Anterior | |
| ST Elevation | – | General Anesthesia (versus Neuraxial Anesthesia) | CM - Chronic Kidney Disease | |
| New Q Waves | – | Non-English Speaker | CM - Chronic Obstructive Pulmonary Disease | |
| ST Depression | – | OG - Mild | CM - Congestive Heart Failure | |
| T Wave Inversion | – | OG - Moderate | CM - Coronary Artery Disease | |
| Sweating | – | OG - Severe | CM - Hypertension | |
| Pain in Left Arm | – | OG - Avascular Necrosis | CM - Diabetes Mellitus | |
| – | – | CHD - No osteoarthritis | CM - Obstructive Sleep Apnea | |
| – | – | CHD - Mild osteoarthritis | CM - Dialysis | |
| – | – | CHD - Moderate osteoarthritis | CM - Psychiatric history (depression, anxiety, or bipolar disease) | |
| – | – | CHD - Severe osteoarthritis | CM - Active Smoker | |
| – | – | CHD - Previous Surgery (ie, hip replacement) | CM - Asthma | |
| – | – | CHD - Avascular Necrosis | CM - Thrombocytopenia (platelets < 150 000/uL) | |
| – | – | Obesity (BMI > 30kg/m2) | CM - Anemia | |
| – | – | Preoperative Opioid Use | CM - Dementia | |
Abbreviations: BMI, body mass index; CHD, Contralateral Hip Description; CM, comorbidities; METS, metabolic equivalents; OG, osteoarthritis grade (operative side); SA, surgical approach; THA, total hip arthroplasty.
The experiment results for the myocardial infarction (Edin), the cancer biomarker (CA), and the length of hospitalization (THA) datasets. The evaluation metric is the averaged full area under the receiver operating characteristic curve (AUC) among N sites, for 30 trials. The Pearson Correlation Coefficient (PCC) was computed to evaluate the linear correlation between 2 methods. Finally, the alpha in the 2-sample t-test was 0.05, and the p-values larger than 0.05 (shown in bold italic) indicate no statistically significant difference between the AUC results of EXPLORER and ExplorerChain
| EXPLORER | ExplorerChain | Correlation | Two-Sample t-Test | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Dataset | N | Mean AUC | Standard Deviation | Mean AUC | Standard Deviation | PCC | Delta | Test Statistics | P-value |
|
| 2 | 0.965 | 0.013 | 0.965 | 0.013 | 0.999 | 0.000 | −1.559 |
|
| 4 | 0.962 | 0.010 | 0.960 | 0.011 | 0.867 | 0.000 | 1.868 |
| |
| 8 | 0.957 | 0.014 | 0.954 | 0.015 | 0.906 | 0.002 | 1.371 |
| |
|
| 2 | 0.893 | 0.054 | 0.891 | 0.055 | 0.977 | 0.000 | 1.106 |
|
| 4 | 0.862 | 0.075 | 0.853 | 0.078 | 0.932 | 0.000 | 1.694 |
| |
| 8 | 0.892 | 0.060 | 0.876 | 0.071 | 0.746 | 0.000 | 1.811 |
| |
|
| 2 | 0.734 | 0.035 | 0.733 | 0.036 | 0.995 | 0.000 | 1.622 |
|
| 4 | 0.738 | 0.047 | 0.735 | 0.047 | 0.975 | 0.000 | 1.529 |
| |
| 8 | 0.718 | 0.040 | 0.712 | 0.040 | 0.909 | 0.000 | 1.878 |
| |
Abbreviations: AUC, area under the receiver operating characteristic curve; CA, cancer biomarker; PCC, Pearson correlation coefficient; THA, total hip arthroplasty.
Number of iterations and time results of ExplorerChain among N sites for 30 trials of the 3 datasets. All time measurements are averaged over N sites, and the total time includes both running time and synchronization time. For ExplorerChain, the per-iteration time is computed by dividing the time by the mean number of iterations, and the additional pausing time (240 seconds per trial) between trials for result collection was deducted
| # of Iterations | Time (Seconds) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| ExplorerChain | EXPLORER | ExplorerChain | |||||||
| Dataset | N | Mean | Standard Deviation | Total | Running | Total | Running | Total/ Iteration | Running/ Iteration |
|
| 2 | 2.433 | 1.455 | 2.477 | 2.426 | 144.519 | 7.609 | 59.391 | 3.127 |
| 4 | 3.033 | 1.542 | 2.451 | 2.399 | 165.890 | 9.939 | 54.689 | 3.277 | |
| 8 | 3.633 | 2.157 | 2.432 | 2.383 | 184.086 | 12.145 | 50.666 | 3.343 | |
|
| 2 | 2.000 | 0.000 | 2.000 | 1.945 | 129.618 | 6.125 | 64.809 | 3.062 |
| 4 | 2.700 | 1.088 | 2.011 | 1.949 | 154.175 | 8.829 | 57.102 | 3.270 | |
| 8 | 3.233 | 1.547 | 1.996 | 1.947 | 172.654 | 10.938 | 53.398 | 3.383 | |
|
| 2 | 2.533 | 1.814 | 2.399 | 2.348 | 147.259 | 8.045 | 58.128 | 3.176 |
| 4 | 8.000 | 2.533 | 2.366 | 2.315 | 317.266 | 23.865 | 39.658 | 2.983 | |
| 8 | 9.833 | 0.913 | 2.364 | 2.314 | 365.355 | 32.713 | 37.155 | 3.327 | |
Figure 3.Number of iterations of ExplorerChain on 3 datasets and 2-, 4- and 8-site settings for each of the 30 trials. The number of iterations increases with the number of sites, with an upper limit to the maximum number of required iterations (10 in our experiment) to reach consensus. However, the relatively large standard deviation, especially when the number of sites is small (eg, N = 2), suggests that the number of required iterations is highly dependent on the dataset.