Tsung-Ting Kuo1. 1. UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA.
Abstract
OBJECTIVE: Cross-institutional distributed healthcare/genomic predictive modeling is an emerging technology that fulfills both the need of building a more generalizable model and of protecting patient data by only exchanging the models but not the patient data. In this article, the implementation details are presented for one specific blockchain-based approach, ExplorerChain, from a software development perspective. The healthcare/genomic use cases of myocardial infarction, cancer biomarker, and length of hospitalization after surgery are also described. MATERIALS AND METHODS: ExplorerChain's 3 main technical components, including online machine learning, metadata of transaction, and the Proof-of-Information-Timed (PoINT) algorithm, are introduced in this study. Specifically, the 3 algorithms (ie, core, new network, and new site/data) are described in detail. RESULTS: ExplorerChain was implemented and the design details of it were illustrated, especially the development configurations in a practical setting. Also, the system architecture and programming languages are introduced. The code was also released in an open source repository available at https://github.com/tsungtingkuo/explorerchain. DISCUSSION: The designing considerations of semi-trust assumption, data format normalization, and non-determinism was discussed. The limitations of the implementation include fixed-number participating sites, limited join-or-leave capability during initialization, advanced privacy technology yet to be included, and further investigation in ethical, legal, and social implications. CONCLUSION: This study can serve as a reference for the researchers who would like to implement and even deploy blockchain technology. Furthermore, the off-the-shelf software can also serve as a cornerstone to accelerate the development and investigation of future healthcare/genomic blockchain studies.
OBJECTIVE: Cross-institutional distributed healthcare/genomic predictive modeling is an emerging technology that fulfills both the need of building a more generalizable model and of protecting patient data by only exchanging the models but not the patient data. In this article, the implementation details are presented for one specific blockchain-based approach, ExplorerChain, from a software development perspective. The healthcare/genomic use cases of myocardial infarction, cancer biomarker, and length of hospitalization after surgery are also described. MATERIALS AND METHODS: ExplorerChain's 3 main technical components, including online machine learning, metadata of transaction, and the Proof-of-Information-Timed (PoINT) algorithm, are introduced in this study. Specifically, the 3 algorithms (ie, core, new network, and new site/data) are described in detail. RESULTS: ExplorerChain was implemented and the design details of it were illustrated, especially the development configurations in a practical setting. Also, the system architecture and programming languages are introduced. The code was also released in an open source repository available at https://github.com/tsungtingkuo/explorerchain. DISCUSSION: The designing considerations of semi-trust assumption, data format normalization, and non-determinism was discussed. The limitations of the implementation include fixed-number participating sites, limited join-or-leave capability during initialization, advanced privacy technology yet to be included, and further investigation in ethical, legal, and social implications. CONCLUSION: This study can serve as a reference for the researchers who would like to implement and even deploy blockchain technology. Furthermore, the off-the-shelf software can also serve as a cornerstone to accelerate the development and investigation of future healthcare/genomic blockchain studies.
While multiple healthcare/genomic institutions may want to collaboratively create a predictive model to better predict patient outcomes (eg, heart disease, cancer biomarker, and postsurgery hospital length of stay), privacy concerns of sharing patient data directly and the security risks of having a central server to coordinate the machine learning process can be potential burdens. To mitigate these issues, several cross-institutional distributed predictive modeling methods have been proposed. In this study, the design and implementation details are presented for one specific blockchain-based approach ExplorerChain. The healthcare/genomic use cases for ExplorerChain are also described. Specifically, the 3 main technical components, online learning, blockchain network, and consensus algorithm are introduced. The development configurations, system architecture, and programming languages are described as well. Besides, the designing considerations and the limitations of the implementation were discussed. This study can serve as a reference for researchers who would like to implement/deploy blockchain technology in healthcare/genomic domain. Also, the software of ExplorerChain is available at https://github.com/tsungtingkuo/explorerchain.
BACKGROUND AND SIGNIFICANCE
Cross-institutional predictive modeling is an emerging technology that fulfills both the need for building a more generalizable model and for protecting patient data by only exchanging the models but not the patient data, across multiple healthcare/genomic institutions. Traditional approaches are mainly based on a client-server architecture, which requires a central server to coordinate the learning process, collect the partially trained models from each site, and then integrate and send the global model back to every site. This centralized approach can create risks such as single-point-of-failure., Therefore, several existing researches proposed to leverage blockchain, a peer-to-peer decentralized architecture, to remove the central server.,, Blockchain, a technology originated from financial domain, provides additional desirable technical features such as immutability, provenance, and transparency.Although the literature illustrated the rationale and results of adopting blockchain for cross-institutional predictive modeling, the details of the implementation are yet to be described. For example, one of the blockchain-based cross-institutional predictive modeling methods, ExplorerChain, leverages online machine learning on blockchain and was evaluated on healthcare/genomics datasets (Figure 1). Although the advantages/disadvantages of adopting blockchain, the comparison of different architectures/designs, and the equivalent correctness results of ExplorerChain were shown (more details in Healthcare/genomic use cases section), the feasibility study did not include practical considerations while being constructed, and these details could serve as cornerstones for future researchers to develop new algorithms.
Figure 1.
ExplorerChain. Each site maintains an exact copy of the blockchain, and exchanges partially trained machine learning models via the metadata of transactions on-chain. The patient-level data never leave each site to protect patients’ privacy.
ExplorerChain. Each site maintains an exact copy of the blockchain, and exchanges partially trained machine learning models via the metadata of transactions on-chain. The patient-level data never leave each site to protect patients’ privacy.
OBJECTIVE
In this article, the implementation details for ExplorerChain are dissected from a software development perspective. This article addresses head-on one of the main problems facing blockchain projects today: much is envisioned but little is actually implemented to give developers an idea of using blockchain for distributed predictive model building in practice. A detailed design and implementation of an “off-the-shelf” tool can also help technologists make decisions to adopt blockchain for their specific healthcare/genomic use cases, aiming at accelerating cross-institution research and expedite quality improvement initiatives. ExplorerChain was applied in 3 use cases, including myocardial infarction, cancer biomarker, and length of hospitalization after surgery.
MATERIALS AND METHODS
The 3 main technical components of ExplorerChain include online machine learning, metadata of transaction, and the Proof-of-Information-Timed (PoINT) algorithm. As shown in Figure 1, online machine learning generates models, metadata of transaction disseminates the models, and the PoINT algorithm determines the learning order. These 3 components are introduced in the following 3 subsections.
Online machine learning and the EXPLORER algorithm
ExplorerChain is designed to operate without a centralized server to compute and manage the predictive models, hence online machine learning, is of interest here because it is designed to update the model incrementally and can create/update models without a central server. In ExplorerChain, the EXPLORER online learning method was adopted. EXPLORER is an online logistic regression based on a Bayesian approach that can revise the model when the records are updated, without re-training on the entire dataset, and therefore is suitable for the purpose of ExplorerChain. The core EXPLORER modeling algorithm, EXPLORER-Intra, was adopted for ExplorerChain, while the inter-site model update component (ie, the central server) of EXPLORER was replaced by transferring the model directly among sites for updates.
Metadata of transaction and blockchain implementation
The partially trained machine learning models are disseminated using the metadata of the blockchain transaction., The details of the data fields are described in Table 1. Note that in the implementation of this private blockchain network, ExplorerChain only provides sites with nonfinancial incentives (ie, improved correctness of the predictive model using cross-institution data in a privacy-preserving manner), instead of a financial incentive (eg, mining rewards or transaction fees as in Bitcoin), to create the blocks and verify the transactions. MultiChain,, a general purpose blockchain platform, was selected because it is designed for private blockchains and is based on the popular and proven Bitcoin Blockchain platform, according to a prior systematic review of blockchain platforms. The default configurations of MultiChain were used to implement ExplorerChain. Also, Mining Diversity,, a round-robin-based consensus protocol in MultiChain designed for private blockchain networks, was adopted.
Table 1.
The on-chain data in ExplorerChain
Field name
Description
Possible values
Model mean
The mean vector of the EXPLORER model.4
A numerical vector with length equal to m+1.
Model covariance
The covariance matrix of the EXPLORER model.4
A numerical (m+1) × (m+1) square symmetric matrix.
A unique name or identifier representing the sender site.
To site
The site that will receive the model.
A unique name or identifier representing the receiver site.
Time
The time that the site submitted the model.
A timestamp.
Iteration
The current iteration of the cross-institutional model learning process.
A non-negative integer.
The EXPLORER model contains both the mean coefficient vector and the covariance matrix. Also, m is the number of features in the dataset.
The on-chain data in ExplorerChainThe EXPLORER model contains both the mean coefficient vector and the covariance matrix. Also, m is the number of features in the dataset.
PoINT algorithm
The basic idea of the PoINT algorithm is based on the Proof-of-Information (PoI) algorithm,: if a site s contains data that cannot be predicted accurately using a current model M, those data may contain more information to improve M; therefore, while choosing the next site to update M, the algorithm should assign s with higher priority. The PoINT algorithm starts with the selection of the best performing model among all sites to prevent the propagation of error. Next, it selects the site with the highest error for the current model, and this site is charged with updating the model. Then, the model update process is repeated, until a site cannot find any other site with higher error; the final consensus model is then complete.In the original PoI algorithm,, one potential issue is that although the underlying EXPLORER algorithm is guaranteed to converge,, it may require too many iterations (ie, model transferring) without achieving the “best” consensus predictive model, thus consuming unnecessary computational power. To solve this issue, a time-to-leave counter was added to limit the maximum number of iterations in the PoINT algorithm., This way, ExplorerChain is prevented from executing too long. The training “error” adopted in ExplorerChain was based on the complement of the full Area Under the receiver operating characteristic Curve (AUC)., Specifically, the error was defined as E = 1 – AUCtraining, where AUCtraining was computed using the data at each site. AUC was also used as the evaluation metrics; the training sets were used to guide the process of PoINT (by using AUCtraining), and the test sets were utilized to compute the evaluation metric (ie, AUCtest).The detailed PoINT algorithm is shown in Algorithms 1–3. Algorithm 1 is the core part of the PoINT algorithm. It determines the machine learning order, and then repeats the training process until finding the consensus model. Each site executes Algorithm 2 (which then executes Algorithm 1) to identify a consensus model for a new blockchain network. A running example for PoINT to find a consensus model is shown in Figure 2. At the initial stage (t = 0), the model with lowest error (site s with E0 = 0.2) is selected as the initial model; choosing the best initial model (M0) helps prevent the propagation of error. The selected model M0 is then submitted to site s, s, and s. Next (t = 1), M1 (the same model as M0) is evaluated by each site using the local data. Suppose s has the highest error (E1 = 0.7). Given that the data in s are less accurately predicted using the model M1, s is assumed to contain the richest information to improve M1. Therefore, model M1 is now conceptually “transferred” to s within a blockchain transaction (with amount = 0 and transaction fee = 0). Then (t = 2), s updates the model as M2. Again, s sends M2 to all other sites (within another blockchain transaction), and the site with highest error (or richest information given the current model) will be the next to update the model locally (s). This process is repeated until a site updates the model and finds itself producing the highest error when compared to other sites, or until the maximum number of iterations is reached. For example, when t = 3, s has the highest error (0.3) and thus wins the bid to update the model; but when t = 4, s still has the highest error (0.2) using the updated model. Thus, M4 is regarded as the final consensus model and the online machine learning process stops.
Figure 2.
An example of the Proof-of-Information-Timed (PoINT) algorithm.,,Mt is the model and Et is the error at time t on site s. The timestamp t is not equivalent to the iteration i; the iteration i only increases when the model initialization or transferring occurs. The model/error with green underline is the selected one at that timestamp (ie, at each t, only one model/error is selected). Abbreviation: PHI, Protected Health Information.
An example of the Proof-of-Information-Timed (PoINT) algorithm.,,Mt is the model and Et is the error at time t on site s. The timestamp t is not equivalent to the iteration i; the iteration i only increases when the model initialization or transferring occurs. The model/error with green underline is the selected one at that timestamp (ie, at each t, only one model/error is selected). Abbreviation: PHI, Protected Health Information.On the other hand, a site executes Algorithm 3 with it has new data (a new site is considered as a site of which all data are new), after a consensus model has been found. Algorithm 3 starts from the latest consensus model and also leverages Algorithm 1 to update the consensus model. The examples for new site/data inclusion and situations in which a site leaves are demonstrated in Figure 3.
Figure 3.
An example of the PoINT algorithm for the situations in which any site adds new data or a site leaves the network.,, (A) New data (eg, in s). (B) Site leaving while it is updating the model (eg, s). (C) Site leaving while not updating the model (eg, s).
An example of the PoINT algorithm for the situations in which any site adds new data or a site leaves the network.,, (A) New data (eg, in s). (B) Site leaving while it is updating the model (eg, s). (C) Site leaving while not updating the model (eg, s).First, if there are new data in s (Figure 3A), re-training the whole model is not needed. Instead, an algorithm similar to that shown in Figure 2 can be used to determine whether the model should be updated using the new data. Suppose the new data dʹ is in s, while the current (t = 4) consensus model is M4. In time t = 5, s uses the updated data (including both d and dʹ) to evaluate model M5 (which is the same as M4), and finds that the error E5 = 0.4 is larger than the error of the current updating site (ie, s with E5 = 0.2). Therefore, the model M5 is now transferred to s to be updated. The iteration i is reset to 1, and the same process shown in Figure 2 is repeated until a final consensus model is identified. In the case that the error E5 is higher than E5, the new data are considered not bringing enough information to update the model M5, thus no transfer/update is required. A similar mechanism can be used for a new site (treated as a site where all data are new).Next, if the site (eg, s) leaves while it is updating the model (Figure 3B), it can simply be ignored. This is because until s completes the model update, the latest model M5 (at the end of the blockchain) remains unchanged and can be used for prediction tasks by the other sites in the network. Once coming back to the network, s can continue the process of updating the model. In the case that the site (eg, s) leaves while not updating the model (Figure 3C), the departure can be ignored; the site can rejoin the network at any time. For example, if the site with the highest error leaves the network and does not submit its error, it will time out and the site with the second highest error will replace it, and so on. As a result, in both above-mentioned situations, the departure of a site can be dealt with by the blockchain mechanism of ExplorerChain.There are 4 main hyper-parameters in PoINT: (1) the total number of participating sites N, (2) the polling time period Δ, (3) the waiting time period Θ, and (4) the maximum number of iterations Ω. The total model size (including mean and covariance) is O(m2), where m is the number of features. Using online learning (ie, without a central server like the one in EXPLORER to compute the optimized global model), ExplorerChain is actually an approximation to the optimal solution.
RESULTS
A simplified flowchart of the implemented algorithms is illustrated in Figure 4. When a site starts ExplorerChain, it first determines whether to execute Algorithm 2 (for the initialization scenario) or Algorithm 3 (for the new data scenario). Afterwards, the site runs Algorithm 1 as a “daemon” service to monitor the blockchain network and perform model or error computations as needed. Therefore, Algorithm 1 is always watching the blockchain to check the availability of any newly updated model or an incoming transferred model. In other words, Algorithm 1 keeps running, while occasionally the consensus learning process in it may pause because of the confirmation of a consensus model. Algorithm 1 stops when a site that is running leaves the network (while other sites are still running Algorithm 1), or when the site has new data and would like to stop and run Algorithm 3 instead.
Figure 4.
A simplified flowchart demonstrating the implementation of Algorithms 1–3 of PoINT.
A simplified flowchart demonstrating the implementation of Algorithms 1–3 of PoINT.The system architecture and programming languages used in ExplorerChain are demonstrated in Figure 5. The system (including ExplorerChain and the underlying MultiChain) was deployed on integrating Data for Analysis, Anonymization, and SHaring (iDASH),, a private HIPAA-compliant computing environment including both virtual machines (VMs) and a cloud-based network. Within ExplorerChain, Protected Health Information is only used for internal model training (through EXPLORER-Intra), while the model is disseminated on the blockchain (through Blockchain-Connector).
Figure 5.
System architecture of ExplorerChain. This example shows the architecture under the 8-site configuration.
System architecture of ExplorerChain. This example shows the architecture under the 8-site configuration.Java is used as the main implementation language to interact with EXPLORER-Intra (written in MATLAB) and MultiChain (written in C++) via runtime command execution. The EXPLORER-Intra MATLAB code was refactored to the APIs (ie, initializing, updating, and evaluating the model) that can be called by ExplorerChain without changing the original learning functionalities. The simulation includes the multiple site scenarios on iDASH VMs, with different numbers of sites (2, 4, and 8 in the experiment). Each VM contains 2 Intel Xeon 2.30 GHz CPUs, 8 GB RAM, and 100 GB storage.For EXPLORER-Intra, the prior mean was set to 0 and variance was set to 5 for the normal distribution. The other hyper-parameters of EXPLORER-Intra were set to the default values of EXPLORER. For ExplorerChain, the hyper-parameter values are as follows: (1) polling time period Δ = 1 (s), (2) waiting time period Θ = 30 (s), (3) maximum iteration Ω = 10, and (4) total number of participating sites N = 2, 4, or 8. The time periods (ie, Δ and Θ) were chosen based on the estimated model training/updating time of EXPLORER-Intra (about 3 s for the experiment datasets in this study, including the time to load MATLAB) and network latency (relatively small in the underlying iDASHcomputing environment). Also, the blockchain network was checked only for the latest N (ie, 2, 4, or 8) transactions that had hexadecimal transaction metadata size >20 in the PoINT algorithm.
HEALTHCARE/GENOMIC USE CASES
The use cases and the efficacy of the ExplorerChain were described in Ref.14 and summarized below (all with 8 total number of participating sites and 80%/20% training/test data splitting for 30 trials): (1) Myocardial infarction. In this use case, the binary outcome is the presence of disease, and the dataset contains 9 covariates (eg, Pain in Right Arm) and 1253 samples. ExplorerChain reached a prediction correctness of 0.954 in AUC,, 3.633 of average learning iterations, and about 184 s of total execution time. (2) Cancer biomarker. The binary outcome is the presence of cancer, and the dataset contains 2 covariates (ie, CA-19 and CA-125) and 141 samples. ExplorerChain reached 0.876 in AUC, 3.233 of average learning iterations, and about 173 s of total execution time. (3) Length of hospitalization after surgery., The binary outcome is whether the hospital length of stay is greater than 3 days. This dataset contains 34 covariates (eg, preoperative opioid use) and 960 samples. ExplorerChain reached 0.712 in AUC, 9.833 of average learning iterations, and about 365 s of total execution time.
DISCUSSION
There are design considerations of ExplorerChain that are common to distributed networks federating data at the institutions of origin: it is based on a semi-trust assumption that the sites are willing to share the aggregated model data but not the patient-level data, and the data format (both syntactic and semantic) on each site must be normalized using standards, such as the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). Also, the method is nondeterministic, because the learning process depends on network and computation latency of each blockchain node. Finally, ExplorerChain is platform independent, and can adopt other blockchain such as BigchainDB.One potential concern may be the increased model complexity (ie, a large number of covariates). Since ExplorerChain stores model covariance matrix, the space complexity of a model with m covariates is O(m), as shown in Table 1. In the case of a large m, the size of the model covariance may exceed the limit of a MultiChain. In this case, a different blockchain platform that supports larger transaction size can be adopted.Regarding limitations, the participants of a permissioned ExplorerChain network are predetermined, therefore the total number of participating sites (the hyper-parameter N) has a known value. Any site within the predetermined participating sites can join or leave the network; however, nonapproved sites cannot join during the process. Besides, with current design, the participating sites cannot join or leave during the initialization phase.Also, privacy-preserving methods such as the re-identification risks considered in the research field of differential privacy (ie, the data in some sites are very small thus the model parameters may lead to re-identification of cases and thus compromise privacy) were not fully investigated, while methods such as LearningChain focus on protecting the differential privacy. Not only theoretical guarantees of privacy protection, but also ethical, legal, and social implications that may arise from repeated access to a distributed computing system need to be further pondered to protect human subjects.
CONCLUSION
A software implementation of ExplorerChain has been developed, and is publicly available in an open source repository (https://github.com/tsungtingkuo/explorerchain). With the previously shown accuracy, the details about how a blockchain program can be implemented to solve the cross-institutional predictive modeling problem are further described in this study. Also, healthcare/genomic use cases demonstrate the efficacy of ExplorerChain. This work can serve as a reference for the researchers who would like to implement and even deploy blockchain technology, and the off-the-shelf software can also serve as a cornerstone to accelerate the development and investigation of future healthcare/genomic blockchain studies.
FUNDING
T.-T.K. was funded by the U.S. National Institutes of Health (NIH) (OT3OD025462, R00HG009680, R01HL136835, R01GM118609, and U01EB023685) and a UCSD Academic Senate Research Grant (RG084150). The content is solely the responsibility of the author and does not necessarily represent the official views of the NIH. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
AUTHOR CONTRIBUTIONS
T.-T.K. contributed in conceptualization, data curation, formal analysis, funding acquisition, investigation, methodology, project administration, resources, supervision, software, validation, visualization, and writing (original draft).
Authors: Gamze Gürsoy; Tianxiao Li; Susanna Liu; Eric Ni; Charlotte M Brannon; Mark B Gerstein Journal: Nat Rev Genet Date: 2021-11-10 Impact factor: 53.242