| Literature DB >> 29594204 |
Timo M Deist1,2, A Jochems1,2, Johan van Soest1,2, Georgi Nalbantov1, Cary Oberije1, Seán Walsh1, Michael Eble3, Paul Bulens4, Philippe Coucke5, Wim Dries6, Andre Dekker1, Philippe Lambin1,2.
Abstract
Machine learning applications for personalized medicine are highly dependent on access to sufficient data. For personalized radiation oncology, datasets representing the variation in the entire cancer patient population need to be acquired and used to learn prediction models. Ethical and legal boundaries to ensure data privacy hamper collaboration between research institutes. We hypothesize that data sharing is possible without identifiable patient data leaving the radiation clinics and that building machine learning applications on distributed datasets is feasible. We developed and implemented an IT infrastructure in five radiation clinics across three countries (Belgium, Germany, and The Netherlands). We present here a proof-of-principle for future 'big data' infrastructures and distributed learning studies. Lung cancer patient data was collected in all five locations and stored in local databases. Exemplary support vector machine (SVM) models were learned using the Alternating Direction Method of Multipliers (ADMM) from the distributed databases to predict post-radiotherapy dyspnea grade [Formula: see text]. The discriminative performance was assessed by the area under the curve (AUC) in a five-fold cross-validation (learning on four sites and validating on the fifth). The performance of the distributed learning algorithm was compared to centralized learning where datasets of all institutes are jointly analyzed. The euroCAT infrastructure has been successfully implemented in five radiation clinics across three countries. SVM models can be learned on data distributed over all five clinics. Furthermore, the infrastructure provides a general framework to execute learning algorithms on distributed data. The ongoing expansion of the euroCAT network will facilitate machine learning in radiation oncology. The resulting access to larger datasets with sufficient variation will pave the way for generalizable prediction models and personalized medicine.Entities:
Keywords: Decision support systems; Distributed learning; Dyspnea; Predictive models; Support vector machine
Year: 2017 PMID: 29594204 PMCID: PMC5833935 DOI: 10.1016/j.ctro.2016.12.004
Source DB: PubMed Journal: Clin Transl Radiat Oncol ISSN: 2405-6308
Overview of patient characteristics per hospital.
| Variable | Maastricht | Eindhoven | Hasselt | Liège | Aachen | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Count | % | Count | % | Count | % | Count | % | Count | % | |
| 89 | 72% | 50 | 89% | 8 | 57% | 20 | 61% | 36 | 86% | |
| 34 | 28% | 6 | 11% | 6 | 43% | 13 | 39% | 6 | 14% | |
| Missing | 0 | 0% | 0 | 0% | 0 | 0% | 0 | 0% | 0 | 0% |
| No | 90 | 73% | 44 | 79% | 2 | 14% | 27 | 82% | 24 | 57% |
| Yes | 33 | 27% | 12 | 21% | 3 | 21% | 6 | 18% | 12 | 29% |
| Missing | 0 | 0% | 0 | 0% | 9 | 64% | 0 | 0% | 6 | 14% |
| None | 16 | 13% | 5 | 9% | 3 | 21% | 0 | 0% | 2 | 5% |
| Sequential | 22 | 18% | 24 | 43% | 2 | 14% | 2 | 6% | 4 | 10% |
| Concurrent | 85 | 69% | 27 | 48% | 8 | 57% | 31 | 94% | 33 | 79% |
| Missing | 0 | 0% | 0 | 0% | 1 | 7% | 0 | 0% | 3 | 7% |
| Mean & Standard Dev | 78 | 21 | 80 | 25 | 80 | 25 | 72 | 23 | 66 | 19 |
| Missing Count & Percentage | 0 | 0% | 20 | 36% | 2 | 14% | 0 | 0% | 20 | 48% |
Fig. 1Distributed learning flow in euroCAT.
Discrimination performance (AUC) obtained by learning an SVM on all sites and in a 5-fold CV in distributed learning (ADMM, following the formulation shown in Eqs. (4), (5), (6), (7), Appendix A).
| CV | |||||||
|---|---|---|---|---|---|---|---|
| Train on | All | All except Maastricht | All except Eindhoven | All except Hasselt | All except Liège | All except Aachen | |
| Validate on | Maastricht | Eindhoven | Hasselt | Liège | Aachen | ||
| Training AUC | 0.63 | 0.61 | 0.60 | 0.64 | 0.62 | 0.64 | 0.62 |
| Validation AUC | 0.58 | 0.77 | 0.57 | 0.72 | 0.64 | 0.66 |
SVM coefficients learned by distributed and centralized learning.
| Trained on | ||||||
|---|---|---|---|---|---|---|
| All | Distributed | 0.01 | −0.32 | −0.20 | −0.25 | −0.55 |
| Centralized | 0.01 | −0.31 | −0.20 | −0.25 | −0.55 | |
| All except Maastricht | Distributed | −0.03 | −0.31 | −0.20 | −0.29 | −0.51 |
| Centralized | −0.02 | −0.31 | −0.20 | −0.29 | −0.51 | |
| All except Eindhoven | Distributed | 0.01 | −0.28 | −0.06 | −0.33 | −0.48 |
| Centralized | 0.02 | −0.28 | −0.06 | −0.33 | −0.48 | |
| All except Hasselt | Distributed | 0.00 | −0.32 | −0.20 | −0.26 | −0.55 |
| Centralized | 0.00 | −0.31 | −0.20 | −0.26 | −0.55 | |
| All except Liège | Distributed | 0.00 | −0.31 | −0.20 | −0.25 | −0.55 |
| Centralized | −0.01 | −0.31 | −0.20 | −0.26 | −0.55 | |
| All except Aachen | Distributed | 0.00 | −0.34 | −0.19 | −0.24 | −0.53 |
| Centralized | 0.00 | −0.34 | −0.19 | −0.24 | −0.53 | |
Fig. 2Convergence graphs of distributed ADMM solutions to centralized solutions for iterations. Vertical lines indicate the iterations in which internal convergence criteria were met in the euroCAT network. The data was created in local simulations. ‘∼’ indicates ‘Trained on all sites except’.
Discrimination performance (AUC) obtained by learning an SVM on all sites and in a 5-fold CV in centralized learning (solving the optimization problem shown in Eqs. (1), (2), (4), Appendix A).
| CV | |||||||
|---|---|---|---|---|---|---|---|
| Train on | All | All except Maastricht | All except Eindhoven | All except Hasselt | All except Liège | All except Aachen | |
| Validate on | Maastricht | Eindhoven | Hasselt | Liège | Aachen | ||
| Training AUC | 0.63 | 0.61 | 0.60 | 0.63 | 0.61 | 0.64 | 0.62 |
| Validation AUC | 0.58 | 0.77 | 0.59 | 0.72 | 0.64 | 0.66 |