Jin Li1, Yu Tian1, Runze Li1, Tianshu Zhou1, Jun Li2, Kefeng Ding2, Jingsong Li3. 1. Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China. 2. Department of Surgical Oncology, Second affiliated hospital, Zhejiang University School of Medicine, Hangzhou, China. 3. Engineering Research Center of EMR and Intelligent Expert System, Ministry of Education, College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China; Research center for healthcare data science, Zhejiang Lab, Hangzhou, China. Electronic address: ljs@zju.edu.cn.
Abstract
BACKGROUND AND OBJECTIVE: Clinical decision support assisted by prediction models usually faces the challenges of limited clinical data and a lack of labels when the model is developed with data from a single medical institution. Accordingly, research on multicenter clinical collaborative networks, which can provide external medical data, has received increasing attention. With the increasing availability of machine learning techniques such as transfer learning, leveraging large-scale patient data from multiple hospitals to build data-driven predictive models with clinical application potential provides an alternative solution to address the problem of limited patient data. METHODS: A multicenter hybrid semi-supervised transfer learning model (MHSTL) is proposed in this study on the basis of unified common data model to ensure multicenter data standardized representation. Then the hospital-specific features, along with the co-occurrence features across domains, are aligned through a representation learning architecture that is built based on deep neural networks and the newly proposed neural decision forest model. In this process, limited patient data from the target hospital, both labeled and unlabeled, are incorporated during the feature adaptation process, thereby contributing to better model performance. Without patient-level data sharing, the proposed model learning strategy which overcomes feature misalignment and distribution divergence, enables the multi-source transfer learning process in the case of insufficient and unlabeled patient data at target hospital. RESULTS: The effectiveness of the proposed transfer learning model was evaluated on a collaborative research network of colorectal cancer patients in the US and China. The results demonstrate that the proposed model can achieve much better performance for predicting target risk with limited resources on patient data than baseline models . Better discrimination and calibration ability are also observed when sufficient labeled data are not available in the target hospital for prognosis prediction tasks . Further exploratory experiments show that the proposed approach exhibits good model generalizability regardless of the data heterogeneity. With the help of the SHapley Additive exPlanations for model interpretation, the effectiveness of incorporating hospital-specific features in the transfer learning model is shown. CONCLUSIONS: In this study, the proposed method can develop prediction models from multiple source hospitals and exhibit good performance by leveraging cross-domain hospital-specific feature information, therefore enhancing the model prediction when applied to single medical institution with limited patient data.
BACKGROUND AND OBJECTIVE: Clinical decision support assisted by prediction models usually faces the challenges of limited clinical data and a lack of labels when the model is developed with data from a single medical institution. Accordingly, research on multicenter clinical collaborative networks, which can provide external medical data, has received increasing attention. With the increasing availability of machine learning techniques such as transfer learning, leveraging large-scale patient data from multiple hospitals to build data-driven predictive models with clinical application potential provides an alternative solution to address the problem of limited patient data. METHODS: A multicenter hybrid semi-supervised transfer learning model (MHSTL) is proposed in this study on the basis of unified common data model to ensure multicenter data standardized representation. Then the hospital-specific features, along with the co-occurrence features across domains, are aligned through a representation learning architecture that is built based on deep neural networks and the newly proposed neural decision forest model. In this process, limited patient data from the target hospital, both labeled and unlabeled, are incorporated during the feature adaptation process, thereby contributing to better model performance. Without patient-level data sharing, the proposed model learning strategy which overcomes feature misalignment and distribution divergence, enables the multi-source transfer learning process in the case of insufficient and unlabeled patient data at target hospital. RESULTS: The effectiveness of the proposed transfer learning model was evaluated on a collaborative research network of colorectal cancerpatients in the US and China. The results demonstrate that the proposed model can achieve much better performance for predicting target risk with limited resources on patient data than baseline models . Better discrimination and calibration ability are also observed when sufficient labeled data are not available in the target hospital for prognosis prediction tasks . Further exploratory experiments show that the proposed approach exhibits good model generalizability regardless of the data heterogeneity. With the help of the SHapley Additive exPlanations for model interpretation, the effectiveness of incorporating hospital-specific features in the transfer learning model is shown. CONCLUSIONS: In this study, the proposed method can develop prediction models from multiple source hospitals and exhibit good performance by leveraging cross-domain hospital-specific feature information, therefore enhancing the model prediction when applied to single medical institution with limited patient data.
Keywords:
Clinical decision support systems; Data-limited settings; Distributed data mining; Model generalizability; Prognosis prediction; Transfer learning