Literature DB >> 36204043

Interpretable machine learning prediction of all-cause mortality.

Wei Qiu¹, Hugh Chen¹, Ayse Berceste Dincer¹, Scott Lundberg², Matt Kaeberlein³, Su-In Lee¹.

Abstract

Background: Unlike linear models which are traditionally used to study all-cause mortality, complex machine learning models can capture non-linear interrelations and provide opportunities to identify unexplored risk factors. Explainable artificial intelligence can improve prediction accuracy over linear models and reveal great insights into outcomes like mortality. This paper comprehensively analyzes all-cause mortality by explaining complex machine learning models.
Methods: We propose the IMPACT framework that uses XAI technique to explain a state-of-the-art tree ensemble mortality prediction model. We apply IMPACT to understand all-cause mortality for 1-, 3-, 5-, and 10-year follow-up times within the NHANES dataset, which contains 47,261 samples and 151 features.
Results: We show that IMPACT models achieve higher accuracy than linear models and neural networks. Using IMPACT, we identify several overlooked risk factors and interaction effects. Furthermore, we identify relationships between laboratory features and mortality that may suggest adjusting established reference intervals. Finally, we develop highly accurate, efficient and interpretable mortality risk scores that can be used by medical professionals and individuals without medical expertise. We ensure generalizability by performing temporal validation of the mortality risk scores and external validation of important findings with the UK Biobank dataset. Conclusions: IMPACT's unique strength is the explainable prediction, which provides insights into the complex, non-linear relationships between mortality and features, while maintaining high accuracy. Our explainable risk scores could help individuals improve self-awareness of their health status and help clinicians identify patients with high risk. IMPACT takes a consequential step towards bringing contemporary developments in XAI to epidemiology.

Entities: Chemical

Keywords: Computational biology and bioinformatics; Epidemiology; Prognostic markers

Year: 2022 PMID： 36204043 PMCID： PMC9530124 DOI： 10.1038/s43856-022-00180-x

Source DB: PubMed Journal: Commun Med (Lond) ISSN： 2730-664X

Introduction

Identification of risk factors and prediction of all-cause mortality have long been important issues in epidemiology. Most prior studies identify risk factors using associations between each predictor and mortality[1-3]; only a few papers use multi-variate linear models to predict mortality and identify risk factors[4,5]. In terms of prediction, a variety of linear mortality risk scores have been proposed to help characterize unhealthy individuals[6-8]. Although linear models have historically been popular because they are interpretable, modern complex machine learning (ML) models often achieve higher predictive accuracy because they can capture interactions among variables in addition to non-linear relationships (e.g., “U-shaped” relationships). The field of artificial intelligence (AI) has seen considerable advances in supervised learning problems, which involve predicting an outcome variable (e.g., all-cause mortality) based on a set of features (e.g., individual-level characteristics). Notable applications of AI in healthcare include diabetic retinopathy detection in ophthalmology images[9], red blood cells classification[10], Alzheimer’s disease prediction[11], lung cancer classification from histopathology images[12], and skin cancer classification[13]. Despite this progress, a major obstacle to the adoption of AI applications in healthcare is that many of them are considered “black box,” which refers to their lack of interpretability. The inability to understand why a model makes a prediction is especially harmful in healthcare applications, where the patterns a model discovers can be even more important than its predictive accuracy. This is especially true in epidemiology, which aims to identify important variables to guide public health policy or detect risk predictors that warrant further study. To address this need, we turn to a variety of techniques to help us better understand complex ML models from the emerging area of explainable AI (XAI)[14-16]. In this paper, we present the IMPACT (Interpretable Machine learning Prediction of All-Cause morTality) framework (Fig. 1), which improves the interpretability of complex machine learning models for mortality prediction. We combine an accurate, complex ML model and a state-of-the-art XAI technique to predict all-cause mortality and conduct a systematic and integrated study of the relationships among many variables and all-cause mortality. We apply IMPACT to the NHANES (1999-2014) dataset to reveal important all-cause mortality findings. First, using explainable complex ML models rather than linear models, we identify risk predictors that are highly informative of future mortality. Second, our flexible models capture non-linear relationships, which provide more comprehensive information about the relationship between feature values and mortality risk: for example, the “inflection” points of risk predictors could provide a unique perspective of reference intervals that has consequential implications in public health. Third, understanding which features are the most important enables us to develop highly accurate, efficient (using less features) and interpretable mortality risk scores. Furthermore, the individualized explanation of risk scores can help users understand their most important risk factors and adjust their lifestyle. In Table 1, we compare the AUROCs between an existing mortality score or a biological age as reported in the original paper and the IMPACT-20 model tested for the corresponding follow-up time and age ranges in the NHANES dataset. We find that IMPACT risk scores (Supplementary Methods) have higher predictive power than popular mortality risk scores[5-8] and biological ages[17-20]. We ensure generalizability by performing temporal validation of the mortality risk scores and external validation of feature importances and important relationships with the UK Biobank dataset. All our results and risk scores are available on an interactive website (https://suinleelab.github.io/IMPACT) to encourage exploration of important risk predictors and support the use of interpretable individual risk scores for individuals with and without medical expertise. The IMPACT framework can also be applied to other health outcomes and diseases to improve the predictive accuracy and interpretability of complex ML models in epidemiological studies.

Fig. 1

Overview of the IMPACT model and analyses.

Table 1

Comparing the AUROCs between an existing mortality score or a biological age as reported in the original paper and the IMPACT-20 model tested for the corresponding follow-up time and age ranges in the NHANES dataset.

	Task	Age	AUROC	AUROC of IMPACT-20	AUROC of IMPACT-20 (temporal validation)
Mortality risk scores
Intermountain[6]	1-year mortality	18 +	0.84	0.92	0.88
Gagne Index[7]	1-year mortality	65 +	0.79	0.85	0.85
Intermountain[6]	5-year mortality	18 +	0.87	0.89	0.88
Prognostic score[5]	5-year mortality	40–70	Male: 0.80	Male: 0.85	Male: 0.80
Prognostic score[5]	5-year mortality	40–70	Female: 0.79	Female: 0.83	Female: 0.80
Schonberg Index[8]	5-year mortality	65 +	0.75	0.80	0.83
Biological ages
Horvath DNAm Age[17,19]	10-year mortality	21–84	0.56	0.90	0.89
Hannum DNAm Age[18,19]	10-year mortality	21–84	0.57	0.90	0.89
DNAm PhenoAge[19]	10-year mortality	21–84	0.62	0.90	0.89
Phenotypic Age[19,67]	10-year mortality	20–85	0.88	0.90	0.89

The “AUROC” column shows the AUROCs reported in the original paper. The “AUROC of IMPACT-20” column shows the performance of IMPACT models trained with the selected top 20 features (Supplementary Tables 2 and 3). The “AUROC of IMPACT-20 (temporal validation)” column shows the performance of the IMPACT-20 models evaluated on the temporal validation set (Supplementary Methods).

Overview of the IMPACT model and analyses.

a We use the NHANES (1999-2014) dataset, which includes 151 variables and 47,261 samples. The variables can be categorized into four groups: demographics, examination, laboratory and questionnaire. We train the model using different follow-up times and different age groups. b IMPACT combines tree-based models with an explainable AI method. Specifically, IMPACT (1) trains tree-based models for mortality prediction using the NHANES dataset, and (2) uses TreeExplainer to provide local explanations for our models. c We illustrate the advantages of interpretable tree-based models compared to traditional linear models in epidemiological studies. d We further analyze all mortality models and demonstrate the effectiveness of IMPACT at verifying existing findings, identifying new discoveries, verifying reference intervals, obtaining individualized explanations, and comparing models using different follow-up times and age groups. e We propose a supervised distance to help us explore feature redundancy. We further develop a supervised distances-based feature selection method that helps us select predictive and less-redundant features. f We build mortality risk scores that are applicable to professional and non-professional individuals with different cost-vs-accuracy tradeoffs. The individualized explanations of IMPACT show the impact of each risk factor for the overall risk score. Comparing the AUROCs between an existing mortality score or a biological age as reported in the original paper and the IMPACT-20 model tested for the corresponding follow-up time and age ranges in the NHANES dataset. The “AUROC” column shows the AUROCs reported in the original paper. The “AUROC of IMPACT-20” column shows the performance of IMPACT models trained with the selected top 20 features (Supplementary Tables 2 and 3). The “AUROC of IMPACT-20 (temporal validation)” column shows the performance of the IMPACT-20 models evaluated on the temporal validation set (Supplementary Methods).

Methods

Data cohorts

This study primarily focuses on NHANES[21-23] (http://www.cdc.gov/nchs/nhanes.htm) data based on samples collected between 1999 and 2014. We include demographic, laboratory, examination, and questionnaire features that could be automatically matched across different NHANES cycles. The National Center for Health Statistics Research Ethics Review Board approved all NHANES protocols, and all participants gave informed consent. After data preprocessing (Supplementary Methods), 47,261 samples with 151 features (Supplementary Data 1) remain. Follow-up mortality data is provided from the date of survey participation through December 31, 2015. We predict all-cause mortality for two broad categories: (1) follow-up times of 1-year, 3-year, 5-year, and 10-year, and (2) age groups of < 40, 40–65, 65–80, and ≥ 80 years old. For mortality prediction with different follow-up times, we use samples of all ages. For different age groups, we fix the follow-up time to predict 5-year mortality and divide all samples for 5-year mortality prediction into four sets based on age. The dataset is randomly divided into training (80%) and testing (20%) sets. Demographic characteristics and sample size of the data for different tasks are shown in Supplementary Fig. 1 and Supplementary Table 1. The histogram of the the samples’ age in different data collection cycles are shown in Supplementary Fig. 2. In additioin, we use UK Biobank (https://www.ukbiobank.ac.uk/) samples as an external validation dataset. Ethics approval for the UK Biobank study was obtained from the North West - Haydock Research Ethics Committee (21/NW/0157). Informed consent was obtained from all UK Biobank participants (the consent form is available at https://www.ukbiobank.ac.uk/consent). For UK Biobank data, we include the 51 features that overlap (Supplementary Data 1) between the NHANES and UK Biobank datasets and have 384,762 samples with confirmed 5-year mortality status. All-cause mortality included deaths occurring before May, 2021. The dataset is randomly divided into training (80%) and testing (20%) sets. More detail about UK Biobank dataset is in Supplementary Methods and Supplementary Fig. 3.

IMPACT framework

To achieve high accuracy and explainable mortality prediction models, we developed the IMPACT (Fig. 1) framework, which combines tree-based models and TreeExplainer[24]. To model all-cause mortality, we use gradient boosted trees (GBTs). GBTs are nonparametric models composed of iteratively trained decision trees. The final ensemble of trees can capture non-linear and interaction effects between predictors. The hyperparameters are chosen by GridSearch and 5-fold cross-validation (Supplementary Methods). Model performance is measured using the area under the receiver operator characteristic curve (AUROC). In our previous work, we introduced TreeExplainer[24], which provides a local (i.e., for each subject) explanation of the impact of input features on individual predictions for GBT models (Supplementary Methods). Specifically, TreeExplainer calculates exact SHAP[15] (SHapley Additive exPlanations) values for GBT models, which guarantee a set of desirable theoretical properties. SHAP values are additive; they sum to the model’s output, i.e., the log-odds for GBTs. They are also consistent, which means features that are unambiguously more important are guaranteed to have a higher SHAP value. Therefore, SHAP values are consistent and accurate calculations of each feature’s contribution to the model’s prediction. TreeExplainer also extends local explanations to capture pairwise feature interactions directly. In this work, we utilize TreeExplainer to conduct a systematic and integrated study of associations between a large number of variables and all-cause mortality. Here, higher SHAP values imply large contributions to mortality risk. By showing the impact of each variable and interactions among variables for local, sample-specific explanations, we can obtain a comprehensive understanding of why the model made a specific mortality prediction. The foreground samples and the SHAP values of the 1-, 3-, 5-, and 10-year mortality prediction models can be found in Supplementary Data 2–9. In addition to studying the relationships between risk factors and all-cause mortality, we further propose a technique, “relative risk percentage”, to identify sub-optimal reference intervals and a metric, “supervised distance”, to measure feature redundancy and identify redundant groups of features given a specific prediction task. Building on supervised distance, we also propose a recursive feature selection strategy to select feature sets that are both predictive and less redundant. We additionally propose a recursive feature selection method to train accurate and efficient (low-cost) interpretable mortality risk scores.

Supervised distance

Supervised distance and hierarchical clustering

Supervised distance can accurately measure feature redundancy based on a specific prediction task. To calculate the supervised distance between feature i and feature j, we first train a uni-variate GBT model to predict the label (e.g. 5-year mortality in our study) using feature i. Then, we can obtain the Prediction which is the output of the fitted uni-variate GBT. Next, we fit another uni-variate GBT to predict Prediction using feature j. We define the output of the new GBT as . All hyperparameter values of the uni-variate GBTs are set to their default values. Following the same above steps, we can obtain . The supervised distance between feature i and feature j (supervised distance(i,j)) is defined as:where var(x) is the variance of the vector x, mean(x) is the average of the vector x. Supervised distance is scaled roughly between 0 and 1, where 0 distance means the features perfectly redundant and 1 means they are completely independent. To explore the redundant feature groups, we hierarchically cluster all features according to the supervised distance. Specifically, we use complete linkage hierarchical clustering which merges in each step the two clusters whose merger has the smallest diameter.

Supervised distance-based feature selection

We propose a supervised distance-based feature selection method to select predictive and less-redundant feature sets. Firstly, we fit a GBT for 5-year mortality prediction on all features using the training set and rank the features by mean absolute SHAP values from TreeExplainer. We cluster features except age and gender into a specific number of groups using supervised distances-based hierarchical clustering and select the most important feature in each cluster. Then, we add age and gender to the selected feature set and re-fit the model. Next, we rerun the clustering using the new feature set except age and gender. This process is repeated until all remaining features cluster to a single group. In every iteration, we remove 5 features. The models are evaluated on the testing set with bootstrapping for 1000 times. We report the average of the AUROCs and the minimum supervised distance within the selected feature sets.

5-year mortality risk scores

IMPACT mortality risk scores are defined to be the prediction of the 5-year mortality prediction models. To compare with Intermountain gender-specific risk scores, we evaluate the models on different gender groups. The models are trained on the whole training set and evaluate on different gender groups in the testing set. Furthermore, considering the different feature collection cost for the general public and medical professionals, we build the risk scores starting from different feature sets. For the general public, the models are trained on all demographics, questionnaire features and examination features that are accessible at home for general public, For medical professionals, the models are trained on all demographics and laboratory features. We implement recursive feature selection to reduce the number of features included in the risk scores. Recursive feature elimination works by searching for a subset of features by starting with all features in the training dataset and successively removing features until the desired number of features remains. Firstly, we train a model on the full dataset with all features. Then we rank features by importance (mean absolute SHAP values) and remove the least important features. Another model is trained on the resulting feature set, and the process iterates until only the desired number of features are left. We remove 5 features in each iteration. We bootstrap the test set for 1000 times and assess the predictive performance. We report the average of the AUROCs within the selected feature sets.

Table 2

Providing additional perspective to laboratory reference intervals.

Feature	Reference Interval	Relative Risk Percentage (RRP)
		1-year	3-year	5-year	10-year
Gamma glutamyl transferase	0–30 U/L	16.93%	−4.57%	−0.97%	−6.04%
Globulin, serum	20–35 g/L	5.39%	7.95%	14.73%	4.59%
Lymphocyte percent	20%–40%	15.63%	7.02%	6.55%	10.81%
Blood urea nitrogen (Male)	2.86–8.57 mmol/L	8.12%	2.92%	8.02%	21.08%
Blood urea nitrogen (Female)	2.14–7.50 mmol/L	−0.15%	3.07%	0.40%	12.16%
Albumin, serum	35–50 g/L	28.56%	49.70%	59.77%	93.48%
Blood lead	0–0.48 umol/L	100.00%	94.71%	100.00%	100.00%
Mean cell volume	80–100 fL	82.80%	75.82%	83.92%	57.26%
Alanine aminotransferase ALT (Male)	7–55 IU/L	100.00%	100.00%	100.00%	100.00%
Alanine aminotransferase ALT (Female)	7–45 IU/L	100.00%	100.00%	100.00%	100.00%

The table lists the reference interval and relative risk percentage (RRP) of the selected laboratory features. RRP measures the relative risk of the feature values within the reference interval compared to the relative risk of all values. A higher RRP indicates that the current reference interval is relatively more inappropriate. The negative value indicates that the reference interval of that laboratory feature is optimal for mortality risk. The 100% value suggests that the reference interval may be sub-optimal for mortality risk.

57 in total

1. Establishing reference intervals for clinical laboratory test results: is there a better way?

Authors: Alex Katayev; Claudiu Balciza; David W Seccombe
Journal: Am J Clin Pathol Date: 2010-02 Impact factor: 2.493

2. 5 year mortality predictors in 498,103 UK Biobank participants: a prospective population-based study.

Authors: Andrea Ganna; Erik Ingelsson
Journal: Lancet Date: 2015-06-03 Impact factor: 79.321

3. National health and nutrition examination survey: sample design, 2011-2014.

Authors: Clifford L Johnson; Sylvia M Dohrmann; Vicki L Burt; Leyla K Mohadjer
Journal: Vital Health Stat 2 Date: 2014-03

4. Association between serum albumin and mortality from cardiovascular disease, cancer, and other causes.

Authors: A Phillips; A G Shaper; P H Whincup
Journal: Lancet Date: 1989-12-16 Impact factor: 79.321

5. The effectiveness of BMI, calf circumference and mid-arm circumference in predicting subsequent mortality risk in elderly Taiwanese.

Authors: Alan C Tsai; Tsui-Lan Chang
Journal: Br J Nutr Date: 2010-12-06 Impact factor: 3.718

6. J-shaped mortality relationship for uric acid in CKD.

Authors: Mohamed E Suliman; Richard J Johnson; Elvia García-López; A Rashid Qureshi; Hadi Molinaei; Juan Jesús Carrero; Olof Heimbürger; Peter Bárány; Jonas Axelsson; Bengt Lindholm; Peter Stenvinkel
Journal: Am J Kidney Dis Date: 2006-11 Impact factor: 8.860

7. Serum albumin level and physical disability as predictors of mortality in older persons.

Authors: M C Corti; J M Guralnik; M E Salive; J D Sorkin
Journal: JAMA Date: 1994-10-05 Impact factor: 56.272

8. Low mid-upper arm circumference, calf circumference, and body mass index and mortality in older persons.

Authors: Hanneke A H Wijnhoven; Marian A E van Bokhorst-de van der Schueren; Martijn W Heymans; Henrica C W de Vet; Hinke M Kruizenga; Jos W Twisk; Marjolein Visser
Journal: J Gerontol A Biol Sci Med Sci Date: 2010-06-13 Impact factor: 6.053

9. An epigenetic biomarker of aging for lifespan and healthspan.

Authors: Morgan E Levine; Ake T Lu; Austin Quach; Brian H Chen; Themistocles L Assimes; Stefania Bandinelli; Lifang Hou; Andrea A Baccarelli; James D Stewart; Yun Li; Eric A Whitsel; James G Wilson; Alex P Reiner; Abraham Aviv; Kurt Lohman; Yongmei Liu; Luigi Ferrucci; Steve Horvath
Journal: Aging (Albany NY) Date: 2018-04-18 Impact factor: 5.682

10. DNA methylation GrimAge strongly predicts lifespan and healthspan.

Authors: Ake T Lu; Austin Quach; James G Wilson; Alex P Reiner; Abraham Aviv; Kenneth Raj; Lifang Hou; Andrea A Baccarelli; Yun Li; James D Stewart; Eric A Whitsel; Themistocles L Assimes; Luigi Ferrucci; Steve Horvath
Journal: Aging (Albany NY) Date: 2019-01-21 Impact factor: 5.682

1 in total

1. Interpretable machine learning prediction of all-cause mortality.

Authors: Wei Qiu; Hugh Chen; Ayse Berceste Dincer; Scott Lundberg; Matt Kaeberlein; Su-In Lee
Journal: Commun Med (Lond) Date: 2022-10-03

1 in total