Literature DB >> 35148360

Actionable absolute risk prediction of atherosclerotic cardiovascular disease based on the UK Biobank.

Ajay Kesar¹, Adel Baluch¹, Omer Barber¹, Henry Hoffmann¹, Milan Jovanovic¹, Daniel Renz¹, Bernard Leon Stopak¹, Paul Wicks¹, Stephen Gilbert^1,2.

Abstract

Cardiovascular diseases (CVDs) are the primary cause of all death globally. Timely and accurate identification of people at risk of developing an atherosclerotic CVD and its sequelae is a central pillar of preventive cardiology. One widely used approach is risk prediction models; however, currently available models consider only a limited set of risk factors and outcomes, yield no actionable advice to individuals based on their holistic medical state and lifestyle, are often not interpretable, were built with small cohort sizes or are based on lifestyle data from the 1960s, e.g. the Framingham model. The risk of developing atherosclerotic CVDs is heavily lifestyle dependent, potentially making many occurrences preventable. Providing actionable and accurate risk prediction tools to the public could assist in atherosclerotic CVD prevention. Accordingly, we developed a benchmarking pipeline to find the best set of data preprocessing and algorithms to predict absolute 10-year atherosclerotic CVD risk. Based on the data of 464,547 UK Biobank participants without atherosclerotic CVD at baseline, we used a comprehensive set of 203 consolidated risk factors associated with atherosclerosis and its sequelae (e.g. heart failure). Our two best performing absolute atherosclerotic risk prediction models provided higher performance, (AUROC: 0.7573, 95% CI: 0.755-0.7595) and (AUROC: 0.7544, 95% CI: 0.7522-0.7567), than Framingham (AUROC: 0.680, 95% CI: 0.6775-0.6824) and QRisk3 (AUROC: 0.725, 95% CI: 0.7226-0.7273). Using a subset of 25 risk factors identified with feature selection, our reduced model achieves similar performance (AUROC 0.7415, 95% CI: 0.7392-0.7438) while being less complex. Further, it is interpretable, actionable and highly generalizable. The model could be incorporated into clinical practice and might allow continuous personalized predictions with automated intervention suggestions.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35148360 PMCID： PMC8836294 DOI： 10.1371/journal.pone.0263940

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Globally, cardiovascular diseases (CVDs) are the number one cause of all death [1, 2]. In 2016, 17.9 million people died of CVDs alone, accounting for 31% of all global deaths [1]. The direct costs of CVDs in the US for 2010 were $272.5b whereas indirect costs were $171.7b and are expected to increase to $818.1b and $275.8b in 2030 respectively [3, 4]. Atherosclerosis alone is responsible for 1.3% of all hospital stays with costs of $9b per year, while all atherosclerosis-related diseases amount to $43.5b of hospital costs annually [5]. Individually, patients with CVD incur more than twice the medical costs of age- and sex-matched patients without CVD, largely because of the increased likelihood of subsequent hospitalizations. The greatest differences in total CVD costs usually occur when comparing patients with and without a secondary CVD hospitalization [6]. All current guidelines on the prevention of CVD in clinical practice recommend the assessment of total CVD risk since atherosclerosis is usually the product of a number of risk factors [7, 8] and in recent years these guidelines have evolved to focus on the absolute risk of disease as opposed to relative risk [7-10]. Clinician tools for CVD risk estimation must enable rapid and accurate estimation of an individual patient’s absolute CVD risk [7], or for opportunistic screening of high-risk patients from relevant populations [11]. Screening is the identification of unrecognized disease or risk of disease in individuals without symptoms. In addition to opportunistic screening, which is carried out without a predefined strategy (e.g. when the individual is consulting a general practitioner (GP) for some other reason), tools can be used for systematic screening, which is centrally organized strategic screening in the general population or in targeted subpopulations, such as subjects with a family history of premature CVD or familial hyperlipidaemia [7]. There is ongoing debate on the role of systematic centralized population based screening in CVD [10, 12] because of burdensome diagnostic testing following the use of risk based screening tools [13]. A relatively new area of screening is self-screening, carried out by proactive individuals using screening tools on mobile devices such as smartphones or smartwatches, which may use built in app-linked sensors or screening chat-bots [14-16]. There is public demand for reliable, actionable, explainable and usable health information tools [17], including for disease screening. The risk to build up atherosclerotic plaque varies and is determined by multiple factors such as genetics, environment and lifestyle [11, 18–21]. The risk of developing atherosclerotic plaque can be reduced based on an individual’s behavioral risk factors, such as smoking, physical activity and nutrition [1, 11, 19, 20]. Most diseases, including atherosclerotic CVDs, have a complex pathophysiology that involves multiple interacting molecular systems, making it insufficient to look only at an isolated biological pathway or a subset of markers to predict disease risk [22]. A precision medicine based approach is required, where multiple biological layers are considered (i.e., ‘multi-omics’), alongside clinical and lifestyle data [22]. Such an approach has the potential to capture all important interactions or correlations detected between molecules in different biological layers, providing a holistic understanding of an individual’s current health status and enabling the quantification of an individual’s absolute risk of atherosclerotic CVDs [23, 24]. Previous studies in this area use a limited set of risk factors and outcomes for their analyses [7, 25, 26]. In recent years, the knowledge of behavioral risk factors and of the pathophysiology of atherosclerotic CVDs have advanced tremendously [11, 25]. Current absolute risk prediction models have limited predictive capability as they have not been trained on all possible atherosclerotic CVD outcomes [27-29], or they include outcomes which are unmodifiable such as those related to pregnancy, accidents, or congenital factors [29]. Both SCORE (Systematic COronary Risk Evaluation) and SCORE2 [30, 31], are models for predicting relative CVD risk, whereas we focus on predicting absolute CVD risk, which is why we chose to omit those models from our analysis. Another related investigation, which also used the UK Biobank (UKB) dataset, developed multiple Cox Proportional Hazard models for 10-year CVD risk prediction, with a reduced version requiring 47 risk factors and another version disregarding all cholesterol risk factors as well as systolic blood pressure, in order to provide a simple approach for risk prediction in remote settings with limited testing resources [32]. However, survival models such as the proportional hazard model are not designed to provide absolute risk estimates for individual patients. Machine learning (ML) based approaches have many advantages compared to humans or standard statistical algorithms, such as superior performance, being able to identify complex non-linear patterns, the ability to encode diverse and high dimensional data types, being more stable to outliers, allowing continuous model updates, versatility for different domains and scalability [33-36]. However, classic disadvantages of ML based approaches are their lack of interpretability, risk for inherent bias due to the used data, difficulty to acquire physician adoption, explaining to physicians why a new risk model might be superior to existing ones, with all of these hindering widespread adoption of ML based risk prediction models [36, 37]. One example for ML based CVD risk prediction is the AutoPrognosis based approach, where an ensemble of multiple ML pipelines has also been applied on the UK Biobank dataset for 5-year CVD risk prediction [29]. Further, using a purely ML-driven approach can lead to a model that requires too many risk factors to compute risk, which is infeasible for routine clinical check-ups. Another disadvantage of purely data-driven approaches is the inclusion of risk factors which might show strong correlations but are unrelated to the pathophysiology of CVDs or are not actionable, making them inapplicable in a clinical setting or as an actionable self-management tool [29]. The aim of this study was to use a large-data ML approach to develop an actionable absolute risk prediction tool which considers the holistic health of an individual. Uniquely, we focused on behavioral risk factors relating to all atherosclerotic CVD outcomes. Our goal was to have a holistic understanding of an individual’s current health status, to better quantify their risk of atherosclerotic CVDs, and to provide actionable advice. Our approach is novel in that we employ a highly holistic understanding of an individual’s current health status, to better quantify their risk of all atherosclerotic CVDs. By utilizing a comprehensive set of lifestyle factors, we enable the subsequent suggestion of personalized and actionable advice relating to unhealthy risk factors. Instead of using only a limited set of risk factors, we aimed to achieve this by taking multiple biological layers into account, which include: (i) multi-omics data from blood samples (e.g. lipidome and proteome); (ii) family history (e.g. genome), (iii) lifestyle data, (iv) clinical data and (v) environmental data; along with (vi) an extensive set of risk factors and outcomes. We used data from 464,547 participants of the UK Biobank study who did not have atherosclerotic CVD at their baseline visit. We created an automated pipeline to benchmark risk prediction classifier algorithms against each other, then evaluated their predictive performances in the overall population and tested the generalizability of the top-performing classifiers through retraining and testing on different sub-populations. We explored the clinical implications of the proposed classifiers, with a focus on the top-performing models. This study does not focus on the algorithmic aspects of the utilized classifiers. Methodological details on the utilized classifiers can be found in the open-source documentation of the respective algorithms of the scikit-learn [38] and xgboost [39] libraries and in the supporting information (S4 Table).

Materials and methods

Baseline data from the UK Biobank was utilized to extract an extensive set of risk factors and outcomes associated with the pathophysiology of atherosclerotic CVDs. A benchmarking pipeline was used to train and evaluate different standard and ML algorithms for the task of 10-year atherosclerotic CVD risk prediction. The performance was measured using AUROC and compared against the baseline models Framingham and QRisk3, which are widely used and recommended models. We evaluated our best performing models further by analyzing the most informative features and assessed model generalizability and created a reduced model.

Study design and participants

The UK Biobank is a long-term prospective large-scale biomedical database including over 500,000 participants aged 40–69 years (when recruited between 2006 and 2010). The database is globally accessible to approved researchers undertaking research into the most common and life-threatening diseases and continuously collects phenotypic and genotypic data about its participants, including data from questionnaires, physical measures, blood, urine and saliva samples, lifestyle data [40]. This data is further linked to each participant’s health-related records, accelerometry, multimodal imaging, genome-wide genotyping and longitudinal follow-up data for a wide range of health-related outcomes [40, 41]. The UK Biobank study protocol is available online [42]. The North West Multi-Centre Research Ethics Committee approved the UK Biobank study and all participants provided written informed consent prior to study enrollment. Our research is covered by the UK Biobank’s Generic Research Tissue Bank (RTB) Approval and was approved by the UK Biobank Access Management Team [43]. We excluded participants with atherosclerotic CVDs present before or during baseline, participants who chose to leave the UKB study and participants who were lost due to various reasons. The resulting cohort consisted of 464,547 participants. The last available date of participant follow-up was March 5th, 2020.

Risk factor definition

We curated a list of all generally known risk factors and outcomes for atherosclerotic CVDs from the medical literature and from validated risk prediction models. This preliminary list of risk factors was reduced through curation to focus on those factors that were clearly involved in the pathophysiology of atherosclerosis and those that are modifiable through behavioral change. The curation was carried out by three medical doctors with experience in diagnosing or scientifically modelling cardiovascular diseases. We consolidated all relevant UKB columns into 203 risk factors and grouped them into six categories: demographics (e.g. age, biological sex, ethnicity), biomarkers (e.g. cholesterol, glucose, blood pressure, heart rate), lifestyle (e.g. alcohol consumption, smoking, physical activity, sleep, social visits), environment (e.g. exposure to tobacco smoke, work and housing and other socio-economic related factors), genetics (e.g. family history of CVD, stroke, diabetes, high cholesterol, high blood pressure) and comorbidities (e.g. heart arrhythmias, diabetes, acute & chronic kidney injury, migraines, rheumatoid arthritis, systemic lupus erythematosus, severe mental illnesses (schizophrenia, bipolar disorder, depression, psychosis), diagnosis or treatment of erectile dysfunction, atypical antipsychotic medication). A categorized list of all risk factors used in our analysis is provided in the supplementary data (S1 Table).

Outcome definition

In the same manner as described above, an initial list of atherosclerotic CVDs was further reviewed and curated by the same team of medical doctors. All resulting CVDs of interest are associated with atherosclerotic plaque build-up, are modifiable and relate to the collected risk factors only. Thus, we disregard brain haemorrhages due to accidents and congenital and pregnancy-related CVDs, which are not actionable. The curated list of all ICD-10 and ICD-9 outcomes meeting the above criteria consists of 193 total (125 unique) CVD outcomes, e.g. coronary/ischaemic heart disease, heart attack, angina, stroke, cardiac arrest, congestive heart failure, left ventricular failure, myocardial infarction, aortic valve stenosis, cerebral artery occlusions, nontraumatic haemorrhages. A list with all outcome codes used in our analysis is provided in the supplementary data (S2 Table). An atherosclerotic CVD event was defined as the first occurrence out of the following: any of the atherosclerotic CVD outcome diagnosis codes, also as primary or secondary death cause during the 10-year follow-up period.

Cohort follow-up

Follow-up time was set to 10 years as commonly used in other risk models (see Table 2 in [7]) and counted from the date of initial assessment center visit. Individuals who died from other causes during their follow-up period or had a relevant CVD event past their individual follow-up period, were marked as not having had a relevant CVD event.

Models used in comparison

Framingham risk score

The Framingham 10-year CVD absolute risk score is based on the data of the two prospective studies, the Framingham Heart Study and the Framingham offspring study [27]. The cohort consists of 8491 participants, with 4522 women and 3969 men who attended a baseline examination between 30 and 74 years of age and were free of CVD. A positive CVD outcome was defined as any of the following: coronary death, myocardial infarction, coronary insufficiency, angina, ischemic stroke, hemorrhagic stroke, transient ischemic attack, peripheral artery disease and heart failure. Participants were followed up for 12 years where 1174 participants developed a CVD. Two biological sex-specific risk models were derived, with one model using lipid measurements and the other one Body Mass Index (BMI). The variables used were biological sex, age, total cholesterol, HDL cholesterol, treated and untreated systolic blood pressure, smoking status and diabetes status. The Framingham risk calculators and model coefficients are publicly available [44]. We imputed missing data using simple mean imputation.

QRisk3

The QRisk3 10-year CVD absolute risk score is based on a prospective open cohort study using data from general practices (GPs), mortality and hospital records in England [28]. The cohort consists of 10.56 million patients between the age of 25 and 84 years, where 75% of the patients were used for training and 25% for validation. Patients with a pre-existing CVD, missing Townsend score or using statins were removed from the baseline. Patients were classified as having a positive CVD outcome when any of the following outcomes was present during follow-up in the GP, hospital or mortality records: coronary heart disease, ischaemic stroke, or transient ischaemic attack. QRisk3 used the following ICD-10 codes: G45 (transient ischaemic attack and related syndromes), I20 (angina pectoris), I21 (acute myocardial infarction), I22 (subsequent myocardial infarction), I23 (complications after myocardial infarction), I24 (other acute ischaemic heart disease), I25 (chronic ischaemic heart disease), I63 (cerebral infarction), and I64 (stroke not specified as haemorrhage or infarction). The utilized ICD-9 codes were: 410, 411, 412, 413, 414, 434, and 436. Participants were followed-up for 15 years where 363,565 participants of the training set (4,6%) developed a relevant CVD. One biological sex-specific risk model was derived. The risk factors used in the final model were age, ethnicity, deprivation, systolic blood pressure, BMI, total cholesterol/HDL cholesterol ratio, smoking status, family history of coronary heart disease, diabetes status, treated hypertension, rheumatoid arthritis, atrial fibrillation, chronic kidney disease, systolic blood pressure variability, diagnosis of migraine, corticosteroid use, systemic lupus erythematosus, atypical antipsychotic use, diagnosis of severe mental illnesses, diagnosis or treatment of erectile dysfunction. The QRisk3 risk calculator and model coefficients are publicly available [45], built into all major NHS GP systems and included in the UK’s national guidelines (https://www.healthcheck.nhs.uk/seecmsfile/?id=1687, accessed 10th November 2021). We imputed missing data using simple mean imputation.

Standard linear and ML models

Since the introduction of the classic CVD risk prediction methods, the field of supervised machine learning has developed from classical statistics with the sole purpose of maximizing predictive accuracy with modern statistical methods. Therefore, in addition to using standard linear models, we tested the major ML approaches, covering a wide spectrum of the possible ML design space, to evaluate which model type performs best for our task. Based on our initial benchmarking pipeline results, we focused on reporting the results of the initially best performing models: logistic regression, random forest and XGBoost. We compared regularized linear regression (with L1 penalty), random forests and gradient boosting (xgboost implementation) for assessing the highest achievable Area Under the Receiver Operating Characteristic Curve (AUROC) value, which we used for assessing the trade-off between number of features and predictive performance of several simpler practical risk predictors, as determined by an iterative feature elimination procedure outlined below. L1 regularization for logistic regression implements a strong penalty for non-zero feature weights, resulting in a feature selection procedure that discards features that are likely to be non-predictive. Random Forest is an ensemble method that fits many decision trees independently to a subset of the data. We implemented both methods using their scikit-learn library implementation. Finally, we evaluated Extreme Gradient Boosting: Gradient boosting is an ensemble tree-based machine learning method that combines many weak classifiers to produce a stronger one. It sequentially fits a series of classification or regression trees, with each tree created to predict the outcomes misclassified by the previous tree [46]. By sequentially predicting residuals of previous trees, the gradient boosting process has a focus on predicting more difficult cases and correcting its own shortcomings. Extreme Gradient Boosting (XGB / XGBoost) is a specific implementation of the gradient boosting process, and uses memory-efficient algorithms to improve computational speed and model performance [39, 47]. For completeness, we briefly evaluated a number of other standard classifiers, but discarded them due to excessive computational complexity or inferior performance so we do not report their performances here: Decision Trees [48], Voting Classifiers, Multi-Layer Perceptrons with 2 layers and 200 and 150 neurons each (Neural Network) [49], stochastic gradient descent implementing a support vector machine algorithm [50, 51], Ada Boost [52, 53], Gradient Boosting [46], K Neighbors [54], Quadratic Discriminant Analysis [55] and Gaussian Naive Bayes [38, 56].

Model development and benchmarking using pipeline

We built a benchmarking pipeline for automated and reproducible data extraction, normalization, imputation, model training, tuning of model hyperparameters, classification, documentation and reporting. We implemented all models using their respective scikit-learn library or xgboost library implementation using the Python programming language [38, 39]. Details on the used Python libraries, methods and parameters are provided in the supplementary data (S3 and S4 Tables). Categorical values were one-hot encoded. Data normalization was performed by removing the mean and scaling to unit variance. Data imputation was performed for all models using a simple mean imputation. The models’ hyper-parameters were determined using grid search and stratified k-fold cross validation using 3 folds was employed to avoid overfitting. Finally, we assessed model performance mainly using the AUROC. Fig 1 visualizes an overview of all performed steps of our experimental setup.

Fig 1

Overview of experimental setup of proposed approach.

Iterative feature elimination

We employed an iterative feature elimination procedure based on the regularized logistic regression for finding the best trade-off between predictive performance and number of risk factors, with the aim of creating a risk prediction algorithm that is applicable in the clinical context. We used the standard L1 regularization (also known as Lasso) proposed by [57]; it implements a strong penalty on non-zero feature weights of our logistic regression model, resulting in a sparse feature set for prediction. A logistic regression coefficient value β can be interpreted as the expected change in log odds of having the outcome per unit change in the feature x. Therefore, increasing the feature by one unit multiplies the odds of having the outcome by eβ. This means that we can interpret the coefficients as feature importance values in the sense that the feature with the smallest coefficient has the least importance on model predictions. Importantly, this holds only true in the context of the parameters contained in the current model. Thus, we re-estimate the model after each feature elimination round. In each iteration, we re-estimated the logistic regression model on the remaining parameters, and then discarded all parameters that were set to zero by the L1 regularization; finally, we also discarded the parameter with the lowest non-zero absolute value. As an additional step, we created a ranking of the relative feature importance value of each feature by dividing its absolute coefficient weight by the sum of all absolute coefficient weights.

Statistical analysis

To reduce overfitting, we evaluated the classification performance of all our benchmarked algorithms by using 3-fold stratified cross-validation and measuring the Area Under the Receiver Operating Characteristic Curve. For the cross-validation, we used a training set with 325,182 participants to train and derive our standard linear and ML models and then assessed the AUROC performance on the held-out test set with 139,365 participants using 203 risk factors respectively. We reported the AUROC and the 95% confidence intervals (Wilson score intervals) for all models and performed a sensitivity analysis using Shapley Additive Explanations (SHAP values) for the best performing linear model.

Generalizability

With 442,620 out of the 502,551 participants in the UK Biobank, the cohort has a high proportion (88.1%) of participants with British White ethnicity. In an effort to estimate a proxy for out-of-sample generalizability, we re-trained the two best models, XGB and logistic regression with L1 regularization, only on Whites and tested their performance on a non-White test set. The white-only training set consists of 378,836 participants (81.5%). The non-White test set consists of 85,711 participants (18.5%).

Results

Characteristics of the training and test populations

Of 502,551 patients in the UK Biobank, we filtered out 7.6% who already experienced a relevant CVD outcome (during or before baseline) and the participants being lost or who withdrew from the biobank. This resulted in 464,547 participants who met the inclusion criteria. 28,561 (6.1%) of those participants developed at least one of the relevant CVD outcomes during their 10-year follow-up period. We used a common 70% of the data as a training set and 30% as a hold-out test set. Table 1 shows the overlap of our atherosclerotic CVD outcome definition with the CVD outcome definition used in the related work approach by Alaa et al. [29]:

Table 1

CVD outcomes statistics according to definition in current study and the comparator study definition by Alaa et al. [29].

Statistic measured	Number
No. of atherosclerotic CVD outcomes that developed in 10-year follow-up according to definition in current study	28,561
No. of CVD outcomes that developed in 10-year follow-up according to comparator study definition	28,242
No. of CVD outcomes after 10-year follow-up that overlap in the current study and comparator study definition	456,184 out of 464,547 (98%)
No. of CVD outcomes identified in the current study but not in comparator studies	4,341
No. of CVD outcomes included in comporator studies, but not in current study	4,022

Prediction accuracy

The resulting prediction accuracy of the benchmarked models is depicted in Table 2. We used both Framingham 10-year CVD risk versions, with and without lipids, as well as QRisk3 as baseline models to assess the performance of predicting someone’s 10-year risk of developing an atherosclerotic cardiovascular disease based on a holistic set of risk factors, with a focus on actionable risk factors and outcomes. The best performing model was XGB with an AUROC of 75.73%, only marginally higher than the logistic regression model with L1 regularization (75.44%) and substantially better than the Random Forest model (66.90%).

Table 2

Performance of all tested classifiers including baseline models.

No.	Algorithm Name	AUROC and 95% confidence intervals
1	Extreme Gradient Boosting (XGB)	0.7573 (0.755–0.7595)
2	Logistic regression with L1 regularization	0.7544 (0.7522–0.7567)
3	QRisk3	0.725 (0.7226–0.7273)
4	Framingham Lipid & BMI	0.680 (0.6775–0.6824) & 0.681 (0.6788–0.6837)
5	Random Forest	0.6690 (0.6666–0.6715)

Fig 2 shows the AUROCs of the best performing models XGB and from logistic regression with L1 regularization, which is the simplest model tested and amongst the top two best performing models. Logistic regression comes with the advantages of being interpretable by providing reasoning for its classifications, and being a simple and robust method [36].

Fig 2

AUROC of logistic regression with L1 regularization and XGBoost.

In order to better evaluate the clinical implications and significance of our results, we compared the results of our benchmarked models with our baseline models Framingham and QRisk3. Table 2 shows that both our XGB and logistic regression classifiers achieved superior performance compared to the baseline models. Apart from the Random Forest model, all tested models had a higher AUROC than both baseline Framingham (68.0% and 68.1%) and QRisk3 (72.5%) models. The difference in AUROC performance of the Framingham score in our experiments in Fig 2 compared to Alaa et al. [29] is explainable by their use of an older UK Biobank version with 40,000 fewer baseline patients with their last available date of participant follow-up being February 17, 2016. The UK Biobank version we used includes biochemistry data which was released May 1, 2019 including cholesterol and additional questionnaires data. Additionally, more diagnosis data was made available over time. These dataset differences may help explain the difference in AUROC. Figs 3 and 4 show the AUROCs of all baseline models on imputed and unimputed data respectively.

Fig 3

AUROC curves of baseline models on imputed data.

Fig 4

AUROC curves of baseline models on unimputed data.

Both Framingham versions perform nearly identically on imputed and unimputed data whereas QRisk3 performs worse on unimputed data.

Feature elimination vs. predictive performance

Fig 5 shows how the performance of the best logistic regression model depends on the number of risk factors used. Discarding the risk factors stepwise leads to a relatively unchanged and stable model performance until around 170 iterations of feature elimination. This indicates that for predicting an individual’s 10-year atherosclerotic CVD risk, many features provide only marginal value and a small subset of features provides substantial informative value. After around 170 iterations, there was a marked decline in model performance associated with further reductions in utilized features.

Fig 5

Performance of best logistic regression model depending on number of features.

Performance of best logistic regression model depending on number of features.

AUROC performance of best performing logistic regression model with L1 regularization (continuous blue line) compared to number of features utilized in each iterative feature elimination step (orange line), dotted blue horizontal line showing intersection of 25 features with iterative feature elimination step, allowing for extrapolation to model performance. Table 3 shows in more detail the dependence of the model performance on the number of features. Utilizing only 25 (88%) out of the 203 total risk factors still leads to a reasonable AUROC performance, with a high reduction in utilized features. Compared to the model performance with an AUROC of 75.44% when using all 203 risk factors, the model still achieves 74.15% (95% CI: 0.7392–0.7438) with the 25 most informative risk factors.

Table 3

Performance of best logistic regression model depending on number of features.

Number of Features	AUROC
203	75.44
40	75.01
25	74.15
20	73.32
17	72.76
10	70.88
2	68.98

We also assessed the performance for fewer features. To reach the same performance as QRisk3 of 72.5% AUROC, 16 features would be necessary. The two most informative features were age and biological sex. To reach a similar performance as Framingham (68.0%), just two features were necessary (68.98%). It is worth noting, however, that both Framingham and QRisk3 were trained and tuned on other datasets and have different CVD definitions and objectives.

Generalizability of results

We assessed the generalizability of our models by re-training the two previously best performing models only on a White cohort and then testing them on a non-White cohort. Table 4 and Fig 6 show the results for logistic regression and XGB. The logistic regression model has an AUROC of 75.86% in the generalizability experiment, compared with an AUROC of 75.44% in the previous experiment. XGB has an AUROC of 76.26% in the generalizability experiment and 75.73% in the previous experiment. These results show only marginal differences to the results of the previous experiments.

Table 4

Model performance when trained on Whites and tested on non-Whites.

Model	AUROC on generalizability experiment	Previous AUROC results
Logistic Regression with L1 regularization	75.86%	75.44%
XGBoost	76.26%	75.73%

Fig 6

AUROC of logistic regression with L1 regularization and XGBoost when trained on Whites and tested on non-Whites.

Predictive ability of individual variables in UK Biobank

Table 5 shows the relative regression feature weights of the 25 most informative risk factors in descending order. A full list is provided in the supplementary materials (S5 Table). Based on our previous manual curation of risk factors and outcomes, we can see that the most informative risk factors are distributed across 5 categories (Table 6), with the lifestyle category contributing the most risk factors. The two most informative features were age and biological sex. We provided a sensitivity analysis using SHAP values of the best performing logistic regression model for all risk factors in the supplementary materials (S1 Fig).

Table 5

Relative regression feature weights of 25 most informative risk factors from best logistic regression model.

Feature number	Risk factor name	Relative informative value descending
1	Age	0.0938
2	Biological sex	0.0485
3	Systolic blood pressure	0.0284
4	Social visits: About once a week	0.0277
5	Social visits: 2–4 times a week	0.0273
6	Walking pace: Brisk pace	0.0268
7	Total cholesterol HDL ratio	0.0267
8	Total cholesterol	0.0239
9	LDL cholesterol	0.0235
10	Familial CVD	0.0218
11	Social visits: About once a month	0.0203
12	Sleep problems: Not at all	0.0188
13	Alcohol with meals: Yes	0.0184
14	Smoking	0.0184
15	Social visits: Almost daily	0.0178
16	No. of cigarettes daily	0.0163
17	Hypertension	0.0160
18	Walking pace: Steady average pace	0.0154
19	Waist circumference	0.0150
20	Alcohol with meals: It varies	0.0141
21	Social visits: Once every few months	0.0139
22	Overall health rating: Excellent	0.0134
23	Other Heart Arrhythmias	0.0129
24	Overall health rating: Poor	0.0123
25	Sleep problems: Several days	0.0122

Table 6

Categorization of the 25 most informative risk factors into categories from the best logistic regression model.

Category	Risk Factors
Demographics	Age, Biological sex
Biomarkers	Waist circumference, systolic blood pressure, total cholesterol, LDL cholesterol, total cholesterol HDL ratio
Comorbidities	Hypertension, sleep problems: not at all, sleep problems: several days, other heart arrhythmias
Family History	Familial CVD
Lifestyle Factors	Social visits: about once/week, social visits: 2–4 times/week, social visits: about once/month, social visits: almost daily, social visits: once every few months, smoking, no. of cigarettes daily, alcohol with meals: yes, alcohol with meals: it varies, walking pace: steady average pace, walking pace: Brisk pace, overall health rating: excellent, overall health rating: poor

Discussion

Using data gathered from the large longitudinal cohort UK Biobank study, we developed a pipeline to benchmark several classification models for predicting a subject’s 10-year absolute risk of developing an atherosclerotic CVD. We used an extensive set of physician curated risk factors and outcomes methodology, employing a holistic view of the subject’s current health status rooted in a precision medicine approach. The models were trained and evaluated using data from 464,547 UK Biobank participants, spanning 203 CVD risk factors for each subject. Using a simple logistic regression model with a holistic set of risk factors significantly improved the accuracy of atherosclerotic CVD risk prediction compared to currently available, widely used and recommended models such as Framingham and QRisk3. Both of these existing models rely on a limited set of risk factors and outcomes and do not focus on modifiable lifestyle factors. Further, our best performing logistic regression model utilizes new CVD risk predictors showing high predictive power, namely: social visits, walking pace and overall health rating. The frequency of social visits could be indicative of someone’s current mental health status, which has been shown to be a relevant CVD risk factor [58, 59]. These and other non-laboratory risk factors could be collected by means of a questionnaire or passively deduced using data analytics from data sources such as GPS, calendar and sensors [26, 60] from e.g. smartphones, smartwatches and fitness trackers. Additionally, our best performing models, XGBoost and logistic regression, showed marginal differences when trained and tested on particular sub-populations, which is indicative of good generalizability to other ethnicities. As there was little performance difference between the best performing models, we primarily discuss the simplest model, logistic regression with L1 regularization. This model has the inherent benefit of offering reasoning for its predictions through analyzing the learned coefficients for every risk factor and having feature selection performed by the L1 regularization. With L1 regularization, less important risk factors’ coefficients are minimized and also set to zero, which then leads to entire removal of these features from the model, and fewer risk factors needed for an accurate prediction. Using iterative feature elimination, we identified a subset of the 25 most relevant risk factors providing a similar performance compared to using all 203 risk factors. The 25 most relevant risk factors are distributed across five different categories, suggesting that different biological layers contribute to the risk of atherosclerotic CVD. This result confirms that it is insufficient to assess only one biological layer for accurate risk prediction, supporting our initial model development approach [61]. Our approach takes into account multiple biological layers by using multi-omics as well as clinical and lifestyle data with the aim to capture all potential interactions or correlations detected between molecules in different biological layers [22]. Multi-omics data generated for the same set of samples can provide useful insights into the interaction of biological information at multiple layers and thus can help in understanding the mechanisms underlying the complex biological condition of interest. In our model, the lifestyle category contributed the most risk factors, suggesting that accurate prediction relies upon continuous daily lifestyle data and not just periodic snapshots of clinical data. The causal relationships between the risk factors considered in our model and atherosclerotic CVDs have been demonstrated by other studies [11, 19, 21, 25]. Innovative approaches are needed in order to tackle the increasing prevalence and mortality of CVD-related diseases [2], and the associated healthcare systems’ financial burdens. This is particularly true in low and middle income countries where CVD prevalence has also been increasing and is expected to increase as a consequence of an aging and growing population [2]. Our atherosclerotic CVD prediction model has the potential to support healthcare systems by identifying more people at risk earlier and more accurately than currently available models and intervening with personalized behavior change programs. Currently available models, like Framingham and QRisk3, have limited predictive capability for atherosclerotic CVDs as they were not trained on all of them and do not provide actionable results. There is potential for novel disruptive approaches to affordably improve CVD outcomes. Areas where this may have an impact is in novel approaches to screening, lifestyle coaching and prevention [2]. Screening will become more accessible and widespread by more (near-)medical-grade sensors being integrated into smartphones and smartwatches, enabling continuous monitoring of relevant behavioral CVD risk factors, as well as biomarkers such as heart rate, blood pressure and blood glucose. By gathering a wider spectrum of relevant risk factors for cardiovascular disease automatically and continuously, an ongoing and personalized cardiovascular disease risk prediction could be enabled. Through linking personalized information on an individual’s CVD risk with app-based programs for sustained behavioral modification, it may be possible to lower the incidence and mortality of CVDs [62]. Combined with a companion smartphone-based app, an AI or healthcare provider-generated personalized intervention program could be provided and targeted at those people who need it the most. A system and method gathering personal health data and predicting an individual’s atherosclerotic CVD risk is handling sensitive health data (e.g. laboratory values) and must adhere to local regulations and best practices in data transfer, processing and storage to ensure data privacy and security. Many studies have shown that digital health interventions are cost effective for managing CVD (for a review see [63]). One report found that a community-based prevention program could have a mean return on investment (ROI) on medical cost savings of $5.60 for every $1 spent within a 5 year timeframe by improving physical activity and nutrition and reducing tobacco usage [64]. A review of 11 in-home cardiac rehabilitation programs for the secondary prevention of CVD found that social support, goal setting, monitoring, credible instructions and literature resources are all effective behavior change techniques to reduce behavioral risk factors for CVD [65]. The improvement achieved by our models might be partially attributed to being trained and assessed on the UK Biobank dataset, whereas the baseline Framingham model was derived from a different population. The population and many of the data sources used in the QRisk3 model are similar, being the general UK population and using their GP, hospital and mortality records. However, our risk model generation approach and QRisk3’s approach were designed with different aims and objectives and the modelling strategy was different. For these reasons, direct comparison between the models is limited. Notable differences between the approaches include a more limited set of risk factors included in Framingham and QRisk3’s and a focused and wider range of atherosclerotic CVDs included in our approach. The results from our generalizability sub-analysis indicate that our XGB and logistic regression models might generalize well to other ethnicities and do not overfit to our cohort, however, this needs to be further evaluated with more data from diverse ethnicities. Our results show that our models have improved performance over the baseline models Framingham and QRisk3 (Table 2). This is because the selection of the appropriate disease modelling approach, classifiers and careful tuning of the model’s hyperparameters are crucial steps for realizing the potential benefits of ML. Our pipeline automates some of these steps which makes the tuning and discovery of new disease risk models easily accessible for clinical research. Our prospective cohort modelling approach, which is rooted in precision medicine, is the first to generate an atherosclerotic CVD absolute risk prediction tool based upon a complete definition of atherosclerotic CVD outcomes and a holistic set of risk factors.

Limitations

The UK Biobank only admitted participants for their initial signup from the ages 40 and up. This might limit the applicability of the risk score for younger populations and further tests with data from younger populations need to be conducted. There are many missing data values related to the potential risk factors for many participants. Having more unimputed data of relevant CVD risk factors could improve the predictive performance of all our benchmarked classifiers and could also lead to changes in the classifier ranking from Table 2 and relative risk factor importances in Table 5. However, the use of imputed data is highly unlikely to have an impact on our conclusion that a holistic set of risk factors and an exhaustive atherosclerotic CVD outcome definition could improve atherosclerotic and actionable CVD risk prediction. An additional limitation of our study is that the UK Biobank dataset consists of participants of predominantly (88%) British ethnicity, with an even larger portion having a White background (91%). Therefore, further assessments of the influence of the ethnicity predictor need to be carried out to enable a generalizable tool. Previous work in this area indicates that the development of plaques seems to be independent of ethnicity [21]. A further limitation of this UK-focused dataset is that socio-economic and other environmental factors differ between countries. This is another potential bias that needs to be further evaluated with datasets from other countries with different socio-economic characteristics. Disease risk prediction models which include subjective non-laboratory risk factors, such as the self-reported health rating and usual walking pace, should be cautiously evaluated to minimize self-reported bias. These risk factors have been found to be good predictors of overall CVD risk in another study using UK Biobank data [29].

Conclusions

We benchmarked multiple classifiers to predict an individual’s 10-year risk of developing an atherosclerotic CVD, using a holistic set of risk factors and a specific definition of atherosclerotic CVDs. Our reduced logistic regression with L1 regularization classifier, a simple and interpretable model, is amongst our best prediction models, includes actionable lifestyle factors, has great predictive power and requires 13 unique features. Our experiments showed that a two feature-questionnaire is as accurate as the Framingham models and a 16 feature-questionnaire is as accurate as QRisk3 for 10-year atherosclerotic CVD risk prediction. Both prediction models, XGBoost and logistic regression, generalize well to non-White people, which might indicate that our models generalize well to other (western) countries. Framingham and QRisk3, which are well established and validated absolute risk prediction models, do not perform as well on predicting individuals’ 10-year risk of developing an atherosclerotic CVD. With our logistic regression model, we created a promising new interpretable, actionable and accurate risk prediction tool that could assist individuals and public health in CVD risk reduction.

Shapley Additive Explanations (SHAP value) of each risk factor for the logistic regression model.

This summary plot combines risk factor importance with risk factor effects. It shows the relationship between the value of a risk factor and its impact on the prediction. Risk factors are sorted according to their importance along the y-axis. Each point in the summary plot is a Shapley value for a risk factor and an instance. The position of a Shapley value on the y-axis is determined by the risk factor importance and on the x-axis by the Shapley value. The color represents the value of a risk factor from low to high. Overlapping points are jittered on the y-axis direction, showing the distribution of the Shapley values per risk factor. (TIFF) Click here for additional data file.

List of all risk factors used in our analysis.

The listed risk factors were summarized into 203 risk factors for the respective UK Biobank participant. (XLSX) Click here for additional data file.

List of all outcomes used in our analysis.

The following outcomes were all consolidated into one final binary outcome column indicating if the respective UK Biobank participant did or did not develop one the relevant atherosclerotic CVDs during their individual 10-year follow-up period starting from their individual initial assessment attendance date. (XLSX) Click here for additional data file.

Specifications of the python (v3.9.6) libraries and their versions used in this study.

(PDF) Click here for additional data file.

List of utilized open-source methods, best parameters and references.

(PDF) Click here for additional data file.

Full list of relative informative values for each risk factor for logistic regression model.

(XLSX) Click here for additional data file. 19 Dec 2021

PONE-D-21-37349

Actionable absolute risk prediction of atherosclerotic cardiovascular disease: a behavior-management approach based on data from 464,547 UK Biobank participants

PLOS ONE Dear Dr. Kesar, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ============================== ACADEMIC EDITOR:

Based on the comments from the reviewers and my own observation, I recommend major revisions for the article. It is not mandatory to cite the articles suggested by the reviewers. If the authors feel that the suggested references do not enhance the literature survey they need not cite them. ==============================

Please submit your revised manuscript by Feb 02 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Thippa Reddy Gadekallu Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following in the Competing Interests section: "All of the authors are or were employees of, contractors for, or hold equity in Ada Health GmbH. AK, AB, OB, HH, MJ, DN, BLS and SG are employees or company directors of Ada Health GmbH and some of the listed authors hold stock options in the company. Ada Health GmbH has received research grant funding from the Bill & Melinda Gates Foundation, Fondation Botnar, the Federal Ministry of Education and Research Germany, the Federal Ministry for Economic Affairs and Energy Germany and the European Union. PW is employed by Wicks Digital Health Ltd, which has received funding from Ada Health, AstraZeneca, Baillie Gifford, Biogen, Bold Health, Camoni, Compass Pathways, Coronna, EIT, Endava, Happify, HealthUnlocked, Inbeeo, Kheiron Medical, Lindus Health, Sano Genetics, Self Care Catalysts, The Learning Corp, The Wellcome Trust, THREAD Research, VeraSci, and Woebot. HH is the topic driver of the AI-based symptom assessment group of the WHO/ITU Focus Group on AI4H (Artificial Intelligence for Health) and SG is a member of the clinical evaluation topic group of the WHO/ITU Focus Group on AI4H. A related patent application is currently pending with the title “System and method for predicting the risk of a patient to develop an atherosclerotic cardiovascular disease” and application number EP21191089.8." We note that you received funding from a commercial source: Ada Health GmbH Please provide an amended Competing Interests Statement that explicitly states this commercial funder, along with any other relevant declarations relating to employment, consultancy, patents, products in development, marketed products, etc. Within this Competing Interests Statement, please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests). If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. Please include your amended Competing Interests Statement within your cover letter. We will change the online submission form on your behalf. 3. We note that you have a patent relating to material pertinent to this article. Please provide an amended statement of Competing Interests to declare this patent (with details including name and number), along with any other relevant declarations relating to employment, consultancy, patents, products in development or modified products etc. Please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials, as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. This information should be included in your cover letter; we will change the online submission form on your behalf. 4. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For more information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The proposed research discusses a risk prediction approach for diagnosing cardiac arrest risk. It is an interesting research area. However, the following concerns should be addressed. • The architecture looks very abstract and misses very important details. I recommend authors elaborate design and experimental setup of the proposed approach. The authors have described the materials and methods section, but I recommend including a detailed experimental setup for a better understanding and interpretation of the proposed work. • A detailed, layered design describing the proposed approach should be included for a better understanding of readers. • The authors have included 54 references (which occupies a lot of space), which has some unnecessary references which can be removed and essential references such as, https://www.frontiersin.org/articles/10.3389/fpubh.2021.762303/full”, “https://ieeexplore.ieee.org/abstract/document/9170666/” can be referred. • The results and discussions about how the proposed approach enhances the state of the art is missing. I recommend authors to highlight the contribution of the proposed work separately, along with the limitations of the system. • The authors have not discussed the security and privacy aspects of the proposed system. Some more changes are needed: 1. All tables should be symmetrical and should follow a similar formatting style. 2. All the equations should be written using a professional equation editor and should use a similar formatting style and numbering. 3. Check the entire manuscript for grammatical and typo errors. Reviewer #2: In this manuscript, some machine learning approaches were employed to make a relation among risk and some input factors. Topic is interesting. Please consider the following comments to improve its quality. -Abstract: please mention results of study in this section. -Title: I think second part of the title can be reduced and integrated with first part a behavior-management approach based on data from 464,547 UK Biobank participants -Introduction: why this study is new and novel. Please mention it in the introduction. -It is recommended to insert a workflow in the methodology section. Moreover, please describe method briefly in first paragraph of the method. -I cannot understand why these machine learning approaches were employed. -I would like to know the selected parameters for running each machine learning approach. It is necessary to change parameters and achieve accuracy result. In fact, a sensitivity analysis should be performed. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 18 Jan 2022 Ajay Kesar Ada Health GmbH Karl-Liebknecht-Str. 1 10178 Berlin, Germany science@ada.com Dear Mr. Gadekallu, We thank the editor and the two reviewers for their comments on our manuscript. Below is our revised competing interests statement and responses to each point raised by the academic editor and reviewers. We hope that we satisfyingly addressed them and that the manuscript will be now suited for publication. Academic editor: 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. We have modified the file naming to comply with the style requirements and are now fully compliant with the style requirements. 2. Competing Interests Statement: We added the first and last sentence to our revised competing interests statement, for further clarification: ”This research was funded by Ada Health GmbH and has been conducted using the UK Biobank under application number 34802.” And “This does not alter our adherence to PLOS ONE policies on sharing data and materials.” Full revised competing interests statement: “This research was funded by Ada Health GmbH and has been conducted using the UK Biobank under application number 34802. All of the authors are or were employees of, contractors for, or hold equity in Ada Health GmbH. AK, AB, OB, HH, MJ, DN, BLS and SG are employees or company directors of Ada Health GmbH and some of the listed authors hold stock options in the company. Ada Health GmbH has received research grant funding from the Bill & Melinda Gates Foundation, Fondation Botnar, the Federal Ministry of Education and Research Germany, the Federal Ministry for Economic Affairs and Energy Germany and the European Union. PW is employed by Wicks Digital Health Ltd, which has received funding from Ada Health, AstraZeneca, Baillie Gifford, Biogen, Bold Health, Camoni, Compass Pathways, Coronna, EIT, Endava, Happify, HealthUnlocked, Inbeeo, Kheiron Medical, Lindus Health, Sano Genetics, Self Care Catalysts, The Learning Corp, The Wellcome Trust, THREAD Research, VeraSci, and Woebot. HH is the topic driver of the AI-based symptom assessment group of the WHO/ITU Focus Group on AI4H (Artificial Intelligence for Health) and SG is a member of the clinical evaluation topic group of the WHO/ITU Focus Group on AI4H. A related patent application is currently pending with the title “System and method for predicting the risk of a patient to develop an atherosclerotic cardiovascular disease” and application number EP21191089.8. This does not alter our adherence to PLOS ONE policies on sharing data and materials.” 3. Patent Mention in Competing Interests: We have declared the requested name and number of the pending patent in the competing interests statement and added the last sentence for further clarification: “This does not alter our adherence to PLOS ONE policies on sharing data and materials.”. 4. Data Availability Please find below our revised data availability statement: “There are restrictions prohibiting the provision of data in this manuscript. The data were obtained from a third party, UK Biobank, upon application. Interested parties can apply for data from UK Biobank directly, at http://www.ukbiobank.ac.uk. UK Biobank will consider data applications from bona fide researchers for health-related research that is in the public interest. By accessing data from UK Biobank, readers will be obtaining it in the same manner as we did.” Reviewer #1: 1. The architecture looks very abstract and misses very important details. I recommend authors elaborate design and experimental setup of the proposed approach. The authors have described the materials and methods section, but I recommend including a detailed experimental setup for a better understanding and interpretation of the proposed work. A detailed, layered design describing the proposed approach should be included for a better understanding of readers. Thank you for your feedback. We have included a new figure 1 on page 12 to describe the design and experimental setup of our approach. The following sentence was modified for better visibility on page 12: “Details on the used Python libraries, methods and parameters are provided in the supplementary data (S3 and S4 Tables)” and this sentence added: “Fig 1 visualizes an overview of all performed steps of our experimental setup.” 2. The authors have included 54 references (which occupies a lot of space), which has some unnecessary references which can be removed and essential references such as, https://www.frontiersin.org/articles/10.3389/fpubh.2021.762303/full”, “https://ieeexplore.ieee.org/abstract/document/9170666/” can be referred. We thank the reviewer for their suggestions of relevant literature. We can confirm that both studies are relevant as well and are now referenced in our manuscript. We referenced the first study twice and the second study once. As PLOS One is an online journal we understand there is not a strict limit on the number of references, but are happy to follow guidance from the editorial staff. 3. The results and discussions about how the proposed approach enhances the state of the art is missing. I recommend authors to highlight the contribution of the proposed work separately, along with the limitations of the system. We have taken this suggestion into account and extended our discussion section on page 25 to highlight the contribution of our proposed model more clearly: “Our atherosclerotic CVD prediction model has the potential to support healthcare systems by identifying more people at risk earlier and more accurately than currently available models and intervening with personalized behavior change programs. Currently available models, like Framingham and QRisk3, have limited predictive capability for atherosclerotic CVDs as they were not trained on all of them and do not provide actionable results.” 4. The authors have not discussed the security and privacy aspects of the proposed system. Thank you for highlighting this important missing aspect. We added the following remarks for completeness on page 26: “A system and method gathering personal health data and predicting an individual's atherosclerotic CVD risk is handling sensitive health data (e.g. laboratory values) and must adhere to local regulations and best practices in data transfer, processing and storage to ensure data privacy and security.” 5. All tables should be symmetrical and should follow a similar formatting style. All the equations should be written using a professional equation editor and should use a similar formatting style and numbering. Check the entire manuscript for grammatical and typo errors. Thank you for your feedback. We refined all table formatting styles to be more consistent. The whole manuscript was double checked by a native English speaker for grammatical and typographical errors. Reviewer #2: 1. Abstract: please mention results of study in this section. We thank the reviewer for their suggestion and have added the results to the abstract on page 1. While doing so, we also noticed a copy and paste error for the confidence intervals of our best performing Logistic Regression model which we have corrected. 2. Title: I think second part of the title can be reduced and integrated with first part Thank you for your feedback. We shortened the title to “Actionable absolute risk prediction of atherosclerotic cardiovascular disease based on the UK Biobank” on the author page. 3. Introduction: why this study is new and novel. Please mention it in the introduction. We appreciate the reviewer’s feedback on that matter and have modified and emphasized our unique contributions with a new second to last paragraph in the Introduction section on page 5: “The aim of this study was to use a large-data ML approach to develop an actionable absolute risk prediction tool which takes into account the holistic health of an individual. Uniquely, we focussed on behavioral risk factors relating to all atherosclerotic CVD outcomes. Our goal was to have a holistic understanding of an individual's current health status, to better quantify their risk of atherosclerotic CVDs, and to provide actionable advice. Our approach is novel in that we employ a highly holistic understanding of an individual’s current health status, to better quantify their risk of all athersclerotic CVDs. By utilizing a comprehensive set of lifestyle factors, we enable the subsequent suggestion of personalized and actionable advice relating to unhealthy risk factors. Instead of using only a limited set of risk factors, we aimed to achieve this by taking multiple biological layers into account, which include: (i) multi-omics data from blood samples (e.g. lipidome and proteome); (ii) family history (e.g. genome), (iii) lifestyle data, (iv) clinical data and (v) environmental data; along with (vi) an extensive set of risk factors and outcomes.” 4. It is recommended to insert a workflow in the methodology section. Moreover, please describe method briefly in first paragraph of the method. Thank you for your recommendation. We have included a new figure 1 on page 12 to describe the design and experimental setup of our approach in the methodology section. The following sentence was added on page 12: “Fig 1 visualizes an overview of all performed steps of our experimental setup.”. We also added a new brief summary to the methods section on page 6: “Baseline data from the UK Biobank was utilized to extract an extensive set of risk factors and outcomes associated with the pathophysiology of atherosclerotic CVDs. A benchmarking pipeline was used to train and evaluate different standard and ML algorithms for the task of 10-year atherosclerotic CVD risk prediction. The performance was measured using AUROC and compared against the baseline models Framingham and QRisk3, which are widely used and recommended models. We evaluated our best performing models further by analysing the most informative features and assessed model generalizability and created a reduced model.”. 5. I cannot understand why these machine learning approaches were employed. We certainly want to clarify for our readers why we have employed a ML approach and thank you for the opportunity to expand on our rationale in the text. Specifically, we added further clarifications to the method section on page 10: “Since the introduction of the classic CVD risk prediction methods, the field of supervised machine learning has developed from classical statistics with the sole purpose of maximizing predictive accuracy with modern statistical methods. Therefore, in addition to using standard linear models, we tested the major ML approaches, covering a wide spectrum of the possible ML design space, to evaluate which model type performs best for our task. Based on our initial benchmarking pipeline results, we focused on reporting the results of the initially best performing models: logistic regression, random forest and XGBoost.” 6. I would like to know the selected parameters for running each machine learning approach. It is necessary to change parameters and achieve accuracy result. In fact, a sensitivity analysis should be performed. Thanks for your feedback. We have added additional information to address your point in the supplementary file S4 Table “List of utilized open-source methods, best parameters and references”, and here we have provided the parameters of the 3 benchmarked methods. For better visibility, we modified the following sentence on page 12: “Details on the used Python libraries, methods and parameters are provided in the supplementary data (S3 and S4 Tables).” We also added the parameters of the other tested methods to the data supplement file S4 Table. Additionally, we performed a sensitivity analysis for our best performing Logistic Regression model using Shapley Additive Explanations (SHAP values) and provided the full analysis as a new figure in the supplementary data S1 Figure. We added this sentence to the statistical paragraph of the methods section on page 13: [...] “and performed a sensitivity analysis using Shapley Additive Explanations (SHAP values) for the best performing linear model” and added the following sentences to the manuscript on page 20: “We provided a sensitivity analysis using SHAP values of the best performing Logistic Regression model for all risk factors in the supplementary materials (S1 Fig.)” and on the last page 35: “S1 Fig. Shapley Additive Explanations (SHAP value) of each risk factor for the logistic regression model. (PNG) This summary plot combines risk factor importance with risk factor effects. It shows the relationship between the value of a risk factor and its impact on the prediction. Risk factors are sorted according to their importance along the y-axis. Each point in the summary plot is a Shapley value for a risk factor and an instance. The position of a Shapley value on the y-axis is determined by the risk factor importance and on the x-axis by the Shapley value. The color represents the value of a risk factor from low to high. Overlapping points are jittered on the y-axis direction, showing the distribution of the Shapley values per risk factor.” We hope these modifications satisfyingly increase the quality of our manuscript. Sincerely on behalf of all authors, Ajay Kesar Submitted filename: Response to reviewers.pdf Click here for additional data file. 31 Jan 2022 Actionable absolute risk prediction of atherosclerotic cardiovascular disease based on the UK Biobank PONE-D-21-37349R1 Dear Dr. Kesar, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Thippa Reddy Gadekallu Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The presented research work should be shared with the research community. The manuscript can be accepted in as it is form. Reviewer #2: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Sharnil Pandya Reviewer #2: No 4 Feb 2022 PONE-D-21-37349R1 Actionable absolute risk prediction of atherosclerotic cardiovascular disease based on the UK Biobank Dear Dr. Kesar: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Thippa Reddy Gadekallu Academic Editor PLOS ONE

38 in total

1. Absolute, attributable, and relative risk in the management of coronary heart disease.

Authors: J E Sedgwick
Journal: Heart Date: 2001-05 Impact factor: 5.994

Review 2. Mechanisms of plaque formation and rupture.

Authors: Jacob Fog Bentzon; Fumiyuki Otsuka; Renu Virmani; Erling Falk
Journal: Circ Res Date: 2014-06-06 Impact factor: 17.367

3. Forecasting the future of cardiovascular disease in the United States: a policy statement from the American Heart Association.

Authors: Paul A Heidenreich; Justin G Trogdon; Olga A Khavjou; Javed Butler; Kathleen Dracup; Michael D Ezekowitz; Eric Andrew Finkelstein; Yuling Hong; S Claiborne Johnston; Amit Khera; Donald M Lloyd-Jones; Sue A Nelson; Graham Nichol; Diane Orenstein; Peter W F Wilson; Y Joseph Woo
Journal: Circulation Date: 2011-01-24 Impact factor: 29.690

Review 4. New risk factors for atherosclerosis and patient risk assessment.

Authors: Jean-Charles Fruchart; Melchior C Nierman; Erik S G Stroes; John J P Kastelein; Patrick Duriez
Journal: Circulation Date: 2004-06-15 Impact factor: 29.690

5. Estimation of ten-year risk of fatal cardiovascular disease in Europe: the SCORE project.

Authors: R M Conroy; K Pyörälä; A P Fitzgerald; S Sans; A Menotti; G De Backer; D De Bacquer; P Ducimetière; P Jousilahti; U Keil; I Njølstad; R G Oganov; T Thomsen; H Tunstall-Pedoe; A Tverdal; H Wedel; P Whincup; L Wilhelmsen; I M Graham
Journal: Eur Heart J Date: 2003-06 Impact factor: 29.983

Review 6. Multi-omics approaches to disease.

Authors: Yehudit Hasin; Marcus Seldin; Aldons Lusis
Journal: Genome Biol Date: 2017-05-05 Impact factor: 13.583

7. Early detection of type 2 diabetes mellitus using machine learning-based prediction models.

Authors: Leon Kopitar; Primoz Kocbek; Leona Cilar; Aziz Sheikh; Gregor Stiglic
Journal: Sci Rep Date: 2020-07-20 Impact factor: 4.379

8. Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants.

Authors: Ahmed M Alaa; Thomas Bolton; Emanuele Di Angelantonio; James H F Rudd; Mihaela van der Schaar
Journal: PLoS One Date: 2019-05-15 Impact factor: 3.240

9. A Guide to Chatbots for COVID-19 Screening at Pediatric Health Care Facilities.

Authors: Juan Espinoza; Kelly Crown; Omkar Kulkarni
Journal: JMIR Public Health Surveill Date: 2020-04-30

Review 10. Relevance of Multi-Omics Studies in Cardiovascular Diseases.

Authors: Paola Leon-Mimila; Jessica Wang; Adriana Huertas-Vazquez
Journal: Front Cardiovasc Med Date: 2019-07-17