Literature DB >> 35639667

Critical appraisal of artificial intelligence-based prediction models for cardiovascular disease.

Maarten van Smeden¹, Georg Heinze², Ben Van Calster^3,4,5, Folkert W Asselbergs^6,7,8, Panos E Vardas^9,10, Nico Bruining¹¹, Peter de Jaegere¹², Jason H Moore¹³, Spiros Denaxas^8,14, Anne Laure Boulesteix¹⁵, Karel G M Moons¹.

Abstract

The medical field has seen a rapid increase in the development of artificial intelligence (AI)-based prediction models. With the introduction of such AI-based prediction model tools and software in cardiovascular patient care, the cardiovascular researcher and healthcare professional are challenged to understand the opportunities as well as the limitations of the AI-based predictions. In this article, we present 12 critical questions for cardiovascular health professionals to ask when confronted with an AI-based prediction model. We aim to support medical professionals to distinguish the AI-based prediction models that can add value to patient care from the AI that does not.

Entities: Chemical

Keywords: Artificial intelligence; Diagnosis; Digital health; Machine learning; Prediction; Prognosis

Mesh：

Year: 2022 PMID： 35639667 PMCID： PMC9443991 DOI： 10.1093/eurheartj/ehac238

Source DB: PubMed Journal: Eur Heart J ISSN： 0195-668X Impact factor: 35.855

Introduction

Artificial intelligence (AI) and its subdiscipline machine learning are receiving increasing attention throughout medicine, including cardiovascular medicine.[1,2] Proponents promise AI will change the way medicine and healthcare is practiced, by making use of technological advancements that allow for collection of increasingly detailed and diverse data and the ever-increasing computational ability to analyse and combine such data. An important part of these promises is the development and implementation of more accurate clinical prediction models (algorithms, tools, or rules, from here onwards simply referred to as prediction models) to improve—or according to some advocates, even revolutionize—screening, diagnosis, and prognostication of diseases. Prediction models usually fall within one of two major categories: diagnostic prediction models that estimate an individual’s probability of a specific health condition being currently present, and prognostic prediction models that estimate the probability of developing a specific health outcome over a specific time period.[3] Indeed, technological developments in machine learning drive the ability to derive increasingly complex prediction models, from data sources that are structured, data from a sample of individuals that can simply be captured in a spreadsheet format, and unstructured, such as free text in electronic patient health records, medical images, and electrophysiology. Taking advantage of these technological developments, AI has now been introduced for a large variety of healthcare challenges within cardiovascular diseases, for instance, the automated detection of cardiac arrhythmias from electrocardiograms,[4] early detection of aortic stenosis,[5] and mortality prediction of patients undergoing cardiac resynchronization therapy.[6] The increased interest in the development, testing, implementation, and impact of AI-based prediction models in cardiovascular patient care, also comes with new challenges. One of these challenges is that it requires the cardiovascular disease professional and researcher to familiarize themselves with the opportunities of AI prediction models, as well as their inherent limitations when developed for and applied in their own setting. This article aims to assist researchers and professionals, readers, and reviewers in appraising the development and testing (i.e. validation) of AI prediction models. We propose 12 critical questions for health professionals and researchers to consider (Graphical abstract).

Graphical Abstract

Twelve critical questions to be asked by readers and reviewers when confronted with prediction models that are based on AI.

Question 1: Is artificial intelligence needed to solve the targeted medical problem?

The development of a new AI prediction model should be clearly linked to a relevant medical problem it tries to solve. The literature is populated with many prediction models to detect (diagnose) or prognosticate new onset cardiovascular diseases and predict future health in patients diagnosed with a cardiovascular disease. For instance, over 360 models for cardiovascular disease in the general population,[7] over 160 female-specific models for cardiovascular diseases,[8] and over 80 models for sudden cardiac arrest,[9] already exist. Prime examples of such prediction models in cardiovascular diseases are the Framingham risk score[10] and the recently updated SCORE2[11] for cardiovascular disease prediction in the general population, and the EuroSCORE[12] and revised cardiac risk index[13] for inpatient predictions. With such large numbers of prediction models already existing—and few models used in practice, before developing a new model one may question: is a new prediction model really needed? While the potential of AI technology to improve predictions over the existing prediction models is beyond dispute when trained in the right way and on the right data, actual incremental value of AI over prevailing prediction models is not per se guaranteed for any healthcare application.[14] This was for instance shown in a recent systematic review where traditional regression techniques were not outperformed by modern AI-based prediction models.[15] Given the targeted medical problem at hand, claims about the incremental value of a more complex and possibly more difficult to implement AI prediction model over existing and often simpler prediction models should therefore be based on solid evidence coming, e.g. from careful model and prediction comparisons (see also question 8). For example, an AI-based prediction model aiming to predict 10-year risk of cardiovascular events might be compared with a canonical model like the Framingham risk score.

Question 2: How does the artificial intelligence prediction model fit in the existing clinical workflow?

Knowing the intended place of a prediction model within the existing clinical workflow is essential to identify early on the barriers towards implementation of the model in daily practice. To understand and to be able to appraise the intended use of an AI prediction model, not only should it be clear what the model aims to predict, for whom it predicts (i.e. target population) and over what time period (prediction horizon—see also question 4), but also where in the clinical workflow the prediction model is aimed to be implemented. A new AI prediction model may be developed as an add-on to improve the efficiency of the existing diagnostic testing process. Or it may be aimed at replacing an existing prognostic prediction model that is already part of the workflow. Such contextual information is of critical importance to identify relevant limitations and barriers of the AI-based prediction model early, well before implementation later-on in the daily clinical processes. For example, the identification of cultural barriers, such as the trust of intended users in a complex AI model, and technical barriers, such as the mismatch between required technology platforms to execute and maintain the AI technology available in the daily clinical process of intended use, early in AI model development could, to some degree, be addressed during the AI prediction model development.[16] These particular barriers may be addressed by aiming for more transparent and simpler modelling strategies and by giving insight into which features contribute most to making the predictions (see question 12). Operational challenges and perceived clinical utility may also play an important role in the adoption of an AI algorithm. For instance, a study of a diagnostic AI model for detecting diabetic retinopathy from retinal images found that the increasing workload of medical personnel associated with uploading images was an important barrier to implementation.[17] Another study identified a lack of perceived utility for decision-making of a prognostic model to predict post-operative nausea and vomiting as an important barrier.[18] Such studies of barriers that prevent implementation of prediction models are rare.

Question 3: Are the data for prediction model development and testing representative for the targeted patient population and intended use?

Representativeness of the data for the targeted population and intended use is important for development of well-calibrated prediction models and valid testing of AI predictions. The predictive performance and clinical utility of any prediction model highly depend on the quality and representativeness of the data available for the model’s development. When data of low quality are used to train a prediction model, for instance, data subject to large incompleteness, measurement and misclassification errors, or using data which are based on (partly) wrong and biased human decisions, important patterns may be missed that would otherwise be identified. This usually results in loss in the model’s predictive performance.[19] Conversely, using high-quality data that are not representative for data quality in the targeted population may also result in disappointing predictive performance and even misleading predictions (a phenomenon known as predictor measurement heterogeneity[20,21]) once the model is tested or applied in the targeted setting with data that were not collected primarily for research, such as routine care data, or data collected in settings that differ greatly from the development setting. For example, the aforementioned AI algorithm for detecting diabetic retinopathy from retinal images performed poorly when employed under poor lighting conditions in eye clinics in Thailand.[17] Detailed information on the conditions under which data were collected and standardization of data collection where possible may reduce the chance of unexpected prediction failures due to measurement heterogeneity. Representativeness of the data for the targeted population and context is also critical to ensure that individualized risks (i.e. predicted risks of the outcome given the individuals’ feature values) are appropriately calibrated. Individualized risk estimates are often used to make medical decisions; inaccurate estimates of these risks (i.e. miscalibration) can thus lead to poor medical decisions. Data sets used for a prognostic model development with an incidence of the predicted outcome that is not representative for the incidence of the predicted outcome in the targeted clinical setting may require model recalibration.[22] Likewise, models can be poor performing and miscalibrated when developed on data in which certain groups are underrepresented (e.g. based on gender, ethnicity, and comorbidities), which may require model updating or even complete re-development.[22] Developments in transfer learning, in which an AI prediction model can be pre-trained on a data set that is not representative, may be used to alleviate some of the problems encountered with non-representative data.[23] Representativeness of data for the targeted population and context is arguably even more important for testing than for developing any prediction model. First, this is required for a valid assessment of the model’s calibration.[24] Second, it avoids artificial inflation of summary measures of predictive performance, e.g. by including healthy controls that are not part of the targeted population, or through overrepresentation of individuals with advanced diseases, in which prediction errors are less likely to occur.[25] Likewise, exclusion should be avoided of individuals with data that are incomplete (i.e. missing data) or of individuals for whom the outcome is more difficult to be determined, e.g. cases with an atypical presentation which are harder to predict. Excluding such individuals from testing may create a selection bias that results in unrealistic expectations of performance when the model is eventually applied in daily practice.

Question 4: Is the (time)point of prediction clear and aligned with the feature measurements?

The intended timing of prediction and the measurement of the feature data should be aligned, and the prediction model should not be developed or tested using measurements that are unavailable at the time of prediction. In diagnostic prediction models, the goal is to determine whether the condition of interest is present or absent at the moment of prediction—the time a prediction is made. For instance, a disease may already be manifest but not yet assessed by a reference test.[26] Hence, to be aligned with the intended use of a diagnostic prediction model, the feature data (i.e. the data that serves as input in the AI model) should be measured before the true disease status is known. For both diagnostic and prognostic prediction models, measurement and construction of features should generally be done without knowledge of the outcome to avoid artificial inflation of the associations between features and outcome.[27] Feature data should not include information that becomes available only after the intended moment of prediction. For example, an AI model that was developed to pre-operatively predict in-hospital mortality in patients undergoing transcatheter aortic valve replacement included features related to post-operative complications,[28] such as acute kidney injury, sepsis, and cardiac arrest. As post-operative complications will be unknown pre-operatively, the intended point of prediction, such a model cannot be applied as intended. Prognostic models in particular require specification of a prediction horizon—how far ahead in time the model aims to predict outcome occurrence by—and follow-up time to measure the outcome needs to be matched to that. Variation in follow-up times, e.g. because of administrative censoring or competing risks, can be accounted for using survival analysis techniques.[29]

Question 5: Is the outcome variable labelling procedure reliable, replicable, and independent?

Verification of the outcome status for each individual in the data set that is accurate and independent of the feature data is essential for the development and valid testing of the AI prediction model. Like all other domains of medicine, there are many situations in cardiovascular disease research in which a perfectly accurate gold standard to diagnose a cardiovascular disease or condition is not available. This is, for instance, applicable to the diagnosis of heart failure with preserved ejection fraction,[30] but can also be relevant when interest is in cause-specific mortality or myocardial infarctions registered in, for instance, a routine healthcare database.[31] When developing an AI prediction model for an outcome for which no perfect reference standard is available, misclassification of the outcome status becomes probable. This can severely hamper the performance of the prediction model developed on the misclassified outcome data.[32] The AI prediction model may then be able to adequately predict the imperfect reference standard but not the true condition of interest. To increase reliability and completeness of the verification of the outcome status, it may therefore be desired to rely on the judgement of individual patients by an expert, or a group of experts, or even independent outcome adjudication committees as commonly used in randomized therapeutic intervention trials. In image recognition applications of AI, such a process is known as labelling, often requiring large numbers of images to be scrutinized and annotated, which is a burdensome task that itself carries a risk of error.[33] To ensure verification of the outcome status can be appraised and replicated, detailed information must be provided regarding the experts involved, such as the education, expertise, years of experience of experts, and the setting, such as the number of experts per case, available information per case, time constraints and how discrepancies or disagreements between experts were resolved. Earlier studies into inter-observer variability within cardiovascular testing can serve as good examples.[34-37] Recent innovations for semi-automated labelling[38] may also be a promising area of development to overcome some of the mentioned limitations of case-by-case labelling. While verification of the outcome status should be done as accurately as possible, it should in general be done without knowledge of the patients’ characteristics that are used as candidate features in the development of the AI prediction model. This is important in situations where the outcome status can be influenced by knowledge from patient characteristics (e.g. not likely when non-cause-specific mortality is the outcome), which in turn can create artificial inflation of the associations between features and outcome.[27]

Question 6: Was the sample size sufficient for artificial intelligence prediction model development and testing?

It is essential to ensure the sample size available for developing and testing of the AI prediction model suffices to allow for reliable predictions in new individuals. The sample size of the data set for development of the AI prediction model must be large enough to develop a model that is reliable when applied to new individuals in the target population and context. For regression-based prediction modelling, there is guidance and easy-to-use software to calculate the minimally required sample size.[39,40] The minimally required sample size for a prediction model increases (i) the further away the incidence or prevalence in the target population is from 0.5, (ii) the lower the predictive value in the features (i.e. the extent to which the features are able to explain the variance in the outcome), and (iii) the more features and complexities are considered during modelling. Hence, to be able to make full use of the potential of AI prediction models, often with much higher complexity than default regression-based modelling, a much larger sample size than for traditional regression-based approaches is usually needed.[39] In particular, with rare outcomes such as inherited cardiomyopathy occurring only in 1/250 to 1/5000,[41] the minimally required sample size for traditional regression-based approaches may already be very high.[40] While the three minimally required sample size drivers mentioned above can be used as a starting point for AI prediction modelling, currently no formal sample size criteria exist for alternatives to regression-based prediction models, such as random forest and neural networks including deep learning, let alone for settings in which the number of candidate features is much larger than the number of available individuals (i.e. high-dimensional data analyses), which limits the possibilities to perform a priori sample size calculations for such applications. However, a posteriori sample size calculation can also be performed to justify the sample size of the development data and to ensure it suffices for developing the AI prediction model. A flexible a posteriori approach that can be used in retrospective and prospective model development studies is the learning curve approach.[42] A recent review of sample size determination methodologies in medical imaging research shows such an a posteriori sample size calculation is still rarely applied.[43] The sample size for prediction model test data sets should be large enough to ensure the predictive performance measures (see question 8) can be estimated with sufficient precision (i.e. small confidence intervals). For reasonably precise model testing results, usually data are required for several hundreds of individuals. Recent formulas for minimally required sample size for prediction model testing or validation based on various predictive performance criteria have been published.[44] Unlike with model development, sample size formulas for prediction modelling testing or validation are applicable regardless of the method used to develop the AI prediction model and can thus be calculated and justified a priori.

Question 7: Is optimism of predictive performance of the artificial intelligence prediction model avoided?

Testing of AI prediction is done through internal and external validation approaches that avoid reporting of optimistic model performance. The AI prediction models should foremost be evaluated for their performance in making valid predictions. Estimates of predictive performance, such as the area under the receiver-operating characteristic curve (AUROC), should not be obtained directly from the same data used to develop the AI prediction model, as this will lead to estimates of performance that are too optimistic, for instance too high estimates of the AUROC.[22,45] Instead, performance of AI prediction models must be tested through rigorous internal and external prediction model validation procedures, to provide reliable estimates of their predictive performance. Internal validation procedures use only the original model development study data to get estimates of predictive performance and include methods such as bootstrapping or cross-validation.[46] These methods do not prevent model overfitting but can provide insight in the extent of performance overfitting and aim to get an unbiased assessment of the model’s performance. All steps taken for development should be integrated in the internal validation, i.e. considered as part of the AI modelling process and, in case of bootstrap or cross-validation, repeated in each bootstrap or cross-validation iteration. This includes steps for pre-filtering and selection of features and tuning and selection of the models, to avoid so-called incomplete validation.[47] Strictly speaking, such procedures test the AI modelling process rather than the final model itself. Another common approach to test the AI prediction models is on a single test set after a train–test split of the data (or sometimes training–validation–test split, where validation here confusingly refers to data used for model tuning and selection). While train–test splits are common in AI prediction model development studies, this is typically referred to as an internal validation approach and reduces the effective sample size available for developing the model as compared with bootstrap and cross-validation procedures and is commonly mistaken for actual external validation of an AI prediction model. To perform external validation, data may come from the same setting as used for development of the AI prediction model, collected in a different time period but often by the same researchers (narrow external validation[46]), or by other researchers in another geographical area (broad external validation[46]), or from even other types of patients, deviating from the original intended use. In a recent example of a narrow external validation, Al-Farra et al.[48] performed a temporal validation of updated prediction models of early mortality after transcatheter aortic valve implantation. Routine healthcare data from 13 Dutch heart centres between 2013 and 2016 were used for updating prediction models, while data from the same hospitals collected in 2017 were used for a narrow external validation of the updated models. For an overview of broad external validations of cardiovascular disease models, see Damen et al.[7] Performance of any prediction model is expected to vary over time and place,[49,50] and therefore an AI prediction model is never truly validated, in the sense that it can be proven to work adequately across time and place. An external validation of an AI prediction model should therefore not be viewed as a method to generate a definitive verdict on the adequacy of performance of the AI prediction model, but a snapshot that can generate knowledge about the performance and, importantly, variation in performance over time and place, and clues to the need for replacing, updating, and tailoring AI prediction models to specific settings.

Question 8: Was the artificial intelligence model’s performance evaluated beyond simple classification statistics?

The AUROC and classification accuracy statistics do not provide the full picture of the performance and utility of an AI prediction model. A broader view on performance is necessary. While AI prediction models often have a binary outcome, usually current (diagnosis) or future (prognosis) presence or absence of a certain target status, other outcomes such as multi-category, continuous, and time-to-event outcomes are possible, and their evaluation require different performance parameters to be evaluated (beyond this article). However, for the more common binary outcomes to be predicted, there is also a large variation in measures that can quantify performance.[51] In general, measures that are sensitive to the relative frequency of the outcome, such as the percentage of correctly predicted individuals, should be interpreted with caution especially when the outcome is rare (e.g. when the relative frequency of the outcome is only 1%, a naive model that predicts everyone in the majority outcome class already has a percentage correctly predicted of 100−1% = 99%). For AI prediction models that quantify the probability of (current or future) presence of the target status for individuals (i.e. risk prediction models), the calibration of the model is important and often the weakest link.[24] Calibration of the model describes the degree to which the estimated risks agree with the observed risks. However, calibration and other traditional predictive performance measures, such as the AUROC, do not describe clinical consequences of the use of a prediction model. For this, decision curves[52] could be useful to determine the relation between a chosen risk threshold—for instance, a threshold above which treatment might be started—and the relative value of false-positive and false-negative predictions. This is to evaluate the net benefit of using the model at that specific risk threshold.[53] Another aspect that has received increasing interest is the comparison of the performance of AI models to that of clinical experts, notably for diagnostic tasks. In 2019, a systematic review that compared the performance of deep learning algorithms to that of health professional assessment in diagnosis of various diseases from medical images found that only 17% of the studies compared performance with that of health professionals in other data than the data used to train the model.[54] Such comparative studies come with additional challenges,[55] such as the need for creating realistic settings where physicians work under realistic practical time constraints and have access to all regular patient information (possibly including existing prediction model results), where performance of model and physicians are evaluated on the same scale, and where optimism about performance is prevented.[56] For a broader discussion of human vs. machine, we refer to Mayer-Schönberger and Cukier.[57]

Question 9: Were the relevant reporting guidelines for artificial intelligence prediction model studies followed?

Detailed and transparent reporting of AI prediction model development and testing are essential to ensure reproducible, replicable, and critically appraisable results. Replicability (i.e. re-development and evaluation of the AI prediction model on different data with similar results) and reproducibility (i.e. re-development and evaluation of the AI prediction model on the original data with exactly the same results) should be the core principles for development and testing of AI prediction models. This requires detailed planning, conduct, documentation, and transparent reporting of all steps of the prediction modelling, including all data preparation steps (e.g. feature engineering, initial data analysis[58]), all model selection, tuning, recalibration, and testing steps, and the results from internal and/or external validation approaches. Recent reviews of AI prediction models showed that the reporting of AI prediction modelling in academic journals is often poor.[59-62] Not only does incomplete and inaccurate reporting prevent adequate reproducibility and replicability of the study findings, it also prevents reviewers, readers, and users appreciating the appropriateness of the used methodology for model development, tuning, and testing, compromising their ability to critically appraise the results. Such problems can easily be avoided by following established reporting guidelines. For prediction model development, validation, and updating studies, TRIPOD[46,63] has been the widely accepted reporting guideline. An extension specifically for AI prediction models is underway and soon anticipated.[64,65] A full overview of existing reporting guidelines with a specific focus on reporting guidelines of prediction models can be found on https://www.tripod-statement.org/ and on all other type of reporting guidelines on the Equator Network website: https://www.equator-network.org/.

Question 10: Is algorithmic (un)fairness considered and appropriately addressed?

AI prediction models can be a source of unfairness due to, among other reasons, choices in methodology or the data used for model development. Data driven approaches, including AI prediction models, to inform healthcare professionals about the likely diagnosis and prognosis of patients are often considered to provide objective sources of diagnostic and prognostic information. Applying such prediction models in daily practice can, however, in some cases do more harm than good; some degree of prediction error is inescapable when applying AI prediction models in medical practice. Such errors—and thus potential harm—may be more likely to occur in particular groups of patients. For instance, a comprehensive study of a commercial AI algorithm to manage health in the USA showed Black patients were on average sicker than White patients with the same level of risk.[66] This was attributed to the model using healthcare costs as a proxy for healthcare needs; less money is spent on Black patients with the same level of healthcare needs.[66] There is growing concern about the potential of AI prediction models to increase such racial, gender, and minority group disparities via either choice in model development or existing inequalities encoded in the data used.[67] When developing or testing prediction models, regardless of the used modelling technique, the algorithmic biases that create potential algorithmic unfairness should be acknowledged and investigated where possible.

Question 11: Is the developed artificial intelligence prediction model open for further testing, critical appraisal, updating, and use in daily practice?

Proprietary algorithms, complex algorithms, and algorithms that received regulatory approval may be more limited in openness, testing, updating, and less welcoming to critical appraisal than non-proprietary algorithms. The AI prediction model development, testing, and deployment are increasingly the domain of commercial developers. These developers may choose not to disclose their algorithm and ask for a fee per patient for model use.[68] For example, a biomarker-based model to diagnose ovarian cancer has a cost of $897 per patient, which in order to test this model through an external validation approach may require more than $400K investment on model use costs alone.[68] Hence, the ability to test proprietary models may be severely hampered because of financial constraints of the user or tester. That commercially available prediction models are also not guaranteed to perform well was recently illustrated in the context of a widely implemented commercial model to predict sepsis, showing very poor calibration and discrimination in a broad external validation.[69] Even in the absence of fees for use or testing, complex and proprietary algorithms often largely remain a ‘black box’ for testers or users, with limited ability for critical appraisal and updating of the AI prediction model. The AI prediction models rarely come with easily applicable model coefficients (see also question 12) that can easily be updated. However, the model may be encoded in closed software. Arguably, such closed software AI prediction models require extra scrutiny of their output through thorough testing with special attention to potential algorithmic unfairness. Under the current regulatory standards, commercial and non-commercial AI prediction models that have already received regulatory approval—for instance, via a Conformitè Europëenne mark for a medical device—are limited in their opportunities to be updated. If a model is updated, for instance, to become better calibrated in a new setting (e.g. hospital), the updated AI prediction model may require additional regulatory approval before it can be used in practice.

Question 12: Are presented relations between individual features and the outcome not overinterpreted?

Approaches to identify which features are most important in making the predictions can increase interpretability of a black-box AI prediction model, but come with the risk of overinterpretation, including incorrectly inferring that the strongest associations between features and outcome indicate causal relations. The AI prediction models are often criticized by healthcare workers, patients, lawmakers, and scientists for their black-box nature.[70] Unlike regression models, which are usually presented as equations with regression coefficients representing the strength of the multivariable relation between individual features and the outcome, outside of the regression framework (e.g. random forest, neural networks) the strength of multivariable feature—outcome relations may not be directly represented in the software output. Indeed, when only the output of a black-box model is presented to the user (i.e. the predictions), whereas the associations between the individual features and the outcome remain hidden, the predictions are difficult to scrutinize and to question, which in turn may hamper trust of the user in the AI prediction model. Several approaches exist that aim at opening the black box after the AI prediction model has been developed, to ‘explain’ which features contribute most to making the prediction. Common examples of these so-called explainable AI approaches[71] are Local Interpretable Model-agnostic Explanations (LIME) and Additive exPlanations (SHAP)[72] values. Analogous to regression coefficients, SHAP and LIME values express both the direction of a feature–outcome association as well as the magnitude of these associations. For an application of SHAP in the context of obstructive coronary artery disease,, we refer the reader to Al’Aref et al.[73] Despite the increasing popularity of approaches that aim to increase interpretability of AI prediction models, several authors have warned against overinterpretation of their results.[74,75] Such approaches do not generally allow for statements about causal relations between the selected features and the outcome. This is because causal inference inherently requires information that cannot directly be derived from data but must be provided by the analyst as explicit sets of assumptions.[76] Conclusions about cause and effect purely based on prediction model feature–outcome associations are rarely justified, even (or also) when the modelling is done using AI techniques. For discussions on AI that are specifically designed to explore cause and effect, we refer the reader to Blakely et al.[77]

Conclusion

In this article, we proposed 12 critical questions to be asked by readers and reviewers when they are confronted with prediction models that are based on AI. Many of these questions may also have relevance for prediction models that are not based on AI (). Twelve critical questions about artificial intelligence-based prediction of cardiovascular disease With the increasing use of AI in medicine in general and, in particular, the diagnosis, prognostication, and treatment of cardiovascular diseases, it is important to keep asking critical questions before these prediction models are implemented in practice. Before implementation, many steps need to be taken including steps for data preparation, training of the model and software, as well as the evaluation of performance and impact of the model on clinical decision-making. For an overview of contemporary detailed guidance for each of these steps, we refer to a recent scoping review.[78] For an overview of ethical guidelines related to AI, see Hagendorff.[79]

Table 1

Twelve critical questions about artificial intelligence-based prediction of cardiovascular disease

Question	Key considerations
Is AI needed to solve the targeted medical problem?	• Many prediction models already exist, few of them are used
Is AI needed to solve the targeted medical problem?	• Value of a new complex model over existing simpler model is not guaranteed
How does the AI prediction model fit in the existing clinical workflow?	• Knowing the place of a model in the clinical workflow is essential to identify and address cultural and technical barriers early on
Are the data for prediction model development and testing representative for the targeted patient population and intended use?	• Representative data at development is essential for model calibration
	• Excluding individuals with atypical presentation or missing data can create bias in predictive performance measures
Is the (time)point of prediction clear and aligned with the feature measurements?	• Feature data should not include information that becomes available only after the intended moment of prediction
	• Prognostic models require specification of a clear prediction horizon
Is the outcome variable labelling procedure reliable, replicable, and independent?	• Verification of the outcome status should be done accurately
	• Inaccurate verification may bias predictions and estimates of predictive performance
Was the sample size sufficient for AI prediction model development and testing?	• A priori or a posteriori sample size calculations can be used to justify the sample size
Is optimism of predictive performance of the AI prediction model avoided?	• Performance of AI prediction models must be tested through rigorous internal and external validation procedures
Was the AI model’s performance evaluated beyond simple classification statistics?	• There is a large variety of statistics to quantify predictive performance
	• Traditional performance statistics do not describe clinical consequences of using the AI prediction model
Were the relevant reporting guidelines for AI prediction model studies followed?	• Reporting of prediction modelling studies is often poor
	• TRIPOD can be used to guide reporting for model development and testing
Is algorithmic (un)fairness considered and appropriately addressed?	• Prediction models have the potential to do harm when applied
	• Choices in model development and existing inequalities encoded in the data can create prediction models that are harmful to particular groups
Is the developed AI prediction model open for use, further testing, critical appraisal, and updating and use in daily practice?	• Proprietary AI prediction models can be difficult or expensive to test and critical appraisal
	• Regulatory standards can hamper the opportunities to update existing models that already received regulatory approval
Are presented relations between individual features and the outcome not overinterpreted?	• Explainable AI methodology can be helpful to identify which features contribute most to making predictions
	• Conclusions about cause and effect purely based on prediction modelling results are rarely justified

68 in total

Review 1. Risk prediction models: II. External validation, model updating, and impact assessment.

Authors: Karel G M Moons; Andre Pascal Kengne; Diederick E Grobbee; Patrick Royston; Yvonne Vergouwe; Douglas G Altman; Mark Woodward
Journal: Heart Date: 2012-03-07 Impact factor: 5.994

2. Barriers and facilitators perceived by physicians when using prediction models in practice.

Authors: Teus H Kappen; Kim van Loon; Martinus A M Kappen; Leo van Wolfswinkel; Yvonne Vergouwe; Wilton A van Klei; Karel G M Moons; Cor J Kalkman
Journal: J Clin Epidemiol Date: 2015-09-21 Impact factor: 6.437

3. Calibration of risk prediction models: impact on decision-analytic performance.

Authors: Ben Van Calster; Andrew J Vickers
Journal: Med Decis Making Date: 2014-08-25 Impact factor: 2.583

4. Should Health Care Demand Interpretable Artificial Intelligence or Accept "Black Box" Medicine?

Authors: Fei Wang; Rainu Kaushal; Dhruv Khullar
Journal: Ann Intern Med Date: 2019-12-17 Impact factor: 25.391

5. General cardiovascular risk profile for use in primary care: the Framingham Heart Study.

Authors: Ralph B D'Agostino; Ramachandran S Vasan; Michael J Pencina; Philip A Wolf; Mark Cobain; Joseph M Massaro; William B Kannel
Journal: Circulation Date: 2008-01-22 Impact factor: 29.690

6. Calibration: the Achilles heel of predictive analytics.

Authors: Ben Van Calster; David J McLernon; Maarten van Smeden; Laure Wynants; Ewout W Steyerberg
Journal: BMC Med Date: 2019-12-16 Impact factor: 8.775

Review 7. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review.

Authors: Anne A H de Hond; Artuur M Leeuwenberg; Lotty Hooft; Ilse M J Kant; Steven W J Nijman; Hendrikus J A van Os; Jiska J Aardoom; Thomas P A Debray; Ewoud Schuit; Maarten van Smeden; Johannes B Reitsma; Ewout W Steyerberg; Niels H Chavannes; Karel G M Moons
Journal: NPJ Digit Med Date: 2022-01-10

8. SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe.

Authors:
Journal: Eur Heart J Date: 2021-07-01 Impact factor: 35.855

9. Time to reality check the promises of machine learning-powered precision medicine.

Authors: Jack Wilkinson; Kellyn F Arnold; Eleanor J Murray; Maarten van Smeden; Kareem Carr; Rachel Sippy; Marc de Kamps; Andrew Beam; Stefan Konigorski; Christoph Lippert; Mark S Gilthorpe; Peter W G Tennant
Journal: Lancet Digit Health Date: 2020-09-16

10. Predictive analytics in health care: how can we know it works?

Authors: Ben Van Calster; Laure Wynants; Dirk Timmerman; Ewout W Steyerberg; Gary S Collins
Journal: J Am Med Inform Assoc Date: 2019-12-01 Impact factor: 4.497

1 in total

Review 1. A Review of Converging Technologies in eHealth Pertaining to Artificial Intelligence.

Authors: Iuliu Alexandru Pap; Stefan Oniga
Journal: Int J Environ Res Public Health Date: 2022-09-10 Impact factor: 4.614

1 in total