Anna K Bonkhoff1, Christian Grefkes2,3,4. 1. J. Philip Kistler Stroke Research Center, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA. 2. Cognitive Neuroscience, Institute of Neuroscience and Medicine (INM-3), Research Centre Juelich, Juelich, Germany. 3. Department of Neurology, University Hospital Cologne, Cologne, Germany. 4. Medical Faculty, University of Cologne, Cologne, Germany.
Abstract
Stroke ranks among the leading causes for morbidity and mortality worldwide. New and continuously improving treatment options such as thrombolysis and thrombectomy have revolutionized acute stroke treatment in recent years. Following modern rhythms, the next revolution might well be the strategic use of the steadily increasing amounts of patient-related data for generating models enabling individualized outcome predictions. Milestones have already been achieved in several health care domains, as big data and artificial intelligence have entered everyday life. The aim of this review is to synoptically illustrate and discuss how artificial intelligence approaches may help to compute single-patient predictions in stroke outcome research in the acute, subacute and chronic stage. We will present approaches considering demographic, clinical and electrophysiological data, as well as data originating from various imaging modalities and combinations thereof. We will outline their advantages, disadvantages, their potential pitfalls and the promises they hold with a special focus on a clinical audience. Throughout the review we will highlight methodological aspects of novel machine-learning approaches as they are particularly crucial to realize precision medicine. We will finally provide an outlook on how artificial intelligence approaches might contribute to enhancing favourable outcomes after stroke.
Stroke ranks among the leading causes for morbidity and mortality worldwide. New and continuously improving treatment options such as thrombolysis and thrombectomy have revolutionized acute stroke treatment in recent years. Following modern rhythms, the next revolution might well be the strategic use of the steadily increasing amounts of patient-related data for generating models enabling individualized outcome predictions. Milestones have already been achieved in several health care domains, as big data and artificial intelligence have entered everyday life. The aim of this review is to synoptically illustrate and discuss how artificial intelligence approaches may help to compute single-patient predictions in stroke outcome research in the acute, subacute and chronic stage. We will present approaches considering demographic, clinical and electrophysiological data, as well as data originating from various imaging modalities and combinations thereof. We will outline their advantages, disadvantages, their potential pitfalls and the promises they hold with a special focus on a clinical audience. Throughout the review we will highlight methodological aspects of novel machine-learning approaches as they are particularly crucial to realize precision medicine. We will finally provide an outlook on how artificial intelligence approaches might contribute to enhancing favourable outcomes after stroke.
In spite of over 10 million yearly strokes worldwide and a global lifetime risk of 25% to suffer a stroke,[1,2] each of these strokes is a unique and very personal experience, leaving each stroke survivor with his or her very own story. Imagine being that one particular patient: you are female, 65 years old and have arterial hypertension as known comorbidity, yet are otherwise a healthy and independent person. You have noticed a weaker left-sided grip strength for 1 h and now cannot lift your left arm against gravity, your speech is slurred. Your symptom severity corresponds to a National Institutes of Health Stroke Scale (NIHSS) of 5 (maximum: 42). Initial MRI indicates ischaemia in the right internal capsule with an acute onset and no evidence for a large vessel occlusion. Would you choose a treatment and outcome prediction based on the ‘average’ stroke patient having a comparable NIHSS score, time constellation and imaging findings? Or would you rather prefer a more personalized version that takes into account (i) your individual constitution with respect to the potential to recovery; and/or (ii) response to a certain treatment, and has the potential to produce individualized predictions? The second, more complex choice, considering high-dimensional information, may be rendered more and more possible when merging artificial and human intelligence, as we will outline more in depth in this review.In well-developed countries, stroke outcome has been steadily improving in recent years: These advancements have been achieved by highly effective recanalizing therapies for acute treatment, such as thrombolysis and thrombectomy,[3,4] high-quality imaging, the stratified extension of therapeutic time windows[5,6] and standardized care for dedicated stroke units. Intense rehabilitation programs[7,8] and secondary prevention, such as anticoagulants and statins,[9,10] are further examples in the subacute and chronic phases. However, most of these post-stroke treatment options require a high number of patients needed to treat to prevent an unfavourable outcome. Therefore, the optimal and most effective treatment decisions for an individual may not necessarily be derived from population averages.These insights are not limited to stroke, but pertain to healthcare in general. They have prompted a new focus on individualizing treatments in recent years and ignited increasing numbers of precision medicine endeavours.[11,12] Ever since, more and more optimization aims focus on individuals rather than population averages to increase the efficacy in healthcare.
The role of artificial intelligence for precision medicine
Modern artificial intelligence (AI) practices offer the great opportunity to realize the vision of precision medicine.[13-15] AI can be formally defined as ‘the capacity of computers or other machines to exhibit or simulate intelligent behaviour; the field of study concerned with this’ (Oxford English Dictionary, see also Matheny et al.[16]). Of note, the term AI was introduced already about 70 years ago.[17,18] However, since then, AI has also experienced several periods of reduced interest (‘AI winter’) after falling short of expectations. Early AI implementations successfully completed tasks that are usually difficult for humans by applying a sequence of logical rules.[19] Examples may be seen in expert systems that imitate human decision-making processes.[20] These same implementations, however, failed to tackle tasks easy to complete for humans, such as image recognition. With the recent coincidence of growing amounts of data, exponentially increasing computational power, affordable computing and storing resources, as well as a broad software availability,[21] techniques such as machine learning and deep learning have begun to remedy these previous shortcomings. In general, both machine and deep learning have led to ground-breaking innovations, such as intelligent software to understand language[22] and images[23] or, as a very recent biological example, the prediction of protein structures based on their amino acid sequence (AlphaFold).[24] Machine and deep learning approaches, as modern branches of AI, excel in automatically detecting patterns in data and leveraging those pattern to predict future data (see Box 1 for examples of individual algorithms).[25-27] Deep learning is special in the way that it leverages artificial neural networks with multiple (‘deep’) levels of representations that facilitate the acquisition of particularly complex functions.[28]Supervised learningThe machine-learning algorithms highlighted in this review fall into the category of supervised learning algorithms. This scenario assumes that each predictor (= input) variable is linked to a response. Responses can be quantitative (i.e. taking on a numerical value, such as a patient’s Fugl–Meyer score) or qualitative (i.e. categorical, such as motor symptoms versus no motor symptoms), resulting in the formulations of regression or classification problems, respectively.[83] Overall, supervised learning stands in contrast to unsupervised learning, where we have observations of measurements but no associated responses. Hence, instead of formulating regression or classification problems, the main aim of unsupervised learning approaches is to understand relationships between observations, which can, for example be achieved via clustering.[83] Further examples of unsupervised learning are dimensionality reduction techniques, such as principal component analysis (PCA) or non-negative matrix factorization. Classical examples for supervised learning algorithms for both regression and classification are linear regression models (regularized or unregularized), tree-based and nearest-neighbour algorithms, SVMs and deep neural networks:Linear regression: In linear regression, one (simple linear regression) or more input variables (multiple linear regression) are linked to a response via a linear function. A typical application scenario for linear regression in stroke recovery research is the modelling of Fugl–Meyer follow-up scores based on initial Fugl–Meyer scores.[97] Model parameters are commonly fitted using the least square approximation or a penalized version for regularized variations (ridge regression: L2-norm penalty; lasso: L1-norm penalty). In case of regularization, coefficient estimates are shrunk towards or to zero, which can be particularly helpful in the case of highly variable least squares estimates that often arise when the number of explanatory variables is almost as large as the sample size, i.e. the number of observations.[83] In these situations, estimates may differ widely between different samples.Tree-based algorithms: The simplest tree-based algorithm is a decision tree. An exemplary application in recovery research is the PREP algorithm.[128] Other tree-based algorithms are, for example, random forest[210] and gradient boosting algorithms.[122] Regression and classification are achieved by finding sequences of splitting rules that segment the space of input variables into simple regions. While being very transparent and interpretable, decision trees usually cannot compete with other algorithms with respect to prediction performance. However, modifications, such as bagging, boosting and random forests, that introduce different ways of combining multiple decision trees (ensemble learning), have been shown to enhance prediction performance substantially.[164] Interpretation is less straightforward in case of ensemble learning approaches due to their complexity (combination of trees). However, it is still possible to extract the importance of input variables for generating predictions, which facilitates their interpretability.Nearest-neighbour algorithms
: These algorithms accomplish solutions to regression or classification problems by finding k closest observations for any given observation and then creating average responses or majority votes.[211] Therefore, the predicted response is the average value of responses of all neighbours in regression scenarios. In classification scenarios, the predicted category is the majority class of nearest neighbours. Overall, it is appreciated that nearest-neighbour algorithms can find very complex patterns in data, which, however, comes at the cost of increased computation demand and decrease in interpretability.[165]Support vector machines: SVMs are generalizations of so-called maximal margin classifiers.[87] SVMs are frequently used in multivariate lesion-symptom studies relying on neuroimaging data.[159] For example, a classification problem with two linearly separable classes: In this case, many straight lines can entirely separate the two classes; an SVM finds the one straight line with the widest margin. The observations closest to this separating line with the widest margin are then called support vectors. In reality, classes may not be perfectly separable, and the objective might rather be to accept misclassification in few instances to allow for a better classification performance in general, i.e. higher generalizability. While ‘classic’ SVMs are linear models, they can be rendered non-linear by introducing a ‘kernel’ that maps the input variables to an even higher dimensional space. SVMs are comparably less computationally expensive and more interpretable, but more limited in the complexity of patterns that they can fit.[165]Deep learning algorithms: Deep learning algorithms, in particular, have gained attraction in recent years, not least due to concurrently increasing dataset sizes and available computational power. They constitute exceptionally flexible methods that combine multiple stacked layers and non-linear transformations when passing on information from one layer to the next. While each building block is comparably simple, their combination has been shown to be capable of automated feature selection and representation of complex pattern. As Goodfellow and colleagues[19] phrase it: ‘Deep learning allows the computer to build complex concepts out of simpler concepts’. For example, deep learning algorithms have premiered in stroke outcome prediction scenarios based on clinical data.[113,114]All of the outlined algorithms have their unique strengths and weaknesses. It may be particularly instrumental to compare them with respect to their transparency and complexity[212] (Fig. 6).
Figure 6
Comparison of various learning algorithms with respect to their model transparency and complexity. Model transparency here refers to the interpretability of input variables and thus the potential scientific insight and mechanistic understanding that can be gained. More complex models, in return, maximize the predictive power. Altogether, increased transparency may come at the cost of decreased model complexity and associated decreased predictive power and vice versa. Figure adapted from Bzdok and Ioannidis,[212] with permission.
Comparison of various learning algorithms with respect to their model transparency and complexity. Model transparency here refers to the interpretability of input variables and thus the potential scientific insight and mechanistic understanding that can be gained. More complex models, in return, maximize the predictive power. Altogether, increased transparency may come at the cost of decreased model complexity and associated decreased predictive power and vice versa. Figure adapted from Bzdok and Ioannidis,[212] with permission.Notably, AI is not a new idea in healthcare,[29] as expert-guided, rule-based medical approaches were already introduced in the 1970s, for example featuring the automated interpretation of ECGs.[18,30] Once again, machine and deep learning have recently enabled substantial improvements and demonstrated performances comparable with highly trained physicians, especially in the fields of radiology, dermatology and ophthalmology. For example, Gulshan and colleagues[31] demonstrated the feasibility of automatically detecting diabetic retinopathy in retina fundus photographs. Esteva and colleagues[32] predicted skin cancer type as accurately as dermatologists, and Hannun and colleagues[33] constructed a deep learning model that could accurately classify computerized echocardiograms into 12 rhythm classes. These successful AI implementations hold several promises in the longer term, such as predicting future disease manifestations based on routinely collected healthcare data,[34] or automated screening for certain cancer types in imaging data.[35] In the shorter term, AI-based individualized predictions on clinical outcomes could provide essential information for healthcare professionals, as well as patients, their families and friends.[16]To foster the potential of machine and deep learning, it will be of particular importance to acquire large datasets, comprising subject-level information on hundreds to thousands of patients. Only then will these datasets have the potential to adequately represent interindividual variability in the presentation of the disease, comorbidities and predisposition,[36,37] and allow for an advantageous performance of AI models. Recent years have already seen the advent of big medical data initiatives, mostly within the framework of population studies that are not only impressive in the number of participants (number of participants >500 000), but also their data depth (number of variables >1000) (e.g. UK Biobank,[38] NIH All of Us research programme in the USA[39] and the Rhineland Study in Germany[40]). First examples of similar developments in stroke research can be observed as well: the virtual international stroke trial archive (VISTA) contains clinical data, such as the NIHSS, comorbidities or laboratory results of 82 000 patients.[41] However, ‘big’ imaging datasets of stroke patients are still at least an order of magnitude smaller (e.g. 2800 structural scans in the MRI-GENIE study,[42,43] 2950 scans of in Meta VCI map consortium,[44] 1800 scans in ENIGMA[45] or 1333 scans in an unicentre study[46,47]). All in all, there have been calls to accumulate and exploit regularly obtained clinical, imaging and genetic stroke patient data in a collaborative fashion.[48-50]
Article structure
In the following sections, we will specifically illustrate single-subject prediction scenarios within stroke outcome research in the acute, subacute and chronic stage. Additionally, we will highlight important considerations with respect to methodological approaches in line with the aim of this review. We first address general aspects of motor outcome research after stroke (‘Motor impairment after stroke’ section). Then, we will summarize the statistical foundations necessary to understand the basic principles of AI in healthcare (‘Statistical background for precision medicine: inference versus prediction’ section). Afterwards, we present and discuss recent studies on stroke outcome research with a special focus on those using prediction models, organized depending on the type of data, i.e. clinical data (‘Stroke prognostic scales based on clinical data only’ section), neurophysiological data, and combinations of clinical, neurophysiological and basic imaging data (‘Neurophysiology and combination of biomarkers in individual data’ section), as well as more detailed structural (‘Structural imaging’ section) and functional imaging data (‘Functional imaging’ section). Given their prime importance for the realization of precision medicine, we will outline essential methodological aspects at the beginning and end of each section. Finally, we will present a synopsis of methods as employed in concrete scenarios in motor outcome research post-stroke (‘Overview of employed algorithms’ section), their general advantages and promises (‘General advantages and promises’ section), as well as disadvantages and pitfalls (‘Disadvantages and pitfalls’ section). All in all, our review complements previous reviews on the use of AI in stroke, for example with a focus on clinical decision support in the acute phase,[51] acute stroke imaging,[52,53] stroke rehabilitation[54] and prognostic scales on clinical outcomes and mortality[55,56] (see the Supplementary material for our literature research strategy and selection criteria).
Motor impairment after stroke
A substantial amount of stroke patients finds themselves affected by some degree of motor impairment. Studies[7,57] report frequencies as high as 80% and 50%, respectively. The enormous burdens associated with motor impairments with regard to economic costs,[58] rehabilitation need[59] and disability-adjusted life years[60] necessitate optimizing acute and chronic stroke care. While acute stroke treatment has been considerably advanced leading to both reduced mortality and morbidity in the past decades, it may now be the restorative therapy after stroke that needs to see the same progress.[61] This focus on the subacute-to-chronic post-stroke phase may be of particular importance since only a relatively small fraction of patients presenting with acute ischaemic stroke are eligible for acute treatment options (e.g. 15.9% for thrombolysis and 5.8% for mechanical thrombectomy in Germany in 2017,[62] with comparable numbers in various other countries[63,64]).Providing accurate outcome predictions has always been a central goal in stroke research. More specifically, predictions may point at the most suitable short and long-term treatment goals: should the focus of treatment be on true recovery or rather compensation, when significant behavioural restitution is unlikely?[65] True recovery requires neural repair to allow for an at least partial return to the pre-stroke repertoire of behaviours, e.g. the same grasping movement pattern as present prior to cerebral ischaemia. Compensation implies the substitution of pre-stroke behaviours by newly learned pattern without the necessity of neural repair, e.g. compensatory movements of the shoulder to account for extension deficits of the hand.[61] During rehabilitation, patients often show both phenomena, i.e. a partial recovery, which is complemented by compensatory behaviours. In this context, rehabilitation refers to the entire process of care after brain injury and an ‘active change by which a person who has become disabled acquires the knowledge and skills needed for optimum physical, psychological and social function’.[66] The availability of predictions may help patients and their proxies to be informed about what to expect in the future and plan accordingly. Furthermore, predicting spontaneous recovery after stroke may be crucial to evaluate the effect of intervention studies. Using this information to stratify patients into control and treatment groups could decrease the overall number of patients needed to be recruited, thereby not only rendering significantly more studies feasible in terms of design and financial costs, but also yielding faster results.[67] Last, outcome models could also target the prediction of response to specific therapies, such as non-invasive brain stimulation, and thus support the identification of probable responders before the start of the therapy.[68] In the same vein, Stinear and colleagues[54] previously defined several prerequisites for rehabilitation prediction tools that may be useful in clinical practice. Accordingly, prediction tools should forecast an outcome that is meaningful for individual patients at a specific time point in the future.
Statistical background for precision medicine: inference versus prediction
Classical inference statistics, such as F- or t-tests, comprise a powerful tool kit to evaluate research hypotheses, and offer explainable results. Null hypotheses testing represents a frequently used example, which is linked to resulting P-values and ensuing statistical significance statements.[69,70] Importantly, these classical statistical instruments were invented almost a hundred years ago, in an era of rather limited data availability and hardly any computational power.[71] In regard to biomedical research, insights were previously commonly gleaned from either observational descriptions of single patients (e.g. Pierre-Paul Broca’s patient Mr Leborgne, called ‘Tan’),[72] or group comparisons. This situation, however, is changing nowadays.The perception of statistical significance will most probably experience a redefinition in times of emerging big data scenarios. On the one hand, extensive datasets will more frequently lead to statistical significance of effects with (clinically) negligible effect sizes.[73,74] For example, Miller and colleagues conducted 14 million individual association tests between MRI-derived brain phenotypes, e.g. brain volumes or functional connectivity strength between two brain areas, and sociodemographic, neuropsychological or clinical variables in 10 000 UK Biobank participants.[75] These tests resulted in many statistically significant associations, yet these associations sometimes explained less than a percentage point of variance, which, thus, questions their relevance.[76] On the other hand, the default use and interpretation of P-values has been challenged frequently in recent years. This process was triggered by increasing reports on low reproducibility of research findings.[77] When trying to reproduce the findings of 100 psychological research studies, replication studies produced significant results in only 36%, while original studies reported significant results in 97% of cases.[78] In response to these findings, Benjamin and colleagues suggested a lower level of significance, i.e. P < 0.005, for the discovery of new effects to increase the robustness of findings.[79] Amrhein and colleagues went a step further yet and recommended to relax the over-reliance on P-values by completely abandoning dichotomous decisions.[80] These suggestions have prompted vital discussions: While generally being supported widely—the call by Amrhein was accompanied by >800 signatures of international researchers—other statisticians have been more cautious, for example stressing the positive effect of statistical significance as gatekeeper.[81]It is also important to realize that statistically significant group differences, as indicated by low P-values, do not generally imply good single-subject level prediction performances, as measured by out-of-sample generalization (Fig. 1). The latter, however, is the idea of precision medicine.[37,82-84] In contrast to the previous focus on inference and explanation, recent years have seen an upsurge of AI and, more specifically, machine-learning techniques, that predominantly target prediction performance of single-subject outcomes. Examples of these machine-learning models include, e.g. regularized regression, (deep) neural networks, nearest-neighbour algorithms,[85] random forests[86] or kernel support vector machines (SVMs) (Box 1).[87] Given multiple input variables, such as age, sex, initial stroke severity and comorbidities, these models are trained to predict some specific individual outcome, such as a motor score 3 months after stroke, based on a weighted combination of these input variables, with the highest achievable prediction performance. This performance can be quantified by various established measures, such as explained variance, accuracy, sensitivity, specificity and area under the curve. As they are evaluated by their generalization capability to previously unseen, i.e. new data samples, they are well suited to ensure accurate predictions of individual future outcomes. At the same time, these models may not typically and reliably be able to explain their predictions any further and allow for inferences on particular biological mechanisms. This characteristic has prompted the denotation black-box model.[88]
Figure 1
Three scenarios to compare group difference and classification analyses. Data is simulated, differences between groups 1 and 2 are determined via two-sample t-tests, classification via linear methods into groups 1 and 2 is achieved via thresholding (indicated by red dotted lines). (A) A significant group difference is found despite a poor classification performance. (B) Groups do not differ significantly, but classification accuracy is very high. (C) A significant group difference goes along with high classification accuracy. Overall, these three scenarios illustrate that neither significant group differences automatically lead to high classification accuracies, nor high classification accuracies to significant group differences. Adapted from Arbabshirani et al.,[37] with permission.
Three scenarios to compare group difference and classification analyses. Data is simulated, differences between groups 1 and 2 are determined via two-sample t-tests, classification via linear methods into groups 1 and 2 is achieved via thresholding (indicated by red dotted lines). (A) A significant group difference is found despite a poor classification performance. (B) Groups do not differ significantly, but classification accuracy is very high. (C) A significant group difference goes along with high classification accuracy. Overall, these three scenarios illustrate that neither significant group differences automatically lead to high classification accuracies, nor high classification accuracies to significant group differences. Adapted from Arbabshirani et al.,[37] with permission.
Stroke outcome studies
Stroke prognostic scales based on clinical data only
The initial level of impairment induced by the stroke lesion is a well-known explanatory variable of the neurological outcome several months later.[89-91] A number of studies aimed to explain recovery patterns through linking motor impairments at initial and follow-up time points by means of linear regression models (63–211 patients).[92-95] In these scenarios, the change between initial and follow-up motor impairment (i.e. a continuous change = follow-up − initial scores) represented the output, i.e. the dependent variable Y. This output was computed based on the recovery potential, i.e. the maximum score minus the initial motor impairment as input, i.e. the independent variable X. Motor impairment was most frequently captured as Fugl–Meyer score of the upper limb.[96] The typically obtained performance measure here was the explained variance in form of the in-sample R2-value. This R2-value, also called the coefficient of determination, indicates how much of the variance in the dependent variable can be explained by one or more independent variables.These modelling endeavours resulted in the proportional recovery rule.[97] Stroke patients with mild to moderate motor impairments usually regain a certain amount of their lost motor function within the first months after their stroke. In one of the largest stroke recovery studies, considering 211 patients, initial motor impairment apparently explained up to 94% of the variance in motor recovery based on the proportional recovery rule.[93] However, recent re-evaluations of the statistics underlying the proportional recovery rule suggest that previous estimates of explained variance were inflated. This inflation occurred due to statistical confounds, such as measurement noise, ceiling effects and a phenomenon called mathematical coupling.[98-101] Mathematical coupling here describes a situation where the input and output variables are not independent—which is the case when recovery is defined as the difference between an initial score and the follow-up score, and this change score is then correlated to the very same initial score that was used to compute the change score. Thus, the assumption of no relationship between input and output is void. Simulations have shown that significant relationships between initial and change score can occur, when, in fact, there is no significant link between initial and follow-up scores.[99,100] We recently introduced a Bayesian hierarchical modelling regimen to combine patient data from six recovery studies (n = 385) and demonstrated that reducing analyses to the subset of only severely to moderately affected patients could successfully mitigate the effects of ceiling and mathematical coupling.[102] Notably, after addressing confounds, the initial impairment was shown to explain only a small amount of the variance in recovery, reaching a maximum of 32% explained variance only.[102] Therefore, proportional recovery may occur, however, to a considerably smaller degree than originally claimed.Importantly, these recovery studies highlight the distinction between inference and prediction (see the ‘Statistical background for precision medicine: inference versus prediction’ section). In the studies mentioned before, the relationship between initial impairment and recovery was primarily investigated in-sample. In-sample here means that the performance of linear regression models was estimated relying on exactly the same data that was used for model training. Therefore, models had already seen all parts of the data that they were subsequently tested on and could optimally adapt to them. This strategy is particularly helpful, if the main study aim is to identify significant explanatory variables of the outcome and to obtain interpretable models.[82,84] If several studies then independently point to the same association, this association might be considered more stable and reliable, since it was validated. The estimates of prediction performance have, however, not been validated by these means. When training and test data do not differ, algorithms are also prone to overfitting, i.e. they might capture the characteristics of the data sample at hand very well by explaining a high percentage of the variance observed for this particular sample, but at the same time perform relatively poorly when tested on an independent dataset. Conversely, generalization capability—conceptually underlying precision medicine—is commonly tested by measuring a model’s prediction performance for unseen, novel data-points, i.e. those that have not been used in the training phase. The developed model is therefore validated out-of-sample.[83,103] Therefore, it is generally crucial to know whether prediction performance estimates were obtained in-sample or out-of-sample (Supplementary Tables 1–3).Given that the studies highlighted in the previous section mainly used a within-sample approach, they could well infer the significance of input variables (i.e. the initial motor score) and interpret their coefficients (i.e. 70%) at the group-level. In contrast, the aim of the studies presented in the following section is the accurate training of prognostic models that can predict a categorical functional outcome at the level of an individual patient[104] (for recent reviews see Fahey et al.[55] and Drozdowska et al.[56] and ‘Statistical background for precision medicine: inference versus prediction’ and ‘General advantages and promises’ sections for details on the distinction between inference on group-level versus prediction on individual-patient-level). Most of these prognostic prediction endeavours feature similar methodological steps. First, the outcome is not represented by a change score, as above, but by a binary, categorical (0–1) follow-up outcome, for example, favourable versus unfavourable functional outcome [e.g. modified Rankin Scale (mRS) of ≤2 versus >2, no-mild versus moderate-severe disability]. Most often, the model of choice is a logistic regression model[105] that considers sociodemographic and clinical information as input variables. Training and testing, or in other words developing and validating, is commonly performed in separate datasets. Importantly, it is this separate training and testing approach that enables conclusions on the generalization performance of a model to unseen data of individual patients. Prediction performance itself is frequently quantified as area under the receiver operating characteristic (AUROC), or in short area under the curve (AUC), which considers the true positive rate (i.e. sensitivity) as well as the false positive rate (i.e. 1—specificity) across various thresholds. While a value of 0.5 represents the level of chance, an AUC of 1 signals the best possible performance.[83]Using data from 10 777 patients included in the clinical trials archive VISTA as an additional validation (test) dataset, Quinn and colleagues[106] compared the predictive capacities of eight well recognized prognostic models to predict favourable outcome post-stroke (90-day mRS ≤ 2).[107-112] The model abbreviated to ASTRAL (Acute Stroke Registry and Analysis of Lausanne)[107] provided the highest prediction fidelity of all models and achieved an AUROC of 0.78. As each model had originally been trained in a different dataset, relying on anywhere between 1645 and 12 262 patients from different countries and continents, each model included a marginally varying number and collection of input variables. However, most considered age and stroke severity, as well as pre-stroke comorbidities as input (Table 1). More recently, two studies explored the capability of deep learning algorithms to enhance the prediction of functional outcomes based on clinical information. Heo and colleagues[113] compared the performances of deep neural networks, random forest classification and logistic regression to the established ASTRAL score to predict favourable outcomes (90-day mRS ≤ 2) in 2604 patients. Their deep learning model based on 38 clinical variables, such as demographics, stroke severity and stroke subtype, was the only one to significantly outperform the ASTRAL score. Li and colleagues,[114] on the other hand, used deep neural networks, an SVM, random forest classification, a gradient boosting algorithm and logistic regression to predict unfavourable outcomes (mRS > 2) 6 months post-stroke in 1735 patients using information on clinical, demographic and laboratory characteristics. Neither of their prediction models performed clearly better. Of note, the studies by both Heo and colleagues and Li and colleagues used a test set, thus their estimates can be regarded as out-of-sample.
Table 1
Integer-based prognostic ASTRAL score for the calculation of probability of unfavourable outcome in patients with acute ischaemic stroke (1645 patients in total)
Covariates
Score
Age: for every 5 years
1
Severity: for every NIHSS point
1
Time delay from onset to admission <3 h
2
Range of visual field defect
2
Acute glucose >7.3 or <3.7 mmol/l
1
Level of consciousness decreased
3
Higher scores indicate less favourable outcomes.[108]
Integer-based prognostic ASTRAL score for the calculation of probability of unfavourable outcome in patients with acute ischaemic stroke (1645 patients in total)Higher scores indicate less favourable outcomes.[108]Further studies evaluated modified scenarios as they focused on stroke patients admitted to rehabilitation institutes and specifically strived for modelling outcomes after rehabilitation.[115-117] Brown and colleagues asserted that the motor subscore of the Functional Independence Measure (FIM),[118] age and walking distance at admission explained most variance in the FIM-based recovery (i.e. change), length of stay and discharge destination (148 367 patients).[115] Since the authors derived their results only in-sample, the generalization performance to out-of-sample, i.e. new, patients remains to be elucidated. Scrutinio and colleagues[116] developed a prediction model of the motor subscore of the FIM after rehabilitation and considered multiple available variables as predictors during model training. They eventually chose five of them based on forward stepwise logistic regression: age, time from stroke occurrence to rehabilitation admission and unilateral neglect were predictive of higher motor impairment at discharge, while lower admission motor and cognitive impairment predicted lower motor impairment at follow-up. After model development, they then tested for their algorithm’s capacity to generalize to new patients and obtained a validation sample prediction performance of AUC = 0.866.In general, objectives of these prediction model endeavours were to provide additional information to augment a doctor’s judgement on the risk of favourable or unfavourable outcome and assist in (fast) clinical decision making. Most of these studies translated the original logistic model to an integer-based score or offered online calculator for a more intuitive and faster outcome calculation (Table 1 and e.g. https://goo.gl/fEAp81 for Scrutinio et al.’s prediction models). Indeed, some of these automated predictions were shown to outperform the intuitions of medical doctors in several datasets.[119-121] However, any one of these scores has yet to be implemented into clinical routine and several challenges remain to be addressed (see the ‘Disadvantage and pitfalls’ section).
Neurophysiology and combination of biomarkers in individual data
The studies in the following section make use of predictors that are closer to the neurobiology of the brain, i.e. data obtained by neuroimaging or neurophysiological recordings. Such surrogate-based predictions might yield higher prediction accuracies than those based on clinical or behavioural information as the former may better capture interindividual differences of lesion-induced disturbances in neuronal function as well as the mechanisms driving functional recovery. While some of the authors instrumentalized stepwise logistic regression to identify critical parameters for recovery or future motor performance, others demonstrate broader model type considerations that go beyond linear methods, for example, decision-tree-based algorithms (Box 1). Among others, these model types have the ability to exploit non-linear relationships and interactions of input variables automatically.[122] Conversely, when working in a linear regression framework as before, interactions have to be inserted manually and thus intentionally, probably based on previous knowledge that the researcher has. Altogether, more flexible models like tree-based algorithms as well as further non-parametric models, such as nearest-neighbour algorithms and those applying kernels, may be more capable to ‘let data speak for themselves’.[123] They can—at best—uncover complex, predictive patterns in data automatically. However, these models require a lot of data to successfully do so. In view of their flexibility, these models are otherwise at the risk of overfitting and poor generalization to new data due to too close adaptation to the data at hand in case of data scarcity.As first examples of broader biomarker consideration: Koh and colleagues[124] used stepwise linear regression to evaluate 19 variables comprising information on clinical and also imaging parameters to build a prediction model of motor recovery, i.e. the change between admission and follow-up upper limb movement capacity, in a sample of 140 severely affected stroke patients. Four variables were ultimately identified to hold the most explanatory information: ‘baseline upper extremity score’ (positive association with impairment) and ‘baseline NIHSS score’ (negative), as well as the imaging-related variables ‘haemorrhagic stroke’ and ‘cortical lesion excluding primary motor cortex’ (both positive). Model performance, however, was capped at 35% of total variance explained in-sample. This low value signalled a generally limited explanatory capacity of the considered initial input variables. Nonetheless, more complex and less predictable recovery patterns after severe stroke are a frequently described finding,[125] which renders the results of Koh and colleagues[124] less surprising. In another study comprising data of 160 acute stroke patients, mRS-based functional outcome 3 months post-stroke was significantly associated with the clinical variables ‘left-sided lesions’, ‘stroke severity at admission’ (both negative association with favourable outcome) and the ‘presence of motor-evoked potentials (MEPs) on TMS of the ipsilesional motor cortex’ (positive association).[126]Going yet a step further, several studies employed the combination of clinical, imaging and neurophysiological markers to optimize outcome predictions. The Predict Recovery Potential (PREP) algorithm runs through a sequence involving all of these measures to stratify patients into four recovery groups based on their follow-up Action Research Arm Test (ARAT) score.[127,128] It first divides patients into two groups based on a commonly conducted clinical test of upper limb function 72 h after stroke (SAFE = sum of the shoulder abduction and finger extension grades based on the Medical Research Council muscle scale). Subsequently, it considers information on transcranial magnetic stimulation (TMS)-obtained MEPs in upper limb muscles. MEP-positivity is here thought to indicate functional integrity of corticospinal pathways. Last, the PREP algorithm incorporates an MRI parameter that represents the structural integrity of the cortico-spinal tract fibres in the posterior limb of the internal capsule (fractional anisotropy asymmetry index). The PREP algorithm was designed in a decision-tree-like fashion relying on expert knowledge, i.e. the sequence and nature of tests was manually chosen and not automatically computed from data to reflect a setting that can be readily implemented into the clinical routine.[127] In general, decision-tree-based algorithms create classifications by finding sequences of splitting rules that segment the space of input variables into simple regions and, as such, are very transparent and interpretable (Box 1). The PREP decision-tree-algorithm was subsequently validated in a dataset of 40 stroke patients.[128] Especially the outcome ‘full recovery’ could be predicted with high positive predictive power (88%), negative predictive power (83%), specificity (88%) and sensitivity (73%). Only slightly lower prediction accuracies of correct outcome classifications (80%) were found when testing the PREP on an independent dataset of 157 patients, underlining its reliable generalization performance.[129] Furthermore, the PREP algorithm was successively refined to the PREP2 algorithm by means of a more, yet still only partly, automated classification and regression tree (CART) approach in 207 patients, thus in-sample (Fig. 2).[130] In contrast to the original PREP algorithm, the authors defined a lower SAFE score cut-off at the first decision point, i.e. it was now reduced to 5 instead of 8 points, with 5 points indicating a higher motor impairment. As a result, this change required to assess MEPs only in low SAFE score patients without any loss in prediction performance (sensitivity of 75% in comparison to 73% before). Furthermore, the PREP2 algorithm did not rely on MRI data anymore, thereby facilitating its clinical implementation. A study comparing the lengths of rehabilitation stays suggested a real-world relevance of the PREP predictions: patients in an intervention group as well their therapists were disclosed their PREP outcome predictions at the beginning of the rehabilitation stay.[129] Patients in this group could then be discharged a week sooner than the patients in the control group lacking information on the additional PREP estimate. Of note, this finding was controlled for upper limb impairment, age, sex and comorbidities (implementation group: n = 110, 11 days, control group: n = 82, 17 days). Furthermore, there were no adverse effects on later functional outcomes. The authors explained the shorter rehabilitation period by an increase in the therapists’ confidence and modification of therapy content in view of the outcome prediction. Importantly, classification accuracy has been shown to be decreased in case of measuring initial performance 2 weeks post-stroke instead of within the first 72 h (91 patients).[131]
Figure 2
Prediction of ARAT score-based upper limb recovery potential via the PREP2 algorithm. The PREP2 algorithm combines several assessments in a decision-tree-like fashion considering the SAFE score, age, NIHSS and MEPs. The first decision step is based on the SAFE score, which captures the ability of shoulder abduction and finger extension, using the Medical Research Council grades (0: no palpable muscle activity, to 5: normal power) within the first 3 days after stroke onset. In the case of a SAFE score of 5 or above, the next decision is based on the patient’s age. If younger than 80 years, outcome is predicted to be excellent. If older than 80 years, it is once again the SAFE score that differentiates between outcomes: The algorithm predicts excellent outcome in case of a score of 8 or higher and good outcome, if lower than 8. If, however, the patient achieves a SAFE score below 5, the next decision step considers the presence or absence of MEPs on transcranial magnetic stimulation (TMS) of the ipsilesional motor cortex on Days 3–7 after symptom onset. If MEPs are present, the patient is assigned to the second-best outcome group, i.e. a good outcome. Absent MEPs, in contrast, prompt the consideration of the NIHSS on Day 3: a score below seven leads to the prediction of limited outcomes, while an NIHSS score of 7and above results in the prediction of the lowest, i.e. poor outcome. Adapted from Stinear and colleagues,[130] with permission.
Prediction of ARAT score-based upper limb recovery potential via the PREP2 algorithm. The PREP2 algorithm combines several assessments in a decision-tree-like fashion considering the SAFE score, age, NIHSS and MEPs. The first decision step is based on the SAFE score, which captures the ability of shoulder abduction and finger extension, using the Medical Research Council grades (0: no palpable muscle activity, to 5: normal power) within the first 3 days after stroke onset. In the case of a SAFE score of 5 or above, the next decision is based on the patient’s age. If younger than 80 years, outcome is predicted to be excellent. If older than 80 years, it is once again the SAFE score that differentiates between outcomes: The algorithm predicts excellent outcome in case of a score of 8 or higher and good outcome, if lower than 8. If, however, the patient achieves a SAFE score below 5, the next decision step considers the presence or absence of MEPs on transcranial magnetic stimulation (TMS) of the ipsilesional motor cortex on Days 3–7 after symptom onset. If MEPs are present, the patient is assigned to the second-best outcome group, i.e. a good outcome. Absent MEPs, in contrast, prompt the consideration of the NIHSS on Day 3: a score below seven leads to the prediction of limited outcomes, while an NIHSS score of 7and above results in the prediction of the lowest, i.e. poor outcome. Adapted from Stinear and colleagues,[130] with permission.
Structural imaging
Structural neuroimaging is well implemented in routine diagnostic pathways for treating stroke patients. Therefore, using this information for outcome prediction seems particularly feasible in a clinical setting, and indeed, several studies have already provided clear links between motor outcome and structural markers, such as imaging parameters reflecting pre-stroke brain health or lesion location for example with respect to fibre tracts.[132-139]Several studies pursued a hypothesis-driven approach focusing on particular anatomical structures such as the corticospinal tract (CST)—the most important motor output tract of the brain. For example, the amount of damage to the CST, as estimated by the spatial overlap between stroke lesion and tract volume, is strongly associated with the level of motor impairment in the chronic phase post-stroke (50 participants, in-sample R2 = 0.71).[133] Furthermore, Feng and colleagues[135] presented evidence that CST lesion and the initial motor score performed on par when explaining the final motor outcome 3 months after stroke. To demonstrate the generalizability of findings, analyses were conducted in two separate datasets, considering 37 patients for a training cohort and further 39 for a validation cohort. In the validation cohort, CST-lesion load and the initial motor score explained 69% and 62% (out-of-sample) variance of the final Fugl–Meyer score. Interestingly, CST-lesion load was also significantly associated with realized recovery, albeit to a limited degree, explaining ∼20% of the in-sample variance (48 patients, with realized recovery = [follow-up − initial motor score] / [maximum score − initial score]).[140] Likewise, studies based on diffusion tensor imaging (DTI) suggest significant associations between measures of CST integrity and long-term functional outcomes as reflected by the mRS 3–12 months after stroke.[141-144]In contrast to the low-dimensional data underlying CST-lesion-based predictions of stroke outcomes, multivariate approaches have the capacity to capture high-dimensional whole-brain lesion patterns. Thereby, they can consider specific spatial distributions in high granularity, e.g. a particular combination of lesioned voxels. For example, Forkert and colleagues[145] leveraged a multivariate SVM to predict favourable versus unfavourable functional follow-up outcome in 68 patients (favourable 30 days mRS ≤ 2). The authors found that a favourable outcome could be predicted with a cross-validated accuracy of 85% when considering detailed information on lesion location as derived from MRI-FLAIR images. Two recent studies demonstrated the feasibility of using convolutional neural networks (CNNs) for a similar prediction task, i.e. the prediction of favourable outcomes (90-day mRS ≤ 2). More specifically, Bacchi and colleagues[146] trained CNNs in combination with deep neural networks on information originating from non-contrast CT and clinical data (e.g. age, sex, stroke severity and comorbidities) to generate predictions for 204 patients, and achieved a test set accuracy of 74%. Nishi and colleagues[147] relied on diffusion-weighted MRI (DWI) of 324 patients to predict favourable outcomes, and demonstrated superior test set performance of their deep learning model compared to simpler baseline models (deep learning model: AUC = 0.81 versus ASPECTS: AUC = 0.63). Several further studies considered imaging data from a database of a 132 first-time ischaemic and haemorrhagic stroke patients.[148,149] Two weeks after stroke, DWI-derived lesion volume itself only explained a small amount of variance (cross-validated R2 < 20%) in any of the evaluated functional domains (motor, language, attention, memory, vision). However, explained variances increased when information on lesion location was added (between 25% and 54% for motor deficits, language and attention/visual field biases). Only in case of verbal and spatial memory explained variance still totalled <20%, which probably reflects their less localized representation in the brain compared to the other functional domains.[148] These analyses relied on a pipeline comprising ridge regression and leave-one-out cross-validation. To reduce the high-dimensionality of voxel-wise lesion location information, lesion maps were also embedded in lower dimensional space via principal component analyses (PCA) before regression analyses.The aim of a subsequent study relying on the same dataset was to predict the domain-specific 3-month outcomes (in contrast to 2-week outcomes)[150]: lesion size, age, educational attainment, hours of therapy and domain-specific scores obtained in the subacute post-stroke phase could explain in-sample variances between 42% for attention at the lower and 70% for language impairments at the higher end in a linear regression framework. When PCA-transformed lesion location information was added to the models, prediction performance significantly increased for models explaining language, motor and attention impairments (4.0–13.0% increase in explained variance). However, explained variance remained unchanged in case of the verbal and spatial memory domains, suggesting once again that there is no one-size-fits-all solution, and some deficits may not be straightforwardly explained by lesion location alone. Furthermore, it is important to note that the inclusion of hours of therapy as input variable in a prediction algorithm of stroke outcomes may be problematic, given that this value is not necessarily known in advance and hence cannot easily be entered into a prediction algorithm at the beginning of rehabilitation.[54] Last, yet another study indicated that more sophisticated neuroimaging parameters may outperform simple lesion location ones. Accordingly, DTI-derived axial diffusion maps were shown to yield higher prediction accuracies of 3-month functional outcomes compared to simple lesion segmentation maps in a sample of 87 patients (median cross-validated accuracy: 82.8 and 76.7%, respectively).[151]Taken together, the studies reviewed above demonstrate that subacute and chronic post-stroke impairments in several functional domains can be better explained, if information on lesion location is included. Nonetheless, the variance that could be explained varied widely for different outcomes—for example from 4% to 54% in the work by Corbetta and colleagues.[148] Thus, these results raise the question whether sample sizes larger than the ones presented here may facilitate deriving more informative low-dimensional lesion representations. Independent of sample size, it may be also necessary to increase the spatial resolution as even 1-mm isotropic voxel scans may still not capture the interindividual variability that is seen in microscopical analyses of histological brain sections, especially with respect to fibre tract anatomy.[152]
Functional imaging
In addition to structural scanning, functional MRI has become a valuable method to infer post-stroke alterations of neuronal activity and also enable individual predictions.[153-156] This technique allows to draw conclusions on neural activity non-invasively on the basis of changes of blood flow and oxygen content.[157] Functional imaging can come in two forms: task-based and resting-state functional MRI. While participants are asked to perform a specific task in the first scenario, they are required to lie motionless but awake in the scanner in the second scenario. Analyses are then either centred on activity changes in certain brain areas or functional connectivity strengths between brain areas, respectively.[158]Sample sizes are usually considerably smaller in functional imaging studies than in other stroke outcome prediction scenarios, due to methodological challenges (longer acquisition times, low signal-to-noise ratio, signal susceptibility to head movement, MRI contraindications) and substantially higher costs. Interestingly, as functional MRI datasets can be considered high-dimensional data containing thousands of voxels, AI approaches have been frequently used to detect certain patterns of activity or connectivity that allow prediction of the functional outcomes of a single patient. Importantly, here we focus only on those functional neuroimaging studies that used rigorous cross-validation schemes to generate single-subject predictions.Two studies have made use of functional MRI data acquired in the first days after stroke to make predictions on clinical motor impairment at the time of scanning and follow-up motor impairment 4 to 6 months post-stroke (40 and 21 stroke patients).[159,160] These studies were conducted in prediction-focused frameworks similar to the structural stroke studies described previously, e.g. by applying SVMs combined with nested leave-one-out-cross-validation. In a first study, resting-state functional MRI data were used to calculate whole-brain connectivity to a ‘seed region’, i.e. reference region, in the ipsilesional, yet structurally intact primary motor cortex. Subsequently, this connectivity information was instrumentalized to discriminate between stroke patients with and without acute hand motor deficits as well as healthy controls.[159] Prediction models were successively refined to tell apart stroke patients with favourable versus unfavourable motor outcome several months after stroke.[160] Notably, prediction here relied on task-based functional MRI, instead of resting-state functional MRI data. Motor deficits were measured as Motricity Index of the hand[161] in the first and grip force and ARAT score in the second study.[162] Both studies reported cross-validated prediction accuracies of >80% (82.6% motor-stroke versus non-stroke, 87.6% motor-stroke versus non-motor-stroke in the first study[159] and 86% favourable versus unfavourable motor outcome in the second study[160]). In case of the discrimination of motor-stroke versus non-stroke patients, classification performance particularly relied on interhemispheric primary motor cortex M1—M1 as well as ipsilesional M1—premotor areas connectivity profiles (Fig. 3). As the resting-state data investigated in the first study was collected during routine scanning sessions, the authors underline the clinical practicability of their approach, particularly for acute and severely affected patients.[159] Another milestone study, once again relying on the 132 stroke patients introduced in the previous section on structural scans,[148,150] compared predictive capacities of dimensionality-reduced structural lesion topography and functional connectivity via ridge regression (Fig. 4): functional connectivity allowed for more accurate cross-validated predictions in neurocognitive domains (functional connectivity: visual and verbal memory: R2 = 0.36 and R2 = 0.42, respectively). Nonetheless, lesion topography outperformed functional connectivity in case of predictions in sensorimotor domains (structural lesion information: vision and motor impairments: R2 = 0.50 and R2 = 0.45, respectively).[149] Both imaging and behavioural data were obtained on average 2 weeks post-stroke. Altogether, these rather moderate levels of explained variance also suggest that a substantial fraction of variability in outcome may originate from factors that are not yet captured and considered in current studies. The studies reviewed in this section made use of SVMs as well as ridge regression to compute predictions on behavioural outcome after stroke. These two approaches are influenced by so-called hyperparameters determining the amount of model regularization. In the case of ridge linear regression, a regularized version of linear regression, as e.g. applied in the study by Siegel and colleagues,[149] the hyperparameter lambda determines the amount of shrinkage of the regression coefficients.[25] Likewise, the parameter C defines the amount of regularization of the SVM applied in Rehme and colleagues.[159,160] However, the optimization of these hyperparameters requires some extra care to avoid overfitting. One way to achieve a safe optimization can, for example, be a nested cross-validation framework, i.e. the combination of inner and outer cross-validation loops (Fig. 5). When the computational burden is high, as in case of deep learning approaches, nested cross-validation might not be feasible and, alternatively, the entire dataset can be split in three parts: training, test and validation sets. The optimal hyperparameters can then be obtained by relying on training and test sets, while less biased performance estimates can be attained in the validation set. However, such an approach may require relatively large datasets.
Figure 3
SVM-based prediction of motor deficits after stroke. Whole-brain functional connectivity to an ipsilesional M1 seed region was computed in a voxel-wise fashion for 20 stroke patients with motor impairments, 20 stroke patients without motor impairments and 20 non-stroke controls. (A) Stroke patients with motor impairment could be differentiated from non-stroke controls with an accuracy of 82.6%. (B) Similarly, the classification of stroke patients into those with and without motor impairments resulted in an accuracy value of 87.6%. Regions coloured in blue support the prediction of non-stroke controls or stroke patients without motor impairment; their functional connectivity is enhanced in comparison to stroke patients with motor impairments. Regions coloured in red, on the other hand, indicate a higher functional connectivity in patients with motor impairment and contributed to their classification. Adapted from Rehme and colleagues,[159] with permission.
Figure 4
Overview of the analytical pipeline to predict behavioural impairments in 100 stroke patients based on structural and functional MRI. (A) Manual lesion segmentation in case of structural lesion information and atlas-defined region-of-interest (ROI)-based estimation of functional connectivity in case of functional data. (B) Structural lesion information or functional connectivity data is entered into ridge regression models to predict behavioural outcomes in a leave-one-out cross-validation. (C) Comparison of predicted and true behavioural scores to determine model performance. (D) Visualization of model weights as estimated via ridge regression. Adapted from Siegel and colleagues (Copyright 2016, National Academy of Sciences, USA).[150]
Figure 5
Schematic illustration of nested cross-validation. Two loops of cross-validation are performed, with hyperparameter optimization being performed in the inner, or nested, loop. Adapted from Varoquaux and colleagues,[103] with permission.
Overview of the analytical pipeline to predict behavioural impairments in 100 stroke patients based on structural and functional MRI. (A) Manual lesion segmentation in case of structural lesion information and atlas-defined region-of-interest (ROI)-based estimation of functional connectivity in case of functional data. (B) Structural lesion information or functional connectivity data is entered into ridge regression models to predict behavioural outcomes in a leave-one-out cross-validation. (C) Comparison of predicted and true behavioural scores to determine model performance. (D) Visualization of model weights as estimated via ridge regression. Adapted from Siegel and colleagues (Copyright 2016, National Academy of Sciences, USA).[150]SVM-based prediction of motor deficits after stroke. Whole-brain functional connectivity to an ipsilesional M1 seed region was computed in a voxel-wise fashion for 20 stroke patients with motor impairments, 20 stroke patients without motor impairments and 20 non-stroke controls. (A) Stroke patients with motor impairment could be differentiated from non-stroke controls with an accuracy of 82.6%. (B) Similarly, the classification of stroke patients into those with and without motor impairments resulted in an accuracy value of 87.6%. Regions coloured in blue support the prediction of non-stroke controls or stroke patients without motor impairment; their functional connectivity is enhanced in comparison to stroke patients with motor impairments. Regions coloured in red, on the other hand, indicate a higher functional connectivity in patients with motor impairment and contributed to their classification. Adapted from Rehme and colleagues,[159] with permission.Schematic illustration of nested cross-validation. Two loops of cross-validation are performed, with hyperparameter optimization being performed in the inner, or nested, loop. Adapted from Varoquaux and colleagues,[103] with permission.
General considerations
Overview of employed algorithms
Having illustrated the various data fields of motor-focused stroke outcome studies, it becomes apparent that each field may have its unique repertoire of preferably used (prediction) algorithms (for a theoretical overview on algorithms, see Box 1). ‘Classic’ motor recovery studies that consider the sole recovery potential, e.g. defined as maximum minus initial Fugl–Meyer score as input variable, primarily rely on relatively simple, unregularized linear regression to model quantitative recovery scores, i.e. the change between follow-up and initial Fugl–Meyer scores. Particularly as fitted in-sample, these models have proven to be particularly easy to interpret. On the other hand, there seems to be a preference for logistic regression models within the field of prognostic studies of raw (and not change score) follow-up outcomes (binary categories: favourable versus unfavourable functional outcomes). These logistic regression analyses are often combined with stepwise procedures to select final input variables and construct a model as parsimonious as possible. While the stepwise feature selection step can indeed lead to memorable sets of predictors and allow the construction of simple, yet clinically practical point scores, there are some drawbacks to stepwise feature selection procedures; for example, neither forward nor backward feature selection are guaranteed to result in the overall best model as models are constructed and tested iteratively and not all conceivable models are considered.[83] Moreover, the performance of stepwise selection models might be overestimated, i.e. too optimistic.[163]Decision-tree-based algorithms may be a natural choice when combining information from different sources, such as behavioural, neurophysiological and neuroimaging ones, as in case of the PREP or PREP2 algorithm.[128] Decision trees perform regression and classification tasks by finding sequences of splitting rules that segment the space of input variables into simple regions. They may excel in being transparent, easily interpretable and applicable. It has to be noted, however, that more advanced tree-based algorithms, such as random forest or gradient boosting algorithms that combine multiple individual decision trees and are thus more difficult to interpret, usually outperform simple decision-tree algorithms.[164]Last, SVMs and regularized linear regression (e.g. ridge regression) have been frequent choices to evaluate structural or functional neuroimaging data, given that they have proved to be capable of handling high-dimensional data particularly well. While ridge regression is mostly still combined with some initial (PCA-based) unsupervised dimensionality reduction preprocessing step, SVMs have been shown to generate good predictions despite the combination of moderate sample sizes and thousands of voxels per patient.[165] More generally, SVMs can employ the ‘kernel trick’, i.e. map input information to high-dimensional feature spaces and by these means produce non-linear predictions, that may automatically capture complex relationships between the input and output. In contrast, ridge regression, as regularized version of linear regression, is a linear prediction model. Interestingly, deep learning has also made its entrance into several stroke outcome predictions scenarios. This update may have been incentivized by deep learning approaches’ promising success in further, often machine vision-focused medical scenarios. For example, deep learning has been shown to excel when detecting skin cancer,[32] inferring genetic mutations in cancerous tissue from routine histopathology tissue slides[166] or evaluating mammography scans.[167,168] As deep learning models typically show most favourable performances when trained on particularly large samples, often >105–106, future studies are warranted to investigate the usefulness of deep learning approaches for stroke outcome predictions more broadly. Particularly as data sample sizes may not grow quickly enough and may not reach the standards in other non-stroke fields, successful deep learning applications might be limited to specific tasks, such as image registration.[169,170] Last, it is important to consider that the ‘No Free Lunch’ theorem[171] guarantees that all algorithms perform similarly on average when all possible problems are taken into account. Thus, while each field currently appears to employ a unique methodological toolset, an enhanced methodological exchange between researchers of the various displayed fields, that may motivate the application of several learning algorithms at once, may be generally beneficial.
General advantages and promises
AI approaches in stroke research have already facilitated promising developments in outcome predictions, as well as additional insights in the (neurobiological) factors and mechanisms associated with poor versus good outcomes. Importantly, we have highlighted the delicate difference between in-sample inference and out-of-sample prediction-oriented studies. The former—inference—capitalizes on the interpretability of findings, at best describing an underlying mechanism. In-sample inference, for example, focuses on estimating the importance of individual input variables in explaining the outcome of interest across an entire group (and not individual patients). In contrast, out-of-sample computations are central to prediction studies that put an emphasis on the best generalization performance possible.[83,84,172] This approach targets optimal predictions for an individual patient not only with respect to outcome, but also concerning the response to a certain treatment.AI-based prediction approaches hold several advantages over ‘classical’ tools used in the field of post-stroke recovery. Most studies in stroke outcome research still apply some variant of linear or logistic regression model. Although such models are often easier to interpret, they cannot automatically exploit non-linear relationships and interactions, which can lead to poorer prediction performance. These limitations can be overcome with machine-learning algorithms, such as decision-tree-based algorithms, SVMs and neural networks. Although these techniques are computationally more demanding and the interpretation of the model parameters more complex, they might augment the prediction performance by exactly the amount that is necessary to turn an interesting prediction model into a diagnostic tool. The PREP algorithm[128,129] represents a promising example. This decision-tree-based algorithm has been shown to generate accurate predictions that are clinically beneficial: information on outcome prediction shortened rehabilitation stays without any reduction in functional outcome.[130] Nonetheless, recent non-stroke prediction-focused studies suggest that these more complex relationships, i.e. non-linearities and interactions, may not be generally present or readily exploitable in clinical datasets with for example small to moderate sample sizes (n < 100).[173,174] Thus, the authors of these studies caution against unrealistic expectations that the application of machine-learning algorithms instead of simple linear models will automatically enhance prediction performance.In addition to using linear models, most of the studies highlighted in this review still relied on specifically curated datasets and considered a circumscribed list of input variables only. However, the combination of out-of-sample testing and machine-learning algorithms may allow for the consideration of a broader range of input variables—as long as data sample sizes increase in parallel. For example, it would be conceivable to jointly consider multimodal, structural and functional imaging data[175] or metabolic, demographic and mechanistic variables[176] to enhance prediction performance. Overall, it seems likely that it will be such a combination of multiple data sources, or essentially neurobiologically based biomarkers, that will facilitate the most accurate stroke outcome prediction performance at a personalized level. Future studies may hence not only explore a richer methodological toolset (see the ‘Overview of employed algorithms’ section) but could also plan to systematically and explicitly investigate the combination of a variety of biomarkers.What is more, machine-learning-based prediction performance may be boosted even further when making use of unsystematically collected, but considerably bigger samples.[84] ‘Unsystematically’ here refers to the fact that collected variables might not have been hand-picked, but acquired without any previous hypothesis and selective inclusion and exclusion criteria. Examples for these kinds of data could be registry data, electronic health records or clinical stroke scans that have been recorded independent of specifically planned research projects. The use of general, unstructured clinical data may furthermore enable a better representation of the full spectrum of stroke patients: These prediction scenarios may also include subgroups that are often neglected in stroke outcome studies, such as very young, very old, very severely affected or multimorbid stroke patients with recurrent strokes or other interfering neurological conditions.[36,50,177,178]A further, desirable next step to enhance current prediction scenarios is the consideration of outcome measures that go beyond coarse-grained classifications, such as favourable versus unfavourable functional outcome based on binarized or ordinal scores like the mRS. Several studies already provided evidence that the focus on detailed scales, such as the ARAT or Fugl–Meyer assessment for motor impairments of the upper limb, is feasible and instrumental.[159,160] These more detailed motor assessments could be amended by scores evaluating impairments in further functional domains, such as the cognitive or language domains, and then integrated into multi-outcome prediction algorithms.[179] Such a multi-outcome approach might represent a more holistic and hence realistic approach, as impairments are rarely limited to just one functional domain[148] and may even interact with one another during recovery (e.g. motor recovery and cognitive dysfunction).[180] In conjunction with the selection of outcome scales, it will be important to reflect on the representation of the outcome: Do we want to predict the change between follow-up and initial scores or the final, follow-up score directly? Directly predicting the final score, while of course taking into account the initial baseline score, may be more desirable for several reasons. First, it circumvents any confounds induced by mathematical coupling that arises when a change score is predicted by an initial score (see the ‘Stroke prognostic scales based on clinical data only’ section). In particular, a linear regression model of raw outcome scores could be transformed into a change score model, which would then additionally allow for the interpretation of coefficients with respect to the classic proportional recovery concept.[102,181] Second, interpreting recovery solely on the basis of the change between follow-up and initial scores may mix up different patient subgroups and neurobiological mechanisms underlying different forms of functional recovery. For example, it is likely that a recovery change score of 10 points on the Fugl–Meyer assessment scale is driven by very different neurobiological processes depending on whether recovery started with an initial score of 5 (very severely affected) or 55 (almost no deficits). In turn, a patient that has recovered 20 points on the Fugl–Meyer scale, but started with an initial score of 5, is still considerably less recovered than a patient recovering 10 points but starting from 55. Therefore, follow-up scores rather than change scores seem to be better suited for recovery prediction scenarios. Complementing the increase in granularity of targeted outcomes, performance evaluation metrics could also be intelligently varied: the AUROC is the currently predominantly used score for binary prediction tasks. This one-dimensional approach could be extended to a multidimensional one by considering numerous, complimentary metrics, such as positive predictive values, sensitivity and specificity at once.[182]Altogether, improving predictions by all these means might eventually resolve the disenchantment stemming from reports of low real-world impact as few of the prediction models are actually used in clinical routine[91,106,183] and render them more clinically useful.
Disadvantages and pitfalls
Interpretability has always been of particular importance for researchers, independent from their specific field of research.[184] However, as mentioned before, some modern learning algorithms capture and instrumentalize patterns in high-dimensional data[185] with sometimes even millions of parameters that may simply be too complex to be readily comprehensible. These characteristics have led to the denotation black box and triggered some scepticism with which these modern statistical tools are regarded.[186] Yet, these black-box characteristics may be acceptable when they produce the best prediction results including high generalization performance, as increasing interpretability—as found in simpler, e.g. linear, models—often comes at the cost of decreasing prediction performance.[88] Essentially, there is currently no consensus on what level of interpretability is required for safe deployment of prediction models.[15] However, independent of the level of model interpretability, it seems necessary that human intelligence acts together with AI.[187] In particular, due to their capacity to extract information from otherwise intractable high-dimensional data, AI or, more specifically, machine-learning approaches could represent a very effective initial step. Medical professionals, such as physicians and therapists, could then include this information into their treatment decisions to achieve an optimal outcome for their patients. Physicians, for example, tend to be too optimistic and vastly overestimate life expectancy of terminally ill patients.[188] In contrast, deep learning-based predictions were shown to generate more accurate life expectancy predictions and hence might have yielded better therapeutic decisions.[189] As outlined above (the ‘Stroke prognostic scales based on clinical data only’ section), some stroke outcome models have also already been shown to outperform the predictions made by physicians and/or therapists.[119-121] At the same time, for a safe implementation of machine-learning routines, physicians and therapists need to check on a regular basis whether the models established for a certain diagnostic or therapeutic scenario are still valid.[187] Incongruencies may, for example, arise in case of a ‘dataset shift’, i.e. when there is a mismatch between the data used during model development and the data currently used for model deployment.[190,191] As a prominent, recent example, a major US hospital had to deactivate model-based sepsis-predictions to prevent spurious alerts after patients’ characteristics had substantially changed with the onset of the coronavirus disease 2019 pandemic.[191] The same sepsis-alert model has furthermore been shown to perform only poorly in independent, real-world data,[192] motivating a constant ongoing surveillance and validation of already established prediction models.The current curriculum in medical school might thus be revised to equip physicians with the necessary toolset, e.g. in the field of health informatics.[35] Close collaborations between various disciplines might be strengthened to successfully combine statistical, computational and human perspectives.[193,194] These efforts may then increase physicians’ abilities to recognize both the benefits and limitations of AI in healthcare[195] and enhance the knowledge on how to, for example, continuously quantify and validate prediction performance of used prediction tools. Recently presented checklists and guidelines for the transparent reporting of AI algorithms and interventions in medicine may represent an essential foundation.[29,196,197] In general, reliability, privacy and fairness are further important ethical aspects that need to be reconsidered and redefined in greater depth in an interdisciplinary fashion when using more machine-learning algorithms in upcoming years.[15]Last, it will be important to warrant satisfactory data quality. Otherwise, we may be at risk of encountering the big data paradox, as Xiao-Li Meng outlines it: ‘The more data, the more surely we fool ourselves’.[198] An algorithm can hardly be any better than the data that it learns from. Data acquired in the clinical routine might be noisy and biased, since, for instance, the patient moved during MRI scanning, it took too long until the blood was analysed or a junior doctor systematically misunderstood how to rate certain symptoms of the NIHSS score. Missing data, particularly those missing not at random, represent further challenges.[182] An important, somewhat trivial, but often neglected aspect is the validity of the data with respect to what can be really inferred from them. For example, DTI-based neuroimaging gives the impression of assessing anatomical fibre tracts, but they remain model-based approximations with a coarse spatial resolution when considering the nearly 1000-fold smaller diameter of axons.[152] Likewise, functional MRI is based on a haemodynamic signal, which is much slower and anatomically blurrier than true neuronal activity.[157] These issues are further complicated by strong interindividual variability, which is encountered at basically all levels of the CNS. For example, analyses of post-mortem brains have revealed that even the location of primary areas like M1 or primary visual cortex, which represent highly conserved brain regions within and across species, may vary in a centimetre range between subjects independent of anatomical landmarks.[199] Decomposing anatomical variability is technically feasible to a certain degree, but quickly meets its limits when it comes to spatial and temporal resolution issues of neurons and axons.To proceed with any of these aspects mentioned, medical doctors, neuroscientists, statisticians, computer scientists and ethicists ideally need to work in an interdisciplinary fashion.[17,193,194] They will need to ensure that inherent biases in data are detected, react accordingly, ignite discussions and develop international standards for big data analytics.[191] In this way, it might be possible to realize data science at its best and develop clinically helpful models—as, after all: ‘All models are wrong, but some are useful’. (George E. P. Box).
Conclusion and open questions
Prediction approaches based on AI have the great potential to revolutionize medical care in general. However, it remains to be seen whether expectations can be sustainably met. It is furthermore essential to recall that machine-learning based prediction scenarios should not be mistaken as causal inference.[200] In this context, randomized clinical studies are an exemplary study type that permits conclusions on whether a specific treatment causally underlies a better outcome in the treatment group. However, we may not generally be able to decide on whether a (standard) treatment is effective or not and whether it should e.g. be stopped based on machine-learning-derived outcome predictions. To infer causal effects, we would rather have to estimate what was most likely to happen, as well as the counterfactual prediction, i.e. what would have happened, if things had been different.[201] As Wilkinson and colleagues[200] put it: We may not be able to learn this counterfactual prediction by relying on the combination of machine learning and observational data—as they do not contain any information on what would have happened given altered circumstances.Furthermore, if it is not only the physician who is undertaking the clinical decision-taking process, but a prediction algorithm, who is responsible in case of (fatal) error? At present, it is still the physician who has to take the final responsibility and to verify that the result of a prediction algorithm complies with the current medical standards. How to deal with the situation when an effective prediction algorithm is available for a doctor but is not used, especially when the doctor’s decision was wrong and harmed the patient? How can we ensure that we comply with patients’ privacy rights and protect health data from potential cyber-attacks?[202] How can we guarantee that our prediction models are fair, i.e. that prediction performance does not vary depending on ethnicity or gender? This aspect might be of particular concern since machine-learning approaches may sometimes even enhance biases present in historical datasets that, for example, include skewed representations of people of colour, women and under-served populations.[203] Last, how do these changes affect the doctor–patient relationship and how can we unlock the potential of AI assisted healthcare to eventually enhance our physician time veridically spend on ‘caring for the patient’?[187,204]Finally, advocating for more AI-based studies certainly does not negate the value of small data studies using inference statistics, which—especially if founded on strong theory, robust measurement and effective error variance control—can reveal systematic, functional relationships on the individual subject level[205] and may thus help to take a more mechanistic perspective on the development of therapeutic approaches for stroke recovery.[206] We have also not considered any Bayesian approaches here that hold great promise of capturing essential characteristics of stroke recovery.[207-209] In the very end, conclusions originating from different methodological approaches may be merged to maximize patients’ well-being and we may particularly embrace novel prediction techniques to augment our human performance as medical doctors.Click here for additional data file.
Authors: Samuel G Finlayson; John D Bowers; Joichi Ito; Jonathan L Zittrain; Andrew L Beam; Isaac S Kohane Journal: Science Date: 2019-03-22 Impact factor: 47.728
Authors: Anna K Bonkhoff; Markus D Schirmer; Martin Bretzner; Sungmin Hong; Robert W Regenhardt; Mikael Brudfors; Kathleen L Donahue; Marco J Nardin; Adrian V Dalca; Anne-Katrin Giese; Mark R Etherton; Brandon L Hancock; Steven J T Mocking; Elissa C McIntosh; John Attia; Oscar R Benavente; Stephen Bevan; John W Cole; Amanda Donatti; Christoph J Griessenauer; Laura Heitsch; Lukas Holmegaard; Katarina Jood; Jordi Jimenez-Conde; Steven J Kittner; Robin Lemmens; Christopher R Levi; Caitrin W McDonough; James F Meschia; Chia-Ling Phuah; Arndt Rolfs; Stefan Ropele; Jonathan Rosand; Jaume Roquer; Tatjana Rundek; Ralph L Sacco; Reinhold Schmidt; Pankaj Sharma; Agnieszka Slowik; Martin Söderholm; Alessandro Sousa; Tara M Stanne; Daniel Strbian; Turgut Tatlisumak; Vincent Thijs; Achala Vagal; Johan Wasselius; Daniel Woo; Ramin Zand; Patrick F McArdle; Bradford B Worrall; Christina Jern; Arne G Lindgren; Jane Maguire; Danilo Bzdok; Ona Wu; Natalia S Rost Journal: Nat Commun Date: 2021-06-02 Impact factor: 14.919
Authors: Anna K Bonkhoff; Nicole Rübsamen; Christian Grefkes; Natalia S Rost; Klaus Berger; André Karch Journal: J Am Heart Assoc Date: 2022-03-05 Impact factor: 6.106
Authors: Lisa Fleury; Philipp J Koch; Maximilian J Wessel; Christophe Bonvin; Diego San Millan; Christophe Constantin; Philippe Vuadens; Jan Adolphsen; Andéol Cadic Melchior; Julia Brügger; Elena Beanato; Martino Ceroni; Pauline Menoud; Diego De Leon Rodriguez; Valérie Zufferey; Nathalie H Meyer; Philip Egger; Sylvain Harquel; Traian Popa; Estelle Raffin; Gabriel Girard; Jean-Philippe Thiran; Claude Vaney; Vincent Alvarez; Jean-Luc Turlan; Andreas Mühl; Bertrand Léger; Takuya Morishita; Silvestro Micera; Olaf Blanke; Dimitri Van De Ville; Friedhelm C Hummel Journal: Front Neurol Date: 2022-09-26 Impact factor: 4.086