Literature DB >> 33204501

Artificial intelligence in orthopaedics: false hope or not? A narrative review along the line of Gartner's hype cycle.

Jacobien H F Oosterhoff^1,2, Job N Doornberg^1,3.

Abstract

Artificial Intelligence (AI) in general, and Machine Learning (ML)-based applications in particular, have the potential to change the scope of healthcare, including orthopaedic surgery.The greatest benefit of ML is in its ability to learn from real-world clinical use and experience, and thereby its capability to improve its own performance.Many successful applications are known in orthopaedics, but have yet to be adopted and evaluated for accuracy and efficacy in patients' care and doctors' workflows.The recent hype around AI triggered hope for development of better risk stratification tools to personalize orthopaedics in all subsequent steps of care, from diagnosis to treatment.Computer vision applications for fracture recognition show promising results to support decision-making, overcome bias, process high-volume workloads without fatigue, and hold the promise of even outperforming doctors in certain tasks.In the near future, AI-derived applications are very likely to assist orthopaedic surgeons rather than replace us. 'If the computer takes over the simple stuff, doctors will have more time again to practice the art of medicine'.76 Cite this article: EFORT Open Rev 2020;5:593-603. DOI: 10.1302/2058-5241.5.190092.

Entities: Chemical Disease Gene Species

Keywords: artificial intelligence; computer vision; data-driven medicine; machine learning; orthopaedic surgery; orthopaedic trauma; personalized medicine; prediction tools

Year: 2020 PMID： 33204501 PMCID： PMC7608572 DOI： 10.1302/2058-5241.5.190092

Source DB: PubMed Journal: EFORT Open Rev ISSN： 2058-5241

Introduction

Artificial intelligence (AI) is believed to have the capacity to change the scope of medicine, much as the introduction of smartphones changed our day-to-day lives. AI and machine learning (ML) are terms commonly used to cover a range of computer applications such as ML-derived clinical decision support, deep learning (DL)-based computer vision and natural language processing (NLP). In essence, computers use human-created algorithms for analysing patterns in data and improve their performance by learning from their own mistakes. The increase in (cheap) powerful computers and availability of larger and more robust data have fuelled the use of ML in healthcare.[1] For decades, data-driven algorithms have been showing promising results as valuable diagnostic tools to assist clinicians in many respective specialties. As early as the 1980s a data-driven clinical prediction tool to determine which patients with chest pain presenting to the ED (emergency department) could be safely discharged home versus patients who were at high risk of myocardial infarction requiring admission to the intensive care unit (ICU)[2,3] overcame doctors’ inconsistent and inefficient admission strategies. This greatly improved workflow in the ED and resulted in fewer admissions while improving patients’ outcomes. Now, 30 years later, many hospitals build on similar clinical prediction tools and conduct data-driven algorithms to improve workflow from simple tasks in EDs to complex decision-making in ICUs.[4] In the era of AI, these data-driven algorithms are augmented with ML with two theoretical benefits: (1) to add non-linear correlations to the models; and (2) eventually to become self-learning to improve performance. However, according to the Gartner hype cycle,[5] we are over the top of the curve and coming down the slope to realize that AI is not going to solve all patients’ and doctors’ problems (Fig. 1). Nevertheless, many successful applications are known: computer vision DL models screen over 50,000 mammograms annually for breast cancer in the Massachusetts General Hospital in Boston.[6] In orthopaedics, our Massachusetts General Hospital-based SORG (Skeletal Oncology Research Group) is on the frontier of ML in orthopaedic musculosketal oncology to provide advanced models for predicting surgical outcomes to improve patient-centred care,[7] and the Traumaplatform ML Consortium is broadening the scope of AI to orthopaedic trauma.[8-10] However, critics may argue: ‘Why do so many promising applications have yet to be adopted in patients’ clinical care or doctors’ workflow?’

Fig. 1

Artificial Intelligence Hype Cycle, Machine learning, Natural Language Processing and Computer Vision on its way down – Adapted from Gartner Hype Cycle for Artificial Intelligence, 2019 gartner.com/smarterwithgartner. In this narrative review we focus on AI in orthopaedic surgery. We use respective examples of factual ML applications in orthopaedics to illustrate the great potential of AI to assist orthopaedic surgeons. In contrast, we will present methodologically sound ML applications, which have not, to date, made it to clinical practice, to exemplify AI’s shortcomings. Finally, we will present a practical stepwise approach on how to develop, validate, test and implement a ML application by using an example of a clinical prediction tool for discharge destination of patients with hip fractures. All examples serve as a narrative along the line of Gartner’s hype cycle to explore the question: ‘Artificial intelligence in orthopaedics: false hope or not?’

Part I: orthopaedic surgery is all about risk stratification – how can AI assist?

Risk of bias: risk stratification to neutralize the influence of ‘biased’ surgeons

In orthopaedics – although generalizable risk factors are well known – the probability of a favourable outcome or an adverse event for each respective individual patient that we care for (i.e. risk stratification), is currently still at best an educated guess when taking into account the great number of all unique specific patient and injury characteristics. ML-derived prediction models – that ultimately become self-learning and thereby constantly increase in accuracy – have great clinical potential in such risk stratification. This is based on the premise that high quality data are (prospectively) collected for the specific prediction-task at hand.[11] For example, predicting which elderly hip fracture patient has high probability of developing delirium on admission means they can be targeted for preventive measures.[12] Another example of a ‘non-medical’ risk stratification, but merely a useful logistical risk stratification that improves ‘workflow’ in the same frail patient group, is predicting discharge destination which could reduce expensive hospital admission days by streamlining post-operative pathways.[13] In the era of data-driven care and personalized or ‘precision’ medicine, decision-making in orthopaedic trauma surgery is flawed by selection bias of surgeons because we still lack good quality prospective outcome data for many common injuries. Moreover, surgeons are notoriously poor at accurately predicting patient outcomes.[14] Hence, there is great – undesired – variation in treatment. For example, in the Netherlands 20% of patients with a wrist fracture undergo surgery, while 80% of patients in Australia are treated operatively.[15,16] Consensus on the optimal treatment strategies for such common fractures is lacking and this leads to sub-optimal workflow, physical impairment and unnecessary costs. When aiming for global consensus, global collective – open access – use of available data is needed. First, combining data from multiple institutions is difficult because of ethical, legal, political and administrative barriers. However, it can be done: intensive care doctors showcased an innovative example of the Medical Information Mart for Intensive Care (MIMIC) database that was developed by the Massachusetts Institute of Technology (MIT) and Beth Israel Deaconess Medical Centre (BI), Boston MA, USA.[17] It is free to use and to develop prediction tools for non-commercial purposes. This ‘open access’ mentality will allow improved patient care in a collaborative effort, rather than multiple fragmented individual efforts around the world that have very poor external validity. As such, slowly, supporting systems for collective use of data have come across into healthcare, but have not been evaluated in orthopaedics other than in our implant registries. Distributed learning training – an algorithm learning from data without data leaving the hospital[18] – has been proven an effective alternative for sharing patient imaging data in other specialties for computer vision.[19] Data are allowed to be kept at the source where they are easier to handle and secure. Some firms – start-ups and for-profit organizations – are bypassing hospital routes to buy data directly from patients in order to receive identified data for model development. In the upward slope of Gartner’s hype cycle, over 90 well-funded AI-driven imaging and diagnostic solutions start-ups have been founded to date with combined funding of $1.5 billion.[20] There is a great challenge in conducting infrastructure and pathways for efficient high-quality data-sharing and feasible model development in orthopaedics globally. The benefits of sharing data have been recognized by governments and intergovernmental organizations around the world to promote transparency, accountability and value creation by making data available to all.[21] When data are stored centrally on servers, we can aim for ‘open access’ anonymized safe data-sharing and applications and thus aim for personalizing orthopaedic care globally and accessibly throughout the world.[17] Second, we should be cautious when combining data. Combined data can be used when data were collected for a specific research question and collected in an appropriate representative way. In particular, differences in healthcare systems should be acknowledged when combining and translating data through various countries. For example, our discharge prediction tool for elderly patients with a hip fracture that was deployed in Boston MA on data collected through the United States,[12] will likely not be externally valid in different healthcare systems in the Netherlands or Australia. More research is needed to explore these limitations of AI, in particular for ML-driven prediction tools, or computer vision applications using imaging from dissimilar machines from different parts of the world. In conclusion, treatment is not only influenced by biased surgeons, but decision-making can be biased by differences in healthcare plans and insurance systems.[22] In the clinical case of predicting discharge position after hip fracture surgery, facilities in the United States are limited by insurance approval, whereas in the Netherlands they are limited by availability. This makes generalizing predicted probabilities difficult; an algorithm should be externally validated thoroughly to overcome these discrepancies, as we will elaborate on below.

Risk of bias due to (lack of) experience: risk stratification based on ‘big data’

Junior doctors gain experience by treating hundreds of patients during their training. Senior doctors may be considered experienced after treating thousands of patients. Both are prone to bias:[23] the first due to lack of experience, the latter due to personal subjectivity of one’s experience.[24] Based on ‘objective experience’ with greater than 10,000s of patients, DL-driven computer vision and ML-derived prediction tools could alert clinicians about decisions that are at risk of bias. For example, in terms of decision-making in EDs, the majority of patients are seen by junior doctors. Junior doctors are known to misdiagnose significant trauma abnormalities on radiographs.[25] Food and Drug Administration (FDA)-approved and commercially available computer vision applications[26] can produce a heatmap on a radiograph where there is high probability of suspected fracture to alert the junior doctors and improve risk management. In addition, situations with high cognitive load for clinicians, such as decision-making at the end of a clinic day, could be supported by ML predictions. If non-biased ML predictions and real-life clinician decision-making differ in these situations, clinicians can be alerted.[27] The common claim for ML-derived prediction tools is that a better decision can be made with a model, than without.[28] Transparency and traceability of the decision-making process of AI systems must be made available to physicians in order to avoid fear of the ‘black box’: ‘How did the computer come to this decision?’. Therefore, it is important for orthopaedic trauma surgeons to have a foundation of knowledge of ML, as well as how it may affect and impact models, in order to critically assess predictions generated by ML and interpret the advice on probability of outcome in clinically meaningful ways. Not only treating physicians, but also patients are becoming important consumers of predictive analytics since patients are included in decisions about their treatments. Therefore, better tools to gain insight into risk stratification and communication to patients are needed to achieve true shared decision-making.[1] When intended to diagnose, treat or prevent disease, ML-derived applications are defined as a medical device under the Food, Drug, and Cosmetic Act in the United States.[29] In Europe, ML-derived applications are required to be approved by the Medical Device Regulations as defined by the CE Mark.[30] In addition, regulatory US and European platforms are not yet equipped to oversee AI’s insertion into medical practice.[29]

Part II: three forms of machine learning to aid clinical decision-making

Natural language processing (NLP)

Natural language processing (NLP) is a field of deep learning (DL), with the ability for a computer to understand and analyse human language. Google translate is the best-known non-medical example. DL is a class of ML characterized by the use of neural networks, in which the algorithm learns to distinguish patterns directly from data and learns on its own to select features to classify the input data. The goal of NLP is to translate the natural human language of a patient’s medical record, for example surgery reports, into structured format data to query for the presence or absence of a finding.[31] In orthopaedics, NLP has been applied to identify surgical site infections in free-text notes of medical records and achieved predictive abilities comparable with the manual abstraction process and superior to models that used administrative data only.[32] In hip arthroplasty, NLP has been used to identify common data elements[33] and classification of periprosthetic femur fractures.[34] Our group applied NLP to evaluate unstructured free-text patient-experience reviews of orthopaedic surgeons throughout the United States. Patient experience reflects quality of care from the patient’s perspective, hence these are important data that can teach us about what creates an (un)satisfying experience.[35] Another simple, yet very elegant, application of NLP in clinical practice has been developed at the Beth Israel Deaconess Center (BIDC) by Steven Horng – Emergency Physician and Clinical Lead for ML – and colleagues.[36] In the BIDC’s emergency department, the NLP algorithm automatically ‘reads’ the triage nurse’s admission note. Subsequently, it provides a drop-down menu of ICD diagnoses in order of differential diagnostic likelihood – rather than alphabetical – based on written clinical triage data. Moreover, this algorithm is subsequently self-learning based on the final entered ICD diagnosis, increasing the accuracy of the drop-down differential diagnosis based on plain written text. When debating, ‘AI, false hope or not?’, one could consider the larger sum of these respective small advances in our clinical workflow to result in a major reduction of time we spend on our computers (Fig. 2).

Fig. 2

AI is very likely to assist orthopaedic surgeons: ‘If the computer takes over the simple stuff, doctors will have more time again to practice the art of medicine’ (Courtesy: Marcello Lavallen).

Clinical prediction rule

Predictive tools in orthopaedics consist of diagnostic as well prognostic outcome applications. In orthopaedic trauma, ML-derived clinical prediction rules may enhance workflow in the ED:[37,38] patients clinically suspected for scaphoid fracture are referred for radiographic evaluation. Of these, up to 20% of patients with a negative radiograph have sustained an actual scaphoid fracture.[39] The developed Clinical Prediction Rule can aid clinicians in identifying patients requiring advanced imaging (i.e. magnetic resonance imaging (MRI) or computed tomography (CT)) and thereby may reduce the number of requested advanced imaging and potential unnecessary casting procedures for up to 31% of patients.[38] Similarly, using the Ottowa Ankle Rules, a combination of predictive clinical parameters increasing the likelihood of a fracture with an additional benefit of its self-learning and correcting capacity, could support and guide the clinician when taking a history and performing a physical examination. Hence, there would be improved risk stratification for advanced imaging of patients with ongoing improved accuracy when results are fed back into the ML algorithm.

Computer vision for fracture recognition

Computer vision is a domain of DL and describes the process of a machine understanding images or videos, and could be useful to aid diagnostic decision-making in fracture care. In computer vision, convolutional neural networks (CNNs) have proven to be effective for these purposes.[40] Using pre-trained CNNs enables us to transfer knowledge to a specific new fracture recognition task, without the need for new time-consuming computational training. Our systematic review addressed the promise and potential utility in fracture care, and found computer vision was nearly as good as and even outperformed humans in detecting certain common fractures.[9] When classifying proximal humerus fractures, often misdiagnosed due to variable presentation, a CNN outperformed general physicians and general orthopaedic surgeons, but with the same performance as specialized upper extremity surgeons. The CNN was trained on ~2000 radiographs classified according to the Neer classification.[41] Moreover, few studies have been published showing that AI performs at a human level in recognizing fractures on plain radiographs taken in the ED of patients with wrist, hand, and ankle injuries with at least 83% accuracy.[42-45] Arguably, these studies all included simple – easy to identify – fractures only. Of critical note, however, subtle and invisible (occult) fractures may be more challenging than fractures that are easy to detect. In the case of the aforementioned clinically suspected scaphoid fracture, a scaphoid fracture is relatively subtle on radiographs and is often overlooked by non-specialists.[46] Even specialists cannot detect some scaphoid fractures on radiographs – so-called radiographically occult fractures. When applying computer vision to identify true fractures among suspected fractures, many of which were radiographically invisible to human observers, computer vision did not outperform humans. Along Gartner’s line: CCN for fracture recognition was embraced for its high potential and lured in many investors supporting numerous start-ups for billions of dollars. But as we are now over the top of the hype cycle, we recognize that, for example, occult fractures of the scaphoid, remain occult for expert surgeons as well as for a specially trained CNN for scaphoid fractures.[47] This uncovers one of the problems of supervised learning of CNN for musculoskeletal computer vision of occult fractures: training of the algorithm requires a great number of cases, with a reference standard (MRI, CT or follow-up radiographs) which is at best debatable in accuracy. At the stage we are at now, computer vision will miss (occult) scaphoid fractures just as often as orthopaedic surgeons and radiologists do. However, in other specialties, computer vision has been proven to outperform specialists in cancer screening in picking up early tumours that are often missed, even by specialists.[48,49] The hope of computer vision in orthopaedics is early accurate diagnosis and classification, to improve treatment outcomes. At this point, orthopaedic surgeons are on par with AI, as even the first FDA-approved computer vision application in orthopaedics (OsteoDetect) does not exceed specialists’ accuracy in detecting and diagnosing distal radius fractures.[26]

Outcome calculator

Risk stratification in orthopaedics has the potential to neutralize the influence of biased surgeons and thus overcome treatment inconsistencies,[16,50] thereby improving patients’ functional outcomes and reducing associated healthcare costs (Fig. 3). Thus, small significant changes in daily decision-making in high-volume patient care will result in important overall public health advances.[51,52] In orthopaedics, ML-derived decision tools to assist clinicians in treatment outcomes have been developed in arthroplasty,[53] trauma,[10,12,38] oncology and spinal disorders.[54-57] In orthopaedic oncology, decision tools show accurate performance characteristics in pre-operative estimation of survival in patients with spinal or extremity metastatic disease.[54,55] The developed tools may enhance personalized survival prediction, from 30 days up to five years, and aid shared treatment decision-making, both surgical and non-surgical. In arthroplasty, estimation of patients who will benefit from elective surgery will support optimization in treatment strategy, and prevent patients undergoing an elective procedure with an unacceptably high (individual) risk of adverse events.[55] In orthopaedic trauma, an outcome calculator was developed to identify pre-operative risk of post-operative delirium in hip fracture surgery[12] and the ML algorithm will likely improve the efficiency of a screening programme aimed at identifying patients at risk for delirium. However, the clinical efficacy of the latter tool has yet to be determined and will be the subject of clinical testing and implementation studies.

Fig. 3

Workflow for patients clinically suspected for a distal radius fracture.

Note. ED, emergency department.

Workflow for patients clinically suspected for a distal radius fracture. Note. ED, emergency department. Although there are many studies on development of decision tools, few authors have driven further development by successful external validation.[58-60] Methods for evaluation and monitoring models to ensure continued accuracy and performance are in their infancy with regard to their imbedding ML in healthcare.[61] In the final part of this narrative review, we will demonstrate a logical stepwise approach from clinical problem to implementation, derived from a successfully implemented ML application, which is suggested to be followed to ensure quality in orthopaedic ML research.

Part III: stepwise approach from clinical decision-making problem to implementation

The methodology follows the framework for prediction models proposed by Professor Steyerberg et al,[28] and covers the range of development of applications such as NLP, computer vision and clinical decision support as discussed above (Fig. 4).

Fig. 4

Flowsheet from clinical problem to implementation.

Step 1. Predictive modelling: development of a machine learning algorithm

Data derived from various study designs addressing the clinical decision-making problem at hand could be used for predictive modelling with the use of ML; retrospective, prospective, registry data and nested case-control studies fit best for prognostic modelling whereas cross-sectional and case-control study design fit better for diagnostic modelling.[62] The benefit of ML may be best realized with larger data sets, particularly those that are periodically updated, with the rule of thumb of having ≥ 200 events and ≥ 200 non-events.[63] For example, a ML algorithm for delirium prediction following elderly hip fracture surgery, and various other SORG ML algorithms, were developed with a large clinical database from the American College of Surgeons (ACS) National Surgical Quality Improvement Programme (NSQIP).[12,56,57] A function is generated consisting of an outcome variable (dependent variable) which is predicted from a given set of features (independent variables). In the case of development of a clinical prediction rule or outcome calculator, variable importance is first carried out to identify and select those features that contribute most to the outcome variable with clinical importance in mind. Variables included may contain clinical and radiological findings (e.g. patient demographics, trauma mechanism or classification of fracture), as well as intra-operative findings and surgical techniques (e.g. screw and/or plate fixation or arthroplasty). In the case of computer vision and NLP, the algorithm distinguishes patterns directly from data and learns on its own to select features to classify the input data (essentially black boxes – e.g. edges, curves, colour). Training and internal validation of the supervised ML algorithm continues (‘run’) until the model achieves the best model performance. The delirium hip fracture prediction tool targeted post-operative delirium as the dependent variable, with easy, readily available independent variables derived from variable importance (i.e. age, BMI, ASA class, functional status, pre-operative dementia, pre-operative delirium, pre-operative need for mobility-aid and pre-operative creatinine level). Predictive performance of ML algorithms is assessed according to Steyerberg’s structured stepwise ABCD-approach: calibration-in-the-large, or the model intercept (A); calibration slope (B); discrimination, with a concordance statistic (C); and clinical usefulness, with decision-curve analysis (D).[28,64] In addition, overall model performance – a composite of discrimination – is assessed using the Brier score, compared with the null model Brier score.[65] Classification algorithms include linear classifiers (logistic regression, naïve Bayes), support vector machine, classification trees or neural networks (Fig. 5). Linear classifiers are easy to interpret, and fast to train. Non-linear classifiers are more flexible and have the ability to capture more complex patterns, but are, in small samples, prone to overfitting. Logistic regression involves fitting an S-shaped probability curve to numerical data by taking the log odds to make predictions about binary events. Classification trees (e.g. gradient boosted machine, random forest) use flowchart-like structures to make decisions, which can be readily understood and visualized. Artificial neural networks are inspired by biological neural networks which mainly use so-called feed-forward neural networks with hidden layers and neurons and which are, in general, data-hungry. Support vector machines are based on the idea of finding a hyperplane in a 3D (kernel) scatterplot that can divide a dataset into two classes, and works in general quite well on smaller datasets. A naïve Bayes machine is a product of probabilities, best visualized as a Venn diagram that shows possible logical relations, works well with smaller datasets, and prefers categorial features over continuous features (where a normal distribution is assumed).

Fig. 5

Classification algorithms. (Courtesy: B.Y. Gravesteijn)

Classification algorithms. (Courtesy: B.Y. Gravesteijn) On one hand, there is no one solution when choosing the right ML algorithm. The decision is taken after conducting a research question, preparing data and building various models. Comparing model performance is based on all metrics according to the ABCD approach,[28] combined with the most clinically meaningful variable importance. The ML algorithm development for delirium prediction in hip fracture surgery led to almost perfect model performance combined with clinically meaningful feature importance, outperformed the default strategy of screening all patients, and included easy and readily available variables.

External validation

External validation is essential to assess performance and generalizability of the algorithm before implementation in clinical practice. External validation can be carried out with temporal, geographical or fully independent validation.[65] External validation is also important because model performance might differ across populations, making an unvalidated algorithm less reliable.[60] Of the current few externally validated ML algorithms the validation cohort was derived from retrospective analysis at a large, tertiary care centre.[58-60] A developed, internally validated ML algorithm is applied to a separate validation dataset to assess model performance according to the same metrics as above. Even though various institutions may all be using the same electronic health record (EHR) vendor, the data structure, field meanings and extent of data cleaning likely differ across organizations.[66] For future research, when prospectively collecting data, common data elements for common data models could support combining data from various institutes and validation of prediction models globally[67] and thereby support fully independent validation.[68] As discussed above, our discharge prediction tool for elderly patients with a hip fracture was deployed in Boston MA on United States data.[12] Differences in healthcare systems, standards of care and treatment strategies can prohibit generalizability to other countries. In some situations, when external validation reveals low generalizability, re-calibration strategies are allowed.[69] In re-calibration, particular components of the developed model are modified and tailored for each study population (such as the intercept of the model or variable effects).[70]

Evaluation and implementation

If found to be externally valid, clinicians might use an available (web) application to help incorporate the algorithm into practice to aid decision-making and target actions to be a priority (e.g. https://sorg-apps.shinyapps.io/hipfxdelirium/). A real-time clinical prediction rule, computer vision model or outcome calculator based on the developed ML algorithm and routinely collected clinical data is best established and validated in EHR systems.[71] Derived predictions are integrated and calculated automatically and made available to the clinician.[71] Although ML is a new methodology that greatly expands the ability to analyse data, implementation should follow the same rules as the previously developed diagnostic test.[72,73] Efficacy of the developed ML algorithm is ideally assessed through large randomized controlled trials (RCTs).[74] ML-derived decision support has great power to assist clinicians and change the scope of medicine; however, many powerful algorithms are not utilized yet.[66] Consider the following scenario: a patient is scheduled for hip fracture surgery and randomized to either the intervention or control arm of an RCT. In the intervention arm, an intervention is based on high probability derived from the developed ML algorithm. In the control arm, treatment is according to common practice. The proposed primary end-point is incidence of post-operative delirium to determine benefit from the developed ML algorithm and clinical importance (i.e. patient outcomes). ML requires the use of a computer and EHR integration, which has implications for patient privacy and creates obstacles for implementation.[73] Physicians will need to open the application, enter information and then return to using it in the EHR.[66] The biggest challenge is incorporating an ML-derived decision support tool into an EHR workflow. In addition, the distinctive characteristics of ML-based software require a regulatory approach, allowing necessary steps to improve treatment while ensuring that the algorithm is safe.[75]

Improvement of the algorithm: continuous self-learning

The increase in data set size substantially improves ML model performance, as a response to changes in practice or patient population. Ongoing data collection will lead to improved ML models, though with gradually diminishing returns.[73] The great advantage of ML algorithms over decision rules is the ability to improve accuracy of the model over time, including earlier disease detection, more accurate diagnosis, identification of new observations or patterns, and development of personalized diagnostics and treatment.

Conclusions

Many argue that AI will change the scope of medicine. Indeed, along the upslope of Gartner’s hype cycle, $1.5 billion has been invested in AI in healthcare,[20] and counting. However, coming over the top of the hype curve, we recognize the methodological limitations of ML and DL: for example, a computer can recognize an obvious fracture,[41] which may be beneficial as a support tool for junior doctors in an ED under a high demanding workload.[25] But for an occult scaphoid fracture, CNN algorithms have yet to outperform orthopaedic specialists.[47] On the downward slope of Gartner’s line, we come to realize that many promising ML prediction tools and DL image recognition tools have been developed with good intentions for commercial benefit, but very few have been externally validated – systematically tested on accuracy in clinical workflow – or implemented in daily practice to date. To do so in orthopaedics, we face ethical, legal, political and administrative barriers. To move forward along the slope of enlightenment, we strongly argue for collaboration in an ‘open access’ mentality as intensive care specialists do:[17] share good quality prospective data to improve the accuracy and external validity of AI-derived algorithms; and – in an ideal world – continue prospective data collection with an active feedback loop to improve performance. We envision the plateau of productivity of the hype cycle as follows: AI-derived applications will facilitate data-driven personalized care for our patients, limiting surgeons’ bias, and empower shared decision-making on patient specific data. AI is likely to assist orthopaedic surgeons rather than replace us: ‘If the computer takes over the simple stuff, doctors will have more time again to practice the art of medicine’.[76]

62 in total

1. Calibration of risk prediction models: impact on decision-analytic performance.

Authors: Ben Van Calster; Andrew J Vickers
Journal: Med Decis Making Date: 2014-08-25 Impact factor: 2.583

2. Implementation of the Ottawa ankle rules in France. A multicenter randomized controlled trial.

Authors: G R Auleley; P Ravaud; B Giraudeau; L Kerboull; R Nizard; P Massin; C Garreau de Loubresse; C Vallée; P Durieux
Journal: JAMA Date: 1997-06-25 Impact factor: 56.272

3. Towards better clinical prediction models: seven steps for development and an ABCD for validation.

Authors: Ewout W Steyerberg; Yvonne Vergouwe
Journal: Eur Heart J Date: 2014-06-04 Impact factor: 29.983

4. Machine Learning, Predictive Analytics, and Clinical Practice: Can the Past Inform the Present?

Authors: Eric D Peterson
Journal: JAMA Date: 2019-12-17 Impact factor: 56.272

5. Infographic. Can even experienced orthopaedic surgeons predict who will benefit from surgery when patients present with degenerative meniscal tears? A survey of 194 orthopaedic surgeons who made 3880 predictions.

Authors: Coen H Bloembergen; Victor A van de Graaf; Adam Virgile; Nienke W Willigenburg; Julia C A Noorduyn; Daniel Bf Saris; Ian Harris; Rudolf W Poolman
Journal: Br J Sports Med Date: 2019-10-25 Impact factor: 13.800

6. Predicting nonroutine discharge after elective spine surgery: external validation of machine learning algorithms.

Authors: Brittany M Stopa; Faith C Robertson; Aditya V Karhade; Melissa Chua; Marike L D Broekman; Joseph H Schwab; Timothy R Smith; William B Gormley
Journal: J Neurosurg Spine Date: 2019-07-26

7. The Amsterdam wrist rules: the multicenter prospective derivation and external validation of a clinical decision rule for the use of radiography in acute wrist trauma.

Authors: Monique M J Walenkamp; Abdelali Bentohami; Annelie Slaar; M Suzan H Beerekamp; Mario Maas; L Cara Jager; Nico L Sosef; Romuald van Velde; Jan M Ultee; Ewout W Steyerberg; J Carel Goslings; Niels W L Schep
Journal: BMC Musculoskelet Disord Date: 2015-12-18 Impact factor: 2.362

8. Comparisons of health care systems in the United States, Germany and Canada.

Authors: Goran Ridic; Suzanne Gleason; Ognjen Ridic
Journal: Mater Sociomed Date: 2012

9. External validation of clinical prediction models using big datasets from e-health records or IPD meta-analysis: opportunities and challenges.

Authors: Richard D Riley; Joie Ensor; Kym I E Snell; Thomas P A Debray; Doug G Altman; Karel G M Moons; Gary S Collins
Journal: BMJ Date: 2016-06-22

10. Learning a Health Knowledge Graph from Electronic Medical Records.

Authors: Maya Rotmensch; Yoni Halpern; Abdulhakim Tlimat; Steven Horng; David Sontag
Journal: Sci Rep Date: 2017-07-20 Impact factor: 4.379

4 in total

1. Industry 5.0 technology capabilities in Trauma and Orthopaedics.

Authors: Karthikeyan P Iyengar; Eindere Zaw Pe; Janaranjan Jalli; Madapura K Shashidhara; Vijay K Jain; Abhishek Vaish; Raju Vaishya
Journal: J Orthop Date: 2022-06-06

2. Alternative- and focal therapy trends for prostate cancer: a total population analysis of in-patient treatments in Germany from 2006 to 2019.

Authors: Luka Flegar; Aristeidis Zacharis; Cem Aksoy; Hendrik Heers; Marcus Derigs; Nicole Eisenmenger; Angelika Borkowetz; Christer Groeben; Johannes Huber
Journal: World J Urol Date: 2022-05-13 Impact factor: 3.661

3. In Silico Finite Element Modeling of Stress Distribution in Osteosynthesis after Pertrochanteric Fractures.

Authors: Jacek Lorkowski; Mieczyslaw Pokorski
Journal: J Clin Med Date: 2022-03-28 Impact factor: 4.241

Review 4. AI MSK clinical applications: spine imaging.

Authors: Florian A Huber; Roman Guggenberger
Journal: Skeletal Radiol Date: 2021-07-15 Impact factor: 2.199

4 in total