Literature DB >> 32395545

Bridging the "last mile" gap between AI implementation and operation: "data awareness" that matters.

Federico Cabitza¹, Andrea Campagner², Clara Balsano^3,4.

Abstract

Interest in the application of machine learning (ML) techniques to medicine is growing fast and wide because of their ability to endow decision support systems with so-called artificial intelligence, particularly in those medical disciplines that extensively rely on digital imaging. Nonetheless, achieving a pragmatic and ecological validation of medical AI systems in real-world settings is difficult, even when these systems exhibit very high accuracy in laboratory settings. This difficulty has been called the "last mile of implementation." In this review of the concept, we claim that this metaphorical mile presents two chasms: the hiatus of human trust and the hiatus of machine experience. The former hiatus encompasses all that can hinder the concrete use of AI at the point of care, including availability and usability issues, but also the contradictory phenomena of cognitive ergonomics, such as automation bias (overreliance on technology) and prejudice against the machine (clearly the opposite). The latter hiatus, on the other hand, relates to the production and availability of a sufficient amount of reliable and accurate clinical data that is suitable to be the "experience" with which a machine can be trained. In briefly reviewing the existing literature, we focus on this latter hiatus of the last mile, as it has been largely neglected by both ML developers and doctors. In doing so, we argue that efforts to cross this chasm require data governance practices and a focus on data work, including the practices of data awareness and data hygiene. To address the challenge of bridging the chasms in the last mile of medical AI implementation, we discuss the six main socio-technical challenges that must be overcome in order to build robust bridges and deploy potentially effective AI in real-world clinical settings. 2020 Annals of Translational Medicine. All rights reserved.

Entities: Chemical

Keywords: Artificial intelligence; DevOps; data hygiene; last mile

Year: 2020 PMID： 32395545 PMCID： PMC7210125 DOI： 10.21037/atm.2020.03.63

Source DB: PubMed Journal: Ann Transl Med ISSN： 2305-5839

Introduction

Interest in the applications of machine learning (ML) techniques in medicine is growing fast because of their ability to endow decision support systems with so-called artificial intelligence (AI), a vague but evocative expression to denote, in this context, the capabilities of machines (i.e., algorithms) to classify or stratify clinical cases or predict related conditions with high accuracy—in some cases, even more accurately than human experts (1). If we limit ourselves to counting how many papers indexed on PubMed have the expression “AI” in their title, we can see that, every 2 years from 2012 to 2017, this number has roughly doubled (43, 70, and 169 in, respectively, 2012 to 2013, 2014 to 2015, and 2016 to 2017), but, in the most recent 2-year period, this number has increased almost tenfold (1,413 in 2018 to 2019). Moreover, this interest goes beyond the ambit of academic research, as it is mirrored by a doubling of the number of FDA approvals for devices endowed with some form of AI in the last 5 years.

The proof of the pudding in medical AI

Despite this increasing interest, there is oddly still a lack of consensus on how to assess whether the adoption of AI in a healthcare setting is even successful and hence useful. In an editorial in this journal (2), we made the point that we must go beyond statistical validation (which is what it is usually conducted and reported in scientific reports and articles in terms of accuracy measures, such as C-statistics or F-scores) and demand proof that these systems bring clinical benefit when fed with real-world data (what we called pragmatic validation) and when deployed in real clinical settings (ecological validation). Achieving ecological validation of statistically high-performing AI in diagnostic and other medical tasks has been observed to be more complicated than initially expected; in fact, most of the challenges that make technically sound systems perform poorly in real-world settings lie in the “last mile of implementation” (3)—a concept that we equate to the conceptual gap between developing medical ML and the mere application of ML techniques to medical data. Moreover, this “last mile,” although apparently short, is not a flat and regular path, but rather presents two chasms, as shown in .

Figure 1

The hiatuses in the “last mile” between medical AI and human agency.

The hiatuses in the “last mile” between medical AI and human agency. In particular, the hiatus of human trust represents the most serious hindrance to the full realization of the potential of AI at the point of care, as even the most accurate systems are affected when they are not trusted by doctors as a result of what we called “prejudice against the machine” (4). This hiatus represents the failure of medical AI to make a positive impact on doctors’ decisions, irrespective of its intrinsic accuracy. This can be because the interface is inadequate, because the good advice comes late or among several—often too many—false alarms (5), or because the decision-maker cannot take advantage of the technological support due to the emergence of either automation bias (6) or automation complacency (7). Efforts to bridge this hiatus are attracting increasing interest from the specialist communities of ML and human-computer interaction (HCI), in which solutions are designed and tested to improve the usability of AI interfaces (8), their causability (9)—that is, the quality of their explanations—and, more generally, their acceptability (10). This is why we make the point here that, paradoxical as it might seem, the other hiatus, which represents how the clinical experience from which ML models might learn is accumulated and processed, is more neglected than the human trust hiatus. In fact, most data scientists would deny that this step is even a gap: data scientists and ML developers usually assume that the datasets with which their predictive models are trained—what is emphatically called ground truth—are (I) truthful, (II) reliable, and (III) representative of the target population. However, this threefold assumption is seldom tenable and often ill-grounded, at least to some extent. As we have observed (11), ground truth can be considered reliable (note: for our practical aims, we consider a source dataset reliable if it is 95% accurate) only when perfect raters are called to annotate past cases or at least nine averagely accurate raters [that is, at least 85% accurate on average, which is a reasonable estimation (12)] are involved. Such a condition seldom holds true, if ever. Thus, the hiatus of machine experience represents the difference between actual care, as experienced by doctors, and its codified representation in the form of data, which is the only input for any machine, no matter how intelligent it is or might ever become.

A matter of quality

We have thus come to the crux of our position: the primary source for the training of ML models is data produced during the care of patients—with the exception of test results and diagnostic images—by doctors and other clinical practitioners in their daily routines and tasks. However, the quality of data found in medical records is notoriously far from being perfect, with independent studies consistently finding that approximately 5% of records contain errors (13-15). Focusing on diagnostic imaging, radiological reports might be used for automatic annotation (16), though they can have an even higher error rate (17), or, alternatively, images can be deliberately annotated by a pool of radiologists, though they often show a high degree of discordance in their findings (2). Thus, the quality of ML training data is often lower than the level needed to build reliable models for integration into effective decision support systems; what is worse, ML developers and doctors—as end users of the products—usually neglect or underrate this issue. That said, few concepts are as intuitively comprehensible and yet academically elusive as that of data quality (DQ) in health records. In fact, the concept of DQ can intuitively and concisely be equated to the concept of fitness for use (18). This is in line with the operational definition of DQ given by The Joint Commission, which equates DQ with the adequacy to support a number of medically relevant tasks, like identifying the patient, supporting the diagnosis, justifying care and treatment, documenting the course and results of treatment, and promoting continuity and safety of care (19). Despite this apparent simplicity, a recent review by Juddoo et al. (20), which considered 41 high impact papers, extracted a staggering 43 distinct DQ dimensions relevant to health care applications, of which 38 were mentioned more than once. In their words, “This confirmed the impression of a lack of a universal DQ framework and the possible fact that different authors might be using different jargon to express the same idea”. Accuracy, which Cabitza and Batini (21) defined as “health data [that] represent the truth and what actually happened” was the dimension most frequently mentioned (58 times), followed by completeness, consistency, reliability, and timeliness. These are the dimensions that appeared more than ten times and hence were considered by the authors to be “most important in the context of Big Data within the health industry.” To try to simplify this complex matter, we could consider three main DQ areas. The first is related to accuracy and reliability (where the latter also includes internal consistency and high inter-rater agreement). The second is in regard to completeness and timeliness (in that missing data can be seen as data that is not yet recorded and, conversely, having complete but obsolete data is like not having useful data at all). The third is related to comparability and (external) consistency, which is also a matter of interoperability between information systems and the communities of practitioners who use those systems.

Crossing the chasms between clinical practice and ML

In the following, we will address the issue(s) of DQ in healthcare in order to bridge the “last mile” of the ML hiatus. This is the challenge of bringing a trustworthy (datafied) representation of health conditions and care actions to the opposite side of this chasm or, in other words, the challenge of engineering a workflow to develop an ML-based AI that supports medical decision making—a composite industrial process encompassing various steps, among which are the engineering of the above representation, the training of predictive models, and their testing and validation. The main challenges to which we refer relate to the following areas of concern (see ):

Figure 2

The main factors either bridging or separating clinical work and AI development. Oriented arcs represent strong influence.

The main factors either bridging or separating clinical work and AI development. Oriented arcs represent strong influence. A lack of uniformity and consensus on (i) what to record (MDS in , for minimum data set) and (ii) how to record it (STD, for standards); The phenomenon of observer variability (IRR, for inter-rater reliability), which is the extent to which multiple raters disagree on how to classify, and hence codify, a given medical phenomenon; Limitations on doctors’ capacity to record and communicate information; this also includes limited education of processing staff (both clinical and administrative), especially with respect to awareness of the consequences of poor DQ (DW, for data work); Poorly designed data collection tools—both paper-based and electronic interfaces (HCI, for human-computer interaction); No single (or central) repositories for vetted and anonymized data for use at scale for secondary purposes, like research and ML training and validation (DM, for data management); and Lack of planning (or will to plan) by administrative and managerial staff and higher policy makers with respect to DQ assessment and continuous improvement (DG, for data governance). As the reader may notice, the problems we have mentioned are all socio-technical in nature, with inter-rater reliability (IRR), data work, and data governance being primarily socio-organizational areas of concern, and HCI and data management being mainly technical. Yet, for the very first concern—achieving consensus on minimum data sets and the related classification schemas—both areas are inextricably intertwined, and distinguishing between them is more futile than in other cases. Let us quickly review each of the concerns. We can begin with an expression of optimism: the concern related to the use of standards (STD in ) to establish how to report a medical condition (i.e., how to code it) has affected the healthcare environment for decades and still prevents many centers from exchanging information or documents. The impact of this on the whole sequence of challenges of the ML hiatus should be gradually declining due to the increasing adoption of coding standards—such as ICD-10, SNOMED, and LOINC—that have passed the test of time and reached sufficient maturity (including technological maturity) and their integration into current electronic medical records and hospital information systems. In particular, the latter two coding systems are the most widely used terminology standards to date for health measurements, observations, and documents (22) and their adoption appears to be a growing trend. The problem of the minimum data set (MDS in )—what to report—is more complex and relates to the challenge of identifying all the relevant attributes of a clinical condition that, in the case of AI training, are good predictors of the target variable. In this regard, both ML developers and clinical practitioners can help each other. The former can employ the largest number of attributes (or features) available (or even conceivable) in a well-circumscribed experimental setting to train a set of ML models on specific relevant targets (like classifying cases in terms of either the associated diagnosis or a stratum of expected improvements or outcomes) and then perform a quantitative feature ranking [or feature selection (23)] to determine the most useful N features for each predictive task at hand. The union set of these features could then be indicated (including in terms of iconic or graphical signs in the user interface of the electronic record) as being the data that it is recommended to report as carefully and accurately as possible for the secondary use of the data (including, but not limited to, AI development). In a similar but complementary fashion, a clinical study could be designed and conducted to review a sufficient number of retrospective cases that are adequately representative of relevant conditions; when multiple raters agree upon what data affected (or would have affected) the right decision at the right time, the study could identify the necessary data without which most of the cases (e.g., 80%) would not be managed appropriately or timely—in other words, the minimum data set with the highest impact on the patients’ outcomes. We could call this set the minimum pragmatic data set, which is still lacking for many clinical specialties. In both cases, the common idea is to adopt the motto “less is more” (24), translated into the DQ and ML fields as “less (but good) data is more data.” In so doing, we would fully recognize that the doctors’ responsibilities cannot be further expanded with new and more intensive reporting tasks (see also problem no. 3) and, perhaps more importantly, that medical data should not be treated as any other type of data and that its quality requirements cannot be borrowed from other domains. As a paradigmatic example of this realization, de Mul and Berg (25) reported a convincing case in which missing data did not necessarily indicate a DQ problem, but rather the occurrence of conditions that practitioners deemed not necessary to document (the “all is well” situation), thus shedding light on the unsuitability of establishing (and enforcing algorithmically) requirements for data completeness that are expressed in terms of some fixed threshold, as is common in other fields, such as administration. Moving to the next concern: even if a community of specialists agree upon what and how to report, data could still be unreliable whenever more than one clinician is involved in its production (as should be the case for data for ML training) and these practitioners do not agree on how to report the same clinical phenomenon. This is a well-known situation in the medical literature (26,27), which has been studied for almost a century, variously referred to as observer variability, inter-rater agreement, or IRR (see ). In Cabitza et al. (28), we contributed to raising renewed awareness of the potential distortion that IRR, which is intrinsic to and probably ineradicable from the interpretation of medical conditions, can induce in any medical dataset, especially those used to train ML models. To address this factor, we proposed further investigating the viability and efficacy of some socio-technical solutions, which we can also relate to the HCI concern and to the solution we proposed above for raising awareness of the importance of careful completion of selected fields of the record. In particular, we proposed highlighting the fields that presented low IRR scores during adoption, in a way that is not too different from that shown in (28). In this solution, IRR scores can be computed on the basis of a small user study—or even at regular intervals—by asking two or more clinicians to fill in the same data on a random basis and computing this score on the fly.

Figure 3

A standard form to collect surgery data, with indications of IRR for each field (the darker the red, the lower the agreement among raters). Adapted from Ref. (28).

A standard form to collect surgery data, with indications of IRR for each field (the darker the red, the lower the agreement among raters). Adapted from Ref. (28). Data work (DW in ) is a recent expression (29) that was introduced to cover all the tasks that doctors and nurses perform to document care and coordinate with each other (30) and that produce (and consume) medical data. The concerns with this kind of work (ontologically different from care) are related to excessive paperwork, with the consequent frustration and alienation of health practitioners, possibly leading to potentially serious consequences for the quality of care and health of their patients (31). Thus, if we just assume the limitations of doctors with respect to DQ (and avoid treating it as either their fault or an organizational failure), we can support data work in several ways—partly organizationally and partly technologically. As an example of a radical solution of the former type, we suggested relieving doctors from directly using data collection tools and flanking them with medical scribes (29). These would be “non-licensed health care team members that document patient history and physical examination contemporaneously with the encounter” (32) and who are trained in transcribing the doctors’ orders and notes as well as in describing medical cases in standardized and more consistent and comparable ways. Another example is the technological counterpart of this organizational solution, the so-called virtual scribes. This term can denote either the outsourcing of the medical scribing service (33) or, less frequently (and together with the alternative term digital scribes), the full automation of this service through AI systems that perform speaker diarization (understanding who spoke when), speech recognition, named entity (or knowledge) recognition, and the processing of structured data (32). These would be, in short, a sort of specialized AI that fills in the electronic medical record autonomously and with reduced effort on the part of the medical staff (who are involved in vetting the AI output). We acknowledge that both the above solutions require additional resources (both human and economic), but we can here paraphrase the famous quotation often misattributed to Derek Bok: “if you think ensuring high DQ is expensive, try low DQ.” In any case, we recall the success of some cost-effective initiatives to improve doctors’ hand hygiene (34) and can envision that similar cognitive-behavioral solutions [“nudges” (35)] could also be applied to data work practices in order to spread good practices of data hygiene. These solutions would likely be more effective than mere economic incentives or disciplinary sanctions, although their effectiveness in the long term may be uncertain and make them of limited sustainability. The concerns regarding the quality of the electronic data collection tools—their low usability and poor HCI (see )—have several implications, including for medical errors, patient safety, and clinician burnout (36). However, the design of structured and orderly graphical interfaces has also been found to have a positive role in improving DQ for ML training; for instance, Pinto Dos Santos et al. (37) provided proof of the concept that data extracted from structured reports written during clinical routines can be used to successfully train deep learning algorithms. While a plea to improve the usability of the interfaces of electronic medical records would hardly be considered inappropriate, we recognize the difficulties inherent in its realization. Nevertheless, we believe it is important that scholars talk about these problems, that research is done into the role of human factors in the performance of doctors (38), and that healthcare stakeholders become more aware of the opportunities to improve medical AI not only in terms of its accuracy, but also in terms of the usability of the systems through which we interact with it and, ultimately, in terms of the satisfaction of its users. The concern about data management (DM in ) is the most technical one of those mentioned so far; in this regard, we can observe the wider diffusion and stronger reliability of third-party cloud storage solutions in which health facilities can store the data they produce in a secure and safe environment, often maintained by dedicated staff using state-of-the art equipment. This is justified not only by cost savings and economies of scale, but also by the more robust infrastructure against malicious threats like data poisoning (39), in which an adversary injects bad data into a model’s training dataset to get it to learn something that could make it vulnerable (attack its integrity) or inaccurate for a particular input (attack its availability or usefulness), or against adversarial attack, in which an adversary changes the input (e.g., by adding random pixels to a diagnostic digital image) to prevent the system from classifying the resulting input (without the knowledge of the healthcare provider). Recent events have also raised the awareness of stakeholders, managers, and policy makers of data management risks and made it clear how the greater dependence on technology that AI induces—precisely because of its quality and potential—is also mirrored by greater vulnerability and fragility of the health system as a whole. Finally, and related to this latter point, the greatest organizational concern is the last one mentioned: data governance (DG in ). We emphasize the difference between data management, which is a set of practices around the good operation of an information system and the adequate quality of its data flows, and data governance. This is a term denoting a strategic attitude toward the information assets “under the control of a hospital or health system [which encompasses] all policies and procedures to guide, manage, protect, and govern the electronic information” (40). We consider this factor at the very end of the ML hiatus, as we regard it as the “last yard” of the path from the point of care to the entrance to the ML development pipeline, although its influence, as can be seen in , can be easily traced back to almost all of the steps preceding it. In fact, all of the previous elements can be set and aligned to bring high quality data to the ML development pipeline, but if healthcare facilities do not exert full data governance over this flow and process—including governance of the processes by which predictive models are created, validated, updated, and applied to new cases on the basis of daily needs and routine—the ML hiatus depicted in might be closed for a while, but it will open up again, sooner or later, under the attacks of cyber-hackers or just the erosion caused by the drift of practices and the passage of time.

Final remarks

In this contribution, we have focused on the socio-technical elements that we recognize must all line up to allow for the deployment of potentially effective AI in real-world clinical settings. Rather than focusing on the theoretical performance and accuracy of medical AI, which is a rather new and surprising concern that computer scientists seem to have passed on to doctors, we have shed light here on a still relatively neglected and underrated set of concerns regarding the quality of the data that is used to train and adjust the AI algorithms to fit the situated needs of a community of health practitioners. At this point, the famous adage comes to mind, which appears whenever ML and DQ are near each other: “garbage in, garbage out”. This expression refers to the fact that, no algorithm, no matter how smart or intelligent it is, can produce value if its input lacks value in the first place. However, the situation in healthcare is unfortunately worse than this common engineering phrase might suggest in other, less critical, domains. In fact, if inadequate input is used to optimize the performance of a decision support system, yet remains undetected, and thus the input is not appropriately improved nor the “support” discarded, but instead erroneously considered truthful and fit for purpose, the resulting garbage output risks being viewed as the proper advice of an accurate tool. A unreliable indication may then be made more “objective” and indisputable thanks to the armor of algorithmic legitimacy that we tend to ascribe to this class of machines, and we eventually risk allowing experts to be misled in more complex decisions by this garbage-in-disguise output and risk novices being deskilled in what should be easy decisions (41). Thus, before AI can unleash its full potential to help practitioners deliver better—and more human—care, we must, when the implementation chasm is closed and human and machine intelligence converge (42), build robust bridges that close both the sides of the machine and human hiatuses. This requires a full range of interventions, both organizational and technical, which alone would be either over ambitious or useless, together with the awareness that the accuracy of any technological support is nothing in medicine without the power of doctors to use it to the best of their knowledge and judgment: in a word, responsibly. The article’s supplementary files as

31 in total

1. Less is more: how less health care can result in better health.

Authors: Deborah Grady; Rita F Redberg
Journal: Arch Intern Med Date: 2010-05-10

2. The elephant in the record: On the multiplicity of data recording work.

Authors: Federico Cabitza; Angela Locoro; Camilla Alderighi; Raffaele Rasoini; Domenico Compagnone; Pedro Berjano
Journal: Health Informatics J Date: 2019-01-22 Impact factor: 2.681

3. Unintended Consequences of Machine Learning in Medicine.

Authors: Federico Cabitza; Raffaele Rasoini; Gian Franco Gensini
Journal: JAMA Date: 2017-08-08 Impact factor: 56.272

4. Electronic Health Record Logs Indicate That Physicians Split Time Evenly Between Seeing Patients And Desktop Medicine.

Authors: Ming Tai-Seale; Cliff W Olson; Jinnan Li; Albert S Chan; Criss Morikawa; Meg Durbin; Wei Wang; Harold S Luft
Journal: Health Aff (Millwood) Date: 2017-04-01 Impact factor: 6.301

5. Virtual Scribe Services Decrease Documentation Burden Without Affecting Patient Satisfaction: A Randomized Controlled Trial.

Authors: Savannah Benko; Alex J Idarraga; Daniel D Bohl; Kamran S Hamid
Journal: Foot Ankle Spec Date: 2020-08-26

Review 6. Automation bias and verification complexity: a systematic review.

Authors: David Lyell; Enrico Coiera
Journal: J Am Med Inform Assoc Date: 2017-03-01 Impact factor: 4.497

7. The incidence of diagnostic error in medicine.

Authors: Mark L Graber
Journal: BMJ Qual Saf Date: 2013-06-15 Impact factor: 7.035

8. Interrater reliability: the kappa statistic.

Authors: Mary L McHugh
Journal: Biochem Med (Zagreb) Date: 2012 Impact factor: 2.313

9. Structured report data can be used to develop deep learning algorithms: a proof of concept in ankle radiographs.

Authors: Daniel Pinto Dos Santos; Sebastian Brodehl; Bettina Baeßler; Gordon Arnhold; Thomas Dratsch; Seung-Hun Chon; Peter Mildenberger; Florian Jungmann
Journal: Insights Imaging Date: 2019-09-23

10. Effective coding is key to the development and use of the WHO Essential Diagnostics List.

Authors: Jacob McKnight; Michael L Wilson; Pamela Banning; Chris Paton; Felix Bahati; Mike English; Ken Fleming
Journal: Lancet Digit Health Date: 2019-12

9 in total

1. A Graphical Toolkit for Longitudinal Dataset Maintenance and Predictive Model Training in Health Care.

Authors: Eric Bai; Sophia L Song; Hamish S F Fraser; Megan L Ranney
Journal: Appl Clin Inform Date: 2022-02-16 Impact factor: 2.342

2. An Interpretable Chest CT Deep Learning Algorithm for Quantification of COVID-19 Lung Disease and Prediction of Inpatient Morbidity and Mortality.

Authors: Jordan H Chamberlin; Gilberto Aquino; Uwe Joseph Schoepf; Sophia Nance; Franco Godoy; Landin Carson; Vincent M Giovagnoli; Callum E Gill; Liam J McGill; Jim O'Doherty; Tilman Emrich; Jeremy R Burt; Dhiraj Baruah; Akos Varga-Szemes; Ismail M Kabakus
Journal: Acad Radiol Date: 2022-04-04 Impact factor: 5.482

3. Digitalization, clinical microbiology and infectious diseases.

Authors: A Egli
Journal: Clin Microbiol Infect Date: 2020-07-02 Impact factor: 8.067

4. Clinician Preimplementation Perspectives of a Decision-Support Tool for the Prediction of Cardiac Arrhythmia Based on Machine Learning: Near-Live Feasibility and Qualitative Study.

Authors: Stina Matthiesen; Søren Zöga Diederichsen; Mikkel Klitzing Hartmann Hansen; Christina Villumsen; Mats Christian Højbjerg Lassen; Peter Karl Jacobsen; Niels Risum; Bo Gregers Winkel; Berit T Philbert; Jesper Hastrup Svendsen; Tariq Osman Andersen
Journal: JMIR Hum Factors Date: 2021-11-26

Review 9. Human Factors and Technological Characteristics Influencing the Interaction of Medical Professionals With Artificial Intelligence-Enabled Clinical Decision Support Systems: Literature Review.

Authors: Michael Knop; Sebastian Weber; Marius Mueller; Bjoern Niehaves
Journal: JMIR Hum Factors Date: 2022-03-24