Literature DB >> 33011334

Methodology minute: a machine learning primer for infection prevention and control.

Timothy L Wiemken¹, Ana Santos Rutschman².

Abstract

The use of machine-learning and predictive modeling in infection prevention and control activities is increasing dramatically. In order for infection preventionists to make informed decisions on the performance of any particular model as well as to determine if the output of the model will be useful for their program needs, a suitable understanding of the creation and evaluation of these models is necessary. The purpose of this primer is to introduce the infection preventionist to the most commonly used machine-learning method in infection prevention: supervised learning.

Entities: Disease Species

Keywords: Artificial intelligence; Deep learning; Healthcare-associated infection; Natural language processing; Statistical learning; Supervised learning

Mesh：

Year: 2020 PMID： 33011334 PMCID： PMC7528905 DOI： 10.1016/j.ajic.2020.09.009

Source DB: PubMed Journal: Am J Infect Control ISSN： 0196-6553 Impact factor: 2.918

Machine learning, an overview

Machine learning is an umbrella term encompassing many algorithms used for assisting human understanding of large amounts of data. Often, machine learning is conflated with the terms “predictive” or “prescriptive” modeling; though they are not directly interchangeable. Machine learning is also commonly confused with “artificial intelligence,” a phrase with little agreement on definition. Simply put, machine learning can be considered a computational method, while general artificial intelligence could be considered a physical manifestation utilizing machine learning to perform a task(s). Several subclasses of machine learning have been developed and used in medicine, but the 3 major groups utilized include: unsupervised, supervised, and reinforcement learning. These groups include different algorithms, each with their unique pros and cons for specific tasks. Here, we focus on supervised machine learning, as it is the most commonly used approach for predictive modeling in infection prevention and control. Since this is only a brief primer, reviews can be found elsewhere in the literature for further study.

Supervised machine learning

When the phrase machine learning is used, frequently the speaker is talking about supervised machine learning, an approach most synonymous with predictive modeling. Supervised machine learning is a method where an algorithm learns from available data to create a model, which is then used to predict the same outcome for new data. The learning process is called “training the model,” where various features (AKA variables) are provided to an algorithm. The prefix “supervised” means that an outcome is known to the model during training. For example, one may be interested in being able to predict if someone has a catheter-associated urinary tract infection (CAUTI) before a diagnostic test. In this example, a binary variable, CAUTI, is the outcome of interest and would be obtained from an electronic health record retrospectively. A multitude of features that the modeler thinks might explain the presence or absence of CAUTI are added to the model as well. Depending on the algorithm used, different mathematical computations are done to allow the computer to learn and compare the complex patterns of the features in patients with and without CAUTI. Next, new data without the outcome known (eg, prospective patients with unknown CAUTI status) are passed to this model, which will output a prediction of the presence or absence of CAUTI. Supervised machine learning is used regularly in infection prevention for prediction of various health outcomes such as Clostridioides difficile infection and other healthcare-associated infections, , and most recently for development of vaccine candidates for SARS-CoV-2.

How good are these models? Performance explained

Model performance is focused on several statistics representing how good it is at predicting the outcome. Care must be taken when evaluating performance statistics as there are many and they are often utilized inappropriately, particularly model accuracy. A case-in-point is a product vendor who creates a model to predict sepsis. They report the model to be 90% accurate. Although great at face-value, it is more important to look at the sensitivity, specificity, positive, and negative predictive values, rather than a grand statistic such as accuracy. This is because accuracy is calculated as the total correct predictions divided by the total predictions, not separating false positives and negatives. Furthermore, if the model was trained inappropriately, this accuracy statistic may be incongruent with the performance of the model in real life. For example, if the model was trained on a dataset where 90% of the patients did not have sepsis and 10% did, the model could simply report that every patient did not have sepsis and it would still be 90% accurate. It would miss that 10% with sepsis, and the purpose of the prediction – to predict what is called the “minority class,”, or the outcome group that is less frequent.

What variables or features are in the model?

A complex part of building a supervised machine-learning model is “feature engineering”; creating and modifying variables (features in machine-learning language) for inclusion. It is not as simple of an approach as traditional explanatory regression models where only “clinically meaningful” or “biologically plausible” variables are added to the model (eg, age, gender, race, etc.). Those variables can certainly be added to a machine-learning model, but the outcome should not be expected to be superior to any traditional regression model with the same variables. With machine learning, the goal is not to identify the impact of variables on the outcome with a risk ratio or odds ratio, but rather to have a model that can accurately predict an outcome. The features used in the model are largely irrelevant as long as they improve the performance of the model. Natural Language Processing is another umbrella term which encompasses a great deal of methods for dealing with textual data, such as text mining (extracting specific words or phrases from a corpus of text), topic modeling (defining topics from a corpus), and defining the sentiment of text, among a wide variety of other methods often used for machine-learning models. Natural Language Processing has proven useful in various studies, including for identifying healthcare-associated infections from notes in the electronic health record. ,

A key issue and policy implications

It is critical to understand that machine-learning models are based on data that are often likely to reflect contextual biases, due to having been produced in high-resource settings such as university-affiliated hospitals. Outside these settings, data are unlikely to properly account for diversity in clinical features of the patients or may not have the same variables available (eg, various laboratory values) for modeling. Since machine-learning models strictly learn patterns present in the data supplied to the algorithm, any biased input data will result in biased outputs and predictions. As we begin to rely more on computational predictions in all areas of medicine and health, we must recognize and start addressing the data gaps. Failing to do so can lead to racism, sexism, ageism, or other forms of discrimination.

Conclusions

The future will bring more and more electronic decision support tools using machine learning to our workday. Remembering that any computer model is created by humans and may be error prone underscores the need to implement any decision support tool with caution, ensuring the clinical aspects of any predictions are not disregarded.

8 in total

1. Commentary: The problem of class imbalance in biomedical data.

Authors: Hemant Ishwaran; Robert O'Brien
Journal: J Thorac Cardiovasc Surg Date: 2020-06-29 Impact factor: 5.209

2. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.

Authors: Evangelia Christodoulou; Jie Ma; Gary S Collins; Ewout W Steyerberg; Jan Y Verbakel; Ben Van Calster
Journal: J Clin Epidemiol Date: 2019-02-11 Impact factor: 6.437

3. A Generalizable, Data-Driven Approach to Predict Daily Risk of Clostridium difficile Infection at Two Large Academic Health Centers.

Authors: Jeeheh Oh; Maggie Makar; Christopher Fusco; Robert McCaffrey; Krishna Rao; Erin E Ryan; Laraine Washer; Lauren R West; Vincent B Young; John Guttag; David C Hooper; Erica S Shenoy; Jenna Wiens
Journal: Infect Control Hosp Epidemiol Date: 2018-04 Impact factor: 3.254

4. Machine Learning in Epidemiology and Health Outcomes Research.

Authors: Timothy L Wiemken; Robert R Kelley
Journal: Annu Rev Public Health Date: 2019-10-02 Impact factor: 21.981

5. Detection of healthcare-associated urinary tract infection in Swedish electronic health records.

Authors: Hideyuki Tanushi; Maria Kvist; Elda Sparrelid
Journal: Stud Health Technol Inform Date: 2014

Review 6. High-performance medicine: the convergence of human and artificial intelligence.

Authors: Eric J Topol
Journal: Nat Med Date: 2019-01-07 Impact factor: 53.440

7. Detecting hospital-acquired infections: A document classification approach using support vector machines and gradient tree boosting.

Authors: Claudia Ehrentraut; Markus Ekholm; Hideyuki Tanushi; Jörg Tiedemann; Hercules Dalianis
Journal: Health Informatics J Date: 2016-08-04 Impact factor: 2.681

8. Computationally Optimized SARS-CoV-2 MHC Class I and II Vaccine Formulations Predicted to Target Human Haplotype Distributions.

Authors: Ge Liu; Brandon Carter; Trenton Bricken; Siddhartha Jain; Mathias Viard; Mary Carrington; David K Gifford
Journal: Cell Syst Date: 2020-07-27 Impact factor: 10.304

8 in total