Literature DB >> 35699597

A Systematic Review of Artificial Intelligence and Machine Learning Applications to Inflammatory Bowel Disease, with Practical Guidelines for Interpretation.

Imogen S Stafford^1,2,3, Mark M Gosink⁴, Enrico Mossotto¹, Sarah Ennis¹, Manfred Hauben^4,5.

Abstract

BACKGROUND: Inflammatory bowel disease (IBD) is a gastrointestinal chronic disease with an unpredictable disease course. Computational methods such as machine learning (ML) have the potential to stratify IBD patients for the provision of individualized care. The use of ML methods for IBD was surveyed, with an additional focus on how the field has changed over time.
METHODS: On May 6, 2021, a systematic review was conducted through a search of MEDLINE and Embase databases, with the search structure ("machine learning" OR "artificial intelligence") AND ("Crohn* Disease" OR "Ulcerative Colitis" OR "Inflammatory Bowel Disease"). Exclusion criteria included studies not written in English, no human patient data, publication before 2001, studies that were not peer reviewed, nonautoimmune disease comorbidity research, and record types that were not primary research.
RESULTS: Seventy-eight (of 409) records met the inclusion criteria. Random forest methods were most prevalent, and there was an increase in neural networks, mainly applied to imaging data sets. The main applications of ML to clinical tasks were diagnosis (18 of 78), disease course (22 of 78), and disease severity (16 of 78). The median sample size was 263. Clinical and microbiome-related data sets were most popular. Five percent of studies used an external data set after training and testing for additional model validation. DISCUSSION: Availability of longitudinal and deep phenotyping data could lead to better modeling. Machine learning pipelines that consider imbalanced data and that feature selection only on training data will generate more generalizable models. Machine learning models are increasingly being applied to more complex clinical tasks for specific phenotypes, indicating progress towards personalized medicine for IBD.

Entities: Chemical

Keywords: artificial intelligence; inflammatory bowel disease; machine learning

Mesh：

Year: 2022 PMID： 35699597 PMCID： PMC9527612 DOI： 10.1093/ibd/izac115

Source DB: PubMed Journal: Inflamm Bowel Dis ISSN： 1078-0998 Impact factor: 7.290

What is already known? Machine learning has been applied with success to cancer diagnostics, and now these methods are increasingly being applied to complex conditions, such as inflammatory bowel disease. What is new here? In the past 2.5 years, the number of articles published in this field has increased by 68%; this has been accompanied by a shift in machine learning applications from diagnostics to prognostics, and the use of more complex methods such as neural networks. How can this study help patient care? Two main requirements were identified for translation of machine learning models into the clinic: generalizable models generated from robust pipelines, and the collection of deep and specific patient phenotype data.

Introduction

Inflammatory bowel disease (IBD) is an umbrella term for a set of chronic diseases, of which there are 2 main subtypes: Crohn’s disease (CD) and ulcerative colitis (UC). The global prevalence of IBD increased to 84.3 per 100 000 by 2017, and with it comes a greater burden to patients and health services.[1] Due to a number of factors contributing to its etiology, IBD disease course is highly variable. Patients can experience a mild disease or a severe, refractory disease requiring many interventions. A patient’s disease course is often unclear at diagnosis. There has been a relative explosion in the use of artificial intelligence and machine learning (ML) techniques for complex diseases, after the success of these algorithms in fields like oncology.[2] Unlike traditional statistical techniques, ML infers patterns from data, allowing model application to unseen cases. Key concepts for this field are included in Box 1, and a further breakdown of ML terms, metrics, and methods are detailed elsewhere.[3,4] For IBD, ML has the potential to improve patient care at every stage of their disease course through prediction modeling: from a quick subtype diagnosis so appropriate treatment can be identified, to assessing disease activity and identifying those patients more likely to develop complications and require surgery. For clinicians, this potential is exciting but comes with many questions about which ML methods may be successful. Here, practical guidelines are provided to guide interpretation of current and future research in this field (Appendix 1). Although this systematic review centers around applications to IBD, these are general guidelines for ML interpretation. Artificial Intelligence: methods that enable computers to mimic human intelligence. Machine Learning: methods that infer patterns from data to perform a specific task, usually classification or regression. Deep Learning: neural network–based approaches that enable machines to train themselves to perform tasks. Supervised Learning: the ML model learns patterns in data and associates this information with an already present label. The model can then apply this learning to new data and predict these labels. Unsupervised Learning: the ML model identifies patterns and clusters the data in a way that explains the data structure (not according to labels). Feature Selection: a collection of methods that reduce the dimensionality of a data set, such that ML is performed on a subset of the most informative variables for the task. Cross Validation: a method that can reduce the overfitting of ML models, meaning the results will generalize well to new data. During training the data are split into k folds, and the ML model trained on k-1 folds. The model performance is tested on the final fold, and the process is repeated so each fold becomes the test fold exactly once. In a previous, broader systematic review of artificial intelligence and ML applications to autoimmune diseases,[3] a number of popular methods and applications were identified, and the research assessed guided some recommendations for the field. In addition, other systematic reviews have been published commenting on this area, including Tontini et al’s review of artificial intelligence for gastrointestinal endoscopy[5] and Nguyen et al’s study on machine learning for diagnosis and prognosis in IBD.[6] The aim of this systematic review was to assess common data types, applications, and methods in the field of ML for IBD and to evaluate changes in the field over the past few years. The broad scope of this review allows for the assessment of trends and the recording of the full range of ML applications to IBD.

Methods

Literature Search

An electronic literature search was performed using 2 databases available through OvidSP: MEDLINE(R) and Epub Ahead of Print, In-Process, In-Data-Review & Other Non-Indexed Citations and Daily 1946, and Embase 1974. The literature search was completed in May 2021. Search terms were combined using Boolean operators as follows: (“machine learning” OR “artificial intelligence”) AND (“Crohn* Disease” OR “Ulcerative Colitis” OR “Inflammatory Bowel Disease”). Any record in which these search terms were identified in the title, abstract, and/or subject headings would appear in the list of records (last search May 6, 2021).

Inclusion and Exclusion Criteria

This systematic review sought to expand and better characterize a subsection of a previous review of ML in autoimmune disease; therefore the same criteria was employed. Studies that applied ML to IBD, or a subtype of IBD, were included. Studies that used ML for analysis of complications that arise from IBD were also included. Studies that were not written in English, were published prior to 2001, that did not use real human patient data, were not peer reviewed, or were not original research articles were also excluded. Therefore, the following publication types (as labeled by OvidSP) were not assessed during screening: conference abstracts, conference review, editorial, erratum, journal article comment, journal article review, letter, letter comment, note, and review. The abstract of each study was assessed by 2 reviewers independently for inclusion in the systematic review. The full text was assessed when a decision on inclusion could not be made based on the abstract, and a consensus was reached by the 2 reviewers. The following data items were collected for each study that met the criteria: the task ML was applied to, the type of ML (supervised or unsupervised), all ML algorithms trialed by the researchers, the best performing ML algorithm, sample size, clinical population (IBD, UC, or CD), data type, the best results achieved, whether a training and testing split was used, whether other cross-validation was used, whether the model was applied to independent test data, and the year of publication. This systematic review conforms to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) standards.[7]

Graphical Representations of Data

Articles were graphically summarized in sunburst or pie chart diagrams using a custom R script utilizing the Plotly library.[8,9] The R scripts can be downloaded from Github (github.com/isstafford/review_ml_ibd_2021). Briefly, all articles were classified according to the machine-learning approach used (method), the type of information being analyzed (data type), and the outcome which is being predicted (task). These categories and the sorting of studies into categories were agreed upon by all authors. The R script counts unique titles under each level of the 3-class hierarchy, and the results are displayed in a sunburst diagram. The innermost ring represents the highest level of the hierarchy, whereas the outermost ring represents the individual articles. Since some articles discuss multiple methods, tasks, or outcomes, they may be represented more than once on the diagram. Graphical summaries of sample sizes and uses of ML method types over time were generated using ggplot2 in R.[10] For the sample size graphical summary, the ML method was counted as used if it was recorded as a method in the research article, even if the ML method did not generate the optimal model. Machine learning methods were sorted into type groups (eg, ridge regression and logistic regression were both included under “regression”). Multiple methods from the same type group within the same article were counted once to avoid skewing the data. In cases where articles investigated multiple classification problems with different sample sizes, each classification problem counted as a separate entry. All ML method groups with sufficient data for a boxplot (n ≥ 5) were included in the visualization. The same ML method groups were plotted for the use of ML types over time.

Results

Initially, 409 records were identified, and 135 records were subsequently removed as duplicates. When the study criteria regarding original research articles, year of publication, and language were applied, 153 entries were removed. Of the remaining 121 screened articles, 33 were excluded after assessing the abstract against the inclusion and exclusion criteria, and a further 9 were excluded after a full-text read (Figure 1). A technical analysis of the ML applications in these studies is outside the scope of this review. Here, summary statistics are provided regarding popular methods, applications, and data; summary statistics are also provided regarding the sample sizes, cross-validation, and trends in ML usage in recent years. The chosen ML models and data types used for each type of task are detailed in Table 1.

Figure 1.

Flowchart documenting number of records found and reviewed at each stage.

Table 1.

Summary of ML Models Chosen as Most Optimal for the Clinical Task, and the Types of Data Used (ML models and data types sorted alphabetically).

Task	No. Studies	Chosen ML Models	Data Types Used
Disease Course	22	Bayes Network, Boosting, Decision Tree, Hierarchical Clustering, Neural Network, Partial Least Squares Discriminant Analysis, Random Forest, Regression, Support Vector Machine	Clinical, Gene Expression, Genetic, Imaging, Metabolomic, Metatranscriptomic, Microbiome
Diagnosis	18	Boosting, Hierarchical Clustering, Neural Network, Random Forest, Regression, Support Vector Machine	Gene Expression, Genetic, Imaging, Metabolomic, Microbiome
Disease Severity	16	Bayes Network, Boosting, Decision Tree, Hierarchical Clustering, Intelligent Monitoring, Neural Network, Regression, Support Vector Machine	Clinical, Gene Expression, Genetic, Imaging, Protein Biomarkers
Disease Subtype	8	Boosting, Hierarchical Clustering, Random Forest, Similarity Network Fusion Clustering, Support Vector Machine	Clinical, Gene Expression, Metabolomic, Microbiome
Treatment Response	7	Neural Network, Random Forest	Clinical, Gene Expression, Microbiome
Risk of Disease	6	Ensemble Model, Random Forest, Regression	Clinical, Gene Expression, Genetic
Patient Clustering	4	Gaussian Mixture Model, Hierarchical Clustering, Latent Dirichlet Allocation, Neural Network	Immunoassay, Metagenomic, Online Posts, Questionnaire
Medication Adherence	1	Support Vector Machine	Clinical
Metabolite Abundance	1	Sparse Neural Encoder-Decoder Network	Metabolomic, Microbiome
Identification of Patients	1	Natural Language Processing	Clinical

Summary of ML Models Chosen as Most Optimal for the Clinical Task, and the Types of Data Used (ML models and data types sorted alphabetically). Flowchart documenting number of records found and reviewed at each stage. Of 78 studies included in the systematic review, the majority used supervised ML, with 4 articles employing unsupervised methods,[11-14] and 5 utilizing both supervised and unsupervised ML[15-19] for varied clinical applications. Many articles trialed different ML methods before selecting the optimal one, and some researchers implemented ML for multiple IBD applications. Three main clinical application areas were identified: diagnosis (23%),[15,20-36] disease course (28%),[15,30,37-56] and disease severity (21%).[19,57-71] Diagnosis classification tasks involved differentiating IBD patients (or one subtype) from controls. Studies of disease course examined relapse, remission, and surgery ML classifiers. Disease severity studies sought to predict patients’ IBD activity or those who may develop complications. The most prevalent method implemented was random forest (47%), with regressions, neural networks, and support vector machines also used regularly (31%, 28%, and 27%, respectively. Percentages here sum to over 100%, as multiple methods were trialed by one study in many instances). Other tree-based methods were used by 22% of studies (13% boosting with trees, 9% decision trees). Clinical data (41%) and data related to the microbiome (23%) were the most commonly used in ML modeling. The median sample size, not including external validation data sets that were additional to usual training and testing data, was 263 (range, 12-7 400 000). A breakdown of sample sizes per ML method used can be viewed in Figure 2. Validation data sets in addition to the expected training and testing sets were used in 5% of studies.[12,32,41,72] Another 7 studies trained their models with cross-validation on one data set and tested their method on an external, independent data set.[23,29,45,59,61,66,73] Crohn’s disease data (only) was used in 27 studies,[12,17,19,26-29,32,37-39,41,44,47-50,58,60,63,65,74-79] and UC data (only) was used in 15 studies,[25,40,42,46,52,55,59,61,62,64,68-70,80,81] with the remainder (n = 36) using a mix of CD and UC data, or IBD data as one class.[11,13-16,18,20-24,30,31,33-36,43,45,51,53,54,56,57,66,67,71-73,82-88] Half of the research using UC-only data focused on predicting disease activity with endoscopy data, whereas the aims of ML classifications on CD data were varied. A breakdown of the method and classification task can be found in Figure 3, which can be customized here (isstafford.github.io/review_ml_ibd_2021/). More details regarding each study included in this review are found in Supplementary Table 1.

Figure 2.

Figure 3.

Sunburst of machine learning methods and the classification tasks used in conjunction with them.

Sample sizes used for each group of machine learning methods. Abbreviations: BN, Bayes Network; DT, decision tree; NN, neural network; RF, random forest; SVM, support vector machine. Note that 10 outlier entries (sample sizes 20 368 to 7 400 000) have been excluded from the visualization. Sunburst of machine learning methods and the classification tasks used in conjunction with them. The literature search assessed in the previous systematic review was completed on the December 18, 2018; therefore, comparisons were made between studies published before and during 2018 and those published from 2019 to the literature search date. Fifty-three articles have been published from 2019 to May 2021. If a publication is published online and subsequently printed in a different year, the first publication date is used. Since the end of 2018, there has been a rapid expansion in the use of neural networks (a deep learning method) for IBD, with 21 studies trialing this method on their data from 2019 onwards, compared with 1 study prior to this. This increase coincides with more imaging data sets (4% 2007-2018, 18% 2019-2021), specifically colonoscopy data; the majority of neural networks were applied to this data type. Support vector machine, random forest, and regression-based methods were popular during both time periods (year on year breakdown of ML method group use, Figure 4). More studies utilized 2 data types in 2019-2021 (8% vs 17%), almost always combining clinical data with another data type. The median sample size of studies has not increased in recent years (N = 273 2007-2018, N = 257.5 2019-2021). Diagnosis has continued to be a popular ML application, but prior to 2019, investigating treatment response was more popular (24% vs 1.8%); and exploring classification tasks connected to disease course is now the most popular application (12% vs 35.8%).

Figure 4.

Implementation of machine learning methods over time; incomplete data for 2021.

Discussion

The increased use of ML methods for IBD demonstrates the wider interest in artificial intelligence for health care. Due to the heterogeneity of ML model workflows, data types and reported metrics, it was not possible to ascertain any superior approaches. It is possible that some studies may have been excluded from the review, as Medical Subject Headings (MeSH) were not utilized in the search strategy. However, when the ML subject heading was expanded, the only algorithm specified as a search term was “support vector machine,” which could have biased the search strategy towards only identifying additional articles that used this classifier. An additional limitation was the search of only 2 databases, as the systematic search focused on capturing models with clinical application. An assessment of the risk of bias was not performed, as there is no clear equivalent of PROBAST (Prediction model Risk Of Bias ASsessment Tool) to assess ML modeling. The construction of a tool that could assess potential ML pipeline bias would be beneficial for the transition of models into clinical settings. Minimizing bias in modeling and creating generalizable models go hand in hand. There is a clear dominance of tree-based methods: one or a combination of random forests, decision trees, and tree-based boosting methods were implemented by 55% of studies. This is potentially due to decision trees being highly interpretable, with tree boosting and random forest preventing overfitting of this model type. Random forests are also well known as an ML algorithm that can leverage nonlinear relationships. This popularity is not inherently a drawback; however, a lack of comparison of different ML methods or a lack of reporting of this comparison in studies may make developing ML models for clinical application more challenging. Overall, there was good reporting of a range of informative metrics in these studies, which was particularly important given many of their data sets had imbalanced classes. In cases where imbalanced data are used, a high accuracy score can mask poor prediction of the minority class. Although some studies sought to correct these imbalances with algorithm weighting[82] or oversampling of the minority class,[48,64] some researchers did not explicitly address this. Imbalanced data sets may be representative of the patient population, but it is important to consider whether enough samples from both classes were present in ML algorithm training, such that accurate predictions can be made for both the majority and minority class. Another potential ML pipeline issue discovered here was a use of feature selection on the whole data set, rather than just the training set. With this workflow, there is a danger that information regarding features in the test set leaks into the training set. Improvements to ML pipelines can only benefit patients, as more robust workflows allow the identification of the most successful models. Although an increase in the use of external data sets in the expected workflow of training and testing on one data set and validating on an independent data set was not observed, other interesting approaches were employed, showing researchers bringing data sets together to extract additional information. Some studies used all of their initial data set to train their model with cross-validation and subsequently tested the model on a different external data set.[23,29,45,59,61,66,73] Others used a type of cross-validation called LODO, or leave-one-data set-out, allowing researchers to utilize many, smaller data sets.[27,84] The range of overall data set sizes used in studies was large. Some of the smaller sample sizes used may not have been appropriate for the chosen ML method, although evaluating whether there was sufficient data to construct a classifier can be challenging; and the required sample size is contingent on the ML task. There is no standard power calculation available for studies using ML. The sample size required depends on the method used, with algorithms such as neural networks requiring more data. This trend was observed in the systematic review: larger data sets were used in conjunction with neural networks. The number of features used for modeling will also affect the required sample size. More features will generally produce a more complex model, so a larger data set is necessary. If the ML model has generalized well from training to testing data (or other independent data), this is a good indicator the data set was sufficient. It is also important to consider how representative the data set is of the patient population. An ML model may perform well in initial training and testing, but if the data set is biased in the demographics or phenotypes represented, then the modeling may translate poorly when implemented in clinical settings. Although diagnosis (classifying controls and IBD patients) is still a popular application in 2021, it was encouraging to see the highest percentage of articles addressing issues surrounding disease course in recent years. This suggests that more longitudinal and deeper phenotyping data are being collected, allowing a move towards more precise and complex classification tasks. The median size of data sets has not grown in recent years. Although data set size is not an indicator of data quality, it is surprising that even though we are in the era of “big data,” data set sizes are not increasing at a rate they might be expected to. A potential roadblock in garnering larger data sets for more specific classifiers may be linking up these other data types with phenotyping data. A community effort may be necessary to accumulate sufficient data sets for more accurate and generalizable ML models and external validation. Despite the uncontested power of ‘omics data sets in providing a—usually—unbiased representation of a patient profile, detailed clinical information remains fundamental for precise phenotyping and patient stratification. Projects such as UK Biobank[89] have progressed this need for data, but phenotyping can be limited. With this data in place, robust pipelines and models that generalize well, the community takes the next step towards personalized medicine for IBD patients. Ways to assess the generalizability of ML models are addressed in the Appendix. Click here for additional data file.

100 in total

1. Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease.

Authors: Zhi Wei; Wei Wang; Jonathan Bradfield; Jin Li; Christopher Cardinale; Edward Frackelton; Cecilia Kim; Frank Mentch; Kristel Van Steen; Peter M Visscher; Robert N Baldassano; Hakon Hakonarson
Journal: Am J Hum Genet Date: 2013-05-23 Impact factor: 11.025

2. Deconstructing the Pharmacovigilance Hype Cycle.

Authors: Manfred Hauben; Robert Reynolds; Patrick Caubel
Journal: Clin Ther Date: 2018-12 Impact factor: 3.393

3. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): A Guide for Authors and Reviewers.

Authors: John Mongan; Linda Moy; Charles E Kahn
Journal: Radiol Artif Intell Date: 2020-03-25

4. A decision tree-based approach for determining low bone mineral density in inflammatory bowel disease using WEKA software.

Authors: Farzad Firouzi; Marjan Rashidi; Sattar Hashemi; Mohammadreza Kangavari; Ali Bahari; Naser Ebrahimi Daryani; Mohammad Mehdi Emam; Nosratollah Naderi; Hamid Mohaghegh Shalmani; Alma Farnood; Mohammadreza Zali
Journal: Eur J Gastroenterol Hepatol Date: 2007-12 Impact factor: 2.566

5. Accurate Classification of Pediatric Colonic Inflammatory Bowel Disease Subtype Using a Random Forest Machine Learning Classifier.

Authors: Jasbir Dhaliwal; Lauren Erdman; Erik Drysdale; Firas Rinawi; Jennifer Muir; Thomas D Walters; Iram Siddiqui; Anne M Griffiths; Peter C Church
Journal: J Pediatr Gastroenterol Nutr Date: 2021-02-01 Impact factor: 2.839

6. Surrogate-assisted feature extraction for high-throughput phenotyping.

Authors: Sheng Yu; Abhishek Chakrabortty; Katherine P Liao; Tianrun Cai; Ashwin N Ananthakrishnan; Vivian S Gainer; Susanne E Churchill; Peter Szolovits; Shawn N Murphy; Isaac S Kohane; Tianxi Cai
Journal: J Am Med Inform Assoc Date: 2017-04-01 Impact factor: 4.497

7. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.

Authors: Cathie Sudlow; John Gallacher; Naomi Allen; Valerie Beral; Paul Burton; John Danesh; Paul Downey; Paul Elliott; Jane Green; Martin Landray; Bette Liu; Paul Matthews; Giok Ong; Jill Pell; Alan Silman; Alan Young; Tim Sprosen; Tim Peakman; Rory Collins
Journal: PLoS Med Date: 2015-03-31 Impact factor: 11.069