| Literature DB >> 34573420 |
Biljana Stankovic1, Nikola Kotur1, Gordana Nikcevic1, Vladimir Gasic1, Branka Zukic1, Sonja Pavlovic1.
Abstract
Research of inflammatory bowel disease (IBD) has identified numerous molecular players involved in the disease development. Even so, the understanding of IBD is incomplete, while disease treatment is still far from the precision medicine. Reliable diagnostic and prognostic biomarkers in IBD are limited which may reduce efficient therapeutic outcomes. High-throughput technologies and artificial intelligence emerged as powerful tools in search of unrevealed molecular patterns that could give important insights into IBD pathogenesis and help to address unmet clinical needs. Machine learning, a subtype of artificial intelligence, uses complex mathematical algorithms to learn from existing data in order to predict future outcomes. The scientific community has been increasingly employing machine learning for the prediction of IBD outcomes from comprehensive patient data-clinical records, genomic, transcriptomic, proteomic, metagenomic, and other IBD relevant omics data. This review aims to present fundamental principles behind machine learning modeling and its current application in IBD research with the focus on studies that explored genomic and transcriptomic data. We described different strategies used for dealing with omics data and outlined the best-performing methods. Before being translated into clinical settings, the developed machine learning models should be tested in independent prospective studies as well as randomized controlled trials.Entities:
Keywords: IBD; artificial intelligence; genomics; prediction modeling; transcriptomics
Mesh:
Year: 2021 PMID: 34573420 PMCID: PMC8466305 DOI: 10.3390/genes12091438
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Machine learning using omics data for prediction of clinically relevant IBD outcomes. Omics data from patients with known clinical outcomes (exploratory cohort) can be used as input data in machine learning algorithms during the prediction model training. Performance of the designed model is further assessed on an independent group with unknown outcomes (testing cohort). Machine learning models that have high prediction performance on the testing cohort are well fitted and could be employed for future improved patients’ diagnosis, classification, prognosis and prediction of drug response. ML—machine learning, CD—Crohn’s disease, UC—ulcerative colitis, and IBD—inflammatory bowel disease.
Glossary of common terms in machine learning.
| Instance | An entity (human subject in healthcare applications) which features are used as inputs for prediction modeling. |
| Feature | An explanatory variable, such as genetic variant, gene expression, etc. |
| Machine learning algorithm | Procedure that is run on data to create a machine learning model. It is a set of mathematical optimization functions that minimizes the error of the model function. |
| Iterations | Machine learning algorithm’s parameters are updated number of times until model reaches desired performance |
| Classification | Supervised learning technique used to predict a discrete class or category of an instance (disease or healthy subject, good or poor drug responder, etc.). |
| Regression | Supervised learning technique in which the predicted variable is continuous. |
| Model fitting | Measure of how well a machine learning model generalizes to data not used for model training. |
| Penalized regression method | A method used to reduce overfitting of a model. The penalty causes the regression coefficients of less contributive variables to shrink toward zero therefore reducing the number of variables in the model. |
| Sparse model | A predictive model that includes only the most informative features. |
| Clustering | Unsupervised learning technique that groups instances by their similarity. The groups are called clusters. |
| Black box model | Model that is built on complex functions that are not easily interpreted (such as neural networks). Input and output are clear, but the process between is not explainable. |
| Effect size | A biological measure of the difference or relationship between variables. |
| AUC value | Evaluation metric of a model that ranges from 0.5 (poor classifier) to 1 (perfect classifier). |
Figure 2Classification of machine learning algorithms used in IBD research; LASSO—least absolute shrinkage and selection operator; SVM—support vector machines.
Classification and regression machine learning algorithms employed in IBD research.
| Algorithm | Principle | Usage | Pros and Cons |
|---|---|---|---|
| Logistic regression | Linear model transformed into sigmoid function used as a binary classifier | Classification | Fast to develop; easily interpretable; limited by strong assumptions; prone to overfitting |
| Linear regression | Classical linear model that employs linear codependency for prediction | Regression (can also be used for classification) | Fast to develop; easily interpretable; limited by strong assumptions; prone to overfitting |
| Ridge regression | Linear model with L2 regularization | Classification and regression | Linear model with enhanced interpretability and reduced overfitting |
| LASSO | Linear model with L1 regularization | Classification and regression | Linear model with enhanced interpretability and reduced overfitting |
| Elastic net | Linear model with both L1 and L2 regularization | Classification and regression | Linear model with enhanced interpretability and reduced overfitting |
| Decision trees | Prediction based on a tree-like model. Nodes are splitting points of a dataset based on most informative features; leaves are output values. | Classification and regression | Prone to overfitting but can be improved with ensemble methods; interpretable outputs |
| Random forest | An ensemble method (modified bootstrap aggregation) applied to decision trees. It grows multiple decision trees; output is the average prediction of individual trees. | Classification and regression | High prediction performance; deals with overfitting; requires a large dataset for optimal learning. |
| Gradient boosted trees (GBT) | An ensemble method (gradient boosting) applied to decision trees | Classification and regression | High prediction performance; hard-to-tune parameters of the algorithm |
| K nearest neighbors (KNN) | Predicts an output taking into account (k) most similar instances (nearest neighbors) | Classification and regression | Requires a lot of memory to store all the instances; cannot deal with a large number of input variables. |
| Support vector machines (SVM) | Maximizes margin (decision boundary) between different classes supported by instances that lie near the margin (support vectors) | Classification | Works well with high number of input variables; flexible (allow curved margin by using nonlinear kernels); computationally expensive; limited interpretability |
| Naïve Bayes | Employs Bayesian posterior probability theorem but assume nondependency between features given the output | Classification | Fast to develop; suitable for large datasets and for making real time predictions; limited by strong assumptions; requires feature selection and transformation |
| Neural networks | Network of interconnected units resembling the nervous system which renders input information to produce an output. | Classification and regression | High performance; limited interpretability; requires very large dataset; computationally expensive |
LASSO—least absolute shrinkage and selection operator.
Studies that explored machine learning for designing IBD prediction models using genomic and transcriptomic data.
| First Author and Year [ref] | Machine Learning Algorithm | Predictors/Prediction | Performance | Tested on Independent Cohort | Subjects |
|---|---|---|---|---|---|
| Chen 2017 [ | Bayesian mixture approach | GWAS or Immunochip SNPs data/IBD risk score | CD AUC: 0.75, UC AUC: 0.70 | yes | The IIBDGC) cohort—over 68,000 IBD patients and 29,000 healthy controls (4:5 ratio for training and testing, respectively) |
| Wei 2013 [ | L1 penalized logistic regression, SVM, gradient boosted trees | Immunochip SNPs data/CD and UC distinction from healthy controls | CD AUC 0.86, | yes | The IIBDGC cohort—~17,000 CD, ~13,000 UC, and ~22,000 controls (randomly divided into 3 folds of equal size for preselection, training and testing, respectively) |
| Romagnoni 2019 [ | Logistic regression, gradient boosted trees, neural network and ensemble method | Immunochip SNPs data/probability of CD | AUC 0.8 | yes | The IIBDGC cohort—train dataset (34,634 samples), test dataset (17,317 samples) |
| Pal 2017 [ | Naïve Bayes | Exome data/CD status | AUC 0.81 | yes | Training set: 64 CD and 47 controls (CAGI4); Testing set: 51 CD and 15 controls (CAGI3) |
| Raimondi 2020 [ | Neural network | Whole exomes/to distinguish between CD and healthy controls | AUC 0.74–0.83 AUPRC 0.81–0.93 | yes | CAGI2, CAGI3, CAGI4 datasets (training and testing) |
| Wang 2019 [ | SVM | Whole exomes/to distinguish between CD and healthy controls | AUC 0.7–0.75 AUPRC 0.73–0.80 | yes | CAGI4 (training set), CAGI3 (testing set) |
| Isakov 2017 [ | Random forest, SVM with polynomial kernel, extreme gradient boosting, elastic net and ensemble method | Data from 2050 genes annotated by the expression (array and RNAseq) and pathway information (categorical terms)/IBD-risk gene prioritization | AUC 0.775–0.829 | yes | Intestinal biopsies of 180 CD, 149 UC, 94 colorectal neoplasms, 90 normal tissue (75:25 ratio for training and testing set, respectively) |
| Cushing 2018 [ | Unsupervised hierarchical clustering, random forest | Whole transcriptome/identification of markers that could predict postoperative disease activity | 92–93% of correct estimates in random forest | no | 24 anti-TNFα-naïve |
| Khorasani 2020 [ | Feature selection algorithm | Wide expression array data/UC and healthy subjects classification | Active UC AUPRC 1, | yes | Training set: 39 UC samples (active and inactive) and 38 controls; testing set: 97 UC samples (active and inactive) and 22 controls |
| Yuan 2017 [ | Feature selection (minimum | Wide expression array data from PBMC samples/CD, UC and normal subject discrimination and candidate gene selection | Accuracy 0.94 | no | 59 Crohn’s disease, 26 ulcerative colitis, and 42 normal samples |
| Hubenthal 2015 [ | Penalized SVM, random forest | miRNAs in whole-blood samples/IBD and control subject distinction | AUC 0.75-1.0 | no | 40 CD, 36 UC, 38 healthy controls and other inflammation controls (24 chronic obstructive pulmonary disease, 23 multiple sclerosis, 38 pancreatitis and 45 sarcoidosis cases) |
| Zarringhalam 2014 [ | Differential expression profile was used to infer upstream regulators using Bayesian approach; posterior probabilities of regulators’ activities were then used in a regularized regression framework to predict outcome | Genome wide expression data/response to infliximab in UC | Accuracy 0.79 | yes | Training set: 22 active UC patients (12 responders and 10 nonresponders); Testing set: 24 active UC patients (8 responders and 16 nonresponders) |
| Li 2020 [ | Random forest, neural network | RNAseq and microarray expression data/identification of susceptibility genes and establishing | AUC 0.95; AUPRC 0.97 | yes | Training set: 206 UC, 20 normal; Testing set: 53 UC and 21 normal |
| Martin 2019 [ | Hierarchical clustering, principal component analysis | Single-cell RNA sequencing data/cell type classification in inflamed and uninflamed tissues | Inflamed tissue (r = 0.96) | no | 11 ileal CD patients; samples taken from inflamed and uninflamed tissues |
GWAS—genome-wide association study, IBD—inflammatory bowel disease, CD—Crohn’s disease, UC—ulcerative colitis, SVM—support vector machine, AUC—area under the receiver operating curve, AUPRC—area under the precision-recall curve, IIBDGC—The International Inflammatory Bowel Disease Genetics Consortium, *—correlation of cell type frequencies between hieratical clustering analysis applied to RNA profile of a cell and cytometry results referring to that cell.