Literature DB >> 31199451

Assessment of Machine Learning Detection of Environmental Enteropathy and Celiac Disease in Children.

Sana Syed^1,2, Mohammad Al-Boni³, Marium N Khan¹, Kamran Sadiq², Najeeha T Iqbal², Christopher A Moskaluk⁴, Paul Kelly^5,6, Beatrice Amadi⁶, S Asad Ali², Sean R Moore¹, Donald E Brown⁷.

Abstract

Importance: Duodenal biopsies from children with enteropathies associated with undernutrition, such as environmental enteropathy (EE) and celiac disease (CD), display significant histopathological overlap. Objective: To develop a convolutional neural network (CNN) to enhance the detection of pathologic morphological features in diseased vs healthy duodenal tissue. Design, Setting, and Participants: In this prospective diagnostic study, a CNN consisting of 4 convolutions, 1 fully connected layer, and 1 softmax layer was trained on duodenal biopsy images. Data were provided by 3 sites: Aga Khan University Hospital, Karachi, Pakistan; University Teaching Hospital, Lusaka, Zambia; and University of Virginia, Charlottesville. Duodenal biopsy slides from 102 children (10 with EE from Aga Khan University Hospital, 16 with EE from University Teaching Hospital, 34 with CD from University of Virginia, and 42 with no disease from University of Virginia) were converted into 3118 images. The CNN was designed and analyzed at the University of Virginia. The data were collected, prepared, and analyzed between November 2017 and February 2018. Main Outcomes and Measures: Classification accuracy of the CNN per image and per case and incorrect classification rate identified by aggregated 10-fold cross-validation confusion/error matrices of CNN models.
Results: Overall, 102 children participated in this study, with a median (interquartile range) age of 31.0 (20.3-75.5) months and a roughly equal sex distribution, with 53 boys (51.9%). The model demonstrated 93.4% case-detection accuracy and had a false-negative rate of 2.4%. Confusion metrics indicated most incorrect classifications were between patients with CD and healthy patients. Feature map activations were visualized and learned distinctive patterns, including microlevel features in duodenal tissues, such as alterations in secretory cell populations. Conclusions and Relevance: A machine learning-based histopathological analysis model demonstrating 93.4% classification accuracy was developed for identifying and differentiating between duodenal biopsies from children with EE and CD. The combination of the CNN with a deconvolutional network enabled feature recognition and highlighted secretory cells' role in the model's ability to differentiate between these histologically similar diseases.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 31199451 PMCID： PMC6575155 DOI： 10.1001/jamanetworkopen.2019.5822

Source DB: PubMed Journal: JAMA Netw Open ISSN： 2574-3805

Introduction

The interpretation of clinical biopsy images for disease diagnoses can be challenging when clinicians are faced with distinguishing between distinct but related conditions. Recently, increasing attention has been paid to methods in artificial intelligence that help clinicians to translate big data (ie, biomedical images and patient biosample data) into accurate and quantitative diagnostics.[1] To our knowledge, most computer modeling enhancements in health care, particularly in image analysis, have focused on feature engineering, ie, asking a computer to evaluate prespecified, explicit image features to permit computational algorithms to detect disease or specified lesions. In contrast, deep learning or a convolutional neural network (CNN) is a form of artificial intelligence that includes machine learning techniques designed to process data and interpret it (eg, by detecting and segmenting multiple pixel intensities within a single image and labeling features at a pixel-by-pixel level).[2,3] Machine learning and its subtypes (eg, CNNs) are an extension of the traditional tools and methods of statistical analysis (eg, linear regression, comparative t tests).[4] In 2017, Ehteshami Bejnordi et al[1] demonstrated the use of deep learning algorithms to interpret whole-slide pathology images. They used annotated images of metastases in lymph node biopsies to train various algorithms and showed that some algorithms achieved better diagnostic performance compared with a panel of trained pathologists.[1] We hypothesized that deep learning algorithms for pathology slide evaluation could recognize complex disease phenotypes that cannot be measured via molecular approaches and are dependent on tissue diagnostics. We wanted to develop diagnostic methods to enable us to correlate patient-level numerical metadata, including biomarkers, with invasively obtained data (eg, tissue biopsies). Our specific focus was on pediatric undernutrition, which is estimated to cause approximately 45% of the 5 million deaths annually in children younger than 5 years worldwide.[5] There are many manifestations of early childhood undernutrition; stunting (linear growth failure, length-for-age z score <−2) is among the most common, affecting approximately 155 million children younger than 5 years.[6] Stunting is a clinical marker for devastating, sometimes irreversible, deficiencies, which have adverse cognitive, physical, immunologic, and socioeconomic effects.[7,8,9,10] A common cause of stunting in the United States is celiac disease (CD), with an estimated 1% prevalence.[11] Celiac disease is an immune-mediated, small-bowel enteropathy triggered by gluten sensitivity in people with genetic susceptibility.[12] Environmental enteropathy (EE), another similar but distinct condition, is thought to be a key factor underlying stunting in children residing in low-income and middle-income countries.[13,14] Environmental enteropathy is an acquired small-intestinal condition that is proposed to be a consequence of the continuous burden of immune stimulation by fecal-oral exposure to enteropathogens, leading to persistent, nonspecific chronic inflammation.[15,16,17] Environmental enteropathy and CD have been described as overlapping enteropathies.[17,18,19,20,21] Currently, the standard diagnostic criteria for these diseases is the evaluation of a small-intestinal biopsy obtained via an endoscopic procedure, which requires sedation.[17,22] Between 4 and 6 biopsies are required for diagnosis,[23] and because only parts of the bowel are affected in some cases, patients may require multiple endoscopic procedures. Therefore, there is a need to develop methods that allow feature extraction from biopsies of children with undernourishment and stunting for further analysis, ie, correlation with numerical metadata. This would aid traditional pathology-based diagnoses and pave the way for future gastrointestinal diagnostics that depend less on obtaining intestinal biopsies and more on noninvasive interpretations of intestinal histological features for both diagnosis and follow-up. The pathologist’s interpretation of histological tissue is a unique skill set; pathologists associate specific clinical information with biopsy findings to suggest a presumptive diagnosis.[24] However, it is difficult to translate this ability into quantifiable measurements that can be used for statistical analyses with other numerical metadata—most tissue measurements are subjective, whether they are morphometry, immunohistochemistry, or fluorescence intensity quantification.[25,26,27,28] We propose a deep learning–based image analysis platform for the automated extraction of quantitative morphologic phenotypes from gastrointestinal biopsy images to identify novel features that could be used to help differentiate between overlapping conditions (EE and CD). Our overarching aim is to develop methods in data science to support the integration of this data with clinical and molecular data, enabling the construction of biologically informative and clinically useful integrative prognostic models for pediatric undernutrition. The primary aim of the present study is to build and deconstruct a deep learning network, using unannotated images of duodenal biopsy slides, which would characterize intestinal mucosal alterations and distinguish between EE, CD, and healthy tissue. We hypothesized that advances in deconvolutional neural networks (DNNs) could mimic the pathologist’s skill set, including the ability to identify novel features in understudied diseases (eg, EE). Deconvolutional neural networks construct hierarchical image representations that are top-down projections representing structures that have stimulated particular feature maps.[29,30] They are hypothesized to be able to look for distinctive patterns in input images[29,30] and, therefore, could find key distinguishing features between many overlapping diseases, not just EE and CD.

Methods

This study is a prospective diagnostic study designed to develop and validate a predictive machine learning model for the interpretation of duodenal biopsy slides and feature detection in diseased vs healthy duodenal tissue. Its report follows the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline.[31] The data were collected, prepared, and analyzed from November 2017 to February 2018. This study was approved by the University of Virginia institutional review board (waiver of consent granted), the Ethical Review Committee of Aga Khan University in Karachi, Pakistan (informed consent obtained from parents and/or guardians), and the Biomedical Research Ethics Committee of the University of Zambia in Lusaka, Zambia (informed consent obtained from caregivers).

Image Analysis Data Sets

We obtained 3118 segmented images from 121 hematoxylin-eosin (H-E)–stained duodenal biopsy glass slides from 102 patients, labeled as EE, CD, or control. Primary study physicians at each site made all diagnoses based on histological and clinical findings. Biopsy slides for patients with EE were obtained from the Aga Khan University Hospital (29 slides from 10 patients) and the University of Zambia Medical Center (16 slides from 16 patients). Biopsy slides for patients with CD (34 slides from 34 patients) and the control group (42 slides from 42 patients) were obtained from the Biorepository and Tissue Research Facility at the University of Virginia (eAppendix 1 in the Supplement).

Biological Sample Collection

Data on various biomarkers obtained from blood, urine, and/or fecal samples of patients from Pakistan and Zambia with EE were also obtained. The complete details of the acquisition and handling of the specimens have been outlined in articles by Iqbal et al[32] (for patients in Pakistan) and Amadi et al[33] (for patients in Zambia). These biomarkers were used to propose an algorithmic framework to correlate this numerical metadata with biopsy features. Since limited and variable biomarkers were obtained in each of these studies, biological inferences could not be made from our results. Therefore, our primary goal was to develop a correlation algorithm to test in a larger, more comprehensive data set.

Statistical Analysis

Descriptive statistics were performed using the R coding language and RStudio (The R Institute). Anthropometric measurements and age were used to calculate median and interquartile range z scores for length for age and height for age, weight for age, and weight for height. These were calculated using reference guidelines and R application macros available through the World Health Organization.[34,35] Weight-for-age z scores were only calculated for children younger than 10 years because no reference guideline exists for patients older than 10 years owing to the discrepancies in height and body mass index because of pubertal changes.[35]

Convolutional Neural Network and DNN Framework

We proposed a CNN, based on AlexNet,[36] to perform multiclass classification on biopsy images (Figure 1). Importantly, while based on AlexNet,[36] which requires the least amount of training data vs deeper architectures (eg, VGG-16[37] and ResNet[38]), several variations to the architecture were made to reduce the number of trainable parameters to enable using a smaller data set. Variations included the use of (1) 1 convolution network pipeline vs 2, (2) 4 convolutions vs 5, and, most importantly, (3) 1 fully connected layer with 1024 neurons vs 2 fully connected layers with 4096 neurons. The network consisted of (1) 4 convolution layers, each followed by a rectified linear unit layer[39] and a max pooling layer; (2) 1 fully connected layer; (3) 1 dropout layer; and (4) 1 softmax layer.

Figure 1.

Illustration of Proposed Convolutional Neural Network Classification and Visualization Framework

Illustration of Proposed Convolutional Neural Network Classification and Visualization Framework

The convolutional neural network consists of 4 convolution layers and 1 fully connected layer. Each convolution layer consists of 3 sublayers: (1) a convolution layer, (2) a rectified linear unit activation layer, and (3) a max pooling layer. Deconvolution layers increase image resolution and find locations with high activations. The input image represents a hematoxylin-eosin–stained duodenal biopsy image (original magnification ×100). The 4 convolution layers had 16, 32, 32, and 32 feature maps, respectively, with pixel-by-pixel filter sizes of 5 × 5, 5 × 5, 5 × 5, and 3 × 3, respectively. Each convolution’s input layer was 0 padded, ensuring equal sizes of input and output. The max pooling layers’ window sizes were set to 2 × 2, 4 × 4, 5 × 5, and 5 × 5, respectively. The stride in convolution and max pooling layers was set to 1. Given an input of 1000 × 1000 × 3, the fourth max pooling layer would generate a 5 × 5 output for each feature map. The output of the 32 feature maps was flattened and concatenated before being connected to the fully connected layer. The dropout layer had a dropout probability of 0.5,[40] and the softmax layer generated 3 probabilities: (1) likelihood of image being healthy duodenal tissue, (2) likelihood of image being duodenal tissue with CD, or (3) likelihood of image being duodenal tissue with EE. An additional component was added to the network for high activation visualization, which is conceptually similar to a DNN.[29] However, we increased the resolution starting from a low-resolution output layer, and instead of deconvoluting the entire output layer, only the highest activation for each feature map was traced back to the source image. We trained 1 DNN model on the entire data set. Then, we visualized the patches with the highest activations from each of the 32 feature maps at layer 4 with respect to the different classes.

Image Processing

We had 2 image sources: (1) digitized whole-slide images from Pakistan (single biopsy per slide) and Zambia (multiple biopsies per slide) and (2) scanned slide images at magnification ×40 and ×100 of CD and control slides (multiple biopsies per slide). Both image formats had relatively high resolutions (ie, ranging from 2288 × 1356 pixels to 18 304 × 14 926 pixels). However, although most discriminating features can only be observed at high resolution, it is impractical to input such high-resolution images into the network. Therefore, methods of artificial data augmentation, including segmentation, horizontal and vertical reflection of randomly selected patches, and γ correction, were used to produce a more practical input data set (eFigure 1 in the Supplement). Each biopsy was segmented into multiple 1360 × 1024 images. During testing, averages from all images from each patient were used for the final prediction. To adjust for artifact and hue variations between slides, color augmentation experiments were conducted, including γ correction, contrast-limited adaptive histogram equalization, and γ correction with contrast-limited adaptive histogram equalization.[41] A 10-fold cross-validation performance of the CNN found that the use of γ correction alone provided the best performance.[41] Therefore, γ correction was applied to the data set with a random γ value of 0.5 to 2.0 to account for appreciable intersite color differences in H-E staining. For each image, ten 1000 × 1000 patches and their horizontal and vertical reflections (for variability in feature orientation, such as villi, within the same and different images) were randomly selected. The size of our data set increased by a factor of 30 and enabled the algorithm to learn translation and rotation invariant features. As a result of data augmentation, approximately 85 000 images were included in each fold (approximately 76 059 for training and 8541 for testing). For testing, 15 patches were generated from each image: 1 central patch, 4 corner patches, and their horizontal and vertical reflections. Then, the average probability of EE was calculated from these patches. The likelihood of EE in a biopsy was computed as the mean of its segments’ estimated probabilities.

Feature Prediction t Tests

Zeiler and Fergus[29] and Zeiler et al[30] suggested that filters from the CNN would look for distinctive patterns in input images. The degree to which a pattern match is found in the input image is reflected by the activation value, and a higher value corresponds to a better match. Therefore, image patches were run from the different classes, and the maximum activation value per CNN filter was collected. A t test was run to test the hypothesis that activation values from a class (eg, EE) are significantly higher than those generated by other classes (eg, CD or control); this test served as a proxy for testing the prevalence of various tissue patterns in different biopsies (eFigure 2 in the Supplement).

Lasso Regression Models for Biomarker Correlation

Lasso regression models were built to correlate EE biomarkers with activation maps and to predict EE biopsy features. The lasso model was chosen to add regularization to the regression to avoid overfitting and to promote sparsity in the feature space to reduce the number of input biomarkers used to predict biopsy features. In the EE studies from Pakistan and Zambia, each patient with EE was associated with various variables, including noninvasive biomarkers (from blood, urine, and stool) that described the patient’s clinical situation. Biomarkers have been used to detect EE and other gastrointestinal diseases, such as biliary atresia and inflammatory bowel disease.[42,43,44,45,46,47,48] However, to our knowledge, there has been no work on estimating or synthesizing biopsies from biomarkers. Only data from Pakistan were used for this biopsy-biomarker correlation framework, which consisted of 2 components: (1) CNN activations and (2) lasso regression (eFigure 3 in the Supplement). The correlation process involved 5 steps. First, for training, the CNN contained 32 different feature maps at the fourth layer; each feature map searched for a specific pixel pattern (ie, morphological feature) in the input images (derived from dividing biopsies into multiple images). Each image would produce 25 × 25 pixel convoluted values for each feature map. Second, a maximum operator function was performed on the activations across all the images per each biopsy. The maximum value of the final 625 convoluted values corresponded to a 142 × 142 pixel segment that maximally activated the feature map. Therefore, each biopsy was mapped onto 32 values that corresponded to 32 segments. Third, each case was associated with multiple variables. Thus, 32 lasso regression models were built to correlate variables with feature map activation values; 32 image segments of 142 × 142 pixels were extracted from each biopsy, and their activation values at layer 4 were correlated with biomarkers. Fourth, for testing, the trained lasso regression models estimated 32 activation values from the biomarkers. Rectified linear units generate nonnegative activations; therefore, response variables for the regression models only have positive values. However, because of the possibility that using the trained models for prediction may generate negative estimates, we removed negative estimates from the testing cases’ estimates. Fifth, the estimated values were used to obtain training image sections with similar activation values; these sections were used to reconstruct the testing image. The underlying assumption was that 2 regions from 2 images producing similar activation values would likely contain similar pixel patterns. The correlation model was evaluated with 2 approaches. First, for each 10-fold cross-validation testing case, we estimated 32 activations from biomarkers using the trained lasso regression models. Then, we applied the CNN models to these cases and computed the mean squared error (MSE) between estimated and actual activations. Next, random forest models were used to estimate each biomarker’s importance in predicting biopsy features. We then tested the predictive power of the subsets of the most important biomarkers. Starting with a model using all biomarkers, we sequentially removed the least important feature, retrained the model, and estimated the cross-validation per-image and per-biopsy MSE. Second, we applied the trained CNN models on testing cases and obtained the 32 regions that corresponded to the highest activation values for the 32 feature maps.

Results

Background Clinical Characteristics of Patient Populations

Table 1 summarizes the participants’ background characteristics. The median (interquartile range) age of the 102 participants was 31.0 (20.3-75.5) months, and there was a roughly equal sex distribution, with 53 boys (51.9%). Overall, 26 patients (25.5%) were diagnosed as having EE, 34 patients (33.3%) were diagnosed as having CD, and 42 patients (41.2%) had healthy duodenal tissue. More characteristics are described in eAppendix 2 in the Supplement.

Table 1.

Background Clinical Characteristics of Patient Population

Characteristic	No. (%)
	Total Participants	Patients With Environmental Enteropathy		Patients With Celiac Disease, United States	Patients With No Disease, United States
	Total Participants	Pakistan	Zambia	Patients With Celiac Disease, United States	Patients With No Disease, United States
Diagnosis	102 (100)	10 (9.8)	16 (15.7)	34 (33.3)	42 (41.2)
Age, median (IQR), mo	31.0 (20.3 to 75.5)	22.0 (20.0 to 23.0)	16.5 (10.5 to 21.0)	129.0 (72.5 to 180.8)	31.5 (22.0 to 49.8)
Sex
Boys	53 (51.9)	5 (50.0)	10 (62.5)	12 (35.0)	26 (62.0)
Girls	49 (48.1)	5 (50.0)	6 (37.5)	22 (65.0)	16 (38.0)
Images^a	121 (100)	29 (24.0)	16 (13.2)	34 (28.0)	42 (34.7)
Weight-for-age z score, median (IQR)	−1.00 (−3.10 to 0.06)	−3.40 (−3.78 to −2.46)	−3.75 (−5.23 to −3.21)	−0.14 (−0.77 to 0.24)^b^,^c	−0.36 (−1.28 to 0.93)
Length-for-age/height-for age z score, median (IQR)	−1.00 (−2.33 to 0.31)	−2.85 (−3.47 to −2.35)	−3.06 (−3.84 to −2.29)	−0.12 (−0.91 to 0.67)^b^,^d	−0.36 (−1.15 to 0.47)
Weight-for-height z score, median (IQR)	−1.00 (−2.68 to 0.27)	−2.68 (−2.87 to −1.90)	−3.05 (−4.62 to −2.61)	0.62 (0.40 to 1.07)^b	−0.23 (−1.06 to 0.50)^e

Abbreviation: IQR, interquartile range.

Images refer to the number of hemotoxylin-eosin–stained biopsy images made available to the deep learning network; these included both scanned images (celiac disease and no disease) and digitized images (environmental enteropathy from Pakistan and Zambia). For Pakistan, there were 2 to 3 biopsies available from each patient; therefore, there were 29 digitized biopsy images from 10 patients.

Three patients with celiac disease did not have anthropometric data available, and they were excluded from the analysis for all z scores.

Weight-for-age z scores could only be calculated for approximately 35% of patients with celiac disease because the rest were older than 10 years and there is no reference standard for this age group.

Weight-for-height z scores could only be generated for 7 patients with celiac disease using the current algorithm.

Weight-for-height z scores could only be generated for 38 patients with no disease using the algorithm.

Abbreviation: IQR, interquartile range. Images refer to the number of hemotoxylin-eosin–stained biopsy images made available to the deep learning network; these included both scanned images (celiac disease and no disease) and digitized images (environmental enteropathy from Pakistan and Zambia). For Pakistan, there were 2 to 3 biopsies available from each patient; therefore, there were 29 digitized biopsy images from 10 patients. Three patients with celiac disease did not have anthropometric data available, and they were excluded from the analysis for all z scores. Weight-for-age z scores could only be calculated for approximately 35% of patients with celiac disease because the rest were older than 10 years and there is no reference standard for this age group. Weight-for-height z scores could only be generated for 7 patients with celiac disease using the current algorithm. Weight-for-height z scores could only be generated for 38 patients with no disease using the algorithm.

Deep Learning Prediction Accuracy

We used a case-preserving 10-fold cross-validation setup, ie, all images from a given patient were either in the training set or the testing set for a given fold. The models were trained for only 20 epochs to avoid overfitting and achieved 92.1% cross-validation per-image prediction accuracy (evaluated for each image individually) and 93.4% per-patient accuracy (evaluated after taking mean probabilities from all images). Aggregated confusion/error matrices (a table representing a CNN’s predicted classification vs actual classification, enabling visualization of the algorithm’s performance[49,50]) were generated for the 10-fold cross-validation to understand where most incorrect classifications occurred, and most were found between patients with CD and the control group.[41] On review, all misclassified CD biopsies had a Marsh score of 1. Overall, based on our aggregated confusion matrix, our model had a false-negative rate of 2.4%.

Deconvolutional Neural Networks Paired With CNN

Various DNN feature maps learned distinctive patterns, and as a result, the highest 9 activations corresponded to relatively similar segments from the training data. The model automatically learned microlevel features in the data, specifically duodenal epithelial secretory cells, which were identified as highly important in the predictive diagnosis of EE or CD (Figure 2).

Figure 2.

High Activation Areas

High Activation Areas

A, Hematoxylin-eosin–stained duodenal tissues with diagnosed environmental enteropathy (original magnification ×100). B, Hematoxylin-eosin–stained histologically normal duodenal tissue (original magnification ×100). These images were the areas of high activation identified by the model; we observed secretory cells, specifically Paneth cells and goblet cells, in the mucosa. Our classification model identified these secretory cells to be of high importance for distinguishing biopsies with no disease from biopsies of environmental enteropathy and celiac disease. Furthermore, we extracted 151 deconvolutions to gain insight into the model’s decision-making process. The deconvolutions generated were reviewed by a gastrointestinal pathologist (C.A.M.) and pediatric gastroenterologist (S.S.) and broadly categorized into 10 groups, including Paneth cells, luminal mucin, apposed epithelium, and artifact (Figure 3) (eAppendix 3 in the Supplement).

Figure 3.

Deconvolution Groupings

We selected 151 deconvolutions from hematoxylin-eosin–stained duodenal biopsies for interpretation (original magnification ×40); the 10 groupings the model identified are shown. Red boxes and lines indicate the pixel configuration that the deconvolution model considered an area of importance. Each of these features was used by the model in its decision-making process, but the relative importance of each feature is unknown.

Deconvolution Groupings

Intercountry EE Comparison

Because we had EE biopsies from Zambia and Pakistan, we were interested in analyzing the intercountry microfeature differences. First, all patients from Zambia were excluded, and 10 models were trained, each of which removed 1 of the 10 patients from Pakistan. Control and CD images were randomly allocated to the models. After training, the patients from Zambia were used for evaluation. Table 2 shows the per-image and per-case performance. Overall, 3 models misclassified almost all specimens from Zambia. We identified the specimens from Pakistan that were removed from each model; therefore, these 3 specimens from Pakistan were hypothesized to be very informative for the identification of EE in the images from Zambia. To validate this, an additional model was trained using only the 3 biopsies from the 3 patients Pakistan and was tested on all patients from Zambia, achieving 99.0% per-image and 100% per-case accuracy.

Table 2.

Classification Accuracy of Model Trained on Patients From Pakistan and Evaluated on Patients From Zambia

Evaluation Method	Model No.
Evaluation Method	1	2	3	4	5	6	7	8	9	10
Per-image accuracy	0.94	0.93	0.92	0.07	0.81	0	0.91	0.997	0.14	0.89
Per-case accuracy	1.00	1.00	1.00	0.13	1.00	0	1.00	1.00	0.13	1.00

Biopsy Patterns–Biomarker Correlation Model

We found that the mean 10-fold cross-validation MSE was 0.0840, with a variance of 0.0038. The top 5 biomarkers identified were interleukin 9, interleukin 6, interleukin 1b, interferon γ-induced protein 10, and regenerating family member 1. eFigure 4 in the Supplement shows that, by using the 12 most important features, we achieved the lowest per-image error. We qualitatively compared the 32 regions that corresponded to the highest activation values for the 32 feature maps against the regions obtained from the correlation algorithm (eFigure 5 in the Supplement). Segments were sorted by the absolute difference between the estimated value, produced by the regression model and matched to the closest corresponding training biopsy segment, and the ground-truth activation values, which corresponded to the highest 32 activations from each EE specimen in the testing set. Our correlation algorithm produced relatively good estimates (eFigure 5 in the Supplement), and the MSE was 0.0751. However, given the minimal biomarker overlap between specimens from Pakistan and Zambia, this proposed method needs to be validated in a larger data set before biological interpretation of the meaning of the correlations can be attempted.

Discussion

This study aimed to develop and validate a machine learning–based histopathological analysis model to distinguish and extract morphologic phenotypes from duodenal biopsy images and identify novel features that could be used to help differentiate between overlapping conditions causing pediatric undernutrition. The major results of this work include the following: (1) the development of a CNN applied to H-E–stained duodenal biopsy specimens from participants with healthy tissue, CD, and EE; (2) the use of a DNN to identify distinguishing features; and (3) a proposed analytic framework to correlate high-dimensional biomarker data with biopsy features. In the past decade, various studies have investigated the use of deep learning to facilitate the detection of medical conditions.[1,51,52,53,54,55,56,57,58,59] In 2017, Ehteshami Bejnordi et al[1] used deep learning algorithms to interpret sentinel lymph node pathology images. They used whole-slide images, which had been annotated for metastases, as their input to train the algorithms.[1] Our analytic framework was set up as a cross-validation, ie, our training set was labeled but our validation set was not. We did not apply any annotations other than the assignment of broad categories (ie, EE, CD, or control). Strengths of our study include a novel machine learning–based histopathological analysis for identifying and differentiating between gastrointestinal diseases and control images and the use of a DNN for feature recognition and novel insights into differentiating 2 histologically similar diseases.

Limitations

This study has some inherent limitations. First, we had different forms of images as inputs, including (1) digitized slides for EE (from Pakistan and Zambia) and (2) images taken from a microscope at different resolutions for CD and the control group. Therefore, the data from our patients with EE were much larger in size and more feature heavy. Second, there was an obvious intersite staining color difference, leading to a potential decision-making bias based on color and likely producing our high degree of accuracy. Although H-E staining is a standard method used by pathologists to study human tissue, differences in commercially available reagents by country led to clear color differences in biopsy slides from Zambia, Pakistan, and the United States. We used γ correction to address this problem, but this only partly solved the issue, evident by the high accuracy of EE classification. Nevertheless, in the context of the scarce literature about applying CNNs to small-intestinal tissue, our results suggest that CNNs can be used to provide quantitative and novel disease insights. Third, our study had broad inclusion criteria for the control group. While pathology reports confirmed healthy small-intestine tissue, patients were not excluded from the study on the basis of disease in other parts of the gastrointestinal tract (eg, eosinophilic esophagitis). Our current work includes more stringent inclusion and exclusion criterion for both CD and the control group. We presented a proposed analytic framework for biopsy pattern correlation with biomarkers; however, our small data set limited us from making biological inferences regarding the biopsy feature groupings identified via biomarkers. Future directions for our research include using digitized images for all disease categories, thereby providing the algorithm with high-resolution microscopic features, hypothetically enabling it to more robustly identify novel features and decreasing the amount of artifact used for decision making as data size increases. Future work will also include transfer-learning approaches, ideally allowing the algorithm to make more accurate classifications and feature predictions. Gradient-weighted class activation mapping[60] will be applied for the assignment of relative-importance weights to distinguishing features on each image and disease. Additionally, to address differential staining between study sites and reduce appearance variability within the data set, methods of stain normalization will be implemented, which will modify the image color to resemble a reference sample. Furthermore, a potential reason for most misclassifications occurring between CD and healthy tissue could be that the misclassified CD images had a lower Marsh score or unusual clinical features (eg, normal serology, no weight loss, constipation vs diarrhea). We plan to conduct a secondary review of these misclassifications to identify histological and clinical features that could have caused misclassifications. Further, we plan to expand our current biomarker analysis to correlate microscopic biopsy features with a wide array of biomarker as well as molecular and genetic data.

Conclusions

In this diagnostic study, a machine learning–based histopathological analysis model demonstrated 93.4% classification accuracy for identifying and differentiating between duodenal biopsies from children with EE and CD. The combination of CNNs with a DNN enabled feature recognition and highlighted secretory cells’ role in the model’s ability to differentiate between these histologically similar diseases.

41 in total

1. Comparison of the interobserver reproducibility with different histologic criteria used in celiac disease.

Authors: Gino Roberto Corazza; Vincenzo Villanacci; Claudia Zambelli; Massimo Milione; Ombretta Luinetti; Carla Vindigni; Caterina Chioda; Luca Albarello; Daniela Bartolini; Francesco Donato
Journal: Clin Gastroenterol Hepatol Date: 2007-06-04 Impact factor: 11.382

2. Fecal Markers of Environmental Enteropathy and Subsequent Growth in Bangladeshi Children.

Authors: Michael B Arndt; Barbra A Richardson; Tahmeed Ahmed; Mustafa Mahfuz; Rashidul Haque; Grace C John-Stewart; Donna M Denno; William A Petri; Margaret Kosek; Judd L Walson
Journal: Am J Trop Med Hyg Date: 2016-06-27 Impact factor: 2.345

3. Artificial Intelligence and the Pathologist: Future Frenemies?

Authors: Gaurav Sharma; Alexis Carter
Journal: Arch Pathol Lab Med Date: 2017-05 Impact factor: 5.534

Review 4. Histopathological image analysis: a review.

Authors: Metin N Gurcan; Laura E Boucheron; Ali Can; Anant Madabhushi; Nasir M Rajpoot; B Yener
Journal: IEEE Rev Biomed Eng Date: 2009-10-30

5. Big Data and Machine Learning in Health Care.

Authors: Andrew L Beam; Isaac S Kohane
Journal: JAMA Date: 2018-04-03 Impact factor: 56.272

6. Chronic diarrhea and malnutrition--histology of the small intestinal lesion.

Authors: P B Sullivan; M N Marsh; R Mirakian; S M Hill; P J Milla; G Neale
Journal: J Pediatr Gastroenterol Nutr Date: 1991-02 Impact factor: 2.839

7. Environmental Enteropathy, Oral Vaccine Failure and Growth Faltering in Infants in Bangladesh.

Authors: Caitlin Naylor; Miao Lu; Rashidul Haque; Dinesh Mondal; Erica Buonomo; Uma Nayak; Josyf C Mychaleckyj; Beth Kirkpatrick; Ross Colgate; Marya Carmolli; Dorothy Dickson; Fiona van der Klis; William Weldon; M Steven Oberste; Jennie Z Ma; William A Petri
Journal: EBioMedicine Date: 2015-09-25 Impact factor: 8.143

8. Infant Nutritional Status and Markers of Environmental Enteric Dysfunction are Associated with Midchildhood Anthropometry and Blood Pressure in Tanzania.

Authors: Lindsey M Locks; Ramadhani S Mwiru; Expeditho Mtisi; Karim P Manji; Christine M McDonald; Enju Liu; Roland Kupka; Rodrick Kisenge; Said Aboud; Kerri Gosselin; Matthew Gillman; Andrew T Gewirtz; Wafaie W Fawzi; Christopher P Duggan
Journal: J Pediatr Date: 2017-05-09 Impact factor: 4.406

Review 9. Environmental enteric dysfunction pathways and child stunting: A systematic review.

Authors: Kaitlyn M Harper; Maxine Mutasa; Andrew J Prendergast; Jean Humphrey; Amee R Manges
Journal: PLoS Negl Trop Dis Date: 2018-01-19

10. Biomarkers of Environmental Enteropathy, Inflammation, Stunting, and Impaired Growth in Children in Northeast Brazil.

Authors: Richard L Guerrant; Alvaro M Leite; Relana Pinkerton; Pedro H Q S Medeiros; Paloma A Cavalcante; Mark DeBoer; Margaret Kosek; Christopher Duggan; Andrew Gewirtz; Jonathan C Kagan; Anna E Gauthier; Jonathan Swann; Jordi Mayneris-Perxachs; David T Bolick; Elizabeth A Maier; Marjorie M Guedes; Sean R Moore; William A Petri; Alexandre Havt; Ila F Lima; Mara de Moura Gondim Prata; Josyf C Michaleckyj; Rebecca J Scharf; Craig Sturgeon; Alessio Fasano; Aldo A M Lima
Journal: PLoS One Date: 2016-09-30 Impact factor: 3.240

12 in total

Review 1. Artificial Intelligence for Disease Assessment in Inflammatory Bowel Disease: How Will it Change Our Practice?

Authors: Ryan W Stidham; Kento Takenaka
Journal: Gastroenterology Date: 2022-01-04 Impact factor: 22.682

Review 2. The global burden of coeliac disease: opportunities and challenges.

Authors: Govind K Makharia; Prashant Singh; Carlo Catassi; David S Sanders; Daniel Leffler; Raja Affendi Raja Ali; Julio C Bai
Journal: Nat Rev Gastroenterol Hepatol Date: 2022-01-03 Impact factor: 46.802

3. Application of Artificial Intelligence to Clinical Practice in Inflammatory Bowel Disease - What the Clinician Needs to Know.

Authors: David Chen; Clifton Fulmer; Ilyssa O Gordon; Sana Syed; Ryan W Stidham; Niels Vande Casteele; Yi Qin; Katherine Falloon; Benjamin L Cohen; Robert Wyllie; Florian Rieder
Journal: J Crohns Colitis Date: 2022-03-14 Impact factor: 10.020

4. Artificial Intelligence for Understanding Imaging, Text, and Data in Gastroenterology.

Authors: Ryan W Stidham
Journal: Gastroenterol Hepatol (N Y) Date: 2020-07

Review 5. Artificial Intelligence Applied to Gastrointestinal Diagnostics: A Review.

Authors: Vatsal Patel; Marium N Khan; Aman Shrivastava; Kamran Sadiq; S Asad Ali; Sean R Moore; Donald E Brown; Sana Syed
Journal: J Pediatr Gastroenterol Nutr Date: 2020-01 Impact factor: 3.288

6. Artificial Intelligence-based Analytics for Diagnosis of Small Bowel Enteropathies and Black Box Feature Detection.

Authors: Sana Syed; Lubaina Ehsan; Aman Shrivastava; Saurav Sengupta; Marium Khan; Kamran Kowsari; Shan Guleria; Rasoul Sali; Karan Kant; Sung-Jun Kang; Kamran Sadiq; Najeeha T Iqbal; Lin Cheng; Christopher A Moskaluk; Paul Kelly; Beatrice C Amadi; Syed Asad Ali; Sean R Moore; Donald E Brown
Journal: J Pediatr Gastroenterol Nutr Date: 2021-06-01 Impact factor: 3.288

7. A novel histological index for evaluation of environmental enteric dysfunction identifies geographic-specific features of enteropathy among children with suboptimal growth.

Authors: Ta-Chiang Liu; Kelley VanBuskirk; Syed A Ali; M Paul Kelly; Lori R Holtz; Omer H Yilmaz; Kamran Sadiq; Najeeha Iqbal; Beatrice Amadi; Sana Syed; Tahmeed Ahmed; Sean Moore; I Malick Ndao; Michael H Isaacs; John D Pfeifer; Hannah Atlas; Phillip I Tarr; Donna M Denno; Christopher A Moskaluk
Journal: PLoS Negl Trop Dis Date: 2020-01-13