| Literature DB >> 36009575 |
Daniele Pietrucci1,2, Adelaide Teofani3, Marco Milanesi1, Bruno Fosso4, Lorenza Putignani5, Francesco Messina6, Graziano Pesole2,4, Alessandro Desideri3, Giovanni Chillemi1.
Abstract
In recent years, the involvement of the gut microbiota in disease and health has been investigated by sequencing the 16S gene from fecal samples. Dysbiotic gut microbiota was also observed in Autism Spectrum Disorder (ASD), a neurodevelopmental disorder characterized by gastrointestinal symptoms. However, despite the relevant number of studies, it is still difficult to identify a typical dysbiotic profile in ASD patients. The discrepancies among these studies are due to technical factors (i.e., experimental procedures) and external parameters (i.e., dietary habits). In this paper, we collected 959 samples from eight available projects (540 ASD and 419 Healthy Controls, HC) and reduced the observed bias among studies. Then, we applied a Machine Learning (ML) approach to create a predictor able to discriminate between ASD and HC. We tested and optimized three algorithms: Random Forest, Support Vector Machine and Gradient Boosting Machine. All three algorithms confirmed the importance of five different genera, including Parasutterella and Alloprevotella. Furthermore, our results show that ML algorithms could identify common taxonomic features by comparing datasets obtained from countries characterized by latent confounding variables.Entities:
Keywords: Alloprevorella; Parasutterella; autism spectrum disorder; dysbiosis; gut microbiota; machine learning data analysis; targeted metagenomics
Year: 2022 PMID: 36009575 PMCID: PMC9405825 DOI: 10.3390/biomedicines10082028
Source DB: PubMed Journal: Biomedicines ISSN: 2227-9059
Dataset used in this study. For each study, the number of ASD samples and HC samples before and after the filtering are reported. In addition, whenever possible, the BioProject ID is reported. For each study, a Study ID was defined, using the first Author Name or AGP for samples downloaded by the American Gut Project [36].
| Study ID | ASD Samples | HC | ASD Samples after Quality Filtering | HC Samples after Quality Filtering | Country | BioProject ID |
|---|---|---|---|---|---|---|
| Averina | 15 | 5 | 15 | 5 | Russia | PRJNA516054 |
| Coretti | 11 | 14 | 4 | 0 | Italy | PRJEB29421 |
| Dan | 142 | 143 | 142 | 143 | China | PRJNA453621 |
| Pulikkan | 30 | 24 | 30 | 24 | India | PRJNA355023 |
| Son | 59 | 44 | 59 | 44 | USA | PRJNA282013 |
| Zurita | 27 | 31 | 27 | 31 | Ecuador | PRJEB27306 |
| Vernocchi | 206 | 108 | 197 | 108 | Italy | PRJNA754695 |
| AGP | 50 | 50 | 47 | 47 | USA | - |
Figure 1Relative abundance of the genus identified in all the datasets. For each dataset, the mean relative abundance of the 85 genera has been evaluated for ASD patients and HC. The 20 most abundant genera are represented using different colors and the remaining genera are reported in the “Others” bin.
Pseudo-Fscore (F), Degree of Freedom (Df), Sum of Variance, R2 and p-value of PERMANOVA test conducted on microbial communities. Two variables were analyzed: the phenotype (ASD vs. HC) and the Study ID, which represent an identifier for each dataset (AGP, Averina, Dan, Pulikkan, Vernocchi, Son, Zurita). In (A) the statistics prior to the batch effect removal using combat are reported. In (B) the values after the removal of batch effect are reported.
| Variable | Df | Sum | R2 | F | Pr (>F) |
|---|---|---|---|---|---|
| (A) Values prior to batch effect removal | |||||
| Phenotype | 1 | 0.0788 | 0.03050 | 39.354 | 0.001 |
| ìStudy ID | 6 | 74.081 | 0.28705 | 616.687 | 0.001 |
| Residual | 911 | 182.396 | 0.70676 | ||
| Total | 918 | 328.85 | 1.0000 | ||
|
| |||||
| Phenotype | 1 | 0.0782 | 0.02508 | 27.416 | 0.001 |
| Study ID | 6 | 42.040 | 0.13874 | 255.534 | 0.001 |
| Residual | 911 | 259.968 | 0.85795 | ||
| Total | 918 | 303.012 | 1.000 | ||
Figure 2Principal Coordinate Analysis (PCoA) and Principal Component Analysis (PCA) were performed on microbial abundances. The phenotype variable is reported using a square or a circle for ASD and HC samples, respectively. Each color represents a different Project ID, namely one of the six datasets used in this study (AGP, Averina, Dan, Pulikkan, Vernocchi, Son and Zurita). (A) PCoA performed on original data; (B) PCoA performed after the removal of the batch effect using the ComBatfunction of the SVA package; (C) PCA performed on original data; (D) PCA performed after the removal of the batch effect using the ComBatfunction of the SVA package.
Comparison of the RF performance on three datasets. The algorithm parameters for each dataset were selected by using a grid search approach and the values which provided the greatest accuracy were selected. The threshold indicates the probability value at which the Recall (True Positive Rate, TPR) and the Specificity (True Negative Rate, TNR) were the same (see Section 2 and Supplementary Figure S2 for more details on this procedure). The following metrics are reported: Accuracy, Precision, Recall (TPR), Specificity (TNR) and F-score.
| Dataset | Algorithm | Threshold | Accuracy | Precision | Recall (TPR) | F-Score |
|---|---|---|---|---|---|---|
| ntree = 1500, | 0.4540 | 0.85 | 0.85 | 0.86 | 0.85 | |
| mtry = 10.21 | ||||||
| ntree = 1000, | 0.6190 | 0.72 | 0.82 | 0.72 | 0.77 | |
| mtry = 10.21 | ||||||
| ntree = 1500, | 0.5640 | 0.41 | 0.49 | 0.41 | 0.44 | |
| mtry = 6.21 |
Comparison of the algorithm performance using three strategies. In Strategy 1, all the datasets were used to train and test the algorithms. In Strategy 2,the dataset that did not admit patients’ siblings were used to train and test the algorithms. In Strategy 3, the two datasets that admit the patients’ siblings were used to train and test the algorithm. The threshold indicates that the probability value at which the Recall (True Positive Rate, TPR) and the Specificity (True Negative Rate, TNR) were the same (see Section 2 and Supplementary Figure S2 for more details on this procedure). The following metrics are reported: Accuracy, Precision, Recall (TPR), Specificity (TNR) and F-score.
| Algorithm | Strategy | Algorithm | Threshold | Accuracy | Precision | Recall (TPR) | F-Score |
|---|---|---|---|---|---|---|---|
| RF | 1 | ntree = 500, | 0.5580 | 0.67 | 0.71 | 0.67 | 0.70 |
| mtry = 7.21 | |||||||
| RF | 2 | ntree = 2000, | 0.5570 | 0.70 | 0.76 | 0.70 | 0.72 |
| mtry = 12.27 | |||||||
| RF | 3 | ntree = 500, | 0.5640 | 0.49 | 0.54 | 0.54 | 0.54 |
| mtry = 8.21 | |||||||
| GBM | 1 | n.trees = 1000, | 0.6545 | 0.62 | 0.68 | 0.62 | 0.65 |
| interaction.depth = 1, | |||||||
| n.minobsinnod = 1, | |||||||
| shrinkage = 0.1 | |||||||
| GBM | 2 | n.trees = 1000, | 0.6053 | 0.69 | 0.73 | 0.69 | 0.71 |
| interaction.depth = 1, | |||||||
| n.minobsinnod = 5, | |||||||
| shrinkage = 0.1 | |||||||
| GBM | 3 | n.trees = 2500, | 0.9853 | 0.48 | 0.54 | 0.49 | 0.47 |
| interaction.depth = 1, | |||||||
| n.minobsinnod = 0.1, | |||||||
| shrinkage = 20 | |||||||
| SVM | 1 | C = 1, sigma = 2.9802 × 10−8 | 0.5966 | 0.65 | 0.70 | 0.65 | 0.67 |
| SVM | 2 | C = 246, sigma = 3.1250 × 10−2 | 0.6025 | 0.69 | 0.74 | 0.70 | 0.72 |
| SVM | 3 | C = 81, sigma = 9.7656 × 10−4 | 0.5632 | 0.45 | 0.53 | 0.49 | 0.50 |
Figure 3(A) Feature rank importance for all the 15 bacterial taxa identified by the feature selection procedure for the RF (green diamond), GBM (red circle) and SVM (orange triangle) algorithms. Feature selection for the (B) RF, (C) GBM and (D) SVM algorithms. On the y-axis, the mean precision value (evaluated on k = 5 fold) is reported. On the x-axis, the number of the n-th most relevant features used to train the algorithms is reported.
Feature importance for the RF, GBM and SVM algorithms. For each algorithm, the rank of the 15 most important bacterial genera is reported. The 15 most important bacterial genera were identified by a feature selection procedure. Blank spaces indicate that specific genera were not identified among the 15th most important in the feature selection for a specific algorithm.
| Bacterial | Importance “RF” | Importance “GBM” | Importance “SVM” |
|---|---|---|---|
|
| 1 | 1 | 5 |
|
| 6 | 6 | 13 |
|
| 15 | 9 | 11 |
|
| 13 | 4 | 12 |
|
| 10 | 10 | 4 |
|
| 8 | 5 | |
|
| 9 | 3 | |
|
| 13 | 1 | |
|
| 7 | 12 | |
|
| 11 | 7 | |
|
| 5 | 3 | |
|
| 4 | 2 | |
|
| 15 | 6 | |
|
| 2 | ||
|
| 15 | ||
|
| 10 | ||
|
| 14 | ||
|
| 7 | ||
|
| 9 | ||
|
| 12 | ||
|
| 14 | ||
|
| |||
|
| 14 | 8 | |
|
| 2 | ||
|
| |||
|
| 3 | 11 | |
|
| 8 |
Figure 4Results of the SHAP algorithm allow the visualization of the contribution of five features (bacterial genera) to classify a sample as ASD or HC. In this figure, each dot represents a sample, while the color indicates the microbial abundance. Red dots are samples for which a genus is abundant, while blue dots are genera that are poorly represented in a sample. Points that show an attribution greater than 0 are ASD samples, while points that show an attribution lower than 0 are HC samples.