| Literature DB >> 34997172 |
Ziyuan Jiang1, Jiajin Li2, Nahyun Kong3, Jeong-Hyun Kim4, Bong-Soo Kim5, Min-Jung Lee5, Yoon Mee Park4, So-Yeon Lee6, Soo-Jong Hong7, Jae Hoon Sul8.
Abstract
Atopic dermatitis (AD) is a common skin disease in childhood whose diagnosis requires expertise in dermatology. Recent studies have indicated that host genes-microbial interactions in the gut contribute to human diseases including AD. We sought to develop an accurate and automated pipeline for AD diagnosis based on transcriptome and microbiota data. Using these data of 161 subjects including AD patients and healthy controls, we trained a machine learning classifier to predict the risk of AD. We found that the classifier could accurately differentiate subjects with AD and healthy individuals based on the omics data with an average F1-score of 0.84. With this classifier, we also identified a set of 35 genes and 50 microbiota features that are predictive for AD. Among the selected features, we discovered at least three genes and three microorganisms directly or indirectly associated with AD. Although further replications in other cohorts are needed, our findings suggest that these genes and microbiota features may provide novel biological insights and may be developed into useful biomarkers of AD prediction.Entities:
Mesh:
Year: 2022 PMID: 34997172 PMCID: PMC8741793 DOI: 10.1038/s41598-021-04373-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Baseline characteristics of the subjects in this study.
| All | Cases (AD) | Control (No AD) | Cases (AD) vs control (No AD) t test p value | |
|---|---|---|---|---|
| Average age: months | 14.21 ± 2.14 | 17.37 ± 3.48 | 10.81 ± 2.15 | 0.001 |
| Sex: female | 72 | 32 | 40 | – |
| aSCORAD | – | 32.86 ± 5.49 | – | – |
| Total IgE (IU/ml) | 135.191 ± 83.53 | 243.06 ± 160.21 | 22.83 ± 9.46 | 0.004 |
aSCORAD: SCOring AD value, an AD assessment index that is only available for patients.
Figure 1The overview of atopic dermatitis classification pipelines in two settings. (a) Transcriptome dataset only, and (b) transcriptome and microbiota data.
The results on different methods with transcriptome data only.
| Feature selection method (number of features) + Classification method | F1 score | Accuracy | Precision | Recall |
|---|---|---|---|---|
| All features (44,608) + SVM (rbf) | 0.7272 | 0.6000 | 0.5714 | |
| chi-squared test (35) + SVM (rbf) | 0.8125 | |||
| All features (44,608) + SVM (rbf), with noise ( | 0.7111 | 0.5667 | 0.5517 | |
| chi-squared test (35) + SVM (rbf), with noise ( | 0.8125 |
The first method trained the model on the original training set without feature selection. The second method performed feature selection by chi-squared test and selected 35 features. For the last two methods, they are similar with the first two methods respectively while the only difference was that they added the noise and changed the probability threshold. The random seed of the noise was 21, which was the best result on this intensity (I = 0.001).
Figure 2The ROC curve of the test set with transcriptome data only. (a) All features (44,608) + SVM (rbf). (b) Chi-squared test (35) + SVM (rbf). (c) All features (44,608) + SVM (rbf), with noise (I = 0.001) and probability threshold = 0.3. (d) Chi-squared test (35) + SVM (rbf), with noise (I = 0.001) and probability threshold = 0.3.
The first and second methods used microbiota data only.
| Feature selection method (number of features) + Classification method | F1 score | Accuracy | Precision | Recall |
|---|---|---|---|---|
| All features (366) + SVM (rbf) | 0.7111 | 0.5667 | 0.5517 | |
| chi-squared test (25) + SVM (rbf) | 0.7442 | 0.6333 | 0.5926 | |
| chi-squared test (85) + SVM (rbf) | 0.8750 | |||
| All features (366) + SVM (rbf), with probability threshold = 0.3 | 0.6957 | 0.5333 | 0.5333 | |
| chi-squared test (25) + SVM (rbf), with probability threshold = 0.3 | 0.7111 | 0.5667 | 0.5517 | |
| chi-squared test (85) + SVM (rbf), with noise ( | 0.8750 |
The first method trained the model on the original training set without feature selection. The second method did feature selection by chi-squared test and selected 25 features, while the third method used both transcriptome and microbiota data, and integrated the data using the fourth plan mentioned above, and selected 85 features (35 for transcriptome and 50 for microbiota). For the last three methods, they are similar with the first three methods respectively. The only difference was that they changed the threshold and added noises.
Figure 3The ROC curve of the test set with microbiota data. (a) All features (366) + SVM (rbf). (b) Chi-squared test (25) + SVM (rbf). (c) Chi-squared test (85) + SVM (rbf). (d) Chi-squared test (85) + SVM (rbf), with noise (I = 0.001) and probability threshold = 0.3. For panel (a,b), we only use microbiota data, while for (c,d) we also include transcriptome data.
Figure 4The average feature importance of the top 35 selected probes/genes. See more detailed annotation information in Supplementary Table S5.
Figure 5The average feature importance of the top 50 selected microorganisms from the microbiota dataset.
| Predicted class | ||
|---|---|---|
| True | False | |
| True | ||
| False | ||