| Literature DB >> 31216310 |
Abstract
BACKGROUND: Current laboratory tests are less than 50% accurate in distinguishing between people who have food allergies (FA) and those who are merely sensitized to foods, resulting in the use of expensive and potentially dangerous Oral Food Challenges. This study presents a purely-computational machine learning approach, conducted using DNA Methylation (DNAm) data, to accurately diagnose food allergies and potentially find epigenetic targets for the disease. METHODS ANDEntities:
Mesh:
Substances:
Year: 2019 PMID: 31216310 PMCID: PMC6584060 DOI: 10.1371/journal.pone.0218253
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Top CpGs and associated genes from GEO2R across 8 independent folds.
| Rank | CpG | Associated Gene | Positions in Lists |
|---|---|---|---|
| 1 | cg06410630 | 6,1,1,21,91,3,13 | |
| 2 | cg13560030 | 60,13,20,38,5,28 | |
| 3 | cg02681173 | 2,35,1,16,56 | |
| 4 | cg09755579 | 11,8,19,71,1 | |
| 5 | cg20502977 | 2,40,4,1 | |
| 6 | cg26124569 | 14,6,8,43 | |
| 7 | cg24616138 | 5,13,2,71 | |
| 8 | cg24584002 | 20,18,40,32 | |
| 9 | cg03946731 | 50,23,34,6 | |
| 10 | cg20463995 | - | 39,44,1,39, |
| 11 | cg09618933 | - | 48,12,60,5 |
| 12 | cg10301401 | 7,18,11 | |
| 13 | cg08378782 | 9,27,38 | |
| 14 | cg21615831 | 13,59,34,74 | |
| 15 | cg07060505 | - | 1,11,70 |
A CpG may not appear in the top 99 CpGs for all of the eight folds. The above ranking is based on the frequency of each CpG across the eight GEO2R lists as well as its ranking in each list. The order of the genes in this table has no methodological significance.
Fig 1Average accuracy across eight independent folds for singular CpG features.
The accuracy for each CpG is its average hidden-data accuracy across the 8 independent folds. cg06410630 was the strongest CpG biomarker with an average accuracy of 84.375%. 18 CpGs each had a score of 75% or more.
Top CpGs and associated genes using a single input feature to a classifier across 8 independent folds.
| Number | CpG | Gene | Average Accuracy | AUROC |
|---|---|---|---|---|
| 1 | cg06410630 | 84.375 | 0.8359375 | |
| 2 | cg06669701 | 81.25 | 0.7890625 | |
| 3 | cg06628000 | 79.6875 | 0.8359375 | |
| 4 | cg10461264 | - | 78.125 | 0.7421875 |
| 5 | cg18988685 | - | 76.5625 | 0.8125 |
| 6 | cg24616138 | 76.5625 | 0.7109375 | |
| 7 | cg27027230 | 75 | 0.765625 | |
| 8 | cg00936790 | 75 | 0.7421875 | |
| 9 | cg14414100 | 75 | 0.7734375 | |
| 10 | cg00939931 | 75 | 0.796875 | |
| 11 | cg06116095 | 75 | 0.7421875 | |
| 12 | cg02788266 | - | 75 | 0.7734375 |
| 13 | cg03068039 | 75 | 0.828125 | |
| 14 | cg25890092 | 75 | 0.8203125 | |
| 15 | cg19287711 | - | 75 | 0.78125 |
| 16 | cg07033513 | - | 75 | 0.75 |
| 17 | cg07060505 | - | 75 | 0.8125 |
| 18 | cg26963090 | 75 | 0.7734375 |
These 18 CpGs achieved an accuracy score of 75% or higher when used as the singular feature in the machine learning classifiers. Their accuracy scores and AUROC were averaged over the 8 independent folds. For each fold, the machine learning classifiers were retrained and accuracy was computed on hidden test data.
Fig 2Distribution of methylation values for cg06410630 and cg06669701.
Fig 3Average accuracy by combining multiple independent classifiers through a simple voting scheme.
The graph shows the average accuracy achieved by combining classifiers with one to three CpG features through a majority voting scheme. Though the average accuracies for individual classifiers with single-feature CpG features are lower than those of classifiers with a larger number of CpGs, an ensemble (29 or more) of single-feature classifiers achieved perfect classification and outperformed ensembles of larger-feature classifiers.
Fig 4Distribution of machine learning classifier types for single-CpG feature models.
The MLP was selected most frequently in the single-input case (53%), followed by Logistic Regression (30%), Decision Trees (10%), and Radial Basis Functions (7%). As the number of features per model increased, the MLP classifiers tended to further dominate the classifier selection process, with 86.67% of the twelve-feature classifiers attaining highest cross-validation accuracy with the MLP.
Classifier statistics based on number of input features.
| Features | Best Accuracy Score | Average Score Top 5 | Average AUROC Top 5 | Best Ensemble Accuracy | Steady-State EnsembleAccuracy |
|---|---|---|---|---|---|
| 1 | 84.375 | 80 | 0.8031 | ||
| 2 | 87.5 | 86.25 | 0.9086 | 95.31 | 95.31 |
| 3 | 90.625 | 90.625 | 0.92815 | 92.1875 | 92.1875 |
| 4 | 93.75 | 93.75 | 0.9375 | 95.3125 | 95.3125 |
| 5 | 96.875 | 95.625 | 0.9468 | 96.875 | 96.875 |
| 6 | 95.3125 | 94.6875 | 0.9796875 | 96.875 | 96.875 |
| 7 | 96.875 | 96.25 | 0.9875 | 100 | 96.875 |
| 8 | 96.875 | 96.875 | 0.9890625 | 100 | 96.875 |
| 9 | 98.4375 | 97.5 | 0.99375 | 98.4375 | 98.4375 |
| 10 | 96.875 | 97.8125 | 0.996875 | 100 | 98.4375 |
| 11 | 98.4375 | 98.4375 | 0.9984375 | 98.4375 | 96.875 |
| 12 | 99.0625 |
The table shows the average 8-fold hidden accuracy (accuracy score) achieved by the best classifier for the given number of features. The third and fourth columns show the average accuracy score and AUROC for the top five classifiers, where each classifier has a different feature set. The fifth column shows the best score achieved by combining multiple independent classifiers via a simple voting scheme, and the sixth shows the steady-state (converging) accuracy score achieved by this combination after using 29+ independent classifiers.
Fig 5Best accuracy on hidden data and average AUROC as a function of the number of features.
The bar graph shows the average accuracy on the hidden data achieved by the best individual classifier for a given number of CpG features, while the line graph shows the best average AUROC.
Top classifiers using twelve features averaged across 8 independent folds.
| Number | CpG | Average Accuracy | AUROC |
|---|---|---|---|
| 1 | cg06410630, cg10461264, cg06116095, cg06628000, cg26963090, cg18988685, cg02788266, cg03068039, cg19287711, cg24616138, cg07060505, | 100% | 1 |
| 2 | cg06410630, cg10461264, cg06116095, cg06628000, cg26963090, cg18988685, cg02788266, cg03068039, cg19287711, cg24616138, cg07060505, | 100% | 1 |
Eleven of the twelve CpGs were common for the two cases; cg00936790 and cg07033513 were the two CpGs that differed. Perfect classification, averaged on the eight completely hidden test cohorts, was achieved.
Fig 6The plot shows the CpGs that appear in the classifiers with the highest accuracy for a given number of features.
The shaded box indicates that the CpG appeared in the feature list of one of the best classifiers for that number of features. Note that at times there were multiple combinations of CpGs that achieved the same accuracy, due to which the number of shaded boxes may be more than the number of features.
CpGs and associated genes from top 12-CpG classifiers.
| CpG | Frequency | Gene | Gene description | Identified Martino et al. [ | |
|---|---|---|---|---|---|
| 1 | cg06410630 | 26 | Ring finger protein 213 | Yes | |
| 2 | cg06628000 | 26 | Seryl-TRNA Synthetase | Yes | |
| 3 | cg03068039 | 26 | Zinc Finger Protein 252, | No | |
| 4 | cg10461264 | 26 | - | No | |
| 5 | cg18988685 | 26 | - | No | |
| 6 | cg02788266 | 25 | ATP Binding Cassette | No | |
| 7 | cg26963090 | 22 | TIMP Metallopeptidase | Yes | |
| 8 | cg19287711 | 22 | - | No | |
| 9 | cg00939931 | 21 | MAF BZIP Transcription | Yes | |
| 10 | cg25890092 | 17 | CD7 Molecule | Yes | |
| 11 | cg07060505 | 16 | - | No | |
| 12 | cg06116095 | 13 | Pannexin 1 | Yes | |
| 13 | cg24616138 | 13 | C-Terminal Binding | Yes | |
| 14 | cg14414100 | 8 | Solute Carrier Family 24 | No | |
| 15 | cg07033513 | 8 | - | No | |
| 16 | cg27027230 | 7 | AT-Rich Interaction | No | |
| 17 | cg00936790 | 7 | Kinesin Family | No | |
| 18 | cg06669701 | 3 | Coiled-Coil Serine | No |
This table shows the frequency, associated genes, and gene descriptions of the 18 unique CpGs obtained from the 26 twelve-feature classifiers. The frequency shows the number of times each CpG was used across the 26 classifiers. Interestingly, seven of the thirteen genes identified in this study appeared in previous work conducted by Martino et al. 2015 [14]. The two pseudogenes [40], ZNF252 and TMED10P, are counted as a single gene, resulting in a 13-gene signature.
Fig 7Plot of methylation values for cg06628000 versus cg06410630 for allergy and sensitized samples.
The o markings denote allergy samples, while the * markings denote sensitized samples. There is some overlap in the middle region, while most other samples can be differentiated.
Average hidden data accuracy across a large number of dataset permutations.
| Number | Signature | Average Accuracy | AUROC | 95% CI for Accuracy | |
|---|---|---|---|---|---|
| 1 | 12-CpG #1 | 200 | 9.5.313 | 0.98328 | (94.175, 96.451) |
| 2 | 12-CpG #2 | 200 | 9.5.625 | 0.98531 | (94.483, 96.767) |
| 3 | 18-CpG | 200 | 9.3.438 | 0.98047 | (92.216, 94.734) |
This table shows the average accuracy and AUROC across n randomized hidden test cohorts. The 95% Confidence Interval for accuracy is also shown and provides an estimate for the true population accuracy of each classifier on similar cohorts of patients.
Gene Ontology enrichment analysis.
| GO Annotation Data Set | Concept Number (Homo sapiens) | |
|---|---|---|
| 1 | Biological process | 3250 |
| 2 | Molecular function | No statistically significant results |
| 3 | Cellular component | No statistically significant results |
The 13-gene signature mapped to 3250 GO biological-process concepts, while there were no statistically significant matches for the molecular function and cellular component GO concepts. This match is based on GO Ontology database released on 2018-12-01 and was created through the GO Enrichment Analysis Tool [51].
Gene Ontology terms summarization using clustering by Revigo [39].
| Representative Terms | GO Term (GO ID) | Uniqueness | |
|---|---|---|---|
| 1 | anatomical structure development | aging (GO:0007568) | 0.781 |
| anatomical structure development (GO:0048856) | 0.781 | ||
| 2 | biosynthesis | biosynthetic process (GO:0009058) | 0.946 |
| 3 | catabolic process (GO:0009056) | 0.936 | |
| 4 | cellular amino acid metabolic process (GO:0006520) | 0.757 | |
| cell-cell signaling (GO:0007267) | 0.813 | ||
| cell cycle (GO:0007049) | 0.813 | ||
| mitotic cell cycle (GO:0000278) | 0.836 | ||
| small molecule metabolic process (GO:0044281) | 0.858 | ||
| 5 | cell proliferation (GO:0008283) | 0.894 | |
| 6 | extracellular matrix organization (GO:0030198) | 0.762 | |
| cellular component assembly (GO:0022607) | 0.762 | ||
| cytoskeleton organization (GO:0007010) | 0.777 | ||
| 7 | growth | growth (GO:0040007) | 0.944 |
| 8 | homeostatic process (GO:0042592) | 0.924 | |
| 9 | immune system process (GO:0002376) | 0.944 | |
| 10 | locomotion | locomotion (GO:0040011) | 0.944 |
| 11 | neurological system process | neurological system process (GO:0050877) | 0.944 |
| 12 | protein targeting | cell motility (GO:0048870) | 0.767 |
| transport (GO:0006810) | 0.847 | ||
| transmembrane transport (GO:0055085) | 0.848 | ||
| vesicle-mediated transport (GO:0016192) | 0.865 | ||
| protein targeting (GO:0006605) | 0.869 | ||
| 13 | reproduction | reproduction (GO:0000003) | 1 |
| 14 | signal transduction (GO:0007165) | 0.778 | |
| response to stress (GO:0006950) | 0.911 | ||
| 15 | symbiosis, encompassing | symbiosis, encompassing mutualism | 0.944 |
| 16 | tRNA metabolism | translation (GO:0006412) | 0.827 |
| cellular protein modification process (GO:0006464) | 0.853 | ||
| cellular nitrogen compound metabolic process (GO:0034641) | 0.862 | ||
| tRNA metabolic process (GO:0006399) | 0.868 |
The 37 GO terms were clustered into 16 representative terms using Revigo [38][39]. The concepts are sorted alphabetically using the representative terms. The GO terms within each representative term are sorted based on uniqueness, where smaller values denote higher uniqueness. The bolded representative terms have been known to be associated with the immune system.