| Literature DB >> 31552087 |
Sadia Akter1, Dong Xu1,2,3, Susan C Nagel4, John J Bromfield4, Katherine Pelch4, Gilbert B Wilshire5, Trupti Joshi1,3,6.
Abstract
Endometriosis is a complex and common gynecological disorder yet a poorly understood disease affecting about 176 million women worldwide and causing significant impact on their quality of life and economic burden. Neither a definitive clinical symptom nor a minimally invasive diagnostic method is available, thus leading to an average of 4 to 11 years of diagnostic latency. Discovery of relevant biological patterns from microarray expression or next generation sequencing (NGS) data has been advanced over the last several decades by applying various machine learning tools. We performed machine learning analysis using 38 RNA-seq and 80 enrichment-based DNA methylation (MBD-seq) datasets. We experimented how well various supervised machine learning methods such as decision tree, partial least squares discriminant analysis (PLSDA), support vector machine, and random forest perform in classifying endometriosis from the control samples trained on both transcriptomics and methylomics data. The assessment was done from two different perspectives for improving classification performances: a) implication of three different normalization techniques and b) implication of differential analysis using the generalized linear model (GLM). Several candidate biomarker genes were identified by multiple machine learning experiments including NOTCH3, SNAPC2, B4GALNT1, SMAP2, DDB2, GTF3C5, and PTOV1 from the transcriptomics data analysis and TRPM6, RASSF2, TNIP2, RP3-522J7.6, FGD3, and MFSD14B from the methylomics data analysis. We concluded that an appropriate machine learning diagnostic pipeline for endometriosis should use TMM normalization for transcriptomics data, and quantile or voom normalization for methylomics data, GLM for feature space reduction and classification performance maximization.Entities:
Keywords: DNA methylation; RNA-seq; classification; endometriosis; machine learning; methylomics; transcriptomics; translational bioinformatics
Year: 2019 PMID: 31552087 PMCID: PMC6737999 DOI: 10.3389/fgene.2019.00766
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Inclusion and exclusion criteria.
| Group | Inclusion criteria | Exclusion criteria |
|---|---|---|
| Controls | Age 18 to 49 years | Visual observation of lesions |
| Endometriosis | Age 18 to 49 years | Diagnostic laparoscopy without visual observation of endometriotic lesions |
Machine learning experimental approach using decision tree.
| Experimental name | Normalization | GLM | Decision tree |
|---|---|---|---|
| TMM + Decision Tree | TMM | X | |
| qNorm + Decision Tree | qNorm | X | |
| vNorm + Decision Tree | vNorm | X | |
| TMM + GLM + Decision Tree | TMM | X | X |
| qNorm + GLM + Decision Tree | qNorm | X | X |
| vNorm + GLM + Decision Tree | vNorm | X | X |
Decision tree models using transcriptomics data.
| Gene feature set | Experiment name | Tree model |
|---|---|---|
| All | TMM + Decision Tree | NOTCH3 <= 0.3994181: endometriosis (10) |
| All | qNorm + Decision Tree | NOTCH3 <= 1.710526: endometriosis (13/1) |
| All | vNorm + Decision Tree | NOTCH3 <= -0.05049461: endometriosis (13/1) |
| All |
| NOTCH3 <= 0.3994181: endometriosis (10) |
| All | qNorm + GLM + Decision Tree | NOTCH3<= 1.684211: endometriosis (13/1) |
| All | vNorm + GLM + Decision Tree | NOTCH3 <= -0.05049461: endometriosis (13/1) |
| Protein-Coding | TMM + Decision Tree | NOTCH3 <= 1.644335: endometriosis (10) |
| Protein-Coding | qNorm + Decision Tree | NOTCH3 <= 1.894737: endometriosis (13/1) |
| Protein-Coding | vNorm + Decision Tree | NOTCH3 <= 1.293087: endometriosis (13/1) |
| Protein-Coding |
| NOTCH3 <= 1.641844: endometriosis (10) |
| Protein-Coding | qNorm + GLM + Decision Tree | NOTCH3 <= 1.815789: endometriosis (13/1) |
| Protein-Coding | vNorm + GLM + Decision Tree | NOTCH3 <= 1.293087: endometriosis (13/1) |
The best model in each subgroup of experiment is presented in bold text.
Figure 1Gene interaction network among the genes from the decision tree models using the protein-coding genes of transcriptomics data. Black circles denote the candidate biomarkers genes, and gray circles denote the GeneMANIA-predicted genes. Blue and green edges represent co-expression (network weight 99.70) and genetic interactions (network weight 0.33), respectively.
Figure 2Gene tier plot from Biosigner using transcriptomics data from the experiments using: (A) all genes, (B) protein-coding genes only; S, signature genes; A–E=A is a higher tier, and E is a lower tier.
Candidate biomarker genes from transcriptomics analysis.
| Experiment name | Gene names | Gene names |
|---|---|---|
| TMM + Decision Tree |
|
|
| qNorm + Decision Tree |
|
|
| vNorm + Decision Tree |
|
|
|
|
|
|
| qNorm + GLM + Decision Tree |
|
|
| vNorm + GLM + Decision Tree |
|
|
| Biosigner (PLSDA) |
|
|
| Biosigner (Random Forest) |
|
|
| Biosigner (SVM) | No signature or A-tier genes were found in the final model |
|
Performance measures using transcriptomics data by leave-one-out cross-validation.
| Gene feature set | Experiment name | Accuracy | Sensitivity | Specificity | Precision | F1 score | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| All | TMM + Decision Tree | 0.711 | 0.688 | 0.727 | 0.647 | 0.667 | 0.412 | 0.665 |
| All | qNorm + Decision Tree | 0.184 | 0.000 | 0.318 | 0.000 | NA | −0.689 | 0.239 |
| All | vNorm + Decision Tree | 0.553 | 0.188 | 0.818 | 0.429 | 0.261 | 0.007 | 0.205 |
| All |
|
|
|
|
|
|
|
|
| All | qNorm + GLM + Decision Tree | 0.842 | 0.750 | 0.909 | 0.857 | 0.800 | 0.675 | 0.820 |
| All | vNorm + GLM + Decision Tree | 0.684 | 0.375 | 0.909 | 0.750 | 0.500 | 0.344 | 0.810 |
| All |
|
|
|
|
|
|
|
|
| All | Biosigner (Random Forest) | 0.447 | 0.455 | 0.438 | 0.526 | 0.488 | -0.107 | NA |
| All | Biosigner (SVM) | 0.553 | 0.636 | 0.438 | 0.609 | 0.622 | 0.075 | NA |
| Protein-Coding | TMM + Decision Tree | 0.711 | 0.625 | 0.773 | 0.667 | 0.645 | 0.402 | 0.611 |
| Protein-Coding | qNorm + Decision Tree | 0.421 | 0.125 | 0.636 | 0.200 | 0.154 | −0.268 | 0.554 |
| Protein-Coding | vNorm + Decision Tree | 0.263 | 0.125 | 0.364 | 0.125 | 0.125 | −0.511 | 0.239 |
| Protein-Coding |
|
|
|
|
|
|
|
|
| Protein-Coding | qNorm + GLM + Decision Tree | 0.763 | 0.563 | 0.909 | 0.818 | 0.667 | 0.513 | 0.577 |
| Protein-Coding | vNorm + GLM + Decision Tree | 0.763 | 0.563 | 0.909 | 0.818 | 0.667 | 0.513 | 0.573 |
| Protein-Coding |
|
|
|
|
|
|
|
|
| Protein-Coding | Biosigner (Random Forest) | 0.447 | 0.500 | 0.375 | 0.524 | 0.512 | −0.124 | NA |
| Protein-Coding | Biosigner (SVM) | 0.605 | 0.591 | 0.625 | 0.684 | 0.634 | 0.213 | NA |
The best model with corresponding performance measures in each subgroup of experiment is presented in bold text.
Figure 3Performance comparisons using both protein-coding and nonprotein-coding genes (“all genes”) of the transcriptomics dataset.
Figure 4Performance comparisons using the protein-coding genes of the transcriptomics dataset.
Decision tree models using methylomics data.
| Experiment Name | Tree Model |
|---|---|
| TMM + Decision Tree | chr2_147728001_147729000 <= 1.207401::...chr10_132354001_132355000 <= 1.41709: endometriosis (2): chr10_132354001_132355000 > 1.41709: control (22)chr2_147728001_147729000 > 1.207401::...chr1_35106001_35107000 <= 1.102261: :...chr1_20862001_20863000 <= 0.1675461: endometriosis (2) : chr1_20862001_20863000 > 0.1675461: control (11) chr1_35106001_35107000 > 1.102261: :...chr22_16562001_16563000 <= 0.286556: control (2) chr22_16562001_16563000 > 0.286556: endometriosis (38) |
| qNorm + Decision Tree | chr9_94372001_94373000 > 5.356569::...chr1_3182001_3183000 <= 5.492207: endometriosis (2): chr1_3182001_3183000 > 5.492207: control (22)chr9_94372001_94373000 <= 5.356569::...chr1_2908001_2909000 > 5.999371: control (7) chr1_2908001_2909000 <= 5.999371: :...chr16_37922001_37923000 <= 4.049325: control (5) chr16_37922001_37923000 > 4.049325: endometriosis (41/1) |
| vNorm + Decision Tree | chr9_94372001_94373000 > 1.435922::...chr1_3182001_3183000 <= 1.56803: endometriosis (2): chr1_3182001_3183000 > 1.56803: control (22)chr9_94372001_94373000 <= 1.435922::...chr1_2908001_2909000 > 2.063281: control (7) chr1_2908001_2909000 <= 2.063281: :...chr16_37922001_37923000 <= 0.1332516: control (5) chr16_37922001_37923000 > 0.1332516: endometriosis (41/1) |
| TMM + GLM + Decision Tree | chr9_92948001_92949000 > 0.1864191: control (19/1)chr9_92948001_92949000 <= 0.1864191::...chr2_9142001_9143000 > 0.515642: control (8) chr2_9142001_9143000 <= 0.515642: :...chr4_2757001_2758000 > 0.9228122: control (5) chr4_2757001_2758000 <= 0.9228122: :...chr22_49841001_49842000 <= 1.252199: endometriosis (41/1) chr22_49841001_49842000 > 1.252199: control (4/1) |
|
| chr9_74884001_74885000 <= 4.534801::...chr20_4827001_4828000 <= 4.961341: endometriosis (30): chr20_4827001_4828000 > 4.961341: control (5/1)chr9_74884001_74885000 > 4.534801::...chr10_71353001_71354000 > 4.124296: control (29/1) chr10_71353001_71354000 <= 4.124296: :...chr20_44466001_44467000 <= 5.021993: endometriosis (10) chr20_44466001_44467000 > 5.021993: control (3) |
|
| chr9_74884001_74885000 <= 0.280641::...chr20_4827001_4828000 <= 0.7003891: endometriosis (30): chr20_4827001_4828000 > 0.7003891: control (5/1)chr9_74884001_74885000 > 0.280641::...chr10_71353001_71354000 > -0.1261655: control (29/1) chr10_71353001_71354000 <= -0.1261655: :...chr20_44466001_44467000 <= 0.7601537: endometriosis (10) chr20_44466001_44467000 > 0.7601537: control (3) |
Methylated regions of interest (MROI) and candidate biomarker genes from methylomics analysis.
| Experiment name | Methylated regions of interest (MROI) | Nearby gene names |
|---|---|---|
| TMM + Decision Tree | chr2_147728001_147729000, chr10_132354001_13235500, chr1_35106001_35107000, chr1_20862001_20863000, chr22_16562001_16563000 | Not found |
| qNorm + Decision Tree | chr9_94372001_94373000, chr1_3182001_3183000, chr1_2908001_2909000, chr16_37922001_37923000 |
|
| vNorm + Decision Tree | chr9_94372001_94373000, chr1_3182001_3183000, chr1_2908001_2909000, chr16_37922001_37923000 |
|
| TMM + GLM + Decision Tree | chr9_92948001_92949000, chr2_9142001_9143000, chr4_2757001_2758000, chr22_49841001_49842000 |
|
|
|
|
|
|
|
|
|
| Biosigner (PLSDA) | chr7_5111001_5112000, chr5_29429001_29430000, chr22_49841001_49842000 |
|
| Biosigner (Random Forest) | chr11_2027001_2028000, chr2_147728001_147729000, chr9_74884001_74885000 |
|
| Biosigner (SVM) | chr18_17526001_17527000, chr4_186970001_186971000, chr4_189277001_189278000 | Not found |
The best model is presented in bold text.
Figure 5Gene interaction network among the genes from the decision tree models using the methylomics data. The blue, green, orange, teal, purple, yellow, and gray edges represent various GeneMANIA networks: physical interaction network (weight 67.64), co-expression network (weight 13.50), predicted functional relationships between genes (weight 6.35), co-localization network (6.17), pathway network (weight 4.35), genetic interaction network (weight 1.40), and shared protein domain network (weight 0.59), respectively.
Figure 6Methylated region tier plot from biosigner using methylomics data; S, signature genes; A–E=A is a higher tier, and E is a lower tier.
Performance measures using methylomics data by leave-one-out cross-validation.
| Experiment name | Accuracy | Sensitivity | Specificity | Precision | F1 score | MCC | AUC |
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| qNorm + Decision Tree | 0.364 | 0.405 | 0.314 | 0.415 | 0.410 | −0.280 | 0.199 |
| vNorm + Decision Tree | 0.403 | 0.405 | 0.400 | 0.447 | 0.425 | −0.194 | 0.233 |
| TMM + GLM + Decision Tree | 0.714 | 0.714 | 0.714 | 0.750 | 0.732 | 0.427 | 0.679 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Biosigner (Random Forest) | 0.429 | 0.314 | 0.524 | 0.355 | 0.333 | −0.164 |
|
| Biosigner (SVM) | 0.519 | 0.400 | 0.619 | 0.467 | 0.431 | 0.019 |
|
The best model with corresponding performance measures in each subgroup of experiment is presented in bold text.
Figure 7Performance comparisons using the methylomics dataset.