| Literature DB >> 31270349 |
Brian Kegerreis1, Michelle D Catalina1, Prathyusha Bachali1, Nicholas S Geraci1, Adam C Labonte1, Chen Zeng2, Nathaniel Stearrett3, Keith A Crandall3, Peter E Lipsky1, Amrie C Grammer4.
Abstract
The integration of gene expression data to predict systemic lupus erythematosus (SLE) disease activity is a significant challenge because of the high degree of heterogeneity among patients and study cohorts, especially those collected on different microarray platforms. Here we deployed machine learning approaches to integrate gene expression data from three SLE data sets and used it to classify patients as having active or inactive disease as characterized by standard clinical composite outcome measures. Both raw whole blood gene expression data and informative gene modules generated by Weighted Gene Co-expression Network Analysis from purified leukocyte populations were employed with various classification algorithms. Classifiers were evaluated by 10-fold cross-validation across three combined data sets or by training and testing in independent data sets, the latter of which amplified the effects of technical variation. A random forest classifier achieved a peak classification accuracy of 83 percent under 10-fold cross-validation, but its performance could be severely affected by technical variation among data sets. The use of gene modules rather than raw gene expression was more robust, achieving classification accuracies of approximately 70 percent regardless of how the training and testing sets were formed. Fine-tuning the algorithms and parameter sets may generate sufficient accuracy to be informative as a standalone estimate of disease activity.Entities:
Mesh:
Year: 2019 PMID: 31270349 PMCID: PMC6610624 DOI: 10.1038/s41598-019-45989-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Data sources for active (SLEDAI ≥ 6) and inactive (SLEDAI < 6) SLE WB gene expression.
| Accession | Microarray Platform | N Active | N Inactive | SLEDAI Range | SLEDAI Mean (SD) |
|---|---|---|---|---|---|
| GSE39088 | GPL570 (Affymetrix HG-U133 + 2.0) | 24 | 13 | 2–12 | 6.8 (2.7) |
| GSE45291 | GPL13158 (Affymetrix HG-U133 + PM) | 35 | 35 | 0–11 | 4.3 (3.5) |
| GSE49454 | GPL10558 (Illumina HumanHT-12 v4.0) | 23 | 26 | 0–26 | 7.7 (7.2) |
Data sets are listed by Gene Expression Omnibus (GEO) accession numbers. N Active/Inactive: number of active/inactive patients in data set. Range, mean, and standard deviation of SLEDAI values in each data set are provided.
Adjusted Rand Index of Unsupervised Hierarchical Clustering Compared to Known Disease Activity.
| Adjusted Rand Index | |
|---|---|
| GSE39088 | −0.04 |
| GSE39088; FDR < 0.2 | 0.19 |
| GSE39088; FDR < 0.05 | N/A |
| GSE45291 | 0.03 |
| GSE45291; FDR < 0.2 | −0.01 |
| GSE45291; FDR < 0.05 | 0.94 |
| GSE49454 | 0.04 |
| GSE49454; FDR < 0.2 | 0.14 |
| GSE49454; FDR < 0.05 | 0.14 |
| All Studies | 0.03 |
| All Studies; Three Consistent DE Genes | 0.05 |
Data sets are listed by GEO accession numbers. GSE39088 had no genes with FDR < 0.05. The “Three Consistent DE Genes” are DNAJC13, IRF4, and RPL22.
Figure 1Heatmaps of −log10(overlap p values) from RRHO. Strongest overlaps near the center of each plot indicate weak agreement among the most significantly upregulated and downregulated genes from each data set. Strong agreement between data sets should form a diagonal from the bottom-left corner to the top-right corner.
Figure 2Clustering all three studies on three consistent DE genes. DNAJC13, IRF4, and RPL22 were consistently differentially expressed in each study yet fail to fully separate active from inactive patients. Orange bars denote active patients; black bars denote inactive patients. Blue, yellow, and red bars denote patients from GSE39088, GSE45291, and GSE49454, respectively.
Cell module correlations to disease activity and functional analysis.
| Cell Type | Module Name | Module Size | Correlation with SLEDAI | Top GO Biological Process | Top BIG-C Category |
|---|---|---|---|---|---|
| CD4 | Floralwhite | 237 | 0.81 | type I interferon signaling pathway | Interferon-Stimulated-Genes |
| Turquoise | 805 | 0.50 | positive reg of ubiquitin-protein ligase | Proteasome | |
| Orangered4 | 237 | −0.77 | translational initiation | mRNA-Translation | |
| CD14 | Plum1 | 247 | 0.47 | ubiquitin-dependent protein catabolic process | mRNA-Translation |
| Yellow | 356 | 0.65 | type I interferon signaling pathway | Interferon-Stimulated-Genes | |
| Greenyellow | 89 | −0.49 | transcription from RNA polymerase II promoter | General-Transcription | |
| Pink | 261 | −0.77 | protein phosphorylation | Endosome-and-Vesicles | |
| Purple | 124 | −0.66 | inositol phosphate metabolic process | Fatty-Acid-Biosynthesis | |
| Sienna3 | 222 | −0.64 | translational initiation | mRNA-Translation | |
| CD19 | Darkolivegreen | 591 | 0.78 | cell division | Proteasome |
| Greenyellow | 251 | 0.66 | Notch signaling pathway | mRNA-Translation | |
| Steelblue | 146 | 0.65 | gluconeogenesis | Glycolysis-Gluconeogenesis | |
| Turquoise | 572 | 0.50 | ER to Golgi vesicle-mediated transport | Unfolded-Protein-and-Stress | |
| Violet | 566 | 0.61 | mitochondrial respiratory chain complex I | Interferon-Stimulated-Genes | |
| Brown | 620 | −0.62 | regulation of transcription, DNA-templated | Chromatin-Remodeling | |
| Green | 541 | −0.49 | transcription, DNA-templated | Transcription-Factors | |
| Skyblue | 756 | −0.74 | viral transcription | mRNA-Translation | |
| CD33 | Royalblue | 94 | 0.60 | positive reg of cytosolic calcium ions | Transposon-Control |
| Sienna3 | 133 | 0.76 | type I interferon signaling pathway | Interferon-Stimulated-Genes | |
| Violet | 177 | 0.79 | defense response to virus | Interferon-Stimulated-Genes | |
| Darkmagenta | 273 | −0.49 | ubiquinone biosynthetic process | MHC-Class-TWO | |
| LDG+ | LDG_A | 334 | 0.79 | platelet degranulation | Cytoskeleton |
| LDG_B | 92 | 0.81 | regulation of transcription | Secreted-Immune | |
| LDG_C | 82 | −0.39 | viral process | Nucleus-and-Nucleolus | |
| PC* | PC_Up | 423 | N/A | protein N-linked glycosylation | Endoplasmic-Reticulum |
| PC_Down | 183 | N/A | antigen processing and presentation MHC II | MHC-Class-TWO |
Information on cell modules including number of genes, Pearson correlation coefficient to SLEDAI, and functional analysis. +LDG modules were generated by WGCNA meta-analysis, and r values indicate separation from control and SLE neutrophils as SLEDAI was unavailable. *PC modules are based solely on differential expression. LDG: low-density granulocyte; PC: plasma cell.
Figure 3Cellular gene modules provide the basis for machine learning predictions of SLE activity. GSVA was performed on three SLE WB datasets using 25 WGCNA modules made from purified SLE cells with correlation or published relationship to SLEDAI (See Table 2). Orange: active patient; black: inactive patient. LDG: low-density granulocyte; PC: plasma cell.
Assessment of WGCNA module relationships with SLE disease activity in WB.
| Spearman correlation to SLEDAI | Active vs. Inactive t-test | ||||
|---|---|---|---|---|---|
| rho | p value | t statistic | p value | d | |
| CD4_Floralwhite |
| 3.90E-06 |
| 2.40E-06 | 0.788 |
| CD4_Turquoise | −0.044 | 0.587 | −0.93 | 0.352 | −0.149 |
| CD4_Orangered4 |
| 2.21E-07 |
| 4.35E-07 | −0.853 |
| CD14_Plum1 | 0.010 | 0.904 | −0.35 | 0.729 | −0.054 |
| CD14_Yellow |
| 4.93E-06 |
| 4.44E-06 | 0.761 |
| CD14_Greenyellow | −0.132 | 0.100 |
| 0.037 | −0.339 |
| CD14_Pink | −0.026 | 0.751 | 0.13 | 0.894 | 0.021 |
| CD14_Purple | −0.149 | 0.064 | −1.65 | 0.101 | −0.263 |
| CD14_Sienna3 |
| 2.27E-06 |
| 1.62E-06 | −0.799 |
| CD19_Darkolivegreen | 0.020 | 0.809 | −0.06 | 0.953 | −0.010 |
| CD19_Greenyellow |
| 0.016 |
| 0.012 | 0.403 |
| CD19_Steelblue | 0.016 | 0.838 | 0.55 | 0.580 | 0.089 |
| CD19_Turquoise | −0.069 | 0.393 | −0.84 | 0.403 | −0.132 |
| CD19_Violet | −0.087 | 0.282 | −1.48 | 0.141 | −0.236 |
| CD19_Brown | −0.050 | 0.537 | −1.04 | 0.301 | −0.164 |
| CD19_Green | −0.150 | 0.062 |
| 0.040 | −0.330 |
| CD19_Skyblue |
| 0.010 |
| 0.020 | −0.378 |
| CD33_Royalblue |
| 8.99E-05 |
| 1.03E-04 | 0.637 |
| CD33_Sienna3 |
| 3.41E-06 |
| 6.15E-06 | 0.753 |
| CD33_Violet |
| 4.15E-05 |
| 2.46E-05 | 0.696 |
| CD33_Darkmagenta |
| 6.74E-03 |
| 0.021 | −0.369 |
| LDG_A | −0.044 | 0.588 | −0.25 | 0.802 | −0.040 |
| LDG_B |
| 5.71E-03 |
| 0.019 | 0.377 |
| PC_Up |
| 9.75E-04 |
| 1.61E-03 | 0.508 |
| PC_Down | 0.022 | 0.781 | 0.80 | 0.426 | 0.129 |
Statistics on WGCNA module relationships with SLEDAI and active disease. Correlation to SLEDAI was done by Spearman rank correlation, and the relationship with active versus inactive disease was assessed by Welch’s unequal variances t-test and Cohen’s d. Significant results are bolded (p < 0.05). LDG: low-density granulocyte; PC: plasma cell.
Figure 4Individual WGCNA modules are ineffective at separating active and inactive SLE subjects. GSVA enrichment scores for (a) CD4_Floralwhite and (b) CD4_Orangered4 in SLE WB are unable to fully separate active patients from inactive patients. Asterisks denote significant differences by Welch’s t-test. Error bars indicate mean ± standard deviation.
Figure 5Performance of machine learning classifiers across three independent data sets. Classifiers were trained on the data sets listed across the top and evaluated in the data sets listed across the bottom. Data sets are listed by their GEO accession numbers. Expression (black): gene expression data. WGCNA (blue): module enrichment scores.
Figure 6Area under the ROC curve of machine learning classifiers across three independent data sets. Classifiers were trained on the data sets listed across the top and tested in the other two data sets. Data sets are listed by their GEO accession numbers. Expression (black): gene expression data. WGCNA (blue): module enrichment scores.
Classification metrics of machine learning classifiers.
| 10-fold CV | Trained on GSE39088 | Trained on GSE45291 | Trained on GSE49454 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Expression | WGCNA | Expression | WGCNA | Expression | WGCNA | Expression | WGCNA | ||
| GLM | Accuracy | 0.80 | 0.72 | 0.51 | 0.56 | 0.57 | 0.56 | 0.63 | 0.63 |
| Sensitivity | 0.78 | 0.73 | 0.86 | 0.79 | 0.51 | 0.60 | 0.54 | 0.59 | |
| Specificity | 0.82 | 0.70 | 0.18 | 0.34 | 0.64 | 0.51 | 0.73 | 0.67 | |
| AUC | 0.84 | 0.73 | 0.62 | 0.65 | 0.68 | 0.55 | 0.63 | 0.69 | |
| Kappa | 0.60 | 0.43 | 0.04 | 0.14 | 0.15 | 0.11 | 0.26 | 0.26 | |
| PPV | 0.83 | 0.73 | 0.50 | 0.53 | 0.63 | 0.60 | 0.71 | 0.69 | |
| NPV | 0.77 | 0.70 | 0.58 | 0.64 | 0.52 | 0.51 | 0.56 | 0.57 | |
| KNN | Accuracy | 0.75 | 0.70 | 0.50 | 0.70 | 0.49 | 0.70 | 0.51 | 0.72 |
| Sensitivity | 0.66 | 0.72 | 0.59 | 0.83 | 0.23 | 0.68 | 0.31 | 0.68 | |
| Specificity | 0.85 | 0.68 | 0.41 | 0.57 | 0.79 | 0.72 | 0.77 | 0.77 | |
| AUC | 0.82 | 0.74 | 0.54 | 0.71 | 0.58 | 0.75 | 0.63 | 0.70 | |
| Kappa | 0.50 | 0.40 | 0.00 | 0.40 | 0.03 | 0.40 | 0.07 | 0.44 | |
| PPV | 0.83 | 0.71 | 0.49 | 0.65 | 0.58 | 0.74 | 0.62 | 0.78 | |
| NPV | 0.69 | 0.68 | 0.51 | 0.78 | 0.46 | 0.65 | 0.47 | 0.66 | |
| RF | Accuracy | 0.83 | 0.72 | 0.45 | 0.63 | 0.47 | 0.63 | 0.61 | 0.66 |
| Sensitivity | 0.83 | 0.77 | 0.86 | 0.91 | 0.53 | 0.62 | 0.54 | 0.61 | |
| Specificity | 0.82 | 0.68 | 0.07 | 0.36 | 0.38 | 0.64 | 0.69 | 0.73 | |
| AUC | 0.89 | 0.77 | 0.69 | 0.73 | 0.58 | 0.68 | 0.65 | 0.74 | |
| Kappa | 0.65 | 0.45 | −0.07 | 0.27 | −0.08 | 0.26 | 0.22 | 0.33 | |
| PPV | 0.84 | 0.72 | 0.47 | 0.58 | 0.51 | 0.67 | 0.68 | 0.73 | |
| NPV | 0.81 | 0.72 | 0.33 | 0.81 | 0.41 | 0.58 | 0.55 | 0.60 | |
Training sets are listed by their GEO accession numbers. Test accuracy was determined by testing the classifiers on the other two data sets. Expression: gene expression data. WGCNA: module enrichment scores. AUC: area under the receiver operating characteristic curve. Kappa: Cohen’s kappa coefficient. PPV: positive predictive value. NPV: negative predictive value.
Figure 7Random forest classifier reveals variable importance of genes and modules. (a) Variable importance of top 25 individual genes as determined by mean decrease in Gini impurity. (b) Variable importance of cell modules. (c) As many modules shared genes, modules were deduplicated to determine the effects on the random forest classifier. The relative importance of the full modules and deduplicated modules was strongly correlated (Spearman’s rho = 0.69, p = 1.94E-4). LDG: low-density granulocyte; PC: plasma cell.