| Literature DB >> 29924864 |
Zariel I Johnson1, Jacqueline D Jones2, Angana Mukherjee2, Dianxu Ren3, Carol Feghali-Bostwick4, Yvette P Conley1,5, Cecelia C Yates1,6.
Abstract
Progression of systemic scleroderma (SSc), a chronic connective tissue disease that causes a fibrotic phenotype, is highly heterogeneous amongst patients and difficult to accurately diagnose. To meet this clinical need, we developed a novel three-layer classification model, which analyses gene expression profiles from SSc skin biopsies to diagnose SSc severity. Two SSc skin biopsy microarray datasets were obtained from Gene Expression Omnibus. The skin scores obtained from the original papers were used to further categorize the data into subgroups of low (<18) and high (≥18) severity. Data was pre-processed for normalization, background correction, centering and scaling. A two-layered cross-validation scheme was employed to objectively evaluate the performance of classification models of unobserved data. Three classification models were used: support vector machine, random forest, and naive Bayes in combination with feature selection methods to improve performance accuracy. For both input datasets, random forest classifier combined with correlation-based feature selection (CFS) method and naive Bayes combined with CFS or support vector machine based recursive feature elimination method yielded the best results. Additionally, we performed a principal component analysis to show that low and high severity groups are readily separable by gene expression signatures. Ultimately, we found that our novel classification prediction model produced global gene signatures that significantly correlated with skin scores. This study represents the first report comparing the performance of various classification prediction models for gene signatures from SSc patients, using current clinical diagnostic factors. In summary, our three-classification model system is a powerful tool for elucidating gene signatures from SSc skin biopsies and can also be used to develop a prognostic gene signature for SSc and other fibrotic disorders.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29924864 PMCID: PMC6010260 DOI: 10.1371/journal.pone.0199314
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of patient information for microarray biopsy samples used in models.
| C | Dataset 1 | Dataset 2 |
|---|---|---|
| 54 | 58 | |
| 14/40 | 14/44 | |
| 44/4/2/4 | 38/9/10/1 | |
| 52.3 ± 7.1 | 52.2 ± 10.7 | |
| 19.5 ± 2.7 | 14.9 ± 2.7 |
Means +/- SEM are shown. W: White, AA: African American, H: Hispanic, A: Asian.
Performance evaluation of various classifier and feature selection methods.
| Dataset | Classifier | Feature Selection | Accuracy | Sensitivity | Specificity | MCC |
|---|---|---|---|---|---|---|
| Dataset 1 | SVM | All Features | 0.87 | 0.88 | 0.87 | 0.74 |
| Chi-Squared | 0.93 | 0.96 | 0.90 | 0.85 | ||
| CFS | 0.93 | 1.00 | 0.87 | 0.86 | ||
| SVM-RFE | 0.96 | 1.00 | 0.93 | 0.93 | ||
| RVFS | 0.89 | 0.88 | 0.90 | 0.78 | ||
| RF | All Features | 0.76 | 0.67 | 0.83 | 0.51 | |
| Chi-Squared | 0.93 | 0.92 | 0.93 | 0.85 | ||
| CFS | 0.98 | 1.00 | 0.97 | 0.96 | ||
| SVM-RFE | 0.94 | 0.92 | 0.97 | 0.89 | ||
| RVFS | 0.94 | 0.96 | 0.93 | 0.89 | ||
| NB | All Features | 0.74 | 0.85 | 0.73 | 0.48 | |
| Chi-Squared | 0.89 | 0.83 | 0.93 | 0.78 | ||
| CFS | 0.98 | 1.00 | 0.97 | 0.96 | ||
| SVM-RFE | 0.98 | 1.00 | 0.97 | 0.96 | ||
| RVFS | 0.91 | 0.92 | 0.90 | 0.81 | ||
| Dataset 2 | SVM | All Features | 0.84 | 0.62 | 0.97 | 0.66 |
| Chi-Squared | 0.91 | 0.90 | 0.92 | 0.82 | ||
| CFS | 0.95 | 1.00 | 0.92 | 0.90 | ||
| SVM-RFE | 1.00 | 1.00 | 1.00 | 1.00 | ||
| RVFS | 0.95 | 0.95 | 0.95 | 0.89 | ||
| RF | All Features | 0.72 | 0.29 | 0.97 | 0.38 | |
| Chi-Squared | 0.91 | 0.81 | 0.97 | 0.81 | ||
| CFS | 1.00 | 1.00 | 1.00 | 1.00 | ||
| SVM-RFE | 0.97 | 0.90 | 1.00 | 0.93 | ||
| RVFS | 0.98 | 0.95 | 1.00 | 0.96 | ||
| NB | All Features | 0.71 | 0.43 | 0.86 | 0.33 | |
| Chi-Squared | 0.90 | 0.86 | 0.92 | 0.78 | ||
| CFS | 1.00 | 1.00 | 1.00 | 1.00 | ||
| SVM-RFE | 1.00 | 1.00 | 1.00 | 1.00 | ||
| RVFS | 0.95 | 0.95 | 0.95 | 0.89 |
SVM: support vector machine, RF: random forest, NB: naive Bayes, CFS: correlation-based feature selection method, SVM-RFE: SVM-based recursive feature elimination method, RVFS: random forest-based backward feature elimination method, MCC: Mathew’s correlation coefficient.
Number of features (microarray probe IDs) selected for each dataset and feature selection method.
| Column1 | SVM-RFE | Chi-Squared | CFS | RVFS |
|---|---|---|---|---|
| Dataset 1 | 450 | 84 | 84 | 23 |
| Dataset 2 | 50 | 89 | 89 | 9 |
SVM-RFE: SVM-based recursive feature elimination method, CFS: correlation-based feature selection method, RVFS: random forest-based backward feature elimination method.
Fig 1Principal component analyses (PCA) of gene expression separation between low and high severity groups.
Results based on CFS feature selection method are shown in A (Dataset 1) and B (Dataset 2). Results based on SVM-RFE feature selection method are shown in C (Dataset 1) and D (Dataset 2).
Microarray probe IDs associated with the top 10 highest absolute value of loading values for principal component 1 based on principal component analysis of genes identified by Chi-squared feature selection.
| Microarray Probe ID | Gene Symbol | Loading Value | Gene Name | Adjusted p-value | |
|---|---|---|---|---|---|
| A_23_P73297 | MAGI1 | 0.813 | membrane associated guanylate kinase, WW and PDZ domain containing 1 | 1.56E-03 | |
| A_32_P86578 | LOC389023 | -0.749 | uncharacterized LOC389023 | 1.75E-03 | |
| A_23_P403398 | DKFZP586I1420 | 0.747 | uncharacterized protein DKFZp586I1420 | 1.11E-03 | |
| A_23_P36531 | TSPAN8 | 0.745 | tetraspanin 8 | 1.42E-03 | |
| A_23_P31996 | SLC46A2 | 0.744 | solute carrier family 46, member 2 | 5.80E-04 | |
| A_23_P155441 | RFT1 | -0.743 | RFT1 homolog (S. cerevisiae) | 5.19E-03 | |
| A_23_P85140 | TCEAL2 | 0.740 | transcription elongation factor A (SII)-like 2 | 4.19E-03 | |
| A_23_P147647 | SGCD | -0.728 | sarcoglycan, delta (35kDa dystrophin-associated glycoprotein) | 4.40E-03 | |
| A_23_P73220 | FGD6 | 0.727 | FYVE, RhoGEF and PH domain containing 6 | 2.13E-03 | |
| A_23_P92552 | PET112 | 0.726 | PET112 homolog (yeast) | 1.11E-03 | |
| ILMN_1812968 | SOX18 | 0.754 | SRY (sex determining region Y)-box 18 | 8.75E-06 | |
| ILMN_1741688 | CPXM2 | 0.753 | carboxypeptidase X (M14 family), member 2 | 1.10E-05 | |
| ILMN_2402766 | AFTPH | -0.750 | aftiphilin | 2.02E-03 | |
| ILMN_1663618 | STAT3 | 0.742 | signal transducer and activator of transcription 3 (acute-phase response factor) | 3.23E-03 | |
| ILMN_1676893 | ADCY3 | 0.737 | adenylate cyclase 3 | 5.26E-04 | |
| ILMN_1786197 | NR2F1 | 0.737 | nuclear receptor subfamily 2, group F, member 1 | 9.06E-03 | |
| ILMN_1821397 | N/A | -0.737 | N/A | 6.71E-03 | |
| ILMN_1687840 | ABCB7 | -0.726 | ATP-binding cassette, sub-family B (MDR/TAP), member 7 | 8.12E-05 | |
| ILMN_1786139 | VKORC1 | 0.726 | vitamin K epoxide reductase complex, subunit 1 | 3.83E-04 | |
| ILMN_1785113 | MUT | -0.722 | methylmalonyl CoA mutase | 2.13E-02 |
Adjusted p-values are Bonferroni-corrected p-values from T-test statistic comparing normalized levels between low and high severity patient groups.
Fig 2Heat map showing Log2 normalized expression values for patient samples from Dataset 1 for probe IDs identified by CFS feature selection method.
Fig 3Heat map showing Log2 normalized expression values for patient samples from Dataset 2 for probe IDs identified by CFS feature selection method.
Fig 4Schematic representation of pipeline for choosing genes that were included in Ingenuity Pathway Analysis.
Red font indicates numbers of genes more highly expressed by high severity patients; blue font indicates numbers of genes more highly expressed by low severity patients.
Results of IPA-based upstream regulator analysis showing potential role for OSM in regulating genes identified by CFS-based classification of Dataset 2 and differentially expressed between low and high severity patients.
| Probe ID | Genes in Dataset | Prediction of OSM Activation (based on measurement direction) | High vs. Low Severity Regulation | Evidence from Literature |
|---|---|---|---|---|
| ILMN_1715417 | SELP | Activated | Upregulated | Upregulated by OSM1 |
| ILMN_1720710 | HSPB3 | Activated | Downregulated | Downregulated by OSM1 |
| ILMN_1741021 | CH25H | Activated | Upregulated | Upregulated by OSM1 |
| ILMN_1720048 | CCL2 | Activated | Upregulated | Upregulated by OSM1 |
| ILMN_3250067 | ANGPT2 | Activated | Upregulated | Upregulated by OSM1 |
Fig 5Predicted signaling network between OSM and downstream genes related to SSc severity.
Red shading of gene indicates upregulation in dataset compared to low severity patients, green shading downregulation, and intensity of color depicts strength of regulation. Relationships between genes that are predicted based on literature are indicated by lines connecting genes, with red symbolizing predicted upregulation and blue predicted downregulation.