| Literature DB >> 33256785 |
Andrew Tran1, Chris J Walsh2,3, Jane Batt2,4, Claudia C Dos Santos5,6, Pingzhao Hu7,8,9.
Abstract
BACKGROUND: Myopathies are a heterogenous collection of disorders characterized by dysfunction of skeletal muscle. In practice, myopathies are frequently encountered by physicians and precise diagnosis remains a challenge in primary care. Molecular expression profiles show promise for disease diagnosis in various pathologies. We propose a novel machine learning-based clinical tool for predicting muscle disease subtypes using multi-cohort microarray expression data.Entities:
Keywords: Biomarker; Clinical tool; Machine learning; Microarray; Muscle diseases
Mesh:
Substances:
Year: 2020 PMID: 33256785 PMCID: PMC7708151 DOI: 10.1186/s12967-020-02630-3
Source DB: PubMed Journal: J Transl Med ISSN: 1479-5876 Impact factor: 5.531
Fig. 1Model training and validation workflow. The original, augmented, and combined expression profile data are referred to as T0, T1, and T2 respectively. A training-test split of 2:1 was made for T0. The training set T2 was used for feature selection and training the support vector machine (SVM) classifier. The test set of T0 were used for making predictions and validating the model performance measured by multiclass area under the receiver-operator curve (AUC). This workflow was applied to three data augmentation strategies: (a) no class size adjustment, (b) sampling to the mean class size, and (c) sampling to twice the mean class size
Fig. 2SVM model performance as a function of top genes using an augmentation strategy of no data augmentation (a) (N = 1260), sampling to the mean class size (b) (N = 1260), or sampling to twice the mean class size (c) (N = 2520). Model performance was averaged over 30 iterations. In strategy (a), the model was trained using the training set of the original data T0 with a 2:1 training-test split stratified by class. In strategies (b) and (c), the model was trained using the augmented training set T2 with a 2:1 training-test split stratified by class. Performance was measured by multiclass AUC in the test set of T0. The p-values in the bottom panel indicate the enrichment of the gene signature in muscle specific genes
Fig. 3Disease-specific ROC curves for model discrimination using the top 500 genes. Classes were balanced using an augmentation strategy of sampling to twice the mean class size (N = 2520). Optimal thresholds were determined using a Youden’s J statistic and are indicated on each disease-specific ROC curve by crosses
Disease-specific 95% confidence intervals for the AUC using the top 500 genes
| Diseases | AUC (95% CI) | Specificity | Sensitivity | Precision | F1 score |
|---|---|---|---|---|---|
| Control | 0.861 (0.826–0.895) | 0.814 | 0.747 | 0.883 | 0.810 |
| Chronic | 0.872 (0.824–0.920) | 0.708 | 0.851 | 0.957 | 0.901 |
| Congenital | 0.848 (0.805–0.892) | 0.805 | 0.776 | 0.900 | 0.833 |
| IM | 0.794 (0.713–0.876) | 0.585 | 0.883 | 0.951 | 0.916 |
| ICUAW | 0.777 (0.668–0.887) | 0.812 | 0.614 | 0.988 | 0.758 |
| Immobility | 0.789 (0.716–0.861) | 0.850 | 0.598 | 0.974 | 0.741 |
IM inflammatory myositis, ICUAW intensive care unit acquired weakness, CI confidence interval. Classes were balanced using an augmentation strategy of sampling to twice the mean class size (N = 2520). Confidence intervals were generated using 2000 stratified bootstrapping replications. Optimal thresholds were determined using a Youden’s J statistic. Specificity, sensitivity, precision, and F1 score were calculated at the optimal threshold
Fig. 4Enrichment map of the biological processes related to the top 500 genes identified by ANOVA included in the a congenital and b immobility disease classifier. Circles are referred to as “nodes” and the connectors are “edges”. Nodes represent specific biological pathways and node size represents the number of genes in the pathway. Edges connecting adjacent nodes represent overlapping pathways and edge width represents gene overlap size. The node colour represents enrichment score. Nodes that are blue are upregulated (enrichment score greater than zero) and nodes that are red are downregulated (enrichment score less than zero). The nodes are clustered into general functional groups