| Literature DB >> 36010347 |
Justin Gerolami1, Justin Jong Mun Wong2, Ricky Zhang1, Tong Chen1, Tashifa Imtiaz2, Miranda Smith1, Tamara Jamaspishvili2,3, Madhuri Koti4, Janice Irene Glasgow1, Parvin Mousavi1, Neil Renwick2, Kathrin Tyryshkin1,2.
Abstract
Complex high-dimensional datasets that are challenging to analyze are frequently produced through '-omics' profiling. Typically, these datasets contain more genomic features than samples, limiting the use of multivariable statistical and machine learning-based approaches to analysis. Therefore, effective alternative approaches are urgently needed to identify features-of-interest in '-omics' data. In this study, we present the molecular feature selection tool, a novel, ensemble-based, feature selection application for identifying candidate biomarkers in '-omics' data. As proof-of-principle, we applied the molecular feature selection tool to identify a small set of immune-related genes as potential biomarkers of three prostate adenocarcinoma subtypes. Furthermore, we tested the selected genes in a model to classify the three subtypes and compared the results to models built using all genes and all differentially expressed genes. Genes identified with the molecular feature selection tool performed better than the other models in this study in all comparison metrics: accuracy, precision, recall, and F1-score using a significantly smaller set of genes. In addition, we developed a simple graphical user interface for the molecular feature selection tool, which is available for free download. This user-friendly interface is a valuable tool for the identification of potential biomarkers in gene expression datasets and is an asset for biomarker discovery studies.Entities:
Keywords: RNA-Seq; big data analysis; biomarker; feature selection; prostate adenocarcinoma
Year: 2022 PMID: 36010347 PMCID: PMC9407361 DOI: 10.3390/diagnostics12081997
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
The feature selection algorithms used in MFeaST.
| Feature Selection Type | Univariable | Multivariable |
|---|---|---|
| Filter Type | Mutual information score | |
| Wrapper Type | Support vector machine | |
| Embedded Type | Treebagger predictor importance | Decision tree with bagging |
The clinicopathologic statistics of important prognostic markers in prostate adenocarcinoma.
| Basal | Luminal A | Luminal B | Total | χ2 | D.o.F |
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| n = 88 | n = 64 | n = 144 | n = 296 | |||||||||||
| Yes | 14 (16%) | 3 (5%) | 24 (17%) | 41 (14%) | 5.77 | 2 | 0.056 | ||||||||
| No | 74 (84%) | 61 (95%) | 120 (83%) | 255 (86%) | |||||||||||
|
| n = 61 | n = 36 | n = 99 | n = 296 | |||||||||||
| Yes | 8 (13%) | 6 (17%) | 14 (14%) | 28 (9%) | 0.24 | 2 | 0.888 | ||||||||
| No | 53 (87%) | 30 (83%) | 85 (86%) | 168 (57%) | |||||||||||
|
| n = 80 | n = 64 | n = 132 | n = 276 | |||||||||||
| Yes | 9 (11%) | 8 (13%) | 25 (19%) | 42 (15%) | 2.76 | 2 | 0.252 | ||||||||
| No | 71 (89%) | 56 (88%) | 107 (81%) | 234 (85%) | |||||||||||
|
| n = 102 | n = 72 | n = 166 | n = 340 | |||||||||||
| Acinar | 101 (99%) | 72 (100%) | 158 (95%) | 331 (97%) | * 6.10 | 2 | 0.047 | ||||||||
| Other | 1 (1%) | 0 (0%) | 8 (5%) | 9 (3%) | |||||||||||
|
| n = 99 | n = 71 | n = 164 | n = 334 | |||||||||||
| GG1 | 10 (10%) | 8 (11%) | 11 (7%) | 29 (9%) | 21.221 | 6 | 0.020 | ||||||||
| GG2 | 35 (35%) | 28 (39%) | 31 (19%) | 94 (28%) | |||||||||||
| GG3 | 17 (17%) | 14 (20%) | 30 (18%) | 61 (18%) | |||||||||||
| GG4 + GG5 | 37 (37%) | 21 (30%) | 92 (56%) | 150 (45%) | |||||||||||
|
| n = 100 | n = 69 | n = 166 | n = 335 | |||||||||||
| pT2 | 41 (41%) | 35 (51%) | 50 (30%) | 126 (38%) | 9.515 | 2 | 0.009 | ||||||||
| pT3 + pT4 | 59 (59%) | 34 (49%) | 116 (70%) | 209 (62%) | |||||||||||
|
| n = 83 | n = 62 | n = 151 | n = 296 | |||||||||||
| pN0 | 73 (88%) | 52 (84%) | 114 (75%) | 239 (81%) | 5.84 | 2 | 0.054 | ||||||||
| pN1 | 10 (12%) | 10 (16%) | 37 (25%) | 57 (19%) | |||||||||||
|
|
|
|
| ||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| Age at diagnosis | 62 | 102 | 41–77 | 63 | 72 | 46–75 | 62 | 166 | 46–78 | 62 | 340 | 41–78 | 0.17 | 2 | 0.918 |
| PSA | 0.1 | 90 | 0–37.36 | 0.1 | 68 | 0–13.95 | 0.1 | 143 | 0–39.80 | 0.1 | 301 | 0–39.80 | 1.93 | 2 | 0.381 |
Relevant clinicopathologic data collected by the TCGA study were compared between molecular subtypes using the Chi-square or Kruskal–Wallis test where appropriate. Statistically significant differences between molecular subtypes were found for grade (p = 0.020) and pathological stage (p = 0.009). Differences in histological type could not be assessed as one or more categories had expected counts <5. Abbreviations: degrees of freedom (D.o.F), median (med), range (ran). * denote one or more cells in the category have expected counts <5.
Figure 1Hierarchal clustering analysis performed using top ranked The top 10% highest ranking MFeaST-selected genes were used to cluster (A) Luminal A and Basal samples; (B) Luminal B and Basal samples; (C) Luminal A and Luminal B samples. Unsupervised hierarchical clustering with average linkage was performed on the log2 transformed expression values. The data were median-centered for proper visualization of the heatmap. Spearman correlation was used as a similarity measure between samples. The top 10% highest ranking genes are listed in Supplementary Table S3 up to the horizontal double line.
Figure 2Hierarchal clustering analysis performed using differentially expressed genes. All differentially expressed genes (FDR ≤ 0.05) were used to cluster (A) Basal and Luminal A samples; (B) Basal and Luminal B samples; (C) Luminal A and Luminal B samples. Unsupervised hierarchical clustering with average linkage was performed on the log2 transformed expression values. The data were median-centered for proper visualization of the heatmap. Spearman correlation was used as a similarity measure between samples.
The classification results for the models built using the MFeaST-selected and differentially expressed genes.
| MFeaST | Differential Expression | All Features | |
|---|---|---|---|
| (A) Basal|Luminal A | |||
| Number of genes | 33 | 295 | 574 |
| Accuracy | 81.08 ± 11.71 | 78.82 ± 9.77 | 77.12 ± 8.96 |
| Precision | 79.59 ± 19.84 | 78.03 ± 17.63 | 73.95 ± 12.80 |
| Recall | 82.86 ± 16.22 | 77.68 ± 13.81 | 75.00 ± 14.31 |
| F1-score | 78.86 ± 11.35 | 75.58 ± 8.80 | 73.09 ± 9.15 |
| (B) Basal|Luminal B | |||
| Number of genes | 18 | 472 | 574 |
| Accuracy | 94.80 ± 2.58 | 92.91 ± 6.18 | 94.02 ± 6.61 |
| Precision | 96.56 ± 3.88 | 93.46 ± 7.36 | 93.53 ± 7.37 |
| Recall | 95.22 ± 3.75 | 95.74 ± 4.17 | 97.50 ± 4.37 |
| F1-score | 95.79 ± 2.07 | 94.42 ± 4.68 | 95.33 ± 5.03 |
| (C) Luminal A|Luminal B | |||
| Number of genes | 15 | 328 | 574 |
| Accuracy | 95.36 ± 3.69 | 91.59 ± 5.20 | 92.45 ± 6.45 |
| Precision | 96.65 ± 5.18 | 93.35 ± 5.84 | 94.50 ± 6.36 |
| Recall | 96.99 ± 3.18 | 95.22 ± 6.09 | 95.26 ± 6.08 |
| F1-score | 96.70 ± 2.56 | 94.05 ± 3.68 | 94.66 ± 4.36 |
Figure 3The MFeaST GUI comprises of several tabs that guide end-users through feature selection analysis. (A) Input data: The data can be imported and viewed. (B) Feature selection: The feature selection and cross validation algorithms, and number of iterations for sequential algorithm can be selected. (C) Results: A ranking of the features and a colored scatter plot based on two selected features are presented and can be downloaded. A final list of selected features can be created by selecting all, top percentage, or a custom list of features. (D) Clustering: The selected features can be visualized using t-SNE and hierarchical clustering analysis.