| Literature DB >> 28049413 |
Milos Radovic1,2, Mohamed Ghalwash3,4,5, Nenad Filipovic6,7, Zoran Obradovic3.
Abstract
BACKGROUND: Feature selection, aiming to identify a subset of features among a possibly large set of features that are relevant for predicting a response, is an important preprocessing step in machine learning. In gene expression studies this is not a trivial task for several reasons, including potential temporal character of data. However, most feature selection approaches developed for microarray data cannot handle multivariate temporal data without previous data flattening, which results in loss of temporal information. We propose a temporal minimum redundancy - maximum relevance (TMRMR) feature selection approach, which is able to handle multivariate temporal data without previous data flattening. In the proposed approach we compute relevance of a gene by averaging F-statistic values calculated across individual time steps, and we compute redundancy between genes by using a dynamical time warping approach.Entities:
Keywords: Feature selection; Gene expression; Temporal data
Mesh:
Year: 2017 PMID: 28049413 PMCID: PMC5209828 DOI: 10.1186/s12859-016-1423-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Data flattening commonly used as a preprocessing step to the mRMR
Fig. 2The proposed approach for calculation of relevance and redundancy for temporal data
Fig. 3Pseudo code of TMRMR-M and TMRMR-C feature selection algorithms
Description of gene expression datasets
| Dataset | # Genes | # Samples (symptomatic/ | No of time points |
|---|---|---|---|
| asymptomatic) | |||
| H3N2 | 12023 | 17 (9/8) | 16 |
| HRV | 12023 | 20 (10/10) | 14 |
| RSV | 12023 | 19 (9/10) | 21 |
Evaluation of feature selection methods on H3N2, HRV and RSV datasets using the top m genes (values represent classification accuracy)
| Feature | KNN | NB | SVM | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| selection | Number of features | Number of features | Number of features | |||||||||||||||||||
| method | 1 | 10 | 20 | 30 | 40 | 50 | 1 | 10 | 20 | 30 | 40 | 50 | 1 | 10 | 20 | 30 | 40 | 50 | ||||
| H3N2 | mRMR | 58.8 | 76.5 | 82.4 | 88.2 | 88.2 | 88.2 | 64.7 | 76.5 | 76.5 | 70.6 | 70.6 | 76.5 | 58.8 | 70.6 | 64.7 | 70.6 | 76.5 | 88.2 | |||
| F-statistic | 58.8 | 82.4 | 88.2 | 88.2 | 88.2 | 94.1 | 64.7 | 82.4 | 88.2 | 94.1 | 94.1 | 94.1 | 58.8 | 88.2 | 88.2 | 88.2 | 88.2 | 100 | ||||
| ReliefF | 64.7 | 47.1 | 70.6 | 76.5 | 82.4 | 82.4 | 70.6 | 52.9 | 82.4 | 88.2 | 88.2 | 94.1 | 52.9 | 70.6 | 94.1 | 100 | 94.1 | 94.1 | ||||
| MT-LASSO | 52.9 | 70.6 | 76.5 | 94.1 | 88.2 | 100 | 64.7 | 70.6 | 64.7 | 76.5 | 82.4 | 76.5 | 58.8 | 82.4 | 70.6 | 94.1 | 100 | 100 | ||||
| TMRMR-C | 100 | 100 | 100 | 100 | 100 | 100 | 94.1 | 100 | 100 | 100 | 100 | 100 | 88.2 | 100 | 94.1 | 94.1 | 94.1 | 100 | ||||
| TMRMR-M | 100 | 100 | 100 | 100 | 100 | 100 | 94.1 | 94.1 | 100 | 100 | 100 | 100 | 94.1 | 100 | 94.1 | 94.1 | 100 | 100 | ||||
| HRV | mRMR | 40.0 | 40.0 | 50.0 | 60.0 | 55.0 | 60.0 | 35.0 | 40.0 | 65.0 | 60.0 | 75.0 | 75.0 | 35.0 | 40.0 | 70.0 | 65.0 | 60.0 | 65.0 | |||
| F-statistic | 40.0 | 55.0 | 85.0 | 75.0 | 75.0 | 75.0 | 35.0 | 75.0 | 70.0 | 70.0 | 80.0 | 80.0 | 30.0 | 60.0 | 70.0 | 85.0 | 75.0 | 80.0 | ||||
| ReliefF | 45.0 | 55.0 | 55.0 | 55.0 | 60.0 | 60.0 | 50.0 | 50.0 | 40.0 | 50.0 | 50.0 | 60.0 | 55.0 | 50.0 | 45.0 | 50.0 | 60.0 | 60.0 | ||||
| MT-LASSO | 40.0 | 50.0 | 50.0 | 65.0 | 60.0 | 60.0 | 40.0 | 55.0 | 60.0 | 70.0 | 75.0 | 75.0 | 40.0 | 55.0 | 50.0 | 60.0 | 70.0 | 75.0 | ||||
| TMRMR-C | 55.0 | 80.0 | 80.0 | 75.0 | 85.0 | 75.0 | 50.0 | 75.0 | 85.0 | 90.0 | 85.0 | 80.0 | 50.0 | 75.0 | 85.0 | 85.0 | 75.0 | 75.0 | ||||
| TMRMR-M | 55.0 | 60.0 | 75.0 | 75.0 | 80.0 | 75.0 | 50.0 | 75.0 | 85.0 | 80.0 | 80.0 | 80.0 | 50.0 | 70.0 | 80.0 | 75.0 | 80.0 | 75.0 | ||||
| RSV | mRMR | 84.2 | 68.4 | 68.4 | 63.2 | 63.2 | 68.4 | 79.0 | 68.4 | 68.4 | 63.2 | 57.9 | 68.4 | 84.2 | 68.4 | 57.9 | 57.9 | 57.9 | 57.9 | |||
| F-statistic | 79.0 | 63.2 | 68.4 | 63.2 | 63.2 | 63.2 | 79.0 | 68.4 | 79.0 | 73.7 | 57.9 | 68.4 | 84.2 | 73.7 | 79.0 | 68.4 | 68.4 | 63.2 | ||||
| ReliefF | 73.7 | 47.4 | 36.8 | 31.6 | 36.8 | 42.1 | 68.4 | 68.4 | 79.0 | 52.6 | 47.4 | 47.4 | 68.4 | 68.4 | 52.6 | 47.4 | 47.4 | 42.1 | ||||
| MT-LASSO | 79.0 | 57.9 | 52.6 | 47.4 | 57.9 | 57.9 | 79.0 | 89.5 | 73.7 | 63.2 | 57.9 | 52.6 | 79.0 | 73.7 | 57.9 | 52.6 | 52.6 | 57.9 | ||||
| TMRMR-C | 79.0 | 84.2 | 73.7 | 84.2 | 84.2 | 84.2 | 79.0 | 84.2 | 84.2 | 84.2 | 84.2 | 84.2 | 79.0 | 84.2 | 79.0 | 89.5 | 84.2 | 84.2 | ||||
| TMRMR-M | 79.0 | 84.2 | 84.2 | 73.7 | 73.7 | 73.7 | 79.0 | 84.2 | 84.2 | 84.2 | 84.2 | 84.2 | 79.0 | 73.7 | 84.2 | 73.7 | 79.0 | 79.0 | ||||
| Average | mRMR | 61.0 | 61.6 | 66.9 | 70.5 | 68.8 | 72.2 | 59.6 | 61.6 | 70.0 | 64.6 | 67.8 | 73.3 | 59.3 | 59.7 | 64.2 | 64.5 | 64.8 | 70.4 | |||
| F-statistic | 59.3 | 66.8 | 80.6 | 75.5 | 75.5 | 77.4 | 59.6 | 75.3 | 79.1 | 79.3 | 77.3 | 80.8 | 57.7 | 74.0 | 79.1 | 80.6 | 77.2 | 81.1 | ||||
| ReliefF | 61.1 | 49.8 | 54.1 | 54.4 | 59.7 | 61.5 | 63.0 | 57.1 | 67.1 | 63.6 | 61.9 | 67.2 | 58.8 | 63.0 | 63.9 | 65.8 | 67.2 | 65.4 | ||||
| MT-LASSO | 57.3 | 59.5 | 59.7 | 68.8 | 68.7 | 72.6 | 61.2 | 71.7 | 66.1 | 69.9 | 71.7 | 68.0 | 59.3 | 70.3 | 59.5 | 68.9 | 74.2 | 77.6 | ||||
| TMRMR-C |
|
| 84.6 |
|
|
|
|
|
|
|
|
| 72.4 |
| 86.0 |
| 84.4 |
| ||||
| TMRMR-M |
| 81.4 |
| 82.9 | 84.6 | 82.9 |
| 84.4 |
| 88.1 | 88.1 |
|
| 81.2 |
| 80.9 |
| 84.7 | ||||
Bold represents the best average accuracy
Fig. 4Classification accuracy obtained by using 5-fold cross validation procedure on the three gene expression datasets: H3N2 (left), HRV (middle) and RSV (right). Results are given for the three classifiers: KNN (top), NB (middle) and SVM (down)
Evaluation of feature selection methods on H3N2, HRV and RSV datasets using the top m genes and reduced number of time points (T=3, T=5 and T=7). Values represent average accuracy on the three datasets obtained by using five-fold cross validation procedure
| Feature | KNN | NB | SVM | |||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| selection | Number of features | Number of features | Number of features | |||||||||||||||||||
| method | 1 | 10 | 20 | 30 | 40 | 50 | 1 | 10 | 20 | 30 | 40 | 50 | 1 | 10 | 20 | 30 | 40 | 50 | ||||
| T=3 | mRMR | 75.8 | 76.6 | 79.0 | 80.8 | 77.3 | 75.5 | 74.4 | 80.2 | 82.3 | 80.6 | 73.8 | 77.1 | 72.5 | 76.5 | 85.6 | 84.2 | 77.0 | 77.0 | |||
| F-statistic | 75.8 | 84.1 | 82.3 | 84.2 | 85.9 | 85.9 | 74.4 | 85.8 | 82.1 | 84.1 | 85.7 | 83.8 | 74.2 | 87.5 | 80.3 | 83.7 | 85.5 | 86.0 | ||||
| ReliefF | 58.6 | 76.7 | 79.2 | 75.7 | 74.0 | 77.6 | 60.8 | 78.5 | 75.2 | 78.9 | 80.6 | 79.1 | 60.8 | 75.3 | 75.8 | 77.4 | 81.1 | 79.1 | ||||
| MT-LASSO | 61.6 | 71.0 | 76.8 | 73.7 | 72.0 | 73.8 | 64.1 | 66.0 | 75.2 | 76.5 | 76.7 | 78.6 | 67.8 | 69.9 | 80.5 | 75.5 | 74.0 | 79.2 | ||||
| TMRMR-C |
|
|
| 86.3 | 89.7 |
|
| 88.1 |
|
|
| 86.1 |
|
|
| 88.0 | 89.4 |
| ||||
| TMRMR-M |
| 87.8 | 85.7 |
|
| 88.0 |
|
| 86.0 | 86.0 |
|
| 87.7 | 85.8 | 84.1 |
|
| 88.1 | ||||
| T=5 | mRMR | 61.1 | 69.8 | 69.9 | 75.4 | 76.9 | 77.0 | 61.5 | 73.5 | 71.8 | 66.6 | 73.7 | 75.4 | 64.9 | 73.3 | 68.1 | 73.8 | 79.0 | 77.2 | |||
| F-statistic | 61.1 | 80.6 | 80.6 | 77.2 | 80.8 | 82.6 | 61.5 | 84.2 | 84.1 | 84.1 | 84.1 | 85.7 | 64.9 | 84.1 | 80.6 | 79.0 | 78.9 | 84.3 | ||||
| ReliefF | 55.3 | 69.7 | 65.0 | 70.6 | 74.0 | 77.7 | 55.6 | 71.2 | 75.3 | 77.2 | 77.1 | 79.1 | 57.0 | 72.0 | 72.1 | 70.4 | 72.0 | 70.1 | ||||
| MT-LASSO | 49.8 | 55.8 | 63.1 | 70.7 | 69.0 | 72.4 | 53.8 | 65.0 | 62.1 | 73.3 | 76.9 | 73.8 | 51.5 | 68.0 | 70.4 | 77.7 | 74.4 | 76.1 | ||||
| TMRMR-C | 83.0 |
|
|
|
|
|
|
|
|
|
|
| 81.3 |
|
|
|
|
| ||||
| TMRMR-M |
| 86.3 | 88.0 | 84.7 | 82.7 | 81.0 |
| 88.1 | 86.4 | 86.4 | 84.7 | 86.4 |
| 88.0 |
| 82.8 | 86.1 | 84.4 | ||||
| T=7 | mRMR | 67.3 | 76.3 | 77.0 | 76.7 | 76.6 | 81.8 | 65.3 | 66.1 | 69.1 | 70.7 | 72.3 | 78.4 | 67.1 | 80.0 | 76.5 | 76.5 | 80.1 | 83.6 | |||
| F-statistic | 65.5 | 85.6 | 83.6 | 87.3 | 87.5 | 89.3 | 65.3 | 83.8 | 85.4 | 83.5 | 89.1 | 87.1 | 67.1 | 84.0 | 89.1 | 89.1 | 89.3 | 87.6 | ||||
| ReliefF | 63.7 | 60.5 | 61.3 | 65.2 | 61.8 | 67.4 | 69.0 | 76.7 | 76.9 | 72.3 | 73.9 | 74.2 | 67.3 | 73.8 | 73.7 | 76.0 | 70.7 | 68.8 | ||||
| MT-LASSO | 63.3 | 76.5 | 81.9 | 82.3 | 84.2 | 84.2 | 67.3 | 72.7 | 74.1 | 80.2 | 82.1 | 75.1 | 69.0 | 79.8 | 83.7 | 84.1 | 84.1 | 86.1 | ||||
| TMRMR-C |
| 89.6 | 91.3 |
|
|
|
| 91.3 |
| 91.3 | 93.0 |
| 87.7 | 89.7 | 93.0 | 91.3 |
|
| ||||
| TMRMR-M |
|
|
| 89.6 | 87.8 |
|
|
|
|
|
| 91.3 |
|
|
|
| 89.6 |
| ||||
Bold represents the best average accuracy
Fig. 5Average classification accuracy over all datasets obtained in a 5-fold cross validation procedure. Results are given for a different number of time points used for both feature selection and classifiers training: T=3, T=5, T=7 and T=T where T ∈{16,14,21}
Top 5 GO terms over-represented in the top 50 genes selected by the TMRMR-C algorithm from H3N2, HRV and RSV datasets
| Dataset | GO ID | GO biological process |
|
|---|---|---|---|
| H3N2 | GO:0060337 | Type I interferon signaling pathway | 6.17E–23 |
| GO:0071357 | Cellular response to type I interferon | 6.17E–23 | |
| GO:0034340 | Response to type I interferon | 1.55E–22 | |
| GO:0051607 | Defense response to virus | 2.52E–22 | |
| GO:0009615 | Response to virus | 6.85E-21 | |
| HRV | GO:0060337 | Type I interferon signaling pathway | 2.51E–18 |
| GO:0071357 | Cellular response to type I interferon | 2.51E–18 | |
| GO:0034340 | Response to type I interferon | 5.56E–18 | |
| GO:0009615 | Response to virus | 2.08E–15 | |
| GO:0051607 | Defense response to virus | 1.07E–14 | |
| RSV | GO:0070269 | Pyroptosis | 1.46E–03 |
| GO:0002376 | Immune system process | 1.93E–03 | |
| GO:0006955 | Immune response | 1.95E–03 | |
| GO:0045087 | Innate immune response | 3.68E–03 | |
| GO:0006952 | Defense response | 6.96E–03 |
Fig. 6Robustness analysis. The average values of Spearman’s rank correlation coefficient (ρ), Tanimoto distance (T ) and number of features shared across all folds (N ) for all experiments (all datasets and all tested number of time points)