| Literature DB >> 29357817 |
Michail Tsagris1, Vincenzo Lagani2, Ioannis Tsamardinos2.
Abstract
BACKGROUND: Feature selection is commonly employed for identifying collectively-predictive biomarkers and biosignatures; it facilitates the construction of small statistical models that are easier to verify, visualize, and comprehend while providing insight to the human expert. In this work we extend established constrained-based, feature-selection methods to high-dimensional "omics" temporal data, where the number of measurements is orders of magnitude larger than the sample size. The extension required the development of conditional independence tests for temporal and/or static variables conditioned on a set of temporal variables.Entities:
Keywords: Longitudinal data; Multiple solutions; Regression; Time course data; Variable selection
Mesh:
Year: 2018 PMID: 29357817 PMCID: PMC5778658 DOI: 10.1186/s12859-018-2023-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Graphical representation of the four different scenarios. In all panel the x-axis reports the time dimension, while y-axis reports the log-transformed expression value of a randomly-chosen probeset from one of the datasets used in the experimentation. a Temporal-longitudinal scenario. All data, including the target variable, consists of longitudinal (repeated) measurements. Values from the same subject are linked with a dashed line. (data from the GDS3915 dataset). b Temporal distinct scenario. Each observed value refers to a different subject (data from the GDS964 dataset). c Static longitudinal scenario. There are two groups (red and black lines), and each group consists of trajectories of longitudinal measurements. Each trajectory refers to the same subject (data from the GDS4146 dataset). d Static distinct scenario. At every time point different subjects are measured. Green and red colors indicate the two populations from which the subjects are sampled from (data from the GDS2456 dataset)
Temporal-longitudinal scenario: comparison between SES equipped with GLMM (SESglmm) and SES equipped with GEE
| Dataset | MSPE | Average time (in seconds) | ||||
|---|---|---|---|---|---|---|
| SESglmm | SESgee(CS) | SESgee(AR(1)) | SESglmm | SESgee(CS) | SESgee(AR(1)) | |
| GDS5088 | 0.131 (0.000) | 0.189 (0.1) | 0.289 (0.018) | 1562.51 (230.53) | 1022.45 (217.99) | 933.14 (180.34) |
| GDS4395 | 0.116 (0.007) | 0.156 (0.019) | 0.298 (0.028) | 21167.21 (26089.48) | 4862.15 (1724.89) | 5577.80 (1890.15) |
| GDS4822 | 0.066 (0.000) | 0.055 (0.001) | 0.045 (0.004) | 1785.66 (321.92) | 2103.96 (490.74) | 1492.30 (205.03) |
| GDS3326 | 0.062 (0.001) | 0.052 (0.000) | 0.063 (0.002) | 6617.09 (472.16) | 3167.78 (795.74) | 2348.69 (390.10) |
| GDS3181 | 0.805 (0.096) | 0.458 (0.000) | 0.458 (0.00) | 1684.90 (206.26) | 1011.44 (152.59) | 748.18 (105.32) |
| GDS4258 | 0.074 (0.000) | 0.149 (0.003) | 0.152 (0.002) | 4135.76 (506.15) | 2818.024 (418.97) | 2078.52 (462.30) |
| GDS3915 | 0.527 (0.038) | 0.553 (0.01) | 0.439 (0.000) | 669.18 (63.93) | 511.82 (84.22) | 491.91 (108.64) |
| GDS3432 | 0.057 (0.001) | 0.060 (0.008) | 0.038 (0.003) | 3275.22 (474.06) | 2213.11 (371.68) | 2104.05 (546.76) |
| Average | 0.230 (0.280) | 0.209 (0.192) | 0.223 (0.172) | 5112.2 (6756.04) | 2378.56 (1566.36) | 1971.82 (1611.13) |
The latter is indicated as SESgee(CS) and SESgee(AR(1)), depending by the employed variance estimator. TT-corrected, cross-validated mean square prediction error are reported for each dataset, along with their standard deviation (in parenthesis). Average (standard deviation) computational time is reported as well, while the last line reports performances averaged across datasets. The MSPE values are not statistically different, however SESgee(AR(1)) is faster than the other alternatives
Fig. 2a Temporal-longitudinal scenario: Time in seconds required by glmmLasso and SES equipped with different conditional independence tests on the GSD5088 dataset. The number of randomly selected predictors is reported on the x-axis, while y-axis reports the required computational time: glmmLasso rapidly becomes computationally more expensive than any SES variant. b Gene expression over time for the target gene CSHL1 in dataset GDS5088 (one line for each subject). c Average relative change for the target gene and predictors reported in model 10. The expression of the genes was averaged over subjects for each time point, and the logarithm of the change with respect to the first time point was then computed. The target gene appears as bold line, whereas the 5 predictor genes are reported as dashed lines. d Differences in performance between SESglmm and glmmLasso for the 20 replications on each dataset. Negative values indicate SESglmm outperforming glmmLasso; SESglmm is always comparable or better than glmmLasso, especially in dataset GDS5088 (excluded for sake of clarity). e Static-longitudinal scenario: Expressions over time of gene TSIX, selected by SES for dataset GDS4146. The plot show one line for each subject: there is a clear separation between the two classes included in the dataset (dashed and solid lines, respectively). f Static-distinct scenario: Expressions over time of gene Ppp1r42, selected by SES for dataset GDS2882. The dotted and dashed lines correspond to the average trend of the gene in two different classes; differences in intercept and trend are easily noticeable
Cross-validated, TT-corrected performances of SES and LASSO-type methods on the four scenarios
| Temporal-longitudinal scenario | Temporal-distinct scenario | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| MSPE | Selected vars | MSPE | Selected vars | ||||||
| Dataset | SESglmm | glmmLasso | SESglmm | glmmLasso | Dataset | SES | LASSO | SES | LASSO |
| GDS5088 | 0.160 (0.042) | 5.25 (0.85) | 5.15 (8.65) | GDS3859 | 0.068 (0.006) | 3.5 (0.51) | 11.81 (4.66) | ||
| GDS4395 | 0.640 (0.568) | 5.37 (0.56) | 12.35 (13.61) | GDS972 | 0.022 (0.000) | 5.83 (0.92) | 22.2 (9.85) | ||
| GDS4822 | 0.765 (0.436) | 4.75 (0.85) | 3.16 (5.77) | GDS947 | 0.056 (0.000) | 5.92 (0.65) | 12.40 (5.40) | ||
| GDS3326 | 0.234 (0.139) | 5.42 (0.78) | 2.42 (7.45) | GDS964 | 0.033 (0.000) | 5.73 (0.69) | 25.69 (11.86) | ||
| GDS3181 | 0.971 (0.484) | 4.17 (0.87) | 0.35 (2.15) | GDS2688 | 0.184 (0.006) | 5.79 (1.06) | 20.64 (10.93) | ||
| GDS4258 | 9.882 (4.518) | 3.83 (0.51) | 1.48 (4.06) | GDS2135 | 0.053 (0.002) | 3.80 (0.76) | 10 (5.72) | ||
| GDS3432 | 2.283 (1.572) | 1.67 (3.51) | 0.08 (0.55) | Av. diff. | 0.053 | -12.03 | |||
| GDS3915 | 0.150 (0.055) | 5.12 (0.80) | 1.66 (4.62) | ||||||
| Av. diff. | -1.59 | 1.12 | |||||||
| Static-distinct scenario | Static-longitudinal scenario | ||||||||
| PCC | Selected vars | PCC | Selected vars | ||||||
| Dataset | SES | LASSO | SES | LASSO | Dataset | SES | GLASSO | SES | GLASSO |
| GDS4319 | 0.873 (0.000) | 2.1 (0.31) | 8 (0.00) | GDS4146 | 0.858 (0.142) | 1.00 (0.00) | 0.42 (1.38) | ||
| GDS3924 | 0.528 (0.104) | 2.75 (0.44) | 53.56 (28.55) | GDS4518 | 0.417 (0.333) | 1.75 (0.44) | 3.04 (2.15) | ||
| GDS3184 | 0.556 (0.067) | 3.00 (0.00) | 10.62 (5.16) | GDS4820 | 0.500 (0.000) | 2.00 (0.00) | 5.14 (3.19) | ||
| GDS3145 | 0.594 (0.125) | 1.5 (0.88) | 0.6 (0.55) | GDS1840 | 0.500 (0.250) | 1.5 (0.51) | 2.67 (2.03) | ||
| GDS2882 | 0.750 (0.000) | 1.5 (0.88) | 0.25 (0.50) | Av. diff. | 0.108 | -1.23 | |||
| GDS2851 | 0.694 (0.000) | 2.25 (0.44) | 0.75 (0.50) | ||||||
| GDS1784 | 0.694 (0.000) | 1.75 (0.85) | 0.5 (0.58) | ||||||
| GDS2456 | 0.739 (0.000) | 1.2 (0.41) | 0.44 (0.53) | ||||||
| Av. diff. | 0.115 | -6.52 | |||||||
For each dataset, performances are reported as average (st.d.). Zero standard deviations are caused by numerical rounding. For Temporal-longitudinal and Temporal-distinct scenario’s performance are computed as Mean Squared Prediction Error (MSPE, lower values indicate better performances) and number of selected variables, while for the other scenarios the Percentage of Corrected Classification (PCC, the higher the better) is used instead of MSPE. The bold numbers indicate better performance; average differences over all datasets are reported for each scenario. Symbols and denote average differences that are statistically significant at 0.01 and 0.05, respectively. In terms of predictive performances, SES is always on par or better than LASSO type algorithms in all scenarios except for the Temporal-distinct
Temporal-longitudinal scenario: comparison between SESglmm and glmmLasso based on 20 replications with different target variable (gene) and independently randomly selected 2000 genes as predictor variables
| Dataset | GDS5088 | GDS4395 | GDS4822 | GDS3326 | GDS3181 | GDS4258 | GDS3432 | GDS3915 |
|---|---|---|---|---|---|---|---|---|
| Average difference | -3.560(4.118) | 0.188(0.516) | -0.003(0.134) | -0.180(0.506) | -0.020(0.04) | -0.139(0.288) | 0.000(0.355) | 0.093(0.455) |
| Proportion | 19/20 | 7/20 | 9/20 | 13/20 | 15/20 | 10/20 | 10/20 | 8/20 |
| 0.0001 | 0.128 | 0.938 | 0.0015 | 0.0312 | 0.024 | 0.9946 | 0.3842 |
Average difference in performances (standard deviation of the differences appear inside the parentheses) and percentage of times SESglmm outperformed glmmLasso. The last line contains the permutation based p-value for the equality of the mean performances. Symbols and denote average differences that are statistically significant at 0.01 and 0.05, respectively. Notice that SESglmm is either statistically significantly better or on par with glmmLasso in terms of predictive performance