| Literature DB >> 30271585 |
Suyan Tian1, Chi Wang2, Howard H Chang3.
Abstract
The emerging field of pathway-based feature selection that incorporates biological information conveyed by gene sets/pathways to guide the selection of relevant genes has become increasingly popular and widespread. In this study, we adapt a gene set analysis method - the significance analysis of microarray gene set reduction (SAMGSR) algorithm to carry out feature selection for longitudinal microarray data, and propose a pathway-based feature selection algorithm - the two-level SAMGSR method. By using simulated data and a real-world application, we demonstrate that a gene's expression profiles over time can be considered as a gene set. Thus a suitable gene set analysis method can be utilized or modified to execute the selection of relevant genes for longitudinal omics data. We believe this work paves the way for more research to bridge feature selection and gene set analysis with the development of novel pathway-based feature selection algorithms.Entities:
Keywords: Core subset; feature selection; gene set analysis; longitudinal microarray data; significance analysis of microarray (SAM)
Mesh:
Year: 2018 PMID: 30271585 PMCID: PMC6124382 DOI: 10.12688/f1000research.15357.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. Flowchart illustrates the two-level SAMGSR algorithm.
Performance of the SAMGSR algorithm and our SAMGSR extensions for longitudinal feature selection, evaluating on individual time points.
| 5-fold CV | Test set | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Day 1/2 | Day 1 | Day 4 | Day 7 | Day 14 | Day 1/2 | Day 1 | Day 4 | Day 7 | |
| A. Using two-level SAMGSR (94-gene signature, cutoff for c
k = 0.2 on the whole training set)
[ | |||||||||
| # of genes | 40.6 | 31.6 | 33.2 | 38.6 | 45 | 63 | 55 | 49 | 63 |
| GBS | 0.304 | 0.266 | 0.306 | 0.274 | 0.278 | 0.298 | 0.272 | 0.240 | 0.288 |
| BCM | 0.514 | 0.565 | 0.536 | 0.556 | 0.526 | 0.491 | 0.534 | 0.560 | 0.495 |
| AUPR | 0.533 | 0.690 | 0.610 | 0.617 | 0.575 | 0.494 | 0.551 | 0.594 | 0.527 |
| B. Using simple SAMGSR (97-gene signature, cutoff for c
k = 0.2 on the whole training set)
[ | |||||||||
| # of genes | 51 | 32.6 | 35 | 36.2 | 39.2 | 69 | 53 | 45 | 77 |
| GBS | 0.279 | 0.210 | 0.279 | 0.323 | 0.262 | 0.262 | 0.309 | 0.307 | 0.257 |
| BCM | 0.501 | 0.598 | 0.501 | 0.498 | 0.559 | 0.499 | 0.513 | 0.498 | 0.534 |
| AUPR | 0.551 | 0.739 | 0.514 | 0.522 | 0.609 | 0.503 | 0.521 | 0.514 | 0.572 |
| C. Using SAMGSR at each time point (the size of signature >1000, cutoff for c
k = 0.1 on the training set)
[ | |||||||||
| # of genes | 230.2 | 23 | 59 | 74.6 | 453.6 | 360 | 30 | 61 | 42 |
| GBS | 0.257 | 0.231 | 0.327 | 0.305 | 0.272 | 0.264 | 0.295 | 0.266 | 0.296 |
| BCM | 0.506 | 0.551 | 0.478 | 0.497 | 0.520 | 0.491 | 0.486 | 0.515 | 0.492 |
| AUPR | 0.535 | 0.655 | 0.482 | 0.518 | 0.584 | 0.490 | 0.482 | 0.529 | 0.512 |
Note: 1 the posterior probabilities were calculated using an SVM classifier. Here, the cutoff for q-value in SAM-GS part is set at 0.05. # of genes represents the average number of genes over 5-fold cross-validated data selected by an algorithm at each time point for the five columns on the training set.
Performance of the SAMGSR algorithm and our SAMGSR extensions for longitudinal feature selection, when all time points considered together.
| Method | # of
| 5-fold CV | Test set | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Error | GBS | BCM | AUPR | Error | GBS | BCM | AUPR | ||
| Two-level SAMGSR
[ | 94 | 0.419 | 0.258 | 0.507 | 0.541 | 0.356 | 0.239 | 0.525 | 0.566 |
| L-SAMGSR
[ | 97 | 0.442 | 0.268 | 0.515 | 0.576 | 0.356 | 0.230 | 0.535 | 0.622 |
| SAMGSR separately
[ | >400 | 0.419 | 0.246 | 0.510 | 0.559 | 0.428 | 0.243 | 0.511 | 0.553 |
Note: 1 the posterior probabilities were calculated using an SVM classifier. Here, the cutoff for q-value in SAM-GS part is set at 0.05. # of genes represents the number of the union of individual genes selected at each time point. L-SAMGSR: the longitudinal SAMGSR method.
Figure 2. Selected genes by the two-level SAMGSR algorithm in the traumatic injury application.
( A) Venn-diagram illustrates the overlap of selected genes by the two-level SAMGSR method at different time points. ( B) Venn-diagram illustrates the overlap of concordantly differentially expressed genes across all time points by the two-level SAMGSR algorithm and the longitudinal SAMGSR algorithm.
Figure 3. Characteristics of 5 common significant expressed genes across all time points by both two-level SAMGSR method and the longitudinal SAMGSR in the traumatic injury application.
Subgroup sample means versus time plot for the 5 common genes that were identified as to be significant at all 5- time points between uncomplicated and complicated patients. Red line represents the complicated group while black line represents the uncomplicated group.
The results of simulation 1.
| Method | Time 1 | Time 2 | Time 3 | Time 4 | Time 5 | |
|---|---|---|---|---|---|---|
| L-SAMGSR
| # of genes | 19.84 | 19.14 | 13.68 | 9.30 | 11.00 |
| F13A1 | 72 % | 100 % | 100 % | 92 % | 68 % | |
| GSTM1 | 0 % | 0 % | 62 % | 22 % | 0 % | |
| Two-level SAMGSR
| # of genes | 38.88 | 32.66 | 21.44 | 18.96 | 20.50 |
| F13A1 | 64 % | 92 % | 90 % | 84 % | 52 % | |
| GSTM1 | 2 % | 62 % | 94 % | 80 % | 36 % |
Note: # of genes represents the average number of genes selected by either the longitudinal SAMGSR algorithm or the two-level SAMGSR algorithm at each time point over 50 replicates. Ave # represents the average number of unique genes across 5 time points. The percentages of the causal genes being correctly selected at each time point over these 50 replicates are presented in the corresponding cells.
The results of simulation 2.
| Method | Time 1 | Time 2 | Time 3 | Time 4 | Time 5 | |
|---|---|---|---|---|---|---|
| L-SAMGSR
| # of genes | 182.38 | 56.18 | 35.44 | 30.94 | 123.84 |
| COX4I2 | 96 % | 0 % | 0 % | 0 % | 4 % | |
| RP9 | 10 % | 4 % | 4 % | 6 % | 96 % | |
| Two-level SAMGSR
| # of genes | 209.44 | 73.40 | 48.04 | 49.38 | 138.66 |
| COX4I2 | 100 % | 0 % | 0 % | 0 % | 0 % | |
| RP9 | 4 % | 0 % | 0 % | 0 % | 92 % |
Note: # of genes represents the average number of genes selected by either the longitudinal SAMGSR algorithm or the two-level SAMGSR algorithm at each time point over 50 replicates. Ave # represents the average number of unique genes across 5 time points. The percentages of the causal genes being correctly selected at each time point over these 50 replicates are presented in the corresponding cells.