| Literature DB >> 31016185 |
Suyan Tian1,2, Chi Wang3,4.
Abstract
With the rapid evolution of high-throughput technologies, time series/longitudinal high-throughput experiments have become possible and affordable. However, the development of statistical methods dealing with gene expression profiles across time points has not kept up with the explosion of such data. The feature selection process is of critical importance for longitudinal microarray data. In this study, we proposed aggregating a gene's expression values across time into a single value using the sign average method, thereby degrading a longitudinal feature selection process into a classic one. Regularized logistic regression models with pseudogenes (i.e., the sign average of genes across time as predictors) were then optimized by either the coordinate descent method or the threshold gradient descent regularization method. By applying the proposed methods to simulated data and a traumatic injury dataset, we have demonstrated that the proposed methods, especially for the combination of sign average and threshold gradient descent regularization, outperform other competitive algorithms. To conclude, the proposed methods are highly recommended for studies with the objective of carrying out feature selection for longitudinal gene expression data.Entities:
Mesh:
Year: 2019 PMID: 31016185 PMCID: PMC6444255 DOI: 10.1155/2019/1724898
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Comparison of methods for optimizing a penalized linear regression model. (a) Coordinate Descent. (b) Threshold Gradient Descent Regularization.
Figure 2Study schema of the injury data application.
Overview of methods under consideration.
| Method | GX | Pseudo- | If using pseudo-genes, which summary score is used |
|---|---|---|---|
| Values1 | Genes2 | ||
| Sign Avg & LASSO/CD | √ | Sign average of a gene's expression over time. | |
|
| |||
| Sign Avg &TGDR | √ | Sign average of a gene's expression over time. | |
|
| |||
| Mean & LASSO/CD | √ | Mean of a gene's expression value over time. | |
|
| |||
| Mean & TGDR | √ | Mean of a gene's expression value over time. | |
|
| |||
| Median & LASSO/CD | √ | Median of a gene's expression value over time. | |
|
| |||
| Median & TGDR | √ | Median of a gene's expression value over time. | |
|
| |||
| PC1 & LASSO/CD | √ | First principal component of a gene's expression value over time | |
|
| |||
| PC1 & TGDR | √ | First principal component of a gene's expression over time. | |
|
| |||
| EDGE | √ | ||
|
| |||
| limma | √ | ||
|
| |||
| LASSO/CD separately | √ | ||
|
| |||
| TGDR separately | √ | ||
|
| |||
| glmmLASSO | √ | ||
1 GX values represent actual expression values.
2 Pseudogenes are generated to summarize expression values across time.
Performance of the proposed method on the traumatic injury application and comparison with other methods.
| Method | Size | Rand Index | Test Set | |||
|---|---|---|---|---|---|---|
| Error rate | BCM1 | AUPR2 | ||||
| Proposed methods | Sign Avg. & LASSO/CD3 | 32 | 19.58% |
|
| 0.626 |
| Sign Avg. & TGDR4 | 30 | 25.21% | 0.378 | 0.590 |
| |
|
| ||||||
| Existing methods | limma | 47 | 21.40% | 0.432 | 0.542 | 0.628 |
| EDGE | 453 | 13.67% | 0.432 | 0.543 | 0.622 | |
| glmmLASSO | 8 | 34.99% | 0.432 | 0.519 | 0.532 | |
| LASSO/CD separately5 | 28 | 13.59% | 0.486 | 0.498 | 0.508 | |
| TGDR separately6 | 133 | 22.58% | 0.378 | 0.520 | 0.579 | |
| Mean & LASSO/CD7 | 29 | 17.95% | 0.405 | 0.536 | 0.560 | |
|
| ||||||
| Using other summary scores | Mean & TGDR8 | 36 | 27.37% | 0.405 | 0.562 | 0.617 |
| Median & LASSO/CD9 | 22 | 7.76% | 0.351 | 0.543 | 0.617 | |
| Median & TGDR10 | 43 | 18.58% | 0.405 | 0.578 | 0.626 | |
| PC1 & LASSO/CD11 | 3 | 13.59% | 0.405 | 0.504 | 0.541 | |
| PC1 & TGDR12 | 29 | 32.68% | 0.432 | 0.539 | 0.548 | |
BCM captures the average confidence that a sample belongs to class i when it indeed belongs to that class;
AUPR is the average of AUPRk for each class and it captures the ability of correctly ranking the samples known to belong in a given class;
Sign Avg. & LASSO/CD: pseudo genes were obtained by calculating the sign average of a gene's expression values across time, and the feature selection method is LASSO in which the optimization method used is coordinate descent;
4Sign Avg. & TGDR: pseudo genes were obtained by calculating the sign average of a gene's expression values across time, and the feature selection/optimization method is threshold gradient descent regularization;
5LASSO/CD separately: separate LASSO models were trained at individual time points; the optimization method is CD;
6TGDR separately: separate TGDR models were trained at individual time points; the optimization method is TGDR;
7Mean & LASSO/CD: pseudo genes were obtained by calculating the average of a gene's expression values across time, and the optimization method is CD;
8Mean & TGDR: pseudo genes were obtained by calculating the average of a gene's expression values across time, and the optimization method is TGDR;
9Median & LASSO/CD: pseudo genes were obtained by calculating the median of a gene's expression values across time, and the optimization method is CD;
10Median & TGDR: pseudo genes were obtained by calculating the median of a gene's expression values across time, and the optimization method is TGDR;
11PC1 & LASSO/CD: pseudo genes obtained by calculating the first principal component of a gene's expression values across time, and the optimization method is CD;
12PC1 & TGDR: pseudo genes were obtained by calculating the first principal component of a gene's expression values across time, and the optimization method is TGDR.
Performance of the proposed methods and other relevant methods on Simulation I.
| Method | Size | Rand | F13A1 | GSTM1 | Error rate1 | BCM2 | AUPR3 |
|---|---|---|---|---|---|---|---|
| (%) | (%) | (%) | (%) | ||||
| Sign Avg. & LASSO/CD4 | 5.52 | 13.78 | 70 | 10 | 22.97 | 0.582 | 0.873 |
|
| |||||||
| Sign Avg. & TGDR5 | 16.76 | 8.12 | 88 | 100 | 6.77 | 0.724 | 0.987 |
|
| |||||||
| EDGE | 20 | 3.85 | 16 | 0 | 10.80 | 0.719 | 0.936 |
|
| |||||||
| limma | 6.04 | 11.72 | 8 | 100 | 16.17 | 0.707 | 0.908 |
|
| |||||||
| LASSO/CD separately6 | 4.65 | 29.17 | 36 | 40 | 30.00 | 0.527 | 0.924 |
|
| |||||||
| TGDR separately7 | 32.26 | 5.30 | 100 | 100 | 19.27 | 0.611 | 0.991 |
|
| |||||||
| glmmLASSO | 114.06 | 3.05 | 0 | 0 | 36.40 | 0.519 | 0.571 |
∗Using q-value as the cutoff, EDGE selects all 1,000 genes as significant. We used the 20 most significant genes instead. 1Error rate = (false positives + false negatives)/(sample size).
2BCM captures the average confidence that a sample belongs to class i when it indeed belongs to that class.
3AUPR is computed as the average of the AUPRk for each class and captures the ability of correctly ranking the samples known to belong in a given class.
4Sign Avg. & LASSO/CD: Pseudogenes were obtained by calculating the sign average of a gene's expression values across time; the optimization method is coordinated descent.
5Sign Avg. & TGDR: Pseudogenes were obtained by calculating the sign average of a gene's expression values across time; the optimization method is threshold gradient descent regularization.
6LASSO/CD separately: separate LASSO models were trained at individual time points; the optimization method is CD.
7TGDR separately: separate TGDR models were trained at individual time points; the optimization method is TGDR.
Performance of the proposed methods and other relevant methods on Simulation II.
| Method | Size | Rand | F13A1 | GSTM1 | Error rate1 | BCM2 | AUPR3 |
|---|---|---|---|---|---|---|---|
| (%) | (%) | (%) | (%) | ||||
| Sign Avg. & LASSO/CD4 | 13.82 | 10.03 | 100 | 96 | 1.27 | 0.854 | 0.994 |
| Sign Avg. & TGDR5 | 9.92 | 14.78 | 100 | 96 | 3.33 | 0.841 | 0.993 |
| EDGE | 20 | 2.72 | 0 | 0 | 7.37 | 0.755 | 0.973 |
| limma | 8.9 | 9.75 | 0 | 100 | 5.23 | 0.809 | 0.981 |
| LASSO/CD separately6 | 15.88 | 8.81 | 98 | 100 | 6.60 | 0.668 | 0.982 |
| TGDR separately7 | 75.48 | 3.38 | 100 | 100 | 4.47 | 0.714 | 0.991 |
| glmmLASSO | 63.52 | 1.63 | 4 | 8 | 46.77 | 0.510 | 0.551 |
∗Using q-value as the cutoff, EDGE selects all 1,000 genes as significant. We used the 20 most significant genes instead. 1Error rate = (false positives + false negatives)/(sample size).
2BCM captures the average confidence that a sample belongs to class i when it indeed belongs to that class.
3AUPR is computed as the average of the AUPRk for each class and captures the ability of correctly ranking the samples known to belong in a given class.
4Sign Avg. & LASSO/CD: Pseudogenes were obtained by calculating the sign average of a gene's expression values across time; the optimization method is coordinated descent.
5Sign Avg. & TGDR: Pseudogenes were obtained by calculating the sign average of a gene's expression values across time; the optimization method is threshold gradient descent regularization.
6LASSO/CD separately: separate LASSO models were trained at individual time points; the optimization method is CD.
7TGDR separately: separate TGDR models were trained at individual time points; the optimization method is TGDR.
Figure 3Venn diagram illustrates the overlap of genes selected by the sign average and TGDR method and the sign average and CD method for the injury application. Genes directly related to injury according to the Genecards database are underlined.