| Literature DB >> 31073522 |
Suyan Tian1, Chi Wang2, Bing Wang3.
Abstract
To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.Entities:
Mesh:
Year: 2019 PMID: 31073522 PMCID: PMC6470448 DOI: 10.1155/2019/2497509
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Major ramifications of pathway-based feature selection methods.
Figure 2Graphical illustration of the stepwise forward methods.
Figure 3Graphical illustration of the weighting methods.
Figure 4Graphical illustration of the penalty methods.
A selective review of pathway-guided gene selection algorithms.
| Reference | Brief description of the proposed method and its characteristics | Category |
|---|---|---|
| Zhu et al. [ | The proposed network-based SVM method combines the network-constrained penalty (see equation ( | Penalty |
| Chen et al. [ | The netSVM method also combines the network-constrained penalty (see equation ( | Penalty |
| Sokolov et al. [ | The generalized elastic net penalty function is given and combined with an objective function to select important genes. This is named as the GELnet method. | Penalty |
| Zhang et al. [ | The Net-Cox method adds a network-constrained penalty term to the corresponding partial likelihood function of a Cox model, aiming to select important prognostic genes | Penalty |
| Bandyopadhyay et al. [ | After ranking genes in a pathway according to their marginal classification power, the proposed BPFS method starts from the gene with the largest power and then adds genes | Stepwise forward |
| Lee et al. [ | In each pathway, the method reorders genes according to their t-scores, and then the subset of genes whose combined expression has optimal discriminative power called CORGs is identified. | Stepwise forward |
| Razi et al. [ | The proposed NBCG method starts with a seed gene and traverses the network to find the optimal subset on the basis of Shapley value. | Stepwise forward |
| Wu et al. [ | The shortest path method (with well-known genes related to the disease under study, i.e., gastric cancer as seeds) is used to mine candidate genes and the combination of random forest +incremental feature selection is used to obtain the optimal subset. | Stepwise forward1 |
| Tian et al. [ | The weighted-SAMGSR method extends the SAMGSR algorithm by weighing SAMGS statistics according to genes' connectivity levels in the network. | A hybrid of weighting and stepwise forward |
| Johannes et al. [ | The RRFE method uses the GeneRank algorithm to alter the ranking criterion of the SVM-RFE algorithm and selects a subset with the best discriminative power. | Weighting |
| Chan et al. [ | The wgSVM-SCAD method weighs the expression values of genes in a pathway according to their t-values and then uses a penalized SVM model (with SCAD penalty) to identify relevant genes. | Weighting |
| Tian et al. [ | Using sign averages of all genes inside a gene set to represent corresponding gene set, the proposed methods (i.e., one forward bi-level selection method and one backward bi-level selection method) filter out insignificant gene sets and insignificant genes in a specific order. | Bi-level selection |
| Lim and Wong [ | In both FSNet and PFSNet methods, a fuzzy value is assigned to each gene for each sample and then majority voting is used to determine important genes. | Bi-level selection |
Note: Bilevel selection algorithms are regarded as a special case of pathway-guided gene selection algorithms.
1Can be loosely categorized into the indicated category (e.g., stepwise forward).
Penalty terms used in the penalty methods.
| Methods | Mathematical notation | Characteristics |
|---|---|---|
| Li & Li, 2008 [ |
| Aims at smoothing the |
|
| ||
| [ |
| Accounts for that two connected genes might have |
|
| ||
| [ |
| Shrinks the weighted |
| [ | for | A 2-step procedure is used to reduce biases; it is proved that this performs better than that with smaller |
|
| ||
| [ |
| Encourages simultaneous selection of neighboring genes in the network. But the Indictor function I is not continuous and thus needs special care. |
|
| ||
| The generalize elastic net: |
| Includes the network-constrained penalty term by [ |
Penalty terms used in the bilevel selection methods.
| Methods | Mathematical notation | Characteristics |
|---|---|---|
| Group LASSO | General form | It cannot identify the important genes within the selected gene sets and thus is actually incapable of bilevel selection and also heavily shrinks large coefficients (leading to estimate biases for large coefficients) |
| Group bridge [ | Outer bridge penalty+ inner LASSO penalty | It can provide sparse solutions at both pathway and gene levels, but it is associated with big empirical difficulties since the bridge penalty is not everywhere differentiable. |
| Group MCP [ | Outer MCP penalty+ inner MCP penalty | Allow coefficients to grow large and groups to remain sparse. |
| Group exponential LASSO [ | Outer exponential penalty + inner LASSO penalty | A decay parameter controls the degree to which gene selection is coupled together within gene sets and has several advantages over the other composite penalty term such as group bridge. |
| Sparse group LASSO [ |
| Convex and thus highly likely to get the global minimum, but extra care is needed since the group coordinate descent algorithms cannot be applied. |
Note: the general formatting for group LASSO, group bridge, and group MCP was given by Breheny & Huang [31]. It is too general to guarantee all combinations of outer and inner penalties produce sensible models. Thus the second general form was proposed by Huang et al. [59] to address this issue specifically.
Figure 5Statistics for pathway-guided gene selection methods in cancer studies. A literature search was conducted in the PubMed using keywords of feature selection, gene expression, pathway/network, and cancer. The number of relevant articles stratified by the cancer types under study is given on the top of those bars.