| Literature DB >> 23777239 |
Yong Liang1, Cheng Liu, Xin-Ze Luan, Kwong-Sak Leung, Tak-Ming Chan, Zong-Ben Xu, Hai Zhang.
Abstract
BACKGROUND: Microarray technology is widely used in cancer diagnosis. Successfully identifying gene biomarkers will significantly help to classify different cancer types and improve the prediction accuracy. The regularization approach is one of the effective methods for gene selection in microarray data, which generally contain a large number of genes and have a small number of samples. In recent years, various approaches have been developed for gene selection of microarray data. Generally, they are divided into three categories: filter, wrapper and embedded methods. Regularization methods are an important embedded technique and perform both continuous shrinkage and automatic gene selection simultaneously. Recently, there is growing interest in applying the regularization techniques in gene selection. The popular regularization technique is Lasso (L1), and many L1 type regularization terms have been proposed in the recent years. Theoretically, the Lq type regularization with the lower value of q would lead to better solutions with more sparsity. Moreover, the L1/2 regularization can be taken as a representative of Lq (0 <q < 1) regularizations and has been demonstrated many attractive properties.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23777239 PMCID: PMC3718705 DOI: 10.1186/1471-2105-14-198
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
asunbiasedness, sparsity and oracle properties [14].
Lqpenalties makes no significant difference and solving the L1/2 regularization is much simpler than solving the L0 regularization. Hence, the L1/2 regularization can be taken as a representative of Lq (0
The average errors (%) for the test data sets obtained by the sparse logistic regressions with the L, Land Lpenalties in 30 runs
| 28.2 | 31.8 | 31.2 | ||
| 10.7 | 23.1 | 22.2 | ||
| 8.1 | 16.9 | 15.7 | ||
| 31.4 | 33.1 | 33.3 | ||
| 18.4 | 27.1 | 26.6 | ||
| 14.2 | 22.4 | 21.3 | ||
| 30.1 | 32.6 | 33.0 | ||
| 11.1 | 23.3 | 22.9 | ||
| 9.1 | 19.0 | 16.4 | ||
| 35.1 | 35.5 | 36.3 | ||
| 20.5 | 27.2 | 26.9 | ||
| 15.1 | 22.7 | 22.9 |
The average number of variables selected by the sparse logistic regressions with the L, Land Lpenalties in 30 runs
| 7.5 | 31.6 | 27.1 | ||
| 8.8 | 43.1 | 40.3 | ||
| 8.9 | 49.7 | 45.7 | ||
| 8.3 | 33.6 | 29.2 | ||
| 10.6 | 45.7 | 41.9 | ||
| 10.8 | 54.4 | 50.1 | ||
| 7.8 | 33.5 | 28.3 | ||
| 8.9 | 44.5 | 41.8 | ||
| 9.0 | 51.2 | 46.6 | ||
| 8.6 | 41.3 | 29.9 | ||
| 10.7 | 45.9 | 44.1 | ||
| 11.2 | 56.4 | 53.4 |
The frequencies of the relevant variables obtained by the sparse logistic regressions with the L, Land Lpenalties in 30 runs
| L1/2 | 21 | 22 | 19 | 15 | 15 | ||
| LEN | 24 | 25 | 21 | 17 | 17 | ||
| L1 | 22 | 24 | 20 | 15 | 17 | ||
| L1/2 | 30 | 30 | 30 | 30 | 30 | ||
| LEN | 30 | 29 | 30 | 30 | 30 | ||
| L1 | 30 | 29 | 30 | 30 | 30 | ||
| L1/2 | 30 | 30 | 30 | 30 | 30 | ||
| LEN | 30 | 30 | 30 | 30 | 30 | ||
| L1 | 30 | 30 | 30 | 30 | 30 | ||
| L1/2 | 17 | 17 | 17 | 14 | 14 | ||
| LEN | 18 | 19 | 17 | 16 | 14 | ||
| L1 | 18 | 18 | 18 | 16 | 15 | ||
| L1/2 | 30 | 29 | 30 | 28 | 28 | ||
| LEN | 30 | 28 | 30 | 28 | 27 | ||
| L1 | 30 | 28 | 30 | 27 | 26 | ||
| L1/2 | 30 | 30 | 30 | 30 | 30 | ||
| LEN | 30 | 30 | 30 | 30 | 30 | ||
| L1 | 30 | 30 | 30 | 28 | 30 | ||
| L1/2 | 19 | 18 | 18 | 16 | 15 | ||
| LEN | 21 | 22 | 21 | 17 | 17 | ||
| L1 | 18 | 21 | 19 | 16 | 17 | ||
| L1/2 | 30 | 30 | 30 | 30 | 30 | ||
| LEN | 30 | 28 | 30 | 29 | 29 | ||
| L1 | 30 | 27 | 30 | 29 | 29 | ||
| L1/2 | 30 | 30 | 30 | 30 | 30 | ||
| LEN | 30 | 30 | 30 | 30 | 30 | ||
| L1 | 30 | 30 | 30 | 29 | 29 | ||
| L1/2 | 14 | 16 | 15 | 12 | 12 | ||
| LEN | 17 | 17 | 17 | 12 | 14 | ||
| L1 | 17 | 15 | 14 | 9 | 13 | ||
| L1/2 | 29 | 25 | 26 | 28 | 29 | ||
| LEN | 28 | 24 | 24 | 27 | 24 | ||
| L1 | 27 | 24 | 24 | 23 | 23 | ||
| L1/2 | 30 | 29 | 30 | 30 | 30 | ||
| LEN | 30 | 27 | 28 | 28 | 30 | ||
| L1 | 29 | 27 | 27 | 28 | 30 |
Four publicly available gene expression datasets used in the experiments
| Leukaemia | 3571 | 72 | ALL/AML |
| Prostate | 5966 | 102 | Normal/Tumor |
| Colon | 2000 | 62 | Normal/Tumor |
| DLBCL | 6285 | 77 | DLBCL/FL |
The detail information of 4 microarray datasets used in the experiments
| Leukaemia | 50(32 ALL/18 AML) | 22 (15 ALL/7 AML) |
| Prostate | 71(35 ALL/36 AML) | 31(15 ALL/16 AML) |
| Colon | 42(14 Normal/28 Tumor) | 20(8 Normal/12 Tumor) |
| DLBCL | 60(45 DLBCL/15FL) | 17(13 DLBCL/4 FL) |
The classification performances of different methods for 4 gene expression datasets
| Leukaemia | L1/2 | 2/50 | 1/22 | 2 |
| LEN | 1/50 | 1/22 | 9 | |
| L1 | 1/50 | 1/22 | 6 | |
| Prostate | L1/2 | 5/71 | 3/31 | 5 |
| LEN | 5/71 | 4/31 | 34 | |
| L1 | 5/71 | 3/31 | 25 | |
| Colon | L1/2 | 4/42 | 3/20 | 5 |
| LEN | 5/42 | 4/20 | 13 | |
| L1 | 5/42 | 4/20 | 7 | |
| DLBCL | L1/2 | 3/60 | 2/17 | 14 |
| LEN | 2/60 | 1/17 | 38 | |
| L1 | 3/60 | 3/17 | 23 |
Figure 1The results of the sparse logistic regression with the Lpenalty on Prostate dataset. The solution paths and the gene selection results of the sparse logistic L1/2 penalty methods for the Prostate dataset in one sample run.
Figure 2The results of the sparse logistic regression with the Lpenalty on Prostate dataset. The solution paths and the gene selection results of the sparse logistic elastic net penalty methods for the Prostate dataset in one sample run.
Figure 3The results of the sparse logistic regression with the Lpenalty on Prostate dataset. The solution paths and the gene selection results of the sparse logistic L1 penalty methods for the Prostate dataset in one sample run.
The 10 top-ranked informative genes found by the three sparse logistic regression methods from the Leukaemia dataset
| 1 | CFD complement factor D (adipsin) * | ||
| 2 | CFD complement factor D (adipsin) * | ||
| 3 | |||
| 4 | GYPB glycophorin B (MNS blood group) | ||
| 5 | TCL1A T-cell leukemia/lymphoma 1A * | ||
| 6 | TCL1A T-cell leukemia/lymphoma 1A * | ||
| 7 | LOC100437488 interleukin-8-like | ||
| 8 | ZYX zyxin * | ||
| 9 | TCRB T cell receptor beta cluster | CD79A CD79a molecule, immunoglobulin-associated alpha | |
| 10 | S100A9 S100 calcium binding protein A9 | CD79A CD79a molecule, immunoglobulin-associated alpha | HBB hemoglobin, beta |
The genes with star(*) are the most frequently selected genes to construct the classifiers according to the last column of Table 6, and the common genes obtained by L1/2 , LEN , L1 classifiers are emphasized with bold.
The 10 top-ranked informative genes found by the three sparse logistic regression methods from the Prostate dataset
| 1 | |||
| 2 | |||
| 3 | KHDRBS1 KH domain containing, RNA binding, signal transduction associated 1 * | ||
| 4 | ZNF787 zinc finger protein 787 * | PRAF2 PRA1 domain family, member 2 * | Gene symbol:AA683055, probe set: 34711_at * |
| 5 | GMPR guanosine monophosphate reductase * | CACYBP calcyclin binding protein * | |
| 6 | Gene symbol:AA683055, probe set: 34711_at * | VSNL1 visinin-like 1 * | |
| 7 | VSNL1 visinin-like 1 * | FLNC filamin C, gamma * | |
| 8 | USP2 ubiquitin specific peptidase 2 | PRAF2 PRA1 domain family, member 2 * | |
| 9 | CACYBP calcyclin binding protein * | ||
| 10 | ACTN4 actinin, alpha 4 | TMCO1 transmembrane and coiled-coil domains 1 * | |
The genes with star(*) are the most frequently selected genes to construct the classifiers according to the last column of Table 6, and the common genes obtained by L1/2 , LEN , L1 classifiers are emphasized with bold.
The 10 top-ranked informative genes found by the three sparse logistic regression methods from the colon dataset
| 1 | |||
| 2 | |||
| 3 | |||
| 4 | CHRND cholinergic receptor, nicotinic, delta polypeptide * | GSN gelsolin * | |
| 5 | PECAM1 platelet/endothelial cell adhesion molecule-1 * | GSN gelsolin * | |
| 6 | COL11A2 collagen, type XI, alpha 2 * | COL11A2 collagen, type XI, alpha 2 * | |
| 7 | ATF7 activating transcription factor 7 | MXI1 MAX interactor 1, dimerization protein * | |
| 8 | PROBABLE NUCLEAR ANTIGEN (Pseudorabies virus)[accession number:T86444] | ssb single-strand binding protein * | UQCRC1 ubiquinol-cytochrome c reductase core protein I * |
| 9 | Sept2 septin 2 * | ||
| 10 | MYH10 myosin, heavy chain 10, non-muscle | MXI1 MAX interactor 1, dimerization protein * | ZEB1 zinc finger E-box binding homeobox 1* |
The genes with star(*) are the most frequently selected genes to construct the classifiers according to the last column of Table 6, and the common genes obtained by L1/2 , LEN , L1 classifiers are emphasized with bold.
The 10 top-ranked informative genes found by the three sparse logistic regression methods from the DLBCL dataset
| 1 | MTH1 metallothionein 1H * | MTH1 metallothionein 1H * | |
| 2 | HLA-DQB1 major histocompatibility complex, class II, DQ beta 1 * | ||
| 3 | |||
| 4 | THRSP thyroid hormone responsive * | TCL1A T-cell leukemia/lymphoma 1A * | |
| 5 | ZFP36L2 ZFP36 ring finger protein-like 2 * | POLD2 polymerase (DNA directed), delta 2, accessory subunit * | |
| 6 | TCL1A T-cell leukemia/lymphoma 1A * | FCGR1A Fc fragment of IgG, high affinity Ia, receptor (CD64) * | |
| 7 | GOT2 glutamic-oxaloacetic transaminase 2, mitochondrial (aspartate aminotransferase 2) * | MELK maternal embryonic leucine zipper kinase * | |
| 8 | Plod procollagen lysyl hydroxylase * | TRB2 Homeodomain-like/winged-helix DNA-binding family protein * | CKS2 CDC28 protein kinase regulatory subunit 2 * |
| 9 | STXBP2 syntaxin binding protein 2 * | MELK maternal embryonic leucine zipper kinase * | EIF2A eukaryotic translation initiation factor 2A, 65kDa * |
| 10 | AQP4 aquaporin 4 * | ||
The genes with star(*) are the most frequently selected genes to construct the classifiers according to the last column of Table 6, and the common genes obtained by L1/2 , LEN , L1 classifiers are emphasized with bold.
Summary of the results of KNN classifiers using the most frequently selected genes by our proposed Lpenalized logistic regression method
| Leukaemia | 98.3% | 94.4% |
| Prostate | 95.1% | 94.2% |
| Colon | 95.1% | 90.6% |
| DLBCL | 94.8% | 91.2% |