| Literature DB >> 21176134 |
Zhenqiu Liu1, Dechang Chen, Ming Tan, Feng Jiang, Ronald B Gartenhaus.
Abstract
BACKGROUND: Most genomic data have ultra-high dimensions with more than 10,000 genes (probes). Regularization methods with L₁ and L(p) penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size n << m (the number of genes), directly identifying a small subset of genes from ultra-high (m > 10, 000) dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is select a couple of thousands (or hundreds) of genes using univariate analysis or statistical tests, and then apply the LASSO-type penalty to further reduce the number of disease associated genes. This two-step procedure may introduce bias and inaccuracy and lead us to miss biologically important genes.Entities:
Mesh:
Year: 2010 PMID: 21176134 PMCID: PMC3019227 DOI: 10.1186/1471-2105-11-606
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Test RRMSE with Different Correlation Structures. Test Relative Root Mean Squared Error (RRMSE) with Different Models and Different Correlation Structures: L - linear; p2 - second order polynomial kernel; p3 - third order polynomial kernel; and rbf - radial basis function kernel. The upper panels show the performance with the linear model (k = 1) and the lower panels show the performance with quadratic model(k = 2).
Figure 2RRMSE with Different Input Dimensions. Test Relative Root Mean Squared Error (RRMSE) with Different Input Dimensions. The input dimensions vary from 100 to 100,000.
Frequencies of Correctly Identified variables with Different Parameters Out of 100 Simulations
| Parameters | | | | | | |
|---|---|---|---|
| 100 | 100 | 99 | |
| 100 | 98 | 99 | |
| 97 | 99 | 98 | |
| 98 | 99 | 99 | |
| 98 | 98 | 94 | |
| 88 | 77 | 81 | |
| 90 | 86 | 77 | |
| 98 | 95 | 99 | |
| 98 | 98 | 97 | |
| 96 | 96 | 95 | |
| 99 | 99 | 98 | |
| 99 | 100 | 100 |
Model performance with Simulation Data and Different Parameter Values
| | | Av. # of Vars | Exactly-match | Overfitting | Underfitting |
|---|---|---|---|---|
| 1 (0.01, 0.6) | 12.61 | 75% | 21% | 4% |
| 0.2 (0.002, 0.6) | 11.52 | 54% | 3% | 43% |
| 0.1 (0.001, 0.6) | 11.43 | 52% | 2% | 46% |
Computational Time (in Seconds): AKRR vs LASSO
| Input Dimensions | AKRR | LASSO |
|---|---|---|
| 100 | 0.4801 | 0.6378 |
| 1000 | 0.5844 | 6.4577 |
| 10000 | 1.7500 | 978.23 |
| 50000 | 7.5255 | >7200 |
| 100000 | 17.4545 | - - |
Genes Associated with Survival Time for DLBCL Data
| Count | GenBank | Symbal | Description |
|---|---|---|---|
| 200 | RRM2 | ribonucleotide reductase M2 polypeptide | |
| 200 | HSP90B1 | tumor rejection antigen (gp96) 1 | |
| 200 | BMP6 | bone morphogenetic protein 6 | |
| 176 | CD86 | CD86 antigen (CD28 antigen ligand 2, B7-2 antigen) | |
| 181 | MS4A1 | membrane-spanning 4-domains, subfamily A, member 2 | |
| 198 | CD79A | CD79A antigen (immunoglobulin-associated alpha) | |
| 200 | SD19 | CD19 antigen | |
| 138 | BIRC3 | baculoviral IAP repeat-containing 3 | |
| 146 | LRMP | lymphoid-restricted membrane protein | |
| 176 | MAPK10 | mitogen-activated protein kinase 10 | |
| 179 | |||
| 153 | HLA-C | immunoglobulin kappa constant | |
| 164 | CCL13 | small inducible cytokine subfamily A (Cys-Cys), member 13 | |
| 142 | CLU | clusterin | |
| 200 | IL1R1 | interleukin 1 receptor, type I | |
| 183 | MMP9 | matrix metalloproteinase 9 | |
| 200 | LMO2 | LIM domain only 2 (rhombotin-like 1) | |
| 200 | MNDA | myeloid cell nuclear differentiation antigen | |
| 115 | IGL@ | heat shock 70 kD protein 1A | |
| 162 | MGST1 | microsomal glutathione S-transferase 1 | |
| 200 | ITIH4 | inter-alpha (globulin) inhibitor H4 | |
| 200 | PDGFRA | platelet-derived growth factor receptor, alpha polypeptide | |
| 187 | ESTs |
Genes Associated with Survival Time for FL Data
| count | ProbeID | Symbal | Description |
|---|---|---|---|
| 200 | 231760_at | C20orf51 | chromosome 20 open reading frame 51 |
| 200 | 232932_at | ||
| 200 | 235856_at | C4A | complement component 4A (Rodgers blood group) |
| 187 | 224280_s_a | LOC56181 | family with sequence similarity 54, member B |
| 200 | 201425_at | ALDH2 | aldehyde dehydrogenase 2 family (mitochondrial) |
| 180 | 214694_at | M-RIP | Myosin phosphatase Rho-interacting protein |
| 200 | 214713_at | YLPM1 | YLP motif containing 1 |
| 200 | 218477_at | TMEM14A | transmembrane protein 14A |
| 200 | 220669_at | HSHIN1 | HIV-1 induced protein HIN-1 |
| 195 | 203970_s_a | PEX3 | peroxisomal biogenesis factor 3 |
| 200 | 208470_s_a | HPR | haptoglobin-related protein; haptoglobin |
| 175 | 210920_x_a | ||
| 200 | 215444_s_a | TRIM31 | tripartite motif-containing 31 |