| Literature DB >> 25635165 |
Xinyu Tian1, Xuefeng Wang2, Jun Chen3.
Abstract
Classic multinomial logit model, commonly used in multiclass regression problem, is restricted to few predictors and does not take into account the relationship among variables. It has limited use for genomic data, where the number of genomic features far exceeds the sample size. Genomic features such as gene expressions are usually related by an underlying biological network. Efficient use of the network information is important to improve classification performance as well as the biological interpretability. We proposed a multinomial logit model that is capable of addressing both the high dimensionality of predictors and the underlying network information. Group lasso was used to induce model sparsity, and a network-constraint was imposed to induce the smoothness of the coefficients with respect to the underlying network structure. To deal with the non-smoothness of the objective function in optimization, we developed a proximal gradient algorithm for efficient computation. The proposed model was compared to models with no prior structure information in both simulations and a problem of cancer subtype prediction with real TCGA (the cancer genome atlas) gene expression data. The network-constrained mode outperformed the traditional ones in both cases.Entities:
Keywords: cancer subtype prediction; group lasso; multinomial logit model; network-constraint; proximal gradient algorithm
Year: 2015 PMID: 25635165 PMCID: PMC4295837 DOI: 10.4137/CIN.S17686
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1MSE of parameter estimation under ideal structure information for small and large models with ideal, similar, and random coefficients.
Figure 2Prediction accuracy rate for small and large models with ideal, similar, and random coefficients under ideal structure information.
Figure 3Brier scores for small and large models with ideal, similar, and random coefficients under ideal structure information.
Figure 4Comparison of four candidate methods under incorrect network and overlapping network in terms of MSE, accuracy rate, and Brier score.
Average prediction accuracy and average number of predictors in each model (model size) for the GBM data set.
| PREDICTION ACCURACY (MEAN/SD) | BRIER SCORE (MEAN/SD) | MODEL SIZE | |
|---|---|---|---|
| L-MLM | 0.824/0.043 | 3.352/0.352 | 52.76 |
| GL-MLM | 0.858/0.044 | 2.992/0.381 | 43.18 |
| NGL-MLM | 0.859/0.053 | 3.226/0.323 | 37.54 |
| NGL-MLMa | 0.907/0.040 | 2.816/0.281 | 34.62 |
Figure 5The subnetwork selected by NGL-MLMa on GBM gene expression data.