| Literature DB >> 22577297 |
Zhenqiu Liu1, Halima Bensmail, Ming Tan.
Abstract
Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L(1) or L(p) penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced.By combining instance-based and model-based learning, we propose an efficient learning method with integrated KNN and constrained logistic regression (KNNLog) for simultaneous multiclass classification and feature selection. Our proposed method simultaneously minimizes the intra-class distance and maximizes the interclass distance with fewer estimated parameters. It is very efficient for problems with small sample size and unbalanced classes, a case common in many real applications. In addition, our model-based feature selection methods can identify highly correlated features simultaneously avoiding the multiplicity problem due to multiple tests. The proposed method is evaluated with simulation and real data including one unbalanced microRNA dataset for leukemia and one multiclass metagenomic dataset from the Human Microbiome Project (HMP). It performs well with limited computational experiments.Entities:
Keywords: feature selection; high-dimensional data; multiclass classification; statistical learning
Year: 2012 PMID: 22577297 PMCID: PMC3347893 DOI: 10.4137/EBO.S9407
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Frequencies of correctly identified features with different sample sizes.
| Sample size/per-class parameters ( | 10 (300, 9, 2) | 20 (350, 19, 2) | 30 (450, 28, 1) | 50 (460, 45, 1) |
|---|---|---|---|---|
| 90 | 93 | 94 | 96 | |
| 100 | 100 | 100 | 100 | |
| 100 | 100 | 100 | 100 | |
| 100 | 100 | 100 | 100 | |
| 76 | 88 | 91 | 93 | |
| 90 | 93 | 94 | 96 | |
| 100 | 100 | 100 | 100 | |
| 100 | 100 | 100 | 100 | |
| 100 | 100 | 100 | 100 | |
| 76 | 88 | 91 | 93 | |
Note: The frequency number indicates the number of times each feature is selected over 100 permutations.
Figure 1Average prediction.
Notes: Error with different sample sizes (n = 10, 20, 30, 50) and different methods: left—KNNLog; middle—KNN; and right—RF. The mean predictive errors are 0.046, 0.41, and 0.104 respectively for n = 10; 0.04, 0.32, and 0.063 respectively for n = 20; 0.042, 0.25, and 0.06 respectively for n = 30; 0.0345, 0.197, and 0.0371 respectively for n = 50.
Frequencies of correctly identified features with different σ2/σ1 ratios.
| 4 | 6 | 8 | 10 | |
|---|---|---|---|---|
| 84 | 96 | 98 | 98 | |
| 78 | 98 | 100 | 100 | |
| 84 | 96 | 98 | 98 | |
| 78 | 98 | 100 | 100 | |
Note: The frequency numbers represent the number of times each relevant feature is selected over 100 permutations.
32 selected leukemia associated microRNAs and their relevance counts.
| microRNA | Relev. count | microRNA | Relev. count | ||
|---|---|---|---|---|---|
| 1 | hsa-mir-125b-1 | 93 | 17 | hsa-mir-514-1 | 100 |
| 2 | hsa-mir-142 | 99 | 18 | hsa-mir-514-2&3 | 100 |
| 3 | hsa-mir-150 | 97 | 19 | hsa-mir-515-15p | 100 |
| 4 | hsa-mir-153-1 | 100 | 20 | hsa-mir-515-25p | 100 |
| 5 | hsa-mir-153-2 | 100 | 21 | hsa-mir-517a | 100 |
| 6 | hsa-mir-154 | 100 | 22 | hsa-mir-518a-1 | 100 |
| 7 | hsa-mir-155 | 100 | 23 | hsa-mir-518b | 100 |
| 8 | hsa-mir-181a | 100 | 24 | hsa-mir-518c | 100 |
| 9 | hsa-mir-181b | 100 | 25 | hsa-mir-518e | 100 |
| 10 | hsa-mir-20b | 100 | 26 | hsa-mir-518e/526c | 100 |
| 11 | hsa-mir-213 | 100 | 27 | hsa-mir-520a | 100 |
| 12 | hsa-mir-216 | 83 | 28 | hsa-mir-520a* | 100 |
| 13 | hsa-mir-302c | 100 | 29 | hsa-mir-520c/526a | 100 |
| 14 | hsa-mir-367 | 88 | 30 | hsa-mir-520d | 100 |
| 15 | hsa-mir-368 | 94 | 31 | hsa-mir-526a-1 | 100 |
| 16 | hsa-mir-373 | 100 | 32 | hsa-mir-526b | 100 |
| Average predictive error | 0.0079 ± 0.003 |
Note: The count number indicates how many times a microRNA is selected over 100 permutations.
Figure 2Normalized log-gene expressions for the 32 identified microRNAs in three different classes: left—normal, middle—AML, and right—CLL.
Predictive performance of the test data for each location.
| True classes | Predicted classes | |||||
|---|---|---|---|---|---|---|
| EAC | Gut | Hair | Nostril | OC | Skin | |
| 10 | 0 | 0 | 0 | 0 | 4 | |
| 0 | 15 | 0 | 0 | 0 | 0 | |
| 0 | 0 | 1 | 0 | 0 | 3 | |
| 0 | 0 | 0 | 11 | 0 | 4 | |
| 0 | 0 | 0 | 0 | 15 | 0 | |
| 0 | 0 | 0 | 1 | 0 | 118 | |
Identified class associated OTUs with KNNLog.
| Bacteria;Actinobacteria;Actinomycetales; Propionibacteriaceae;Propionibacterium(100) |
| Bacteria;Cyanobacteria;Cyanobacteria_incertae sedis; Chloroplast;Streptophyta(100) |
| Bacteria;Actinobacteria;Actinomycetales; Corynebacteriaceae;Turicella(100) |
| Bacteria;Proteobacteria;Betaproteobacteria; Neisseriales;Neisseriaceae;Neisseria(100) |
| Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales; Bacteroidaceae;Bacteroides(100) |
| Bacteria;Actinobacteria;Actinomycetales; Corynebacteriaceae;Corynebacterium(100) |
| Bacteria;Gammaproteobacteria;Pasteurellales; Pasteurellaceae;Haemophilus(100) |
| Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales; Prevotellaceae;Prevotella(100) |
| Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales; Bacteroidaceae;Bacteroides(100) |
| Bacteria;Firmicutes;Clostridia;Clostridiales; Incertae-Sedis-XI;Peptoniphilus(72) |
| Bacteria;Firmicutes;Clostridia;Clostridiale; Ruminococcaceae;Faecalibacterium(89) |