Literature DB >> 22577297

Efficient feature selection and multiclass classification with integrated instance and model based learning.

Zhenqiu Liu¹, Halima Bensmail, Ming Tan.

Abstract

Multiclass classification and feature (variable) selections are commonly encountered in many biological and medical applications. However, extending binary classification approaches to multiclass problems is not trivial. Instance-based methods such as the K nearest neighbor (KNN) can naturally extend to multiclass problems and usually perform well with unbalanced data, but suffer from the curse of dimensionality. Their performance is degraded when applied to high dimensional data. On the other hand, model-based methods such as logistic regression require the decomposition of the multiclass problem into several binary problems with one-vs.-one or one-vs.-rest schemes. Even though they can be applied to high dimensional data with L(1) or L(p) penalized methods, such approaches can only select independent features and the features selected with different binary problems are usually different. They also produce unbalanced classification problems with one vs. the rest scheme even if the original multiclass problem is balanced.By combining instance-based and model-based learning, we propose an efficient learning method with integrated KNN and constrained logistic regression (KNNLog) for simultaneous multiclass classification and feature selection. Our proposed method simultaneously minimizes the intra-class distance and maximizes the interclass distance with fewer estimated parameters. It is very efficient for problems with small sample size and unbalanced classes, a case common in many real applications. In addition, our model-based feature selection methods can identify highly correlated features simultaneously avoiding the multiplicity problem due to multiple tests. The proposed method is evaluated with simulation and real data including one unbalanced microRNA dataset for leukemia and one multiclass metagenomic dataset from the Human Microbiome Project (HMP). It performs well with limited computational experiments.

Entities: Chemical Disease Gene Species

Keywords: feature selection; high-dimensional data; multiclass classification; statistical learning

Year: 2012 PMID： 22577297 PMCID： PMC3347893 DOI： 10.4137/EBO.S9407

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

Multi-class classification and feature selections are commonly encountered in many biological and medical applications, especially in genomic and metagenomic studies. Those data usually have high-dimensions, small-sample size, and unbalanced classes, and features (genes) may be highly correlated. It is not trivial to detect disease associated genes and evaluate the predictive powers under the multi-class classification framework, Machine learning for multiclass (and more general multilabel) classification has received increasing attention in many areas.1–4 In current literature, all machine learning methods roughly fall into two different categories: instance-based and model-based learning. Instance-based learning (IBL) such as the k-nearest neighbor (KNN)5 predicts the class of a sample with unknown class by considering the classes of k-nearest neighbors. It is more robust for data with unbalanced classes and is efficient for multiclass classification with a small number of features. However, its predictive accuracy is seriously degraded when there is a large number of irrelevant features because of the curse of dimensionality. On the other hand, model-based learning methods such as support vector machine (SVM) and logistic regression are mainly designed for binary classification. They are designed to separate two different classes as far as possible without considering the intra-class distances. Multiclass problems are often dealt by combining binary classifier outputs, such as one class against the other (one vs. one) or one class against the rest (one vs. rest). However, this may lead to over-fitting and poor predictive accuracy especially when sample size is small, since we need to estimate either c(c − 1)n/2 or (c − 1)n parameters for problems with c classes and n features. It also produces unbalanced classification problems with one vs. the rest rule even if the original multiclass problem is balanced. Instance-based leaning only takes into account the minimal distance, while model-based learning incorporates maximizing the interclass distances (eg, maximizing the margin in SVM). It is natural to integrate the instance-based and model-based methods and maximize the interclass distances while minimizing the intraclass distances. While there are some efforts in this direction,6 they only consider the labels of neighborhood instances as additional features for logistic regression. They do not fully take advantage of the robustness of instance-based learning for unbalanced classes and continue to have the same drawbacks of estimating too many parameters and creating unbalanced classes in multi-class classifications, even if the original problem is balanced. A fundamental aspect of feature (variable) selection for high dimensional data is to derive interpretable results. Earlier approaches for feature selection7–9 were based on filtering to select a subset of features, independent of the statistical learning methods. However, filtering methods, which examine each feature in isolation and ignore the possibility that groups of features, may have a combined effect that does not necessarily follow from the individual performance of features in the group.10 In addition, they result in multiplicity problems due to multiple comparisons. The more recent L1 and L based penalized statistical learning approaches perform variable selection as part of the statistical learning procedure.11–16 However, they are mainly designed for binary classification and can only select independent features. However, highly correlated features may function together and it is very important to select highly correlated genes in biological research. There are two difficulties when dealing with multiclass problems with high dimensional data: small sample size and unbalanced classes. In this paper, we propose a novel approach through integrating instance-based and model-based learning to overcome both difficulties encountered in multiclass classification with high dimensional data. Our proposed approach combines the k-nearest neighbor (KNN) and a model-based binary classifier and simultaneously maximizes the interclass distance and minimizes the intraclass distance. It is robust for unbalanced classification and can classify multiclasses simultaneously without creating unbalanced classes. It also estimates a fewer number of parameters (only the same as the number of features) and can simultaneously select features and predict multiclasses with simple parameter regulation. Moreover, the proposed method can identify highly correlated features for multiclass classification and overcome both the problem of multiplicity with statistical tests and the problem of failing to identify correlated features with L1 and L penalized statistical learning methods. We evaluate the performance of our proposed method through simulation and the publicly available microRNA expression and metagenomic data sets. The proposed method is robust across data-sets and efficient for feature identification and phenotype prediction.

Methods

A general multiclass classification problem may be simply described as follows. Given n samples, with normalized features, D = {(x1, y1), ..., (x, y)}, where x is a multidimensional feature vector with dimension m and g classes with class label y ∈ C = {c1, ..., c}, find a classifier f(x) such that for any normalized feature vector x with class label y, f(x) predicts class y. Given two samples x and x, we introduce a general weighted distance functions for KNN learning as follows: where |.| denotes the absolute value, w ≥ 0 for k = 1, ..., m are the nonnegative weights, and p is a positive free parameter. Especially when p = 1 and p = 2, D(w, x, x, 1) and D(w, x, x, 2) represent the weighted city-block and Euclidean distance between x and x respectively. Given a new sample x, we the calculate k nearest neighbor of x denoted by N(x, c) for each class c, and then take the average distance as the distance of x to class c. Finally, we assign x to class c by means of a minimal distance vote:

Loglikelihood Based Approach for Weight Estimation (KNNLog)

Now, the goal is to choose optimal w with small intraclass distance and large interclass distances simultaneously and automatically identify features relevant to multiple classes. We proposed an integrated KNN and constrained logistic regression (KNNLog) approach for sparsef parametric estimation, which forces the irrelevant features to zero. The problem can be formulated as a constrained linear programming (LP) as follows: where |x − x|. = [(x1 − x1), ..., (x − x)] is an element-wise operation, and λ, k, and p will be determined through cross validation. In Equation (4), the first constraint represents the k-nearest neighbor intraclass distances, and we restrict them to a soft upper bound 1. The second constraint indicates the interclass distances with a soft lower bound 2. Hence, we can enforce a soft-margin 1 between the intra-class and inter-class distances. Therefore the solution of Equation (4) will guarantee a small KNN intra-class distance and large interclass distance simultaneously. Finally, we used the k nearest neighbor instead of all the samples in the same class for the first constraint because samples in one class may have multimodal distributions. It is too stringent and unrealistic to require that all samples in one class have small distances. While we can solve Equation 4 with some LP software such as linprog in MATLAB, and lp_solve in C (http://lpsolve.sourceforge.net/5.5/), there are limitations with the LP approach. It could not scale both in terms of time and memory for problems with large number of examples and variables. The number of constraints will increase with O(n2) for a problem with the number of samples n. Even though efficient algorithms exist, handling a large number of constraints is still challenging. We therefore propose an efficient log-likelihood based approach for weight estimation. Since we would like to maximize the intra-class distance and minimize the inter-class distance, we first define an augmented distribution for the intra-class and inter-class distances with the truncated logit function of logistic regression. Letting h = 1 be the class of intra-class distance D(w, x, x, p) and h = 0 represent the class of inter-class distances D(w, x, x, p), we define the probabilities: where D(w, x, x, p) = 0. So we have P(h = 1| D(w, x, x, p)) = 0.5 and 1, when D(w, x, x, p) = 2 and D(w, x, x, p) → ∞ respectively. Therefore, The likelihood for the intra-class and interclass distances is Taking the negative log likelihood and drop the constant, we have the following error function: Equation (8) is a much simpler negative log likelihood with nonnegative constraints. It can be solved efficiently, even if the problem is one of both large sample size and high dimension. Let be the two intra-class and interclass distance matrices, the first order derivative for Equation (8) is as follows: Based on Equation (10) and w ≥ 0 ∀ k = 1, ..., m, we implement a standard conjugate gradient method17 with nonnegative constraints. Because E is a convex optimization with a convex constraint, a global optimal solution is guaranteed theoretically. The global minimum of E is reached if, for each element w, either (i) w > 0 and (∂E/∂w)|ŵ = 0, or (ii), w = 0 and (∂E/∂w)|w ≥ 0. The first condition applies to the positive elements of ŵ , whose corresponding terms in the gradient must vanish, and the second condition applies to the zero elements of ŵ. Here, the corresponding terms of the gradient must be nonnegative, thus pinning w to the boundary of the feasible region. Upon reaching the optimal solution, sparse ŵ with a small number of nonzero parameters can be found. The important features are identified with the nonzero ŵ. Since w ≥ 0, sparsity of the model is determined by both k and λ. The larger the k and λ, the fewer of the nonzero w. The free parameters λ, k, and p are also determined by leave-one-out Jackknife test with the smallest prediction error. For simplicity, we choose p = 1 or 2 only in all computational experiments, but other choices of p do improve the predictive power of our method. Different P values may be selected in individual computations.

Computational Results

Simulation data

The purpose of our first simlation is to show that the proposed method can predict the class labels with high accuracy and identify the class associated features correctly even if there is a high correlation among the features. The simulated dataset is randomly generated with input dimension m = 1000 and only the first 10 features are relevant to the classes. All other features are random noise generated from N(0, 1). We first generate the input data of 5 classes with the sample size of 10, 20, 30, and 50 for each class from 5-dimensional multivariate normal distributions with different means and a variance-covariance matrix The mean of each dimension for each class is randomly chosen from an integer between and including 1 and 5, and the mean of each dimension for different classes is different. In addition, the pairwise correlation among the features (ρ 0.5) is used to assess the performance of the proposed method. We then reduplicate the first 5 features at the dimension 6–10 so that the input features from dimensions 1–5 and 6–10 are exactly the same. We are trying to demonstrate that KNNLog can identify the first 10 class-relevant features correctly even if some of them are highly correlated (exactly the same). We analyze this simulation data with the proposed approach and show that our method can identify the features 1–5 and 6–10 simultaneously. The free parameters k, p, and λ are determined through leave-one- out Jackknife test with the training data only. We simulate the experiments 100 times for each of the different sample sizes and count number of correctly identified features in Table 1. Table 1 indicates that KNNLog correctly identified all 10 features with at least 76% accuracy and correctly chose 6 out of 10 features in all 100 simulations with a sample size of n = 10 for each class. As the sample size n increases, the accuracy for selecting the true features also increases. KNNLog identified all 10 features with at least 93% accuracy and 6 out of 10 features with 100% accuracy with the sample size of n = 50. In addition, KNNLog selected features 1 and 6, 2 and 7, 3 and 8, 4 and 9, and 5 and 10 simultaneously with the same accuracy, even if they are exactly the same. Therefore, KNNLog can identify highly correlated features simultaneously without encountering the multiplicity problem with statistical tests. Moreover, the average number of selected features is also closer to the true number 10, when the sample size increases as shown at the bottom of Table 1. The prediction errors with KNNLog are 0.046, 0.04, 0.041, 0.034 with the sample size of 10, 20, 30, and 50 respectively, compared to the much larger prediction errors (0.41, 0.32, 0.25, and 0.20) using KNN without feature selection as shown in Figure 1. In addition, we also compare the performance of our KNNLog with random forests (RF). Random forests (RF) is a classification algorithm that uses an ensemble of unpruned decision trees, each of which is built on a bootstrap sample of the training data using a randomly selected subset of variables.18 Figure 1 shows that KNNLog has similar test errors with a different sample size. It also has better performance than random forests (RF) which has the prediction errors of 0.104, 0.063, 0.06, and 0.037 respectively, especially when the sample size are small. Finally, unlike KNNLog which can identify highly correlated features, RF can only select independent features, the average number of features selected with RF are 3.8, 4.2, 4.5, and 4.8 respectively.

Table 1

Frequencies of correctly identified features with different sample sizes.

Sample size/per-class parameters (λ, k, p*)	10 (300, 9, 2)	20 (350, 19, 2)	30 (450, 28, 1)	50 (460, 45, 1)
w₁	90	93	94	96
w₂	100	100	100	100
w₃	100	100	100	100
w₄	100	100	100	100
w₅	76	88	91	93
w₆	90	93	94	96
w₇	100	100	100	100
w₈	100	100	100	100
w₉	100	100	100	100
w₁₀	76	88	91	93
Average no. of features selected	9.32	10.87	9.7	9.96

Note: The frequency number indicates the number of times each feature is selected over 100 permutations.

Figure 1

Average prediction.

Notes: Error with different sample sizes (n = 10, 20, 30, 50) and different methods: left—KNNLog; middle—KNN; and right—RF. The mean predictive errors are 0.046, 0.41, and 0.104 respectively for n = 10; 0.04, 0.32, and 0.063 respectively for n = 20; 0.042, 0.25, and 0.06 respectively for n = 30; 0.0345, 0.197, and 0.0371 respectively for n = 50.

When two classes have different distributions but have the same or small means, statistical tests based summary statistics (eg, t-test) fail to detect the differences and identify important features. KNNLog, based on location parameter, can still be used to select important features and achieve good predictive accuracy. We simulate two classes of sample size 100 for each class from a 2-dimensional normal distribution with the same mean m1 = m2 = [1, 2] and standard deviation of for class 1 and different standard deviations for class 2, with the ratio σ2/σ1 = 4, 6, 8, 10, respectively. We then reduplicate the generated data to dimension 3–4, so the data from dimension 3–4 are exactly the same as that from dimension 1–2. The total input dimension of the simulated data is 1000, with the rest 996 features for both class generated from N(0, 1). In this setting, the standard t-test fails to identify any features, but KNNLog identifies features 1–4 efficiently as shown in Table 2. The free parameters (k, p, λ) = (56, 2, 1) are determined through cross-validation with training data only. We simulate the experiments 100 times for each different σ2/σ1 ratio and the number of correctly identified features 1–4 is reported in the upper part of Table 2. KNNLog correctly identifies features 1–4 in 78% or more simulations with σ2/σ1 = 4, 96% or more simulations with σ2/σ1 = 6, and 98% or more simulations with σ2/σ1 = 8 or 10. The average number of identified features is closer to the true number of features (4) and the test areas under the ROC curve (AUCs) become larger when the ratio of σ2/σ1 increases as shown at the bottom of Table 2. Therefore, KNNLog based on the pairwise distance of individual samples is more powerful than typical statistical tests.

Table 2

Frequencies of correctly identified features with different σ2/σ1 ratios.

σ₂/σ₁	4	6	8	10
w₁	84	96	98	98
w₂	78	98	100	100
w₃	84	96	98	98
w₄	78	98	100	100
No. of features	3.72 (± 1.37)	3.98 (± 0.58)	3.96 (± 0.28)	3.96 (± 0.28)
Test AUC	0.65 (± 0.052)	0.67 (± 0.038)	0.692 (± 0.03)	0.97 (± 0.024)

Note: The frequency numbers represent the number of times each relevant feature is selected over 100 permutations.

microRNA Expression Profiling for Leukemia

A microRNA is a short ribonucleic acid (RNA) molecule found in eukaryotic cells. It has very few nucleotides (an average of 22) compared with other RNAs (http://en.wikipedia.org/wiki/MicroRNA). The variations in microRNA expressions may be associated with different complex diseases including cancer. The microRNA expression data analyzed in this examples are from the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo) under the respective accession numbers E-TABM-969 for normal tissues, E-TABM-972 for acute myeloid leukemia (AML), and E-TABM-973 for chronic lymphocytic leukemia (CLL).19–21 There are total 506 samples with 255 normal tissue, 141 AMLs, and 110 CLLs and 390 candidate human microRNAs. We preprocess the data with log2 transformation and quantile normalization, and then evaluate the performance of proposed approach with 2-fold cross-validation. We divide the data into two subsets of roughly equal size with one training and one test data, build a model with the training data, and evaluate the performance with the test data. The free parameters λ, p, and k are estimated using the training data only with the leave-one-out Jack-knife test. To prevent bias arising from a specific partition, we partition the data 100 times through permutation. The relevance count is calculated by the number of times a microRNA is selected in our model. The optimal free parameters are (λ*, k*, p*) = (20, 15, 2). The 32 selected microRNAs are reported in Table 3. The predictive errors are 0.0079 ± 0.003 with 32 selected microRNAs, so KNNLog predicts normal, AML, and CLL with over 99% accuracy with only 32 microRNAs. The log gene expression levels of each microRNA under different clinical conditions are plotted in Figure 2. Most of the 32 identified microRNA signatures are known to be associated with leukemia. For instance, The microRNA miR-125b-1 we identified is known to cause leukemia.22 MicroRNA-125b-1 is involved in several chromosomal translocations, such as t(2;11)(p21;q23) and t(11;14)(q24;q32), which leads to myelodysplasia and acute myeloid leukemia (AML) or B-cell acute lymphoid leukemia (B-ALL), respectively. Because miR-125b-1 negatively regulates many proteins in the p53 pathway, the deregulation of miR-125b expression would impair human and mouse hematopoiesis. Figure 2 indicates that microRNA-125b-1 is overexpressed in both AML and CLL. In addition, several microRNAs are also involved in the differentiation process of various hematopoietic lineages. Indeed, miR-150 controls early B-lymphocyte differentiation and both miR-181a and miR-181b are a crucial modulator for T lymphocyte differentiation and are linked to both AML and CLL. Mir-181b targets Mcl-1 protein and the decrease of its expression inversely correlated with increased protein levels of MCL1 and BCL2 target genes. Therefore mir-181b expression values can be used to specify disease progression in chronic lymphocytic leukemia.23 In addition, since microRNAs control the regulation of fundamental processes, their dysregulation has been clearly linked to cancer and particularly to leukemia. For instance, over-expression of miR-155 has been found in many human leukemias and lymphomas, and mice transplanted with bone-marrow cells. Ectopically expressing miR-155 may develop a myeloproliferative disorder. Finally, the identified microRNAs also provide important targets for biomedical researchers to pursue further studies. As an example, microRNA 12: hsa-mir-216 and microRNA 24: hsa-mir-518c are only over-expressed in AML patients as shown in Figure 2. Those microRNAs need further studies to verify if they have important biological and clinical implications.

Table 3

32 selected leukemia associated microRNAs and their relevance counts.

microRNA	Relev. count	microRNA	Relev. count
1	hsa-mir-125b-1	93	17	hsa-mir-514-1	100
2	hsa-mir-142	99	18	hsa-mir-514-2&3	100
3	hsa-mir-150	97	19	hsa-mir-515-15p	100
4	hsa-mir-153-1	100	20	hsa-mir-515-25p	100
5	hsa-mir-153-2	100	21	hsa-mir-517a	100
6	hsa-mir-154	100	22	hsa-mir-518a-1	100
7	hsa-mir-155	100	23	hsa-mir-518b	100
8	hsa-mir-181a	100	24	hsa-mir-518c	100
9	hsa-mir-181b	100	25	hsa-mir-518e	100
10	hsa-mir-20b	100	26	hsa-mir-518e/526c	100
11	hsa-mir-213	100	27	hsa-mir-520a	100
12	hsa-mir-216	83	28	hsa-mir-520a*	100
13	hsa-mir-302c	100	29	hsa-mir-520c/526a	100
14	hsa-mir-367	88	30	hsa-mir-520d	100
15	hsa-mir-368	94	31	hsa-mir-526a-1	100
16	hsa-mir-373	100	32	hsa-mir-526b	100
	Average predictive error			0.0079 ± 0.003

Note: The count number indicates how many times a microRNA is selected over 100 permutations.

Figure 2

Normalized log-gene expressions for the 32 identified microRNAs in three different classes: left—normal, middle—AML, and right—CLL.

Human metagenomic count data

KNNLog was applied to a 16S rRNA metagenomic dataset from 6 human body habitats,25 external auditory canal (EAC), gut, hair, nostril, oral cavity (OC), and skin. This benchmark dataset excludes samples from communities that were transplanted from another subject or body site. Similar to Costello et al25 it has 552 remaining samples. OTU count data are generated using Mothur package24 (pubmed: 19801464) with the standard processing pipeline at a sequence similarity threshold of 97%. Since this is a highly unbalanced dataset dominated by one class (skin), which could create challenges for classification. We normalized the count data with proportion and arcsin transformation,26 and then detect the body-site associated features and estimate the predictive powers with KNN. The data is split into training (2/3 of samples) and test (1/3 of samples. We estimate parameters λ, p and k using the leave-one-out Jackknife test with the training data only. To prevent bias from a specific partition, we repeat the partition 100 times, the relevance count is calculated by the number of times an OTU is selected in 100 permutations. The parameters with best predictive error are (λ*, k*, p*) are (50, 8, 1) respectively. The predictive performance for classification is shown in Table 4. Eleven selected OTUs with nonzero parameters are given in Table 5. The numbers in the parentheses are the relevance counts an OTU being selected. Table 4 shows that OC and Gut can be separated from other class perfectly, which is consistent with the result of Costello et al. We also achieved a predictive error of 0.07 (± 0.005) with only 11 OTUs in Table 5, compared with the predictive error of 0.08 with 27 OTUs reported by Knights et al.27 KNNLog performs very well even with this highly unbalanced dataset.

Table 4

Predictive performance of the test data for each location.

True classes	Predicted classes

	EAC	Gut	Hair	Nostril	OC	Skin
EAC	10	0	0	0	0	4
Gut	0	15	0	0	0	0
Hair	0	0	1	0	0	3
Nostril	0	0	0	11	0	4
OC	0	0	0	0	15	0
Skin	0	0	0	1	0	118

Table 5

Identified class associated OTUs with KNNLog.

Bacteria;Actinobacteria;Actinomycetales; Propionibacteriaceae;Propionibacterium(100)

Bacteria;Cyanobacteria;Cyanobacteria_incertae sedis; Chloroplast;Streptophyta(100)

Bacteria;Actinobacteria;Actinomycetales; Corynebacteriaceae;Turicella(100)

Bacteria;Proteobacteria;Betaproteobacteria; Neisseriales;Neisseriaceae;Neisseria(100)

Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales; Bacteroidaceae;Bacteroides(100)

Bacteria;Actinobacteria;Actinomycetales; Corynebacteriaceae;Corynebacterium(100)

Bacteria;Gammaproteobacteria;Pasteurellales; Pasteurellaceae;Haemophilus(100)

Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales; Prevotellaceae;Prevotella(100)

Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales; Bacteroidaceae;Bacteroides(100)

Bacteria;Firmicutes;Clostridia;Clostridiales; Incertae-Sedis-XI;Peptoniphilus(72)

Bacteria;Firmicutes;Clostridia;Clostridiale; Ruminococcaceae;Faecalibacterium(89)

Conclusions

We have proposed a KNNLog method that combines instance-based learning (KNN) and model-based learning (logistic regression) for simultaneous feature selection and multi-class prediction. Unlike L1 and L (P < 1) penalized methods, which can select only independent features, KNNlog can identify highly correlated features without encountering the multiplicity problem due to multiple tests. In addition, the proposed method can also identify features from data that different classes may have similar means, but are from different distributions, a task t-test fails. Finally, it is robust for unbalanced classification, and can classify multiple classes simultaneously without creating unbalanced classes. It also estimates fewer number of parameters (the same as the number of features) than both one-vs.-one and one-vs.-rest classification schemes, and is efficient for problem with small sample size and a large number of features. While KNNLog was evaluated with only a limited number of datasets, it shows that the integration of instance-based and model-based learning methods can improve the efficiency in both feature selection and multi-class prediction.

13 in total

1. miR-181b is a biomarker of disease progression in chronic lymphocytic leukemia.

Authors: Rosa Visone; Angelo Veronese; Laura Z Rassenti; Veronica Balatti; Dennis K Pearl; Mario Acunzo; Stefano Volinia; Cristian Taccioli; Thomas J Kipps; Carlo M Croce
Journal: Blood Date: 2011-06-02 Impact factor: 22.113

2. Gene selection: a Bayesian variable selection approach.

Authors: Kyeong Eun Lee; Naijun Sha; Edward R Dougherty; Marina Vannucci; Bani K Mallick
Journal: Bioinformatics Date: 2003-01 Impact factor: 6.937

3. Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data.

Authors: Zhenqiu Liu; William Hsiao; Brandi L Cantarel; Elliott Franco Drábek; Claire Fraser-Liggett
Journal: Bioinformatics Date: 2011-10-07 Impact factor: 6.937

4. Sparse logistic regression with Lp penalty for biomarker identification.

Authors: Zhenqiu Liu; Feng Jiang; Guoliang Tian; Suna Wang; Fumiaki Sato; Stephen J Meltzer; Ming Tan
Journal: Stat Appl Genet Mol Biol Date: 2007-02-10

5. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities.

Authors: Patrick D Schloss; Sarah L Westcott; Thomas Ryabin; Justine R Hall; Martin Hartmann; Emily B Hollister; Ryan A Lesniewski; Brian B Oakley; Donovan H Parks; Courtney J Robinson; Jason W Sahl; Blaz Stres; Gerhard G Thallinger; David J Van Horn; Carolyn F Weber
Journal: Appl Environ Microbiol Date: 2009-10-02 Impact factor: 4.792

6. Sparse support vector machines with Lp penalty for biomarker identification.

Authors: Zhenqiu Liu; Shili Lin; Ming T Tan
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2010 Jan-Mar Impact factor: 3.710

7. Bacterial community variation in human body habitats across space and time.

Authors: Elizabeth K Costello; Christian L Lauber; Micah Hamady; Noah Fierer; Jeffrey I Gordon; Rob Knight
Journal: Science Date: 2009-11-05 Impact factor: 47.728

8. Distinctive microRNA signature of acute myeloid leukemia bearing cytoplasmic mutated nucleophosmin.

Authors: Ramiro Garzon; Michela Garofalo; Maria Paola Martelli; Roger Briesewitz; Lisheng Wang; Cecilia Fernandez-Cymering; Stefano Volinia; Chang-Gong Liu; Susanne Schnittger; Torsten Haferlach; Arcangelo Liso; Daniela Diverio; Marco Mancini; Giovanna Meloni; Robin Foa; Massimo F Martelli; Cristina Mecucci; Carlo M Croce; Brunangelo Falini
Journal: Proc Natl Acad Sci U S A Date: 2008-02-28 Impact factor: 11.205