| Literature DB >> 18509521 |
Sergio Rojas-Galeano1, Emily Hsieh, Dan Agranoff, Sanjeev Krishna, Delmiro Fernandez-Reyes.
Abstract
BACKGROUND: The analysis of complex proteomic and genomic profiles involves the identification of significant markers within a set of hundreds or even thousands of variables that represent a high-dimensional problem space. The occurrence of noise, redundancy or combinatorial interactions in the profile makes the selection of relevant variables harder. METHODOLOGY/PRINCIPALEntities:
Mesh:
Year: 2008 PMID: 18509521 PMCID: PMC2396875 DOI: 10.1371/journal.pone.0001806
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1High level flow chart of the wKIERA Algorithm.
Description of simulated and biological datasets used in this study.
| Dataset | Size | D | R | Description | Ref |
| Linear with redundant variables (LR) | 200 | 206 | 6 | Occurrence of each condition is equiprobable. Six relevant variables are drawn as { |
|
| Linear with outlier variables (LOV) | 200 | 205 | 5 | Occurrence of each condition is equiprobable. Five relevant variables are drawn from |
|
| Linear with outlier instances (LOI) | 200 | 205 | 5 | Same method as LOV but this time “instance” outliers are artificially induced by picking 5% of the total samples and re-drawn them from the same distribution with an 10-fold augmented standard deviation. See ref. for details. |
|
| Linear hyperplane (LH) | 200 | 205 | 5 | Five relevant variables are drawn from normal distribution,
| N/A |
| Nonlinear Gaussian (NLG) | 200 | 206 | 6 | Occurrence of each condition is equiprobable. Negative samples are drawn from multivariate |
|
| Nonlinear checkers (NLC) | 500 | 202 | 2 | All variables are drawn uniform randomly from the interval [0,1]. Condition label is determined as the logical exclusive-OR between the first 2 variables, |
|
| Human African Trypanosomiasis (HAT) | 231 | 206 | ? | SELDI-ToF Proteomic dataset of 85 serum samples from patients affected with Human African Trypanosomiasis (sleeping sickness) plus 146 control serum samples. See ref. for full details on demographics and data gathering. |
|
| Tuberculosis (TB) | 349 | 219 | ? | SELDI-ToF Proteomic dataset consisting of 179 serum samples from patients affected with active Tuberculosis plus 170 control serum samples. See ref. for full details on demographics and data gathering. |
|
| Malaria | 170 | 56 | ? | SELDI-ToF Proteomic dataset consisting of 28 serum samples from patients affected with Malaria plus 28 control serum samples. To be published elsewhere. | N/A |
| Colon cancer | 66 | 2000 | ? | Publicly available gene expression microarray consisting of 40 tumor and 22 normal colon tissue samples. |
|
| Glial cancer | 50 | 12625 | ? | Publicly available gene expression microarray consisting of 28 samples of glioblastomas and 22 samples of anaplastic oligodendrogliomas. See ref. for further details. |
|
D = dimension, R = number of relevant variables.
Figure 2Performance of variable subsets on simulated datasets.
A) LOI dataset (wKIERA settings: poolsize = 10, maxiter = 400, rep = 2000, wkRBF ρ = 0.1); B) NLG dataset (poolsize = 10, maxiter = 400, rep = 2000, wkPoly d = 2). Top: Average SVM accuracy on 100 randomly train/test splits using subsets of variables obtained by thresholding the estimated factors of a weighted kernel with the corresponding cutoff on horizontal axis. Resulting subset size (number of variables) is shown in brackets. Middle: Comparison of classification accuracy of SVM trained using variables selected by best-wKIERA-ranked (red); worst-wKIERA-ranked (black); rank correlation coefficients (blue) and using all variables (green). Results are averaged over 100 randomly training/test splits. Bottom: ROC-space analysis of the SVM classifiers shown in the mid plot.
Figure 3Performance of variable subsets on proteomic datasets.
A) HAT dataset (wKIERA settings: poolsize = 10, maxiter = 400, rep = 2000, wkRBF ρ = 0.01); B) TB dataset (poolsize = 10, maxiter = 400, rep = 2000, wkRBF ρ = 1). C) MALARIA dataset (poolsize = 10, maxiter = 400, rep = 2000, wkRBF ρ = 1). Top, Middle and Bottom: See legend on Figure 2.
Figure 4Performance of variable subsets on gene expression microarray datasets.
A) COLON CANCER dataset (wKIERA settings: poolsize = 100, maxiter = 1000, rep = 1000, wkRBF ρ = 0.1); B) GLIAL CANCER dataset (poolsize = 100, maxiter = 1000, rep = 1000, wkRBF ρ = 1×10−5). Top, Middle and Bottom: See legend on Figure 2.
Selected variables in synthetic datasets by wKIERA (poolsize = 10, maxiter = 400).
| Dataset | 10-top-ranked variable index | Matched/true relevant | Kernel settings | |||||||||
| LR |
|
|
|
|
|
| 45 | 116 | 76 | 191 | 6/6 | wkRBF (ρ = 0.1) |
| LOV |
|
|
|
|
| 28 | 53 | 93 | 75 | 7 | 5/5 | wkRBF (ρ = 0.1) |
| LOI |
|
|
|
|
| 87 | 132 | 54 | 20 | 142 | 5/5 | wkRBF (ρ = 0.1) |
| LH |
|
|
|
| 162 |
| 169 | 27 | 191 | 85 | 5/5 | wkPoly (d = 1) |
| NLG |
|
|
|
|
|
| 141 | 73 | 170 | 78 | 6/6 | wkPoly (d = 2) |
| NLC |
|
| 178 | 64 | 150 | 162 | 84 | 101 | 3 | 27 | 2/2 | wkPoly (d = 2) |
Type of kernel used in each dataset, weighted RBF kernel (wkRBF) or weighted Polynomial kernel (wkPoly), is showed in rightmost column. Numbers in represent true relevant variables.
Weighted Kernel-based Iterative Estimation of Relevance Algorithm (wKIERA).
| Algorithm wKIERA |
|
|
| Dataset: D = {( |
| Pool size: poolsize; Max. iterations: maxiter; |
|
|
| best |
|
|
| n = dim( |
| W = rand_matrix_01 (poolsize,n) |
| repeat for (t = 1, top = 0; t<maxiter; t++) |
| [S,U] = random_split (J,n/2) |
| repeat for each row |
| K = compute_wkernel ( |
| h = train_kperceptron(KS, |
| score |
| if(score |
| top = score |
| end_if |
| end_repeat |
| B = select_half_best (W,scorei = 1:poolsize); |
| μ = mean(B); σ = std_dev(B); |
| [δ,ξ] = skewness_schedule (t,top); |
| Wnew = μ+((σ+ξ)*rand_matrix_skewed_01 (poolsize,n,δ)) |
| Wnew
1 = best |
| end_repeat |