| Literature DB >> 21685091 |
Limin Li1, Barbara Rakitsch, Karsten Borgwardt.
Abstract
MOTIVATION: Classifying biological data into different groups is a central task of bioinformatics: for instance, to predict the function of a gene or protein, the disease state of a patient or the phenotype of an individual based on its genotype. Support Vector Machines are a wide spread approach for classifying biological data, due to their high accuracy, their ability to deal with structured data such as strings, and the ease to integrate various types of data. However, it is unclear how to correct for confounding factors such as population structure, age or gender or experimental conditions in Support Vector Machine classification.Entities:
Mesh:
Year: 2011 PMID: 21685091 PMCID: PMC3117385 DOI: 10.1093/bioinformatics/btr204
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Genes are sorted according to the weight vector of the ccSVM (blue dashed line) and according to the weight vector of the standard SVM (green line). The correlation coefficient between each gene expression level and lab membership is calculated. The averaged absolute correlation coefficient of the top i genes is plotted for gene i.
AUC and P-values for ccSVM, standard SVM, pcaSVM, (K+L)SVM and ccLR for the three different confounding variables on the Tuberculosis dataset
| Side information | AUCccSVM | AUCSVM | pSVM | AUCpcaSVM | ppcaSVM | AUC(K+L)SVM | p(K+L)SVM | AUCccLR(ML) |
|---|---|---|---|---|---|---|---|---|
| Ethnicity | 0.955±0.002 | 6.3e-05 | 3.6e-09 | 0.942±0.003 | 1.2e-04 | |||
| Age | 0.967±0.002 | 0.939±0.003 | 3.8e-12 | 0.933±0.003 | 1.5e-18 | 0.943±0.002 | 4.0e-16 | 0.49 |
| Gender | 0.938±0.003 | 2.8e-01 | 6.2e-01 | 0.941±0.003 | 1.7e-01 | 0.499 |
Fig. 2.Gene expression levels are sorted according to the weight vector of ccSVM (blue dashed line) and according to the weight vector of standard SVM (green line). The correlation coefficient between each gene expression level and ethnic origin (African) is calculated. The averaged absolute correlation coefficient of the top i genes is plotted for gene i.
AUC and P-values for ccSVM, standard SVM, pcaSVM, (K+L)SVM and ccLR for the five different Arabidopsis phenotypes
| PID | Phenotype | AUCccSVM | AUCSVM | pSVM | AUCpcaSVM | ppcaSVM | AUC(K+L)SVM | p(K+L)SVM | AUCccLR(ML) | AUCccLR(BR) |
|---|---|---|---|---|---|---|---|---|---|---|
| 169 | Chlorosis at 22°C | 0.658±0.004 | 0.623±0.004 | 8.3e-10 | 0.625±0.004 | 6.4e-09 | 0.574±0.004 | 2.2e-28 | 0.632±0.006 | 0.523±0.004 |
| 171 | Anthocyanin at 16°C | 0.590±0.005 | 0.568±0.005 | 1.2e-03 | 0.570±0.004 | 2.1e-03 | 0.560±0.004 | 2.1e-06 | 0.571±0.012 | 0.571± 0.003 |
| 172 | Anthocyanin at 22°C | 0.628±0.003 | 0.610±0.003 | 2.7e-05 | 0.610±0.004 | 1.2e-04 | 0.576±0.003 | 1.8e-21 | 0.613±0.004 | 0.552±0.004 |
| 176 | Leaf Roll at 10°C | 0.720±0.002 | 0.695±0.003 | 2.6e-09 | 0.697±0.003 | 3.8e-08 | 0.653±0.003 | 3.3e-31 | 0.691±0.010 | 0.550±0.003 |
| 178 | Leaf Roll at 22°C | 0.587±0.007 | 0.575±0.006 | 1.8e-01 | 0.591±0.005 | 6.0e-01 | 0.580±0.006 | 4.1e-01 | 0.573±0.006 | 0.476±0.008 |
Fig. 3.SNPs are sorted by their absolute weight of the standard SVM. The green line shows the weights of the standard SVM, the blue dashed line shows the weights of ccSVM. Both weight vectors are normalized. The Arabidopsis phenotypes are shown in the following order (from top to bottom): anthocyanin at 16°C (PID:171,λ=10−2), chlorosis at 22°C (PID:169,λ=108).
Summary of ccSVM results for the presence or absence of chlorosis at 22°C (PID:169)
| Rank | Chrom | Pos | Gene | Gene ID | dist(Gene) |
|---|---|---|---|---|---|
| 109 | 1 | 22050068 | 6365 | ||
| 110 | 1 | 22056970 | PDR8/PEN3 | AT1G59870 | 13267 |
| 111 | 1 | 22057369 | 13666 | ||
| 208 | 4 | 949836 | MOS6 | AT4G02150 | 775 |
| 224 | 1 | 20910400 | AHG2 | AT1G55870 | 8313 |
| 267 | 1 | 20737467 | CPN60B | AT1G55490 | 14605 |
| 363 | 5 | 25795239 | AT5G64510 | AT5G64510 | 6391 |
| 464 | 5 | 25795805 | 5825 | ||
| 489 | 5 | 12625100 | CDR1 | AT5G33340 | 11918 |
In the table, Chrom,Pos and dist(Gene) represent chromosome, position and the distance from the SNP to the specified gene, respectively.