| Literature DB >> 20925936 |
Martin Lauss1, Attila Frigyesi, Tobias Ryden, Mattias Höglund.
Abstract
BACKGROUND: Genome wide gene expression data is a rich source for the identification of gene signatures suitable for clinical purposes and a number of statistical algorithms have been described for both identification and evaluation of such signatures. Some employed algorithms are fairly complex and hence sensitive to over-fitting whereas others are more simple and straight forward. Here we present a new type of simple algorithm based on ROC analysis and the use of metagenes that we believe will be a good complement to existing algorithms.Entities:
Mesh:
Year: 2010 PMID: 20925936 PMCID: PMC2966465 DOI: 10.1186/1471-2407-10-532
Source DB: PubMed Journal: BMC Cancer ISSN: 1471-2407 Impact factor: 4.430
Figure 1Distribution of AUC values for all genes in a dataset. AUC values in respect to MI-Status are plotted in solid lines for SanchezC data in A) and Stransky data in B). AUC values in respect to high grade (G3) are plotted in solid lines for SanchezC data in C) and the Stransky data in D). Distributions of AUC values for randomized versions are plotted in dashed lines and the intervals that contain 99% of randomized data are indicated with vertical dashed lines.
Figure 2Scatter plot of AUC values for 8518 genes shared by SanchezC and Stransky. AUC values for MI-status are plotted in A and AUC values for grade are plotted in B. Genes with lower or higher AUC values than 99% of randomized data (dashed lines) in both datasets are depicted in green or red, respectively. r = Pearson correlation coefficient.
Figure 3Deviation from normal distribution and standard deviation of genes from SanchezC with significantly high or low AUC values in both datasets, SanchezC and Stransky. A) Deviation from normal distribution of genes grouped by their association to MI in Figure 2A. B) Deviation from normal distribution of genes grouped by their association to grade in Figure 2B. C) Box plot for standard deviation of genes grouped by association to MI. D) Box plot for standard deviation of genes grouped by association to grade. -log10 p-value = logarithmic p-value of Shapiro test for normality.
Figure 4Area under the Curve (AUC) values of the ranked single genes and of the metagene obtained from taking the mean expression of the single genes. A) AUC for association to MI. B) AUC for association to grade. Classification performance of MI status for the metagene based predictor (ROC) and the SVM using LOOCV in the SanchezC (C) and Stransky data (D), respectively. For SVM we used two different features selection criteria, t-test and AUC. Accuracy = Balanced Accuracy (see Methods), dashed line = Balanced Accuracy obtained by random class assignment (= 0.5). ROC = metagene-based predictor. SVM = Support Vector Machine.
Figure 5Performance of the metagene-based classifier when applied to independent data. A) Genes with highest AUC values of the SanchezC data are first tested by LOOCV in the SanchezC dataset (black circles) and then taken over to establish a classifier in the Stransky (green) and Blaveri (datasets), again by LOOCV. Gene signatures of 5-500 gene members are used. Balanced Accuracies are plotted; the dashed line at 0.5 indicates the balanced accuracy obtained by chance. B) and C): Procedure repeated with Blaveri and Stransky as training data, respectively. D) Changes in balanced accuracies from one gene signature size to the next biggest size are plotted for all balanced accuracies obtained in the validation datasets in A-C).
Balanced accuracies of prediction obtained in various phenotypes using the metagene classifier (ROC) and linear SVM (SVM).
| Phenotype/ | Training1 | Validation | Validation | Training | Validation | Validation | Training | Validation | Validation | Mean |
|---|---|---|---|---|---|---|---|---|---|---|
| vandeVijver | WangY | Sotiriou | WangY | vandeVijver | Sotiriou | Sotiriou | vandeVijver | WangY | ||
| ROC | 0.85 | |||||||||
| SVM | 0.83 | |||||||||
| vandeVijver | Sotiriou | Sotiriou | vandeVijver | |||||||
| ROC | 0.84 | |||||||||
| SVM | 0.78 | |||||||||
| vandeVijver | Sotiriou | Sotiriou | vandeVijver | |||||||
| ROC | 0.63 | |||||||||
| SVM | 0.49 | |||||||||
| WangQ | JanoueixL | Attiyeh | JanoueixL | WangQ | Attiyeh | Attiyeh | WangQ | JanoueixL | ||
| ROC | 0.88 | |||||||||
| SVM | 0.84 | |||||||||
| WangQ | JanoueixL | Attiyeh | JanoueixL | WangQ | Attiyeh | Attiyeh | WangQ | JanoueixL | ||
| ROC | 0.64 | |||||||||
| SVM | 0.65 | |||||||||
| Angulo | Takeuchi | Takeuchi | Angulo | |||||||
| ROC | 0.96 | |||||||||
| SVM | 0.88 | |||||||||
| Angulo | Takeuchi | Takeuchi | Angulo | |||||||
| ROC | 0.68 | |||||||||
| SVM | 0.61 |
1 Accuracies given in italics indicates accuracies obtained in the training set using LOOCV
2 ER+ vs. ER-
3 grade 1 vs. grade 3
4 tumors < = 2 cm vs. tumors > 2 cm
5 MYCN amplification positive vs. MYCN amplification negative
6 stage 1 and 2 vs. stage 3 and 4
7 adenocarcinoma vs. squamous cell carcinoma
8 grade 1 vs. grade 2 and 3
Abbreviations; BC = Breast Cancer, NB = Neuroblastoma, LC = Lung Cancer
Balanced accuracies of prediction using the metagene classifier (ROC) and various classification algorithms and a t-test as feature selection criteria.
| Phenotype/ | Training1 | Validation | Validation | Training | Validation | Validation | Training | Validation | Validation | Mean |
|---|---|---|---|---|---|---|---|---|---|---|
| vandeVijver | WangY | Sotiriou | WangY | vandeVijver | Sotiriou | Sotiriou | vandeVijver | WangY | ||
| ROC5 | 0.85 | |||||||||
| SVM | 0.83 | |||||||||
| SVMradial | 0.83 | |||||||||
| SVMpoly | 0.76 | |||||||||
| knn | 0.83 | |||||||||
| rforest | 0.84 | |||||||||
| rpart | 0.78 | |||||||||
| bagging | 0.84 | |||||||||
| lda | 0.84 | |||||||||
| dlda | 0.86 | |||||||||
| slda | 0.85 | |||||||||
| neuralnet | 0.81 | |||||||||
| ncc | 0.86 | |||||||||
| WangQ | JanoueixL | Attiyeh | JanoueixL | WangQ | Attiyeh | Attiyeh | WangQ | JanoueixL | ||
| ROC | 0.88 | |||||||||
| SVM | 0.84 | |||||||||
| SVMradial | 0.88 | |||||||||
| SVMpoly | 0.81 | |||||||||
| knn | 0.90 | |||||||||
| rforest | 0.88 | |||||||||
| rpart | 0.86 | |||||||||
| bagging | 0.88 | |||||||||
| lda | 0.80 | |||||||||
| dlda | 0.89 | |||||||||
| slda | 0.91 | |||||||||
| neuralnet | 0.87 | |||||||||
| ncc | 0.90 | |||||||||
| Angulo | Takeuchi | Takeuchi | Angulo | |||||||
| ROC | 0.96 | |||||||||
| SVM | 0.88 | |||||||||
| SVMradial | 0.95 | |||||||||
| SVMpoly | 0.94 | |||||||||
| knn | 0.94 | |||||||||
| rforest | 0.94 | |||||||||
| rpart | 0.87 | |||||||||
| bagging | 0.94 | |||||||||
| lda | 0.87 | |||||||||
| dlda | 0.93 | |||||||||
| slda | 0.96 | |||||||||
| neuralnet | 0.94 | |||||||||
| ncc | 0.95 |
1 Accuracies given in italics indicates accuracies obtained in the training set using LOOCV
2 ER+ vs. ER-
3 MYCN amplification positive vs. MYCN amplification negative
4 adenocarcinoma vs. squamous cell carcinoma
5 the metagene classifier uses AUC values as feature selection criteria.
Abbreviations; BC = Breast Cancer, NB = Neuroblastoma, LC = Lung Cancer