| Literature DB >> 22132775 |
Kenneth R Hess1, Caimiao Wei, Yuan Qi, Takayuki Iwamoto, W Fraser Symmans, Lajos Pusztai.
Abstract
BACKGROUND: Our goal was to examine how various aspects of a gene signature influence the success of developing multi-gene prediction models. We inserted gene signatures into three real data sets by altering the expression level of existing probe sets. We varied the number of probe sets perturbed (signature size), the fold increase of mean probe set expression in perturbed compared to unperturbed data (signature strength) and the number of samples perturbed. Prediction models were trained to identify which cases had been perturbed. Performance was estimated using Monte-Carlo cross validation.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22132775 PMCID: PMC3245512 DOI: 10.1186/1471-2105-12-463
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Spiked probe set recovery rates as function of informative samples size and signature strength. Both, the number of spiked probes (genes) and the number of features included in the predictor model were set to 10. The solid lines indicate the average recovery rates and the dots represent results from the 20 individual iterations. Results from the MAQC-II data set (n = 233) are shown. The "c" value which is the log2 fold-change takes on the values 0.5, 1.0, 1.2 and 1.5.
Figure 2Classifier performance as function of informative sample size and signature strength. Both, the number of spiked probes and the number features included in the predictor model were set to 10. The solid lines indicate the average area above the ROC curve (AAC) from Monte Carlo Cross Validation (MCCV). The smaller the AAC the more accurate the predictor is. The dots represent results from the 20 individual iterations of the analysis performed on the MAQC-II data set (n = 233). The "c" value which is the log2 fold-change takes on the values 0.5, 1.0, 1.2 and 1.5.
Figure 3Classifier performance is influenced by signature size, the number of informative cases and the number of features included in the prediction model. The numbers of spiked probes were 10, 25, 50 and 100 and the number of features included in the prediction model was set to 100. Log2 fold-change ("c") was set to 0.5. The solid lines indicate the average area above the ROC curve (AAC) from Monte Carlo Cross Validation (MCCV), the dots represent results from the 20 individual iterations of the analysis performed on the MAQC-II data set (n = 233).
Figure 4Classifier performance as function of signature strength and informative sample size when redundant features are also included in the model. The number of spiked probes was set to 10 and the number of features included in the predictor was set to 100. The informative sample sizes were 10, 30, 60, 100 and the log2 fold increases (i.e., c values) were, 0.5, 1.0, 1.2, 1.5, 2.0, 3.0, 4.0. The solid lines indicate the average area above the ROC curve (AAC) from Monte Carlo Cross Validation (MCCV), the dots represent results from 20 iterations of the analysis performed on the MAQC-II data set (n = 233).
Fold difference in the expression values of informative probe sets for 3 different clinical prediction problems assessed in the same breast cancer data set (GEO GSE 16716)
| Feature # 10 | Feature # 100 | Feature # 10 | Feature # 100 | Feature # 10 | Feature # 100 | |
|---|---|---|---|---|---|---|
| FDR adjusted p-value | 4.71E-12 | 3.26E-07 | 0.004 | 0.0205 | 0.4 | 0.68 |
| Fold difference | ||||||
| < 0.5 | 0 | 0 | 0 | 6 | 4 | 59 |
| ≥0.5 - < 1.0 | 0 | 13 | 3 | 43 | 2 | 32 |
| ≥1.0 - < 1.2 | 0 | |||||
| ≥1.2 - < 1.5 | 0 | 0 | ||||
| ≥1.5 - < 2.0 | 0 | 0 | ||||
| ≥ 2.0 - < 3.0 | 0 | 0 | ||||
| ≥ 3.0 - < 4.0 | 0 | 0 | 0 | 0 | ||
| ≥ 4.0 | 0 | 0 | 0 | 0 | ||
Random sample of 41 ER Positive and 50 ER Negative Samples
Random sample of 41 pathologic CR and 50 residual cancers regardless of ER status.
Random sample of 41 pathologic CR and 50 residual cancers all ER Negative.
Log2 Difference = abs(mean log2 intensity for group1 - mean log2 intensity for group 2) where "abs" is absolute value.
The numbers of probe sets with a given level of differential expression are shown for the 3 comparisons including (i) Estrogen Receptor (ER)-positive versus ER-negative cancers, (ii) cancers with pathologic complete response (pCR) to chemotherapy versus lesser response (RD) and (iii) cases with pCR versus RD in ER-negative cancers only. Probe sets with mean log2 transformed expression difference >1 between comparisons groups are highlighted in bold. FDR = false discovery rate (i.e., proportion of genes detected to be informative which are not truly informative). FDR adjusted p-values (also known as FDR q-values) are the estimated FDR values that would be incurred if the p-values associated with the selected genes were used as the threshold for significance (i.e., genes with that p-value or a lower p-value were to be detected as informative). So for the Feature #10 column, the reported FDR adjusted p-value is the q-value associated with the 10th highest ranked gene.
Fold difference in the expression values of informative probe sets for 6 different prediction problems in different data sets.
| Data set | IBC versus Non-IBC | p53 mutation versus normal | PIK3 mutation versus normal | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Receptor status | ER-negative | ER-positive | ER-negative | ER-positive | ER-negative | ER-positive | ||||||
| Phenotype | IBC | Non-IBC | IBC | Non-IBC | mutation | normal | mutation | normal | mutation | normal | mutation | normal |
| Pts # | 19 vs 27 | 6 vs 31 | 44 vs 11 | 14 vs 31 | 8 vs 49 | 15 vs 57 | ||||||
| Feature # | 10 | 100 | 10 | 100 | 10 | 100 | 10 | 100 | 10 | 100 | 10 | 100 |
| FDR q value | 1.000 | 1.000 | 0.280 | 0.500 | 0.14 | 0.33 | 0.49 | 0.6 | 0.08 | 0.09 | 0.24 | 0.55 |
| Fold difference | ||||||||||||
| < 0.5 | 0 | 20 | 0 | 2 | 1 | 9 | 2 | 29 | 0 | 0 | 0 | 17 |
| ≥0.5 - < 1.0 | 4 | 55 | 1 | 41 | 1 | 37 | 5 | 47 | 1 | 23 | 6 | 50 |
| ≥1.0 - < 1.2 | 3 | |||||||||||
| ≥1.2 - < 1.5 | 4 | 2 | ||||||||||
| ≥1.5 - < 2.0 | 2 | 1 | 1 | 0 | ||||||||
| ≥2.0 - < 3.0 | 0 | 2 | 1 | 14 | 1 | 1 | ||||||
| ≥3.0 - < 4.0 | 1 | 0 | 0 | 1 | 0 | 0 | ||||||
| ≥4.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||||
| Data set | WANG | TRANSBIG | Mainz | |||||||||
| Receptor status | ER-negative | ER-positive | ER-negative | ER-positive | ER-negative | ER-positive | ||||||
| Recurrence | Yes | No | Yes | No | Yes | No | Yes | No | Yes | No | Yes | No |
| Pts # | 29 vs 50 | 29 vs 50 | 18 vs 45 | 23 vs 112 | 11 vs 20 | 30 vs 139 | ||||||
| Feature # | 10 | 100 | 10 | 100 | 10 | 100 | 10 | 100 | 10 | 100 | 10 | 100 |
| FDR q value | 0.2 | 0.26 | 0.004 | 0.02 | 0.820 | 0.91 | 0.010 | 0.03 | 0.47 | 0.58 | 0.01 | 0.03 |
| Fold difference | ||||||||||||
| < 0.5 | 2 | 35 | 3 | 68 | 2 | 26 | 1 | 26 | 1 | 13 | 3 | 59 |
| ≥0.5 - < 1.0 | 6 | 51 | 7 | 30 | 4 | 47 | 5 | 52 | 5 | 44 | 5 | 26 |
| ≥1.0 - < 1.2 | 0 | 0 | 1 | |||||||||
| ≥1.2 - < 1.5 | 0 | |||||||||||
| ≥1.5 - < 2.0 | ||||||||||||
| ≥2.0 - < 3.0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| ≥3.0 - < 4.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 |
| ≥4.0 | 0 | 0 | 0 | 0 | 0 | 0 | ||||||
Log2 Difference = abs(mean log2 intensity for group1 - mean log2 intensity for group 2) where "abs" is absolute value.
The numbers of probe sets with a given level of differential expression are shown for the 6 comparisons. Analyses were performed separately for Estrogen Receptor (ER)-positive and ER-negative cancers. IBC = inflammatory breast cancer, PI3K = Phosphatidylinositol-3 kinase, FDR = false discovery rate. The WANG, TRANSBIG, Mainz data sets correspond to references 2, 3 and 28. Probe sets with mean log2 transformed expression difference >1 between comparisons groups are highlighted in bold.