| Literature DB >> 22369133 |
Chang Chang1, Junwei Wang, Chen Zhao, Jennifer Fostel, Weida Tong, Pierre R Bushel, Youping Deng, Lajos Pusztai, W Fraser Symmans, Tieliu Shi.
Abstract
BACKGROUND: The use of gene signatures can potentially be of considerable value in the field of clinical diagnosis. However, gene signatures defined with different methods can be quite various even when applied the same disease and the same endpoint. Previous studies have shown that the correct selection of subsets of genes from microarray data is key for the accurate classification of disease phenotypes, and a number of methods have been proposed for the purpose. However, these methods refine the subsets by only considering each single feature, and they do not confirm the association between the genes identified in each gene signature and the phenotype of the disease. We proposed an innovative new method termed Minimize Feature's Size (MFS) based on multiple level similarity analyses and association between the genes and disease for breast cancer endpoints by comparing classifier models generated from the second phase of MicroArray Quality Control (MAQC-II), trying to develop effective meta-analysis strategies to transform the MAQC-II signatures into a robust and reliable set of biomarker for clinical applications.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22369133 PMCID: PMC3287502 DOI: 10.1186/1471-2164-12-S5-S6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Analysis workflow. This figure illuminates the general outline of the whole process.
Overlap at the levels of probes and genes
| Probe | Gene | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Models | Endpoint | Model | Mean (Variance) | Total | Overlapped | Rate(%) | Endpoint Overlap | Total | Overlapped | Rate(%) | Endpoint Overlap |
| Normal | D | 28 | 97.04 (91310.26) | 1747 | 345 | 19.75 | 785 | 1350 | 402 | 29.78 | 619 |
| E | 22 | 143.5 (95148.74) | 1760 | 644 | 36.59 | 1309 | 589 | 45.00 | |||
| Swap | D | 20 | 54.10 (3964.62) | 609 | 207 | 33.99 | 317 | 465 | 174 | 37.42 | 252 |
| E | 20 | 106.05 (32074.79) | 1047 | 443 | 42.31 | 793 | 416 | 52.46 | |||
This table shows prime analyses at levels of probes and genes for normal models and swap models. Total refers to the total number of irredundant probes and genes, with means and variances of models for each endpoint be listed in Mean (Variance). Rate refers to the proportion of overlaps in unique sets of probes and genes. The numbers of overlaps between unique sets of probes and genes of both endpoints are calculated, which are displaced in Endpoint Overlap. Note that two probes for NIEHS_BR_E_5 were removed as they do not appear in the Affymetrix U133A platform. We compared lists of probes for each group of data to get overlaps at probe level. If two probes overlapped at probe level, they must also overlap at gene level; if two probes are not same but they share the same gene, they can also overlapped at gene level. This table suggested that the overlap rates of endpoint E are always greater than those of endpoint D. For normal models, the number of features for both endpoints have no significant difference (F = 0.9507, T-test = 0.5748) and there is either no significant difference for that of swap models (F = 0.1236, T-test = 1.2238, degree of freedom = 23.6263). Furthermore, overlap rates at gene level are greater than those at probe level, suggested that the number of non-identical probes which share the same genes is large.
Model parameters and performances
| UniqueModelID | BR_D_Model | Swap_BR_D_Model | BR_E_Model | SwapBR_E_Model |
|---|---|---|---|---|
| Endpoint | D | D | E | E |
| Dataset | training dataset | validation dataset | training dataset | validation dataset |
| Samples | 130 | 100 | 130 | 100 |
| Features | 32 | 33 | 55 | 10 |
| Normalization | MAS5 | MAS5 | MAS5 | MAS5 |
| Batch Effect Removal Method | AGC | AGC | none | None |
| Feature Selection Method | MCC-robustness | MCC-robustness | MCC-robustness | MCC-robustness |
| Classification Method | SVM | SVM | SVM | SVM |
| Internal Validation | 5F-CV | 5F-CV | 5F-CV | 5F-CV |
| Validation Iterations | 10 | 10 | 10 | 10 |
| MFS Fitting Index | index1 | index1 | MCC | MCC |
| MFS Optimized Method | SVM | SVM | SVM | SVM |
| MFS Best Fitting Model | yes | yes | yes | yes |
| CV_MCC | 0.707 | 0.689 | 0.904 | 0.942 |
| CV_ACC | 0.892 | 0.827 | 0.955 | 0.972 |
| CV_SEN | 0.915 | 0.673 | 0.947 | 0.955 |
| CV_SPE | 0.815 | 0.981 | 0.959 | 0.983 |
| MCC_Std Dev | 0.030 | 0.082 | 0.029 | 0.021 |
| ACC_Std Dev | 0.011 | 0.048 | 0.014 | 0.010 |
| SEN_Std Dev | 0.011 | 0.091 | 0.017 | 0.024 |
| SPE_Std Dev | 0.026 | 0.013 | 0.013 | 0.000 |
| Val_MCC | 0.395 | 0.368 | 0.819 | 0.661 |
| Val_ACC | 0.850 | 0.792 | 0.910 | 0.838 |
| Val_SEN | 0.907 | 0.714 | 0.841 | 0.914 |
| Val_SPE | 0.500 | 0.802 | 0.964 | 0.811 |
ACC is short for Accuracy, SEN for Sensitivity, SPE for specificity and StdDev for standard deviation. The CV rows refer to internal validation and the Val rows refer to validation of the training dataset against the validation dataset. We balanced the training dataset for Swap_BR_D_Model, as the P/N ratio is too small (0.18), as MCC is very sensitive and its value might change a lot even for a small predictive error in P/N ratio (positive/negative ratio) unbalanced datasets. Features of BR_D_Model and BR_E_Model are available at Additional file 17.
Figure 2Heatmaps for gene signatures on validation dataset. (a) Heatmap for BR_D_Model; (b) Heatmap for BR_E_Model. Each column represents a sample in the dataset, and each row represents a gene in the gene signature. Note that the end row is endpoint status.
Figure 3Performances of original and swap models based on classification algorithm level similarity analysis. a) Endpoint D original models; b) Endpoint E original models; c) Endpoint D swap models; and d) Endpoint E swap models. Coordinate axes are MCC (internal validation), Val_MCC (external validation) and MCC_Std (internal validation standard deviation). Each classification algorithm is represented by a different color. The radius of each sphere is related to the number of model features, within a range of 50-1 000. The blue stars are our own models, while spheres are models from which our models were developed.
Biological associations between genes and breast cancer
| Gene | General description | Support | Endpoint D | Endpoint E | ||
|---|---|---|---|---|---|---|
| Overlap | Position | Overlap | Position | |||
| Highly correlated with Erα | + | 67 | -5 | 61 | 4 | |
| prognostic values in ER(+) Primary breast cancers in 3 patient cohorts | + | 43 | -7 | 22 | 26 | |
| Involved in pathological processes of breast cancer | + | 33 | -2 | 53 | 1 | |
| A breast cancer marker | + | 25 | -14 | 40 | 6 | |
| Protein coding | 22 | 75 | -142 | |||
| Differentially expressed reported [ | 22 | 84 | 20 | -61 | ||
| Protein coding | 21 | 27 | -239 | |||
| Downregulated reported [ | 20 | 2 | 19 | -2 | ||
| Aberrantly regulated and could contribute to therapeutic failure in the context of ER-positive breast cancer | + | 18 | -56 | 145 | ||
| Downregulated reported [ | 16 | 7 | -16 | |||
| Protein coding | -60 | 33 | 60 | |||
| Coregulated with estrogen receptor in some breast cancers | + | -64 | 25 | 41 | ||
| Protein coding | -25 | 25 | 19 | |||
| Protein coding | -34 | 16 | 29 | |||
Genes marked '+' indicates experimental support for a role in breast cancer in Support. Unless otherwise stated, descriptions were collected from Entrez Gene [25]. Numbers in the Overlap reflect number of gene overlap, for the top 10 ranked genes of two endpoints. Values for Position reflect these genes' rank in the descending sorted list of fold-change values, signs of which represent gene positions in the two ends of the list. Almost all top 10 genes for each endpoint rank in the top 200 up-regulated or down-regulated genes. Furthermore, among the genes common to these two endpoints, the change in expression level was not always correlated. In other words, genes up-regulated in endpoint D could be down-regulated in endpoint E.