| Literature DB >> 18793473 |
Zhenqiang Su1, Huixiao Hong, Hong Fang, Leming Shi, Roger Perkins, Weida Tong.
Abstract
BACKGROUND: Advances in DNA microarray technology portend that molecular signatures from which microarray will eventually be used in clinical environments and personalized medicine. Derivation of biomarkers is a large step beyond hypothesis generation and imposes considerably more stringency for accuracy in identifying informative gene subsets to differentiate phenotypes. The inherent nature of microarray data, with fewer samples and replicates compared to the large number of genes, requires identifying informative genes prior to classifier construction. However, improving the ability to identify differentiating genes remains a challenge in bioinformatics.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18793473 PMCID: PMC2537560 DOI: 10.1186/1471-2105-9-S9-S9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Nine microarray datasets used in the study.
| Beer | Lung adenocarcinoma | Survival | 86 | 24 | 6532 | [ |
| Bhattacharjee | Lung adenocarcinoma | 4-year survival | 62 | 31 | 5403 | [ |
| Chen | Hepatocellular carcinoma | Tumors | 156 | 82 | 3964 | [ |
| Pomeroy | Medulloblastoma | Medulloblastoma survival | 60 | 21 | 7129 | [ |
| Rosenwald | Non-Hodgkin lymphoma | Survival | 240 | 138 | 7399 | [ |
| Shipp | Diffuse large b-cell lymphoma (DLBCL) | Cured | 58 | 32 | 6817 | [ |
| Singh | Prostate cancer | Tumors | 102 | 52 | 12600 | [ |
| Yeoh | Acute lymphocytic leukaemia | Relapse-free survival | 233 | 32 | 12236 | [ |
| van't Veer | Breast cancer | 5-year metastasis-free survival | 97 | 46 | 4948 | [ |
Figure 1The flowchart for the classifier development and validation using three gene sets: (A) Top 50 p-value ranked genes; (B) Top 50 VIP genes; and (C) the unique VIP genes. Specifically, the data set is first randomly divided into two thirds of samples for training and the remainder for validation. Next, three sets of genes are generated solely based on the training set, and are subsequently used to develop Nearest-Centroid classifiers. Lastly, the classifiers are used to predict the validation samples and their respective prediction performance measured by accuracy (Acc), specificity (Spec), sensitivity (Sens), and Matthew's correlation coefficient (MCC) are calculated. The process is repeated 50 times and the averaged performance metrics are reported in Table 2.
Comparison of prediction performance for Nearest-Centroid classifiers built from unique VIP genes, top 50 p-value ranked genes, and 50 VIP genes. The classifier performance metrics, including accuracy (Acc), specificity (Spec), Sensitivity (Sens), and Matthew's correlation coefficient (MCC) were calculated based on averages of 50 repetitions of sample splitting, gene selection, and validation dataset prediction.
| Beer | 15 | 64.7 | 38.3 | 74.0 | 0.13 | 64.7 | 38.3 | 74.0 | 0.12 | 65.2 | 35.4 | 75.6 | 0.11 |
| Bhattacharjee | 17 | 58.7 | 57.8 | 59.6 | 0.18 | 58.0 | 59.2 | 56.8 | 0.16 | 58.6 | 59.4 | 57.8 | 0.18 |
| Chen | 14 | 96.5 | 99.9 | 93.6 | 0.93 | 95.3 | 100.0 | 91.2 | 0.91 | 95.8 | 100.0 | 92.1 | 0.92 |
| Pomeroy | 20 | 60.8 | 54.0 | 64.2 | 0.19 | 60.8 | 51.7 | 65.3 | 0.18 | 62.4 | 55.7 | 65.8 | 0.22 |
| Rosenwald | 18 | 55.5 | 58.1 | 53.6 | 0.12 | 56.8 | 63.2 | 52.2 | 0.15 | 57.4 | 62.3 | 53.8 | 0.16 |
| Shipp | 18 | 51.6 | 50.8 | 52.5 | 0.03 | 47.9 | 51.8 | 43.0 | -0.05 | 49.0 | 47.4 | 51.0 | -0.02 |
| Singh | 15 | 94.3 | 98.3 | 91.7 | 0.89 | 98.1 | 100.0 | 96.9 | 0.96 | 97.8 | 100.0 | 96.4 | 0.96 |
| Yeoh | 22 | 74.6 | 37.8 | 80.2 | 0.15 | 78.2 | 31.0 | 85.4 | 0.15 | 80.2 | 35.0 | 87.0 | 0.21 |
| van't Veer | 20 | 64.8 | 64.8 | 64.9 | 0.30 | 65.2 | 61.5 | 68.6 | 0.31 | 66.9 | 66.1 | 67.6 | 0.34 |
Pathways identified for the unique VIP genes and common genes for the van't Veer dataset.
| AF055033 (IGFBP5) | Insulin-like growth factor binding protein 5 | Estrogen signaling pathway | Breast cancer | |
| IGF signaling pathway | Lung cancer | |||
| NM_000599 (IGFBP5) | AR mediated pathway; insulin-like growth factor-1 signaling pathway | Prostate cancer | ||
| Responsive genes | Ovarian cancer | |||
| NM_000017 (ACADS) | Acyl-coenzyme A dehydrogenase, C-2 to C-3 short chain | Responsive genes | Colon cancer | |
| NM_004994 (MMP9) | Matrix metallopeptidase 9 (gelatinase B, 92 kDa gelatinase, 92 kDa type IV collagenase) | Heregulin, and CXCL12 signaling pathway | Breast cancer | |
| Bombesin, IL10, IL8, TGFbeta, and HGF signaling pathway; responsive genes | Prostate cancer | |||
| Responsive genes; thrombospondin signaling pathway | Pancreatic cancer | |||
| Gastrin, HGF, and IL4 signaling pathway; integrin, and UPAR mediated pathway | Colon cancer | |||
| Responsive genes | Chronic myeloid leukemia | |||
| EGF signaling pathway; VEGF mediated pathway; responsive genes | Ovarian cancer | |||
| HGF, and IL6 signaling pathway; Responsive genes | Lung cancer | |||
| NM_001197 (BIK) | BCL2-interacting killer (apoptosis-inducing) | p53 mediated pathway | Colon cancer | |
| NM_001809 (CENPA) | Centromere protein A | Responsive genes | Lung cancer | |
| p21 mediated pathway | Cell-cycle | |||
| NM_002808 (PSMD2) | Proteasome (prosome, macropain) 26S subunit, non-ATPase, 2 | Tat signaling pathway | Acquired immuno deficiency syndrome | |
| NM_004336 (BUB1) | BUB1 budding uninhibited by benzimidazoles 1 homolog (yeast) | Spindle Checkpoint Pathway | Cell-cycle | |
| NM_004626 (WNT11) | Wingless-type MMTV integration site family, member 11 | Cell-cell signaling pathway | Others | |
| WNT receptor signaling pathway | Others | |||
| NM_004887 (CXCL14) | Chemokine (C-X-C motif) ligand 14 | Signal transduction pathway | Others | |
| AL050227 (PTGER3) | Prostaglandin E receptor 3 (subtype EP3) | Estrogen signaling pathway | Breast cancer | |
| PGE2 mediated pathway | Lung cancer | |||
| NM_006763 (BTG2) | BTG family, member 2 | Estrogen signaling pathway | Breast cancer | |
| Responsive genes | Prostate cancer | |||
| CEBP alpha mediated pathway | Chronic myeloid leukemia | |||
| Miscellaneous | DNA repair | |||
| BTG mediated pathway | Cell-cycle | |||
| NM_003862 (FGF18) | Fibroblast growth factor 18 | WNT signaling pathway | Colon cancer | |
| NM_006115 (PRAME) | Preferentially expressed antigen in melanoma | Responsive genes | Ovarian cancer | |
| X05610 (COL4A2) | Collagen, type IV, alpha 2 | Responsive genes | Glioblastoma | |
| NM_003981 (PRC1) | Protein regulator of cytokinesis 1 | p21 mediated pathway | Cell-cycle | |
| NM_006027 (EXO1) | Exonuclease 1 | p21 mediated pathway | Cell-cycle | |
| NM_002811 (PSMD7) | Proteasome (prosome, macropain) 26S subunit, non-ATPase, 7 | Tat signaling pathway | Acquired immuno deficiency syndrome | |
The pathways involved with both unique VIP genes and common genes for the van't Veer dataset
| Estrogen signaling pathway | IGFBP5 (AF055033, NM_000599) | BTG2, PTGER3 | Breast cancer |
| p21 mediated pathway | BUB1B, CENPA | EX01, PRC1 | Cell-cycle |
| CEBPalpha mediated pathway | MMP9 | BTG2 | Chronic myeloid leukemia |
| WNT signaling pathway | WNT11 | FGF18 | Colon Cancer |
| Tat signaling pathway | PSMD2 | PSMD7 | Acquired immuno deficiency syndrome |
| Responsive genes | MMP9 | BTG2 | Prostate cancer |
| Responsive genes | MMP9, IGFBP5 (AF055033, NM_000599) | PRAME | Ovarian cancer |
Figure 2The detailed process for identifying a very important pool (VIP) of genes. X1 and X2 are, respectively, the gene expression profiles for class 1 samples and class 2 samples in the training set. X1and X2are samples randomly selected from X1 and X2 in the mbagging step. and are the genes remaining after filtering genes from X1and X2, respectively. Malinowski's factor indicator function (IND) is calculated with equations and IND= RE/(n - k)2, where λis the ith eigenvalue of the total g eigenvalues; n is the number of samples and p is the number of genes. The optimum number (k) of components corresponds to the IND minimum. E11 and E21 are the residue matrices after projecting X1and X2into the PCA space for class 1, respectively, while E22 and E12 are the residue matrices after projecting X2and X1into the PCA space for class 2, respectively. The discrimination power (DP) of a gene j is calculated with the equation: , where , , , and are the j columns of residue matrices E11, E12, E22, and E21, respectively.
The prediction confusion matrix
| +1 | -1 | |
| +1 | TP (True positive) | FN (False negative) |
| -1 | FP (False positive) | TN (True negative) |