| Literature DB >> 16796748 |
Chia Huey Ooi1, Madhu Chetty, Shyh Wei Teng.
Abstract
BACKGROUND: Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy.Entities:
Mesh:
Year: 2006 PMID: 16796748 PMCID: PMC1569877 DOI: 10.1186/1471-2105-7-320
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Descriptions of benchmark datasets.
| Dataset | Type | Training:Test set size | |||
| BRN | cDNA | 7452 | 15 | 176:84 | 150 |
| BRN14 | cDNA | 7452 | 14 | 174:83 | 150 |
| GCM | Affymetrix | 10820 | 14 | 144:54 | 150 |
| NCI60 | cDNA | 7386 | 8 | 40:20 | 150 |
| PDL | Affymetrix | 12011 | 6 | 166:82 | 120 |
| Lung | Affymetrix | 1741 | 5 | 135:68 | 100 |
| SRBC | cDNA | 2308 | 4 | 55:28 | 80 |
| MLL | Affymetrix | 8681 | 3 | 48:24 | 60 |
| AML/ALL | Affymetrix | 3571 | 3 | 48:24 | 60 |
N is the number of features after preprocessing. K is the number of classes in the dataset.
Figure 1F-splits evaluation procedure at each value of P.
Best estimate of accuracy.
| Dataset | Superior method | ||
| BRN | 94.3%, | 94.0%, | |
| BRN14 | 94.7%, | 94.5%, | |
| GCM | 80.2%, | 80.2%, | Equal |
| NCI60 | 68.0%, | 68.0%, | Equal |
| PDL | 98.4%, | 98.3%, | |
| Lung | 94.1%, | 93.8%, | |
| SRBC | 98.9%, | 98.9%, | Equal |
| MLL | 98.3%, | 97.9%, | |
| AML/ALL | 97.5%, | 97.1%, |
Best estimate of accuracy from the Wand the Wscoring methods, obtained at P = Pmax, followed by the corresponding value of the DDP.
Figure 2Estimate of accuracy at . Solid line: Wscoring method, dashed line: W.
Range of class accuracies.
| Dataset | Superior method | ||
| BRN | 0.80 | 0.80 | Equal |
| BRN14 | 0.80 | 0.71 | |
| GCM | 0.81 | 0.83 | |
| NCI60 | 0.68 | 0.85 | |
| PDL | 0.18 | 0.19 | |
| Lung | 0.35 | 0.32 | |
| SRBC | 0.03 | 0.03 | Equal |
| MLL | 0.05 | 0.05 | |
| AML/ALL | 0.05 | 0.06 |
Range of class accuracies from the Wand the Wscoring methods, obtained using the values of DDP shown in Table 2 at P = P.
Figure 3Range of class accuracies at . Solid line: Wscoring method, dashed line: W.
Overall comparison of the Wand the Wscoring methods.
| Dataset | Superior method based on best estimate of accuracy | Superior method based on range of class accuracies | Overall superior method |
| BRN | Equal | ||
| BRN14 | Undecided | ||
| GCM | Equal | ||
| NCI60 | Equal | ||
| PDL | |||
| Lung | Undecided | ||
| SRBC | Equal | Equal | Equal |
| MLL | |||
| AML/ALL |
Comparing the Wand the Wscoring methods through both best estimate of accuracy and range of class accuracies.
Comparing the Wand the Wscoring methods through statistical tests.
| Dataset | ||
| BRN | 16 | 145 |
| BRN14 | 33 | 145 |
| GCM | 18 | 145 |
| NCI60 | 5 | 112 |
| PDL | 24 | 115 |
| Lung | 22 | 95 |
| SRBC | 10 | 75 |
| MLL | 7 | 55 |
| AML/ALL | 11 | 55 |
A is the number of times the null hypothesis that the Wscoring method is as good as the Wscoring method is rejected in favour of the Wscoring method (binomial test). C is the number of times that the right-sided p-value associated with the Wilcoxon signed rank test statistic is below the significance level of 0.05. The maximum values of A and C are both Pmax-1.
Figure 4Correlations of . A (left) and C (right) plotted against training set size, M, for all benchmark datasets. A and C are normalized by dividing against their maximum value, Pmax- 1.
Figure 5Area under the accuracy-predictor set size curve.
Figure 6Relationship between optimal value of the DDP and number of classes in the dataset. Optimal value of the DDP plotted against number of classes in the dataset for the Wscoring method (left) and the Wscoring method (right) for all benchmark datasets.
Most frequently selected genes for the BRN dataset.
| Rank | Annotation | Remarks | Group |
| 1 | TYR tyrosinase (oculocutaneous albinism IA) | Identified as marker for skin tumor class in [12] | M |
| 2 | FLJ20624 **hypothetical protein FLJ20624 | Related to the gene PAK1, which is associated with pancreatic cancer [35] | 1 |
| 3 | DMXL1 Dmx-like 1 | Function still unknown, although high level of conservation suggests important roles [36] | 4 |
| 4 | CLDN4 claudin 4 | Identified as marker for ovarian, bladder, lung, and stomach tumor classes in [12] | M |
| 5 | TACSTD1 tumor-associated calcium signal transducer 1 | Identified as marker for stomach, pancreatic, lung, and breast cancer classes in [12] | M |
| 6 | M6PR mannose-6-phosphate receptor (cation dependent) | Defective function of M6PR leads to hepatocellular carcinoma [37] | 1 |
| 7 | PLG plasminogen | Identified as marker for stomach cancer class in [12] | M |
| 8 | SPINT2 serine protease inhibitor, Kunitz type, 2 | Found to be under-expressed in epithelial ovarian cancer patients [38] | 1 |
| 9 | SORD sorbitol dehydrogenase | Suppresses growth arrest induced by a p53 tumor mutant in fission yeast [39] | 2 |
| 10 | FGA fibrinogen, A alpha polypeptide | Mutation of FGA found in breast cancer patients [40] | 1 |
| 11 | BCL6 B-cell CLL/lymphoma 6 (zinc finger protein 51) | Deregulation of BCL6 found in diffuse large cell lymphoma [41] | 1 |
| 12 | APOH apolipoprotein H (beta-2-glycoprotein I) | Identified as marker for liver cancer class in [12] | M |
| 13 | PAX8 paired box gene 8 | Verified as marker for ovarian cancer in [12], also identified as marker for renal and breast cancer classes in [12] | M |
| 14 | APCS amyloid P component, serum | Produces normal circulating plasma protein that is deposited on amyloid fibrils | 4 |
| 15 | S100A1 S100 calcium-binding protein A1 | Identified as marker for breast, kidney, and ovary cancer classes in [12] | M |
| 16 | AMD1 S-adenosylmethionine decarboxylase 1 | Specifically up-regulated in B cell lymphoma [42] | 1 |
| 17 | FABP1 fatty acid binding protein 1, liver | Identified as marker for pancreatic cancer class in [12] | M |
| 18 | HLCS holocarboxylase synthetase (biotin- [proprionyl-Coenzyme A-carboxylase (ATP-hydrolysing)] ligase) | The enzyme holocarboxylase synthetase plays a role in gene regulation (determining whether genes are turned on or off) | 4 |
| 19 | ITIH3 pre-alpha (globulin) inhibitor, H3 polypeptide | Its product is predominantly transcribed in liver, and is involved in pathological conditions such as tumor invasion and metastasis [43] | 2 |
| 20 | KRT18 keratin 18 | Identified as marker for CNS (central nervous system) and stomach cancer classes in [12] | M |
| 21 | LGALS4 lectin, galactoside-binding, soluble, 4 (galectin 4) | Verified as marker for pancreatic cancer in [12], also identified as marker for stomach, liver, kidney, and breast cancer classes in [12] | M |
| 22 | HELO1 homolog of yeast long chain polyunsaturated fatty acid elongation enzyme 2 | Highly expressed in adrenal gland and testis (tissue-specific), probably involved in encoding the major histocompatibility complex, essential to human immune response [44] | 3 |
| 23 | EST | Unknown sequence | 4 |
| 24 | QKI homolog of mouse quaking QKI (KH domain RNA binding protein) | Specifically expressed in the central nervous system (CNS) | 3 |
| 25 | NDP Norrie disease (pseudoglioma) | Regulates neural cell proliferation and differentiation. Norrie disease (caused by mutation of NDP) is also accompanied by intraocular tumor [45] | 3 |
Top 25 genes ranked from the most frequently selected genes for the BRN dataset. Group M: identified as a marker or repressor of a specific tumor type in originating study. Group 1: identified as a marker or repressor of a specific tumor type in other studies. Group 2: known to either promote or inhibit tumor in general. Group 3: tissue-specific genes. Group 4: unknown sequences and genes with either still-unidentified function or general housekeeping roles.
Most frequently selected genes for the BRN14 dataset.
| Rank | Annotation | Remarks | Group |
| 1 | FLJ20624 **hypothetical protein FLJ20624 | Gene #2 in the BRN dataset (Table 6) | 1 |
| 2 | M6PR mannose-6-phosphate receptor (cation dependent) | Gene #6 in the BRN dataset (Table 6) | 1 |
| 3 | PAX8 paired box gene 8 | Gene #13 in the BRN dataset (Table 6) | M |
| 4 | DMXL1 Dmx-like 1 | Gene #3 in the BRN dataset (Table 6) | 4 |
| 5 | PLG plasminogen | Gene #7 in the BRN dataset (Table 6) | M |
| 6 | LGALS4 lectin, galactoside-binding, soluble, 4 (galectin 4) | Gene #21 in the BRN dataset (Table 6) | M |
| 7 | APCS amyloid P component, serum | Gene #14 in the BRN dataset (Table 6) | 4 |
| 8 | GATA3 GATA-binding protein 3 | Verified as marker for breast cancer in [12], also identified as marker for bladder cancer class in [12] | M |
| 9 | TACSTD1 tumor-associated calcium signal transducer 1 | Gene #5 in the BRN dataset (Table 6) | M |
| 10 | FGA fibrinogen, A alpha polypeptide | Gene #10 in the BRN dataset (Table 6) | 1 |
| 11 | FABP1 fatty acid binding protein 1, liver | Gene #17 in the BRN dataset (Table 6) | M |
| 12 | SORD sorbitol dehydrogenase | Gene #9 in the BRN dataset (Table 6) | 2 |
| 13 | EST | Unknown sequence | 4 |
| 14 | DDOST dolichyl-diphosphooligosaccharide-protein glycosyltransferase | Identified as marker for testis cancer class in [12] | M |
| 15 | ITIH3 pre-alpha (globulin) inhibitor, H3 polypeptide | Gene #19 in the BRN dataset (Table 6) | 2 |
| 16 | KIAA0128 KIAA0128 protein; septin 2 | Up-regulated in renal cell carcinoma [46] | 1 |
| 17 | BCL6 B-cell CLL/lymphoma 6 (zinc finger protein 51) | Gene #11 in the BRN dataset (Table 6) | 1 |
| 18 | AMD1 S-adenosylmethionine decarboxylase 1 | Gene #16 in the BRN dataset (Table 6) | 1 |
| 19 | SERPINC1 serine (or cysteine) proteinase inhibitor, clade C (antithrombin), member 1 | Controls expression of oncogene for hepatocarcinoma [47] | 1 |
| 20 | APOH apolipoprotein H (beta-2-glycoprotein I) | Gene #12 in the BRN dataset (Table 6) | M |
| 21 | HLCS holocarboxylase synthetase (biotin- [proprionyl-Coenzyme A-carboxylase (ATP-hydrolysing)] ligase) | Gene #18 in the BRN dataset (Table 6) | 4 |
| 22 | QKI homolog of mouse quaking QKI (KH domain RNA binding protein) | Gene #24 in the BRN dataset (Table 6) | 3 |
| 23 | Homo sapiens mRNA for putative nuclear protein (ORF1-FL49) | Expressed in spinal cord (high tissue-specificity) | 3 |
| 24 | GRHPR glyoxylate reductase/hydroxypyruvate reductase | One of the partners of the BCL6 (see Gene #11 in Table 6) translocation in follicular lymphoma, which leads to higher risk of transformation into aggressive lymphoma [48] | 1 |
| 25 | HELO1 homolog of yeast long chain polyunsaturated fatty acid elongation enzyme 2 | Gene #22 in the BRN dataset (Table 6) | 3 |
Top 25 genes ranked from the most frequently selected genes for the BRN14 dataset. Group M: identified as a marker or repressor of a specific tumor type in originating study. Group 1: identified as a marker or repressor of a specific tumor type in other studies. Group 2: known to either promote or inhibit tumor in general. Group 3: tissue-specific genes. Group 4: unknown sequences and genes with either still-unidentified function or general housekeeping roles.
Most frequently selected genes for the GCM dataset.
| Rank | Annotation | Remarks | Group |
| 1 | Human DNA sequence from clone 753P9 on chromosome Xq25-26.1. Contains the gene coding for Aminopeptidase P (EC 3.4.11.9, XAA-Pro/X-Pro/Proline/Aminoacylproline Aminopeptidase) and a novel gene. | Ranked #31 in the OVA marker list for the lymphoma class in [2] | M |
| 2 | Antigen, Prostate Specific, Alt. Splice Form 2 | Ranked #8 in the OVA marker list for the prostate cancer class in [2] | M |
| 3 | Galectin-4 | Ranked #1 in the OVA marker list for the colorectal cancer class in [2] | M |
| 4 | Homo sapiens mRNA for APCL protein, complete cds | Under-expressed in ovarian cancer, and thus a potential tumor suppressor gene in ovarian cancer [50] | 1 |
| 5 | Ins(1,3,4,5)P4-binding protein | Ranked #24 in the OVA marker list for the leukemia class in [2] | M |
| 6 | CARCINOEMBRYONIC ANTIGEN PRECURSOR | Ranked #4 in the OVA marker list for the colorectal cancer class in [2] | M |
| 7 | PMEL 17 PROTEIN PRECURSOR | Ranked #6 in the OVA marker list for the melanoma class in [2] | M |
| 8 | KLK1 Kallikrein 1 (renal/pancreas/salivary) | Ranked #18 in the OVA marker list for the prostate cancer class in [2] | M |
| 9 | EST: zt56g08.s1 Soares ovary tumor NbHOT Homo sapiens cDNA clone 726398 3', mRNA sequence. (from Genbank) | Ranked #5 in the OVA marker list for the mesothelioma class in [2] | M |
| 10 | EST: zr71g09.s1 Soares NhHMPu S1 Homo sapiens cDNA clone 668896 3', mRNA sequence. (from Genbank) | Ranked #5 in the OVA marker list for the CNS tumor class in [2] | M |
| 11 | Ribosomal protein S19 | Ranked #1 in the OVA marker list for the lymphoma class in [2] | M |
| 12 | PULMONARY SURFACTANT-ASSOCIATED PROTEIN B PRECURSOR | Ranked #3 in the OVA marker list for the lung cancer class in [2] | M |
| 13 | Mammaglobin 2 | A marker for breast cancer [51] | 1 |
| 14 | MRJ gene for a member of the DNAJ protein family | Associated with a tumor-transforming gene protein [52] | 2 |
| 15 | LPAP gene | Its product mediates proliferative and/or morphologic effects on ovarian cancer cells [53] | 1 |
| 16 | EST: zq49c07.s1 Stratagene hNT neuron (#937233) Homo sapiens cDNA clone 633036 3', mRNA sequence. (from Genbank) | Identified as a marker of CNS tumor in [54] | 1 |
| 17 | Eyes absent homolog (Eab1) mRNA | Up-regulated in epithelial ovarian cancer [55] | 1 |
| 18 | Pulmonary surfactant-associated protein SP-A (SFTP1) gene | Ranked #1 in the OVA marker list for the lung cancer class in [2] | M |
| 19 | EST: ab17g09.s1 Stratagene lung (#937210) Homo sapiens cDNA clone 841120 3' similar to contains LTR7.b2 LTR7 repetitive element ;, mRNA sequence. (from Genbank) | Ranked #10 in the OVA marker list for the lung cancer class in [2] | M |
| 20 | TUMOR-ASSOCIATED ANTIGEN CO-029 | Ranked #5 in the OVA marker list for the colorectal cancer class in [2] | M |
| 21 | MLANA Differentiation antigen melan-A | Ranked #4 in the OVA marker list for the melanoma class in [2] | M |
| 22 | LI-cadherin | Ranked #3 in the OVA marker list for the colorectal cancer class in [2] | M |
| 23 | Antigen, Prostate Specific, Alt. Splice Form 3 | Ranked #39 in the OVA marker list for the prostate cancer class in [2] | M |
| 24 | GPX2 Glutathione peroxidase 2, gastrointestinal | Found to play a role in colon cancer resistance [56] | 1 |
| 25 | Phosphodiesterase 9A | Ranked #28 in the OVA marker list for the prostate cancer class in [2] | M |
Top 25 genes ranked from the most frequently selected genes for the GCM dataset. Group M: identified as a marker or repressor of a specific tumor type in originating study. Group 1: identified as a marker or repressor of a specific tumor type in other studies. Group 2: known to either promote or inhibit tumor in general. Group 3: tissue-specific genes. Group 4: unknown sequences and genes with either still-unidentified function or general housekeeping roles.
Figure 7Limits of (in extreme situations of near-maximum and near-minimum redundancy. (U)1-and plotted against α and ρ respectively and (V)plotted against the DDP in the cases where (a) redundancy is close to the theoretical maximum, and (b) redundancy is close to the theoretical minimum, and (c) magnification of plot (b) for (U)1-and (V).