| Literature DB >> 23082127 |
Yuk Yee Leung1, Chun Qi Chang, Yeung Sam Hung.
Abstract
BACKGROUND: Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own.Entities:
Mesh:
Year: 2012 PMID: 23082127 PMCID: PMC3474777 DOI: 10.1371/journal.pone.0046700
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of the six binary-class microarray datasets.
| Dataset | No. of samples | No. of genes | References |
|
| 38 ALL, 34 AML | 7129 |
|
|
| 40 cancer, 22 normal | 2000 |
|
|
| 25 ER+, 24 ER− | 7129 |
|
|
| 58 DLBCL, 19 FL | 7129 |
|
|
| 50 normal, 52 cancer | 12000 |
|
|
| 150 ADCA, 31 MPM | 12600 |
|
Design of our synthetic datasets.
| Dataset | No. of samples | No. of genes | No. of mislabeled samples |
|
| 15 Class 1, 15 Class 2 | 10000 | 4 |
|
| 15 Class 1, 15 Class 2 | 10000 | 6 |
|
| 15 Class 1, 15 Class 2 | 10000 | 10 |
Figure 1N-MFMW model in an external LOOCV framework.
Figure 2MFMW-outlier: Integrating outlier detection into N-MFMW model with external LOOCV.
Sample(s) removed as outliers in each iteration of MFMW-outlier for all the six microarray datasets.
| Dataset | Iteration | Samples left (#) | Suspected outlier (sample ID) |
|
| 1st | 72 | 66 |
| 2nd | 71 |
| |
|
| 1st | 62 | T33, T36, T37, N20 |
| 2nd | 58 | T2, T30 | |
| 3rd | 56 | N2, N8, N18 | |
| 4th | 53 |
| |
|
| 1st | 49 | Marks206, Marks213, Nevins24, Nevins26, Marks219, Marks220 |
| 2nd | 43 | Marks204, Marks216, Nevins21 | |
| 3rd | 41 |
| |
|
| 1st | 77 | DLBC26, FSCC12, FSCC13, FSCC16 |
| 2nd | 73 | DLBC29, DLBC36 and FSCC18 | |
| 3rd | 70 |
| |
|
| 1st | 102 | N35_normal, N38_normal, T39_tumor, T49_tumor, T54_tumor |
| 2nd | 97 | N06_normal, T17_tumor, T37_tumor | |
| 3rd | 94 |
| |
|
| 1st | 181 |
|
List of outliers detected by different proposed methods on COL.
| Original | CL-Stability | PRAPIV | FOSD | MFMW-outlier | |
| Sample ID. | [2] | [8] | [25] | [27] | NA |
| T2 | Y | Y | - | Y | Y |
| T30 | Y | Y | Y | Y | Y |
| T33 | Y | Y | Y | Y | Y |
| T36 | Y | Y | Y | Y | Y |
| T37 | Y | - | Y | Y | Y |
| N8 | Y | - | Y | Y | Y |
| N12 | Y | - | - | - | Y |
| N34 | Y | Y | Y | Y | Y |
| N36 | Y | Y | Y | Y | Y |
| Others | NA | N2, N28 | N2, N28 | N2, N28 | NA |
List of outliers detected by different proposed methods on BRE.
| Original | CL-Stability | PRAPIV | FOSD | MFMW-outlier | |
| Sample ID. | [3] | [8] | [25] | [27] | NA |
| Nevins21 | Y | - | - | - | Y |
| Nevins24 | Y | Y | - | Y | Y |
| Nevins26 | Y | Y | Y | Y | Y |
| Marks204 | Y | Y | Y | Y | Y |
| Marks206 | Y | - | - | - | Y |
| Marks213 | Y | - | Y | Y | Y |
| Marks216 | Y | Y | Y | Y | Y |
| Marks219 | Y | Y | - | Y | Y |
| Marks220 | Y | - | - | - | Y |
| Others | NA | 47 | 19 | NA | NA |
Final stable set of genes (gene symbols shown) obtained from performing MFMW-outlier (after removal of outliers) on six microarray datasets.
|
|
|
|
|
|
|
| CST3 | VIP | UBE3A | HLA-A | HPN | KLK3 |
| MGST3 | GSTM4 | DSC3 | HMGA1 | LMO3 | PTRF |
| PSMB8 | ETV1 | JTV-1 | NELL2 | SERPINH1 | |
| MYB | ENO1 | MTHFD2 | |||
| TCRB |
Comparison of the mean precision and recall values on the synthetic datasets.
| Test1 | Test 2 | Test 3 | ||||
| PRAPIV | MFMW-outlier | PRAPIV | MFMW-outlier | PRAPIV | MFMW-outlier | |
|
| 83.41% | 98.15% | 76.44% | 96.39% | 54.84% | 96.86% |
|
| 91.50% | 98.91% | 83.33% | 96.77% | 59.00% | 97.34% |
Comparison of the mean precision and recall values on flipped microarray datasets.
| Reduced- | Reduced- | |||
| PRAPIV | MFMW-outlier | PRAPIV | MFMW-outlier | |
|
| 71.61% | 96.87% | 88.44% | 95.49% |
|
| 87.65% | 98.28% | 91.46% | 94.54% |
Presence or absence of biological significant genes as selected by different filters (n = 200).
| SNR | TS | AUC | ||
|
|
| Y | N | Y |
|
| Y | Y | Y | |
|
| Y | Y | N | |
|
| Y | Y | Y | |
|
|
| Y | Y | Y |
|
|
| Y | Y | Y |
|
| N | N | Y | |
|
|
| Y | Y | N |
|
| N | N | Y | |
|
|
| Y | Y | Y |
|
| N | Y | Y | |
|
|
| Y | Y | Y |
|
| Y | Y | Y | |
|
| Y | Y | Y |
MFMW-outlier results obtained from using three filters (n = 200) and two wrappers for LYM dataset.
| Wrappers | # of genes | # of subsets |
| WV+ | 8 | 1 |
| WV+SVM | 8 | 1 |
| WV+NB | 8 | 2 |
|
| 8 | 20 |
|
| 8 | 18 |
| SVM+NB | 6 | 0 |
MFMW-outlier results obtained from using three filters (n = 200) and three/four wrappers for LYM dataset.
| Wrappers | # of genes | # of subsets |
| WV+ | 6 | 1 |
| WV+ | 6 | 2 |
| WV+SVM+NB | 6 | 4 |
|
| 6 | 2 |
| WV+ | 8 | 3 |