| Literature DB >> 21103052 |
Anirban Mukhopadhyay1, Sanghamitra Bandyopadhyay, Ujjwal Maulik.
Abstract
With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21103052 PMCID: PMC2980474 DOI: 10.1371/journal.pone.0013803
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The comparison of the average ARI and %CA scores produced by 50 consecutive runs of MOGASVM with ensemble of kernel functions and MOGASVM with individual kernel functions for all the datasets.
| Algorithms | SRBCT | Adult malignancy | Brain tumor | |||
| ARI | %CA | ARI | %CA | ARI | %CA | |
| MOGASVM | 0.5126 | 76.6412 | 0.8172 | 96.4718 | 0.7172 | 88.5150 |
| MOGASVM (linear) | 0.4726 | 74.7926 | 0.7591 | 95.8244 | 0.6836 | 87.5836 |
| MOGASVM (polynomial) | 0.4682 | 74.5343 | 0.7238 | 94.7375 | 0.6927 | 88.0116 |
| MOGASVM (sigmoidal) | 0.4816 | 76.0284 | 0.7704 | 95.7581 | 0.6734 | 87.2046 |
| MOGASVM (RBF) | 0.4855 | 76.2891 | 0.7926 | 96.2183 | 0.7025 | 88.1173 |
The average and scores produced by 50 consecutive runs of different algorithms for the SRBCT data.
| Algorithms | ARI | %CA |
| MOGASVM | 0.5126 | 76.6412 |
| K-means | 0.3135 | 70.1903 |
| EM | 0.3376 | 71.1295 |
| SGA | 0.3198 | 70.8193 |
| Avg. linkage | 0.1021 | 49.0527 |
| SOM | 0.3872 | 71.7845 |
| SiMM-TS | 0.4628 | 74.4853 |
| CSPA | 0.3922 | 72.0297 |
| HGPA | 0.2839 | 67.4533 |
| MCLA | 0.3902 | 71.9764 |
The average and scores produced by 50 consecutive runs of different algorithms for the Adult malignancy data.
| Algorithms | ARI | %CA |
| MOGASVM | 0.8172 | 96.4718 |
| K-means | 0.6924 | 92.5441 |
| EM | 0.7251 | 94.7294 |
| SGA | 0.7491 | 95.7858 |
| Avg. linkage | 0.6190 | 93.0437 |
| SOM | 0.5917 | 92.8100 |
| SiMM-TS | 0.7823 | 96.0139 |
| CSPA | 0.7331 | 95.0801 |
| HGPA | 0.7192 | 94.0549 |
| MCLA | 0.7398 | 95.2813 |
The average and scores produced by 50 consecutive runs of different algorithms for the Brain tumor data.
| Algorithms | ARI | %CA |
| MOGASVM | 0.7172 | 88.5150 |
| K-means | 0.5764 | 84.5144 |
| EM | 0.5581 | 83.1457 |
| SGA | 0.6325 | 87.1433 |
| Avg. linkage | 0.4603 | 78.2811 |
| SOM | 0.6214 | 87.0376 |
| SiMM-TS | 0.6892 | 87.9110 |
| CSPA | 0.6028 | 85.9984 |
| HGPA | 0.5295 | 83.9416 |
| MCLA | 0.5974 | 86.4543 |
Figure 1The boxplots showing the index scores produced by different algorithm over 50 consecutive runs for the SRBCT dataset.
Figure 2The boxplots showing the index scores produced by different algorithm over 50 consecutive runs for the Adult malignancy dataset.
Figure 3The boxplots showing the index scores produced by different algorithm over 50 consecutive runs for the Brain tumor dataset.
The P-values produced by t-test comparing MOGASVM with the other algorithms.
| P-values | |||||||||
| datasets | (comparing mean values of %CA index of MOGASVM with other algorithms) | ||||||||
| K-means | EM | SGA | Avg. Link | SOM | SiMM-TS | CSPA | HGPA | MCLA | |
| SRBCT | 3.1E-07 | 2.17E-07 | 2.41E-03 | 1.08E-06 | 6.5E-05 | 5.32E-03 | 3.23E-04 | 6.38E-06 | 2.94E-04 |
| Adult malignancy | 2.21E-05 | 1.67E-08 | 3.4E-05 | 4.52E-12 | 1.44E-04 | 2.53E-03 | 7.2E-04 | 2.3E-06 | 1.4E-04 |
| Brain tumor | 3.42E-05 | 7.43E-08 | 5.8E-05 | 2.7E-07 | 2.1E-05 | 1.4E-04 | 8.92E-05 | 6.2E-06 | 9.3E-05 |
Figure 4The heatmap of the expression levels of the most frequently selected top 10 gene markers for each tumor subtype in the SRBCT data.
Red/green represents up/down regulation relative to black. Each subgroup is in a yellow box to identify its samples and the distinguishing gene markers. The image clone IDs of the marker genes are also shown on the right side of the genes.
The gene markers in the SRBCT data for the EWS class, their Image IDs, symbols, selection frequencies, descriptions and up/down regulation natures.
| Gene Image ID | Symbol | Frequency % | Description | Up/Down |
| 782811 | HMGA1 | 100 | high-mobility group (nonhistone chromosomal) | Down |
| protein isoforms I and Y | ||||
| 796646 | ODC1 | 100 | ornithine decarboxylase 1 | Down |
| 810899 | CKS1B | 96 | CDC28 protein kinase regulatory subunit 1B | Down |
| 745138 | TUBA3D | 96 | tubulin, alpha | Down |
| 30093 | RANBP1 | 90 | RAN binding protein | Down |
| 866702 | PTPN13 | 100 | protein tyrosine phosphatase, non-receptor type 13 | Up |
| (APO-1/CD95 (Fas)-associated phosphatase) | ||||
| 811028 | TMEM49 | 98 | transmembrane protein 49 | Up |
| 505491 | PTTG1IP | 98 | pituitary tumor-transforming 1 interacting protein | Up |
| 470261 | SMA4 | 94 | glucuronidase, beta pseudogene | Up |
| 814260 | KDSR | 92 | 3-ketodihydrosphingosine reductase | Up |
The gene markers in the SRBCT data for the NB class, their Image IDs, symbols, selection frequencies, descriptions and up/down regulation natures.
| Gene Image ID | Symbol | Frequency % | Description | Up/Down |
| 207274 | IGF2 | 100 | Human DNA for insulin-like growth factor II (IGF-2); | Down |
| exon 7 and additional ORF | ||||
| 563673 | ALDH7A1 | 100 | aldehyde dehydrogenase 7 family, member A1 | Down |
| 1416782 | CKB | 100 | creatine kinase, brain | Down |
| 296448 | IGF2 | 96 | insulin-like growth factor 2 (somatomedin A) | Down |
| 250654 | SPARC | 92 | secreted protein, acidic, cysteine-rich (osteonectin) | Down |
| 812965 | MYC | 100 | v-myc avian myelocytomatosis viral oncogene homolog | Up |
| 344134 | IGLL1 | 100 | immunoglobulin lambda-like polypeptide | Up |
| 840942 | HLA-DPB1 | 94 | major histocompatibility complex, class II, DP beta | Up |
| 868304 | ACTA2 | 94 | actin, alpha 2, smooth muscle, aorta | Up |
| 745343 | REG1A | 90 | regenerating islet-derived 1 alpha | Up |
| (pancreatic stone protein, pancreatic thread protein) |
The gene markers in the SRBCT data for the BL class, their Image IDs, symbols, selection frequencies, descriptions and up/down regulation natures.
| Gene Image ID | Symbol | Frequency % | Description | Up/Down |
| 784224 | FGFR4 | 100 | fibroblast growth factor receptor | Down |
| 365826 | GAS1 | 98 | growth arrest-specific | Down |
| 810057 | CSDA | 98 | cold shock domain protein A | Down |
| 839552 | NCOA1 | 94 | nuclear receptor coactivator | Down |
| 244618 | FNDC5 | 94 | fibronectin type III domain containing 5 | Down |
| 878652 | PCOLCE | 100 | procollagen C-endopeptidase enhancer | Up |
| 327350 | HNRNPA2B1 | 100 | heterogeneous nuclear ribonucleoprotein A2/B1 | Up |
| 824041 | SFRS9 | 100 | splicing factor, arginine/serine-rich | Up |
| 950574 | H3F3B | 96 | H3 histone, family 3B (H3.3B) | Up |
| 812105 | MLLT11 | 92 | myeloid/lymphoid or mixed-lineage leukemia | Up |
| (trithorax homolog, Drosophila); translocated to, 11 |
The gene markers in the SRBCT data for the RMS class, their Image IDs, symbols, selection frequencies, descriptions and up/down regulation natures.
| Gene Image ID | Symbol | Frequency % | Description | Up/Down |
| 627939 | CSRP3 | 100 | cysteine and glycine-rich protein 3 | Down |
| (cardiac LIM protein) | ||||
| 52076 | OLFM1 | 100 | olfactomedinrelated ER localized protein | Down |
| 781097 | RTN3 | 100 | neurotrophic tyrosine kinase, receptor-related | Down |
| 841620 | DPYSL2 | 98 | dihydropyrimidinase-like | Down |
| 377461 | CAV1 | 98 | caveolin 1, caveolae protein, 22kD | Down |
| 878798 | B2M | 100 | beta-2-microglobulin | Up |
| 770394 | FCGRT | 98 | Fc fragment of IgG, receptor, transporter, alpha | Up |
| 263716 | COL6A1 | 98 | collagen, type VI, alpha | Up |
| 461425 | MYL4 | 96 | myosin light chain 4 | Up |
| 298062 | TNNT2 | 96 | troponin T2, cardiac | Up |
The significant GO terms shared by the gene markers in the SRBCT data for the EWS class. Level refers to the GO Level.
| Level | GO term | Module % | Genome % |
| 3 | cellular component organization and biogenesis (GO:0016043) | 50.0 | 18.3 |
| 4 | transport (GO:0006810) | 42.86 | 18.33 |
| 3 | multicellular organismal development (GO:0007275) | 25.0 | 15.83 |
| 3 | nitrogen compound metabolic process (GO:0006807) | 12.5 | 3.24 |
| 3 | protein localization (GO:0008104) | 12.5 | 5.28 |
| 4 | carbohydrate metabolic process (GO:0005975) | 14.29 | 3.72 |
| 4 | amino acid and derivative metabolic process (GO:0006519) | 14.29 | 2.48 |
| 6 | DNA replication (GO:0006260) | 14.29 | 1.7 |
| 6 | biogenic amine metabolic process (GO:0006576) | 14.29 | 0.52 |
Module % is the percentage of the genes involved in the particular GO term among the gene markers. Genome % is the percentage of genes involved in the particular GO term among the complete genome.
The significant GO terms shared by the gene markers in the SRBCT data for the NB class.
| Level | GO term | Module % | Genome % |
| 3 | cell proliferation (GO:0008283) | 42.86 | 5.46 |
| 3 | immune response (GO:0006955) | 28.57 | 5.38 |
| 3 | cell cycle (GO:0007049) | 28.57 | 6.23 |
| 3 | response to abiotic stimulus (GO:0009628) | 14.29 | 1.02 |
| 3 | antigen processing and presentation (GO:0019882) | 14.29 | 0.84 |
| 3 | tissue remodeling (GO:0048771) | 14.29 | 0.75 |
| 6 | regulation of progression through cell cycle (GO:0000074) | 40.0 | 4.55 |
| 7 | transmembrane receptor protein tyrosine | 66.67 | 1.9 |
| kinase signaling pathway (GO:0007169) |
Level refers to the GO Level. Module % is the percentage of the genes involved in the particular GO term among the gene markers. Genome % is the percentage of genes involved in the particular GO term among the complete genome.
The significant GO terms shared by the gene markers in the SRBCT data for the BL class.
| Level | GO term | Module % | Genome % |
| 3 | response to stress (GO:0006950) | 69.72 | 7.24 |
| 4 | mitotic cell cycle (GO:0000278) | 16.67 | 2.04 |
| 6 | regulation of progression through cell cycle (GO:0000074) | 16.67 | 4.45 |
| 8 | S phase of mitotic cell cycle (GO:0000084) | 20.0 | 0.14 |
| 9 | RNA splicing, via transesterification reactions with bulged adenosine as nucleophile (GO:0000377) | 25.0 | 1.83 |
Level refers to the GO Level. Module % is the percentage of the genes involved in the particular GO term among the gene markers. Genome % is the percentage of genes involved in the particular GO term among the complete genome.
The significant GO terms shared by the gene markers in the SRBCT data for the RMS class.
| Level | GO term | Module % | Genome % |
| 3 | circulation (GO:0008015) | 25.0 | 1.07 |
| 3 | antigen processing and presentation (GO:0019882) | 25.0 | 0.84 |
| 3 | multicellular organismal development (GO:0007275) | 50.0 | 15.83 |
| 3 | anatomical structure development (GO:0048856) | 50.0 | 14.44 |
| 3 | cellular developmental process (GO:0048869) | 50.0 | 15.58 |
| 3 | regulation of biological quality (GO:0065008) | 25.0 | 4.14 |
| 3 | immune response (GO:0006955) | 25.0 | 5.38 |
| 4 | endothelial cell proliferation (GO:0001935) | 12.5 | 0.09 |
| 4 | homeostatic process (GO:0042592) | 25.0 | 2.73 |
| 4 | cell differentiation (GO:0030154) | 50.0 | 16.04 |
| 6 | regulation of endothelial cell proliferation (GO:0001936) | 20.0 | 0.09 |
| 6 | cardiac inotropy (GO:0002026) | 20.0 | 0.05 |
| 6 | muscle development (GO:0007517) | 40.0 | 1.47 |
| 6 | sterol transport (GO:0015918) | 20.0 | 0.06 |
| 6 | glycerolipid metabolic process (GO:0046486) | 20.0 | 0.19 |
| 7 | negative regulation of endothelial cell proliferation (GO:0001937) | 25.0 | 0.07 |
| 7 | cholesterol transport (GO:0030301) | 25.0 | 0.08 |
| 7 | regulation of nitric oxide biosynthetic process (GO:0045428) | 25.0 | 0.09 |
| 7 | cardiac muscle development (GO:0048738) | 25.0 | 0.05 |
| 9 | protein oligomerization (GO:0051259) | 66.67 | 1.22 |
Level refers to the GO Level. Module % is the percentage of the genes involved in the particular GO term among the gene markers. Genome % is the percentage of genes involved in the particular GO term among the complete genome.
The performance of the clustering algorithms on the SRBCT dataset with the initially selected 200 genes, the marker genes selected using the t-statistic and the marker genes selected using the SNR statistic.
| Algorithms | %CA | ||
| Initially selected | Markers selected | Markers selected | |
| 200 genes | by the t-statistic | by the SNR statistic | |
| MOGASVM | 76.6412 | 85.8293 | 90.3781 |
| K-means | 70.1903 | 80.3772 | 85.3914 |
| EM | 71.1295 | 82.3371 | 86.8934 |
| SGA | 70.8193 | 81.1823 | 86.4927 |
| Avg. linkage | 49.0527 | 70.2947 | 76.9837 |
| SOM | 71.7845 | 82.7845 | 86.9833 |
| SiMM-TS | 74.4853 | 84.9648 | 89.1397 |
| CSPA | 72.0297 | 83.2983 | 88.2286 |
| HGPA | 67.4533 | 77.8447 | 83.9824 |
| MCLA | 71.9764 | 83.1845 | 87.9411 |