| Literature DB >> 25986937 |
Vasily Sachnev1, Saras Saraswathi2, Rashid Niaz3, Andrzej Kloczkowski4, Sundaram Suresh5.
Abstract
BACKGROUND: Traditional cancer treatments have centered on cytotoxic drugs and general purpose chemotherapy that may not be tailored to treat specific cancers. Identification of molecular markers that are related to different types of cancers might lead to discovery of drugs that are patient and disease specific. This study aims to use microarray gene expression cancer data to identify biomarkers that are indicative of different types of cancers. Our aim is to provide a multi-class cancer classifier that can simultaneously differentiate between cancers and identify type-specific biomarkers, through the application of the Binary Coded Genetic Algorithm (BCGA) and a neural network based Extreme Learning Machine (ELM) algorithm.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25986937 PMCID: PMC4448565 DOI: 10.1186/s12859-015-0565-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Gene expression for 92 features, selected by BCGA-ELM from GCM dataset (for one of the validation sets of 46 samples). The horizontal bars for each of the 14 different types of cancers show differentiated gene expression for different cancers, notably for Lymphoma, Leukemia and CNS, where broad horizontal bars that separate different types of cancers are seen distinctly. The x-axis represents the 92 genes while the y-axis represents the samples for each type of cancer (see Figure 1 and Additional file 1: Table S2 for gene names and descriptions).
Classification accuracy using four multi-class cancer data sets (GCM, Breast, Leukemia, Lymphoma) and six binary sets (CNS, colon, DLBCL, GCM, lung, prostate) show that performance of BCGA-ELM is superior and consistent over all these data sets. GCM multi-class has an accuracy of 95.4%, which is at least 21.6% higher than other methods given in the literature (although some of them use very small sets of genes)
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
| #Genes-initial set | 16063 | 1213 | 999 | 4026 | 7129 | 2000 | 7129 | 16063 | 12533 | 12600 |
| #Genes BCGA-ELM | 92 | 30 | 11 | 27 | 17 | 27 | 18 | 73 | 11 | 72 |
| # Samples | 198 | 49 | 38 | 96 | 34 | 62 | 77 | 280 | 181 | 102 |
| # Classes | 14 | 4 | 3 | 5 | 2 | 2 | 2 | 2 | 2 | 2 |
| Multi-class, Accuracy (%) | Binary-class, Accuracy (%) | |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
| (*σ2 = 0.00083) | ||||||||||
|
| ||||||||||
| LibSVM-linear | 78.9 | 100 | 100 | 91.9 | 100 | 91.9 | 100 | 99.1 | 95.6 | 97.1 |
| RBF Network | 69.8 | 100 | 100 | 82.3 | 98.7 | 79.1 | 96.2 | 85.4 | 96.7 | 93.6 |
| SMO | 83.3 | 100 | 100 | 93.6 | 98.7 | 89.7 | 98.7 | 98.7 | 95.0 | 97.1 |
| Naïve Bayes | 78.6 | 100 | 97.1 | 72.6 | 93.5 | 60.0 | 81.9 | 73.1 | 97.8 | 92.7 |
| Multiclass Classifier | 85.3 | 100 | 100 | 93.6 | 97.4 | 93.5 | 99.8 | 99.7 | 94.5 | 98.8 |
| Method | #Genes | |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
| HC-k-TSP [ | 5 to 27 | 67.4 | 66.7 | 97.1 | - | 97.1 | 90.3 | 97.4 | 85.4 | 97.0 |
| mul-PAM [ | 5 to 27 | 56.5 | 93.3 | 97.1 | - | 85.3 | 90.3 | 92.2 | 82.9 | 93.9 |
| BMSF(highest) [ | 5 to 27 | - | - | - | - | 97.1 | 95.2 | 97.4 | 98.6 | 100 |
| I-RELIEF(highest)[ | 5 to 27 | - | - | - | - | 88.4 | 82.3 | 95.1 | 96.1 | 91.2 |
| LHR(highest) [ | 5 to 27 | - | - | - | - | 100 | 91.2 | 97.4 | 100 | 100 |
Current results show 4.2% improvement over our previous method using ICGA-ELM. All other multi-class and binary data sets are classified with 100% accuracy (shown in bold). Genes selected by BCGA-ELM (for all data sets) are classified using WEKA [19] machine learning package. These results are much lower for GCM multi-class data but are fairly consistent for other data sets compared to BCGA-ELM and other results in literature. (*σ is the variance).
Selected Genes, biomarkers and activities related to hallmarks of cancer, as identified by IPA®, for four of the eight data sets are given here
|
|
|
|
|
|---|---|---|---|
| Breast - multiclass | CYC1, CYP2A6, DNASE1L3, EEF1D, EVI2A, GPM6B, HAS1, ICAM3, LAD1, LASP1, LEP, LMO4, LOC54157, LTBR, NAT1, PFKFB4, POU2F2, PPP1R1A, RBP1, TCEAL1, TDRD9, TIMP4 | APOE, APOH, BMP7, CALB2, CLU, COL4A3, EGF, IL4, IL13, ITGAV, LEP, LGALS3BP, LTC4S, MAPK1, MED21, MTOR, PPARG, PTK2, RBP1, SLC29A1, SMAD4, STAT5B, TGFB1, THY1, TIMP4, TLR4, TNF, TREM1 | Cell morphology, hematological system development and function ,cell-to-cell signalling and interaction, cell death and survival, cell-mediated immune response, cellular movement, cellular compromise, DNA replication, recombination, and repair, cell-mediated immune response. |
| Leukemia - multiclass | PHF15, SPTAN1, FOXI1, MPO, APOC1, CD33, PTX3, LSS, ZYX, ATBF1, WIT1 | APOC1, CEBPA, JUNB, MPO, NOTCH3, PROC, ZYX | Hematological system development and function, immune cell trafficking, inflammatory response, tissue development, cellular function and maintenance, cell death and survival, cell morphology, tissue morphology, cell-to-cell signalling and interaction, cell-to-cell signalling and interaction, cellular function and maintenance, inflammatory response. |
| CNS - binary | RPS23,TAGLN2, MORC3, BNC1, CSF2, MCFD2, GTF2B, CORO2A, IGF2BP3, UCHL1, EEF1B2, CNR2, CSN1S1, ITIH3, (3 unknowns ) | CCL2, CCL3L3, CD28, CD44, CDKN1A, CSF2, ETS1, FASLG, HTT, IGF2BP3, IL2, IL6, IL15, IL18,STAT3, TLR2, TNF, TREM1, UCHL1 | cell cycle arrest, cell death and survival, cellular compromise, cell-mediated immune response, cell-to-cell signalling and interaction, cellular development, cellular growth and proliferation, angiogenesis, cellular movement. |
| DLBCL - binary | CIRBP, NID2, TRIB2, RPA2, TALDO1, CD28, ECH1, IQGAP2, CD37, CRYAA, ZFP36L1, PON1, CCR1, YWHAH, HLA-A (3 unknowns) | B2M, CALR, CCL5, CD28, CSF2, CCR1, CD28, CD37, CSF2, FLNA, GATA3, HLA-A, IFNG, IL2, IL5, IL2RG, OPRD1, PDCD1, PPARD, PTGER4, SLC7A5, TRAF2, YWHAH | cell-to-cell signalling and interaction, hematological system development and function, immune cell trafficking, inflammatory response, cell death and survival, cellular assembly and organization, cell cycle arrest, cell death and survival, DNA replication, recombination, and repair, cell death and survival, cellular assembly and organization, cell cycle arrest, cell death and survival, DNA replication, recombination, and repair, cell death and survival, cell morphology survival, cell morphology. |
Figure 2Framework of the proposed Binary Coded Genetic Algorithm, which is initialized with a randomly selected set of 200 solutions. These sets of genes undergo genetic operations such as crossover, mutation and selection, and are continually evaluated by ELM, until the termination criteria is met (maximum number of iterations or maximum classification accuracy). Computing fitness value f(F,GCM), where F is a binary string, GCM is a Global Cancer Map data base, f is fitness value computed using Equation 1.
Figure 3Gene names and description for 92 genes selected by BCGA-ELM. Some of the important genes implicated in signalling and metabolic pathways as determined by IPA® and iReport® analysis are in bold letters.
Figure 4The genes that are involved in various cellular activities as indicated by iReport® analysis of the cancer vs. normal data analysis of 92 candidate genes (selected by BCGA-ELM) are displayed inside a wheel here. This figure was consolidated from several figures (given separately in the supplement) in order to show all cell activities in the same figure. Genes that are involved in cellular activities such as signalling, metabolism, growth, apoptosis, survival and proliferation, disease specific interaction and signalling pathways are listed here. This wheel displays the most important 52 candidate genes, where different colours and size of genes indicate various properties. The blue and green colours on the outside of the big circle represent interactions and pathways. The purple markings are for different processes and the orange outer circles are for different diseases. Genes are grouped under three major circles for diseases, interaction pathways and processes as indicated by light grey background circles. The size of the genes indicates the number of diseases/molecular functions/processes they are associated with. Gene circles are coloured according to their expression levels, which range between −3.304 to 1.637, where blue is for lower values and orange for higher values. The small blue circles on the south-east corner of the circular gene symbols indicate that these genes have isoforms. There are 103 pathways involving 20 differentially expressed genes (DEGs), 341 processes involving 40 DEGs and 79 diseases involving 29 DEGs. Some of these are illustrated here and fully listed in Additional file 1: Tables ~ S5 and S6. Genes related to particular types of cancers that are highlighted on the left of the figure are circled in red. APOPA1, NOTCH2, B2M and VEGFC seem to play major roles in these cancers. Genes responsible for cell death and survival are also given here.
Figure 5The top six of twelve biomarkers are listed in this table, with their family classification, such as transporters, growth factors, enzymes or regulators. Each biomarker is related to multiple cancers, with the top three biomarkers are related to almost all but one of the 14 cancers. The degree of filling of the circles denotes the number of processes in which the gene is involved in. The genes represented as filled circles, in the last column, under biomarker applications indicate the processes or disease related evidence, such as diagnosis, efficacy, disease progression, prognosis and safety. It can be seen that one biomarker can be active in multiple cancer classes, with APOA1 involved with all 13 cancers, except CNS. Similarly VEGFC is related to all but pancreatic cancer, while YWHAZ is related to all but ovarian cancer. These biomarkers are useful for diagnosis or determining the efficacy of drugs, while some of them are unspecified. Other colour coding in this figure are similar to those described in Figure 4.
Figure 6Hallmarks of cancer genes are listed here. The biological process and the genes that are related to some of the cancer hallmark processes such as cell cycle, death, movement, vasculogenesis, migration, proliferation, transport and invasion as identified by Ingenuity iReport®, are shown here. Cell proliferation and migration involves the largest number of genes and processes. The colour of the circle denotes expression level of each gene, with blue being the lowest and orange or red the highest. The disease state/evidence of genes are given by the smaller circles, where the small pink circles indicate that the gene is considered as a biomarker, an orange circle indicates that the gene is mutated in disease state, the brown circles indicate the level of expression, while the green circles (none here) indicate that the gene is a drug target. The gene names inside the coloured circles under disease state are listed in the second column.