| Literature DB >> 29166852 |
Sudipta Acharya1, Sriparna Saha2, N Nikhil3.
Abstract
BACKGROUND: Classification of biological samples of gene expression data is a basic building block in solving several problems in the field of bioinformatics like cancer and other disease diagnosis and making a proper treatment plan. One big challenge in sample classification is handling large dimensional and redundant gene expression data. To reduce the complexity of handling this high dimensional data, gene/feature selection plays a major role.Entities:
Keywords: Feature selection; Gene Ontology (GO); Gene-GO term annotation matrix; Multi-objective clustering; Sample classification
Mesh:
Year: 2017 PMID: 29166852 PMCID: PMC5700545 DOI: 10.1186/s12859-017-1933-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Flowchart of the proposed framework
Fig. 2Struct based gene-GO term annotation matrix representation
Chosen K values for PAM clustering algorithm
| Data sets | K | |||||||
|---|---|---|---|---|---|---|---|---|
|
| 5 | 10 | 20 | 30 | 40 | 50 | - | - |
|
| 5 | 10 | 20 | 30 | 40 | 50 | 60 | 70 |
Silhouette index values for clustering solutions produced by PAM with different values of K
| Data set | K | Silho Eucli-PAM | Silho City-PAM | Silho Cosine-PAM |
|---|---|---|---|---|
| Yeast | 5 | 0.3792 | 0.367 | 0.381 |
|
|
|
|
| |
| 20 | 0.4415 | 0.437 | 0.435 | |
| 30 | 0.4075 | 0.411 | 0.426 | |
| 40 | 0.40 | 0.421 | 0.423 | |
| 50 | 0.397 | 0.432 | 0.419 | |
| Multiple tissues | 5 | 0.354 | 0.361 | 0.359 |
| 10 | 0.383 | 0.372 | 0.368 | |
| 20 | 0.394 | 0.379 | 0.382 | |
| 30 | 0.406 | 0.394 | 0.392 | |
|
|
|
| 0.404 | |
|
| 0.429 | 0.402 |
| |
| 60 | 0.415 | 0.398 | 0.416 | |
| 70 | 0.414 | 0.391 | 0.409 |
The data in boldface represents optimal value of ‘K’ i.e. dimension of gene space corresponding to optimal Silhouette index for all of three distance based PAM versions
Results for biological significance test: first two obtained clusters by PAM on Yeast data
|
|
|
|
|
|---|---|---|---|
| Cluster 1 | GO:0022625 | 57.1% | 34.5% |
| 245 genes | cytosolic large ribosomal subunit | ||
| GO:0042221 | 40.63% | 28.29% | |
| response to chemical | |||
| GO:0006325 | 38.62% | 22.86% | |
| chromatin organization | |||
| GO:0055085 | 47.94% | 18.33% | |
| transmembrane transport | |||
| Cluster 2 | GO:0015934 | 44.1% | 22.82% |
| 156 genes | large ribosomal subunit | ||
| GO:0006974 | 37.74% | 14.92% | |
| cellular response to DNA damage stimulus | |||
| GO:0006366 | 36.94% | 18.58% | |
| transcription from RNA polymerase II promoter | |||
| GO:0006811 | 38.37% | 19.47% | |
| ion transport |
Fig. 3Cluster profile plot of one cluster (having 156 genes and 17 samples) after performing PAM based clustering on gene-GO term annotation matrix of Yeast dataset
Comparative analysis of AMOSA based sample clustering outcomes with respect to three internal validity indices
| Data set | Genes(features) | Samples | Silho | DB | Dunn |
|---|---|---|---|---|---|
| Yeast | 2884(Original) | 17 | 0.2365 | 0.149 | 0.5268 |
| 10(Reduced) |
|
|
| ||
| Multiple tissues | 5565(original) | 103 | 0.2527 |
| 0.6246 |
| 40(Reduced) |
| 1.0065 |
|
The obtained optimal values for Silhouette, DB and Dunn index for both datasets are represented in bold font
Fig. 5Graphical comparative analysis of AMOSA based sample clustering outcomes with respect to three internal cluster validity indices
The comparative results of our proposed feature selection based sample clustering technique with other existing techniques
| Data set | Number of genes | Algorithms | %CoA |
|---|---|---|---|
| Yeast | 10 | Proposed(PAM+AMOSA) |
|
| 15 | CLARANS+k-NN | 86.78 | |
| CLARANS+C4.5 | 94.12 | ||
| CLARANS+RF | 94.12 | ||
| CLARANS+MLP | 94.12 | ||
| CLARANS+NB | 94.12 | ||
| Multiple tissues | 40 | Proposed(PAM+AMOSA) | 92.14 |
| 42 | CLARANS+k-NN | 81.03 | |
| CLARANS+C4.5 | 65.0 | ||
| CLARANS+RF | 76.0 | ||
| CLARANS+MLP | 89.32 | ||
| CLARANS+NB |
|
The obtained optimal (maximum) Classification accuracy (%CoA) for both datasets are represented in bold font
Fig. 6Graphical comparative analysis of our proposed feature selection based sample clustering technique with other existing techniques
Results for biological significance test: first two obtained clusters by PAM on Multiple tissues data
|
|
|
|
|
|---|---|---|---|
| Cluster 1 | GO:0009987 | 73.00% | 59.72% |
| 102 genes | cellular process | ||
| GO:0008152 | 75.00% | 46.46% | |
| metabolic process | |||
| GO:0050789 | 69.00% | 36.75% | |
| regulation of biological process | |||
| GO:0050896 | 67.00% | 26.47% | |
| response to stimulus | |||
| GO:0032501 | 55.00% | 16.69% | |
| multicellular organismal process | |||
| Cluster 2 | GO:0043170 | 52.48% | 35.46% |
| 107 genes | macromolecule metabolic process | ||
| GO:0009058 | 44.55% | 22.22% | |
| biosynthetic process | |||
| GO:0032501 | 40.59% | 16.69% | |
| multicellular organismal process | |||
| GO:0007154 | 32.67% | 19.46% | |
| cell communication | |||
| GO:0007275 | 28.71% | 11.47% | |
| multicellular organismal development |
Fig. 4Cluster profile plot of one cluster (having 102 genes and 103 samples) after performing PAM based clustering on gene-GO term annotation matrix of Multiple tissue dataset