| Literature DB >> 32938388 |
Sudipta Acharya1, Laizhong Cui2, Yi Pan3.
Abstract
BACKGROUND: In the field of computational biology, analyzing complex data helps to extract relevant biological information. Sample classification of gene expression data is one such popular bio-data analysis technique. However, the presence of a large number of irrelevant/redundant genes in expression data makes a sample classification algorithm working inefficiently. Feature selection is one such high-dimensionality reduction technique that helps to maximize the effectiveness of any sample classification algorithm. Recent advances in biotechnology have improved the biological data to include multi-modal or multiple views. Different 'omics' resources capture various equally important biological properties of entities. However, most of the existing feature selection methodologies are biased towards considering only one out of multiple biological resources. Consequently, some crucial aspects of available biological knowledge may get ignored, which could further improve feature selection efficiency.Entities:
Keywords: Feature selection; Gene ontology (GO); Multi-objective optimization; Multi-view clustering; Protein protein interaction network; Sample classification
Mesh:
Year: 2020 PMID: 32938388 PMCID: PMC7495900 DOI: 10.1186/s12859-020-03681-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Two views developed based on multiple ‘omics’ data
Fig. 2The flowchart of proposed CMVMC-based gene selection algorithm
Fig. 3Structure of each parent clustering solution in proposed CMVMC
Fig. 4Formation of consensus clusters of view 1 and view 2
Silhouette values corresponding to optimal gene clustering solution obtained by proposed CMVMC as well as other clustering approaches. In bracket () number of gene clusters is mentioned
| 0.446( | 0.434( | 0.439( | 0.4299( | |
| 0.4671( | 0.457( | 0.462( | 0.4531( |
Biological significance test outcome (under ‘Biological process’ ontology) for random two gene clusters obtained by CMVMC for both data sets
| Cluster 1 | GO:0050896 | 55.38% | 40.14% | Cluster 1 | GO:0071704 | 61.2% | 53.7% |
| 134 genes | response to stimulus | 217 genes | organic substance metabolic process | ||||
| GO:0071840 | 40.7% | 27.86% | GO:0044237 | 63.8% | 54.8% | ||
| cellular component organization or biogenesis | cellular metabolic process | ||||||
| GO:0048518 | 42.9% | 29.3% | GO:0044249 | 31.4% | 24.5% | ||
| positive regulation of biological process | cellular biosynthetic process | ||||||
| GO:0032501 | 41.65% | 33.07% | GO:0016043 | 43.4% | 33.4 | ||
| multicellular organismal process | cellular component organization | ||||||
| GO:1901564 | 36.3% | 25.3% | GO:0044260 | 42.8% | 33.5% | ||
| organonitrogen compound metabolic process | cellular macromolecule metabolic process | ||||||
| Cluster 2 | GO:0065008 | 37.39% | 19.24% | Cluster 2 | GO:0006807 | 53.9% | 47.18% |
| 124 genes | regulation of biological quality | 197 genes | nitrogen compound metabolic process | ||||
| GO:0051049 | 19.76% | 8.8% | GO:0071840 | 40.09% | 37.4% | ||
| regulation of transport | cellular component organization or biogenesis | ||||||
| GO:0032879 | 26.85% | 13.12% | GO:0016043 | 40% | 33.4% | ||
| regulation of localization | cellular component organization | ||||||
| GO:0009966 | 25.04% | 14.89% | GO:0065007 | 41.7% | 34% | ||
| regulation of signal transduction | biological regulation | ||||||
| GO:0051128 | 24.05% | 11.69% | GO:0043170 | 50.8% | 40.4% | ||
| regulation of cellular component organization | macromolecule metabolic process | ||||||
Fig. 5Cluster-profile plot for one random gene cluster from Multiple tissues (131 genes and 103 samples) and Yeast (180 genes and 17 samples) data set
Comparative analysis of obtained sample clusters with respect to internal validity measures
| 5565 (original) | 103 | 0.2527 | 0.998 | |
| 41(reduced by CMVMC) | ||||
| 34 (reduced by UMC-v1-v2) | 0.3526 | 1.37 | ||
| 40 (reduced by PAM(GO-based)) | 0.4299 | 1.0065 | ||
| 2884 (original) | 17 | 0.2365 | 0.149 | |
| 10 (reduced by CMVMC) | 0.087 | |||
| 6 (reduced by UMC-v1-v2) | 0.385 | 0.251 | ||
| 10 (reduced by PAM(GO-based)) | 0.4531 |
Fig. 6The comparative Silhouette and DB values for obtained sample clustering solutions for both data sets
Comparative percentage of Classification Accuracy (%CA) values of proposed CMVMC-based gene selection as well as other existing methods
| CMVMC-gene selection | |||
| 40 | PAM+AMOSA | 92.14 | |
| 34 | UMC-v1-v2 | 78.4 | |
| 42 | CLARANS+k-NN | 81.03 | |
| CLARANS+RF | 76.0 | ||
| CLARANS+C4.5 | 65.0 | ||
| CLARANS+NB | 92.23 | ||
| CLARANS+MLP | 89.32 | ||
| CMVMC-gene selection | |||
| 10 | PAM+AMOSA | 95.63 | |
| 6 | UMC-v1-v2 | 79.5 | |
| 15 | CLARANS+k-NN | 86.78 | |
| CLARANS+RF | 94.12 | ||
| CLARANS+C4.5 | 94.12 | ||
| CLARANS+NB | 94.12 | ||
| CLARANS+MLP | 94.12 |
Fig. 7The comparative Classification Accuracy (CA) of samples by proposed and existing gene selection approaches