| Literature DB >> 27932867 |
Jelili Oyelade1, Itunuoluwa Isewon1, Funke Oladipupo2, Olufemi Aromolaran2, Efosa Uwoghiren2, Faridah Ameh2, Moses Achas3, Ezekiel Adebiyi1.
Abstract
Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure.Entities:
Keywords: bioinformatics; biological process; clustering algorithm; gene expression data; homology
Year: 2016 PMID: 27932867 PMCID: PMC5135122 DOI: 10.4137/BBI.S38316
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Figure 1Classification of clustering techniques.
Some clustering algorithms and software packages/tools corresponding to the algorithms.
| ALGORITHMS | SOFTWARE/TOOLS |
|---|---|
| K-means | KMC |
| MATLAB | |
| PYTHON | |
| APACHE SPARK | |
| JAVA (WEKA) | |
| R | |
| K-medoids | MATLAB |
| Gaussian Mixture Model (GMM) | APACHE SPARK |
| PYTHON | |
| Self-Organizing Maps (SOM) | R |
| MATLAB | |
| Hierarchical Clustering | XLSTAT |
| PYTHON | |
| R/PYTHON | |
| Expectation Maximization (EM) | MATLAB |
| Fuzzy K-means | MAHOUT APACHE |
| Affinity Propagation (AP) | PYTHON |
| AFFINITY PROPAGATION WEB APPUICATION | |
| PAM | R |
| STAT | |
| CLARANS | R |
| MATLAB | |
| OPTICS | MATLAB |
| Hierarchical Dirichlet Process (HDP) Algorithm | PYTHON |
| Binary Matrix Factorization (BMF) | PYTHON |
| Multi-Objective Clustering (MOCK) | C++/JAVA |
| DBSCAN | R |
| PYTHON |
Some clustering algorithms, their drawbacks, and proposed solutions.
| ALGORITHMS | DRAWBACKS | PROPOSED SOLUTION |
|---|---|---|
| K-means | Expected clusters is required | A fully unsupervised clustering method called intelligent Kernel K-Means (IKKM) |
| High computational complexity | ||
| Do not satisfy a quality guarantee | Chandrasekhar et al. | |
| It is based on random selection of initial seed point of preferred clusters | Cluster Center initialization Algorithm (CCIA) | |
| Sensitive to noise and outliers | ||
| Get trapped in a local optimum | ||
| Different runs on the same data might produce different clusters | ||
| Vulnerable to the existence of scattered genes | ||
| Fast Genetic K-Means Algorithm (FGKA) | If the mutation probability is quite small, the amount of allele changes will be little | A clustering method named incremental Genetic K-means Algorithm (IGKA) |
| Fuzzy K-means (FKM) | More time-consuming to calculate the membership function | PK-means |
| K-Medoid | Time-consuming | |
| Expectation-Maximization (EM) Algorithm | Get trapped in a local maximum of the log- likelihood | |
| Gaussian Mixture Model (GMM) Algorithm | Prior information of the amount of clusters | Intelligent K-Means (IK-Means) |
| Partitioning Around Medoid (PAM) | Vulnerable to the presence of scattered genes | |
| Multi-Elitist QPSO (MEQPSO) | Needs prior information of the amount of clusters | |
| Runtime deterioration with lengthy particles | ||
| Self-Organizing Map (SOM) | The grid structure and the number of clusters is required | |
| Merging different patterns into a cluster can make SOM ineffective | ||
| Hierarchical Agglomerative Clustering (HAC) algorithm | Suffers from a lack of vigor when dealing with data containing noise | Self-Organizing Tree Algorithm (SOTA) |
| Difficulty in interpreting patterns when large number of data is applied | Hierarchical Growing SelfOrganizing Tree (HGSOT) | |
| Difficulty in clustering larger data | ||
| Clustering Algorithm based on Randomized Search (CLARANS) | Increase in run time when faced with large databases | |
| Density Based Spatial Clustering of Applications with Noise (DBSCAN) | Poor functionality if the data is a high dimensional data | |
| Affinity Propagation (AP) | Robustness limitations | Reduction of the AP hard constraints |
| WAVECUUSTER | Not quite suitable for high dimensional dataset |
Some internal and external validity indexes.
| INTERNAL CLUSTERING MEASURES | |
|---|---|
| Measures | Formular |
| The partition entropy |
|
| Dunn index |
|
| The partition coefficient |
|
| Calinski-Harabasz |
|
| C-Index |
|
| Davies-Bouldin |
|
| Krzanowski-Lai index |
|
| VFS |
|
| VXB |
|
| The fuzzy hypervolume |
|
| Modification of the MPE index |
|
| Modification of the VPC index |
|
| Kwon index |
|
| The PBM index |
|
| PBM-index for Fuzzy c-means |
|
| VPCAES |
|
| Mutual information and variation of information |
|
| Zhang index |
|
| VSC(c,U) |
|
| Rezaee compactness and separation |
|
| WGLI |
|
| Graded Distance Index |
|