| Literature DB >> 30100924 |
Jorge Parraga-Alava1,2, Marcio Dorn3, Mario Inostroza-Ponta1.
Abstract
BACKGROUND: Biologists aim to understand the genetic background of diseases, metabolic disorders or any other genetic condition. Microarrays are one of the main high-throughput technologies for collecting information about the behaviour of genetic information on different conditions. In order to analyse this data, clustering arises as one of the main techniques used, and it aims at finding groups of genes that have some criterion in common, like similar expression profile. However, the problem of finding groups is normally multi dimensional, making necessary to approach the clustering as a multi-objective problem where various cluster validity indexes are simultaneously optimised. They are usually based on criteria like compactness and separation, which may not be sufficient since they can not guarantee the generation of clusters that have both similar expression patterns and biological coherence.Entities:
Keywords: External biological knowledge; Gene expression data; Multi-objective clustering (MOC); Pareto local search (PLS); Path-relinking (PR)
Year: 2018 PMID: 30100924 PMCID: PMC6081857 DOI: 10.1186/s13040-018-0178-4
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Schematic representation of the integration of biological knowledge to MOC-GaPBK
Cluster validity indexes used as objective functions
| Validity index | Equation | Type |
|---|---|---|
| Xie-Beni index (XB) [ |
| Minimisation |
| the quotient between the total | ||
| variance and the minimum | ||
| separation of the elements | ||
| in the clusters. | ||
| Overall cluster deviation (Dev) [ |
| Minimisation |
| is defined as the overall summed | ||
| distances between genes and their | ||
| corresponding cluster medoid. | ||
| Cluster separation (Sep) [ |
| Maximization |
| defined as inter-cluster distances | ||
| between cluster medoids. | ||
The distance D in each formula is measured using both expression profiles-based distance (D) and biological-based distance (D)
Fig. 2Construction of trajectories in Path Relinking procedure: a schematic representation
Fig. 3Schematic representation of the Pareto Local Search (a) Population duplication, (b) Iterative explorarion (c)
Gene expression datasets used in experiments
| Dataset | Samples | Original elements | Selected elements |
|---|---|---|---|
| Arabidopsis thaliana | 8 | 138 | 133 |
| Yeast cell cycle | 17 | 6000 | 384 |
| Yeast sporulation | 7 | 6118 | 472 |
| Human fibroblasts serum | 13 | 8613 | 501 |
Best hypervolume values achieved by objective functions over 20 runs in all datasets
| Objective functions | Arabidopsis | Cell cycle | Sporulation | Serum |
|---|---|---|---|---|
| XB |
|
|
|
|
| Dev | 0.9018 | 0.9258 | 0.9307 | 0.9449 |
| Sep | 0.7913 | 0.8823 | 0.8625 | 0.7922 |
In italics, we highlight the highest values
Fig. 4Comparison of objective functions based on expression (D) and biological (D) informations optimised by MOC-GaPBK algorithm. Pareto fronts for a Arabidopsis Thaliana, b Yeast Cell Cycle, c Yeast Sporulation and d Human Fibroblast Serum
Fig. 5Comparison of MOC-GaPBK algorithm and its variations regarding hypervolume indicator with Expression Index (D) and Biology Index (D). The best Pareto fronts for a Arabidopsis Thaliana, b Yeast Cell Cycle, c Yeast Sporulation and d Human Fibroblast Serum
Fig. 6Clustering solution yield by MOC-GaPBK algorithm in Arabidopsis Thaliana. a Eisen plot b Cluster profile plots
Fig. 7Clustering solution yield by MOC-GaPBK algorithm in Yeast Cell Cycle. a Eisen plot b Cluster profile plots
Fig. 8Clustering solution yield by MOC-GaPBK algorithm in Yeast Sporulation. a Eisen plot b Cluster profile plots
Fig. 9Clustering solution yield by MOC-GaPBK algorithm in Human Fibroblasts Serum. a Eisen plot b Cluster profile plots
The most significant GO terms in datasets
| Dataset | Cluster | Significant GO term | |
|---|---|---|---|
| Arabidopsis | Cluster 1 | Response to wounding(GO:0009611) | 3.63E-16 |
| Cellular biogenic amine metabolic process(GO:0006576) | 1.00E-14 | ||
| Cellular amine metabolic process(GO:0044106) | 1.62E-14 | ||
| Cluster 2 | Lipid catabolic process(GO:0016042) | 1.91E-09 | |
| Response to wounding(GO:0009611) | 9.68E-09 | ||
| Phenylpropanoid metabolic process(GO:0009698) | 7.61E-08 | ||
| Cluster 3 | Response to organonitrogen compound(GO:0010243) | 5.36E-11 | |
| Response to chitin(GO:0010200) | 9.51E-10 | ||
| Jasmonic acid mediated signaling pathway(GO:0009867) | 3.03E-09 | ||
| Cluster 4 | Jasmonic acid biosynthetic process(GO:0009695) | 7.76E-04 | |
| Jasmonic acid metabolic process(GO:0009694) | 1.08E-03 | ||
| Lipid oxidation(GO:0034440) | 1.35E-03 | ||
| Cell cycle | Cluster 1 | Positive regulation of transport(GO:0051050) | 1.84E-04 |
| Regulation of transport(GO:0051049) | 2.93E-03 | ||
| Regulation of localization(GO:0032879) | 3.39E-03 | ||
| Cluster 2 | Cell cycle(GO:0007049) | 8.13E-17 | |
| Cell division(GO:0051301) | 3.26E-16 | ||
| Cell cycle process(GO:0022402) | 2.30E-14 | ||
| Cluster 3 | Cell cycle phase(GO:0022403) | 2.34E-10 | |
| Mitotic interphase(GO:0051329) | 2.71E-10 | ||
| Interphase(GO:0051325) | 2.71E-10 | ||
| Cluster 4 | DNA replication(GO:0006260) | 1.24E-16 | |
| DNA metabolic process(GO:0006259) | 4.36E-16 | ||
| Cell cycle(GO:0007049) | 1.29E-11 | ||
| Sporulation | Cluster 1 | Glucose metabolic process(GO:0006006) | 3.69E-08 |
| Carbohydrate metabolic process(GO:0005975) | 1.04E-07 | ||
| Hexose metabolic process(GO:0019318) | 2.49E-07 | ||
| Cluster 2 | Oxoacid metabolic process(GO:0043436) | 1.76E-05 | |
| Organic acid metabolic process(GO:0006082) | 1.80E-05 | ||
| Monocarboxylic acid transport(GO:0015718) | 4.42E-05 | ||
| Cluster 3 | Cell cycle process(GO:0022402) | 2.76E-19 | |
| Cell cycle(GO:0007049) | 5.83E-19 | ||
| Anatomical formation in morphogenesis (GO:0048646) | 6.88E-19 | ||
| Cluster 4 | Translation(GO:0006412) | 1.03E-28 | |
| Ribosome biogenesis(GO:0042254) | 1.84E-08 | ||
| Ribonucleoprotein complex biogenesis(GO:0022613) | 6.70E-08 | ||
| Serum | Cluster 1 | Mitotic recombination(GO:0006312) | 1.55E-11 |
| G2/M transition of mitotic cell cycle(GO:0000086) | 1.68E-09 | ||
| Chromosome segregation(GO:0007059) | 1.74E-09 | ||
| Cluster 2 | Cellular response to zinc ion(GO:0071294) | 5.25E-08 | |
| Striated muscle cell differentiation(GO:0051146) | 5.98E-07 | ||
| Response to zinc ion(GO:0010043) | 1.26E-06 | ||
| Cluster 3 | Cholesterol metabolic process(GO:0008203) | 7.46E-14 | |
| Cholesterol biosynthetic process(GO:0006695) | 1.39E-13 | ||
| Sterol biosynthetic process(GO:0016126) | 2.95E-13 | ||
| Cluster 4 | Multi-multicellular organism process(GO:0044706) | 8.55E-16 | |
| Regulation of smooth muscle cell proliferation(GO:0048660) | 1.50E-14 | ||
| Smooth muscle cell proliferation(GO:0048659) | 1.84E-14 |
We consider p-values <0.01 across all tests to be totally against the null hypothesis and are remarkably significant. It means that most of the genes belonging to a cluster have the same biological function detailed in the GO term
Mean values of Silhouette index over 20 runs of different algorithms
| Algorithm | Arabidopsis | Cell cycle | Sporulation | Serum |
|---|---|---|---|---|
| MOC-GaPBK |
|
|
|
|
| Semi-FeaClustMOO | 0.46 | 0.50 | 0.70 | 0.44 |
| MO fuzzy | 0.41 | 0.43 | 0.59 | 0.40 |
| MOGA | 0.40 | 0.42 | 0.58 | 0.38 |
| SOM | 0.23 | 0.38 | 0.58 | 0.34 |
| Avg. link. | 0.32 | 0.44 | 0.50 | 0.36 |
In italics, we highlight the highest values
Friedman test ranking result for comparing MOC-GaPBK algorithm with other state of the art single and multi objective clustering techniques
| Dataset | MOC-GaPBK | Semi-FeaClust | MO fuzzy | MOGA | SOM | Avg. link. |
|---|---|---|---|---|---|---|
| Arabidopsis | 0.49 (1) | 0.46 (2) | 0.41 (3) | 0.40 (4) | 0.23 (6) | 0.32 (5) |
| Cell cycle | 0.63 (1) | 0.50 (2) | 0.43 (4) | 0.42 (5) | 0.38 (6) | 0.44 (3) |
| Sporulation | 0.80 (1) | 0.70 (2) | 0.59 (3) | 0.58 (4) | 0.58 (4) | 0.50 (6) |
| Serum | 0.58 (1) | 0.44 (2) | 0.40 (3) | 0.38 (4) | 0.34 (6) | 0.36 (5) |
| Avg. rank | (1) | (2) | (3.25) | (4.25) | (5.5) | (4.75) |
In brackets we show the ranking of the algorithm. Last row shows the average ranking of each algorithm