| Literature DB >> 33335112 |
Saurav Mallik1, Zhongming Zhao2,3,4.
Abstract
There have been numerous genetic and epigenetic datasets generated for the study of complex disease including neurodegenerative disease. However, analysis of such data often suffers from detecting the outliers of the samples, which subsequently affects the extraction of the true biological signals involved in the disease. To address this critical issue, we developed a novel framework for identifying methylation signatures using consecutive adaptation of a well-known outlier detection algorithm, density based clustering of applications with reducing noise (DBSCAN) followed by hierarchical clustering. We applied the framework to two representative neurodegenerative diseases, Alzheimer's disease (AD) and Down syndrome (DS), using DNA methylation datasets from public sources (Gene Expression Omnibus, GEO accession ID: GSE74486). We first applied DBSCAN algorithm to eliminate outliers, and then used Limma statistical method to determine differentially methylated genes. Next, hierarchical clustering technique was applied to detect gene modules. Our analysis identified a methylation signature comprising 21 genes for AD and a methylation signature comprising 89 genes for DS, respectively. Our evaluation indicated that these two signatures could lead to high classification accuracy values (92% and 70%) for these two diseases. In summary, this framework will be useful to better detect outlier-free genetic and epigenetic signatures in various complex diseases and their developmental stages.Entities:
Mesh:
Year: 2020 PMID: 33335112 PMCID: PMC7747741 DOI: 10.1038/s41598-020-78463-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Flowchart of the DBSCAN framework and analysis.
Summary of DBSCAN clustering for outlier (noisy feature) removal prior to statistical test.
| Comparison | Features | # outliers | # features | ||
|---|---|---|---|---|---|
| Cluster 1 | Cluster 2 | Cluster 3 | |||
| # border features | 439 | 206 | 0 | – | |
| # seed features | 0 | 19,592 | 10 | – | |
| Total | 439 | 19,798 | 10 | – | |
| # border features | 1525 | 559 | 0 | 0 | |
| # seed features | 0 | 18,148 | 10 | 5 | |
| Total | 1525 | 18,707 | 10 | 5 | |
AD Alzheimer’s disease, DS Down syndrome.
Figure 2Combined figures of KNN distance plot and Partitioning clustering plot using DBSCAN clustering algorithm for DS vs control (FC neurons). (A) KNN distance plot to find knee point ( marked by red dotted line) used as EPS in DBSCAN clustering algorithms for DS vs control (FC neurons). (B) Partitioning clustering plot using DBSCAN clustering algorithm for DS vs control (FC neurons). Three clusters had been identified, among which the blue cluster contained 18,148 core (seed) features and 559 border features, the orange cluster had only 10 core features, and the violet cluster consisted of only 5 core features. In addition, a total of 1525 unclustered (outlier/noisy) features (denoted by light green dots) had been identified.
Figure 3Plots for Voom normalization, power calculation for soft-thresholding and dendrogram for DS vs control. (A) Plot for Voom normalization for DS vs control. Voom normalization was used on the set of non-noisy features of the resultant clusters obtained from the pre-filtering analysis by DBSCAN clustering. (B) Power calculation for soft-thresholding in the comparison of DS vs control. This power computing is applied to ensure the scale free topology in the corresponding network. In this specific case, the final resultant power was set 10. (C) Dendrogram plot with color thresholding using dynamic tree cut method for the comparison of DS with control, while the x-axis denotes different gene modules represented by various colors and the y-axis shows the height of the tree (dendrogram).
Comparison of various cluster validity indices between our proposed method and k-means clustering for AD vs control.
| Cluster validity index | Proposed method | K-means clustering |
|---|---|---|
| Ball_hall | 0.331 | 0.111 |
| Davies_bouldin | 5.205 | 2.090 |
| Dunn | 0.082 | 0.009 |
| G_plus | 0.136 | 0.070 |
| Gdi11 | 0.082 | 0.009 |
| Gdi12 | 0.363 | 0.068 |
| Gdi31 | 0.394 | 0.209 |
| Ray_turi | 14.018 | 2.229 |
Higher value signifies better than the other value in the same row (cluster validity index).
Comparison of various cluster validity indices between our proposed method and k-means clustering for DS vs control.
| Cluster validity index | Proposed method | K-means clustering |
|---|---|---|
| Ball_hall | 0.221 | 0.192 |
| Davies_bouldin | 3.040 | 2.259 |
| Dunn | 0.071 | 0.014 |
| G_plus | 0.089 | 0.084 |
| Gdi11 | 0.071 | 0.014 |
| Gdi12 | 0.325 | 0.085 |
| Gdi31 | 0.257 | 0.233 |
| Ray_turi | 5.746 | 2.770 |
Higher value signifies better than the other value in the same row (cluster validity index).
Figure 4Area under the curve (AUC) result with 2-fold cross-validation and heatmap of gene signature for AD vs control. (A) Empirical and smoothed patterns for specificity vs sensitivity plot in AUC. (B) AUC plot (AUC value ). (C) Heatmap for the cluster 5 (gene signature represented in red color) containing 19 hyper-methylated and 2 hypo-methylated genes for AD vs control, where “AD” and “ctr” on the x-axis stand for Alzheimer’s disease samples and control samples, respectively.
Classification accuracy, area under the curve (AUC), and precision by cross-validation (CV) in two comparisons.
| Comparison | Avg accuracy | Avg. precision | ||
|---|---|---|---|---|
| 2 fold | 0.921 | 0.795 | 0.967 | |
| 4 fold | 0.929 | 0.783 | 1 | |
| 5 fold | 0.929 | 0.771 | 1 | |
| 2 fold | 0.700 | 0.664 | 0.736 | |
| 5 fold | 0.705 | 0.676 | 0.783 | |
| 8 fold | 0.700 | 0.673 | 0.754 |
CV cross-validation, AUC area under the curve, AD Alzheimer’s disease, DS Down syndrome.
Top significant GO terms enriched with the Alzheimer’s disease specific genes.
| GO | Enrichment p value | |
|---|---|---|
| GO-CC: GO:0016021 Integral component of membrane | 0.031 | |
| GO-MF: GO:0005198 Structural molecule activity | 0.011 | |
| GO-CC: GO:0001533 Cornified envelope | 0.027 | |
| GO-BP: GO:0031424 Keratinization | 0.031 | |
| GO-BP: GO:0016021 Integral component of membrane | 0.031 | |
| GO-BP: GO:0018149 Peptide cross-linking | 0.033 |
Gsig: genes belonging to the gene signature. For GO terms, it has three domains: Biological Process (BP), Cellular Component (CC), and Molecular Function (MF).
Top significant KEGG pathways and GO terms enriched with Down syndrome specific genes.
| KEGG pathway or GO | Enrichment p value | |
|---|---|---|
| KEGG pathway: hsa05204: Chemical carcinogenesis | ||
| KEGG pathway: hsa00982: Drug metabolism-cytochrome P450 | 0.011 | |
| KEGG pathway: hsa04740: Olfactory transduction | ||
| GO-MF: GO:0004872 Receptor activity | ||
| GO-MF: GO:0030246 Carbohydrate binding | ||
| GO-BP: GO:0006935 Chemotaxis | ||
| GO-BP: GO:0070098 Chemokine-mediated signaling pathway | ||
| KEGG pathway: hsa04060: Cytokine–cytokine receptor interaction | ||
| KEGG pathway: hsa04062: Chemokine signaling pathway | 0.044 | |
| GO-BP: GO:0006955 Immune response | ||
| GO-BP: GO:0030216 Keratinocyte differentiation | ||
| KEGG pathway: hsa04740: Olfactory transduction | ||
| KEGG pathway: hsa04740: Olfactory transduction | ||
| GO-BP: GO:0006955 Immune response | ||
| GO-CC: GO:0005576 Extracellular region | ||
| GO-CC: GO:0016021 Integral component of membrane | ||
| GO-BP: GO:0001580 Detection of chemical stimulus involved in sensory perception of bitter taste | ||
| GO-BP: GO:0006955 Immune response | ||
| GO-BP: GO:0006955 Immune response | ||
| GO-MF: GO:0004872 Receptor activity | ||
| KEGG pathway: hsa04060: Cytokine–cytokine receptor interaction |
Gsig: genes belonging to the gene signature. For GO terms, it has three domains: Biological Process (BP), Cellular Component (CC), and Molecular Function (MF).
Figure 5Plots for Voom normalization, dendrogram, KNN distance and Partitioning clustering using DBSCAN clustering algorithm for the dataset having NCBI GEO ID: GSE134379 (AD vs control in CBL). (A) Plot for Voom normalization for additional data (AD vs control in CBL). Voom normalization was used on the set of non-noisy features of the resultant clusters obtained from the pre-filtering analysis by DBSCAN clustering. (B) Dendrogram plot with color thresholding using dynamic tree cut method for the comparison of AD with control in CBL. (C) KNN distance plot to find knee point ( marked by red dotted line) used as EPS in DBSCAN clustering algorithms for AD vs control in CBL. (D) Partitioning clustering plot using DBSCAN clustering algorithm for AD vs control in CBL. Two clusters had been identified, among which the blue cluster contained 19,711 core (seed) features and 71 border features, and the red cluster had only 10 core features. In addition, a total of 398 unclustered (outlier/noisy) features (denoted by light green dots) had been identified.