| Literature DB >> 30839816 |
John Matta1, Junya Zhao2, Gunes Ercal1, Tayo Obafemi-Ajayi3.
Abstract
With the growing ubiquity of data in network form, clustering in the context of a network, represented as a graph, has become increasingly important. Clustering is a very useful data exploratory machine learning tool that allows us to make better sense of heterogeneous data by grouping data with similar attributes based on some criteria. This paper investigates the application of a novel graph theoretic clustering method, Node-Based Resilience clustering (NBR-Clust), to address the heterogeneity of Autism Spectrum Disorder (ASD) and identify meaningful subgroups. The hypothesis is that analysis of these subgroups would reveal relevant biomarkers that would provide a better understanding of ASD phenotypic heterogeneity useful for further ASD studies. We address appropriate graph constructions suited for representing the ASD phenotype data. The sample population is drawn from a very large rigorous dataset: Simons Simplex Collection (SSC). Analysis of the results performed using graph quality measures, internal cluster validation measures, and clinical analysis outcome demonstrate the potential usefulness of resilience measure clustering for biomedical datasets. We also conduct feature extraction analysis to characterize relevant biomarkers that delineate the resulting subgroups. The optimal results obtained favored predominantly a 5-cluster configuration.Entities:
Keywords: Autism spectrum disorders; Clustering; Graph theory; Resilience measures
Year: 2018 PMID: 30839816 PMCID: PMC6214326 DOI: 10.1007/s41109-018-0093-0
Source DB: PubMed Journal: Appl Netw Sci ISSN: 2364-8228
Summary of internal cluster validation used to determine optimal clustering configuration
| Validation Metric | Mathematical Description | Optimal Value |
|---|---|---|
| Silhouette index (SI) | Max | |
|
| ||
| Calinski-Harabasz index (CH) |
| Max |
| Davies-Bouldin index (DB) |
| Min |
| Dunn’s index |
| Max |
| Xie-Beni index (XB) |
| Min |
| SD validity index (SD) | Min | |
|
| ||
| S_Dbw validity Index (SD_Dbw) | Min | |
|
| ||
|
| Max | |
| CVNN index | Min | |
| of |
D denote the data set; N: number of objects in D; C: center of D;
k: number of clusters; C: the i–th cluster; n: number of objects in C;
c: center of C; d(x,y): distance between x and y; NN : number of nearest neighbors
Description of 36 phenotype features used to cluster ASD sample
| Category | ASD phenotype features |
|---|---|
|
| ADOS communication & social interaction score |
| ADOS restricted & repetitive behavior score | |
| ADOS Social Affect score | |
| Social score (ADI-R A) | |
| Verbal score (ADI-R B) | |
| Repetitive and stereotyped patterns of behavior (ADI-R C) | |
| Abnormality evidence (ADI-R Q86) | |
|
| Vineland social score |
| Vineland daily living skills score | |
| Verbal & non-verbal IQ score | |
|
| Vineland communication score |
| Regression | |
| Word delay | |
| Overall Level of Language (ADI-R Q30) | |
|
| ABC |
| RBS | |
| CBCL | |
| SRS | |
| SRS | |
|
| BAPQ |
aABC: Aberrant Behavior Checklist;
bRBS: Repetitive Behavior Scale
cCBCL: Child Behavior Checklist;
dSRS: Social Responsiveness Scale.
eBAPQ: Broader Autism Phenotype Questionnaire
Optimal Cluster configuration by graph type and resilience measures
| Complete clustering | No node reassignment | ||||||
|---|---|---|---|---|---|---|---|
| Integrity k=3 | Tenacity k=5 | VAT k=4 | kNN3 Integrity | VAT k=2 | Integrity k=5 | Tenacity k=5 | |
| Silhouette | 0.11 | 0.05 | 0.07 | 0.07 | 0.12 | 0.04 | 0.05 |
| Davies-Bouldin | 3.18 | 4.28 | 4.19 | 4.40 | 3.37 | 3.66 | 3.75 |
| Xie-Beni | 3.38 | 7.16 | 8.10 | 8.92 | 3.01 | 5.77 | 6.48 |
| Dunn | 0.13 | 0.15 | 0.15 | 0.17 | 0.14 | 0.14 | 0.14 |
| Calinski-Harabasz | 152.57 | 154.11 | 166.22 | 165.32 | 167.71 | 142.52 | 141.58 |
| I Index | 0.14 | 0.08 | 0.12 | 0.08 | 0.12 | 0.06 | 0.09 |
| SD Index | 9.96 | 14.62 | 14.40 | 20.10 | 8.52 | 7.71 | 9.10 |
| SDb w Index | 1.37 | 1.07 | 1.16 | 1.06 | 1.87 | 1.10 | 1.05 |
| CVNN Index | 1.38 | 0.95 | 1.21 | 0.54 | 2.00 | 2.00 | 2.00 |
| Separability | 31.63 | 11.14 | 20.21 | 8.34 | 8.53 | 11.39 | 15.25 |
| Modularity (> 0.6) | 0.42 | 0.72 | 0.65 | 0.68 | 0.27 | 0.67 | 0.68 |
| Conductance (< 0.07) | 0.02 | 0.04 | 0.03 | 0.06 | 0.06 | 0.05 | 0.04 |
akNN3 using Integrity measure on correlation filtered data
Fig. 1Visualization of optimal clustering results by resilience measure. NR indicates no reassignment of attack set nodes. a kNN2 Integrity k=3. Red nodes denotes C0, Blue: C2, and Green: C1. b kNN2 VAT k=2 NR. Red nodes denotes C0, and Blue: C1. c kNN2 VAT k=4. Red nodes de- notes C2, Blue: C3, Purple: C0, and Green: C1. d kNN2 Integrity k=5 NR. Red nodes denotes C0, Blue: C2, Gold: C1, Purple: C3, and Green: C4. e kNN2 Tenacity k=5. Red nodes denotes C2, Blue: C3, Gold: C0, Purple: C1, and Green: C4. f kNN2 Tenacity k=5 NR. Red nodes denotes C1, Blue: C2, Gold: C3, Purple: C0, and Green: C4
Fig. 2Visualization of the graph of k=5 optimal clustering result for kNN3 with Integrity using the correlated filtered set. Red nodes denotes C2, Blue: C4, Gold: C0, Purple: C3, and Green: C1
Demographics per cluster configuration with node reassignment
| Integrity k=3 | VAT k=4 | Tenacity k=5 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C0 | C1 | C2 | C0 | C1 | C2 | C3 | C0 | C1 | C2 | C3 | C4 | |
| Mean age | 9.0 | 8.4 | 8.5 | 8.9 | 8.6 | 8.9 | 8.6 | 9.0 | 9.0 | 8.9 | 8.6 | 8.6 |
| % Caucasian | 81.0 | 64.5 | 73.3 | 78.3 | 64.4 | 83.9 | 71.5 | 82.6 | 77.5 | 83.7 | 71.90 | 69.3 |
| % Male | 86.4 | 84.1 | 87.2 | 87.2 | 84.9 | 85.8 | 87.2 | 85.4 | 89.7 | 86.6 | 87.4 | 82.0 |
Demographics per cluster configuration without node reassignment
| kNN2 Tenacity k=5 | kNN2 Integrity k=5 | kNN2 VAT k=2 | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C0 | C1 | C2 | C3 | C4 | S | C0 | C1 | C2 | C3 | C4 | S | C0 | C1 | S | |
| Mean age | 8.8 | 9.0 | 8.7 | 8.9 | 8.6 | 9.32 | 8.9 | 8.8 | 8.7 | 9.3 | 8.4 | 9.6 | 8.9 | 8.6 | 8.6 |
| % Caucasian | 78.7 | 85.3 | 71.1 | 74.5 | 64.5 | 77.8 | 83.8 | 77.4 | 71.6 | 79.3 | 67.2 | 83.3 | 79.9 | 71.7 | 74.5 |
| % Male | 88.3 | 85.3 | 86.3 | 86.1 | 84.9 | 100.0 | 86.0 | 85.1 | 87.3 | 88.5 | 85.4 | 94.4 | 86.1 | 87.2 | 90.2 |
Statistical analysis of optimal clustering configurations (complete clustering) by graph type and node resilience measure using selected ASD outcome measures
| Cluster (size) | ABC overall | RBS R overall | ADOS CSS | Vineland composite score | Overall IQ | PPVTA 4A | Epilepsy |
|---|---|---|---|---|---|---|---|
| kNN 2 Integrity k=3 | |||||||
| C0 (1903) | 45.73(25.8) | 27.31(18.1) | 7.37(1.7) | 75.72(10.9) | 88.19(23.5) | 91.51(24.7) | 1.74% |
| C1 (189) | 54.08(24.6) | 29.01(13.8) | 7.58(1.5) | 57.57(9.6) | 38.91(18.7) | 41.75(22.3) | 6.91% |
| C2 (588) | 47.55(25.9) | 26.26(16.3) | 7.63(1.6) | 70.58(12.3) | 69.99(27.6) | 72.41(28.8) | 2.73% |
| ANOVA | < 0.001 | 0.15 | 0.003 | < 0.001 | < 0.001 | < 0.001 | |
| Tukey HSD (NS | C0:C2 | All pairs | C1:C0,C2 | None | None | None | |
| Eta-squared ( | 0.007 | 0.001 | 0.004 | 0.157 | 0.244 | 0.209 | |
| kNN 2 VAT k=4 | |||||||
| C0 (811) | 50.86 (28.0) | 31.76 (19.6) | 7.48 (1.7) | 72.69 (10.3) | 80.06 (22.8) | 82.98 (24.1) | 2.47% |
| C1 (219) | 60.02 (28.5) | 32.33 (17.4) | 7.68 (1.5) | 57.74 (9.4) | 39.55 (18.7) | 41.93 (21.3) | 5.99% |
| C2 (1117) | 41.55 (22.2) | 23.89 (15.4) | 7.24 (1.7) | 78.16 (10.4) | 94.77 (20.8) | 98.34 (22.1) | 1.25% |
| C3 (535) | 45.73 (25.2) | 25.08 (16.0) | 7.70 (1.6) | 70.50 (12.8) | 69.12 (28.2) | 71.39 (29.1) | 2.82% |
| ANOVA | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | |
| Tukey HSD (NSa) | None | C0:C1;C2:C3 | C0:C1,C3 C1:C3 | None | None | None | |
| Eta-squared ( | 0.047 | 0.046 | 0.013 | 0.212 | 0.322 | 0.288 | |
| kNN 2 Tenacity k=5 | |||||||
| C0 (535) | 67.12(21.5) | 41.78(18.2) | 7.30(1.7) | 73.71(9.7) | 91.46(21.6) | 96.46 (23.8) | 1.31% |
| C1 (497) | 36.57(19.6) | 22.10(12.6) | 7.40(1.6) | 74.56(10.2) | 80.65(22.5) | 82.57 (23.4) | 2.42% |
| C2 (781) | 33.78(19.3) | 18.34(11.4) | 7.22(1.7) | 78.85(11.0) | 92.99(22.8) | 96.59(22.9) | 1.41% |
| C3 (484) | 44.05(23.9) | 25.35(15.5) | 7.58(1.7) | 74.29(11.0) | 79.31(23.7) | 81.04(24.4) | 2.07% |
| C4 (383) | 61.19(27.5) | 33.83(18.9) | 7.97(1.5) | 58.62(9.5) | 42.20(19.4) | 44.86(22.8) | 5.76% |
| ANOVA | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | |
| Tukey HSD (NS | C1:C2 | none | C0:C1,C2,C3 C1:C2,C3 | C0:C1,C3 C1:C3 | C0:C2;C1:C3 | C0:C2;C1:C3 | |
| Eta-squared ( | 0.274 | 0.253 | 0.022 | 0.272 | 0.361 | 0.333 | |
| kNN 3 Integrity k=5 corr | |||||||
| C0 (462) | 66.56(24.0) | 41.65(18.3) | 7.54(1.7) | 73.26(9.6) | 87.95(22.9) | 92.57(24.5) | 0.87% |
| C1 (276) | 57.20(27.4) | 29.89(15.5) | 7.39(1.4) | 57.38(8.9) | 37.27(17.8) | 39.84(21.8) | 5.82% |
| C2 (743) | 33.48(18.7) | 17.39(10.6) | 7.18(1.7) | 79.86(10.6) | 96.22(20.6) | 99.49(21.6) | 1.62% |
| C3 (744) | 46.25(24.8) | 28.80(17.8) | 7.53(1.6) | 72.53(10.4) | 78.72(23.3) | 80.97(24.9) | 3.10% |
| C4 (455) | 42.58(23.2) | 24.26(14.9) | 7.66(1.8) | 73.63(12.0) | 77.54(25.2) | 79.01(26.2) | 1.54% |
| ANOVA | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | |
| Tukey HSD (NS | C3:C4 | C1:C3 | C0:C1,C3,C4 C1:C2,C3,C4 C3:C4 | C0:C3,C4 C3:C4 | C3:C4 | C3:C4 | |
| Eta-squared ( | 0.197 | 0.215 | 0.011 | 0.258 | 0.354 | 0.306 | |
aNS: implies pairs for which Tukey HSD test was not significant
The mean and standard deviation values are presented for each measure
Statistical analysis of optimal clustering configurations using selected ASD outcome measures: for kNN2 graphs without node reassignment
| Cluster (size) | ABC overall | RBS R overall | ADOS CSS | Vineland composite score | Overall IQ | PPVTA 4A | Epilepsy |
|---|---|---|---|---|---|---|---|
| kNN 2 VAT k=2 | |||||||
| C0 (2072) | 47.01(26.1) | 27.74(17.9) | 7.38(1.7) | 73.99(12.0) | 83.50(27.1) | 87.51(27.9) | 2.27% |
| C1 (506) | 45.66(25.3) | 25.01(16.1) | 7.69(1.6) | 70.47(12.9) | 69.09(28.5) | 71.46(29.3) | 2.77% |
| S (102) | 46.03(21.6) | 26.89(15.1) | 7.52(1.5) | 73.71(9.6) | 81.75(23.2) | 77.94(35.1) | 0.98% |
| ANOVA | 0.295 | 0.002 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | |
| Eta-squared ( | 0 | 0.004 | 0.005 | 0.013 | 0.042 | 0.049 | |
| kNN 2 Integrity k=5 | |||||||
| C0 (1096) | 41.02(22.0) | 23.30(14.8) | 7.19(1.7) | 78.36(10.3) | 94.63(20.6) | 97.91(22.2) | 1.37% |
| C1 (430) | 67.76(24.6) | 43.94(19.2) | 7.61(1.7) | 70.22(9.7) | 78.58(23.8) | 83.02(25.0) | 1.86% |
| C2 (402) | 41.49(23.8) | 24.42(15.8) | 7.57(1.7) | 73.88(11.3) | 78.34(24.3) | 79.36(25.4) | 2.00% |
| C3 (384) | 32.54(18.1) | 18.99(10.5) | 7.47(1.6) | 75.41(10.6) | 82.90(21.9) | 84.73(23.3) | 2.09% |
| C4 (350) | 59.67(27.2) | 30.74(17.2) | 7.83(1.5) | 58.58(9.9) | 40.55(19.3) | 43.70(22.9) | 6.03% |
| S (18) | 56.33(22.6) | 32.50(15.0) | 7.72(1.9) | 69.33(14.2) | 68.78(36.4) | 65.56(46.7) | 11.1% |
| ANOVA | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | |
| Tukey HSD (NSa) | C0:C2 | C0:C2 | C1:C2,C3,C4; C2:C3,C4 | C2:C3 | C1:C2 | C1:C2,C3 | |
| Eta-squared ( | 0.211 | 0.210 | 0.019 | 0.279 | 0.383 | 0.329 | |
| kNN 2 Tenacity k=5 | |||||||
| C0 (741) | 47.01(26.1) | 29.38(18.8) | 7.46(1.6) | 73.25(9.9) | 81.26(22.4) | 83.63(23.3) | 2.30% |
| C1 (951) | 41.23(22.4) | 22.84(14.8) | 7.08(1.7) | 78.68(10.5) | 95.58(20.3) | 99.41(21.6) | 1.26% |
| C2 (591) | 45.06(26.3) | 25.66(17.3) | 7.61(1.7) | 71.44(13.3) | 70.86(28.5) | 72.95(29.9) | 2.54% |
| C3 (216) | 68.07(25.8) | 41.92(17.0) | 8.56(1.5) | 68.76(8.9) | 75.79(27.1) | 80.39(28.1) | 2.33% |
| C4 (172) | 54.72(25.2) | 28.90(14.0) | 7.35(1.4) | 56.13(8.9) | 36.17(16.6) | 37.64(18.6) | 7.60% |
| S (9) | 46.22(19.4) | 23.56(15.7) | 7.78(1.0) | 72.67(11.8) | 79.00(23.5) | 81.88(34.3) | 0% |
| ANOVA | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | < 0.001 | |
| Tukey HSD (NS | C0:C2 | C4:C0,C2 | C0:C2,C4 C4:C1,C2 | None | C2:C3 | C0:C3 | |
| Eta-squared ( | 0.078 | 0.086 | 0.054 | 0.214 | 0.298 | 0.270 | |
aNS: implies pairs for which Tukey HSD test was not significant. S is not included in the ANOVA, Tukey, and Eta-squared analyses
The mean and standard deviation values are presented for each measure)
Fig. 3Visualization of optimal clustering result for kNN2 Tenacity 5-cluster graph in terms of distribution of high overall IQ (≥ 70) vs. lower IQ (< 70). Large circles denote high IQ while small circles denote low IQ. Only green cluster shows a high concentration of low IQ nodes. This demonstrates that the clustering obtained is a combination of various factors, not just IQ scores
Set of discriminant features by clustering result for complete clustering configuration
| Integrity k=3 | Tenacity (corr) k=5 | VAT k=4 | kNN3 Integrity k=5 |
|---|---|---|---|
| ADI-R Q30 (Overall level of language) | ADI-R Q30 (Overall level of language) | ADI-R Q30 (Overall level of language) | ADI-R Q30 (Overall level of language) |
| ADI-R Q86 (Abnormality evidence) | RBS-R (Ritualistic Behavior) | ADI-R Q86 (Abnormality evidence) | ABC-Inappropriate speech |
| CBCL Externalizing T Score | ABC-Irritability | Verbal score (ADI-R B) | RBS-R-Stereotyped behavior |
| Regression | BAPQ Avg (Mother) | BAPQ Avg (Mother) | BAPQ Avg (Mother) |
| Regression | Regression | Regression | |
| ADOS Social Affect | ADI-R C (Repetitive behavior) | ADI-R C (Repetitive behavior) | |
| Social (ADI-R A) | SRS Mannerisms | Social (ADI-R A) | |
| SRS T score | Word delay | SRS cognition | |
| Word delay | Word delay |
Set of discriminant features by clustering result for no node reassignment
| VAT k=2 | Integrity k=5 | Tenacity k=5 |
|---|---|---|
| CBCL externalizing T score | ABC-Inappropriate speech) | ABC-Inappropriate speech) |
| BAPQ Avg (Mother) | ADOS social affect | ADOS communication & social |
| Regression | BAPQ Avg (Mother) | ADI-R Q30 (Overall Level of Language) |
| Verbal IQ | ADI-R Q30 (Overall level of language) | Regression |
| Regression | Social (ADI-R A) | |
| SRS T score | Word delay | |
| Verbal score (ADI-R B) | ||
| Word delay |
Fig. 4Analysis of ASD outcome measures (normalized values using known features ranges) across clusters for kNN2 Tenacity 5 clustering configuration. The color of the boxes correlate to the colors of the clusters in Fig. 1e. Gold denotes C0, Purple: C1, Red: C2, Blue: C3, and Green: C4. a Outcome Measures for which values are positively correlated with ASD severity. b Outcome Measures for which values are inversely correlated with ASD severity