| Literature DB >> 32779563 |
Lian Beijers1, Hanna M van Loo1, Jan-Willem Romeijn2, Femke Lamers3, Robert A Schoevers1,4, Klaas J Wardenaar1.
Abstract
BACKGROUND: Cluster analyses have become popular tools for data-driven classification in biological psychiatric research. However, these analyses are known to be sensitive to the chosen methods and/or modelling options, which may hamper generalizability and replicability of findings. To gain more insight into this problem, we used Specification-Curve Analysis (SCA) to investigate the influence of methodological variation on biomarker-based cluster-analysis results.Entities:
Keywords: biochemistry; cluster analysis; complexity; heterogeneity; psychiatry; specification-curve analysis; subtyping
Year: 2020 PMID: 32779563 PMCID: PMC9069352 DOI: 10.1017/S0033291720002846
Source DB: PubMed Journal: Psychol Med ISSN: 0033-2917 Impact factor: 10.592
Fig. 1.Flowchart of the complete analytical process, including real data preparation, data simulation and specification curve analysis.
Biochemical analytes and associated biological processes
| Analyte | Biological process |
|---|---|
| Alpha-1-antichymotrypsin | PM |
| Alpha-1-antitrypsin | PM |
| CD40 antigen | CC,ST |
| Complement factor h-related protein 1 | IM |
| Enrage | CC,ST |
| Growth-regulated alpha protein | IM |
| Interleukin-12P40 | IM |
| Interleukin-1 receptor antagonist | CC,ST |
| Macrophage migration inhibitory factor | CC,ST |
| Lactoylglutathione lyase (not included because of high correlation with MIF) | M |
| Insulin growth factor-binding proteiN-5 | CC,ST |
| Urokinase-type plasminogen activator receptor | CC,ST |
| Cathepsin D | PM |
| Receptor tyrosine-protein kinase ERBB-3 | CC,ST |
| Hepsin | PL |
| Cellular fibronectin | CG |
| Matrix metalloproteinase-10 | PM |
| Matrix metalloproteinase-3 | PM |
| Tenascin C | CC,ST |
| Carcinoembryonic antigen | IM |
| Angiogenin | M |
| Angiopoietin 2 | CC,ST |
| Vascular endothelial growth factor | CC,ST |
| Apolipoprotein A4 | T |
| Apolipoprotein D | T |
| Fatty acid-binding protein, adipocyte | CC,ST |
| Pancreatic polypeptide | CC,ST |
| Von willebrand factor | PM |
| Luteinizing hormone (not included because of high correlation with FSH) | CC,ST |
| Follicle-stimulating hormone | CC,ST |
| Cystatin C | PM |
| Fetuin-A | CC,ST |
| Prostasin | PM |
CC, cell-cell communication; CG, cell growth/maintenance; IM, immune response; M, metabolism; PL, proteolysis and peptidolysis; PM, protein metabolism; ST, signal transduction; T, transport.
From the Human Protein Reference Database, according to Bot et al. (2015).
Fig. 2.Descriptive Specification Curve in the sample with MDD subjects only, with small clusters (⩽1% of subjects) removed. Each black dot in the top panel depicts an estimate of the optimal number of clusters (K) from a different specification; the dots vertically aligned in the lower panel indicate the analytic decisions behind those estimates. The green lines indicate the expected range of results at each position. N.B. this is not the expected range of the specific combination of options, but rather the range of the m
Stability measures of models with different numbers of clusters (K) for the MDD dataset
| Number of models % of 1200, ( | Distinct solutions, | Dominant solution | Unique solutions | |
|---|---|---|---|---|
| 1 | 60.2 (722) | |||
| 2 | 14.3 (172) | 15 | 33.1 (57) | 0.6 (1) |
| 3 | 3.5 (42) | 8 | 57.1 (24) | 2.4 (1) |
| 4 | 2.4 (29) | 7 | 34.5 (10) | 3.4 (1) |
| 5 | 2.2 (26) | 9 | 38.5 (10) | 11.5 (3) |
| 6 | 1.2 (15) | 7 | 46.7 (7) | 26.7 (4) |
| 7 | 1.8 (22) | 6 | 36.4 (8) | 0 (0) |
| 8 | 1.1 (13) | 5 | 53.8 (7) | 23.1 (3) |
| 9 | 0.8 (10) | 5 | 30 (3) | 10 (1) |
| 10 | 0.6 (7) | 6 | 28.6 (2) | 71.4 (5) |
| 11 | 0.6 (7) | 5 | 28.6 (2) | 42.9 (3) |
| 12 | 1.1 (13) | 7 | 30.8 (4) | 30.8 (4) |
| 13 | 2.8 (34) | 4 | 79.4 (27) | 5.9 (2) |
| 14 | 1.3 (16) | 7 | 37.5 (6) | 18.8 (3) |
| 15 | 2.5 (30) | 6 | 43.3 (13) | 0 (0) |
| Error | 3.5 (42) |
The model solution (i.e. specific division of subjects) that occurs most often within the group of models containing K clusters.
Number of model solutions that occur only once.
Fig. 3.Specification curves based on simulated datasets with K = 2, with small clusters (⩽1% of subjects) removed. Each dot in the top panel depicts an estimate of the optimal number of clusters (K) from a different specification; the dots vertically aligned in the lower panel indicate the analytic decisions behind the estimates of the baseline analysis. N.B. the analytic decisions behind the other analyses are not presented here.
Fig. 4.Specification curves based on simulated datasets with K = 5, with small clusters (⩽1% of subjects) removed. Each dot in the top panel depicts an estimate of the optimal number of clusters (K) from a different specification; the dots vertically aligned in the lower panel indicate the analytic decisions behind the estimates of the baseline analysis. N.B. the analytic decisions behind the other analyses are not presented here.