| Literature DB >> 32066720 |
Muhammad Asif1,2,3, Hugo F M C Martiniano1,2, Ana Rita Marques1,2, João Xavier Santos1,2, Joana Vilela1,2, Celia Rasga1,2, Guiomar Oliveira4,5,6, Francisco M Couto3, Astrid M Vicente7,8.
Abstract
The complex genetic architecture of Autism Spectrum Disorder (ASD) and its heterogeneous phenotype makes molecular diagnosis and patient prognosis challenging tasks. To establish more precise genotype-phenotype correlations in ASD, we developed a novel machine-learning integrative approach, which seeks to delineate associations between patients' clinical profiles and disrupted biological processes, inferred from their copy number variants (CNVs) that span brain genes. Clustering analysis of the relevant clinical measures from 2446 ASD cases in the Autism Genome Project identified two distinct phenotypic subgroups. Patients in these clusters differed significantly in ADOS-defined severity, adaptive behavior profiles, intellectual ability, and verbal status, the latter contributing the most for cluster stability and cohesion. Functional enrichment analysis of brain genes disrupted by CNVs in these ASD cases identified 15 statistically significant biological processes, including cell adhesion, neural development, cognition, and polyubiquitination, in line with previous ASD findings. A Naive Bayes classifier, generated to predict the ASD phenotypic clusters from disrupted biological processes, achieved predictions with a high precision (0.82) but low recall (0.39), for a subset of patients with higher biological Information Content scores. This study shows that milder and more severe clinical presentations can have distinct underlying biological mechanisms. It further highlights how machine-learning approaches can reduce clinical heterogeneity by using multidimensional clinical measures, and establishes genotype-phenotype correlations in ASD. However, predictions are strongly dependent on patient's information content. Findings are therefore a first step toward the translation of genetic information into clinically useful applications, and emphasize the need for larger datasets with very complete clinical and biological information.Entities:
Mesh:
Year: 2020 PMID: 32066720 PMCID: PMC7026098 DOI: 10.1038/s41398-020-0721-1
Source DB: PubMed Journal: Transl Psychiatry ISSN: 2158-3188 Impact factor: 6.222
Fig. 1Integrative systems medicine approach to identify complex genotype–phenotype associations.
Clinical and genetic data from the Autism Genome Project (AGP) were used in this study. a Clinical data analysis processing: clinical data comprise reports of ASD diagnosis and neurodevelopmental assessment instruments. Agglomerative hierarchical clustering (AHC) was used to identify clinically similar subgroups of individuals in stable, validated clusters, defined by multiple clinical measures. b CNV data processing: rare high-confidence CNVs previously identified by the AGP, targeting brain-expressed genes, were retained for analysis. CNV data were merged with clinical data from clustered ASD subjects for a final list of CNVs targeting brain genes. c Functional annotation analysis: biological processes defined by brain-expressed genes targeted by CNVs were obtained by using g:Profiler. d Classifier design: a Naive Bayes machine-learning classifier was trained and tested on patient’s data, to predict the phenotypic clustering of patients from biological processes disrupted by rare CNVs targeting brain-expressed genes.
Clustering validation, after removal of weakly clustered individuals.
| Cluster validation measures | Cluster 1 | Cluster 2 |
|---|---|---|
| Cluster size ( | 903 | 494 |
| Average distance between two patients | 0.235 | 0.231 |
| Silhouette value | 0.567 | 0.579 |
| Average Silhouette of both clusters | 0.571 | |
| Cluster stability | 0.998 | 0.996 |
Clusters 1 and 2 statistics for each clinical measure.
| Clinical measure | Clinically defined categories | Cluster 1 | Cluster 2 | |
|---|---|---|---|---|
| ADI-R verbal status | ADI-R-nonverbal | 0 (0) | 494 (100) | <0.00001a |
| ADI-R-verbal | 903 (100) | 0 (0) | ||
| ADOS severity score | ADOS severity score Autism (score 6–10) | 714 (79.07) | 392 (79.35) | <0.00001b |
| ADOS severity score ASD (score 4–5) | 64 (7.09) | 102 (20.65) | ||
| ADOS severity score Non-spectrum (score 1–3) | 125 (13.84) | 0 (0) | ||
| VABS communication | Dysfunctional VABS communication (score ≤ 70) | 307 (34) | 493 (99.8) | <0.00001a |
| Normal VABS communication (score > 70) | 596 (66) | 1 (0.2) | ||
| VABS daily living skills | Dysfunctional VABS daily living skills (score ≤ 70) | 478 (52.94) | 484 (97.98) | <0.00001b |
| Normal VABS daily living skills (score > 70) | 425 (47.07) | 10 (2.02) | ||
| VABS socialization | Dysfunctional VABS socialization (score ≤ 70) | 497 (55.04) | 490 (99.19) | <0.00001a |
| Normal VABS socialization (score > 70) | 406 (44.96) | 4 (0.81) | ||
| Performance IQ Scale | Severe disability (score < 50) | 2 (0.22) | 218 (44.13) | <0.00001b |
| Moderate disability (score ≥ 50 and ≤ 70) | 31 (3.43) | 125 (25.3) | ||
| Normal ability (score > 70) | 870 (96.35) | 151 (30.57) | ||
| Gender | Male | 830 (91.92) | 417 (84.41) | 0.000015b |
| Female | 73 (8.08) | 77 (15.59) |
aFisher’s exact test.
bChi-square test.
Statistically significant enriched biological processes for CNVs spanning brain-expressed genes (N = 2738).
| Biological processes | Enriched genes ( | FDR |
|---|---|---|
| Homophilic cell adhesion via plasma-membrane adhesion molecules | 53 | 6.30E–09 |
| Cell–cell adhesion via plasma-membrane adhesion molecules | 66 | 1.70E–07 |
| Cellular component organization or biogenesis | 944 | 5.70E–05 |
| Cellular component organization | 915 | 7.00E–05 |
| Cellular component biogenesis | 475 | 0.00066 |
| Cellular component assembly | 434 | 0.00177 |
| Nervous system development | 363 | 0.00215 |
| Organelle organization | 562 | 0.00475 |
| Protein polyubiquitination | 64 | 0.00592 |
| Cell projection organization | 231 | 0.00836 |
| Cellular localization | 418 | 0.0091 |
| Single-organism behavior | 83 | 0.0196 |
| Regulation of cellular component organization | 364 | 0.0257 |
| Plasma-membrane-bounded cell projection organization | 223 | 0.0282 |
| Cognition | 56 | 0.0364 |
| Single-organism organelle organization | 263 | 0.044 |
FDR false discovery rate.
Importance of each biological process from random forest in classifying ASD subjects into defined phenotypic clusters.
| Random Forest rank | Biological process | Mean decrease in accuracy |
|---|---|---|
| 1 | Regulation of cellular component organization | 0.052 |
| 2 | Cell projection organization | 0.025 |
| 3 | Cellular component assembly | 0.025 |
| 4 | Single-organism behavior | 0.020 |
| 5 | Organelle organization | 0.018 |
| 6 | Single-organism organelle organization | 0.017 |
| 7 | Cellular component biogenesis | 0.014 |
| 8 | Cognition | 0.013 |
| 9 | Nervous system development | 0.010 |
| 10 | Cellular localization | 0.009 |
| 11 | Cellular component organization | 0.006 |
| 12 | Protein polyubiquitination | 0.005 |
| 13 | Homophilic cell adhesion via plasma-membrane adhesion molecules | 0.005 |
| 14 | Cell adhesion via plasma-membrane adhesion molecules | 0.005 |
| 15 | Cellular component organization or biogenesis | 0.003 |
Naive Bayes performance in predicting the severe phenotype of ASD.
| Data used for classification | N | Precision | Recall | Specificity | F score |
|---|---|---|---|---|---|
| All ASD cases | 1300 | 0.221 | 0.379 | 0.655 | 0.279 |
| ASD cases from the first quantile with the highest IC | |||||
| ASD cases from the first and second quantiles of IC | 649 | 0.23 | 0.384 | 0.65 | 0.284 |
| ASD cases from the first three quantiles of IC | 974 | 0.29 | 0.389 | 0.672 | 0.329 |