| Literature DB >> 25783485 |
Worrawat Engchuan, Kiret Dhindsa, Anath C Lionel, Stephen W Scherer, Jonathan H Chan, Daniele Merico.
Abstract
BACKGROUND: A substantial proportion of Autism Spectrum Disorder (ASD) risk resides in de novo germline and rare inherited genetic variation. In particular, rare copy number variation (CNV) contributes to ASD risk in up to 10% of ASD subjects. Despite the striking degree of genetic heterogeneity, case-control studies have detected specific burden of rare disruptive CNV for neuronal and neurodevelopmental pathways. Here, we used machine learning methods to classify ASD subjects and controls, based on rare CNV data and comprehensive gene annotations. We investigated performance of different methods and estimated the percentage of ASD subjects that could be reliably classified based on presumed etiologic CNV they carry.Entities:
Mesh:
Year: 2015 PMID: 25783485 PMCID: PMC4315323 DOI: 10.1186/1755-8794-8-S1-S7
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Curated gene-sets description and gene number
| Gene-set ID | Gene-set Description | Gene N# |
|---|---|---|
| hi015 | Predicted haploinsufficiency (most inclusive) | 8862 |
| hi035 | Predicted haploinsufficiency | 4136 |
| hi055 | Predicted haploinsufficiency (most stringent) | 2214 |
| ExpsNov_BrainFeAd_sp | Specific expression in human adult or fetal brain (Novartis Tissue Atlas) | 1285 |
| Synapse_GrantFull | Post-synaptic density components | 1407 |
| FMR1_Targets_Darnell | FMR1 targets (Darnell et al) | 840 |
| FMR1_Targets_Ascano | FMR1 targets (Ascano et al) | 927 |
| thrEXPR_log2rpkm | Expressed in brain (BrainSpan) | 13802 |
| thr4.86_log2rpkm | Expressed in brain, very high (BrainSpan) | 4595 |
| thr3.32_log2rpkm | Expressed in brain, high/medium (BrainSpan) | 4604 |
| thr0.84_log2rpkm | Expressed in brain, medium/low (BrainSpan) | 4603 |
| thr.MIN_log2rpkm | Not expressed in brain (BrainSpan) | 4600 |
| PhHs_NervSys_ADX | Human nervous system phenotype (HPO), autosomal dominant or X-linked | 620 |
| PhHs_NervSys_All | Human nervous system phenotype (HPO) | 784 |
| PhHs_MindFun_ADX | Higher mental function phenotype (HPO), autosomal dominant or X-linked | 395 |
| PhHs_MindFun_All | Higher mental function phenotype (HPO) | 687 |
| MmHs_Neuro_All | Mouse neuro phenotype (MGI/MPO) | 3479 |
| MmHs_Extend_All | Mouse developmental phenotype (MGI/MPO) | 4314 |
| NeuroF_large | Neurobiological function, inclusive | 2601 |
| NeuroF_small | Neurobiological function, stringent | 1088 |
| Total | Total gene count | 18203 |
Figure 1Cross-validation strategy. The data-set is divided into three equal subsets, each with the same propotion of ASD and control subjects. Two of the tree subsets are used as the training set the model, whereas the other subset is used as the validation set for performance quantification; this is iterated three times, so that each subset is used twice for training and once for validation. The feature selection is performed only for GO and pathway-based features. The remaining set is used as test set to assess the performance of classification. The cross-validation procedure is repeated times to estimate the mean performance and its standard deviation.
CF and RF classification performance for 20 neurally-relevant curated features (mean ± sd)
| Subject | Classifier | All CNV | Gain CNV | Loss CNV |
|---|---|---|---|---|
| All subjects | RandomForest | 0.531±0.005 | 0.509±0.004 | 0.544±0.006 |
| All subjects | CForest | 0.533±0.004 | 0.513±0.005 | 0.546±0.003 |
| De novo | RandomForest | 0.805±0.012 | 0.769±0.024 | 0.840±0.010 |
| De novo | CForest | 0.787±0.008 | 0.732±0.013 | 0.846±0.011 |
| Pathogenic | RandomForest | 0.913±0.014 | 0.913±0.012 | 0.935±0.016 |
| Pathogenic | CForest | 0.880±0.012 | 0.897±0.008 | 0.922±0.030 |
Figure 2RF and CF feature relevance, boxplots for the 20 curated neurally-relevant features. Feature relevance boxplots for loss-based features (red) and gain-based features (blue). Mean decrease gini (MDG) and Mean decrease accuracy (MDA) were used for RF. MDA, with and without correlation adjustment, was used for CF. For all relevance metrics, higher values correspond to more relevant features.
Classification results for all subjects using 20 neurally-relevant curated features, 20 matched randomized features, Gene Ontology and pathways (mean ± sd)
| Gene set (All subjects) | All CNV | Gain CNV | Loss CNV |
|---|---|---|---|
| 20 curated | 0.533±0.004 | 0.513±0.005 | 0.546±0.003 |
| GO | 0.512±0.005 | 0.506±0.002 | 0.519±0.002 |
| GO (man. selected) | 0.520±0.005 | 0.505±0.005 | 0.524±0.003 |
| GO (f.s.: 20 MDA dec.) | 0.524±0.003 | 0.510±0.003 | 0.529±0.005 |
| Pathway | 0.500±0.000 | 0.500±0.000 | 0.504±0.004 |
| Pathway (man. selected) | 0.500±0.000 | 0.500±0.001 | 0.510±0.004 |
| Pathway (f.s.: 20 MDA dec.) | 0.513±0.003 | 0.510±0.004 | 0.513±0.003 |
| Random (20 curated) | 0.517±0.005 | 0.510±0.007 | 0.515±0.007 |
| Total count | 0.515±0.005 | 0.505±0.005 | 0.516±0.004 |