| Literature DB >> 32100657 |
Rayees Rahman1, Arad Kodesh2,3, Stephen Z Levine3, Sven Sandin4,5, Abraham Reichenberg4,5,6,7, Avner Schlessinger1.
Abstract
BACKGROUND: Current approaches for early identification of individuals at high risk for autism spectrum disorder (ASD) in the general population are limited, and most ASD patients are not identified until after the age of 4. This is despite substantial evidence suggesting that early diagnosis and intervention improves developmental course and outcome. The aim of the current study was to test the ability of machine learning (ML) models applied to electronic medical records (EMRs) to predict ASD early in life, in a general population sample.Entities:
Keywords: autism spectrum disorder; electronic biomarker; pharmacology; random forest; risk prediction
Mesh:
Year: 2020 PMID: 32100657 PMCID: PMC7315872 DOI: 10.1192/j.eurpsy.2020.17
Source DB: PubMed Journal: Eur Psychiatry ISSN: 0924-9338 Impact factor: 5.361
Figure 1.Workflow used to build the machine learning model of autism spectrum disorder (ASD) incidence. To evaluate the utility of electronic medical record (EMR) and machine learning for predicting the risk of having a child with ASD, we developed a comprehensive dataset. (A) For each mother–father pair, the parental age difference, number of unique medications either parent has taken, the socioeconomic status, as well as the proportion of drugs, by level 2 Anatomic Therapeutic Classification (ATC) code, taken by the parent were used for further analysis. (B) Workflow of performing 10-fold cross-validation to evaluate model performance. First, the data were partitioned into ASD and non-ASD cases, where 80% of the data were randomly sampled as training set, and 20% were withheld as testing set. The training set was then combined and the synthetic minority oversampling technique (SMOTE) was used to generate synthetic records of ASD cases. A multilayer perceptron (MLP), also known as feedforward neural network, logistic regression, and random forest models were trained using the oversampled training data. They were then evaluated on the testing data based on sensitivity, precision, sensitivity, false positive rate, and area under the ROC curve (AUC; C-statistic). Since the testing data did not have synthetic cases, the model performance is indicative of performance of real data. This process was repeated 10 times and average model performance was reported.
Figure 2.Performance of machine learning-based autism spectrum disorder (ASD)-risk predictor. A balanced dataset was generated to train various algorithms to predict the probability of an ASD child from the electronic medical record of the parents. (A) Receiver operator characteristic (ROC) curves for all methods tested: logistic regression, random forest, and MLP. (B) Boxplot of importance values of each feature in the random forest model after 10-fold cross-validation (10× CV). Importance of a feature is defined as the mean decrease in Gini coefficient when training a model. Level 2 Anatomic Therapeutic Classification (ATC) codes are represented by an alphanumeric three-letter code.
Performance of classifiers after 10-fold cross-validation
| Model | True positives (TP) | False negatives (FN) | False positives (FP) | True negatives (TN) | Sensitivity | Specificity | Precision (PPV) | False positive rate (FPR) | Accuracy | AUC |
|---|---|---|---|---|---|---|---|---|---|---|
| Logistic regression | 88.1 | 160.7 | 175.4 | 6,202 | 0.354 | 0.972 | 0.336 | 0.0275 | 0.949 | 0.727 |
| MLP | 74.8 | 172.7 | 126.7 | 6,252 | 0.301 | 0.98 | 0.393 | 0.0198 | 0.955 | 0.709 |
| Random forest | 60.5 | 188.3 | 45.6 | 6,332 | 0.243 | 0.993 | 0.572 | 0.00715 | 0.965 | 0.693 |
| Average | 74.5 | 173.9 | 115.9 | 6,262 | 0.299 | 0.982 | 0.434 | 0.018 | 0.956 | 0.709 |
Note: Average performance metric of each classifier including logistic regression, multilayer perceptron (MLP), and random forest tested after 10-fold cross-validation. Additionally, average marks the average performance across all methods.
Figure 3.Sensitivity analysis of the generated machine learning models. (A) Receiver operator characteristic (ROC) curves for all methods tested with “missing parental information” label included. (B) ROC curves of models generated when all parental medication data are removed. (C) ROC curves of models generated when all maternal medication data are removed.
Performance of classifiers in sensitivity analysis
| Model | True positives (TP) | False negatives (FN) | False positives (FP) | True negatives (TN) | Sensitivity | Specificity | Precision (PPV) | False positive rate (FPR) | Accuracy | AUC |
|---|---|---|---|---|---|---|---|---|---|---|
| Logistic regression | 75.7 | 166 | 97.9 | 6,286 | 0.313 | 0.985 | 0.436 | 0.0153 | 0.96 | 0.716 |
| MLP | 77.1 | 167.9 | 116 | 6,265.5 | 0.314 | 0.982 | 0.404 | 0.0181 | 0.957 | 0.694 |
| Random forest | 57.1 | 184.6 | 52 | 6,332.3 | 0.236 | 0.992 | 0.529 | 0.0081 | 0.96 | 0.687 |
| Average | 69.9 | 172.8 | 88.6 | 6,294.7 | 0.287 | 0.986 | 0.456 | 0.0138 | 0.959 | 0.699 |
Note: Average performance metric of each classifier including logistic regression, multilayer perceptron (MLP), and random forest tested after 10-fold cross-validation with “missing parental information” label included. Additionally, average marks the average performance across all methods.