| Literature DB >> 31553774 |
Scott H Lee1, Matthew J Maenner1, Charles M Heilig1.
Abstract
OBJECTIVE: The Centers for Disease Control and Prevention (CDC) coordinates a labor-intensive process to measure the prevalence of autism spectrum disorder (ASD) among children in the United States. Random forests methods have shown promise in speeding up this process, but they lag behind human classification accuracy by about 5%. We explore whether more recently available document classification algorithms can close this gap.Entities:
Year: 2019 PMID: 31553774 PMCID: PMC6760799 DOI: 10.1371/journal.pone.0222907
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Minimum, first quartile, median, third quartile, and maximum word counts per child.
The first row shows statistics for total (i.e., non-unique) words, while the second shows those for unique words. We represented each child’s record as the unordered collection of his or her abstracted evaluations, which we treated as a single block of text (i.e., a document). We preprocessed the text by lowercasing all strings, removing stop words and special characters, and converting all words to their dictionary forms, or lemmas.
| Min | 1Q | Med | 3Q | Max | |
|---|---|---|---|---|---|
| 2 | 813 | 1,528 | 2,737 | 20,801 | |
| 2 | 344 | 527 | 773 | 2407 |
Mean performance for our 8 models across the 10 train-test splits.
Metrics include sensitivity (Sens), specificity (Spec), positive predictive value (PPV), negative predictive value (NPV), F1, and accuracy (Acc), all shown as percentages. The best scores for each metric are shown in bold, and the final column presents differences in accuracy between each of the models and the most accurate model, the NB-SVM. Simultaneous confidence intervals are multiplicity-adjusted to control FWER.
| Model | Sens | Spec | PPV | NPV | F1 | Acc (95% CI) | Diff acc (95% CI) |
|---|---|---|---|---|---|---|---|
| LDA | 44.2 | 72.4 | 60.6 | 57.5 | 51.1 | 58.6 (55.0, 62.2) | -29.0 (-32.4, -25.6) |
| MNB | 82.3 | 72.6 | 74.2 | 81.0 | 78.0 | 77.3 (73.9, 80.7) | -10.3 (-12.5, -8.1) |
| SVM | 83.5 | 84.5 | 83.8 | 84.2 | 83.6 | 84.0 (80.8, 87.2) | -3.7 (-6.6, -0.7) |
| LSA | 81.5 | 88.5 | 87.2 | 83.3 | 84.2 | 85.1 (83.1, 87.0) | -2.6 (-4.2, -0.9) |
| NNsum | 85.5 | 84.7 | 84.4 | 86.0 | 84.9 | 85.1 (81.9, 88.3) | -2.6 (-5.2, 0.1) |
| NNavg | 86.3 | 86.4 | 85.9 | 86.9 | 86.0 | 86.3 (84.4, 88.2) | -1.3 (-3.3, 0.6) |
| RF | 87.1 | 86.6 | 86.8 | 87.1 (83.8, 90.4) | -0.5 (-2.2, 1.1) | ||
| NB-SVM | 85.2 | 86.4 | * |
Mean prevalence-related metrics for our 8 models across the 10 train-test splits.
Metrics included are false positives (FP); false negatives (FN); number of positive calls (Pos calls); number of true positives in the test set; discordance; and the difference in discordance from the least discordant model, the SVM. Here, discordance is the difference between the predicted percentage positive and the true percentage positive. Simultaneous confidence intervals are multiplicity-adjusted to control FWER.
| Model | FP | FN | Pos calls | True pos | Disc % (95% CI) | Diff disc % (95% CI) |
|---|---|---|---|---|---|---|
| LDA | 158 | 306 | 401 | 549 | -13.2 (-15.1, -11.3) | -13.0 (-15.5, -10.5) |
| MNB | 157 | 97 | 609 | 549 | -3.2 (-5.1, -1.3) | 5.5 (1.3, 9.8) |
| LSA | 66 | 102 | 513 | 549 | -2.1 (-5.5, 1.4) | -3.0 (-5.2, -0.9) |
| NB-SVM | 58 | 81 | 526 | 549 | -2.0 (-5.4, 1.3) | -1.9 (-5.5, 1.7) |
| RF | 74 | 71 | 552 | 549 | -1.1 (-3.5, 1.4) | -0.9 (-3.7, 1.9) |
| NNsum | 88 | 80 | 557 | 549 | 0.7 (-2.6, 4.0) | 0.9 (-2.3, 4.1) |
| NNavg | 78 | 76 | 552 | 549 | 0.3 (-2.9, 3.5) | 0.4 (-3.0, 3.9) |
| SVM | 89 | 91 | 547 | 549 | * |