| Literature DB >> 32190768 |
Gang Peng1,2, Yishuo Tang1, Tina M Cowan3, Gregory M Enns4, Hongyu Zhao1,2, Curt Scharfe1.
Abstract
Newborn screening (NBS) for inborn metabolic disorders is a highly successful public health program that by design is accompanied by false-positive results. Here we trained a Random Forest machine learning classifier on screening data to improve prediction of true and false positives. Data included 39 metabolic analytes detected by tandem mass spectrometry and clinical variables such as gestational age and birth weight. Analytical performance was evaluated for a cohort of 2777 screen positives reported by the California NBS program, which consisted of 235 confirmed cases and 2542 false positives for one of four disorders: glutaric acidemia type 1 (GA-1), methylmalonic acidemia (MMA), ornithine transcarbamylase deficiency (OTCD), and very long-chain acyl-CoA dehydrogenase deficiency (VLCADD). Without changing the sensitivity to detect these disorders in screening, Random Forest-based analysis of all metabolites reduced the number of false positives for GA-1 by 89%, for MMA by 45%, for OTCD by 98%, and for VLCADD by 2%. All primary disease markers and previously reported analytes such as methionine for MMA and OTCD were among the top-ranked analytes. Random Forest's ability to classify GA-1 false positives was found similar to results obtained using Clinical Laboratory Integrated Reports (CLIR). We developed an online Random Forest tool for interpretive analysis of increasingly complex data from newborn screening.Entities:
Keywords: Random Forest; false positive; inborn metabolic disorders; machine learning; newborn screening; second-tier testing; tandem mass spectrometry
Year: 2020 PMID: 32190768 PMCID: PMC7080200 DOI: 10.3390/ijns6010016
Source DB: PubMed Journal: Int J Neonatal Screen ISSN: 2409-515X
Number of patients, false positives and PPV of first and second-tier testing (newborn screening (NBS), glutaric acidemia type 1 (GA-1), methylmalonic acidemia (MMA), ornithine transcarbamylase deficiency (OTCD), or very long-chain acyl-CoA dehydrogenase deficiency (VLCADD)).
| Disorder | Confirmed | First-Tier NBS | Second-Tier Analysis (RF, This Study) | ||
|---|---|---|---|---|---|
| False Positives | PPV | False Positives * | PPV | ||
| GA-1 | 48 | 1344 | 3.10% | 150 | 22.30% |
| MMA | 103 | 502 | 16.40% | 276 | 26.40% |
| OTCD | 24 | 496 | 3.50% | 11 | 62.10% |
| VLCADD | 60 | 200 | 23.10% | 196 | 23.40% |
* Median of false positives from 1000 repeats of 10-fold CV.
Participant and Subgroup Demographics for four disorders.
| GA-1 | MMA | OTCD | VLCADD | Control * | |
|---|---|---|---|---|---|
| Gestational Age, week | |||||
| <37 | 340 (24.4%) | 175 (28.9%) | 181 (34.8%) | 42 (16.2%) | 5490 (5.5%) |
| 37–41 | 1005 (72.2%) | 412 (68.1%) | 325 (62.5%) | 206 (79.2%) | 93,603 (94.0%) |
| >41 | 47 (3.4%) | 18 (3.0%) | 14 (2.7%) | 12 (4.6%) | 444 (0.4%) |
| Birth Weight, g | |||||
| <2500 | 279 (20.0%) | 173 (28.6%) | 130 (25.0%) | 26 (10.0%) | 4045 (4.1%) |
| 2500–4000 | 1025 (73.6%) | 381 (63.0%) | 354 (68.1%) | 223 (85.8%) | 87,268 (87.7%) |
| >4000 | 88 (6.3%) | 51 (8.4%) | 36 (6.9%) | 11 (4.2%) | 8224 (8.3%) |
| Sex | |||||
| Male | 845 (60.7%) | 321 (53.1%) | 325 (62.5%) | 165 (63.5%) | 51,352 (51.6%) |
| Female | 542 (38.9%) | 281 (46.4%) | 194 (37.3%) | 93 (35.8%) | 47,882 (48.1%) |
| Unknown | 5 (0.4%) | 3 (0.5%) | 1 (0.2%) | 2 (0.8%) | 303 (0.3%) |
| Race/Ethnicity | |||||
| Asian | 136 (9.8%) | 63 (10.4%) | 33 (6.3%) | 40 (15.4%) | 14275 (14.3%) |
| Black | 212 (15.2%) | 25 (4.1%) | 50 (9.6%) | 15 (5.8%) | 6630 (6.7%) |
| Hispanic | 444 (31.9%) | 407 (67.3%) | 224 (43.1%) | 94 (36.2%) | 49,400 (49.6%) |
| White | 554 (39.8%) | 92 (15.2%) | 197 (37.9%) | 102 (39.2%) | 26341 (26.5%) |
| Other/Unknown | 46 (3.3%) | 18 (3.0%) | 16 (3.1%) | 9 (3.5%) | 2891 (2.9%) |
| Age at Blood Collection, hour | |||||
| <12 | 246 (17.7%) | 142 (23.5%) | 45 (8.7%) | 47 (18.1%) | 21,564 (21.7%) |
| 12–24 | 877 (63.0%) | 319 (52.7%) | 259 (49.8%) | 183 (70.4%) | 71,396 (71.7%) |
| >24 | 269 (19.3%) | 144 (23.8%) | 216 (41.5%) | 30 (11.5%) | 6577 (6.6%) |
| Total Parenteral Nutrition | |||||
| No | 1178 (84.6%) | 393 (65.0%) | 453 (87.1%) | 248 (95.4%) | 97,269 (97.7%) |
| Yes | 146 (10.5%) | 187 (30.9%) | 57 (11.0%) | 3 (1.2%) | 998 (1.0%) |
| Unknown | 68 (4.9%) | 25 (4.1%) | 10 (1.9%) | 9 (3.5%) | 1270 (1.3%) |
* The number and percentage were calculated from 99,537 singleton screen-negative newborns randomly selected from the California NBS program between 2013 to 2015.
Figure 1Analysis of newborn metabolic profiles with Random Forest (RF). Receiver operating characteristic (ROC) curve analysis for infants with and without a confirmed diagnosis using RF analysis of 39 MS/MS analytes. Without altering the sensitivity of primary newborn screening (NBS) for each of the four disorders, RF reduced the number of false-positive cases (vertical dotted line) by 89% for GA-1 (A), 45% for MMA (B), 98% for OTCD (C) and by 2% for VLCADD (D). For each disease, the number in parenthesis shows the 95% confidence interval of the area under the ROC curve (AUC).
Figure 2Assessing the performance of RF using cross validation. A 10-fold cross validation (1000 repeats) of the RF model was performed for each disorder to classify each screen positive as either a true or false positive. Only RF scores from testing samples were used to plot the ROC curve and to calculate the AUC. The small variation in AUC values without extreme outlier cases for each disorder demonstrates the overall stability of our RF model.
Figure 3The contribution of individual metabolic analytes in the RF model. The mean decrease in accuracy (MDA) was used to rank the relative importance of individual MS/MS analytes and clinical variables for metabolic pattern recognition in the RF model. Only the 20 top-ranked analytes and variables for each disease are shown: (A) GA-1; (B) MMA; (C) OTCD; (D) VLCADD, with the primary markers labeled in red. Abbreviation: Infants with TPN (TPN–Yes); Age at blood collection (AaC).
The five top-ranked analytes identified by Random Forest for each disease.
| MDA Ranking | GA-1 | MMA | OTCD | VLCADD |
|---|---|---|---|---|
| 1 | C5DC a | Methionine b [ | Methionine b [ | C14:1 a |
| 2 | C3DC | Free Carnitine [ | Proline b [ | C14 a |
| 3 | C8 b [ | C2 a | Alanine [ | C3DC |
| 4 | C10 [ | C3 a | Glycine [ | C2 b [ |
| 5 | Ornithine [ | C4 | Citrulline a | C5DC |
a Metabolic analytes used as the primary marker for each disorder in California NBS program. b Metabolic analytes used as part of informative marker ratios included C5DC/C8 for GA-1, C3/Met for MMA (CblC, D or F), methionine/citrulline and proline/citrulline for OTCD, and C14:1/C2 for VLCADD. Additional references provide support for analytes identified in this study.
Comparison of CLIR and Random Forest for GA-1 screen positives (cleaned data) (true positive (TP) and false positive (FP)).
| Predicted by Algorithm | NBS Results (Truth) | ||
|---|---|---|---|
| TP | FP | ||
| CLIR | TP | 44 | 67 |
| FP | 4 | 911 | |
| Random Forest | TP | 43 | 53 |
| FP | 5 | 925 | |
Figure 4Graphical user interface (GUI) of the web-based Random Forest (RF) tool for NBS data analysis. The menu panel on the left was designed to upload data and select parameters, while the panel on the right shows the results from RF-based analysis of the metabolic input data. Users can select a cutoff value or use the default cutoff, which was calculated for each disorder based on the median sensitivity of 1000 repeats in the 10-fold cross validation (Figure 2).