| Literature DB >> 35627495 |
Corrado Lanera1, Ileana Baldi1, Andrea Francavilla1, Elisa Barbieri2, Lara Tramontan3, Antonio Scamarcia4, Luigi Cantarutti4, Carlo Giaquinto2,4, Dario Gregori1.
Abstract
The burden of infectious diseases is crucial for both epidemiological surveillance and prompt public health response. A variety of data, including textual sources, can be fruitfully exploited. Dealing with unstructured data necessitates the use of methods for automatic data-driven variable construction and machine learning techniques (MLT) show promising results. In this framework, varicella-zoster virus (VZV) infection was chosen to perform an automatic case identification with MLT. Pedianet, an Italian pediatric primary care database, was used to train a series of models to identify whether a child was diagnosed with VZV infection between 2004 and 2014 in the Veneto region, starting from free text fields. Given the nature of the task, a recurrent neural network (RNN) with bidirectional gated recurrent units (GRUs) was chosen; the same models were then used to predict the children's status for the following years. A gold standard produced by manual extraction for the same interval was available for comparison. RNN-GRU improved its performance over time, reaching the maximum value of area under the ROC curve (AUC-ROC) of 95.30% at the end of the period. The absolute bias in estimates of VZV infection was below 1.5% in the last five years analyzed. The findings in this study could assist the large-scale use of EHRs for clinical outcome predictive modeling and help establish high-performance systems in other medical domains.Entities:
Keywords: deep learning; electronic health records; infectious disease; natural language processing; varicella-zoster
Mesh:
Year: 2022 PMID: 35627495 PMCID: PMC9141951 DOI: 10.3390/ijerph19105959
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 4.614
Characteristics of units analyzed. Descriptive statistics are reported and stratified for outcome class, i.e., negative or positive case of VZV in the corresponding year.
| N | VZV Negative | VZV Positive | |
|---|---|---|---|
| Sex | 60,659 | ||
| Female | 47% (27,340) | 46% (1068) | |
| Male | 53% (30,994) | 54% (1257) | |
| Age [days] | 60,342 | 0.7/2.2/4.28 * | 0.6/1.4/3.1 |
* I/II (median)/III quartile.
Figure 1Chart for the general strategy for model development and test. Top-line on the chart: each new year X (right-most position on the x-axis), an updated model can be trained on the already ready gold-standard data, i.e., up to the previous two years (blue), and used to predict the following two years X − 1 and X (white). Middle-line on the chart: the following year X + 1, a second, updated prediction (yellow) can be made on one of the years of tested data (X − 1) with the previous model. Bottom-line on the chart: in the second next year X + 2 the gold standard is supposed to be ready for that year X − 1, becoming a new training data (blue). The model can provide an updated prediction for the year X and a new prediction for the years X + 1 and X + 2 (i.e., the current one “just ended”).
Cases in each set of the models trained. All child records for a given year represent a case, i.e., the same child in distinct years represents distinct and independent cases. Each row reports datasets for the training, validation, and test of a model.
| Years | Training Phase | Testing Phase | |||
|---|---|---|---|---|---|
| Train | Test | Train (#) | Validation (#) | Train (#) | Test (#) |
| 2004 | 2005–2006 | 1588 | 396 | 1984 | 7854 |
| 2004–2005 | 2006–2007 | 4405 | 1099 | 5504 | 9454 |
| 2004–2006 | 2007–2008 | 7873 | 1965 | 9838 | 10,852 |
| 2004–2007 | 2008–2009 | 11,969 | 2389 | 14,958 | 12,020 |
| 2004–2008 | 2009–2010 | 16,555 | 4135 | 20,690 | 13,062 |
| 2004–2009 | 2010–2011 | 21,586 | 5392 | 26,987 | 13,848 |
| 2004–2010 | 2011–2012 | 27,006 | 6746 | 33,752 | 14,139 |
| 2004–2011 | 2012–2013 | 32,666 | 8160 | 40,826 | 14,017 |
| 2004–2012 | 2013–2014 | 38,319 | 9572 | 47,891 | 12,768 |
| 2004–2013 | 2014 | 43,882 | 10,961 | 54,843 | 5816 |
Figure 2Flowchart of the trained network. The boxes report shapes and shape interpretation of the data between each computation step, i.e., between layers of the network. Layers are reported as linking connections between the boxes. N represents the size of the record passed in input; for our minibatch training, N is 16; for overall records, N depends on the set reported in Table 2.
Figure 3AUC-ROC (y-axis) performance progression across epochs of training (x-axes) and model years(panels from 2004 to 2004–2013, from left to right) for both the train (green) and test (red). In total, 95% CI are reported as shadows.
Number of positives, negatives, area under the receiver operating characteristic curve (AUC-ROC), predicted true-positives (tp) and true-negatives (tn), false-positives (fp), and false-negatives (fn). Precision or positive predictive value (prec) and recall or sensitivity (rec) for each model year (by row, indexed by the column year) related to their corresponding test sets. Bold face is used to highlight the best performance column wise.
| Model Year | Positives | Negatives | AUC | tp | tn | fp | fn | prec | rec |
|---|---|---|---|---|---|---|---|---|---|
| 2004–2004 | 637 | 1.954 | 0.804 | 540 | 5.180 | 2.037 | 97 | 0.210 | 0.848 |
| 2004–2005 | 172 | 3.348 | 0.385 | 188 | 8.474 | 35 | 757 | 0.843 | 0.199 |
| 2004–2006 | 465 | 3.869 | 0.588 | 194 | 10.024 | 41 | 593 |
| 0.247 |
| 2004–2007 | 480 | 4.640 | 0.649 | 130 | 11.403 | 33 | 454 | 0.798 | 0.223 |
| 2004–2008 | 307 | 5.425 | 0.582 | 102 | 12.470 | 51 | 439 | 0.667 | 0.189 |
| 2004–2009 | 277 | 6.011 | 0.652 | 98 | 13.386 | 46 | 318 | 0.681 | 0.236 |
| 2004–2010 | 264 | 6.510 | 0.775 | 37 | 13.870 | 40 | 192 | 0.481 | 0.162 |
| 2004–2011 | 152 | 6.922 | 0.835 | 43 | 13.848 | 19 | 107 | 0.694 | 0.287 |
| 2004–2012 | 77 | 6.988 | 0.832 | 45 | 12.645 | 22 | 56 | 0.672 | 0.446 |
| 2004–2013 | 73 | 6.879 |
| 17 | 5.698 | 90 | 11 | 0.159 |
|
Incidences of VZV infections observed in Pedianet and estimated by the model trained.
| Model Year | Years | Positives | Negatives | Observed | Estimated | Estimated |
|---|---|---|---|---|---|---|
| 2004 | 2005–2006 | 637 | 1.954 | 8.11 | 32.8 | 24.7 |
| 2004–2005 | 2006–2007 | 172 | 3.348 | 10 | 2.36 | −7.64 |
| 2004–2006 | 2007–2008 | 465 | 3.869 | 7.25 | 2.17 | −5.09 |
| 2004–2007 | 2008–2009 | 480 | 4.640 | 4.86 | 1.36 | −3.5 |
| 2004–2008 | 2009–2010 | 307 | 5.425 | 4.14 | 1.17 | −2.97 |
| 2004–2009 | 2010–2011 | 277 | 6.011 | 3 | 1.04 | −1.96 |
| 2004–2010 | 2011–2012 | 264 | 6.510 | 1.62 | 0.54 | −1.08 |
| 2004–2011 | 2012–2013 | 152 | 6.922 | 1.07 | 0.44 | −0.63 |
| 2004–2012 | 2013–2014 | 77 | 6.988 | 0.79 | 0.52 | −0.27 |
| 2004–2013 | 2014 | 73 | 6.879 | 0.48 | 1.84 | 1.36 |
Figure 4Receiver operator curves (ROCs) of the model trained to classify VZV infections. The years of the training models are reported on the facets’ headers. Testing years are the following two up to 2014. Color variations in the curves represent the variation of the error in the incidence estimation. The optimal cut-off maximizing the product of precision and recall is reported (red dot) on the side of the corresponding error produced by classifying records using it.