| Literature DB >> 32369038 |
Corrado Lanera1, Paola Berchialla2, Ileana Baldi1, Giulia Lorenzoni1, Lara Tramontan3, Antonio Scamarcia4, Luigi Cantarutti4, Carlo Giaquinto5, Dario Gregori1.
Abstract
BACKGROUND: The detection of infectious diseases through the analysis of free text on electronic health reports (EHRs) can provide prompt and accurate background information for the implementation of preventative measures, such as advertising and monitoring the effectiveness of vaccination campaigns.Entities:
Keywords: electronic health report; machine learning technique; pediatric infectious disease; text mining; varicella zoster
Year: 2020 PMID: 32369038 PMCID: PMC7238079 DOI: 10.2196/14330
Source DB: PubMed Journal: JMIR Med Inform
Main characteristics used for the train (Veneto) and test (Sicilia) data sets.
| Characteristic | Train | Test |
| Database | Pedianet | Pedianet |
| Language | Italian | Italian |
| Italian Region | Veneto | Sicilia |
| Date span | January 2, 2004-December 31, 2014 | January 7, 2004-December 30, 2014 |
| Records, n | 1,230,355 | 569,926 |
| Children, n | 7631 | 2347 |
| Pediatricians, n | 46 | 13 |
| Positive cases, n (%) | 3481 (45.6%) | 128 (5.4%) |
Tables used from the Pedianet database.
| Table topic | Content | Type of data | Example |
| Accessing | Reasons for accessing the pediatrician and diagnoses | Free text (including codes) |
Ritardo di crescita <783.4> |
| Diaries | Pediatrician’s free-text diaries | Free text |
DIBASE OS GTT 10ML 10000UI/ML n° conf. 2\r\n per Visita di controllo e di follow up\r\n\r\n |
| Hospitalizations | Details on hospital admissions, diagnoses, and length of stays | Free text |
Divisione di pediatria Tosse, difficolta' respiratoria e di alimentazione |
| SOAPa | Symptoms, objectivity, diagnosis, or prescriptions | Free text (including codes) |
|
| Specialistic visits | Visit type and its diagnosis | Free text including (codes) |
|
aSOAP: symptoms, objectivity, diagnosis, or prescriptions.
bFor tables with multiple fields, field names are reported in italics.
Figure 1Flowchart from the acquisition of the five tables containing the electronic health records (dark gray) in the training set that were merged into a single table (dark blue); preprocessed (gray) with the specification of what was removed (pink) prior to the creation of the document-term matrix (DTM) (yellow); the computation of the weights (light blue); the dimensionality reduction, that is the reduction of the terms used (light gray), and the final DTM used (green). DTM: document-term matrix; SOAP: symptoms, objectivity, diagnosis, or prescriptions; TF-iDF: term frequencies–inverse document frequencies.
Performance on the training set of the three machine learning techniques using a 5-fold cross-validation method.
| Technique | Sensitivity, mean (95% CI) | PPVa, mean (95% CI) | NPVb, mean (95% CI) | Specificity, mean (95% CI) | |
| GLMNetc | 80.2 (77.7-82.7) | 73.2 (70.9-75.6) | 90.9 (89.6-92.2) | 87.1 (85.6-88.7) | 76.5 (75.6-77.5) |
| MAXENTd | 68.8 (66.8-70.7) | 66.0 (62.5-69.5) | 86.1 (85.2-86.9) | 84.5 (82.7-86.3) | 67.4 (64.7-70.0) |
| Boosting | 86.6 (82.1-91.1) | 95.8 (93.2-98.5) | 94.4 (92.4-96.3) | 98.3 (97.0-99.6) | 90.9 (89.7-92.1) |
aPPV: positive predicative value.
bNPV: negative predicative value.
cGLMNet: elastic-net regularized generalized linear model.
dMAXENT: maximum entropy.
Performance on the test set of the three machine learning techniques under consideration.
| Technique | Sensitivity, mean (95% CI) | PPVa, mean (95% CI) | NPVb, mean (95% CI) | Specificity, mean (95% CI) | |
| GLMNetc | 72.3 (66.4-78.1) | 24.5 (21.0-28.0) | 98.3 (97.9-98.6) | 87.4 (85.4-89.5) | 36.5 (32.2-40.8) |
| MAXENTd | 74.8 (62.2-87.5) | 11.0 (9.5-12.5) | 98.0 (97.3-98.6) | 65.5 (54.7-76.2) | 19.1 (17.2-20.9) |
| Boosting | 79.2 (69.7-88.7) | 63.1 (42.7-83.5) | 98.8 (98.3-99.3) | 96.9 (94.2-99.6) | 68.5 (59.3-77.7) |
aPPV: positive predicative value.
bNPV: negative predicative value.
cGLMNet: elastic-net regularized generalized linear model.
dMAXENT: maximum entropy.
Agreement between elastic-net regularized generalized linear model, maximum entropy, and boosting using 5-fold cross-validation.
| Technique | Wrongly agreea, n | Correctly agreeb, n | Disagreec, n | Gwet AC1d,e (95% CI) |
| GLMNetf vs MAXENTg | 669 | 5609 | 1353 | 0.68 (0.67-0.70) |
| GLMNet vs boosting | 195 | 6269 | 1146 | 0.74 (0.72-0.75) |
| MAXENT vs boosting | 224 | 5895 | 1491 | 0.66 (0.65-0.68) |
aThe “Wrongly Agree” column refers to the number of records misclassified by both techniques.
bThe “Correctly Agree” column states the number of records correctly classified by both techniques.
cThe “Disagree” column lists the number of records for which the techniques disagree in the classification.
dAC1: agreement coefficient 1.
eGwet AC1 represents the index of agreement between the identified techniques. Legend for AC1 is: AC1<0=disagreement; AC1 0.00-0.40=poor; AC1 0.41-0.60=discrete; AC1 0.61-0.80=good; AC1 0.81-1.00=optimal.
fGLMNet: elastic-net regularized generalized linear model.
gMAXENT: maximum entropy.