| Literature DB >> 35830225 |
Yung-Chun Chang1,2, Yu-Wen Chiu1,3, Ting-Wu Chuang3.
Abstract
BACKGROUND: Globalization and environmental changes have intensified the emergence or re-emergence of infectious diseases worldwide, such as outbreaks of dengue fever in Southeast Asia. Collaboration on region-wide infectious disease surveillance systems is therefore critical but difficult to achieve because of the different transparency levels of health information systems in different countries. Although the Program for Monitoring Emerging Diseases (ProMED)-mail is the most comprehensive international expert-curated platform providing rich disease outbreak information on humans, animals, and plants, the unstructured text content of the reports makes analysis for further application difficult.Entities:
Keywords: ProMED-mail; bidirectional long short-term memory; dengue; dual channel; natural language processing
Mesh:
Year: 2022 PMID: 35830225 PMCID: PMC9491834 DOI: 10.2196/34583
Source DB: PubMed Journal: JMIR Public Health Surveill ISSN: 2369-2960
The statistics of our corpus (N=965)a.
|
| Number of sentences, n (%) | |
|
| Single-clausal sentences | Multiclausal sentences |
| Dengue case sentences (n=246) | 49 (19.9) | 197 (80.1) |
| Non-dengue case sentences (n=719) | 407 (56.6) | 312 (43.4) |
aNumber of paragraphs: 129, number of sentences: 965, number of single-clausal sentences: 456, and number of multiclausal sentences: 509.
Figure 1Overview of the proposed framework. A: attention layer; BiLSTM: bidirectional long short-term memory; CD: cardinal number; DT: determiner; JJ: adjective; L: forward long short-term memory layer and backward long short-term memory layer; NN: noun, singular or mass; NNS: noun, plural; POS: parts of speech; ProMED: Program for Monitoring Emerging Diseases; VBG: verb, gerund, or present participle.
The performance results of the compared methods.
| System | Negative, precision; recall; F1 score (%) | Positive, precision; recall; F1 score (%) | Macroaverage, precision; recall; F1 score (%) | |
| NBa | 81.69; 99.30; 89.64 | 94.51; 34.96; 51.04 | 88.10; 67.13; 70.34b | <.001 |
| DTc | 95.79; 85.40; 90.29 | 67.59; 89.02; 76.84 | 81.69; 87.21; 83.57b | <.001 |
| RFd | 96.53; 88.87; 92.54 | 73.60; 90.65; 81.24 | 85.06; 89.76; 86.89b | <.001 |
| SVMe | 94.12; 95.69; 94.90 | 86.75; 82.52; 84.58 | 90.43; 89.10; 89.74b | <.001 |
| XGBf | 92.55; 91.52; 92.03 | 75.98; 78.46; 77.20 | 84.26; 84.99; 84.61b | <.001 |
| MLPg | 94.64; 90.82; 92.69 | 76.00; 84.96; 80.23 | 85.32; 87.89; 86.46b | <.001 |
| CNNh for text | 94.47; 94.99; 94.73 | 85.12; 83.74; 84.43 | 89.80; 89.37; 89.58b | <.001 |
| LSTMi | 94.72; 94.85; 94.79 | 84.90; 84.55; 84.73 | 89.81; 89.70; 89.76b | <.001 |
| BiLSTMj | 95.74; 93.88; 94.80 | 83.08; 87.80; 85.38 | 89.41; 90.84; 90.09k | .94 |
| DuBiLSTMl | 95.89; 94.16; 95.02 | 83.78; 88.21; 85.94 | 89.84; 91.18; 90.48k | .95 |
| Our method | 97.72; 95.27; 96.48 | 87.12; 93.50; 90.20 | 92.42; 94.38; 93.34 | —m |
aNB: naïve Bayes.
bP<.001 (a chi-square test was applied to determine whether our method significantly improves performance in comparison with other methods).
cDT: decision tree.
dRF: random forest.
eSVM: support vector machine.
fXGB: extreme gradient boosting.
gMLP: multilayer perceptron.
hCNN: convolutional neural network.
iLSTM: long short-term memory.
jBiLSTM: bidirectional long short-term memory.
kP>.05 (a chi-square test was applied to determine whether our method significantly improves performance in comparison with other methods).
lDuBiLSTM: dual-channel bidirectional long short-term memory.
mNot available.
Figure 2The precision recall curves of the compared methods. CNN: convolutional neural network; DT: decision tree; LSTM: long short-term memory; MLP: multilayer perceptron; NB: naïve Bayes; RF: random forest; SVM: support vector machine; XGB: extreme gradient boosting.
Figure 3The network visualization for generated linguistic patterns. CDC: Taiwan Centers for Disease Control.
Figure 4The network visualization for generated parts-of-speech (POS) patterns. CC: coordinating conjunction; CD: cardinal number; DT: determiner; EX: existential there; IN: preposition or subordinating conjunction; JJ: adjective; JJS: adjective, superlative; NN: noun, singular or mass; NNP: proper noun, singular; NNPS: proper noun, plural; NNS: noun, plural; PRP$: possessive pronoun; RB: adverb; RBS: adverb, superlative; TO: to; VBD: verb, past tense; VBG: verb, gerund, or present participle; VBN: verb, past participle; VBP: verb, nonthird person singular present; VBZ: verb, third person singular present; WDT: wh-determiner.
Error distribution of dengue case information detection.
| Clause type | False positive, n (%) | False negative, n (%) | Error rate, n (%) |
| Single-clausal (n=456) | 7 (1.5) | 7 (1.5) | 14 (3.1) |
| Multiclausal (n=509) | 29 (5.7) | 15 (2.9) | 44 (8.6) |
| Corpus (n=965) | 36 (3.7) | 22 (2.3) | 58 (6) |
Figure 5Box plot of expert assessment on a 5-point Likert scale of the quality of generated summaries.