Literature DB >> 33216746

EventEpi-A natural language processing framework for event-based surveillance.

Auss Abbood¹, Alexander Ullrich¹, Rüdiger Busche^2,3, Stéphane Ghozzi^1,4.

Abstract

According to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of public health agents sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural language processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at the RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles' key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We extracted the key country and disease using a heuristic with good results. We trained a naive Bayes classifier to find the key date and confirmed-case count, using the RKI's EBS database as labels which performed modestly. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using bag-of-words, document and word embeddings. The best classifier, a logistic regression, achieved a sensitivity of 0.82 and an index balanced accuracy of 0.61. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code and data are publicly available under open licenses.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2020 PMID： 33216746 PMCID： PMC7717563 DOI： 10.1371/journal.pcbi.1008277

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Event-based surveillance

One of the major goals of public health surveillance is the timely detection and subsequent containment of infectious disease outbreaks to minimize health consequences and the burden to the public health apparatus. Surveillance systems are an essential part of efficient early-warning mechanisms [1, 2]. In traditional reporting systems the acquisition of this data is mostly a passive process and follows routines established by the legislator and the public health institutes [2]. This process is called indicator-based surveillance. Hints of an outbreak, however, can also be detected through changed circumstances that are known to favor outbreaks, e.g., warm weather might contribute to more salmonellosis outbreaks [3] or a loss of proper sanitation might lead to cholera outbreaks [4]. Therefore, besides traditional surveillance that typically relies on routine reporting from healthcare facilities, secondary data such as weather, attendance monitoring at schools and workplaces, social media, and the web are also significant sources of information [2]. The monitoring of information generated outside the traditional reporting system and its analysis is called event-based surveillance (EBS). EBS can greatly reduce the delay between the occurrence and the detection of an event compared to IBS. It enables public health agents to detect and report events before the recognition of human cases in the routine reporting system of the public health system [2]. Especially on the web, the topicality and quantity of data can be useful to detect even rumors of suspected outbreaks [5]. As a result, more than 60% of the initial outbreak reports refer to such informal sources [6]. Filtering this massive amount of data poses the difficulty of finding the right criteria for which information to consider and which to discard. This task is particularly difficult because it is important that the filter does not miss any important events (sensitivity) while being confident what to exclude (specificity). Without such a filter process, it is infeasible to perform EBS on larger data sources. Algorithms in the field of natural language processing are well suited to tap these informal resources and help to structure and filter this information automatically and systematically [7].

Motivation and contribution

At the RKI, the The Information Centre for International Health Protection (Informationsstelle für Internationalen Gesundheitsschutz, INIG), among other units, performs EBS to identify events relevant to public health in Germany. Their routine tasks are defined in standard operating procedures (SOPs) and include reading online articles from a defined set of sources, evaluating them for their relevance, and then manually filling a spreadsheet with information from the relevant articles. This spreadsheet is INIG’s EBS database, called Incident Database (Ereignisdatenbank, IDB). The existence of SOPs and the amount of time spent with manual information extraction and especially data entry lead to the idea to automate parts of the process. Applying methods of natural language processing and machine learning to the IDB, we developed a pipeline that: automatically extracts key entities (disease, country, confirmed-case count, and date of the case count which are the mandatory entries of the IDB and thus are complete) from an epidemiological article and puts them in a database, making tedious data entry unnecessary; scores articles for relevance to allow the most important ones to be shown first; provides the results in a web service named EventEpi that can be integrated in EBS workflows. We did not formally define what a “disease” was but rather followed the conventions at INIG. Although considering symptoms or syndromes might lead to earlier event detection, those were rarely entered in the IDB. All code and data necessary to reproduce this work are freely available under open licenses: The source code can be found on GitHub under a GNU license [8], the IDB and word embeddings (see Training of the classifiers) on Figshare under a CC BY 4.0 license, [9] and [10] respectively.

Related work

The Global Rapid Identification Tool System (GRITS) [11] by the EcoHealth Alliance is a web service that provides automatic analyses of epidemiological texts. It uses EpiTator [12] to extract crucial information about a text, such as dates or countries, and suggests the most likely disease the text is about. However GRITS cannot be automated and is not customizable. To use it in EBS, one would need to manually copy-paste both URLs and output of the analysis. Furthermore, GRITS does not filter texts for relevance but only extracts entities from provided texts. The recent disease incidents page of MEDISYS [13, 14], which channels PULS [15, 16], tabularly presents automatically-extracted outbreak information from a vast amount of news sources. However, it is not clear how articles are filtered, how information is extracted, or how uncertain the output is. Therefore, it cannot be used as such and as it is a closed software we could not develop it further.

Materials and methods

The approach presented here consists of two largely independent, but complementary parts: key information extraction and relevance scoring. Both approaches are integrated in a web application called EventEpi. After preprocessing the IDB, texts of articles from the RKI’s main online sources have to be extracted. The full pipeline is shown in Fig 1. With the exception of the convolutional neural network (CNN) for which we used Keras [17], we used the Python package scikit-learn to implement the machine learning algorithms [18]. The exact configurations of all algorithms can be found in S1 Table.

Fig 1

An illustration of the EventEpi architecture.

The orange part of the plot describes the relevance scoring of epidemiological texts vectorized with word embeddings (created with word2vec), document embeddings (mean over word embeddings), and bag-of-words, and fed to different classification algorithms (support vector machine (SVM), k-nearest neighbor (kNN) and logistic regression (LR) among others). The part of EventEpi that extracts the key information is colored in blue. Key information extraction is trained on sentences containing named entities using a naive Bayes classifier or the most-frequent approach applied to the output of EpiTator, a epidemiological annotation software. The workflow ends with the results being saved into the EventEpi database that is embedded into EventEpi’s web application.

An illustration of the EventEpi architecture.

Key information extraction

Key information extraction from epidemiological articles was in part already solved by EpiTator. EpiTator is a Python library to extract named entities that are particularly relevant in the field of epidemiology, namely: disease, location, date, and count entities. EpiTator uses spaCy [19] in the background to preprocess text. One function of EpiTator is to return all entities of an entity class (e.g., disease) found in a text. However INIG, as other EBS groups, is mostly interested in the key entities of each class. Accordingly, the IDB contains a single value for each type of information. Thus, we needed to be able to filter the output of EpiTator to a single entity per class that best describes the corresponding event. In the case of the IDB these are source, disease, country, confirmed-case count, and the date of the number of confirmed cases of an outbreak article. Before we could explore methods to find the aforementioned key entities, we applied standard cleaning and preprocessing to the IDB such that it could be fed into machine learning algorithms (see S1 Text). To find the key entities among those found by EpiTator, we explored two methods, a heuristic and classification-based approach which we refer to as key entity filtering (see Sec. Key entity filtering). If the filtered output of EpiTator for a given article matched the respective key information of the IDB, we then knew that the filter selected the correct key entities.

Key entity filtering

A naive approach to finding the key entity out of all the entities returned by EpiTator is to pick the most frequent one (the mode). We call this the most-frequent approach. To find the key country, we focused only on the first three geographic entities mentioned in the text, since articles tend to contain mentions of other geographic entities different from the key country. To improve performance, we developed a learning-based approach for key date and confirmed-case count. This is shown in the supervised learning block in Fig 1. For the learning approach, we took the texts of the articles published in 2018 from the two most relevant sources, WHO DONs [20] and ProMED [21, 22] (a list of the RKI’s frequently used sources and the reason for selecting those two are described in S1 Text) and applied a sentence tokenizer using the Python library NLTK [23]. Tokenization is the process of splitting text into atomic units, typically sentences, words, or phrases. We filtered all sentences to only keep those that contained some entity e recognized by EpiTator, with c being the class of the entity (date or confirmed-case count) and j being the j entity in a text. If an entity e in a sentence matched the entry of class c in the IDB, then we labeled this sentence as key. Every other sentence was labeled not key. The distribution of samples in the datasets obtained is summarized in S1 Fig. Then we trained a Bernoulli naive Bayes classifier (Bernoulli NBC) [24] with these labeled sentences to learn the relevant properties of sentences that led to the inclusion of their information into the IDB. Before applying a classifier, a text needs to be represented as a vector of numbers (vectorization). During training, a Bernoulli NBC classifier receives for each input sentence a binary vector b of the whole vocabulary (all the words seen during training) where the i position of the vector indicates the i term of the vocabulary. If the i term t is present in the input sentence, then b = 1 and 0 otherwise. Based on the binary vectors and the corresponding labels, the Bernoulli NBC assigns probabilities to individual sentences of being key and not key. The key information for class c was set to the entity recognized in the sentence that has the highest probability of being key and contains a recognized entity of class c. This method ensures that some entity is still chosen even if no sentence is being classified as key, i.e., if all sentences in a text have less than 50% probability of being key. Additionally, we applied the multinomial NBC for comparison. The only difference to the Bernoulli NBC is that the multinomial NBC takes an occurrence vector o, with o being the frequency of term t in the text, as an input instead of a binary vector b. This approach is called bag-of-words. We combined bag-of-words with tf-idf (term frequency-inverse document frequency) where each term frequency is scaled so as to correct for overly frequent terms within and across documents. Formally, tf-idf is defined as where t is a term from the bag-of-words, d is a document, D is the corpus of all epidemiological articles seen during training (containing N documents) and f is the frequency of term t occurring in document d. The Bernoulli NBC might perform better on a small vocabulary, while the multinomial NBC usually performs equally well or even better on a large vocabulary [24]. We also applied further standard methods of text preprocessing (see S1 Text).

Relevance scoring

The second part of developing a framework to support EBS was to estimate the relevance of epidemiological articles. We framed the relevance evaluation as a classification problem, where articles that were present in the IDB were labeled relevant. We had access to all assessments of the IDB of the year 2018 and therefore scraped all WHO DON and ProMED articles of the year 2018. This resulted in a dataset of 3236 articles, 164 of them labeled relevant and 3072 irrelevant. The exact statistics of the dataset are summarized in S2 Fig.

Training of the classifiers

Modern text classifiers tend to use word embeddings [25, 26] for vectorization rather than the tf-idf and bag-of-words approach. Word embeddings are vector representations of words that are learned on large amounts of texts in an unsupervised manner. Proximity in the word embedding space tends to correspond to semantic similarity. This is accomplished by assigning similar embeddings to words appearing in similar contexts. First, we used standard pre-trained embeddings, trained on the Wikipedia 2014 and Gigaword 5th Edition corpora [27]. However, many terms specific to epidemiology were not represented. Thus, we produced custom 300-dimensional embeddings, training the word2vec algorithm [28] on the Wikipedia corpus of 2020 [29] and all available WHO DONs and ProMED Mail articles (61,320 articles). We applied the skip-gram approach and hierarchical softmax [28]. Those settings helped incorporating infrequent terms [30]. The embeddings were trained for five epochs. See S1 Text for information on computational resources and elapsed time. Since we ultimately wanted to classify whether a whole document was relevant, we needed document embeddings. Although dedicated algorithms for document embeddings exist [31], we had not enough data to apply them meaningfully. However, taking the mean over all word embeddings of a document is a valid alternative [32] and suffices to show if learning the relevance of an article is possible. A further issue was imbalance. Only a small fraction (5.0%) of the articles in the dataset was labeled relevant. Instead of discarding data from the majority class, we chose to up-sample the dataset using the ADASYN algorithm [33]. It generates new data points of the minority class by repeating the following steps until the proportion of minority and majority classes reaches the desired proportion (1:1): choose a random data point x (the document embedding of article i) of the minority class; randomly choose another minority-class data point x among the 5-nearest neighbors of x; generate a new data point y at a random position between x and x such that y = ax + (1 − a)x with a drawn uniformly at random between 0 and 1. One problem of up-sampling data is that it still uses the minority class to create new examples and this might hinder the generalizability of the classifier [34]. We used the imbalanced-learn package [35] to implement ADASYN. We compared different classifiers for the relevance scoring task using embeddings or the bag-of-words approach. Support vector machine (SVM), k-nearest-neighbors (kNN), logistic regression (LR) and multilayer perceptron (MLP) used document embeddings as features. The CNN operated on the word embeddings instead of the document embeddings. That way the striding filters of the CNN–if large enough–could learn relationships between adjacent words. We capped the input documents to a maximum of 400 words for the CNN. 597 documents contained less than 400 words which we filled up with zero embeddings such that each document has the same shape. For multinomial and complement NBCs, we used the bag-of-words approach since this feature representation coincides with the assumption of the NBC to predict a class given the occurrence (probability) of a feature. See S1 Table for a tabular, detailed view of the vectorizations and parameters used. Finally, we used layer-wise relevance propagation [36] to make decisions of the CNN explainable. This is done by assessing which word embeddings passed through the CNN led to the final classification. We used iNNvestigate to implement this step [37].

Evaluation

The output of the key entity filtering of texts (see Section Key entity filtering) was compared with the IDB entries of the respective texts. If a found key entity matched the IDB entry exactly we counted the filtered output as correctly classified. Less stringently, the extracted date was counted as correctly classified if it was within three days of the IDB entry. This is due to EpiTator’s API which returns date ranges instead of single dates if parsing a text. A range of three days allows us to rightly overlay with EpiTator’s date ranges. The performances of all classifiers are evaluated on a test set which consists of 25% of the whole dataset. We applied stratification to ensure that both classes are evenly distributed on the train and test set. The data for the CNN was split into a training (60%), validation (20%), and testing (20%) set with slightly different class composition due to different sampling strategies (see S1 Text for details). We consider a number of scores defined as functions of true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN): precision = TP/(TP + FP); sensitivity = TP/(TP + FN); specificity = TN/(TN + FP) and F1 = 2 ⋅ TP/(2 ⋅ TP + FP + FN) [38]. Since public health agents are interested in not missing any positives by classifying them incorrectly as negatives, we considered the sensitivity as a good indicator for the performance of the classifiers. As a measure for the overall accuracy, we preferred the index-balanced accuracy (IBA) [39], which has been developed to gauge classification performance on imbalanced datasets. It is based on sensitivity, i.e., the fraction of correctly classified relevant articles or key entities, and specificity, i.e., the fraction of correctly classified irrelevant articles or non key entities. It is defined as where sensitivity − specificity is called the dominance and 0 ≤ α ≤ 1 is a weighting factor that can be fine-tuned based on how significant the dominating class is supposed to be. IBA measures the trade-off between global performance and a signed index that accounts for imbalanced predictions. It favors classifiers with better true positive rates, assuming that correct predictions on the positive class are more important than true negative predictions. As in the original publication [34], we use α = 0.1.

Results

In this section we present the performance of a series of key information extraction and relevance-scoring algorithms, and describe how the findings were embedded into the web application EventEpi.

Performance of key date and count extraction

We identified the most probable true entity among the many proposed by EpiTator using the most-frequent approach and two NBCs. The most frequent approach worked well for detecting the key country (85% correctly classified) and disease (88% correctly classified). Note that EpiTator systematically failed to detect Nipah virus infection and anthrax. However, performing key information extraction of date and case-count entities using the most-frequent approach did not work and no entity was correctly classified (0% correctly classified in both cases). The performance of both NBC algorithms applied to extract key date and key confirmed-case count are shown in Tables 1 and 2 respectively. Confusion matrices and ROC curves present the performances in greater detail (see S5 and S7 Figs respectively).

Table 1

Evaluation of the key date extraction.

	Pre.	Sen.	Spec.	F1	IBA	key sample size	not key sample size
Multinomial naive Bayes	0.55	0.78	0.69	0.65	0.54	27	54
Bernoulli naive Bayes	0.55	0.81	0.67	0.66	0.55	27	54

Table 2

Evaluation of the key confirmed-case count extraction.

	Pre.	Sen.	Spec.	F1	IBA	key sample size	not key sample size
Multinomial naive Bayes	0.39	0.45	0.93	0.42	0.40	89	874
Bernoulli naive Bayes	0.20	0.81	0.67	0.32	0.55	89	874

Definitions and parameters are the same as in Table 1. The best values for each score highlighted in bold.

For each classifier and label, the precision (Pre.), sensitivity (Sen.), specificity (Spec.), F1, index balanced accuracy (IBA) with α = 0.1, and sample size for both classes, key and not key, of the test set is given. The best values for each score highlighted in bold. Definitions and parameters are the same as in Table 1. The best values for each score highlighted in bold. For both date and count information extraction, the Bernoulli NBC had the highest IBA and sensitivity. Thus, without offering perfect results, applying classification to the output of EpiTator enables key entity extraction for date and confirmed-case count. We might be able to improve the performance by hyperparameter tuning, or better feature extraction (e.g., looking for key words such as “confirmed”). Increasing the amount of training data however would probably not lead to much improvement (see S3 Fig).

Performance of relevance scoring

The results of the relevance scoring are shown in Table 3. The confusion matrices for these results are displayed in S6 Fig and the respective ROC curves in S8 Fig. A comparison on classifier trained on non up-sampled data showed no better performance than sensitivity of 0.03 and IBA of 0.02 (see S2 Table). While the logistic regression has the highest sensitivity (0.82) and IBA (0.61), no model has a higher precision than 0.22 which suggests that all classifier overfit the positive class.

Table 3

The performance evaluation of the relevance classification.

	Pre.	Sen.	Spec.	F1	IBA	relevant sample size	irrelevant sample size
Multinomial naive Bayes	0.22	0.42	0.93	0.29	0.37	38	771
Complement naive Bayes	0.19	0.61	0.87	0.29	0.51	38	771
Logistic regression	0.14	0.82	0.75	0.24	0.61	38	771
k-nearest neighbor classifier	0.12	0.63	0.77	0.20	0.48	38	771
Support vector machine	0.13	0.79	0.74	0.22	0.59	38	771
Multilayer perceptron	0.22	0.42	0.93	0.29	0.37	38	771
Convolutional neural network	0.14	0.55	0.78	0.23	0.42	42	606

For each classifier and label, the precision (Pre.), sensitivity (Sen.), specificity (Spec.), F1, index balanced accuracy (IBA) with α = 0.1, and sample size of both classes, relevant and irrelevant articles, of the test set is given. The best values for each score are highlighted in bold. The multinomial NBCs had a better performance (sensitivity and IBA) than the complement NBC, possibly because of the dataset imbalance [40]. Although the scores are relatively high in general, all classifier overfit the positive class. Overfitting usually can be avoided for some models. E.g, for the CNN, we can apply further dropout (random removal of nodes in the network during training time to minimize highly specified nodes), regularization (e.g., L2 to punish strong weighting of nodes), and early stopping (to minimize the difference of losses between the test and validation set). Most models can incorporate a class weight variable adjusted to control overfitting. Also, we did not optimize the decision boundary of the tested classifiers which, however, might improve the balance between both classes. All these points which fall in the task of hyperparameter optimization can be tackled in a separate step. It is nevertheless interesting to use the CNN as an example for explaining what contributed to the classification. A plot of a layer-wise relevance propagation shows one example where a relevant article was correctly classified (Fig 2). We see that words like 500 in the beginning of the text are highlighted as strongly important for the classification of the text as being relevant. Also, the word schistosomiasis–an infectious disease caused by flatworms–is labeled as strongly relevant for the classification. Interestingly, it is also relevant for the classifier that this disease is treated with antiparasitic drugs (anthelmintic). Both make sense, since a very high number of cases of a dangerous infectious disease are of interest for public health agents. All other case numbers are labeled as slightly irrelevant which does not necessarily make sense. An event might be less relevant when out of 500 confirmed cases of some infectious disease half of the patients are in treatment.

Fig 2

A layer-wise relevance propagation of the CNN for relevance classification.

A layer-wise relevance propagation of the CNN for relevance classification.

This text was correctly classified as relevant. Words that are highlighted in red contributed to the classification of the article being relevant and blue words contradicted this classification. The saturation of the color indicates the strength of which the single words contributed to the classification. <UNK> indicates a token for which no word embedding is available. The focus of this work was to show a proof of concept that classification methods can serve in determining the relevance of an article. We did not try to fine-tune all of the compared classifiers. Since training of the algorithms was only a matter of minutes, it might be cheap to perform hyperparameter optimization. Computational time and resources to train all models are described in S1 Text. For now, logistic regression (LR), although having a low precision, is preferred due to its good sensitivity and IBA. Although the relevance classification has not a very strong performance, it could already aid public health agents. The algorithms could be retrained every time articles are entered into the IDB to increase performance continuously. Indeed, testing the classifiers on fractions of the data shows a positive trend of performance (IBA) with increasing data size (see S4 Fig). Until very high performance can be achieved, relevance scores could be displayed and used to sort articles, but not to filter content.

Web service

To showcase the analyses presented above and show how key information and relevance scoring can be used simultaneously to aid EBS, we developed the web application EventEpi. Fig 3 shows a screenshot of its user interface. EventEpi is a Flask [41] app that uses DataTables [42] as an interface to its database. EventEpi lets users paste URLs and automatically analyze texts from sources they trust or are interested in. The last step in Fig 1 shows how the EventEpi database is filled with the output of the key information extraction and relevance scoring algorithms. With our colleagues at INIG in mind, we integrated a mechanism that would automatically download and analyze the newest unseen articles from WHO DONs and ProMED. Currently, this process is slow and depends on pre-analyses for a good user experience. To allow for the integration of the functionality into another application, we also wrote an ASP.NET Core application to analyze texts via API calls.

Fig 3

A screenshot of the EventEpi web application.

The top input text field receives an URL. This URL is summarized if the SUMMARIZE button is pushed. The result of this summary is entered into the datatable, which is displayed as a table. The buttons Get WHO DONs and Get Promed Articles automatically scrape the last articles form both platforms that are not yet in the datatable. Furthermore, the user can search for words in the search text field and download the datatable as CSV, Excel or PDF.

A screenshot of the EventEpi web application.

Conclusion

We have shown that novel natural language processing methodology can be applied in combination with available resources, in this case the IDB of the RKI, to improve public health surveillance. Even with limited datasets, EBS can be supported by automatic processes, such as pre-screening large amounts of news articles to forward a condensed batch of articles for manual review. More work is needed to bring EventEpi into production. While key disease and country can satisfactorily be extracted, the performance of key date and confirmed-case count extractions needs to be improved. Relevance scoring shows promising results. We believe it could already be helpful to public health agents, and could greatly be improved with fine-tuning and larger datasets. The web application EventEpi is a scalable tool. Thus, the scope of EBS might be increased without comparable increase in effort. This is particularly relevant with the availability of automatic translation (for example DeepL [43]). It could allow an EBS team to access much more sources than those in the few languages its members typically speak without being overwhelmed. It is possible to provide better classifications that work for different languages using multilingual word embeddings [44], or a better key information extraction using contextual embeddings [45, 46] which adjust the embedding based on the textual context. Contrary to the relevance of a document, key information is mostly defined by its nearby words. The same fundamental issues encountered in using machine learning in general apply here as well, in particular bias and explainability. Tackling individual biases and personal preferences during labeling by experts is essential to continue this project and make it safe to use. It will also be important to show why EventEpi extracted certain information or computed a relevance for it to be adopted but also critically assessed by public health agents. For artificial neural networks, we showed that layer-wise relevance propagation can be used in the domain of epidemiological texts to make a classifier explainable. For other models, model agnostic methods [47, 48] could be applied analogously. At the moment EventEpi only presents results to the user. However it could be expanded to be a general interface to an event database and allow public health agents to note which articles were indeed relevant as well as correct key information. This process would allow more people to label articles and thus expand the datasets, as well as help better train the relevance-scoring algorithms, an approach called active-learning [49]. With a large labeled dataset, a neural network could be (re)trained for the relevance classification. Later, transfer learning (tuning of the last layer of the network) could be used to adapt the relevance classification to single user preferences. This work demonstrates how machine learning methods can be applied meaningfully in public health using data readily available: As experts evaluate and document events as part of their daily work, valuable labeled datasets are routinely produced. If systematically gathered and cataloged, these offer immense potential for the development of artificial intelligence in public health. (PDF) Click here for additional data file.

Hyperparameter settings of the classifaction algorithms.

This tables lists the parameters and the vectorization methods stratified for the task and used models (naive Bayes classifier (NBC), support vector machine (SVM), k-nearest neighbors (kNN), logistic regression (LR), multi layer perceptron (MLP), and convolutional neural network (CNN)). More information on the used parameters can be found at https://scikit-learn.org. (PDF) Click here for additional data file.

Performance evaluation of the relevance classification without up-sampling using ADASYN.

For each classifier and label, the precision (Pre.), sensitivity (Sen.), specificity (Spec.), F1, index balanced accuracy (IBA) with α = 0.1, and sample size for both classes, relevant and irrelevant articles, of the test set is given. The best values for each score are highlighted in bold. (PDF) Click here for additional data file.

Sample distribution for key information extraction.

The number of articles used for each class (positive/negative, i.e. key/not key) for the partions of the dataset (train, test) are shown for each task. (PDF) Click here for additional data file.

Sample distribution for relevance scoring.

The number of articles used for each class (positive/negative, i.e. relevant/irrelevant) for the partions of the dataset (train, test, and for CNN validation) are shown for each task. (PDF) Click here for additional data file.

Learning curves for key count and date entity extraction.

Dependency of key (date and count) classifcation performance on training data size as measured using 5-fold cross validation for the multinomial and Bernoulli naive Bayes classifiers. The performance is measured by the IBA score. The points show mean scores, the shaded regions show the mean plus and minus one standard deviation on the cross validation folds. (PNG) Click here for additional data file.

Learning curves for relevance scoring.

Dependency of relevance classifcation performance on training data size as measured using 5-fold cross validation for different classifiers. The performance is measured by the IBA score. The points show mean scores, the shaded regions show the mean plus and minus one standard deviation on the cross validation folds. (PNG) Click here for additional data file.

Confusion matrices of the key count and date entity extraction.

The plot shows the true and predicted labels of the test test in the key entitiy extraction task. The plots are stratified by algorithm (multinomial and Bernoulli naive Bayes classifier (NBC)) and task (key count and date extraction). Furthermore, the proportion of missclassified is shown below. (PNG) Click here for additional data file.

Confusion matrices of the relevance scoring.

The plot shows the true and predicted labels of the relevance scoring task. The plots are stratified by algorithm. Furthermore, the proportion of missclassified is shown below. (PNG) Click here for additional data file.

Receiver operating characteristics of the key count and date entity extraction.

The plot shows the true positive rate against the false positive rate stratified by algorithm (multinomial and Bernoulli naive Bayes classifier (NBC)) and task (key count and date extraction) and the area under the curve (AUC). The black, dotted middle shows the expected curve for random classifcation. (PDF) Click here for additional data file.

Receiver operating characteristics of the relevance scoring.

The plot shows the true positive rate against the false positive rate stratified by algorithm (complement naive Bayes classifier (compl. NBC), k-nearest neighbors (kNN), logistic regression (LR), multi layer perceptron (MLP), multi. NBC (multinomial naive Bayes classifier), support vecotor machine (SVM), and convolution neural network (CNN)) and the area under the curve (AUC). The black, dotted middle line shows the expected curve for random classifcation. (PDF) Click here for additional data file. 29 Apr 2020 Dear Mr. Abbood, Thank you very much for submitting your manuscript "EventEpi–A Natural Language Processing Framework for Event-Based Surveillance" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Benjamin Althouse Associate Editor PLOS Computational Biology Virginia Pitzer Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: See attached. Reviewer #2: Dear Authors, Congratulations for your hard work. It is well written. Additional information should be provided to help public health experts not familiar with natural processing language algorithms be able to judge the presented results. From a public health point of view I have the following comments, questions and proposed modifications to the text: General: • Please use “Public health surveillance” instead of “Infectious disease epidemiology” or “epidemiological surveillance” in the context of this paper. • Avoid the use of NLP acronym, consider using the full term “natural language processing” across the text? • Please define each acronym at least once (for example TPR: true predictive ratio). • Specify what you mean by “disease”. Is this limited to laboratory specific diseases such as measles, cholera or yellow fever; or it also includes syndromes such as cutaneous rash, watery diarrhoea, jaundice? This is especially important for EBS, as its purpose is mainly to detect unknown or unexpected diseases that cannot be well captured by the routine data reporting performed by healthcare facilities. Author summary: • What did this research do and find: 4th point (last): misleading sentence, as mentioned in the results only countries and diseases were correctly detected, not dates or counts. Introduction • Line 3 – 4: One of the most important goals of "Public health surveillance" is the "timely" detection and response to an acute public health event; the other being to monitor the health status of the population to drive health policy. In this paper, early detection and response is the topic of interest, yet public health surveillance cannot be reduced to that. • Lines 14-15: “traditional surveillance” relies on “routine reporting from healthcare facilities” and not from “laboratory confirmation” (most cases are not laboratory confirmed in many settings and for many diseases). • Lines 20-22: add “routine” in the sentence: “It enables public health agents […] recognition of human cases in the **routine** reporting system”. • Line 29: why the use of the word “precision” instead of “specificity”? • Lines 45-46: why only “confirmed-case count”? And not counts of suspect cases, for example? Figure 1 • Don't use the light "pink-orange" background for boxes as it is confusing with the orange part of the figure describing the relevance scoring. • Define briefly "word2vec", "EpiTator", "SVM", "kNN", "LR" either in the figure or in its description. • To facilitate understanding, maybe consider splitting the "supervised learning" row in two: the classification phase (relevance scoring), and the identification of the appropriate data (key information extraction). And use the terms "relevance scoring" and "key information extraction" in the figure. Material and methods: • Add a section to detail how you assessed the performance of the key information extraction and of the classifiers, this would encompass among other lines 85-89, 267-273 in the methods, and some paragraphs from the results, for example lines 288 to 295. • In the above proposed section, please add a quick description of all indicators used to assess each extraction and classification method (i.e. indicators described in the results and in tables 2 to 4). • For epidemiologists the term “sensitivity” will be clearer than “recall”. • Lines 79-80: if correct, add “in each class” in the following sentence: “However INIG, as other […] in the keys entitities *in each class*”. • Lines 120-121: you mention the problem of events/incidents involving several countries but you don’t specify how you solved it (for example to have an accurate number of cases for each country). Please specify how many events/incidents it concerned and how you solved this problem. Why didn’t you remove these events for training the algorithm? • Table 1: please specify what a “sample” is; and what is considered “positive” or “negative” samples. • Lines 159-161: if I understood well, the key information for a single class is the recognized entity from the sentence with the highest probability. How did you deal with sentences having more than one recognized entity for a single class? • Please make clear the number of articles used in your samples: ◦ As I understand, you had 3232 articles, 160 relevant (included in the IDB) and 3072 irrelevant. ◦ And then you tested your classifiers with 20% of your sample, please provide figures of relevant and irrelevant articles used to test your classifiers. ◦ As I understand, you had two classes: relevant or irrelevant article. If I am correct, please make it clear, including in the sections related to the test of the classifiers. Results: • When you say all countries and diseases (except one) were correctly recognized, please provide the figures; i.e. for how many articles presents in the IDB country and disease were correctly extracted. Same for dates and counts. • Lines 299-302: not clear what the results related to EpiTator and the ones related to the most frequent approach are. • Please also provide figures for table 4. For example, for each classification modality, how many documents were classified as relevant and not relevant, how many were truly relevant and how many truly not relevant? • Similarly, it would be good to know if other incidents not logged in the IDB but still of interest were identified using the automatized screening approach, i.e. documents identified as “relevant” by the classifier, that were not in the IDB, but that should have been after a review by a public health expert. Conclusion: • The most important added value of the tool would be to pre-screen large amounts of data to identify a sample that would be then manually screened by public health experts. ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: Yes Reviewer #2: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: Jose Guerra Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions, please see Submitted filename: Review for PCOMPBIOL-D-19-01790.pdf Click here for additional data file. 28 Jun 2020 Submitted filename: response.pdf Click here for additional data file. 21 Jul 2020 Dear Mr. Abbood, Thank you very much for submitting your manuscript "EventEpi–A Natural Language Processing Framework for Event-Based Surveillance" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please address reviewer 2's very minor points. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Benjamin Althouse Associate Editor PLOS Computational Biology Virginia Pitzer Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Please address reviewer 2's very minor points. Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: I am satisfied that the authors have addressed the concerns highlighted in the review. I also congratulate them on making the data accessible for future research. Reviewer #2: Dear Authors, Thank you for your impressive work in the improvement of the paper, it looks very good now. I am fully satisfied with your explanations to my questions and the modifications performed to the manuscript. Please find below some very minor comments and proposed modifications, feel free to consider them or not. Introduction: * Line 40: typo, "spent" instead of "spend". Results: * Tables 1 and 2: the term "support" is still not very clear, even with the added explanation, maybe you could consider the use of another term such as "sample used". * The confusion matrices are very good and self-explanatory, I strongly believe they should be part of the main manuscript instead of being in the supplementary material. S1 Appendix: * line 554: typo "classification" instead of "classifcation" Best regards ********** Have all data underlying the figures and results presented in the manuscript been provided? Large-scale datasets should be made available via a public repository as described in the PLOS Computational Biology data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information. Reviewer #1: None Reviewer #2: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: Yes: José Guerra Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see 10 Aug 2020 Submitted filename: response_2.pdf Click here for additional data file. 20 Aug 2020 Dear Mr. Abbood, We are pleased to inform you that your manuscript 'EventEpi–A Natural Language Processing Framework for Event-Based Surveillance' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Benjamin Althouse Associate Editor PLOS Computational Biology Virginia Pitzer Deputy Editor PLOS Computational Biology *********************************************************** 27 Oct 2020 PCOMPBIOL-D-19-01790R2 EventEpi–A Natural Language Processing Framework for Event-Based Surveillance Dear Dr Abbood, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Matt Lyles PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

6 in total

1. What is epidemic intelligence, and how is it being improved in Europe?

Authors: R Kaiser; D Coulombier; M Baldari; D Morgan; C Paquet
Journal: Euro Surveill Date: 2006-02-02

2. Internet surveillance systems for early alerting of health threats.

Authors: J P Linge; R Steinberger; T P Weber; R Yangarber; E van der Goot; D H Al Khudhairy; N I Stilianakis
Journal: Euro Surveill Date: 2009-04-02

Review 3. ProMED-mail: 22 years of digital surveillance of emerging infectious diseases.

Authors: Malwina Carrion; Lawrence C Madoff
Journal: Int Health Date: 2017-05-01 Impact factor: 2.473

4. Effect of temperature and precipitation on salmonellosis cases in South-East Queensland, Australia: an observational study.

Authors: Dimity Maree Stephen; Adrian Gerard Barnett
Journal: BMJ Open Date: 2016-02-25 Impact factor: 2.692

5. "What is relevant in a text document?": An interpretable machine learning approach.

Authors: Leila Arras; Franziska Horn; Grégoire Montavon; Klaus-Robert Müller; Wojciech Samek
Journal: PLoS One Date: 2017-08-11 Impact factor: 3.240

Review 6. The Impact of Water, Sanitation and Hygiene Interventions to Control Cholera: A Systematic Review.

Authors: Dawn L Taylor; Tanya M Kahawita; Sandy Cairncross; Jeroen H J Ensink
Journal: PLoS One Date: 2015-08-18 Impact factor: 3.240