Literature DB >> 34423261

A fast, resource efficient, and reliable rule-based system for COVID-19 symptom identification.

Himanshu S Sahoo¹, Greg M Silverman², Nicholas E Ingraham³, Monica I Lupei⁴, Michael A Puskarich⁵, Raymond L Finzel⁶, John Sartori¹, Rui Zhang⁶, Benjamin C Knoll⁷, Sijia Liu⁸, Hongfang Liu⁸, Genevieve B Melton², Christopher J Tignanelli², Serguei V S Pakhomov⁶.

Abstract

OBJECTIVE: With COVID-19, there was a need for a rapidly scalable annotation system that facilitated real-time integration with clinical decision support systems (CDS). Current annotation systems suffer from a high-resource utilization and poor scalability limiting real-world integration with CDS. A potential solution to mitigate these issues is to use the rule-based gazetteer developed at our institution.
MATERIALS AND METHODS: Performance, resource utilization, and runtime of the rule-based gazetteer were compared with five annotation systems: BioMedICUS, cTAKES, MetaMap, CLAMP, and MedTagger.
RESULTS: This rule-based gazetteer was the fastest, had a low resource footprint, and similar performance for weighted microaverage and macroaverage measures of precision, recall, and f1-score compared to other annotation systems. DISCUSSION: Opportunities to increase its performance include fine-tuning lexical rules for symptom identification. Additionally, it could run on multiple compute nodes for faster runtime.
CONCLUSION: This rule-based gazetteer overcame key technical limitations facilitating real-time symptomatology identification for COVID-19 and integration of unstructured data elements into our CDS. It is ideal for large-scale deployment across a wide variety of healthcare settings for surveillance of acute COVID-19 symptoms for integration into prognostic modeling. Such a system is currently being leveraged for monitoring of postacute sequelae of COVID-19 (PASC) progression in COVID-19 survivors. This study conducted the first in-depth analysis and developed a rule-based gazetteer for COVID-19 symptom extraction with the following key features: low processor and memory utilization, faster runtime, and similar weighted microaverage and macroaverage measures for precision, recall, and f1-score compared to industry-standard annotation systems.

Entities: Chemical

Keywords: and symptoms; artificial intelligence; clinical decision support systems; follow-up studies; information extraction; natural language processing; signs

Year: 2021 PMID： 34423261 PMCID： PMC8374371 DOI： 10.1093/jamiaopen/ooab070

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

BACKGROUND AND SIGNIFICANCE

With COVID-19 came an unprecedented need to identify symptoms of COVID-19 PUIs in a time-sensitive, resource-efficient, and accurate manner. When attempting to identify COVID-19 symptoms from clinical notes in near-real-time, we identified significant limitations with industry-standard annotation systems (hereby referred to as “annotation systems”) including (1) poor scalability with increasing number of notes and (2) high resource needs. While available annotation systems perform well for smaller healthcare settings, they fail to scale in larger healthcare systems (like ours), where 10 000+ clinical notes are generated a day. For example, one instance of MetaMap takes approximately 105 h; CLAMP 28 h, and cTAKES 9 h to process 12 000 notes limiting scalability especially for time-sensitive prognosis such as for COVID-19 PUIs. Similar issues were also found by other researchers., Solutions proposed to mitigate scalability issues included: increasing number of servers, NLP engines, and databases. Although these solutions led to improved runtime, they still did not address the key issue of high resource utilization, being problematic for healthcare sites lacking robust infrastructure. After evaluating several potential annotation systems to address the above-mentioned limitations, we developed a solution using a dictionary of terms (called as a gazetteer) with significantly lower resource utilization, faster runtime, and similar weighted microaverage and macroaverage measures compared to annotation systems. When time-sensitive decisions with minimal patient contact are crucial, such as during the COVID-19 pandemic, this was extremely important. This study presents our findings. Multiple studies have demonstrated the success of rule-based gazetteers consisting of domain-specific lexica as an alternative to annotation systems. In one study, Liu et al. successfully used a gazetteer to select cohorts of heart failure and peripheral arterial disease patients from unstructured text, while Wagholikar et al. used a gazetteer based on radiological findings to automate limb fracture classification. Gazetteer lexicons are highly targeted within clinical domains through construction by subject matter experts, especially when combined with appropriate lexical rules, and work very well with continuous maintenance. Gazetteers can easily be deployed together as a standalone tool using containerization technologies, and their rule-base alone can be deployed as part of an existing infrastructure, such as developed by the Open Health NLP (OHNLP) consortium for the National COVID Cohort Collaborative (N3C)., This study developed a rule-based gazetteer based on a lexicon of COVID-19 symptoms (hereby referred to as “COVID-19 gazetteer”) and compared it to five annotation systems in terms of (1) document processing times; (2) resource needs; and (3) performance in terms of weighted microaverage and macroaverage measures for precision, recall, and f1-score.

MATERIALS AND METHODS

Metrics used for comparing annotation systems

Runtime

Amount of time taken by an annotation system to process a given set of documents.

Resource utilization

Central processing units (CPUs) and random access memory (RAM) utilized by an annotation system. Henceforth, CPUs are referred to as “processor” and RAM is referred to as “memory.”

Weighted microaverage and macroaverage measures

Weighted microaverage (henceforth referred to as “microaverage performance measures”) and macroaverage measures for positive predictive value (precision), sensitivity (recall), and harmonic mean (f1-score) for the task of symptom identification.

System overview

Runtime evaluations were performed on a computing system with configurations listed in Supplementary Appendix A. All annotation systems were containerized using Docker. To ensure equal access to system resources all tests were serially executed in a Kubernetes/Argo workflow where each annotation system ran as a single Kubernetes pod.

Data

Notes were collected from M Health Fairview affiliated with the University of Minnesota (UMN), comprising 12 hospitals and services in the ambulatory and postacute settings. There are over 368 000 ED visits with 9% to 30% cases admitted as inpatients each year. Between March 2020 to December 2020 there were 19 924 total ED visits for 10 110 unique COVID-19 positive patients. 12 000 notes were randomly selected from the pool of ED notes for comparing runtime and resource utilization of the annotation systems.

Expert-curated manually annotated reference corpora

At the time of this study, there were no existing corpora annotated for COVID-19 symptoms. Small corpora of notes were quickly developed by UMN and Mayo Clinic to assess COVID-19 symptom identification performance of annotation systems. Due to the small corpora size, the results obtained would not be sufficient to establish which annotation system is better than the other for symptom identification. In their study of UMLS-extracted symptoms for predictive modeling of influenza, Stephens et al. used only 20 randomly selected notes (with only 200 labeled symptoms) to assess their extraction process, suggesting the corpora used in this study is adequate for testing symptom identification performance and finding potential gaps between annotation systems.

UMN reference corpus

Forty-six notes from M Health Fairview (hereby referred to as “UMNCor”) were randomly selected and manually annotated by a board-certified critical care physician with 12 years of clinical experience who is also a board-certified clinical informaticist (CT). The annotator had experience treating over 250 COVID-19 positive patients and was blinded to the results of annotation systems. Notes in UMNCor were manually reviewed for positive and negative document-level mentions of 11 acute COVID-19 symptoms as documented by the Center for Disease Control and Prevention (CDC) (hereby referred to as “acute CDC symptoms''). The phrase “positive document-level mention” means at least one positive mention of the acute CDC symptom in the entire note. Similarly, the phrase “negative document-level mention” means at least one negative mention. UMNCor contained a total of 259 document-level mentions (shown in Table 1).

Table 1.

Count of document-level mentions for acute CDC symptoms for the corpora

Features	No. of mentions
	UMNCor	MayoCor
cdc_aches_n	3	3
cdc_aches_p	18	3
cdc_cough_n	11	6
cdc_cough_p	28	22
cdc_diarrhea_n	12	14
cdc_diarrhea_p	11	18
cdc_dyspnea_n	10	15
cdc_dyspnea_p	28	34
cdc_fatigue_n	2	1
cdc_fatigue_p	15	14
cdc_fever_n	9	24
cdc_fever_p	30	18
cdc_headaches_n	5	8
cdc_headaches_p	8	15
cdc_nausea_vomiting_n	19	20
cdc_nausea_vomiting_p	13	27
cdc_rhinitis_congestion_n	7	1
cdc_rhinitis_congestion_p	8	2
cdc_sore_throat_n	6	4
cdc_sore_throat_p	9	3
cdc_taste_smell_loss_n	2	3
cdc_taste_smell_loss_p	5	5
sum	259	260

Note: suffix “_p” following an acute CDC symptom represents positive document-level mention for the acute CDC symptom and while suffix “_n” represents negative document-level mention

Count of document-level mentions for acute CDC symptoms for the corpora Note: suffix “_p” following an acute CDC symptom represents positive document-level mention for the acute CDC symptom and while suffix “_n” represents negative document-level mention

Mayo reference corpus

This corpus, developed by Mayo Clinic, consists of 148 fully deidentified notes for COVID-19 positive patients (hereby referred to as “MayoCor”). Each note was labeled for symptoms based on the CDC and Mayo lexicons., The annotation guidelines were developed in collaboration with the Coronavirus Infectious Disease Ontology (CIDO) team. MayoCor contained a total of 260 document-level mentions (shown in Table 1).

Symptom selection criteria

Only acute CDC symptoms were included in the study. Any document-level mention with negligible number of instances compared to the mention with the highest number of instances will not contribute much to microaverage performance measures. Hence, document-level mentions with less than five instances were excluded for the calculation of microaverage performance measures. Using the above-mentioned criteria and Table 1, document-level mentions included for calculation of microaverage performance measures for both corpora are mentioned in Supplementary Appendix C. These mentions selected for UMNCor and MayoCor for microaverage performance measures calculations are hereby referred to as “UMNCor features” and “MayoCor features,” respectively. For macroaverage measures of precision, recall, and f1-score, positive and negative document-level mentions of all acute CDC symptoms have been used for calculation. Since macroaverage measures assign equal weight to every class it is worthwhile to examine how annotation systems compare when treating every acute CDC symptom equally irrespective of sample size.

Lexicon of COVID-19 symptoms

Lexicon of 171 terms based on the CDC’s guidelines was iteratively created by three board-certified clinicians (NI, ML, and MP), using equivalent medical terminology, abbreviations, synonyms, allied symptoms, alternate spellings, misspellings, etc. Terms in this lexicon (see Supplementary Appendix B.1) hereafter referred to as “Derived COVID-19 Symptoms” were used by the COVID-19 gazetteer and to derive the Universal Medical Language System (UMLS) lexicon used by other annotation systems.

Query expansion of derived COVID-19 symptoms

We utilized word2vec model trained on clinical text by Pakhomov et al. to expand the derived COVID-19 symptoms list (see Supplementary Appendix B.2). The model was trained on a corpus of notes (4 169 696 714 tokens) from M Health Fairview between 2010 to 2014, inclusive. The model created embeddings with up to four-word sequences by using the word2phrase tool. The 2018 version of MetaMap was used to map lexicon terms to the UMLS. The final set of terms mapped to UMLS concepts was further reviewed by three board-certified clinicians (NI, ML, and MP) to ensure semantic expansions were clustered appropriately on the acute CDC symptoms. This final set of terms and concepts (see Supplementary Appendix B.4) was made available as a UMLS lexicon for use by annotation systems (refer to subsection “UIMA-based annotation pipeline”).

UIMA-based annotation pipeline

Notes were annotated for Concept Unique Identifiers (CUIs) from the Disorders semantic group using the NLP Artifact Discovery and Preparation Toolkit for Kubernetes (NLP-ADAPT-kube). NLP-ADAPT-kube contains the following Unstructured Information Management Architecture (UIMA) compatible annotation systems as Docker images: (1) BioMedICUS v2.2.0; (2) cTAKES v4.0.1; (3) MetaMap 2018 Linux version; (4) CLAMP v1.6.4. Features relevant to the acute CDC symptoms were constructed using extracted UMLS concepts present in the derived UMLS lexicon described in subsection “Query expansion of derived COVID-19 symptoms.”

MedTagger

MedTagger v1.0.9 is a rule-based gazetteer developed by the Mayo Clinic. We used two versions of MedTagger: (1) COVID-19 gazetteer lexicon adopted to MedTagger’s ruleset format (hereby referred to as “MedTagger Custom”) and (2) Mayo Clinic’s COVID-19 lexicon adopted to MedTagger’s ruleset format.

COVID-19 gazetteer

The COVID-19 gazetteer used the lexicon described in subsection “Lexicon of COVID-19 symptoms” to narrow searches for concepts belonging to sentence-level mentions of acute CDC symptoms. The COVID-19 gazetteer uses spaCy’s Matcher and EntityRuler classes to add lexicon terms to the spaCy en_core_web_sm model. The Matcher instance reads in notes and returns the span of text containing symptom mentions. Returned spans are further processed by the spaCy pipeline to search for custom entities added using EntityRuler. This extra step is necessary because we observed spaCy missed certain phrases in the lexicon; thus, the Matcher instance detected terms the EntityRuler instance had missed. Span length was predetermined through initial tuning on a held-out set of 1700 randomly selected notes. Output was then lemmatized to convert text to its base form (eg, the base form of “wheezing” is “wheeze”). The NegEx component of spaCy (negspaCy) was added at the end of the spaCy pipeline for negation detection. More details about the COVID-19 gazetteer are present in the GitHub repository. The COVID-19 gazetteer used multiple server cores by distributing nearly equal numbers of notes to each core.

RESULTS

Overall microaverage performance measures of annotation systems are shown for both corpora in Table 2. As mentioned in subsection “Symptom selection criteria,” UMNCor uses only UMNCor features and MayoCor uses only MayoCor features for calculating microaverage performance measures.

Table 2.

Overall microaverage performance measures of the annotation systems for both corpora (confidence intervals are present in Supplementary Appendix D.1-2)

	UMNCor			MayoCor
System	Precision	Recall	f1-score	Precision	Recall	f1-score
BioMedICUS	0.78	0.75	0.75	0.89	0.89	0.89
CLAMP	0.84	0.85	0.85	0.91	0.92	0.91
cTAKES	0.83	0.80	0.81	0.91	0.90	0.91
MetaMap	0.85	0.84	0.85	0.90	0.91	0.90
COVID-19 Gazetteer	0.89	0.86	0.87	0.91	0.91	0.91
MedTagger Custom	0.82	0.82	0.82	0.92	0.92	0.92
MedTagger	0.88	0.85	0.85	0.91	0.91	0.91

Overall microaverage performance measures of the annotation systems for both corpora (confidence intervals are present in Supplementary Appendix D.1-2) Table 3 shows the macroaverage measures for precision, recall, and f1-score for positive and negative document-level mentions for all the acute CDC symptoms (as mentioned in subsection “Symptom selection criteria”).

Table 3.

	UMNCor			MayoCor
System	Precision	Recall	f1-score	Precision	Recall	f1-score
BioMedICUS	0.71	0.75	0.72	0.73	0.74	0.73
CLAMP	0.81	0.81	0.81	0.79	0.71	0.74
cTAKES	0.77	0.82	0.79	0.75	0.78	0.76
MetaMap	0.80	0.82	0.81	0.75	0.71	0.73
COVID-19 Gazetteer	0.82	0.88	0.84	0.77	0.79	0.78
MedTagger Custom	0.77	0.78	0.77	0.79	0.80	0.80
MedTagger	0.80	0.87	0.82	0.80	0.75	0.77

Macroaverage performance measures of the annotation systems for both corpora calculated using positive and negative document-level mentions for all the acute CDC symptoms (confidence intervals are present in Supplementary Appendix E.1-2) Figure 1 shows total CPU and RAM utilization for the annotation systems over their runtime on 9000 clinical notes. Total utilization values for CPU and RAM (referred to as cores*sec and RAM*sec in Figure 1, respectively) were calculated as a running summation of the CPU (in cores) and RAM (in gigabytes (GB)) utilized by an annotation system over its runtime. The ideal system would minimize resources while executing in the least amount of time. MedTagger Custom was omitted from any runtime analysis because it uses the same underlying implementation as MedTagger.

Figure 1.

Total CPU and RAM utilization over the period of execution of the annotation systems on 9000 notes. A, CPU utilization (in number of cores); B, Zoomed in view of (A); C, RAM utilization; D, Zoomed in view of (C); E, Total utilization of CPU (represented as cores*s) and RAM (represented as RAM*s). Statistics for CPU and RAM utilization were collected every 30 s and appended to a file using a bash script that queried the Kubernetes cluster.

Figure 2.

Runtime of annotators for 9000 notes. The COVID-19 gazetteer had the least processing time.

Runtime of annotators for 9000 notes. The COVID-19 gazetteer had the least processing time. To demonstrate the efficiency of the COVID-19 gazetteer we analyzed its runtime by keeping the number of notes constant while increasing the number of cores for 3000, 6000, 9000, and 12 000 clinical notes (see Figure 3).

Figure 3.

Runtime of the COVID-19 gazetteer with increasing number of CPU cores on a given set of notes. It is observed that runtime reduced as number of cores increased for constant set of notes processed.

DISCUSSION

The purpose of this study was to develop a rule-based gazetteer for COVID-19 and compare it to five annotation systems. This study makes the following contributions: (1) first in-depth analysis involving rule-based gazetteer for COVID-19 symptom identification; (2) compares performance (weighted microaverage and macroaverage measures for precision, recall, and f1-score) of the COVID-19 gazetteer to other annotation systems; (3) highlights the potential of the COVID-19 gazetteer as a low resource solution by comparing its processor and memory utilization to other annotation systems; (d) compares runtime of the COVID-19 gazetteer to other annotation systems, demonstrating its efficacy for high-throughput real-time annotation of notes for identifying a patient’s presenting symptoms (eg, identifying symptoms of COVID-19 PUIs in a time-sensitive manner).

Performance of systems

Results in Tables 2 and 3 and Supplementary Appendices D and E demonstrate the COVID-19 gazetteer has similar weighted microaverage and macroaverage performance measures compared to other annotation systems. Based on these results, we emphasize the importance of a carefully designed gazetteer for diseases with manageable sets of defined symptoms translatable to lexical rules to aid CDS, including surveillance for long-term care (eg, PASC progression in COVID-19 survivors).

Resource utilization of annotation systems

Figure 1 demonstrates that BioMedICUS, COVID-19 gazetteer, and MedTagger had the lowest CPU and RAM utilization making them good candidates for compute devices with minimal processor and memory resources compared to MetaMap and CLAMP (had highest resource requirements). BioMedICUS utilizes a fast algorithm along with in-memory maps for concept detection but comes with the tradeoff of increased memory utilization. The COVID-19 gazetteer uses en_core_sci_sm and en_core_web_sm spaCy models (about 13–15 MB) for detection of mentions. This is one possible reason why the COVID-gazetteer had the lowest memory requirement. The COVID-19 gazetteer used all available cores to minimize runtime and was among the lowest in terms of overall CPU utilization although the average CPU utilized at any given instant of time was high. MedTagger had low CPU utilization because it processes documents through data streams and loads the compiled ruleset to memory for lower memory utilization. It should be noted annotation systems with minimal resource requirements (eg, BioMedICUS, COVID-19 gazetteer, and MedTagger) have the potential to incur a significantly lower monetary costs when run on cloud-based platforms. In addition, annotation systems with minimal resource requirements are ideal for deployment at healthcare sites lacking robust infrastructure.

Scaling of annotation systems for real-time processing of notes

Results in Figure 2 show the COVID-19 gazetteer consistently outperformed other annotation systems in runtime. The COVID-19 gazetteer took 34 min to process 9000 notes—about 3× faster than MedTagger (second fastest annotation system) and 123× faster than MetaMap (slowest annotation system). Hence, the COVID-19 gazetteer is the best candidate for high-throughput real-time processing of notes for clinical surveillance (eg, identifying symptoms of COVID-19 PUIs). Figure 3 shows the effect of scaling the COVID-19 gazetteer through increase of CPU cores on a given set of notes, where runtime decreases linearly with increasing cores. The COVID-19 gazetteer operating on multiple compute nodes has far greater potential to significantly decrease the runtime to process notes compared to standard annotation systems. It is possible to scale “off-the-shelf” annotation systems for real-time processing through both pipeline customization and across multiple compute nodes. Demner-Fushman et al. introduced MetaMap Lite and found it to be at least 5× faster than MetaMap and cTAKES on various corpora with higher precision, recall, and f1-score. Stephens et al. used MetaMap Lite for processing speed and ease of use and compared it to MetaMap and cTAKES on a corpus containing 7278 EHR notes. In the workshop on ‘Large Scale Ensembled NLP Systems’ with Docker and Kubernetes Finzel et al. scaled MetaMap by running 80 Kubernetes pods on 8 compute nodes to get a processing speed of about 15 documents per second. The study conducted by Miller et al. to extract patients’ phenotypes from 10 000 EHR notes had a processing speed of about 2.45 notes per second when run on Amazon Web Services (AWS) containing 2 CPUs and 8 GB of RAM. This was equivalent to processing 1 million notes per day when run on 10 AWS Elastic Computing (EC2) nodes. The presentation on ‘Fault-Tolerant, Distributed, and Scalable Natural Language Processing with cTAKES’ discusses scaling cTAKES using distributed Apache Spark on 245 worker machines, each with 64 CPUs and 240 GBs of RAM. The developed pipeline was able to process 84 387 661 notes in 89 min compared to about 396 days for cTAKES. Despite such high processing capacity, these systems incur incredibly high resource utilization.

Lexicon creation and maintenance for annotation systems

The COVID-19 gazetteer lexicon process (described in subsection “Lexicon of COVID-19 symptoms”) required clinical expertise. This process could be automated using transformer models like Bidirectional Encoder Representations from Transformers (BERT). Preliminary experiments conducted in our lab indicate that using only 40 terms representing acute CDC symptoms to fine-tune a BERT model for Named Entity Recognition (NER) yielded 360 terms belonging to the acute CDC symptoms from 10 000 ED notes for COVID-19 positive patients (refer Supplementary Appendix F for details on BERT setup for NER). BERT symptom extraction process took about 6 h compared to several weeks required by subject matter experts to create the COVID-19 gazetteer lexicon. In addition, the lexicon of 360 terms extracted using BERT had similar symptom identification performance on UMNCor with respect to microaverage performance measures compared to the 171 terms of the COVID-19 gazetteer lexicon created using clinical expertise. As there are variations in lexical constructs while documenting symptoms among medical scribes as well as over time, it is necessary to maintain the COVID-19 gazetteer lexicon by periodically checking for new lexical constructs of acute CDC symptoms present in notes that were not present in the existing the COVID-19 gazetteer lexicon. This could be done by either using the COVID-19 gazetteer lexicon creation process outlined in subsection “Lexicon of COVID-19 symptoms” or by using transformer models like BERT. The COVID-19 gazetteer lexicon could also be easily extended to COVID-19 symptoms not present in the list of acute CDC symptoms. This process would also work for any disease with a well-defined symptomatology, including PASC. On the other hand, UMLS lexicon creation for UIMA-based annotation systems required the steps mentioned in subsection “Query expansion of derived COVID-19 symptoms” in addition to the COVID-19 gazetteer lexicon creation process. Maintenance of the UMLS lexicon also requires periodically searching for new lexical constructs of acute CDC symptoms present in clinical notes and mapping them to UMLS concepts using rules used to create the UMLS lexicon. Mapping of new lexical constructs to UMLS concepts cannot be automated. This requires costly subject matter intervention and is time-consuming. To summarize, the UMLS lexicon creation process took two steps compared to a single step required for creating COVID-19 gazetteer lexicon. In addition, the second step of UMLS lexicon creation required extensive clinical expertise. Hence, the COVID-19 gazetteer lexicon was simpler to create and maintain compared to the UMLS lexicon.

Complementing UMLS with gazetteer lexicon

Results in Tables 2 and 3 and Supplementary Appendices D and E confirm the COVID-19 gazetteer performs similar to any annotation system reliant on the UMLS Metathesaurus. The COVID-19 gazetteer lexicon consisted of 120 UMLS terms out of 171 terms. For these 120 UMLS terms, we observed a weighted microaverage f1-score of 0.85 across all the mentions present in UMNCor features, which is 2% less than the observed overall microaverage f1-score of 0.87 for the COVID-19 gazetteer (shown in Table 2). With the use of the remaining 51 non-UMLS terms, the COVID-19 gazetteer improved the matching of relevant terms not detected by the 120 UMLS terms. Thus, the COVID-19 gazetteer lexicon complements the UMLS lexicon making it an ideal candidate for being a part of an ensemble of different UIMA-based annotation systems reliant on a UMLS lexicon. Use of a non-UMLS rule-based gazetteer complemented by UMLS terms could be tailored to any disease with a clearly defined symptomatology.

Limitations and future work

Corpora used consisted of a limited number of document-level mentions—259 for UMNCor and 260 for MayoCor. Due to the small corpora size, the annotation systems had mostly wide and overlapping confidence intervals for weighted microaverage and macroaverage performance measures of precision, recall, and f1-score. Thus, the small corpora size failed to highlight the differences between the annotation systems. However, in their study of UMLS-extracted symptoms for predictive modeling of influenza, Stephens et al. used only 20 randomly selected notes (with only 200 labeled symptoms) to assess their extraction process, suggesting the corpora of notes used in this study is adequate for testing. To address the issue of generalizability and assess significant differences between annotation systems for the task of symptom identification, we are in the process of creating a larger reference corpus of notes manually annotated by multiple raters. To understand some of the limitations of the COVID-19 gazetteer for future improvements, we manually audited the output of the gazetteer against a few notes from UMNCor. We observed the span of text containing the mention of an acute CDC symptom analyzed by the COVID-19 gazetteer was sometimes too short to contain the negation for the mention. For example, the COVID-19 gazetteer detected a positive mention instead of a negative mention for “sore throat” in the following sentence: “The patient denies fever, myalgias, nausea, vomiting, abdominal pain, chest pain, dysuria, hematuria, numbness and tingling, leg pain, difficulty walking, headache, visual disturbance, sore throat, rhinorrhea, and any other symptoms at this time.” This was because the span of the text containing the mention for “sore throat” did not include the word “denies” which negates the mention for “sore throat”. This is an implementation issue which could be avoided by using sentence boundary detection, and is something we are currently testing for the COVID-19 gazetteer. This issue led to mislabeling negative document-level mention of “sore throat” (cdc_sore_throat_n) as a positive document-level mention of “sore throat” (cdc_sore_throat_p). Future work on the COVID-19 gazetteer includes expanding the experiments for COVID-19 gazetteer lexicon generation automation by increasing the pool of ED notes for COVID-19 patients. Lastly, the COVID-19 gazetteer is being ported across multiple compute nodes to improve runtime.

CONCLUSIONS

Compared to other annotation systems, the COVID-19 gazetteer demonstrates greater potential as a high-throughput annotation system for real-time processing of notes, therefore, providing an opportunity for clinicians to make more accurate time-sensitive decisions around patient care (eg, identifying symptoms of COVID-19 PUIs). With a continuously maintained and properly devised set of lexical rules, the COVID-19 gazetteer has the potential to perform similar to standard annotation systems for the task of symptom identification. Contrary to standard annotation systems the COVID-19 gazetteer has a considerably lower resource footprint and hence, could be deployed at medical sites lacking robust healthcare infrastructure. Thus, the COVID-19 gazetteer could be used as a fast, resource-efficient, and reliable tool for high-throughput real-time clinical decision support for COVID-19 or any other disease with well-defined symptomatology. It can be easily deployed in a large scale across a wide variety of healthcare settings for continuous surveillance of COVID-19 symptoms for prognostic purposes. In addition, it holds promise as a useful resource to study long-term sequelae of the disease in survivors (eg, PASC progression in COVID-19 survivors).

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online.

CONTRIBUTORS

HSS, MS—Project lead, lead developer of COVID-19 gazetteer, developer for UMLS NLP feature extraction, data analysis, data interpretation, writing, and critical revision. GMS, BS—Lead developer on Fairview COVID-19 pipeline process automation and NLP feature extraction, NLP-ADAPT-kube, study design, data analysis, data interpretation, writing, and critical revision. NEI, MD—Study design, architect on NLP feature extraction, ETL of extracted NLP features, data analysis, data interpretation, writing, and critical revision. MIL, MD—Study design, architect on NLP feature extraction, data interpretation, writing, and critical revision. MP, MD, MS—Study design, architect on NLP feature extraction, data analysis, data interpretation, writing, and critical revision. RLF, BS—Lead developer on NLP-ADAPT-kube and developer, writing, and critical revision. JS, PhD—Project advisor, study design, writing, and critical revision. RZ, PhD—Project advisor, study design, writing, and critical revision. BKK, BS—Lead developer on BioMedICUS and developer, writing, and critical revision. SL, PhD—Lead developer on MedTagger and developer, writing, and critical revision. HL, PhD—Project lead on MedTagger, writing, and critical revision. GBM, MD, PhD—Project advisor, study design, data interpretation, writing, and critical revision. CJT, MD—Project advisor, lead architect Fairview COVID-19 pipeline, architect of COVID-19 Patient Registry and NLP feature extraction, lead on study design, data analysis, data interpretation, writing, and critical revision. SP, PhD—Project advisor, NLP feature extraction conception, study design, data analysis, data interpretation, writing, and critical revision. Click here for additional data file.

15 in total

1. Automation of a problem list using natural language processing.

Authors: Stephane Meystre; Peter J Haug
Journal: BMC Med Inform Decis Mak Date: 2005-08-31 Impact factor: 2.796

2. Automated classification of limb fractures from free-text radiology reports using a clinician-informed gazetteer methodology.

Authors: Amol Wagholikar; Guido Zuccon; Anthony Nguyen; Kevin Chu; Shane Martin; Kim Lai; Jaimi Greenslade
Journal: Australas Med J Date: 2013-05-30

3. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text.

Authors: Özlem Uzuner; Brett R South; Shuying Shen; Scott L DuVall
Journal: J Am Med Inform Assoc Date: 2011-06-16 Impact factor: 4.497

4. NLP-based identification of pneumonia cases from free-text radiological reports.

Authors: Peter L Elkin; David Froehling; Dietlind Wahner-Roedler; Brett Trusko; Gail Welsh; Haobo Ma; Armen X Asatryan; Jerome I Tokars; S Trent Rosenbloom; Steven H Brown
Journal: AMIA Annu Symp Proc Date: 2008-11-06

5. CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines.

Authors: Ergin Soysal; Jingqi Wang; Min Jiang; Yonghui Wu; Serguei Pakhomov; Hongfang Liu; Hua Xu
Journal: J Am Med Inform Assoc Date: 2018-03-01 Impact factor: 4.497

Review 6. Desiderata for delivering NLP to accelerate healthcare AI advancement and a Mayo Clinic NLP-as-a-service implementation.

Authors: Andrew Wen; Sunyang Fu; Sungrim Moon; Mohamed El Wazir; Andrew Rosenbaum; Vinod C Kaggal; Sijia Liu; Sunghwan Sohn; Hongfang Liu; Jungwei Fan
Journal: NPJ Digit Med Date: 2019-12-17

7. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes.

Authors: Veronika Vincze; György Szarvas; Richárd Farkas; György Móra; János Csirik
Journal: BMC Bioinformatics Date: 2008-11-19 Impact factor: 3.169

8. An information extraction framework for cohort identification using electronic health records.

Authors: Hongfang Liu; Suzette J Bielinski; Sunghwan Sohn; Sean Murphy; Kavishwar B Wagholikar; Siddhartha R Jonnalagadda; K E Ravikumar; Stephen T Wu; Iftikhar J Kullo; Christopher G Chute
Journal: AMIA Jt Summits Transl Sci Proc Date: 2013-03-18

9. Experiences implementing scalable, containerized, cloud-based NLP for extracting biobank participant phenotypes at scale.

Authors: Timothy A Miller; Paul Avillach; Kenneth D Mandl
Journal: JAMIA Open Date: 2020-05-22

10. CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis.

Authors: Yongqun He; Hong Yu; Edison Ong; Yang Wang; Yingtong Liu; Anthony Huffman; Hsin-Hui Huang; John Beverley; Junguk Hur; Xiaolin Yang; Luonan Chen; Gilbert S Omenn; Brian Athey; Barry Smith
Journal: Sci Data Date: 2020-06-12 Impact factor: 6.444

2 in total

1. PASCLex: A comprehensive post-acute sequelae of COVID-19 (PASC) symptom lexicon derived from electronic health record clinical notes.

Authors: Liqin Wang; Dinah Foer; Erin MacPhaul; Ying-Chih Lo; David W Bates; Li Zhou
Journal: J Biomed Inform Date: 2021-11-13 Impact factor: 8.000

2. Performance of a Chest Radiograph AI Diagnostic Tool for COVID-19: A Prospective Observational Study.

Authors: Erich Kummerfeld; Christopher J Tignanelli; Ju Sun; Le Peng; Taihui Li; Dyah Adila; Zach Zaiman; Genevieve B Melton-Meaux; Nicholas E Ingraham; Eric Murray; Daniel Boley; Sean Switzer; John L Burns; Kun Huang; Tadashi Allen; Scott D Steenburg; Judy Wawira Gichoya
Journal: Radiol Artif Intell Date: 2022-06-01

2 in total