Literature DB >> 33198814

Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: a systematic review and recommendations for future studies.

Martijn G Kersloot1,2, Florentien J P van Putten3, Ameen Abu-Hanna3, Ronald Cornet3, Derk L Arts3,4.   

Abstract

BACKGROUND: Free-text descriptions in electronic health records (EHRs) can be of interest for clinical research and care optimization. However, free text cannot be readily interpreted by a computer and, therefore, has limited value. Natural Language Processing (NLP) algorithms can make free text machine-interpretable by attaching ontology concepts to it. However, implementations of NLP algorithms are not evaluated consistently. Therefore, the objective of this study was to review the current methods used for developing and evaluating NLP algorithms that map clinical text fragments onto ontology concepts. To standardize the evaluation of algorithms and reduce heterogeneity between studies, we propose a list of recommendations.
METHODS: Two reviewers examined publications indexed by Scopus, IEEE, MEDLINE, EMBASE, the ACM Digital Library, and the ACL Anthology. Publications reporting on NLP for mapping clinical text from EHRs to ontology concepts were included. Year, country, setting, objective, evaluation and validation methods, NLP algorithms, terminology systems, dataset size and language, performance measures, reference standard, generalizability, operational use, and source code availability were extracted. The studies' objectives were categorized by way of induction. These results were used to define recommendations.
RESULTS: Two thousand three hundred fifty five unique studies were identified. Two hundred fifty six studies reported on the development of NLP algorithms for mapping free text to ontology concepts. Seventy-seven described development and evaluation. Twenty-two studies did not perform a validation on unseen data and 68 studies did not perform external validation. Of 23 studies that claimed that their algorithm was generalizable, 5 tested this by external validation. A list of sixteen recommendations regarding the usage of NLP systems and algorithms, usage of data, evaluation and validation, presentation of results, and generalizability of results was developed.
CONCLUSION: We found many heterogeneous approaches to the reporting on the development and evaluation of NLP algorithms that map clinical text to ontology concepts. Over one-fourth of the identified publications did not perform an evaluation. In addition, over one-fourth of the included studies did not perform a validation, and 88% did not perform external validation. We believe that our recommendations, alongside an existing reporting standard, will increase the reproducibility and reusability of future studies and NLP algorithms in medicine.

Entities:  

Keywords:  Annotation; Concept mapping; Entity linking; Evaluation studies; Named-entity recognition; Natural language processing; Ontologies; Recommendations for future studies

Mesh:

Year:  2020        PMID: 33198814      PMCID: PMC7670625          DOI: 10.1186/s13326-020-00231-z

Source DB:  PubMed          Journal:  J Biomed Semantics


Background

One of the main activities of clinicians, besides providing direct patient care, is documenting care in the electronic health record (EHR). Currently, clinicians document clinical findings and symptoms primarily as free-text descriptions within clinical notes in the EHR since they are not able to fully express complex clinical findings and nuances of every patient in a structured format [1, 2]. These free-text descriptions are, amongst other purposes, of interest for clinical research [3, 4], as they cover more information about patients than structured EHR data [5]. However, free-text descriptions cannot be readily processed by a computer and, therefore, have limited value in research and care optimization. One method to make free text machine-processable is entity linking, also known as annotation, i.e., mapping free-text phrases to ontology concepts that express the phrases’ meaning. Ontologies are explicit formal specifications of the concepts in a domain and relations among them [6]. In the medical domain, SNOMED CT [7] and the Human Phenotype Ontology (HPO) [8] are examples of widely used ontologies to annotate clinical data. After the data has been annotated, it can be reused by clinicians to query EHRs [9, 10], to classify patients into different risk groups [11, 12], to detect a patient’s eligibility for clinical trials [13], and for clinical research [14]. Natural Language Processing (NLP) can be used to (semi-)automatically process free text. The literature indicates that NLP algorithms have been broadly adopted and implemented in the field of medicine [15, 16], including algorithms that map clinical text to ontology concepts [17]. Unfortunately, implementations of these algorithms are not being evaluated consistently or according to a predefined framework and limited availability of data sets and tools hampers external validation [18]. To improve and standardize the development and evaluation of NLP algorithms, a good practice guideline for evaluating NLP implementations is desirable [19, 20]. Such a guideline would enable researchers to reduce the heterogeneity between the evaluation methodology and reporting of their studies. Generic reporting guidelines such as TRIPOD [21] for prediction models, STROBE [22] for observational studies, RECORD [23] for studies conducted using routinely-collected health data, and STARD [24] for diagnostic accuracy studies, are available, but are often not used in NLP research. This is presumably because some guideline elements do not apply to NLP and some NLP-related elements are missing or unclear. We, therefore, believe that a list of recommendations for the evaluation methods of and reporting on NLP studies, complementary to the generic reporting guidelines, will help to improve the quality of future studies. In this study, we will systematically review the current state of the development and evaluation of NLP algorithms that map clinical text onto ontology concepts, in order to quantify the heterogeneity of methodologies used. We will propose a structured list of recommendations, which is harmonized from existing standards and based on the outcomes of the review, to support the systematic evaluation of the algorithms in future studies.

Methods

This study consists of two phases: a systematic review of the literature and the formation of recommendations based on the findings of the review.

Literature review

A systematic review of the literature was performed using the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) statement [25].

Search strategy and study selection

We searched Scopus, IEEE, MEDLINE, EMBASE, the Association for Computing Machinery (ACM) Digital Library, and the Association for Computational Linguistics (ACL) Anthology for the following keywords: Natural Language Processing, Medical Language Processing, Electronic Health Record, reports, charts, clinical notes, clinical text, medical notes, ontolog*, concept*, encod*, annotat*, code, and coding. We excluded the words ‘reports’ and ‘charts’ in the ACL and ACM databases since these databases also contain publications on non-medical subjects. The detailed search strategies for each database can be found in Additional file 2. We searched until December 19, 2019 and applied the filters “English” and “has abstract” for all databases. Moreover, we applied the filters “Medicine, Health Professions, and Nursing” for Scopus, the filters “Conferences”, “Journals”, and “Early Access Articles” for IEEE, and the filter “Article” for Scopus and EMBASE. EndNote X9 [26] and Rayyan [27] were used to review and delete duplicates. The selection process consisted of three phases. In the first phase, two independent reviewers with a Medical Informatics background (MK, FP) individually assessed the resulting titles and abstracts and selected publications that fitted the criteria described below. Inclusion criteria were: Medical language processing as the main topic of the publication Use of EHR data, clinical reports, or clinical notes Algorithm performs annotation Publication is written in English Some studies do not describe the application of NLP in their study by only listing NLP as the used method, instead of describing its specific implementation. Additionally, some studies create their own ontology to perform NLP tasks, instead of using an established, domain-accepted ontology. Both approaches limit the generalizability of the study’s methods. Therefore, we defined the following exclusion criteria: Implementation was not described Implementation does not use an existing established ontology for encoding Not published in a peer-reviewed journal (except for ACL and ACM publications) In the second phase, both reviewers excluded publications where the developed NLP algorithm was not evaluated by assessing the titles, abstracts, and, in case of uncertainty, the Method section of the publication. In the third phase, both reviewers independently evaluated the resulting full-text articles for relevance. The reviewers used Rayyan [27] in the first phase and Covidence [28] in the second and third phases to store the information about the articles and their inclusion. In all phases, both reviewers independently reviewed all publications. After each phase the reviewers discussed any disagreement until consensus was reached.

Data extraction and categorization

Both reviewers categorized the implementations of the found algorithms and noted their characteristics in a structured form in Covidence. The objectives of the included studies and their associated NLP tasks were categorized by way of induction. The results were compared and merged into one result set. We collected the following characteristics of the studies, based on a combination of TRIPOD [21], STROBE [22], RECORD [23], and STARD [24] statement elements (see Additional file 3): year, country, setting, objectives, evaluation methods, used NLP systems or algorithms, used terminology systems, size of datasets, performance measures, reference standard, language of the free-text data, validation methods, generalizability, operational use, and source code availability.

List of recommendations

Based on the findings of the systematic review and elements from the TRIPOD, STROBE, RECORD, and STARD statements, we formed a list of recommendations. The recommendations focus on the development and evaluation of NLP algorithms for mapping clinical text fragments onto ontology concepts and the reporting of evaluation results.

Results

The literature search generated a total of 2355 unique publications. After reviewing the titles and abstracts, we selected 256 publications for additional screening. Out of the 256 publications, we excluded 65 publications, as the described Natural Language Processing algorithms in those publications were not evaluated. The full text of the remaining 191 publications was assessed and 114 publications did not meet our criteria, of which 3 publications in which the algorithm was not evaluated, resulting in 77 included articles describing 77 studies. Reference checking did not provide any additional publications. The PRISMA flow diagram is presented in Fig. 1.
Fig. 1

PRISMA flow diagram

PRISMA flow diagram The induction process resulted in eight categories and ten associated NLP tasks that describe the objectives of the papers: computer-assisted coding, information comparison, information enrichment, information extraction, prediction, software development and evaluation, and text processing. Our definitions of these NLP tasks and the associated categories are given in Table 1 and Table 2.
Table 1

Induced objective tasks with their definition and an example

Induced NLP task(s)DescriptionExample
Concept detection 1Assign ontology concepts to phrases in free text (i.e., entity linking or annotation)“Systolic blood pressure” can be represented as SNOMED-CT concept 271649006 | Systolic blood pressure (observable entity) |
Event detectionDetect events in free text“Patient visited the outpatient clinic in January 2020” is an event of type Visit.
Relationship detectionDetect semantic relationships between concepts in free textThe concept Lung cancer in “This patient was diagnosed with recurrent lung cancer” is related to the concept Recurrence.
Text normalizationTransform free text into a single canonical form“This patient was diagnosed with influenza last year.” becomes “This patient be diagnose with influenza last year.”
Text summarizationCreate a short summary of free text and possible restructure the text based on this summary“Last year, this patient visited the clinic and was diagnosed with diabetes mellitus type 2, and in addition to his diabetes, the patient was also diagnosed with hypertension” becomes “Last year, this patient was diagnosed with diabetes mellitus type 2 and hypertension”.
ClassificationAssign categories to free textA report containing the text “This patient is not diagnosed yet” will be assigned to the category Undiagnosed.
PredictionCreate a predictive model based on free textPredict the outcome of the APACHE score based on the (free-text) content in a patient chart.
IdentificationIdentify documents (e.g., reports or patient charts) that match a specific condition based on the contents of the documentFind all patient charts that describe patients with hypertension and a BMI above 30.
Software developmentDevelop new or build upon existing NLP softwareA new algorithm was developed to map ontology concepts to free text in clinical reports.
Software evaluationEvaluate the effectiveness of NLP softwareThe mapping algorithm has an F-score of 0.874.

1.Also known as Medical Entity Linking and Medical Concept Normalization

Table 2

Induced objective categories with their definition and associated NLP task(s)

Induced categoryInduced NLP task(s)Definition
Computer-assisted codingConcept detectionPerform semi-automated annotation (i.e., with a human in the loop)
Information comparison

Concept detection

Event detection

Relationship detection

Compare extracted structured information to information available in free-text form
Information enrichment

Concept detection

Event detection

Relationship detection

Text normalization

Text summarization

Extract structured information from free text and attach this new information to the source
Information extraction

Concept detection

Event detection

Relationship detection

Extract structured information from free text
Prediction

Classification

Prediction

Identification

Use structured information to classify free-text reports, predict outcomes, or identify cases

Software development

and evaluation

Software development

Software evaluation

Develop new NLP software or evaluate new or existing NLP software
Text processing

Text normalization

Text summarization

Transform free text into a new, more comprehensible form
Induced objective tasks with their definition and an example 1.Also known as Medical Entity Linking and Medical Concept Normalization Induced objective categories with their definition and associated NLP task(s) Concept detection Event detection Relationship detection Concept detection Event detection Relationship detection Text normalization Text summarization Concept detection Event detection Relationship detection Classification Prediction Identification Software development and evaluation Software development Software evaluation Text normalization Text summarization Table 3 lists the included publications with their first author, year, title, and country. Table 4 lists the included publications with their evaluation methodologies. The non-induced data, including data regarding the sizes of the datasets used in the studies, can be found as supplementary material attached to this paper.
Table 3

Included publications and their first author, year, title, and country

AuthorYearCountryChallengeInduced objectiveData originDatasetData languageUsed systemTerm. Sys.In useSource codeRef
Afshar2019USANoInformation extractionClinical Data Warehouse DataOwnEnglishNew (+ existing)UMLS (CPT, HCPCS, ICD-10, ICD10CM / ICD9CM, LOINC, MeSH, SNOMED-CT, RxNorm)Not listedNo, only links to cTAKES source code[29]
Alnazzawi2016UKNoInformation enrichmentPhenoCHF corpus 1ExistingEnglishExistingUMLSNot listedNot applicable[30]
Atutxa2018SpainNoInformation enrichmentEHR documentsOwnSpanishNewICD (SNOMED-CT for normalization)Not yet, aim to embed it in human-supervised loopNot listed[31]
Barrett2013USANoInformation extractionPalliative care consult lettersOwnEnglishNewSNOMED CTNot listedNo, but planned[32]
Becker2016GermanyNoInformation extractionShARe/CLEF corpus (2013) 2ExistingGermanExistingSNOMED CT (English), UMLS (German)Not yet, still under developmentNot applicable[33]
Becker2019GermanyNoInformation extractionClinical notes of patients with known colorectal cancerOwnGermanNew (+ existing)UMLSYes, led to improved quality of care for colorectal patientsNot listed[34]
Bejan2015USANoInformation extractionDischarge summaries and i2b2/VA challenge dataset (2010) 3Own + ExistingEnglishExistingUMLSNoNot applicable[35]
Castro2010SpainNoInformation extractionClinical notes with ‘most relevant information’OwnSpanishExistingSNOMED CTNot listedNot applicable[36]
Catling2018UKNoSoftware development and evaluationMIMIC-III dataset 4ExistingEnglishNewICD-9-CMNot listedNot listed[37]
Chapman2004USANoInformation extractionEmergency department reportsOwnEnglishExistingUMLSNot listedNot applicable[38]
Chen2016USANoInformation enrichmentDischarge summaries and progress notesOwnEnglishNew (+ existing)UMLSNot listedNot listed[39]
Chiaramello2016ItalyNoInformation extractionClinical notes (cardiology, diabetology, hepatology, nephrology, and oncology)OwnItalianExistingUMLSNot listedNot applicable[40]
Chodey2016USASemEval (2014)Information extractionICU Data: Discharge summaries, ECG, echo, and radiologyExistingEnglishNew (+ existing)UMLSNot listedNot listed[41]
Chung2005USANoInformation extractionEchocardiogram reportsOwnEnglishNew (+ existing)UMLSNot yet, it will be used to populate a registryNot listed[42]
Combi2018ItalyNoInformation extractionVigiSegn (adverse drug reactions) reportsOwnItalian + EnglishNewMedDRAYes, implemented in VigiFarmacoPseudocode[43]
De Bruijn2011Canadai2b2/VA (2010)Information extractionHospital discharge summaries and progress reportsExistingEnglishNew (+ existing)UMLSNot listedNot listed[44]
Deisseroth2019USANoInformation extractionSix sets of real patient data from four different medical centers.OwnEnglishNewHPONot listedYes[45]
Demner-Fushman2017USANoSoftware development and evaluationBioScope 5, NCBI disease corpus 6, i2b2/VA challenge corpus (2010) 3, ShARe corpus 7, LHC test collection (biological/clinical journal abstracts)ExistingEnglishNew (+ existing)UMLSYes, used in other papers identified in literature searchYes[46]
Divita2014USAParts: i2b2/VA (2010)Software development and evaluationRandomly selected clinical records from the most frequent document typesOwnEnglishNewUMLS (level 0 + 9)Yes, used by VA Informatics and Computing InfrastructureYes[47]
Duarte2018PortugalNoInformation enrichmentDeath certificates, clinical bulletins, and autopsy reportsOwnPortugueseNewICD-10Yes, used by Portugese Ministry of Health for near real-time death cause surveillanceNot listed[48]
Falis2019UKNoInformation extractionMIMIC-III dataset 4ExistingEnglishNewICD-9Not listedNot listed[49]
Ferrão2013PortugalNoInformation enrichmentInpatient adult episodes from the EHROwnPortugueseNewICD-9-CMNot listedNot listed[50]
Gerbier2011FranceNoInformation extractionComputerized emergency department medical recordsOwnFrenchNewICD-10, CCAM, SNOMED CT, ATC, MeSH, ICPC-2, DCRNot yet, will be integrated into a CDSSNot listed[51]
Goicoechea Salazar2013SpainNoInformation enrichmentDiagnostic text from patient recordsOwnSpanishNewICD-9-CMNot listedNot listed[52]
Hamid2013USANoClassificationNotes of Iraq and Afghanistan veterans from the VA national clinical databaseOwnEnglishExistingUMLSNot listedNot applicable[53]
Hassanzadeh2016AustraliaNoInformation extractionShARe/CLEF corpus (2013) 2ExistingEnglishExistingUMLS, SNOMED CTNot applicableNot applicable[54]
Helwe2017LebanonNoComputer-assisted codingMIMIC-III datasetExistingEnglishNewUMLS, ICDNot listedNot listed[55]
Hersh2001USANoInformation enrichmentRadiology image reportsOwnEnglishExistingUMLSNo, still in development/testingPseudocode[56]
Hoogendoorn2015NetherlandsNoPredictionConsultation notes of patients in a primary care settingOwnDutchNewSNOMED-CT, UMLS, ICPCNot listedNot listed[57]
Jindal2013USAi2b2 (2012)Information extractioni2b2 challenge corpus (2012) 8ExistingEnglishNew (+ existing)UMLS, SNOMED CT, MeSHNot listedNot listed[58]
Kang2009KoreaNoInformation extractionDischarge summariesOwnKoreanNewKOMET, UMLSNot listedNot listed[59]
Kersloot2019NetherlandsNoInformation extraction(Non-small cell) Lung cancer chartsOwnEnglishNew (+ existing)SNOMED CTNot listedYes[60]
König2019GermanyNoSoftware development and evaluationDischarge letters from BASE-II studyOwnGermanNew (+ existing)Wingert-NomenclatureNo, still has to prove its valueNot listed[61]
Li2015USANoInformation comparisonClinical notes and discharge prescription listsOwnEnglishNew (+ existing)UMLS, SNOMED CT, RxNormNot yet, plans to move to productionPseudocode[62]
Li2019USANoInformation extractionEHR notesOwnEnglishNew (+ existing)UMLS, SNOMED CT, MedDRANot listedNot listed[63]
Lingren2016USANoClassificationStructured and unstructured data from two EHR databasesOwnEnglishNew (+ existing)UMLS, ICD-9, RxNormNot listedNot listed[12]
Liu2019USANoInformation extractionClinical notes from different institutions + PubMed Case report abstractsOwn + ExistingEnglishExistingHPONot listedNot applicable[64]
Lowe2009USANoInformation extractionSingle-specimen pathology reportsOwnEnglishExistingUMLS, SNOMED CTNot listedNot applicable[65]
Luo2014USANoInformation extractionPathology reportsOwnEnglishNew (+ existing)UMLS, SNOMED CTYes, currently working on project in multiple hospitalsNot listed[66]
Meystre2006USANoInformation enrichmentClinical documents form adult inpatients in a cardiovascular unitOwnEnglishNew (+ existing)UMLS (level 0), SNOMED CTNot yet, testing in practiceNot listed[67]
Meystre2010USAi2b2 (2009)Information extractioni2b2 challenge dataset (2009) 9ExistingEnglishNewUMLSNot yet, possible integration in research infrastructureNot listed[68]
Minard2011Francei2b2/VA (2010)Information extractioni2b2/VA challenge corpus (2010) 3ExistingEnglishNew (+ existing)UMLSNot listedNot listed[69]
Mishra2019USANoInformation extractionClinical notes from NIH Clinical Center data warehouseOwnEnglishExistingUMLS, HPONot listedNot applicable[70]
Nguyen2018AustraliaNoComputer-assisted codingHospital progress notesOwnEnglishNew (+ existing)SNOMED CT, ICD-10-AMNot listedNot listed[71]
Oellrich2015UKNoInformation extractionPubMed abstracts, clinical trial information, i2b2/VA challenge corpus (2010) 3, SHARE/CLEF (2013) 2ExistingEnglishExistingUMLSNot listedNot applicable[72]
Patrick2011Australiai2b2/VA (2010)Information extractioni2b2/VA challenge corpus (2010) 3ExistingEnglishNewUMLS, SNOMED CTNot listedNot listed[73]
Pérez2018SpainNoText processingSpontaneous DTs randomly selected entriesOwnSpanishNewICDNot listedNot listed[74]
Reátegui2018CanadaNoInformation extractioni2b2 challenge corpus (2008) 10ExistingEnglishNew (+ existing)UMLS, SNOMED CT, RxNormNot listedNot listed[75]
Roberts2011USAi2b2/VA (2010)Information extractioni2b2/VA challenge corpus (2010) 3ExistingEnglishNew (+ existing)UMLS, ICD-9Not listedNot listed[76]
Rousseau2019USANoInformation comparisonED encounters for patients with headaches who received head CTOwnEnglishExistingUMLS: SNOMED CT, RadLexNot listedNot applicable[77]
Savova2010USAi2b2 (2006, 2008)Information extractionSubset of clinical notes from the EMROwnEnglishNew (+ existing)UMLS, SNOMED CT, RxNormYes, used in other papers identified in literature searchYes[78]
Shivade2015USAi2b2/UTHealth (2014)Classificationi2b2 challenge corpus (2014) 11ExistingEnglishExistingUMLSNot listedNot applicable[11]
Shoenbill2019USANoInformation extractionEHR notes from hypertension patientsOwnEnglishExistingUMLS, SNOMED CTNot listedNot applicable[79]
Sohn2014USANoInformation extractionClinical notes with medication mentionsOwnEnglishNewRxNormNot listedYes[80]
Solti2008USANoInformation enrichmentCardiology ambulatory progress notesOwnEnglishExistingUMLSNot listedNot applicable[81]
Soriano2019SpainNoInformation extractionclinical emergency discharge reportsOwnSpanishNewSNOMED CTNot yetYes[82]
Soysal2018USAParts: i2b2 (2009 + 2010), ShARe/CLEF (2013), Sem-EVAL (2014)Software development and evaluationDischarge summaries from the i2b2/VA challenge corpus (2010) 3, outpatient clinic visit notes, mock clinical documentsOwn + ExistingEnglishNewUMLSYes, used by various institutions and industrial entitiesYes[83]
Spasić2015UKNoInformation extractionMRI reports of patientsOwnEnglishNew (+ existing)TRAK, UMLS, MEDCIN, RadLexNot listedYes[84]
Strauss2013USANoInformation extractionPathology reports of breast and prostate cancer patientsOwnEnglishNewSNOMED CTNot listedYes[85]
Sung2018TaiwanNoInformation extractionCases of adult patients with AISOwnEnglishExistingUMLSNot listedNot applicable[86]
Tchechmedjiev2018FranceNoInformation extractionQuaero (French MEDLINE abstract titles + EMEA drug labels) + CépiDC (ICD-10 coding of death certificates)ExistingFrenchNew (+ existing)UMLS terminologies (ICD-10)Yes, available in SIFR BioPortalYes[87]
Ternois2018FranceNoClassificationEndoscopy reports written between 2015 and 2016OwnFrenchNewCCAMNot listedNot listed[88]
Travers2004USANoInformation extractionChief complaint text entries for all emergency department visitsOwnEnglishNewUMLSNot listedNot listed[89]
Tulkens2019BelgiumNoInformation extractioni2b2/VA challenge corpus (2010) 3ExistingEnglishNew (+ existing)UMLSNot listedYes[90]
Usui2018JapanNoPredictionElectronic medication history data from pharmacyOwnJapaneseNewICD-10Not yet, expect to use itNot listed[91]
Valtchinov2019USANoClassificationRadiology reports, emergency department notes + other clinical reportsOwnEnglishExistingSNOMED CT, RadLexNot listedNot applicable[92]
Wadia2018USANoClassificationChest CT reportsOwnEnglishExistingSNOMED CT, UMLSNot listedNot applicable[93]
Walker2019USANoInformation extractionTreatment sites from EMROwnEnglishNewUMLSNot listedNot listed[94]
Xie2019ChinaNoInformation extractionMIMIC-III dataset 4ExistingEnglishNewICD-9-CM, ICD-10Not listedNot listed[95]
Xu2011USANoClassificationCRC patient cases from the Synthetic Derivative databaseOwnEnglishExistingUMLSNo, still under developmentNot applicable[96]
Yadav2013USANoPredictionEmergency department CT imaging reportsOwnEnglishExistingUMLSNot listedYes, command line command[97]
Yao2019USANoPredictioni2b2 challenge corpus (2008) 10ExistingEnglishNew (+ existing)UMLSNot listedPart (Sorl)[98]
Zeng2018USANoClassificationProgress notes and breast cancer surgical pathology reportsOwnEnglishNew (+ existing)UMLSNot listedNot listed[99]
Zhang2013USANoInformation extractioni2b2/VA challenge corpus (2010) 3 and GENIA corpus (MEDLINE abstracts)ExistingEnglishNewUMLSNot listedNot listed[100]
Zhou2006USANoInformation extractionRecords of patients with breast complaintsOwnEnglishNewUMLSNo, still under developmentNot listed[101]
Zhou2011USANoSoftware development and evaluationCOPD and CAD patientsOwnEnglishNewSNOMED CT, RxNorm, UMLS, PPL, MDD, HL7 value setsYes, described in other paper (103])Not listed[102]
Zhou2014USANoInformation extractionAdmission notes and discharge summariesOwnEnglishExistingSNOMED CT, HL7 RoleCodesNot listedNot applicable[103]

1. PhenoCHF corpus: narrative reports from electronic health records (EHRs) and literature articles

2. ShARe/CLEF corpus (2013): narrative clinical reports

3. i2b2/VA challenge dataset (2010): discharge summaries and progress reports

4. MIMIC-III dataset: demographics, vital sign measurements, laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality

5. BioScope corpus: medical free texts, biological full papers and biological scientific abstracts

6. NCBI disease corpus: PubMed abstracts

7. ShARe corpus: deidentified clinical free-text notes from the MIMIC II database

8. i2b2 challenge corpus (2012): discharge summaries

9. i2b2 challenge dataset (2009): de-identified hospital discharge summaries

10. i2b2 challenge corpus (2008): discharge summaries of overweight and diabetic patients

11. i2b2 challenge corpus (2014): longitudinally ordered clinical notes from three cohorts of diabetic patients

Table 4

Included publications and their evaluation methodologies

AuthorYearRef. std.ValidationExternalGeneralizability aRef
Afshar2019Existing EHR dataHold-out validation (train, test, development)NoNo, validation is needed[29]
Alnazzawi2016Existing annotated corpusExternalShARe/CLEF, NCBI disease, Heart failure and pulmonary embolism corporaYes, achieves competitive performance on other corpora[30]
Atutxa2018Manual retrospective reviewHold-out validation (train, test, development)NoYes, easily portable to other languages[31]
Barrett2013Manual annotations10-fold cross validationMultiple datasets (different provider)Yes, expect that it is generalizable[32]
Becker2016Existing annotated corpusNot usedNoNot listed[33]
Becker2019Manual annotationsHold-out validation (train, test, development)NoNot listed[34]
Bejan2015Manual annotationsExternali2b2 data (2010)Yes, good performance on the i2b2 dataset, even though not optimized on it[35]
Castro2010Manual annotationsNot usedNoNot listed[36]
Catling2018Existing annotated corpusHold-out validation (train, test, development)NoNot listed[37]
Chapman2004Manual annotationsNot usedNoYes, generalizable to other domains within and outside of bio surveillance[38]
Chen2016Manual annotations10-fold cross validationNoNot listed[39]
Chiaramello2016Manual annotationsNot usedNoNot listed[40]
Chodey2016Existing annotated corpusHold-out validation (train, test)NoNot listed[41]
Chung2005Manual annotationsHold-out validation (train, test)Reports from a second hospitalNot listed[42]
Combi2018Manual annotationsNot usedNoNot listed[43]
deBruijn2011Existing annotated corpus15-fold cross validationNoNot listed[44]
Deisseroth2019Manual annotationsHold-out validation (train, test)Data from a second hospitalYes, it can be immediately incorporated into clinical practice[45]
Demner-Fushman2017Existing annotated corpusExternalMultiple datasetsNot listed[46]
Divita2014Manual annotationsNot usedNoNot listed[47]
Duarte2018Manual annotationsHold-out validation (train, test)Second datasetNot listed[48]
Falis2019Existing annotated corpusHold-out validation (train, test, development)NoYes, method is not specific to an ontology, and could be used for a graph of any formation[49]
Ferrão2013Existing EHR dataHold-out validation (train, test)NoNot listed[50]
Gerbier2011Manual annotationsHold-out validation (train, test)NoYes, it could also serve other types of clinical decision support systems[51]
Goicoechea Salazar2013Manual annotationsHold-out validation (train, test)NoNot listed[52]
Hamid2013Manual annotations10-fold cross validationNoPossible, the classifier may be applicable in academic hospital samples[53]
Hassanzadeh2016Existing annotated corpusHold-out validation (train, test)NoNot applicable[54]
Helwe2017Existing annotated corpusHold-out validation (train, test, development)NoNot listed[55]
Hersh2001Manual annotationsHold-out validation (train, test)NoNot listed[56]
Hoogendoorn2015Existing EHR data5-fold cross validationNoNot listed[57]
Jindal2013Existing annotated corpusHold-out validation (train, test)NoYes, broad applicability[58]
Kang2009Manual annotationsHold-out validation (train, test)NoYes, extensible to other languages[59]
Kersloot2019Manual annotationsHold-out validation (development, test)NoPossible, but external validation is needed[60]
König2019Existing EHR dataNot usedNoStill to be tested[61]
Li2015Manual annotations10-fold cross validationNoNot listed[62]
Li2019Existing annotated corpusHold-out validation (train, test, development)NoNot listed[63]
Lingren2016Manual annotationsHold-out validation (train, test, development)NoNot listed[12]
Liu2019Manual annotationsNot usedNo (but multiple datasets / non-trained)No, limited because of NYP/CUIMC and Mayo notes.[64]
Lowe2009Manual retrospective reviewHold-out validation (train, test)NoYes, has the potential to index other classes of clinical documents[65]
Luo2014Existing EHR data10-fold cross validationNoNo, challenging, not currently working on it[66]
Meystre2006Manual retrospective reviewNot usedNoNot listed[67]
Meystre2010Existing annotated corpusHold-out validation (train, test)NoNot listed[68]
Minard2011Existing annotated corpusHold-out validation (train, test, development)NoNot listed[69]
Mishra2019Manual annotationsNot usedNoNot listed[70]
Nguyen2018Existing EHR dataNot listedNoNot listed[71]
Oellrich2015Existing annotated corpusExternalMultiple datasetsNot listed[72]
Patrick2011Existing annotated corpus10-fold cross validationNoYes, adaptable to different requirements in clinical information extraction and classification by choosing relevant feature sets[73]
Pérez2018Existing annotated corpusHold-out validation (train, test, development)NoYes, extensible to different hospital-sections and hospitals[74]
Reátegui2018Existing annotated corpusNot usedNoNot listed[75]
Roberts2011Existing annotated corpusHold-out validation (train, test)NoNot listed[76]
Rousseau2019Manual annotationsNot usedNoNot listed[77]
Savova2010Manual annotations10-fold cross validationNoYes, implemented in several applications[78]
Shivade2015Manual annotationsHold-out validation (train, test)NoNot listed[11]
Shoenbill2019Manual annotationsHold-out validation (train, test)NoYes, can allow further evaluation and improvement in care delivery models and treatment approaches to multiple chronic illnesses[79]
Sohn2014Manual annotationsHold-out validation (train, test, development)NoYes, with adaptions: create flexible mechanism for adaptation process[80]
Solti2008Manual annotationsHold-out validation (train, test)NoNot listed[81]
Soriano2019Manual annotationsNot listedNoNot listed[82]
Soysal2018Existing annotated corpusHold-out validation (train, test)NoYes, can be used to quickly develop customized clinical information extraction pipelines[83]
Spasić2015Manual annotationsHold-out validation (train, test)NoNot listed[84]
Strauss2013Manual annotationsNot usedNoYes, can be shared between institutions and used to support clinical + epidemiological research[85]
Sung2018Manual annotationsNot listedNoNot listed[86]
Tchechmedjiev2018Existing annotated corpusHold-out validation (train, test, development)NoYes, but not universally[87]
Ternois2018Existing EHR data5-fold cross validation + Hold-out validation (train, test)NoNot listed[88]
Travers2004Manual retrospective reviewNot usedNoNot listed[89]
Tulkens2019Existing annotated corpusHold-out validation (train, test, development)NoNot listed[90]
Usui2018Manual annotationsNot usedNoNot listed[91]
Valtchinov2019Manual annotationsNot usedNoNo[92]
Wadia2018Manual annotationsNot usedNoNot listed[93]
Walker2019Manual retrospective reviewHold-out validation (development, test)NoYes, it can be incorporated in institutional data warehouse[94]
Xie2019Existing annotated corpusHold-out validation (train, test, development)NoNot listed[95]
Xu2011Manual annotationsHold-out validation (train, test)NoYes, generable approach to combine information from heterogeneous data sources in EHRs[96]
Yadav2013Manual annotationsNot usedNoYes, should be broadly applicate to outcomes of clinical interest[97]
Yao2019Existing annotated corpusHold-out validation (train, test)NoNot listed[98]
Zeng2018Manual annotations5-fold cross validation + Hold-out validation (train, test)NoYes, potential to be replicated[99]
Zhang2013Existing annotated corpusExternalTwo different sets with same settingsYes, can be adapted to different semantic categories and text genres[100]
Zhou2006Manual annotations5-fold cross validationNoNot listed[101]
Zhou2011Manual retrospective reviewHold-out validation (train, test)NoNot listed[102]
Zhou2014Manual annotationsNot usedNoNot listed[103]

a As reported by authors

Included publications and their first author, year, title, and country 1. PhenoCHF corpus: narrative reports from electronic health records (EHRs) and literature articles 2. ShARe/CLEF corpus (2013): narrative clinical reports 3. i2b2/VA challenge dataset (2010): discharge summaries and progress reports 4. MIMIC-III dataset: demographics, vital sign measurements, laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality 5. BioScope corpus: medical free texts, biological full papers and biological scientific abstracts 6. NCBI disease corpus: PubMed abstracts 7. ShARe corpus: deidentified clinical free-text notes from the MIMIC II database 8. i2b2 challenge corpus (2012): discharge summaries 9. i2b2 challenge dataset (2009): de-identified hospital discharge summaries 10. i2b2 challenge corpus (2008): discharge summaries of overweight and diabetic patients 11. i2b2 challenge corpus (2014): longitudinally ordered clinical notes from three cohorts of diabetic patients Included publications and their evaluation methodologies a As reported by authors Table 5 summarizes the general characteristics of the included studies and Table 6 summarizes the evaluation methods used in these studies. In all 77 papers, we found twenty different performance measures (Table 7).
Table 5

Characteristics of the included studies

Descriptionn (%)References
Main objective
Information extraction45 (58%)[29, 3236, 38, 4045, 49, 51, 5860, 6366, 6870, 72, 73, 75, 76, 7880, 82, 8487, 89, 90, 94, 95, 100, 101, 103, 104]
Information enrichment9 (12%)[30, 31, 39, 48, 50, 52, 56, 67, 81]
Classification8 (10%)[11, 12, 53, 88, 92, 93, 96, 99]
Software development and evaluation6 (7.8%)[37, 46, 47, 61, 83, 102]
Prediction4 (5.2%)[57, 91, 97, 98]
Information comparison2 (2.6%)[62, 77]
Computer-assisted coding2 (2.6%)[55, 71]
Text processing1 (1.3%)[74]
Part of challenge

i2b2

(Informatics for Integrating Biology and the Bedside)

10 (13%)[11, 44, 47, 58, 68, 69, 73, 76, 78, 83]
Entire system8 (10%)[11, 44, 58, 68, 69, 73, 76, 78]
Parts of the system2 (2.6%)[47, 83]
SemEval (Semantic Evaluation)2 (2.6%)[41, 83]
Entire system1 (1.3%)[41]
Parts of the system1 (1.3%)[83]

ShARe/CLEF

(Shared Annotated Resources/Conference and Labs of the Evaluation Forum)

1 (1.3%)[83]
Parts of the system1 (1.3%)[83]
Dataset: language
English60 (78%)[11, 12, 29, 30, 32, 35, 3739, 4147, 49, 53, 55, 56, 58, 60, 6273, 7581, 8386, 89, 90, 92104]
Spanish5 (6.5%)[31, 36, 52, 74, 82]
French3 (3.9%)[51, 87, 88]
German3 (3.9%)[33, 34, 61]
Italian2 (2.6%)[40, 43]
Portuguese2 (2.6%)[48, 50]
Dutch1 (1.3%)[57]
Japanese1 (1.3%)[91]
Korean1 (1.3%)[59]
Dataset: Origin
Data present in institute55 (71%)[12, 29, 31, 32, 3436, 3840, 42, 43, 45, 47, 48, 5053, 56, 57, 5967, 70, 71, 74, 7786, 88, 89, 9194, 96, 97, 99, 101103]
Existing dataset25 (33%)[11, 30, 33, 35, 37, 41, 44, 46, 49, 55, 58, 64, 68, 69, 72, 73, 75, 76, 83, 87, 90, 95, 98, 100, 104]
Included reference to dataset21 (27%)[11, 30, 35, 37, 41, 44, 46, 49, 55, 58, 64, 72, 75, 76, 83, 87, 90, 95, 98, 100, 104]
Training of algorithm
Trained47 (61%)[11, 12, 29, 31, 32, 34, 37, 39, 41, 42, 44, 45, 4853, 5559, 62, 63, 65, 66, 68, 69, 73, 74, 76, 7884, 87, 88, 90, 95, 96, 98, 99, 104]
Not listed3 (3.9%)[30, 101, 102]
Development of algorithm
Use of development set16 (21%)[12, 29, 31, 34, 37, 49, 55, 60, 63, 69, 74, 80, 87, 90, 94, 95]
Not listed4 (5.2%)[30, 82, 83, 101]
Used NLP system or algorithm
New NLP system or algorithm29 (38%)[31, 32, 37, 43, 45, 4752, 55, 57, 59, 68, 73, 74, 80, 82, 83, 85, 88, 89, 91, 94, 95, 100102]
New NLP system or algorithm with existing components25 (33%)[12, 29, 34, 39, 41, 42, 44, 46, 58, 6063, 66, 67, 69, 71, 75, 76, 78, 84, 87, 90, 98, 99]
Existing NLP system or algorithm23 (30%)[11, 30, 33, 35, 36, 38, 40, 53, 56, 64, 65, 70, 72, 77, 79, 81, 86, 93, 96, 97, 103, 104]
Use in practice
Plans to implement / still under development and testing12 (16%)[31, 33, 51, 56, 62, 6668, 82, 91, 96, 101]
Implemented in practice10 (13%)[34, 42, 43, 4648, 78, 83, 87, 102]
Availability of code
Published algorithm or source code15 (20%)[31, 4547, 60, 78, 80, 8285, 87, 90, 97, 98]
Pseudocode in manuscript3 (3.9%)[43, 56, 62]
Planning to publish algorithm or source code1 (1.3%)[32]
Not applicable, used an existing system20 (26%)[11, 30, 33, 35, 36, 38, 40, 53, 64, 65, 70, 72, 77, 79, 81, 86, 93, 96, 103, 104]
Table 6

Evaluation methods of the included studies

Descriptionn (%)References
Evaluation: Reference standard
Manual annotations40 (52%)[11, 12, 32, 3436, 3840, 42, 43, 45, 47, 48, 5153, 56, 59, 60, 62, 64, 70, 7782, 8486, 9193, 96, 97, 99, 101, 103]
Existing annotated corpus24 (31%)[30, 33, 37, 41, 44, 46, 49, 55, 58, 63, 68, 69, 7276, 83, 87, 90, 95, 98, 100, 104]
Existing EHR data7 (9.1%)[29, 50, 57, 61, 66, 71, 88]
Manual retrospective review6 (7.8%)[31, 65, 67, 89, 94, 102]
Evaluation: Validation
Hold-out validation40 (52%)[11, 12, 29, 31, 34, 37, 41, 42, 45, 4852, 55, 56, 5860, 63, 65, 68, 69, 74, 76, 7981, 83, 84, 87, 88, 90, 9496, 98, 99, 102, 104]
Cross-validation12 (16%)[32, 39, 44, 53, 57, 62, 66, 73, 78, 88, 99, 101]
External validation9 (12%)[30, 32, 35, 42, 45, 46, 48, 72, 100]
Solely external validation5 (6.5%)[30, 35, 46, 72, 100]
In addition to another type of validation4 (5.2%)[32, 42, 45, 48]
Not performed or not listed22 (29%)[33, 36, 38, 40, 43, 47, 61, 64, 67, 70, 71, 75, 77, 82, 85, 86, 89, 9193, 97, 103]
Generalizability
Claimed23 (30%)[3032, 35, 38, 45, 49, 51, 58, 59, 65, 73, 74, 7880, 83, 85, 87, 94, 96, 97, 100]
Externally validated5 (6.5%)[30, 32, 35, 45, 100]
Comparison
Compared to other existing algorithms or models24 (31%)[30, 35, 39, 4547, 49, 58, 60, 63, 64, 72, 75, 80, 83, 87, 90, 94, 95, 98101, 104]
Tested difference in outcomes for statistical significance4 (5.2%)[35, 39, 60, 63]
Table 7

Performance measures used in the included studies

DescriptionFormulan (%)References
Confusion Matrix

Lists the True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN), and the Total (n) amount in a 2 × 2 contingency Table.

TP: Text annotated with ontology concept when ontology concept is present in reference standard

TN: Text not annotated with ontology concept when ontology concept is absent in reference standard

FP: Text annotated with ontology concept when ontology concept is absent in reference standard

FN: Text not annotated with ontology concept when ontology concept is present in reference standard

12 (16%)[34, 44, 47, 51, 56, 58, 60, 61, 84, 87, 91, 93]
Performance measures
Recall\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{TP}{FN+ TP} $$\end{document}TPFN+TP68 (88%)[11, 12, 2931, 3353, 5658, 6064, 6673, 7588, 9094, 96, 99104]
Precision\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{TP}{FP+ TP} $$\end{document}TPFP+TP66 (86%)[11, 12, 2931, 3336, 3851, 53, 5658, 6073, 7588, 90, 91, 93, 94, 96, 99104]
F-score\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ 2\bullet \frac{Precision\bullet Recall}{Precision+ Recall} $$\end{document}2PrecisionRecallPrecision+Recall57 (74%)[11, 12, 30, 31, 3336, 3941, 44, 4650, 52, 53, 55, 5763, 6673, 7580, 8284, 8688, 90, 91, 95, 96, 98100, 102104]
Accuracy\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{TP+ TN}{n} $$\end{document}TP+TNn11 (14%)[30, 32, 34, 41, 48, 52, 67, 74, 78, 92, 96]
Specificity\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{TN}{FP+ TN} $$\end{document}TNFP+TN6 (7.8%)[29, 34, 85, 92, 93, 96]
AUCNot applicable5 (6.5%)[29, 39, 57, 95, 99]
Kappa\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{p_o-\kern0.5em {p}_e}{1-{p}_e}=1-\frac{1-{p}_o}{1-{p}_e} $$\end{document}pope1pe=11po1pe3 (3.9%)[85, 89, 97]
Processing timeNot applicable3 (3.9%)[32, 47, 83]
Negative Predictive Value\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{TN}{FN+ TN} $$\end{document}TNFN+TN3 (3.9%)[29, 85, 93]
False Positive Rate\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{FP}{FP+ TN} $$\end{document}FPFP+TN1 (1.3%)[34]
False Negative Rate\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{FN}{TP+ FN} $$\end{document}FNTP+FN1 (1.3%)[34]
Information entropy\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ -{\sum}_{i=1}^n{P}_i\ \mathit{\log}\left({P}_i\right) $$\end{document}i=1nPilogPi1 (1.3%)[64]
Mean Reciprocal Rank\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{1}{Q}{\sum}_{i=1}^Q\frac{1}{{\mathit{\operatorname{rank}}}_i} $$\end{document}1Qi=1Q1ranki1 (1.3%)[74]
Initial annotator agreementNot applicable1 (1.3%)[79]
Match/no match (%)Not applicable1 (1.3%)[89]
Overgeneration\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{FP}{TP+ FP} $$\end{document}FPTP+FP1 (1.3%)[93]
Undergeneration\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{FN}{TP+ FN} $$\end{document}FNTP+FN1 (1.3%)[68]
Error\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{FN+ FP}{TP+ FN+ FP} $$\end{document}FN+FPTP+FN+FP1 (1.3%)[68]
Fallout\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{FP}{TN+ FP} $$\end{document}FPTN+FP1 (1.3%)[68]
Mean Standard Error\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \frac{1}{n}{\sum}_{i=1}^n{\left({Y}_i-{\hat{Y}}_i\right)}^2 $$\end{document}1ni=1nYiY^i21 (1.3%)[57]
Characteristics of the included studies i2b2 (Informatics for Integrating Biology and the Bedside) ShARe/CLEF (Shared Annotated Resources/Conference and Labs of the Evaluation Forum) Evaluation methods of the included studies Performance measures used in the included studies Lists the True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN), and the Total (n) amount in a 2 × 2 contingency Table. TP: Text annotated with ontology concept when ontology concept is present in reference standard TN: Text not annotated with ontology concept when ontology concept is absent in reference standard FP: Text annotated with ontology concept when ontology concept is absent in reference standard FN: Text not annotated with ontology concept when ontology concept is present in reference standard

Discussion

In this systematic review, we reviewed the current state of NLP algorithms that map clinical text fragments onto ontology concepts with regard to their development and evaluation, in order to propose recommendations for future studies.

Main findings and recommendations

We identified 256 studies that reported on the development of such algorithms, of which 68 did not evaluate the performance of the system. We included 77 studies. Many publications did not report their findings in a structured way, which made it challenging to extract all the data in a reliable manner. We discuss our findings and recommendations in the following five categories: Used NLP systems and algorithms, Used data, Evaluation and validation, Presentation of results, and Generalizability of results. A checklist for determining if the recommendations are followed in the reporting of an NLP study is added as supplementary material to this paper.

Used NLP systems and algorithms

A variety of NLP systems are used in the reviewed studies. Researchers use existing systems (n = 29, 38%), develop new systems with existing components (n = 25, 33%), or develop a completely new system (n = 23, 30%). Most studies, however, do not publish their (adapted) source code (n = 57, 74%), and a description of the algorithm in the final publication is often not detailed enough to replicate it. To ensure reproducibility, implementation details, including details on data processing, and preferably the source code should be published, allowing other researchers to compare their implementations or to reproduce the results. Based on these findings, we formulated three recommendations (Table 8).
Table 8

Recommendation regarding the use of systems and algorithms

1. Describe the system or algorithm that is used or the system that is developed for the specific NLP task.

 1. When an existing NLP system or algorithm is used, describe how it is set up, how it is implemented in practice, and if and how the implementation differs from the original implementation.

 2. When a new system is developed, describe the components and features used in the system, and preferably include a flow chart that explains how these elements work together.

2. Include the source code of the developed algorithm as supplementary material to the publication or upload the source code to a repository such as GitHub.

3. Specify which ontologies are used in the encoding task, including the version of the ontology.

 1. If a new ontology is developed for the encoding task, report on the development and content of the ontology and rationale for the development of a new ontology instead of the use of an existing one. The MIRO guidelines could be used to structure the report [105].

Recommendation regarding the use of systems and algorithms 1. Describe the system or algorithm that is used or the system that is developed for the specific NLP task. 1. When an existing NLP system or algorithm is used, describe how it is set up, how it is implemented in practice, and if and how the implementation differs from the original implementation. 2. When a new system is developed, describe the components and features used in the system, and preferably include a flow chart that explains how these elements work together. 2. Include the source code of the developed algorithm as supplementary material to the publication or upload the source code to a repository such as GitHub. 3. Specify which ontologies are used in the encoding task, including the version of the ontology. 1. If a new ontology is developed for the encoding task, report on the development and content of the ontology and rationale for the development of a new ontology instead of the use of an existing one. The MIRO guidelines could be used to structure the report [105].

Used data

Most authors evaluate their algorithms with manual annotations (n = 40, 52%) and use data present in their institutions (n = 55, 71%). However, it is not clear what these datasets consist of. Most studies describe the data as ‘reports’, ‘notes’, or ‘summaries’, but do not list the contents or example rows from the dataset. It is, therefore, not clear what types of patients and what specific types of data are included, making the study hard to reproduce. Finally, we found a wide range of dataset sizes and formats. The training datasets, for example, ranged from 10 clinical notes to 636.439 discharge reports. The use of small datasets can result in an overfitted algorithm that either performs well on the dataset, but not on an external dataset, or performs poorly, for the algorithm was only trained on a specific type of data. More difficult recognition tasks require more data, and therefore sample size planning is recommended [106]. To improve the description and availability of datasets used in NLP studies, we formulated three recommendations (Table 9).
Table 9

Recommendation regarding the use of data

1. To ensure that new algorithms can be compared against your system, aim to publish the used training, development, and validation data in a data repository.

 1. In case the data cannot be published, determine if the data can be accessed on request or can be used in a federated learning approach (i.e., a learning process in which the data owners collaboratively train a model in which process any data owner does not expose the data to others [107]).

2. In case a reference standard is used, include information about the origin of the data (external dataset, subset of the dataset) and the characteristics of the data in the dataset. If possible, reference the dataset using a DOI or URL.

3. If an external dataset is used, give a short description of the data present in the dataset and reference the source of the dataset.

Recommendation regarding the use of data 1. To ensure that new algorithms can be compared against your system, aim to publish the used training, development, and validation data in a data repository. 1. In case the data cannot be published, determine if the data can be accessed on request or can be used in a federated learning approach (i.e., a learning process in which the data owners collaboratively train a model in which process any data owner does not expose the data to others [107]). 2. In case a reference standard is used, include information about the origin of the data (external dataset, subset of the dataset) and the characteristics of the data in the dataset. If possible, reference the dataset using a DOI or URL. 3. If an external dataset is used, give a short description of the data present in the dataset and reference the source of the dataset.

Evaluation and validation

Evaluation of the algorithm determines its performance on the dataset, and validation determines if the algorithm is not overfitted on that dataset and thus if the algorithm might work on other datasets as well. Over one-fourth of the studies (n = 68, 27%) that we identified did not evaluate their algorithms. In addition, 22 included studies (29%) did not validate the developed algorithm. A statement claiming that an algorithm can be used in clinical practice can be questioned if the algorithm has not been evaluated and validated. Across all studies, 20 performance measures were used. To harmonize evaluation and validation efforts, we formulated three recommendations (Table 10).
Table 10

Recommendation regarding the evaluation and validation of Natural Language Processing algorithms

1. Perform an evaluation using generic (i.e., precision, recall, and F-score) performance measures and appropriate aspects of evaluation including discrimination, calibration, and preferably accuracies of predictions (e.g., AUC, calibration graphs, and the Brier score).

 1. Include a motivation for the choice of measures, with references to existing literature where appropriate (e.g., Sokolova and Lapalme’s analysis of performance measures [108]).

2. Perform an error analysis and discuss the errors in the Discussion section of the paper. Include possible changes to the algorithm that could improve its performance for these specific errors.

3. When using a non-probabilistic NLP method: determine the cut-off value (a priori) for a ‘good’ test result before evaluating the algorithm. Elaborate why this cut-off value is chosen.

Recommendation regarding the evaluation and validation of Natural Language Processing algorithms 1. Perform an evaluation using generic (i.e., precision, recall, and F-score) performance measures and appropriate aspects of evaluation including discrimination, calibration, and preferably accuracies of predictions (e.g., AUC, calibration graphs, and the Brier score). 1. Include a motivation for the choice of measures, with references to existing literature where appropriate (e.g., Sokolova and Lapalme’s analysis of performance measures [108]). 2. Perform an error analysis and discuss the errors in the Discussion section of the paper. Include possible changes to the algorithm that could improve its performance for these specific errors. 3. When using a non-probabilistic NLP method: determine the cut-off value (a priori) for a ‘good’ test result before evaluating the algorithm. Elaborate why this cut-off value is chosen.

Presentation of results

Authors report the evaluation results in various formats. Only twelve articles (16%) included a confusion matrix which helps the reader understand the results and their impact. Not including the true positives, true negatives, false positives, and false negatives in the Results section of the publication, could lead to misinterpretation of the results of the publication’s readers. For example, a high F-score in an evaluation study does not directly mean that the algorithm performs well. There is also a possibility that out of 100 included cases in the study, there was only one true positive case, and 99 true negative cases, indicating that the author should have used a different dataset. Results should be clearly presented to the user, preferably in a table, as results only described in the text do not provide a proper overview of the evaluation outcomes (Table 11). This also helps the reader interpret results, as opposed to having to scan a free text paragraph. Most publications did not perform an error analysis, while this will help to understand the limitations of the algorithm and implies topics for future research.
Table 11

Recommendation regarding the presentation of results

1. Report the outcomes of the evaluation in a clear manner, preferably in a table accompanied by a textual description of the outcomes.

 1. Aim to include a confusion matrix in the reporting of the outcomes.

2. Use figures if they contribute to the making the results more readable and understandable for the reader. If a figure is used, make sure that the data is also available in the text or in a table.

Recommendation regarding the presentation of results 1. Report the outcomes of the evaluation in a clear manner, preferably in a table accompanied by a textual description of the outcomes. 1. Aim to include a confusion matrix in the reporting of the outcomes. 2. Use figures if they contribute to the making the results more readable and understandable for the reader. If a figure is used, make sure that the data is also available in the text or in a table.

Generalizability of results

88% of the studies did not perform external validation (n = 68). Of the studies that claimed that their algorithm was generalizable, only 22% (n = 5) assessed this claim through external validation. However, one cannot claim generalizability without testing for it. Moreover, in 19% (n = 3) of the cases where external datasets were used, the datasets were not referenced and only listed in the text of the article, making it harder to find the used data and reproduce the results. Algorithm performance should be compared to that of other state-of-the-art algorithms, as this helps the reader decide whether the new algorithm could be considered useful for clinical practice. However, only 24 studies (31%) made this comparison, and four of those studies (17%) tested the performance difference for statistical significance. We also found that the authors’ descriptions of generalizability are rather ambiguous and unclear. We formulated five recommendations regarding the generalizability of results (Table 12).
Table 12

Recommendation regarding the generalizability of results

1. Compare the results of the evaluated algorithm with other algorithms by using the same dataset as reported in the publication of the other algorithm or by processing the same dataset with another algorithm available through the literature. Report the outcomes of both experiments and test for statistical significance.

2. Describe in what setting the research is performed. Include if the research is part of a challenge (e.g., i2b2 challenge), or that the research is carried out in a specific institute or department.

3. Before claiming generalizability, perform external validation by testing the algorithm on a different, external dataset from other research projects or other publicly available datasets. Aim to use a dataset with a different case mix, different individuals, and different types of text.

4. Determine and describe if there are potential sources of bias in data selection, data use by the NLP algorithm or system, and evaluation.

5. When claiming generalizability, clearly describe the conditions under which the algorithm can be used in a different setting. Describe for which population, domain, and type and language of data the algorithm can be used.

Recommendation regarding the generalizability of results 1. Compare the results of the evaluated algorithm with other algorithms by using the same dataset as reported in the publication of the other algorithm or by processing the same dataset with another algorithm available through the literature. Report the outcomes of both experiments and test for statistical significance. 2. Describe in what setting the research is performed. Include if the research is part of a challenge (e.g., i2b2 challenge), or that the research is carried out in a specific institute or department. 3. Before claiming generalizability, perform external validation by testing the algorithm on a different, external dataset from other research projects or other publicly available datasets. Aim to use a dataset with a different case mix, different individuals, and different types of text. 4. Determine and describe if there are potential sources of bias in data selection, data use by the NLP algorithm or system, and evaluation. 5. When claiming generalizability, clearly describe the conditions under which the algorithm can be used in a different setting. Describe for which population, domain, and type and language of data the algorithm can be used.

Strengths

Our study has three main strengths: First, to our knowledge, this is the first systematic review that focuses on the evaluation of NLP algorithms in medicine. Second, we used a large number of databases for our search, resulting in publications from many different sources, such as medical journals and computer science conferences. Third, we used existing statements and guidelines and harmonized them to induce our findings and used these findings to propose a list of recommendations.

Limitations

Several limitations of our study should be noted as well. First, we only focused on algorithms that evaluated the outcomes of the developed algorithms. Second, the majority of the studies found by our literature search used NLP methods that are not considered to be state of the art. We found that only a small part of the included studies was using state-of-the-art NLP methods, such as word and graph embeddings. This indicates that these methods are not broadly applied yet for algorithms that map clinical text to ontology concepts in medicine and that future research into these methods is needed. Lastly, we did not focus on the outcomes of the evaluation, nor did we exclude publications that were of low methodological quality. However, we feel that NLP publications are too heterogeneous to compare and that including all types of evaluations, including those of lesser quality, gives a good overview of the state of the art.

Conclusion

In this study, we found many heterogeneous approaches to the development and evaluation of NLP algorithms that map clinical text fragments to ontology concepts and the reporting of the evaluation results. Over one-fourth of the publications that report on the use of such NLP algorithms did not evaluate the developed or implemented algorithm. In addition, over one-fourth of the included studies did not perform a validation and nearly nine out of ten studies did not perform external validation. Of the studies that claimed that their algorithm was generalizable, only one-fifth tested this by external validation. Based on the assessment of the approaches and findings from the literature, we developed a list of sixteen recommendations for future studies. We believe that our recommendations, along with the use of a generic reporting standard, such as TRIPOD, STROBE, RECORD, or STARD, will increase the reproducibility and reusability of future studies and algorithms. Additional file 1. Additional file 2. Additional file 3.
  90 in total

1.  Semantic processing of EHR data for clinical research.

Authors:  Hong Sun; Kristof Depraetere; Jos De Roo; Giovanni Mels; Boris De Vloed; Marc Twagirumukiza; Dirk Colaert
Journal:  J Biomed Inform       Date:  2015-10-26       Impact factor: 6.317

2.  Data from clinical notes: a perspective on the tension between structure and flexible documentation.

Authors:  S Trent Rosenbloom; Joshua C Denny; Hua Xu; Nancy Lorenzi; William W Stead; Kevin B Johnson
Journal:  J Am Med Inform Assoc       Date:  2011-01-12       Impact factor: 4.497

3.  MedXN: an open source medication extraction and normalization tool for clinical text.

Authors:  Sunghwan Sohn; Cheryl Clark; Scott R Halgrim; Sean P Murphy; Christopher G Chute; Hongfang Liu
Journal:  J Am Med Inform Assoc       Date:  2014-03-17       Impact factor: 4.497

4.  Clinical Document Classification Using Labeled and Unlabeled Data Across Hospitals.

Authors:  Hamed Hassanzadeh; Mahnoosh Kholghi; Anthony Nguyen; Kevin Chu
Journal:  AMIA Annu Symp Proc       Date:  2018-12-05

5.  Applying natural language processing techniques to develop a task-specific EMR interface for timely stroke thrombolysis: A feasibility study.

Authors:  Sheng-Feng Sung; Kuanchin Chen; Darren Philbert Wu; Ling-Chien Hung; Yu-Hsiang Su; Ya-Han Hu
Journal:  Int J Med Inform       Date:  2018-02-08       Impact factor: 4.046

6.  Automatic lymphoma classification with sentence subgraph mining from pathology reports.

Authors:  Yuan Luo; Aliyah R Sohani; Ephraim P Hochberg; Peter Szolovits
Journal:  J Am Med Inform Assoc       Date:  2014-01-15       Impact factor: 4.497

7.  Secondary use of clinical data: the Vanderbilt approach.

Authors:  Ioana Danciu; James D Cowan; Melissa Basford; Xiaoming Wang; Alexander Saip; Susan Osgood; Jana Shirey-Rice; Jacqueline Kirby; Paul A Harris
Journal:  J Biomed Inform       Date:  2014-02-14       Impact factor: 6.317

8.  Ensembles of natural language processing systems for portable phenotyping solutions.

Authors:  Cong Liu; Casey N Ta; James R Rogers; Ziran Li; Junghwan Lee; Alex M Butler; Ning Shang; Fabricio Sampaio Peres Kury; Liwei Wang; Feichen Shen; Hongfang Liu; Lyudmila Ena; Carol Friedman; Chunhua Weng
Journal:  J Biomed Inform       Date:  2019-10-23       Impact factor: 6.317

9.  Finding Important Terms for Patients in Their Electronic Health Records: A Learning-to-Rank Approach Using Expert Annotations.

Authors:  Jinying Chen; Jiaping Zheng; Hong Yu
Journal:  JMIR Med Inform       Date:  2016-11-30

10.  CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines.

Authors:  Ergin Soysal; Jingqi Wang; Min Jiang; Yonghui Wu; Serguei Pakhomov; Hongfang Liu; Hua Xu
Journal:  J Am Med Inform Assoc       Date:  2018-03-01       Impact factor: 4.497

View more
  5 in total

1.  Can We Geographically Validate a Natural Language Processing Algorithm for Automated Detection of Incidental Durotomy Across Three Independent Cohorts From Two Continents?

Authors:  Aditya V Karhade; Jacobien H F Oosterhoff; Olivier Q Groot; Nicole Agaronnik; Jeffrey Ehresman; Michiel E R Bongers; Ruurd L Jaarsma; Santosh I Poonnoose; Daniel M Sciubba; Daniel G Tobert; Job N Doornberg; Joseph H Schwab
Journal:  Clin Orthop Relat Res       Date:  2022-04-12       Impact factor: 4.755

2.  Case Report: Utilizing AI and NLP to Assist with Healthcare and Rehabilitation During the COVID-19 Pandemic.

Authors:  Jay Carriere; Hareem Shafi; Katelyn Brehon; Kiran Pohar Manhas; Katie Churchill; Chester Ho; Mahdi Tavakoli
Journal:  Front Artif Intell       Date:  2021-02-12

3.  Clinical Text Data Categorization and Feature Extraction Using Medical-Fissure Algorithm and Neg-Seq Algorithm.

Authors:  Naveen S Pagad; Pradeep N; Khalid K Almuzaini; Manish Maheshwari; Durgaprasad Gangodkar; Piyush Shukla; Musah Alhassan
Journal:  Comput Intell Neurosci       Date:  2022-03-07

4.  Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets.

Authors:  Shikhar Vashishth; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé
Journal:  J Biomed Inform       Date:  2021-08-12       Impact factor: 6.317

Review 5.  The potential of a data centred approach & knowledge graph data representation in chemical safety and drug design.

Authors:  Alisa Pavel; Laura A Saarimäki; Lena Möbus; Antonio Federico; Angela Serra; Dario Greco
Journal:  Comput Struct Biotechnol J       Date:  2022-09-05       Impact factor: 6.155

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.