Literature DB >> 32025655

Developing a scalable FHIR-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data.

Na Hong1, Andrew Wen1, Feichen Shen1, Sunghwan Sohn1, Chen Wang1, Hongfang Liu1, Guoqian Jiang1.   

Abstract

OBJECTIVE: To design, develop, and evaluate a scalable clinical data normalization pipeline for standardizing unstructured electronic health record (EHR) data leveraging the HL7 Fast Healthcare Interoperability Resources (FHIR) specification.
METHODS: We established an FHIR-based clinical data normalization pipeline known as NLP2FHIR that mainly comprises: (1) a module for a core natural language processing (NLP) engine with an FHIR-based type system; (2) a module for integrating structured data; and (3) a module for content normalization. We evaluated the FHIR modeling capability focusing on core clinical resources such as Condition, Procedure, MedicationStatement (including Medication), and FamilyMemberHistory using Mayo Clinic's unstructured EHR data. We constructed a gold standard reusing annotation corpora from previous NLP projects.
RESULTS: A total of 30 mapping rules, 62 normalization rules, and 11 NLP-specific FHIR extensions were created and implemented in the NLP2FHIR pipeline. The elements that need to integrate structured data from each clinical resource were identified. The performance of unstructured data modeling achieved F scores ranging from 0.69 to 0.99 for various FHIR element representations (0.69-0.99 for Condition; 0.75-0.84 for Procedure; 0.71-0.99 for MedicationStatement; and 0.75-0.95 for FamilyMemberHistory).
CONCLUSION: We demonstrated that the NLP2FHIR pipeline is feasible for modeling unstructured EHR data and integrating structured elements into the model. The outcomes of this work provide standards-based tools of clinical data normalization that is indispensable for enabling portable EHR-driven phenotyping and large-scale data analytics, as well as useful insights for future developments of the FHIR specifications with regard to handling unstructured clinical data.
© The Author(s) 2019. Published by Oxford University Press on behalf of the American Medical Informatics Association.

Entities:  

Keywords:  Fast Healthcare Interoperability Resources; data standards; electronic health records; natural language process

Year:  2019        PMID: 32025655      PMCID: PMC6993992          DOI: 10.1093/jamiaopen/ooz056

Source DB:  PubMed          Journal:  JAMIA Open        ISSN: 2574-2531


INTRODUCTION

With the widespread adoption of electronic health records (EHRs) in healthcare organizations, there is ample opportunity for secondary use of EHR data in clinical and translational research. However, the lack of EHR data interoperability between institutions makes it challenging to integrate and share healthcare and clinical research data, thus impeding effective and efficient collaboration. A standardized model for data representation would assist in promoting the exchange of EHR data, achieving large-scale data-driven research collaborations and supporting rapid generation of accurate and computable phenotypes. As the next generation standards framework, the Fast Healthcare Interoperability Resources (FHIR) was developed by HL7 to meet clinical interoperability needs. FHIR defines a collection of “resources” that “can easily be assembled into working systems that solve real-world clinical and administrative problems at a fraction of the price of existing alternatives.” This assembly process typically requires “profiling”—the adaptation of the FHIR core resources for use in particular contexts and use cases. FHIR also leverages the latest web standards and places a strong focus on implementability. Notably, major EHR vendors (eg, Epic, Cerner) and healthcare providers (eg, Mayo Clinic, Intermountain Healthcare, and Partners Healthcare) have been involved in the development and adoption of FHIR through HL7 Argonaut Project. In the clinical research domain, the FHIR-based standard Application Programming Interfaces (APIs) have been leveraged in a national collaboration known as the Sync for Science (S4S) initiative to help patient share EHR data with researchers and empower individuals to participate in health research. While FHIR is rapidly being adopted in different EHR systems at various institutions, there are a number of gaps on how to represent unstructured information in clinical narratives using FHIR. First, there are unmet needs on standardizing unstructured clinical data. The recent proposal from the Office of the National Coordinator for Health Information Technology (ONC) and the Centers for Medicare & Medicaid Services (CMS) that FHIR APIs be required for certified EHR systems highlighted the importance of the FHIR standard. Particularly, using NLP to gain access to the narrative content in EHRs via FHIR will be of great value to data analytics, quality improvement, and advanced decision support. However, current HL7 Argonaut project has not yet provided a solution to standardize unstructured data. Second, although it is certainly feasible to use the FHIR Composition as a document resource for representing clinical narratives in EHRs, few studies have been done on (1) the tool development for generating the FHIR resource instances from clinical narratives leveraging the NLP technology; and (2) assessing the discrepancies between FHIR data models and NLP type systems. The DeepPhe project created a conceptual model (DeepPhe Ontology) leveraging FHIR models to provide a terminology of entities and relationships between them to represent cancer phenotypes extracted from unstructured EHR data. The DeepPhe project was focused more on adapting FHIR data models to represent cancer phenotypes, rather than developing a data normalization pipeline to formally model unstructured data and NLP outputs using FHIR specification. In addition, no work has been done to assess whether FHIR can represent core elements (eg, negation, certainty, etc.) from different clinical NLP systems for handling unstructured clinical data. Third, a common type system for clinical NLP has been regarded as an important way to enable interoperability between structured and unstructured data generated in different clinical settings. As a part of SHARPn data normalization pipeline, cTAKES implemented a common type system that has an end target of deep semantics based on the clinical element models (CEMs). In the context of secondary use of EHR data, we envision that an FHIR standard-based common type system would better improve semantic interoperability between heterogeneous clinical data sources, given the rapid adoption of FHIR as an international standard in different EHR systems. This novel FHIR-based type system not only can enable effective exchange, integration, sharing, and reuse of encoded and structured clinical narratives, along with well-structured EHR data, but it can also serve as target data models for advanced development of NLP system. The latter includes the following two innovative aspects: (1) a well-defined target data model based on the FHIR type system allows us to easily integrate multiple distinct NLP pipelines, each of which may have their own specialties; and (2) FHIR provides a powerful modeling mechanism that enables the creation of new standard models for particular NLP-based information retrieval tasks, for example, cancer-specific phenotype extraction. The objective of this study was to design, develop, and evaluate a scalable and standards-based EHR data modeling framework and accompanying clinical data normalization pipeline leveraging the HL7 FHIR specification. We implemented a generic pipeline known as NLP2FHIR for modeling unstructured EHR data using the FHIR specification and evaluated the main outcomes as well as the performance of our pipeline using the EHR data from the Mayo Clinic.

MATERIALS AND METHODS

Materials

Clinical narrative corpora

To support the experiment and evaluation of the NLP2FHIR pipeline, a FHIR-based clinical data normalization pipeline, we reused a corpus of 734 clinical notes from Mayo Clinic’s previous clinical NLP research projects, including SHARPn, the Mayo MedXN project, and the Mayo Clinic’s family member history (FMH) NLP project. These notes were randomly collected from Mayo Clinic’s EHR. Four section types (ie, problem list, family history, medication list, and past procedure list) with 940 individual sections were used for the unstructured data modeling study in this study. These corpora had previously been annotated by clinical subject matter experts for research purposes.

UIMA-based clinical NLP tools

UIMA, short for Unstructured Information Management Architecture, is a data-driven architecture where individual components are able to communicate with one another through a data structure called the common analysis system (CAS), which uses a specified hierarchical type system. The type system allows for flexible passing of input and output data types between components of an NLP system. In this study, the NLP2FHIR pipeline implementation integrated three UIMA-based clinical NLP tools as follows: (1) cTAKES, an open-source NLP system for extraction of information from EHR clinical free-text, which provides a tool for selecting different descriptors to support common clinical NLP tasks (eg, part-of-speech tagging, chunking, and dictionary lookup); (2) MedXN, an open-source medication entity/attribute extraction and normalization tool, which extracts comprehensive medication information and normalize it to the most appropriate RxNorm concept unique identifier (RxCUI) as specifically as possible; and (3) MedTime, an open-source temporal information detection system, which extracts EVENT/TIMEX3 and temporal link (TLINK) identification from clinical text.

FHIR specification and application programming interface

The building block in FHIR is a Resource, which provides a common way to define and represent all exchangeable content and related metadata in a particular modeling domain. In this study, we leveraged both document resources Composition/Bundle and clinical resources Condition, Procedure, MedicationStatement/Medication, and FamilyMemberHistory to model unstructured EHR data and NLP outputs. As of September 10, 2019, the version FHIR R4 has been released officially while we used an earlier version, the Standards for Trial Use Version 3 (STU3) (released on April 19, 2017) in this study. As the FHIR modeling interface, HAPI FHIR was used in our implementation to support FHIR data modeling. HAPI FHIR is an open-source implementation of the FHIR specification in Java. HAPI FHIR defines model classes for every resource type and data type defined by FHIR specification. In addition, HAPI supports data validation within FHIR modeling. Therefore, we used the HAPI FHIR application programming interface to serialize elements extracted from clinical documents into standard FHIR eXtensible Markup Language (XML) and JavaScript Object Notation (JSON) representations.

Methods

Figure 1 shows the system architecture of the FHIR-based clinical data normalization pipeline. The NLP2FHIR pipeline comprises the following three modules: (1) a module for a core NLP engine with an FHIR-based type system, (2) a module for integrating structured data, and (3) a module for content normalization. In addition, an intuitive graphical user interface is implemented to allow users to configure the pipeline with parameters in terms of unified medical language system (UMLS) username and password, input directory and type (eg, TEXT, or COMPOSITION_RESOURCE), section definition directory and file, resources to produce, and output directory and formats (eg, FHIR JSON). Figure 2 shows a screenshot of the graphic user interface of the implemented NLP2FHIR pipeline.
Figure 1.

NLP2FHIR pipeline for EHR data modeling. EHR: electronic health record; FHIR: Fast Healthcare Interoperability Resources.

Figure 2.

A screenshot of the graphic user interface of the implemented NLP2FHIR pipeline. Input type can be COMPOSITION_RESOURCE, BUNDLE_RESOURCE, XMI, or TEXT.

NLP2FHIR pipeline for EHR data modeling. EHR: electronic health record; FHIR: Fast Healthcare Interoperability Resources. A screenshot of the graphic user interface of the implemented NLP2FHIR pipeline. Input type can be COMPOSITION_RESOURCE, BUNDLE_RESOURCE, XMI, or TEXT.

Module for a core NLP engine

Unstructured clinical documents (eg, clinical notes, radiology reports) usually convey large amounts of valuable information. The module for modeling unstructured data contains the following components: Rendering clinical documents in FHIR Composition resource as input: As a type of FHIR document resource, the Composition resource defines a set of elements that are assembled together into a single logical document and provides a coherent statement for meaningful document representation. Therefore, we used the FHIR Composition resource to standardize variants of clinical documents as standard input of the FHIR NLP engine. A collection of standard Logical Observation Identifiers and Codes (LOINC) codes were assigned for encoding sections, including Reported Problem List (11450-4), History of Present Illness Narrative (10164-2), History of Medication Use Narrative (10160-0), History of Procedures Document (47519-4), and History of Family Member Diseases Narrative (10157-6). After analyzing section content and FHIR resource definition, we created mappings from the sections to FHIR resources. Integrating existing clinical NLP tools as the NLP engine of the NLP2FHIR pipeline: We integrated the existing clinical NLP tools, comprising cTAKES, MedXN, and MedTime to extract clinical entities from corresponding document sections, and standardized them using the FHIR resources Condition, Procedure, MedicationStatement (including Medication), and FamilyMemberHistory. Different tools were used to handle different clinical narrative extraction tasks. The cTAKES and MedTime were used for the FHIR element entity and relation extraction tasks from the problem list (corresponding to FHIR Condition), past history of surgery (corresponding to FHIR Procedure), and FMH (corresponding to FHIR FamilyMemberHistory). For medication list (corresponding to the FHIR MedicationStatement and Medication), MedXN and MedTime were set up for extracting and standardizing drug names and drug related temporal expressions. Creating a FHIR-based type system to interoperate with UIMA-based NLP tools: UIMA provides a software framework for building type systems while supporting interaction between multiple NLP components. To allow for rapid integration of the NLP tooling output or particular FHIR element extraction results, we generated an FHIR-based type system using the FHIR Standards for Trial Use (STU) 3 v1.8.0 specification. The FHIR-based type system is used to meet the need of interoperability between different NLP pipelines, which enhances the NLP component interoperability through maintaining consistent naming of elements, structure hierarchy, and data restrictions present within the FHIR definitions. Defining mapping rules: We compared the NLP output types with FHIR specification, and reassembled extraction outputs of the NLP tools by creating mapping rules between heterogeneous NLP outputs and standard FHIR elements. FHIR resource and element mapping levels were conducted in terms of granularity at two different levels: (1) narrative sections to FHIR resources and (2) NLP output types to FHIR elements. In total, 30 different types and levels of mapping rules were generated to support integrating heterogeneous NLP outputs to our NLP2FHIR pipelines, and 59 target FHIR elements could be directly populated from NLP tools. Table 1 shows examples of the mapping rules. Additional details are provided in Supplemental Material S2.
Table 1.

Examples of mapping rules between EHR sources, NLP output types, and FHIR elements

SourceNLP output typesFHIR elementsMapping typesExamples
Medication listMedXN: DrugMedicationStatement.medicationCodableConcept1:1Oxamniquine→[SNOMED: 747006]

MedXN:

Drug: attributes: type=”frequency”

MedTime:

MedTimex3: type=”SET”

MedicationStatement.dosage.timing.frequency

MedicationStatement.dosage.timing.frequencyMax

MedicationStatement.dosage.timing.period

MedicationStatement.dosage.timing.periodMax

MedicationStatement.dosage.timing.periodUnit

MedicationStatement.dosage.asNeeded.asNeededBoolean

MedicationStatement.dosage.timing.dayofWeek

MedicationStatement.dosage.timing.when

1:n

Once daily

→1[frequency], 1[period], d[periodUnit]

4–6 times

→4[frequency], 6[frequencyMax]

As needed for heel pain→true

Regular: once daily

every six hours

Irregular: as needed for pain

every Monday, Tuesday Wednesday

MedXN: Drug: attributes: type=”duration”

MedTime:

MedTimex3: type=”DURATION”

Dosage.duration

Dosage.durationMax

Dosage.durationUnit

1:n

3 days

→3[duration], d[durationUnit]

MedXN: Drug: attributes: type=”route”Dosage.route1:1By mouth [oral route]
MedXN: Drug: attributes: type=”strength”

Medication.ingredient.amount. numerator.quantity.value

Medication.ingredient.amount. numerator.quantity.unit

Medication.ingredient.amount. denumerator.quantity.value

Medication.ingredient.amount. denumerator.quantity.unit

1:n

Regular: 500 mg /5 mL→

500[numerator.quantity.value], mg[numerator.quantity.unit], 5[denumerator.quantity.value], mL [denumerator.quantity.unit]

Irregular: 200 mg → Default assign:

1[denumerator.quantity.value]

MedXN: Drug: attributes: type=”form”Medication.form1:1tab[Tablet]
MedXN: Drug: attributes: type=”dosage”

Dosage.doseQuantity.value

Dosage.doseQuantity.unit

Dosage.doseQuantity.Range.low.value

Dosage.doseQuantity.Range.low.unit

Dosage.doseQuantity.Range.high.value

Dosage.doseQuantity.Range.high.unit

1:n

10 mL →10[value], mL[unit]

2–3 tabs →2[range.low.value], tab[range.low.unit], 3[range.high.value], tab[range.high.unit]

Problem listcTAKES: Disease_disorderCondition.code1:1The Lingering sore throat → [SNOMED: 140004] /Chronic pharyngitis

cTAKES: Anatomical_Site

relations: type=“LocationOf”

Condition.bodySite1:1

Back of the head

→ [SNOMED: 774007] / Head and neck

cTAKES: modifier: type=“Severity”Condition.severity1:1Very bad → [SNOMED: 24484000] / Severe
Family history

Relation

relations: type:”SideOfFamily”

relations: type:”Blood”

relations: type:”Adopted”

FamilyMemberHistory.relationship n:1Grandpa→ MGRFTH / maternal grandfather
Laboratory testtest_codeObservation.code1:1Albumin in Semen →[LOINC: 10558-5] /Albumin [Moles/volume] in Semen
SectionSource SectionComposition.section.code1:1

Family history→

[LOINC: 10157-6] / History of family member diseases narrative

Temporal informationMedTime: MedTimex3: type=”DATE”MedicationStatement.effectiveDatetime1:1April 16th
MedTime: MedTimex3: type=”TIME”MedicationStatement.effectiveDatetimeDosage.timeofDay1:nApril 8, 2008 at 04: 38 pm
Creating NLP-specific FHIR extensions: We noticed that the current FHIR resource definition did not cover all the elements from NLP outputs, and some NLP-specific elements of these outputs were essential within the context of subsequent downstream analytics. Therefore, we created a list of FHIR extensions to keep these NLP-specific elements by analyzing a set of clinical NLP elements defined in the latest OHDSI CDM v6.0, cTAKES-type system definitions, and input from NLP experts. Table 2 shows a group of 11 common NLP-specific FHIR extensions created for supporting extended unstructured EHR data normalization. The extension elements were aligned for semantic overlap or similarity by the NLP expert-based reviews. The NLP-specific elements defined in the FHIR extensions were reviewed using the following two basic criteria: (1) whether the element was commonly identified in clinical narratives; and (2) whether the existing NLP tools could handle the entity/relation element extraction.
Table 2.

Proposed FHIR NLP extensions for clinical NLP

Proposed FHIR NLP extensionFHIR resourceDefinition reference sourcesaDescription
offsetAny

[Ref: cTAKES/ LineAndTokenPosition]

[Ref: OHDSI NLP/offset]

Token line and offset of the extracted term in the input note
raw_textAny[Ref: OHDSI NLP/ lexical_variant]Raw text extracted from the NLP tool
contextAny

[Ref: cTAKES /LookupWindowAnnotation]

[Ref: cTAKES /ContextAnnotation]

[Ref: OHDSI NLP/snippet]

Contextual information of an entity
nlp_systemAny[Ref: OHDSI NLP/nlp_system]Name and version of the NLP system that extracted the term. Useful for data provenance
nlp_date/nlp_datetimeAny[Ref: OHDSI NLP/nlp_date, nlp_datetime]The date or datetime of the note processing. Useful for data provenance
term_temporalAny

[Ref: cTAKES/HistoryOfModifier]

[Ref: OHDSI NLP/term_temporal]

The time modifier associated with the extracted term
confidence_scoreAnyNLP experts inputsThe confidence score indicates the probability of accuracy with the extracted term
conditional_modifierAny[Ref: cTAKES/ConditionalModifier]Used to indicate that a procedure or assertion occurs under certain conditions
negated_modifier

Condition

Procedure

Medication

[Ref: cTAKES/ PolarityModifier]Used to indicate that a procedure or assertion did not occur or does not exist
certainty_modifierCondition[Ref: cTAKES/ UncertaintyModifier]An introduction of a measure of doubt into a statement
LabDeltaFlag_modifierObservation[Ref: cTAKES/ ssLabDeltaFlagModifier]An indicator to warn that the laboratory test result has changed significantly from the previous identical laboratory test result

Abbreviations: FHIR: Fast Healthcare Interoperability Resources; NLP: natural language processing.

For expansions of abbreviations used in definition reference sources, please refer to text.

Examples of mapping rules between EHR sources, NLP output types, and FHIR elements MedXN: Drug: attributes: type=”frequency” MedTime: MedTimex3: type=”SET” MedicationStatement.dosage.timing.frequency MedicationStatement.dosage.timing.frequencyMax MedicationStatement.dosage.timing.period MedicationStatement.dosage.timing.periodMax MedicationStatement.dosage.timing.periodUnit MedicationStatement.dosage.asNeeded.asNeededBoolean MedicationStatement.dosage.timing.dayofWeek MedicationStatement.dosage.timing.when Once daily →1[frequency], 1[period], d[periodUnit] 4–6 times →4[frequency], 6[frequencyMax] As needed for heel pain→true Regular: once daily every six hours Irregular: as needed for pain every Monday, Tuesday Wednesday MedXN: Drug: attributes: type=”duration” MedTime: MedTimex3: type=”DURATION” Dosage.duration Dosage.durationMax Dosage.durationUnit 3 days →3[duration], d[durationUnit] Medication.ingredient.amount. numerator.quantity.value Medication.ingredient.amount. numerator.quantity.unit Medication.ingredient.amount. denumerator.quantity.value Medication.ingredient.amount. denumerator.quantity.unit Regular: 500 mg /5 mL→ 500[numerator.quantity.value], mg[numerator.quantity.unit], 5[denumerator.quantity.value], mL [denumerator.quantity.unit] Irregular: 200 mg → Default assign: 1[denumerator.quantity.value] Dosage.doseQuantity.value Dosage.doseQuantity.unit Dosage.doseQuantity.Range.low.value Dosage.doseQuantity.Range.low.unit Dosage.doseQuantity.Range.high.value Dosage.doseQuantity.Range.high.unit 10 mL →10[value], mL[unit] 2–3 tabs →2[range.low.value], tab[range.low.unit], 3[range.high.value], tab[range.high.unit] cTAKES: Anatomical_Site relations: type=“LocationOf” Back of the head → [SNOMED: 774007] / Head and neck Relation relations: type:”SideOfFamily” relations: type:”Blood” relations: type:”Adopted” Family history→ [LOINC: 10157-6] / History of family member diseases narrative Proposed FHIR NLP extensions for clinical NLP [Ref: cTAKES/ LineAndTokenPosition] [Ref: OHDSI NLP/offset] [Ref: cTAKES /LookupWindowAnnotation] [Ref: cTAKES /ContextAnnotation] [Ref: OHDSI NLP/snippet] [Ref: cTAKES/HistoryOfModifier] [Ref: OHDSI NLP/term_temporal] Condition Procedure Medication Abbreviations: FHIR: Fast Healthcare Interoperability Resources; NLP: natural language processing. For expansions of abbreviations used in definition reference sources, please refer to text.

Module for integrating structured data

Although the entity mentions that were extracted from clinical narratives using NLP tools covered the majority of the elements as defined in the FHIR Composition resource and its referenced clinical resources, there are still, however, several pieces of information that needed to be captured from structured EHR data and integrated with the NLP output to complete the population of the corresponding FHIR resource content. The crucial steps for integrating structured data with NLP output consisted of: (1) setting templates for mapping the structured source data elements to the corresponding FHIR resource elements; (2) extracting the instance data from the EHR, where normalization processing may have applied; (3) linking structured instance data with NLP output through a primary key reference (eg, patient id) or directly as an attribute defined within an FHIR resource. For example, when populating each instance of the FHIR MedicationStatement resource, we could directly get its subject information (ie, who is/was taking the medication) from structured EHR data and link each subject to the specific MedicationStatement instance through the Reference (Patient|Group) identifier. Table 3 lists the information that was captured from structured EHR data and integrated with each component of the NLP2FHIR pipeline.
Table 3.

Structured data integrated from EHRs for the NLP2FHIR pipeline

NLP2FHIR pipelineElements populated from structured dataData typeDefinitions
ConditionCondition.clinicalStatusCodeableConceptactive | recurrence | relapse | inactive | remission | resolved(HL7 ValueSet: ConditionClinicalStatusCodes)
Condition.categoryCodeableConceptproblem-list-item | encounter-diagnosis(HL7 ValueSet: ConditionCategoryCodes)
Condition.subjectReferenceWho has the condition
Condition.encounterReferenceThe encounter during which this condition was created or diagnosed
Conditon.recordedDatedateTimeDate record was first recorded
ProcedureProcedure.statuscodeA code specifying the state of the procedure. Generally, this will be the in-progress or completed state
Procedure.subjectReferenceThe person, animal or group on which the procedure was performed
Procedure.categoryCodeableConceptClassification of the procedure
Procedure.encounterReferenceThe Encounter during which this Procedure was created or performed or to which the creation of the record is tightly associated
MedicationStatementMedicationStatement.statuscodeactive | completed | entered-in-error | intended | stopped | on-hold | unknown | not-taken
MedicationStatement.subjectReferenceWho is/was taking the medication
MedicationStatement.categoryCodeableConceptType of medication usage(SNOMED CT)
MedicationStatement.dateAsserteddateTimeWhen the statement was asserted
FMHFamilyMemberHistory.statuscodepartial | completed | entered-in-error | health-unknown(HL7 ValueSet: FamilyHistoryStatus)
FamilyMemberHistory.dataAbsentReasonCodeableConceptsubject-unknown | withheld | unable-to-obtain | deferred (HL7 ValueSet: FamilyHistoryAbsentReason)
FamilyMemberHistory.patientReferencePatient history is about
FamilyMemberHistory.datedateTimeWhen history was recorded or last updated

Abbreviations: EHR: electronic health record; FHIR: Fast Healthcare Interoperability Resources; HL7: Health Level Seven International; NLP: natural language processing.

Structured data integrated from EHRs for the NLP2FHIR pipeline Abbreviations: EHR: electronic health record; FHIR: Fast Healthcare Interoperability Resources; HL7: Health Level Seven International; NLP: natural language processing.

Module for content normalization

Content normalization makes the resource content conform to the FHIR specification in terms of its datatype definitions for corresponding model elements and its content semantics through terminology binding. As mentioned previously, we leveraged a number of core FHIR resources Condition, Procedure, MedicationStatement, and FamilyMemberHistory to capture clinical concepts identified from the unstructured narratives. Therefore, we followed the recommendation from the definitions of these core resources on the use of preferred code systems. In addition, many FHIR elements have specific datatype requirements, (eg, boolean, integer, string, and decimal), thus, we implemented the datatype conversion and value transformation to their target element definition. Handling terminology binding is one of the concept normalization tasks, which requires binding an FHIR element with the identity and version of a terminology system, the codes, and their display names, as shown in Supplementary Table 2. In addition to standard codes defined in external terminologies, FHIR also defines its own value sets with a list of codes in its specification. We created a set of transformation rules to normalize the element instances in terms of terminology binding. For example, tab is an instance for the element Medication.form, which is normalized to Tablet (385055001) defined in the SNOMED CT Form Codes. A number of NLP tools support the concept normalization for the identified entities. For instance, MedXN normalizes a variety of nonstandard medication mentions to the RxNorm codes, and cTAKES assigns UMLS concept unique identifiers to the extracted entities. However, the code systems recommended by FHIR may not be the same as those used in existing NLP tools. For example, FHIR suggests the use of SNOMED CT codes for the element “MedicationStatement.medicationCodeableConcept,” but we acquired its corresponding RxNorm codes from MedXN. For this situation, terminology mapping is necessary. Therefore, we used manually created transformation rules and leveraged existing terminology mappings as the main methods for content normalization. Although varieties of individual resources are produced by the standard outputs of our normalization pipeline, these resources are actually directly or indirectly relevant to each other. According to the FHIR specification, we normalized various expressions from source EHR data using a group of normalization rules. A total of 62 normalization rules were created and implemented (Table 4). Other value set and data type conformations for each FHIR element are included in Supplementary Material S2.
Table 4.

Normalization results for each NLP2FHIR pipeline

NLP2FHIR pipelineNo. of rulesElement examplesData typeNormalization examples
MedicationStatement25MedicationStatement.medicationCodableConceptCodeableConceptOxamniquine→747006[coding.code]
MedicationStatement.dosage.timing.frequencyintegerOnce daily→1[frequency]
MedicationStatement.dosage.asNeeded.asNeededBooleanbooleanAs needed for heel pain→true
MedicationStatement.dosage.timing.dayofWeekcodeEvery Monday → mon[http://hl7.org/fhir/ValueSet/days-of-week]
Procedure10Procedure.codeCodeableConceptKidney echography → 306005/echography of kidney
Procedure.reasonCodeCodeableConcept134006/decreased hair growth
Procedure.performed[x].performedDateTimedateTimeApril 16th, 2010
Condition13Condition.codeCodeableConceptThe Lingering sore throat → 140004/Chronic pharyngitis
Condition.bodySiteCodeableConcept774007/Head and neck
Condition.abatementStringstringResolved
FMH14FamilyMemberHistory.condition.codeCodeableConcept3511005/Infectious thyroiditis
FamilyMemberHistory.relationshipCodeableConceptMGRFTH/maternal grandfather

Abbreviations: FHIR: Fast Healthcare Interoperability Resources; MGRFTH: a role code for maternal grandfather; NLP: natural language processing.

Normalization results for each NLP2FHIR pipeline Abbreviations: FHIR: Fast Healthcare Interoperability Resources; MGRFTH: a role code for maternal grandfather; NLP: natural language processing. In the FHIR specification, the Bundle resource refers to a container for a collection of resources, which is typically used to gather a collection of resources into a single Bundle instance with a specific context. In this study, the FHIR Bundle resource is used to contain both the instances of the FHIR Composition resource and its referenced clinical resources. We developed a wrapping process as a part of the NLP2FHIR pipeline for connecting individual resources into an exchangeable Bundle resource that preserves complete semantics to support secondary use of the standardized instance data. An example of the FHIR Bundle recourse is shown in Figure 3.
Figure 3.

Example of the FHIR bundle resource with a standard section of “Problem List—Reported (LOINC: 11450-4)” and its referenced FHIR resources. FHIR: Fast Healthcare Interoperability Resources; LOINC: Logical Observation Identifiers and Codes.

Example of the FHIR bundle resource with a standard section of “Problem List—Reported (LOINC: 11450-4)” and its referenced FHIR resources. FHIR: Fast Healthcare Interoperability Resources; LOINC: Logical Observation Identifiers and Codes.

Evaluation design

The main purpose of the performance evaluation is to demonstrate whether the standardization process causes a loss in performance, as there are often concerns that standardization is culpable for the loss in performance due to data elements that are originally output by NLP being not representable in a standard (eg, word-sense disambiguation, bag-of-word ngrams, cooccurrences, etc.). The performance evaluation was conducted through the following steps. Reusing annotation corpora: We reused the corpora from SHARPn, MedXN, and FMH projects from Mayo’s unstructured EHR data. These corpora contain annotations made by medical experts, the quality of which has been sufficiently verified through previous studies. Standardizing the annotation corpora using FHIR-based annotation schema: To support corpora reuse and integration, we designed and developed a framework, which contains the following two components: (1) an automatic schema transformation component, in which the annotation schema in each corpus is automatically transformed into the FHIR-based schema; and (2) an expert-based verification and annotation component, in which existing annotations can be verified and new annotations can be added for new elements defined in FHIR. Three co-authors (NH, AW, and GJ) reviewed and verified the annotations. NH and AW have extensive experience in medical informatics, FHIR-based research applications, and clinical NLPs; and GJ has medical background with extensive expertise in medical informatics and standards-based research applications. The generated FHIR-represented corpora were used as the gold standard to facilitate the FHIR NLP engine performance tuning and evaluation. Evaluating the performance of the NLP2FHIR pipeline: We used standard measures (precision, recall, and F score) using the FHIR-based annotation corpora as the gold standard. Based on the NLP output mapping and machine learning methods integration, we evaluated the core element extraction and normalization performance of the FHIR resources Condition, Procedure, MedicationStatement, and FamilyMemberHistory. As the FHIR model contained more granular clinical elements than those output types from existing NLP tools, our FHIR NLP engine also supported the particular FHIR element extraction algorithms leveraging machine learning methods; three annotation corpora were used for different FHIR element machine learning tasks.

RESULTS

We measured the performance of our pipeline that achieved F-scores ranging from 0.690 to 0.995 for various FHIR element representations, which is comparable to the general clinical NLP tasks.,, The performance results of core elements and original baseline tools are shown in Table 5.
Table 5.

Evaluation results on the performance of the NLP2FHIR pipeline

FHIR resourceFHIR elementPrecisionRecall F scoreBaselineF Score
MedicationStatement and MedicationMedicationStatement.medicationCodeableConcept0.9960.9820.988

MedXN: 0.581–0.954MedTime:

0.880

Dosage.timing.repeat.frequency0.7950.8730.832
Dosage.timing.repeat.period0.9590.9140.936
Dosage.timing.repeat.duration0.60010.750
Dosage.route0.9570.8160.878
Medication.ingredient.amount.numerator.quantity.value0.9300.8150.869
Medication.ingredient.amount.numerator.quantity.unit0.9260.8990.911
Medication.form0.8710.7040.779
Dosage.timing.repeat.when10.5710.727
Dosage.asNeededBoolean0.9130.5830.712
ConditionCondition.code0.8650.6960.771cTAKES:

0.768–0.954

Condition.bodySite0.8710.6110.718
Condition.severity0.9090.5560.690
Condition.extension.negated_modifier0.9920.9980.995
ProcedureProcedure.code0.8890.6430.746cTAKES:

0.768–0.954

Procedure.bodySite0.8950.7980.844
FamilyMemberHistoryFamilyMemberHistory.condition.code0.9400.7160.813cTAKES:

0.768–0.954

FamilyMemberHistory.extension.negated_modifier0.9370.9670.952
FamilyMemberHistory.relationship0.7560.7390.747

Abbreviations: FHIR: Fast Healthcare Interoperability Resources; NLP: natural language processing.

Evaluation results on the performance of the NLP2FHIR pipeline MedXN: 0.581–0.954MedTime: 0.880 0.768–0.954 0.768–0.954 0.768–0.954 Abbreviations: FHIR: Fast Healthcare Interoperability Resources; NLP: natural language processing. The results demonstrated that the NLP2FHIR pipeline does not cause a decrease in performance through our integration framework, which was established to enhance EHR interoperability compared with diverse existing tools. The element FamilyMemberHistory.extension. negated_modifier is one of the FHIR NLP-specific extensions, and its performance results were based on the cTAKES outputs; the element FamilyMemberHistory.relationship was newly identified using the machine learning-based relation extraction algorithm, and other element evaluation was based on mappings and normalization rules for existing NLP tools. Therefore, the results verified the feasibility of the NLP2FHIR pipeline on standardizing unstructured EHR data.

DISCUSSION

The use of standards in modeling EHR data has the potential to unlock clinical data interoperation, high-throughput computation, and meaningful use. To promote FHIR-oriented EHR data modeling, we designed and developed a FHIR-based clinical data normalization pipeline (ie, NLP2FHIR) that can extract, standardize, integrated data from unstructured clinical narratives. We believe that modeling unstructured EHR data using the NLP2FHIR pipeline can play an important role in enabling advanced semantic interoperability across different EHR systems. The key contributions of our study are: (1) the creation of mapping rules to support automatic FHIR instances population from the heterogeneous clinical database or NLP output types from multiple NLP tools; (2) the creation of normalization rules to support nonstandard data content transforming into standard FHIR representation; (3) the definition of a collection of NLP-specific FHIR extensions to enhance the FHIR model supportability for unstructured data; and (4) the construction of the FHIR-based type system used for improving interoperability among existing NLP tools and components. The design architecture supports extensibility and scalability as the FHIR-based type system covers all core clinical resources in the FHIR specification, which makes the NLP2FHIR pipeline modular. For instance, we can easily extend the architecture to produce a new data normalization pipeline profile using the FHIR DiagnosticReport resource to support the modeling of unstructured diagnostic reports (eg, pathology or radiology reports) in the future. The NLP2FHIR pipeline provides a generic and scalable framework to support the FHIR modeling of unstructured EHR data. We have focused on the use of the core clinical resources Condition, Procedure, MedicationStatement (including Medication), and FamilyMemberHistory. We needed to handle those FHIR elements that were not covered by the NLP outputs through investigating: (1) whether the values of the elements could be retrieved using structured data (Table 3); and (2) whether new relationship detection algorithms should be developed for a specific element (eg, FamilyMemberHistory.relationship). We solicited a collection of such elements and developed corresponding FHIR extensions (Table 2) within the NLP2FHIR pipeline. We argue that community-based consensus development is a critical next step to broaden the applicability of the NLP2FHIR pipeline in the clinical informatics research community. Meanwhile, we identified several barriers to EHR data modeling using FHIR. First, we noticed that while some of the NLP output types could be directly mapped to FHIR elements without semantic differences, in most other cases, there were semantic gaps between data models in the existing NLP systems and the FHIR specification. Second, the content normalization work in FHIR remains a challenging task as it depends to a large extent on both the external terminology services and the FHIR internal value sets. Many of the elements in an FHIR resource are associated with a list of coded values (ie, a value set); some are in the form of a set of fixed values defined in the FHIR specification, while others are in the form of a list of concept codes defined in external terminologies or ontologies (eg, LOINC, RxNorm, or SNOMED CT) If needed, a locally maintained dictionary and/or look-up table can even be used as a part of an FHIR profile. Currently, the FHIR code system and value set lists are under construction, and integrating FHIR terminology services into our pipeline is critical for the future work. Fortunately, a number of community efforts have been initiated in developing open FHIR terminology services, including HAPI FHIR Terminology Loader for SNOMED CT and LOINC, LOINC FHIR Terminology Server, and Health Open Terminology FHIR Server. Third, the mapping and content normalization rules are executed as part of transformation script within our NLP2FHIR pipelines. For future work, we plan to adopt formal models like the FHIR StructureMap resource to represent those structure mapping rules and the ConceptMap resource to represent the content normalization rules. This would enable an automated conversion process to be standardized by the FHIR specification.

CONCLUSION

In this study, we developed and evaluated a standards-based clinical data normalization pipeline to model EHR data using the FHIR specification. We demonstrated that our NLP2FHIR pipeline is feasible for standardizing unstructured EHR data and integrating structured data into the model. The outcomes of this work provide standards-based tools of clinical data normalization that is indispensable for enabling portable EHR-driven phenotyping and large-scale data-driven analytics, as well as useful insights for future development of the FHIR specification on the handling of unstructured clinical data. With the standards-based FHIR modeling of both structured and unstructured EHR data, the NLP2FHIR pipeline would greatly benefit electronic health care data exchange, utilization, and rapid generation of computational data for advancing clinical and translational research. We are actively working on improving the performance of the NLP2FHIR pipeline, developing new pipeline profiles with more FHIR clinical resource support, and applying the pipeline for EHR-driven cohort identification and data analytics. To accelerate community collaboration and tooling validation, the source code of our tooling and related resources are publicly available at the GitHub site: https://github.com/BD2KOnFHIR/NLP2FHIR. Click here for additional data file.
  12 in total

1.  The SHARPn project on secondary use of Electronic Medical Record data: progress, plans, and possibilities.

Authors:  Christopher G Chute; Jyotishman Pathak; Guergana K Savova; Kent R Bailey; Marshall I Schor; Lacey A Hart; Calvin E Beebe; Stanley M Huff
Journal:  AMIA Annu Symp Proc       Date:  2011-10-22

2.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.

Authors:  Guergana K Savova; James J Masanz; Philip V Ogren; Jiaping Zheng; Sunghwan Sohn; Karin C Kipper-Schuler; Christopher G Chute
Journal:  J Am Med Inform Assoc       Date:  2010 Sep-Oct       Impact factor: 4.497

3.  Meaningful use of electronic health records: the road ahead.

Authors:  Ashish K Jha
Journal:  JAMA       Date:  2010-10-20       Impact factor: 56.272

4.  Standardizing Heterogeneous Annotation Corpora Using HL7 FHIR for Facilitating their Reuse and Integration in Clinical NLP.

Authors:  Na Hong; Andrew Wen; Majid Rastegar Mojarad; Sunghwan Sohn; Hongfang Liu; Guoqian Jiang
Journal:  AMIA Annu Symp Proc       Date:  2018-12-05

5.  MedXN: an open source medication extraction and normalization tool for clinical text.

Authors:  Sunghwan Sohn; Cheryl Clark; Scott R Halgrim; Sean P Murphy; Christopher G Chute; Hongfang Liu
Journal:  J Am Med Inform Assoc       Date:  2014-03-17       Impact factor: 4.497

6.  DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records.

Authors:  Guergana K Savova; Eugene Tseytlin; Sean Finan; Melissa Castine; Timothy Miller; Olga Medvedeva; David Harris; Harry Hochheiser; Chen Lin; Girish Chavan; Rebecca S Jacobson
Journal:  Cancer Res       Date:  2017-11-01       Impact factor: 12.701

7.  Ontologies as integrative tools for plant science.

Authors:  Ramona L Walls; Balaji Athreya; Laurel Cooper; Justin Elser; Maria A Gandolfo; Pankaj Jaiswal; Christopher J Mungall; Justin Preece; Stefan Rensing; Barry Smith; Dennis W Stevenson
Journal:  Am J Bot       Date:  2012-07-30       Impact factor: 3.844

8.  An information model for computable cancer phenotypes.

Authors:  Harry Hochheiser; Melissa Castine; David Harris; Guergana Savova; Rebecca S Jacobson
Journal:  BMC Med Inform Decis Mak       Date:  2016-09-15       Impact factor: 2.796

9.  Systematic Analysis of Free-Text Family History in Electronic Health Record.

Authors:  Yanshan Wang; Liwei Wang; Majid Rastegar-Mojarad; Sijia Liu; Feichen Shen; Hongfang Liu
Journal:  AMIA Jt Summits Transl Sci Proc       Date:  2017-07-26

10.  A common type system for clinical natural language processing.

Authors:  Stephen T Wu; Vinod C Kaggal; Dmitriy Dligach; James J Masanz; Pei Chen; Lee Becker; Wendy W Chapman; Guergana K Savova; Hongfang Liu; Christopher G Chute
Journal:  J Biomed Semantics       Date:  2013-01-03
View more
  8 in total

Review 1.  HL7 FHIR-based tools and initiatives to support clinical research: a scoping review.

Authors:  Stephany N Duda; Nan Kennedy; Douglas Conway; Alex C Cheng; Viet Nguyen; Teresa Zayas-Cabán; Paul A Harris
Journal:  J Am Med Inform Assoc       Date:  2022-08-16       Impact factor: 7.942

2.  FHIR-Ontop-OMOP: Building clinical knowledge graphs in FHIR RDF with the OMOP Common data Model.

Authors:  Guohui Xiao; Emily Pfaff; Eric Prud'hommeaux; David Booth; Deepak K Sharma; Nan Huo; Yue Yu; Nansu Zong; Kathryn J Ruddy; Christopher G Chute; Guoqian Jiang
Journal:  J Biomed Inform       Date:  2022-09-09       Impact factor: 8.000

3.  A corpus-driven standardization framework for encoding clinical problems with HL7 FHIR.

Authors:  Kevin J Peterson; Guoqian Jiang; Hongfang Liu
Journal:  J Biomed Inform       Date:  2020-08-16       Impact factor: 6.317

4.  Leveraging Genetic Reports and Electronic Health Records for the Prediction of Primary Cancers: Algorithm Development and Validation Study.

Authors:  Nansu Zong; Victoria Ngo; Daniel J Stone; Andrew Wen; Yiqing Zhao; Yue Yu; Sijia Liu; Ming Huang; Chen Wang; Guoqian Jiang
Journal:  JMIR Med Inform       Date:  2021-05-25

5.  Natural Language Processing and Machine Learning Methods to Characterize Unstructured Patient-Reported Outcomes: Validation Study.

Authors:  Zhaohua Lu; Jin-Ah Sim; Jade X Wang; Christopher B Forrest; Kevin R Krull; Deokumar Srivastava; Melissa M Hudson; Leslie L Robison; Justin N Baker; I-Chan Huang
Journal:  J Med Internet Res       Date:  2021-11-03       Impact factor: 7.076

6.  Toward cross-platform electronic health record-driven phenotyping using Clinical Quality Language.

Authors:  Pascal S Brandt; Richard C Kiefer; Jennifer A Pacheco; Prakash Adekkanattu; Evan T Sholle; Faraz S Ahmad; Jie Xu; Zhenxing Xu; Jessica S Ancker; Fei Wang; Yuan Luo; Guoqian Jiang; Jyotishman Pathak; Luke V Rasmussen
Journal:  Learn Health Syst       Date:  2020-06-25

7.  Designing an openEHR-Based Pipeline for Extracting and Standardizing Unstructured Clinical Data Using Natural Language Processing.

Authors:  Antje Wulff; Marcel Mast; Marcus Hassler; Sara Montag; Michael Marschollek; Thomas Jack
Journal:  Methods Inf Med       Date:  2020-10-14       Impact factor: 2.176

8.  Sharing Biomedical Data: Strengthening AI Development in Healthcare.

Authors:  Tania Pereira; Joana Morgado; Francisco Silva; Michele M Pelter; Vasco Rosa Dias; Rita Barros; Cláudia Freitas; Eduardo Negrão; Beatriz Flor de Lima; Miguel Correia da Silva; António J Madureira; Isabel Ramos; Venceslau Hespanhol; José Luis Costa; António Cunha; Hélder P Oliveira
Journal:  Healthcare (Basel)       Date:  2021-06-30
  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.