| Literature DB >> 31984360 |
Meizhi Ju1, Andrea D Short2, Paul Thompson1, Nawar Diar Bakerly3, Georgios V Gkoutos4,5,6,7,8,9, Loukia Tsaprouni10, Sophia Ananiadou1.
Abstract
OBJECTIVES: Chronic obstructive pulmonary disease (COPD) phenotypes cover a range of lung abnormalities. To allow text mining methods to identify pertinent and potentially complex information about these phenotypes from textual data, we have developed a novel annotated corpus, which we use to train a neural network-based named entity recognizer to detect fine-grained COPD phenotypic information.Entities:
Keywords: chronic obstructive pulmonary disease; information extraction; natural language processing; phenotype; text mining
Year: 2019 PMID: 31984360 PMCID: PMC6951876 DOI: 10.1093/jamiaopen/ooz009
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.Example of a phenotype that includes other concepts nested within it.
Figure 2.Workflow for annotation and detection of information relating to COPD phenotypes. COPD: chronic obstructive pulmonary disease; CRF: conditional random field.
Descriptions, examples, and counts of each category in the COPD annotation scheme
| Type | Description | Examples | Number of concepts |
|---|---|---|---|
| Problem | An overall category for any COPD indicates of concern | COPD exacerbations; past pulmonary TB | 2556 |
| Condition | Any disease or medical condition includes COPD comorbidities | emphysema; pulmonary vascular disease; asthma | 5119 |
| RiskFactor | A phrase signifying a patient’s increased chances of having COPD | increased levels of the C-reactive protein; alpha1 antitrypsin deficiency | 1211 |
| SignOrSymptom | An observable irregularity manifested by a COPD patient | chronic cough; shortness of breath | 2065 |
| IndividualBehaviour | A patient’s habits leading to susceptibility of having COPD | smoking for 25 years; exercise-limited patients | 194 |
| TestResult | Findings based on COPD-relevant examinations | decrease in rate of lung function; FEV1 45% predicted | 685 |
| Treatment | Any medication, therapy, or treatment program | inhaled corticosteroids; oxygen therapy; pulmonary rehabilitation | 4337 |
| Test | An overall category for any COPD-relevant examinations or measures/parameters | spirometry, respiratory frequency, FEV1 | 3576 |
| RadiologicalTest | Any of the radiological tests for detecting COPD | computed tomography scanning; high resolution computed tomography | 29 |
| MicrobiologicalTest | An examination of a COPD-relevant specimen | complete blood count; bacterial isolates | 11 |
| PhysiologicalTest | A measurement of a COPD patient’s capacity to exercise | 6-min walking distance; incremental cardiopulmonary exercise testing | 17 |
| ConstituentConcept | An umbrella type for elementary concepts that may form part of a phenotype description; should only be chosen if none of the subtypes below apply | bronchodilation; enhancement of skeletal muscle contractility | 5 |
| AnatomicalConcept | A mention pertaining to anatomical entities | lung; heart; pulmonary; hepatic; respiratory airway | 2616 |
| Drug | Any drug name; will mostly overlap with treatment | corticosteroids; short-acting bronchodilators | 2593 |
| Protein | Any protein name | alpha1 antitrypsin; pro-inflammatory cytokines | 820 |
| Quality | Expressions which modify or qualify any of the concepts above | chronic; obstructed; damaged; decreased rate; enhanced; decreased amount | 1153 |
Abbreviations: COPD: chronic obstructive pulmonary disease; FEV1: Forced Expiratory Volume.
Figure 3.Hierarchical entity annotation scheme for COPD phenotypic information. COPD: chronic obstructive pulmonary disease.
Number of entities normalized by HYPHEN
| Category | Total entities | Number of entities normalized | Percentage of entities normalized |
|---|---|---|---|
| Problem | 2556 | 2151 | 83.15 |
| Condition | 5119 | 4969 | 97.07 |
| RiskFactor | 1211 | 942 | 77.79 |
| SignOrSymptom | 2065 | 1140 | 55.21 |
| IndividualBehaviour | 194 | 124 | 63.92 |
| TestOrMesureResult | 685 | 259 | 37.81 |
| Treatment | 4337 | 3775 | 87.04 |
| TestOrMeasure | 3576 | 2609 | 72.96 |
| AnatomicalConcept | 2616 | 2372 | 90.67 |
| Drug | 2593 | 2368 | 91.32 |
| Protein | 820 | 727 | 87.66 |
| Quality | 1153 | 1015 | 88.03 |
| Total | 26 925 | 22 451 | 83.38 |
Sample normalization results
| Entity annotation | Semantic category | Mapped UMLS concept |
|---|---|---|
| increased PVR | Problem | Increased pulmonary vascular resistance (C1867423) |
| lung failure | Condition | Pulmonary failure (C0948755) |
| left atrial | AnatomicalConcept | Left atrium (C0225860) |
| arm training | Treatment | Upper limb training (C0556501) |
| spirometric test | TestOrMeasure | Spirometry test (C0037981) |
| genetic predisposition | RiskFactor | Genetic susceptibility to disease (C1455997) |
Figure 4.Overview of the layered-BiLSTM-CRF model architecture. B-AC: B-AnatomicalConcept; B-T: B-treatment; I-T: I-treatment; B-D: B-drug; I-D: I-drug.
Performance of different NER models at different levels of entity nesting
| Level | Model | P (%) | R (%) | F (%) |
|---|---|---|---|---|
| Innermost | CRF |
| 68.78 | 72.74 |
| BiLSTM-CRF | 73.93 |
|
| |
| Layered BiLSTM-CRF | 69.79 | 70.41 | 70.10 | |
| Outermost | CRF | 73.63 | 66.41 | 69.83 |
| BiLSTM-CRF |
| 67.35 | 71.24 | |
| Layered BiLSTM-CRF | 74.00 |
|
| |
| All | CRF | 75.44 | 67.61 | 71.31 |
| BiLSTM-CRF | 74.71 | 70.42 | 72.50 | |
| Layered BiLSTM-CRF |
|
|
|
Note: For each different level, the best precision (P), recall (R), and F-score (F) amongst the 3 models is shown in bold.
Abbreviations: NER: named entity recognition; CRF: conditional random field.
A significant difference between CRF and (flat) BiLSTM-CRF models at P < .05. Since the layered BiLSTM-CRF takes as input different entities than the baseline models (ie, all entities vs innermost or outermost entities), we did not apply significance testing between layered and flat models.
Figure 5.Counts of different types of errors for each semantic type.