| Literature DB >> 29316970 |
John D Osborne1, Matthew B Neu1, Maria I Danila1, Thamar Solorio2, Steven J Bethard3.
Abstract
BACKGROUND: Traditionally text mention normalization corpora have normalized concepts to single ontology identifiers ("pre-coordinated concepts"). Less frequently, normalization corpora have used concepts with multiple identifiers ("post-coordinated concepts") but the additional identifiers have been restricted to a defined set of relationships to the core concept. This approach limits the ability of the normalization process to express semantic meaning. We generated a freely available corpus using post-coordinated concepts without a defined set of relationships that we term "compositional concepts" to evaluate their use in clinical text.Entities:
Keywords: Concept normalization; Concept recognition; Fine grained named entity recognition; Information extraction; NLP
Mesh:
Year: 2018 PMID: 29316970 PMCID: PMC5761157 DOI: 10.1186/s13326-017-0173-6
Source DB: PubMed Journal: J Biomed Semantics
Examples of pre-cordinated and post-coordinated concepts from the NCBI disease corpus
| Type / Subtype | Identifiers | Text mention example | Concept name/s |
|---|---|---|---|
| Pre-coordinated | 1 | Bone dysplasia | Bone diseases, Developmental |
| Compositional / Aggregate (|) | 2 | Breast or ovarian cancer | Breast cancer|Ovarian cancer |
| Compositional / Composed (+) | 3 | Inherited neuromuscular disease | Neuromuscular disease + Genetic diseases + Inborn |
Post-coordinated concepts of type (“aggregate” or “composed”) have 2 or more identifiers
CUI-less examples from SEMEVAL2015 and CUILESS2016 annotation of ShARe corpus
| Aggregate example | Composed example | ||
|---|---|---|---|
| SEMEVAL2015 |
|
|
|
| Negation | Yes | No* | |
| Subject | Patient* | Patient* | |
| Uncertainty | No* | Yes | |
| Course | Unmarked* | Unmarked* | |
| Severity | Unmarked* | Unmarked* | |
| Conditional | False* | False* | |
| Generic | False* | False* | |
| Body location CUI | C0225754 (Both lungs) | C1521748 (Entire mastoid) | |
| Disorder CUI | CUI-less | CUI-less | |
| CUILESS2016 | Disorder CUI | C0034642 (Rhales) | C0543467 (Operative surgery) |
| C0035508 (Rhonchi) | C2004491 (Cicatrix) | ||
| C0043144 (Wheezing) |
An * indicates the default value for that slot in SEMEVAL2015. Our CUILESS2016 annotators added identifiers to describe the disorder when the Disorder CUI was marked “CUI-less” in SEMEVAL2015
SEMEVAL2015 CUI-less distribution by clinical document type
| Data set | Document type | CUI-less count | Average CUI-less by Note |
|---|---|---|---|
| Development | Discharge summaries | 1929 | 13.9 |
| Training | Discharge summaries | 2796 | 20.6 |
| Training | Echocardiogram | 331 | 6.1 |
| Training | Electrocardiogram | 91 | 1.7 |
| Training | Radiology | 250 | 4.6 |
Only discharge summaries were available for annotation in the development document set
SEMEVAL2015 and CUILESS2016 document statistics
| Set | Word count | Clinical note count | |||
|---|---|---|---|---|---|
| Discharge | ECG | EKG | Radiology | ||
| Train | 182K | 136 | 54 | 54 | 54 |
| Development | 153K | 133 | 0 | 0 | 0 |
| Total | 335K | 269 | 54 | 54 | 54 |
CUILESS2016 annotator agreement type examples
| Exact mention score | Hierarchical mention score | Text mention | Annotator 1 Concept/s | Annotator 2 Concept/s |
|---|---|---|---|---|
| 1.0 | 1.0 |
| Drug allergy | Drug allergy |
| Levofloxacin | Levofloxacin | |||
| 0.0 | 0.52 |
| (O/E) - posturing | Posturing behaviour |
| 0.0 | 0.64 |
| Midline shift of brain | Midline shift of brain |
| To the right | ||||
| 0.0 | 0.22 |
| Erythema | Redness |
The computed hierarchical mention score was used instead of annotator judgment in determining an approximate level of agreement
Fig. 1Annotation Workflow. BRAT 1.3 [17] used to normalize concepts to UMLS CUIs from SNOMED CT
Disorder multiple identifier distribution by data set
| Disorder CUI type | Development count | Development proportion | Training count | Training proportion |
|---|---|---|---|---|
| CUI-less | 1 | 0.05 | 7 | 0.20 |
| Single | 1687 | 87.46 | 2823 | 81.40 |
| Double | 221 | 11.46 | 562 | 16.21 |
| Triple | 18 | 0.93 | 73 | 2.11 |
| Quadruple | 2 | 0.10 | 3 | 0.09 |
| Total | 1929 | 100 | 3468 | 100 |
Differences in disorder mention distribution between the development and training data set are likely due to note composition (see Table 3), a larger (4) set of annotators in the training data and a lack of a consensus process for the training data since each training document is annotated only by a single annotator
Overall disorder and attribute multiple identifier distribution
| Identifier type | Disorder | Disorder + Attributes | ||
|---|---|---|---|---|
| Count | Proportion | Count | Proportion | |
| CUI-less | 8 | 0.1% | 3 | 0.06% |
| Single | 4502 | 83.54% | 966 | 17.90% |
| Double | 783 | 14.53% | 2505 | 46.41% |
| Triple | 91 | 1.7% | 1608 | 29.79% |
| Quadruple | 5 | 0.1% | 263 | 4.87% |
| Pentuple | 0 | 0.0% | 50 | 0.93% |
| Hextuple | 0 | 0.0% | 20 | 0.04% |
| Total | 5397 | 100% | 5397 | 100% |
The Disorder column shows the count and proportion of disorders annotated with one or more concepts excluding attributes. The Disorder + Attributes column includes identifiers from attributes in the count to capture post-coordination with other identifiers
Development dataset annotator agreement
| Agreement type | Agreement count | Proportionate agreement |
|---|---|---|
| Exact | 1011 | 52.4 |
| Hierarchical | NA | 78.2 |
| Total mentions | 1929 |
There is no count for hierarchical agreement since each mention is assigned a value based on Eq. (1), whereas exact agreement assign every mention as a match (1.0) or not (0.0)
Compositional CUI normalization error analysis
| Mention | Error Class |
|---|---|
| Allergies, Calcium | Named entity recognition failure |
| Atrial sensed | Named entity recognition failure |
| Left ventricular inflow pattern | Named entity recognition failure |
| RCIA | Ambiguous text |
| RC one Aneurysm | Ambiguous text |
| Echogenic kidney | No composition found |
| Making grammatical errors | No composition found |
| Tortous aorta | No composition found |
All 8 mentions where annotators were unable to annotate the disease using the compositional approach