| Literature DB >> 34078918 |
Lisa Grossman Liu1, Raymond H Grossman2, Elliot G Mitchell3, Chunhua Weng3, Karthik Natarajan3, George Hripcsak3, David K Vawdrey3,4.
Abstract
The recognition, disambiguation, and expansion of medical abbreviations and acronyms is of upmost importance to prevent medically-dangerous misinterpretation in natural language processing. To support recognition, disambiguation, and expansion, we present the Medical Abbreviation and Acronym Meta-Inventory, a deep database of medical abbreviations. A systematic harmonization of eight source inventories across multiple healthcare specialties and settings identified 104,057 abbreviations with 170,426 corresponding senses. Automated cross-mapping of synonymous records using state-of-the-art machine learning reduced redundancy, which simplifies future application. Additional features include semi-automated quality control to remove errors. The Meta-Inventory demonstrated high completeness or coverage of abbreviations and senses in new clinical text, a substantial improvement over the next largest repository (6-14% increase in abbreviation coverage; 28-52% increase in sense coverage). To our knowledge, the Meta-Inventory is the most complete compilation of medical abbreviations and acronyms in American English to-date. The multiple sources and high coverage support application in varied specialties and settings. This allows for cross-institutional natural language processing, which previous inventories did not support. The Meta-Inventory is available at https://bit.ly/github-clinical-abbreviations .Entities:
Year: 2021 PMID: 34078918 PMCID: PMC8172575 DOI: 10.1038/s41597-021-00929-4
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Source Sense Inventories.
| Source | Description | Underlying Corpus | Medical Specialty | Last Updated | Records |
|---|---|---|---|---|---|
| UMLS-LRABR[ | Unified Medical Language System Lexical Resource for Abbreviations and Acronyms | Biomedical research | Multiple | 2019 | 294484 |
| ADAM[ | Another Database of Abbreviations in Medline | Biomedical research | Multiple | 2007 | 94657 |
| Berman[ | Manually-curated general pathology abbreviations | Clinical notes | Pathology | 2004 | 12087 |
| Wikipedia[ | Publicly-curated list of medical and clinical trial abbreviations | Clinical notes | Multiple | 2018 | 2952 |
| Vanderbilt1[ | Semi-automatically derived from the medical record | Sign-out notes | Medicine | 2013 | 2414 |
| Vanderbilt2[ | Semi-automatically derived from the medical record | Discharge notes | Medicine | 2013 | 2090 |
| Stetson[ | Manually-curated from the general medical record | Sign-out notes | Medicine | 2002 | 765 |
| Columbia | Manually-curated from the obstetric medical record | Clinical notes | Obstetrics | 2018 | 219 |
Fig. 1Overview of Data Harmonization.
Performance of Cross-Mapping Models on Clinician-Labeled Data*.
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| Baseline | 0.788 | 0.759 | 0.773 |
| LightGBM | 0.813 | 0.785 | 0.799 |
| BERT Architecture | 0.815 | 0.772 | 0.793 |
| Ensemble | 0.828 | 0.801 | 0.814 |
*Scores calculated using the mean predictions of 3 runs with different random seeds.
Data Dictionary.
| Data Field | Name | Description | Example | ||||
|---|---|---|---|---|---|---|---|
| GroupID | Group Unique Identifier | Identifies a group of synonymous records | G169326 | ||||
| RecordID | Record Unique Identifier | Identifies each record (one per record) | R349343 | ||||
| SF | Short Form | Abbreviated version of an abbreviation | O.C. | ||||
| SFUI | Short Form Unique Identifier | Identifies a unique short form | S050750 | ||||
| NormSF | Normalized Short Form | Lexically normalized version of the short form | oc | ||||
| LF | Long Form | Spelled-out version of an abbreviation | oral contraceptives | ||||
| LFUI | Long Form Unique Identifier | Identifies a unique long form | L121977 | ||||
| NormLF | Normalized Long Form | Lexically normalized version of the long form | oral contraceptive | ||||
| Source | Source Inventory | Name of the source sense inventory | ADAM | ||||
| Modified | Modified | Modified by quality control or not | modified | ||||
| SFEUI | Short Form Entry Unique Identifier | Identifies a unique UMLS short form | UMLS-LRABR | E0319213 | |||
| LFEUI | Long Form Entry Unique Identifier | Identifies a unique UMLS long form | UMLS-LRABR | E0044077 | |||
| Type | Type of Entry | Abbreviation or acronym | UMLS-LRABR | acronym | |||
| PrefSF | Preferred Short Form | Preferred version of a short form | ADAM | o.c. | |||
| Count | Count | Number of occurrences in the corpus | ADAM, Vanderbilt | 10 | |||
| Score | Score | Adjusted proportion of occurrences | ADAM | 0.7357 | |||
| Frequency | Frequency | Frequency of the sense in the corpus | Vanderbilt | 0.4168 | |||
| UMLS.CUI | UMLS Concept Unique Identifier | UMLS CUI that mapped to the sense | Vanderbilt | c0009905 | |||
*Auxiliary data fields are unique to a single source and found only in the “auxiliary” version of the Meta-Inventory available in the GitHub repository (https://bit.ly/github-clinical-abbreviations). Abbreviations: UMLS, Unified Medical Language System; LRABR, Lexical Resource for Abbreviations and Acronyms; ADAM, Another Database of Abbreviations in Medline.
Fig. 2Formulas for Calculating Coverage.
Fig. 3Coverage Estimates for the Meta-Inventory and its Sources.
| Measurement(s) | Controlled Vocabulary • Linguistic Form |
| Technology Type(s) | digital curation • data combination |
| Sample Characteristic - Location | United States of America |