| Literature DB >> 34390853 |
Shikhar Vashishth1, Denis Newman-Griffis2, Rishabh Joshi3, Ritam Dutt3, Carolyn P Rosé3.
Abstract
OBJECTIVES: Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction-extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types.Entities:
Keywords: Distant supervision; Entity typing; Information extraction; Medical concept normalization; Medical entity linking; Natural language processing
Mesh:
Year: 2021 PMID: 34390853 PMCID: PMC8952339 DOI: 10.1016/j.jbi.2021.103880
Source DB: PubMed Journal: J Biomed Inform ISSN: 1532-0464 Impact factor: 6.317
Fig. 1.Overview of article contributions. We present MedType, a novel, modular system for biomedical semantic type prediction, together with WikiMed and PubMedDS, two large-scale, automatically created datasets for medical concept normalization that we use to pretrain MedType. We show that integrating MedType with five commonly used packages for biomedical information extraction improves performance across the board on four benchmark datasets.
Fig. 2.Overview of MedType. For a given input text, MedType takes in the set of identified mentions along with their list of candidate concepts as input. Then, for each mention, MedType predicts its semantic type based on its context in the text. The identified semantic type is used to filter out the irrelevant candidate concepts thus controlling overgeneration of candidates and improving medical entity linking. Please refer to Section 3 for details.
Grouping of the 127 semantic types in the UMLS Metathesaurus into 24 semantic groups. The semantic groups were derived from McCray et al. [48] and is-a relationships in the Semantic Network. Refer to Section 3.3 for details.
| Groups | Semantic Types |
|---|---|
| Activities & Behaviors | Activity, Behavior, Daily or Recreational Activity, Event, Governmental or Regulatory Activity, Individual Behavior, Machine Activity, Occupational Activity, Social Behavior |
| Anatomy | Anatomical Structure, Body Location or Region, Body Part, Organ, or Organ Component, Body Space or Junction, Body Substance, Body System, Cell, Cell Component, Embryonic Structure, Fully Formed Anatomical Structure, Tissue |
| Chemicals & Drugs | Amino Acid, Peptide, or Protein, Antibiotic, Biologically Active Substance, Biomedical or Dental Material, Chemical, Chemical Viewed Functionally, Chemical Viewed Structurally, Element, Ion, or Isotope, Enzyme, Hazardous or Poisonous Substance, Hormone, Immunologic Factor, Indicator, Reagent, or Diagnostic Aid, Inorganic Chemical, Nucleic Acid, Nucleoside, or Nucleotide, Receptor, Vitamin |
| Concepts & Ideas | Classification, Conceptual Entity, Group Attribute, Idea or Concept, Intellectual Product, Language, Quantitative Concept, Regulation or Law, Spatial Concept, Temporal Concept |
| Devices | Drug Delivery Device, Medical Device, Research Device |
| Disease or Syndrome | Disease or Syndrome |
| Disorders | Acquired Abnormality, Anatomical Abnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Experimental Model of Disease, Injury or Poisoning |
| Finding | Finding |
| Functional Concept | Functional Concept |
| Genes & Molecular Sequences | Amino Acid Sequence, Carbohydrate Sequence, Gene or Genome, Molecular Sequence, Nucleotide Sequence |
| Living Beings | Age Group, Amphibian, Animal, Archaeon, Bacterium, Bird, Eukaryote, Family Group, Fish, Fungus, Group, Human, Mammal, Organism, Patient or Disabled Group, Plant, Population Group, Professional or Occupational Group, Reptile, Vertebrate, Virus |
| Mental or Behavioral Dysfunction | Mental or Behavioral Dysfunction |
| Neoplastic Process | Neoplastic Process |
| Objects | Geographic Area, Entity, Food, Manufactured Object, Physical Object, Substance |
| Occupations | Biomedical Occupation or Discipline, Occupation or Discipline |
| Organic Chemical | Organic Chemical |
| Organizations | Health Care Related Organization, Organization, Professional Society, Self-help or Relief Organization |
| Pathologic Function | Pathologic Function |
| Pharmacologic Substance | Clinical Drug, Pharmacologic Substance |
| Phenomena | Biologic Function, Environmental Effect of Humans, Human-caused Phenomenon or Process, Laboratory or Test Result, Natural Phenomenon or Process, Phenomenon or Process |
| Physiology | Cell Function, Clinical Attribute, Genetic Function, Mental Process, Molecular Function, Organ or Tissue Function, Organism Attribute, Organism Function, Physiologic Function |
| Procedures | Diagnostic Procedure, Educational Activity, Health Care Activity, Laboratory Procedure, Molecular Biology Research Technique, Research Activity, Therapeutic or Preventive Procedure |
| Qualitative Concept | Qualitative Concept |
| Sign or Symptom | Sign or Symptom |
Fig. 3.Constructing WikiMed from Wikipedia data. We map each linked mention in Wikipedia articles to a UMLS concept using mappings obtained from Freebase, Wikidata and NCBI knowledge bases.
Details of the medical entity linking datasets used in our experiments; #Unq Con refers to the number of unique CUIs in each dataset. WikiMed is our novel automatically-annotated Wikipedia dataset, and PubMedDS is our novel distantly supervised dataset.
| Datasets | #Documents | #Sentences | #Mentions | #Unq Concepts |
|---|---|---|---|---|
| NCBI | 792 | 7,645 | 6,817 | 1,638 |
| Bio CDR | 1,500 | 14,166 | 28,559 | 9,149 |
| ShARe | 431 | 27,246 | 17,809 | 1,719 |
| MedMentions | 4,392 | 42,602 | 352,496 | 34,724 |
| W | 393,618 | 11,331,321 | 1,067,083 | 57,739 |
| P | 13,197,430 | 127,670,590 | 57,943,354 | 44,881 |
Frequencies of semantic types in our evaluation datasets and novel training datasets. Overall, we find that our WikiMed and PubMedDS datasets give diverse coverage across all semantic types.
|
|
| |||||
|---|---|---|---|---|---|---|
| Categories | NCBI | Bio CDR | ShARe | MedMentions | W | P |
| Activities & Behaviors | 4 | 7 | 1 | 12,249 | 554 | 2,725,161 |
| Anatomy | 3 | 29 | 4 | 19,098 | 14,366 | 10,688,138 |
| Chemicals & Drugs | 0 | 32,436 | 1 | 46,420 | 26,809 | 44,476,957 |
| Concepts & Ideas | 0 | 0 | 1 | 60,475 | 2,562 | 5,274,354 |
| Devices | 0 | 0 | 0 | 2,691 | 483 | 242,599 |
| Disease or Syndrome | 10,760 | 22,603 | 5,895 | 11,709 | 84,706 | 9,846,667 |
| Disorders | 664 | 1,853 | 997 | 3,575 | 8,635 | 1,115,186 |
| Finding | 749 | 2,220 | 500 | 15,666 | 9,285 | 1,778,023 |
| Functional Concept | 0 | 0 | 1 | 23,672 | 117 | 48,553 |
| Genes & Molecular Sequences | 20 | 0 | 0 | 5,582 | 446 | 281,662 |
| Living Beings | 0 | 43 | 7 | 31,691 | 919,694 | 21,339,662 |
| Mental or Behavioral Dysfunction | 293 | 3,657 | 410 | 2,463 | 19,196 | 2,353,547 |
| Neoplastic Process | 4,022 | 2,301 | 323 | 4,635 | 16,823 | 1,476,843 |
| Objects | 0 | 129 | 2 | 10,357 | 421 | 5,184,355 |
| Occupations | 0 | 0 | 0 | 1,443 | 1,156 | 654,604 |
| Organic Chemical | 0 | 90,428 | 1 | 10,258 | 17,330 | 50,248,085 |
| Organizations | 0 | 0 | 0 | 2,276 | 0 | 298,119 |
| Pathologic Function | 143 | 3,290 | 2,285 | 4,121 | 4,474 | 1,895,835 |
| Pharmacologic Substance | 0 | 90,872 | 1 | 11,935 | 24,878 | 50,696,769 |
| Phenomena | 4 | 163 | 2 | 7,210 | 317 | 1,722,873 |
| Physiology | 15 | 166 | 3 | 24,753 | 2,054 | 10,674,561 |
| Procedures | 5 | 73 | 4 | 37,616 | 4,008 | 7,471,434 |
| Qualitative Concept | 0 | 0 | 7 | 32,564 | 106 | 1,211,747 |
| Sign or Symptom | 211 | 9,844 | 2,687 | 1,809 | 4,212 | 3,750,734 |
Fig. 4.Constructing PubMedDS using distant-supervision on PubMed corpus. For each article, we apply biomedical NER on its abstract for obtaining relevant entity mentions which are then linked using supervision from MeSH headings of the article. Refer to Section 4.2 for details.
Quality assessment of PubMedDS, based on the subset of documents it shares with the NCBI Disease Corpus, Bio CDR, and MedMentions. Precision and recall are calculated with respect to overlap between our automated annotations in PubMedDS and the gold standard annotations in the comparison datasets. We find that although PubMedDS has low coverage, extracted mentions have high precision across the three datasets.
| Documents shared with | Precision | Recall |
|---|---|---|
| NCBI | 86.3 | 6.5 |
| Bio CDR | 75.8 | 1.3 |
| MedMentions | 90.3 | 5.3 |
Semantic type prediction results, comparing MedType (with and without additional corpora) to our four baselines; we report the area under the precisionrecall curve as our evaluation metric. MT ← X denotes MedType first trained on X dataset then fine-tuned using T. We find that MedType outperforms other methods on 3 out of 4 datasets. Also, pre-training on WikiMed and PubMedDS gives substantial boost in the performance. More details are provided in Section 6.1.
| NCBI | Bio CDR | ShARe | MedMentions | |
|---|---|---|---|---|
| AttentionNER [ | 94.5 | 89.1 | 88.7 | 72.0 |
| DeepType-FC [ | 95.1 | 82.9 | 89.3 | 72.9 |
| DeepType-RNN [ | 92.8 | 86.9 | 86.1 | 74.1 |
| Type-CNN [ | 95.2 | 88.9 | 89.8 | 74.4 |
| MedNER [ | 95.6 | 90.2 | 84.4 | 67.5 |
| M | 94.5 | 90.4 | 90.5 | 83.5 |
| MT ← W | 94.9 | 93.5 | 93.2 | 84.0 |
| MT ← P | 96.8 |
| 93.6 | 86.8 |
| MT ← Both |
|
|
|
|
For quantifying the impact of semantic type prediction on medical entity linking, we report the F1-score for five medical entity linking methods on multiple datasets. For each method, the first row is its base performance, and the following rows indicate the change in F1-score on incorporating a type-based candidate concepts filtering step. Bold indicates the case when MedType performance matches with an oracle. We report the results with the oracle type predictors (fine-grained and coarse-grained) and MedType. Overall, we find that MedType gives performance comparable to an oracle and improves medical entity linking across all settings. Please refer to Section 6.2 for details.
| NCBI | Bio CDR | ShARe | MedMentions | |||||
|---|---|---|---|---|---|---|---|---|
| Exact | Partial | Exact | Partial | Exact | Partial | Exact | Partial | |
|
| 39.6 | 45.0 | 54.2 | 56.3 | 33.8 | 34.6 | 36.7 | 39.8 |
| Oracle (Fine) | +0.8 | +1.0 | +0.3 | +0.4 | +0.5 | +0.6 | +6.4 | +6.9 |
| Oracle (Coarse) | +0.8 | +1.0 | +0.2 | +0.3 | +0.5 | +0.6 | +5.7 | +6.1 |
| Type-CNN | +0.7 | +0.8 | +0.2 | +0.3 | +0.2 | +0.3 | +3.6 | +3.8 |
| M |
|
|
|
| +0.3 | +0.4 | +4.0 | +4.3 |
|
| 39.2 | 45.9 | 54.5 | 57.0 | 32.3 | 33.3 | 16.9 | 18.3 |
| Oracle (Fine) | +0.3 | +0.3 | +0.1 | +0.2 | +0.1 | +0.2 | +0.2 | +0.2 |
| Oracle (Coarse) | +0.3 | +0.3 | +0.1 | +0.2 | +0.1 | +0.2 | +0.2 | +0.2 |
| Type-CNN | +0.3 | +0.3 | +0.1 | +0.2 | +0.0 | +0.1 | +0.1 | +0.1 |
| M |
|
|
|
| +0.1 | +0.1 |
|
|
|
| 35.4 | 39.4 | 50.3 | 51.5 | 27.1 | 27.5 | 32.6 | 35.2 |
| Oracle (Fine) | +5.9 | +5.9 | +2.7 | +2.8 | +4.7 | +4.8 | +7.2 | +7.8 |
| Oracle (Coarse) | +5.9 | +5.9 | +2.6 | +2.7 | +4.7 | +4.7 | +6.0 | +6.5 |
| Type-CNN | +5.7 | +5.7 | +2.3 | +2.4 | +4.1 | +4.1 | +3.9 | +4.0 |
| M |
|
| +2.5 | +2.6 | +4.3 | +4.4 | +4.4 | +4.6 |
|
| 27.0 | 31.7 | 36.5 | 39.1 | 17.3 | 19.2 | 28.7 | 31.4 |
| Oracle (Fine) | +0.2 | +0.6 | +5.0 | +5.2 | +5.2 | +5.5 | +9.8 | +10.7 |
| Oracle (Coarse) | +0.2 | +0.6 | +4.5 | +4.6 | +5.1 | +5.4 | +7.7 | +8.5 |
| Type-CNN | +0.0 | +0.2 | +4.0 | +4.1 | +4.0 | +4.2 | +4.9 | +5.2 |
| M | +0.1 | +0.5 | +4.3 | +4.4 | +4.8 | +5.0 | +5.9 | +6.4 |
|
| 43.1 | 47.5 | 49.4 | 53.7 | 25.4 | 29.0 | 37.2 | 40.6 |
| Oracle (Fine) | +2.2 | +4.1 | +1.7 | +2.6 | +3.5 | +5.1 | +8.2 | +9.4 |
| Oracle (Coarse) | +2.2 | +4.1 | +1.7 | +2.5 | +3.4 | +5.0 | +6.8 | +7.8 |
| Type-CNN | +1.7 | +3.6 | +0.5 | +1.2 | +2.9 | +4.0 | +3.5 | +3.9 |
| M | +1.9 | +3.8 | +1.3 | +2.2 | +3.1 | +4.5 | +4.1 | +4.6 |
Results of Partial_mention_id_match evaluation of ScispaCy on all four evaluation datasets. Evaluation is restricted to only predicted samples that overlap with gold annotations, to control for the effects of mention detection errors. The number of samples in this restricted subset of each dataset is given in the column headers.
| NCBI | Bio CDR | ShARe | MedMentions | |
|---|---|---|---|---|
| ScispaCy | 56.0 | 60.9 | 30.9 | 42.8 |
| Oracle (Fine) | +4.2 | +2.7 | +5.3 | +9.9 |
| Oracle (Coarse) | +4.2 | +2.6 | +5.3 | +8.1 |
| Type-CNN | +3.5 | +1.2 | +4.2 | +4.1 |
|
| +3.8 | +2.2 | +4.7 | +4.9 |
Type-wise analysis of the impact on using MedType with PubMedDS on NCBI, Bio CDR, ShARe, and MedMentions datasets. We report F1-score for each semantic type. MT denotes MedType, ← W and ← P indicate MedType first pre-trained on WikiMed and PubMedDS dataset, and ← B denotes MedType pre-trained on both the datasets. ‘-’ mean that the semantic type was not part of the dataset.
| NCBI | Bio CDR | ShARe | MedMentions | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MT | ← W | ← P | ← B | MT | ← W | ← P | ← B | MT | ← W | ← P | ← B | MT | ← W | ← P | ← B | |
| Activities & Beh. | - | - | - | - | - | - | - | - | - | - | - | - | 71.9 | 71.7 | 74.4 |
|
| Anatomy | - | - | - | - | - | - | - | - | - | - | - | - | 81.3 | 82.7 |
| 86 |
| Chemicals & Drugs | - | - | - | - | 83 | 83 | 91.5 |
| - | - | - | - | 77.8 | 78.1 |
|
|
| Concepts & Ideas | - | - | - | - | - | - | - | - | - | - | - | - | 80.5 | 81.2 |
|
|
| Devices | - | - | - | - | - | - | - | - | - | - | - | - | 52.2 | 46.4 |
| 54.1 |
| Disease or Syn. | 94.5 | 95.5 | 97.2 |
| 87.8 | 90.5 | 93.2 |
| 84.6 | 91.3 | 92.3 |
| 79 | 81 | 84.4 |
|
| Disorders | 58.9 | 68.7 | 69 |
| 82.4 | 79.4 |
| 85.7 | 50.7 | 78 | 79.9 |
| 62.1 | 64.4 | 67.9 |
|
| Finding | 0 | 45 | 46.8 |
| 59.6 | 77.1 | 86.1 |
| 47.5 | 79.5 | 82.5 |
| 54.8 | 57.5 | 58.5 |
|
| Functional Concept | - | - | - | - | - | - | - | - | - | - | - | - | 76.7 | 76.4 | 77.2 |
|
| Genes & Mol. Seq. | - | - | - | - | - | - | - | - | - | - | - | - | 67.8 | 67 |
|
|
| Living Beings | - | - | - | - | 0 | 0 |
| 40 | - | - | - | - | 88.1 | 88.6 | 90.1 |
|
| Mental/Beh. Dys. | 17.4 | 81.1 |
|
| 58.8 | 90.1 | 92.6 |
| 48.4 | 83.2 | 78.8 |
| 76.7 | 79 | 80.7 |
|
| Neoplastic Process | 91.7 | 93.1 |
| 92.7 | 90.9 | 90.8 |
| 92.2 | 71.5 | 89.2 | 90.9 |
| 85.6 | 86 | 87.4 |
|
| Objects | - | - | - | - | 0 | 20.8 |
| 29.2 | - | - | - | - | 72.3 | 71.6 | 75.7 |
|
| Occupations | - | - | - | - | - | - | - | - | - | - | - | - | 46.7 | 47.1 |
| 55.5 |
| Organic Chemical | - | - | - | - | 91.9 | 91.3 |
| 94.1 | - | - | - | - | 71.9 | 73.6 |
| 80.2 |
| Organizations | - | - | - | - | - | - | - | - | - | - | - | - | 73 | 74 | 75.6 |
|
| Pathologic Function | 0 | 76.2 |
| 80 | 59.6 | 86.2 | 90.2 |
| 74.6 | 85.1 | 85.9 |
| 65.6 | 69.9 | 70.1 |
|
| Pharm. Substance | - | - | - | - | 92 | 91.8 |
| 93.1 | - | - | - | - | 63.6 | 64.3 |
| 70.3 |
| Phenomena | - | - | - | - | 33.3 | 74.3 |
| 92.3 | - | - | - | - | 51.1 | 54.3 |
| 60.7 |
| Physiology | - | - | - | - | 0 | 60.8 |
| 60.8 | - | - | - | - | 72.7 | 74.6 | 77.3 |
|
| Procedures | - | - | - | - | 0 | 0 | 44.4 |
| - | - | - | - | 77.1 | 78.3 |
| 80.2 |
| Qualitative Concept | - | - | - | - | - | - | - | - | - | - | - | - | 82.8 | 83.5 | 84.1 |
|
| Sign or Symptom | 0 | 81.8 |
|
| 46.4 | 89.5 | 89.9 |
| 80.6 | 92.8 |
| 94.4 | 72.1 | 75.4 | 75.1 |
|
Most frequent confusions in semantic type predictions on the MedMentions validation set, using MedType pretrained on WikiMed and PubMedDS.
| Target Semantic Type | Top Confused Semantic Types |
|---|---|
| Devices | Concepts & Ideas, Objects, Procedures, |
| Disorders | Disease or Syndrome, Finding |
| Finding | Concept & Ideas, Physiology, Functional Concept |
| Functional Concept | Procedures, Concepts & Ideas |
| Genes & Mol. Sequences | Chemicals & Drugs |
| Mental and Behavioral Dys. | Disease or Syndrome, Finding |
| Objects | Concepts & Ideas, Chemicals & Drugs |
| Occupations | Procedures, Concepts & Ideas, Functional Concepts |
| Organic Chemicals | Chemicals & Drugs, Pharmacological Substances |
| Organizations | Concepts & Ideas, Procedures, Living Beings |
| Pathologic Functions | Disease or Syndrome, Finding, Functional Concepts |
| Pharmacological Substance | Chemical & Drugs, Organic Chemicals |
Fig. 5.Outcomes of semantic type filtering in MedMentions data, in terms of reduction in candidate set size. All results are reported using the best-performing information extraction model (ScispaCy). Top graphs display candidate set reduction using oracle type filtering, broken down into whether the correct candidate was included in the list generated by ScispaCy. Bottom graphs illustrate corresponding outcomes from MedType and the strongest type prediction baseline (Type CNN), broken down by whether the predicted type was correct. The number of samples each graph displays is provided, along with the percentage of these samples included in each reduction category.
Fig. 6.Error analysis of output predictions from all information extraction tools on the MedMentions test set (annotated set size: 70,405 mentions). False positive mentions are spurious entity spans extracted by the tools; Missing correct candidate cases indicate exclusion of the correct entity from the returned candidate list. Matched indicates that neither of these errors were present. Refer to Section 6.5 for details.
Fig. 7.Distribution of candidate set sizes in MedMentions using ScispaCy, comparing unfiltered concepts to candidate sets filtered using semantic type prediction strategies. Only mentions predicted by ScispaCy that included the correct CUI in the candidate set are included. Larger bars to the left-hand side of the figure indicate greater reductions in candidate set size.