| Literature DB >> 36032678 |
Abstract
The linguistic rules of medical terminology assist in gaining acquaintance with rare/complex clinical and biomedical terms. The medical language follows a Greek and Latin-inspired nomenclature. This nomenclature aids the stakeholders in simplifying the medical terms and gaining semantic familiarity. However, natural language processing models misrepresent rare and complex biomedical words. In this study, we present MedTCS-a lightweight, post-processing module-to simplify hybridized or compound terms into regular words using medical nomenclature. MedTCS enabled the word-based embedding models to achieve 100% coverage and enabled the BiowordVec model to achieve high correlation scores (0.641 and 0.603 in UMNSRS similarity and relatedness datasets, respectively) that significantly surpass the n-gram and sub-word approaches of FastText and BERT. In the downstream task of named entity recognition (NER), MedTCS enabled the latest clinical embedding model of FastText-OA-All-300d to improve the F1-score from 0.45 to 0.80 on the BC5CDR corpus and from 0.59 to 0.81 on the NCBI-Disease corpus, respectively. Similarly, in the drug indication classification task, our model was able to increase the coverage by 9% and the F1-score by 1%. Our results indicate that incorporating a medical terminology-based module provides distinctive contextual clues to enhance vocabulary as a post-processing step on pre-trained embeddings. We demonstrate that the proposed module enables the word embedding models to generate vectors of out-of-vocabulary words effectively. We expect that our study can be a stepping stone for the use of biomedical knowledge-driven resources in NLP.Entities:
Keywords: biomedical nomenclature; linguistic approach; medical terminology; named entity recognition; natural language processing; out-of-vocabulary
Year: 2022 PMID: 36032678 PMCID: PMC9411640 DOI: 10.3389/fmolb.2022.928530
Source DB: PubMed Journal: Front Mol Biosci ISSN: 2296-889X
FIGURE 1Understanding biomedical terms by mapping term components to human organ system.
FIGURE 2MedTCS framework: (A) MedTCS detector normalizes the unknown terms and search in vocabulary; (B) Rule-based pluralizer or singularizer sub-module used to normalize the unknown terms; (C) Architecture for term-parser, where the compound words encode for its components that infer from the dictionary for its semantic words that encode as its mean vector; (D) Architecture for term segmenter, a pre-trained segmentation model segments the word into subwords that encodes as its mean vector.
Statistics of Datasets.
| Evaluation | Dataset | Corpus size | Type |
|---|---|---|---|
| Intrinsic Evaluation | UMNSRS-similarity | 566 term pairs | Pairwise similarity |
| UMNSRS-relatedness | 588 term pairs | Pairwise relatedness | |
| MyoSRS | 101 term pairs | Pairwise relatedness | |
| EHR-RelB | 3630 term pairs | Pairwise relatedness | |
| Extrinsic Evaluation | Dataset | ||
| BC5CDR | 1500 articles | Disease Name | |
| NCBI-Disease | 793 abstracts | Disease Name | |
| DICE | 7231 sentences | Drug Indication |
FIGURE 3Comparison of performance variations in biomedical embedding model after adding MedTCS module on datasets of Table 1 for intrinsic evaluation.
FIGURE 4Comparison of performance variations in clinical embedding model after adding MedTCS module on datasets of Table 1 for intrinsic evaluation.
Comparison of sub-word embeddings with word embedding + MedTCS on the UMNSRS-Similarity datasets.
| Model | Version | Sp | |
|---|---|---|---|
| BERT | BERT | bert-base-uncased | 0.07 |
| BioBert | dmis-lab/biobert-v1.1 | 0.30 | |
| BlueBert | bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12 | 0.36 | |
| Bio_ClinicalBERT | emilyalsentzer/Bio_ClinicalBERT | 0.23 | |
| Model |
| allenai/scibert_scivocab_uncased | 0.18 |
| SciBERT | |||
| PubMedBERT | microsoft/BiomedNLP-PubMed BERT-base-uncased-abstract-fulltext | 0.23 | |
| CODER | GanjinZero/UMLSBert_ENG | 0.47 | |
| Word2Vec | PubMed-w2v | PubMed-w2v.bin | 0.52 |
| +MedTCS | |||
| PubMed-PMC-w2v | // | 0.49 | |
| Model | +MedTCS | ||
| Wiki-PubMed-PMC-w2v | // | 0.49 | |
| + | +MedTCS | ||
| Bio-NLP-30 | Bio-NLP-30 | 0.63 | |
| MedTCS | |||
| +MedTCS | |||
| BioWordVec | BioWordVec | 0.64 | |
| +MedTCS |
Comparison of the word embedding + MedTCS best scores with latest reported results.
| Model | UMNSRS-Similarity | UMNSRS-Relatedness | Model description | ||
|---|---|---|---|---|---|
| #566 | Sp | #587 | Sp | ||
| BioWordVec+ | 480 | 0.629 | 473 | 0.590 | A combined model of Graph |
| Graph | convolutional network (GCN) | ||||
| Embeddings | a path-based graph embedding | ||||
| (GCN) | with BioWordVec embedding | ||||
| Context2Vec+ | 471 | 0.634 | 484 | 0.561 | Composite model of contextual |
| BioWordVec+ | embedding with BioWordVec | ||||
| PubMed + PMC | concatenated with PubMed and | ||||
|
| PMC word embedding to | ||||
| achieve these results | |||||
| CoderBERT | 543 | 0.543 | 564 | 0.473 | A BERT-based model obtained |
|
| by fine-tuned a pre-trained | ||||
| BioBERT on UMLS | |||||
| synonyms and relations | |||||
| SapBERT-S | 543 | 0.585 | 564 | 0.505 | A BERT-based model fine-tuned |
|
| a pre-trained PubMedBERT on | ||||
| UMLS using a self-alignment | |||||
| objective to cluster the term | |||||
| concept | |||||
| BioWordVec |
|
|
|
| BioWordVec with our composed |
| +MedTCS | MedTCS module, to extract the | ||||
| vector representation of a known | |||||
| and unknown term | |||||
Results with highest values of correlation and coverage scores are shown in bold.
FIGURE 5Comparison of performance variations in biomedical word embedding model after adding MedTCS module on datasets of Table 1 for NER task.
FIGURE 6Comparison of performance variations in clinical FastText embedding model after adding MedTCS module on datasets of Table 1 for NER task.
FIGURE 7Model performances enhanced with MedTCS for Drug indication classification.
Examples of the sub-word tokenization schemes followed by the different algorithms with the medical terminology-based MedTCS module.
| Term | MedTCS | FastText | BioBert | CODER |
|---|---|---|---|---|
|
|
|
| ||
| mastodynia | breast, pain | <ma,mas,ast | [CLS],mast,## | [CLS],mast,## |
| discomfort | sto,tod,ody | ody,##nia, [SEP] | odynia, [SEP] | |
| dyn.yni,nia,ia> | ||||
| prostatism | prostate, gland | <pr,pro,ros,ost,sta | [CLS],pro,##sta | [CLS],prost,## |
| state,of,or,condition | tat,ati,tis,ism,sm> | ##tism, [SEP] | atism, [SEP] | |
| prostatorrhea | prostate, gland | <pr,pro,ros,ost.sta | [CLS],pro,##sta | [CLS],prost,## |
| flow, excessive | tat,ato,tor,orr,rrh | ##tor,##r,##hea | ator,##rh,##ea | |
| discharge | rhe,hea,ea> | [SEP] | [SEP] | |
| blepharospasm | eyelid,or,eyelash | <bl,ble,lep,eph,pha | [CLS],b,##le,## | [CLS],ble,## |
| sudden,or | har,aro,ros,osp,spa | pha,##ros,## | pha,#rosp,## | |
| involuntary | pas,asm,asm> | pas,##m, [SEP] | asm, [SEP] | |
| dyslipidemia | painful,fat,a | <dy,dys,ysl,sli,lip | [CLS],d,##ys,## | [CLS] |
| blood, condition | pii,pid,ide,dem | lip,##ide,## | dyslipidemia | |
| emi,mia,ia> | mia, [SEP] | [SEP] | ||
| dyspnea | painful, breathing | <dy,dys,ysp,spn | [CLS],d,##ys,## | [CLS],dyspnea |
| pne,nea,ea> | p,##nea, [SEP] | [SEP] | ||
| urethrorrhea | urethra, flow | <ur,ure,ret,eth,thr | [CLS],u,##ret,## | [CLS],ureth,## |
| excessive | hro,ror,orr,rrh,rhe | hr,##or,##r,## | ro,##r,##rh,## | |
| discharge | hea,ea> | hea, [SEP] | ea, [SEP] | |
| arteriosclerosis | artery, hardening | <ar,art,rte,ter,eri | [CLS],art,##eri | [CLS],arterio |
| rio,ios,osc,scl,cle | ##os,##cle,## | ##sc,##ler,## | ||
| ler,ero,ros,osi,sis,is> | rosis, [SEP] | osis, [SEP] | ||
| dermatitis | Skin | <de,der,erm,rma | [CLS],der,##mat | [CLS] |
| inflammation | mat,ati,tit,its,ts> | ##itis, [SEP] | dermatitis | |
| [SEP] |