| Literature DB >> 35794528 |
Isabel Segura-Bedmar1, David Camino-Perdones2, Sara Guerrero-Aspizua3,4,5,6.
Abstract
BACKGROUND ANDEntities:
Keywords: Deep learning; Named entity recognition; Rare diseases
Mesh:
Year: 2022 PMID: 35794528 PMCID: PMC9258216 DOI: 10.1186/s12859-022-04810-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1This figure shows some annotated sentences in the RareDis corpus. Sentence (a) shows an example of two overlapping entities: sign and diseases. It also has a 'Produces' relationships between a rare diseases and a sign. Sentence (b1) contains an example of nested name entities belonging to different entity types: symptom and rare disease. b2 is a mention of rare diseases, which is muti-token. Sentence c contains several discontinuous mentions of signs
Statistics of the RareDis corpus
| Training | Validation | Test | Total | |
|---|---|---|---|---|
| Documents | 729 | 104 | 208 | 1041 |
| Sentences | 6451 | 903 | 1787 | 9141 |
| Tokens | 135,656 | 18,492 | 37,893 | 192,041 |
| Diseases | 1647 | 230 | 454 | 2331 |
| Rare Diseases | 3608 | 525 | 1095 | 5228 |
| Symptoms | 319 | 24 | 54 | 397 |
| Signs | 3744 | 528 | 958 | 5230 |
Fig. 2BiLSTM method. This figure shows the architecture of the BiLSTM network followed by a softmax layer
Fig. 3BiLSTM + CRF method. This figure shows the architecture of the BiLSTM network with a CRF classifier
Fig. 4BERT-based method. This figure shows the architecture for the three BERT-based models
Comparison of the methods.
| Approach | F1 |
|---|---|
| CRF | 0.6487 |
| BiLSTM (Wiki-PubMed-PMC) | 0.4326 |
| BiLSTM+CRF (Wiki-PubMed-PMC) | 0.5805 |
| BERT | 0.6710 |
| BioBERT | |
| ClinicalBERT | 0.6810 |
Best micro F1 is in bold
Entity-level results of CRF
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| DISEASE | 0.6991 | 0.4912 | 0.5770 | 454 |
| RAREDISEASE | 0.8332 | 0.8164 | 0.8247 | 1095 |
| SIGN | 0.5313 | 0.3987 | 0.4556 | 958 |
| SYMPTOM | 0.7778 | 0.5185 | 0.6222 | 54 |
| Micro-avg | 0.7112 | 0.5963 | 0.6487 | 2561 |
| Macro-avg | 0.7103 | 0.5562 | 0.6199 | 2561 |
| Macro-weighted | 0.6953 | 0.5963 | 0.6384 | 2561 |
Token-level results of CRF
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| B-DISEASE | 0.7116 | 0.5124 | 0.5958 | 454 |
| I-DISEASE | 0.7133 | 0.5225 | 0.6032 | 400 |
| B-RAREDISEASE | 0.8464 | 0.8369 | 0.8416 | 1095 |
| I-RAREDISEASE | 0.8681 | 0.8261 | 0.8466 | 1179 |
| B-SYMPTOM | 0.8286 | 0.5800 | 0.6824 | 54 |
| I-SYMPTOM | 0.6429 | 0.2250 | 0.3333 | 80 |
| B-SIGN | 0.5883 | 0.4894 | 0.5343 | 958 |
| I-SIGN | 0.5591 | 0.3991 | 0.4658 | 2215 |
| Micro-avg | 0.7112 | 0.5818 | 0.6400 | 6243 |
| Macro-avg | 0.7198 | 0.5489 | 0.6129 | 6243 |
| Macro-weighted | 0.6945 | 0.5818 | 0.6292 | 6243 |
Entity-level results of BiLSTM models.
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| DISEASE | 0.4387 | 0.2913 | 0.3502 | 454 |
| RAREDISEASE | 0.4592 | 0.4712 | 0.4651 | 1095 |
| SIGN | 0.3224 | 0.3256 | 958 | |
| SYMPTOM | 0.0000 | 0.0000 | 0.0000 | 54 |
| Micro-avg | 0.3668 | 0.3742 | 0.3705 | 2561 |
| Macro-avg | 0.2454 | 0.2170 | 0.2282 | 2561 |
| Macro-weighted | 0.3946 | 0.3742 | 0.3820 | 2561 |
| DISEASE | 0.4432 | 0.3071 | 0.3628 | 454 |
| RAREDISEASE | 0.4796 | 0.4971 | 0.4882 | 1095 |
| SIGN | 0.3166 | 0.3419 | 0.3287 | 958 |
| SYMPTOM | 0.4571 | 0.3200 | 0.3765 | 54 |
| Micro-avg | 0.3724 | 0.4020 | 0.3866 | 2561 |
| Macro-avg | 0.3393 | 0.2932 | 0.3112 | 2561 |
| Macro-weighted | 0.4084 | 0.4020 | 0.4028 | 2561 |
| DISEASE | 0.4246 | 0.3622 | 0.3909 | 454 |
| RAREDISEASE | 0.5194 | 0.5356 | 1095 | |
| SIGN | 0.3114 | 958 | ||
| SYMPTOM | 54 | |||
| Micro-avg | 0.3850 | 0.4190 | 2561 | |
| Macro-avg | 0.3742 | 0.3630 | 2561 | |
| Macro-weighted | 0.4236 | 0.4387 | 2561 | |
| DISEASE | 454 | |||
| RAREDISEASE | 0.5388 | 1095 | ||
| SIGN | 0.3167 | 0.3570 | 0.3356 | 958 |
| SYMPTOM | 0.5946 | 0.4074 | 0.4835 | 54 |
| Micro-avg | 0.4494 | 2561 | ||
| Macro-avg | 0.3474 | 2561 | ||
| Macro-weighted | 0.4494 | 2561 | ||
Best micro and macro scores are in bold. Best scores for each entity type are also in bold
Token-level results of BiLSTM.
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| B-DISEASE | 0.6105 | 0.3102 | 0.4113 | 454 |
| I-DISEASE | 0.6447 | 0.3660 | 0.4669 | 400 |
| B-RAREDISEASE | 0.6232 | 0.5804 | 0.6010 | 1095 |
| I-RAREDISEASE | 0.7812 | 0.6631 | 0.7174 | 1179 |
| B-SYMPTOM | 0.0000 | 0.0000 | 0.0000 | 54 |
| I-SYMPTOM | 0.0000 | 0.0000 | 0.0000 | 80 |
| B-SIGN | 0.5930 | 0.3311 | 0.4249 | 958 |
| I-SIGN | 0.5924 | 0.4323 | 0.4999 | 2215 |
| Micro-avg | 0.6403 | 0.4633 | 0.5376 | 6243 |
| Macro-avg | 0.4806 | 0.3354 | 0.3902 | 6243 |
| Macro-weighted | 0.6227 | 0.4633 | 0.5271 | 6243 |
| B-DISEASE | 0.6301 | 0.3690 | 0.4654 | 454 |
| I-DISEASE | 0.6807 | 0.3256 | 0.4405 | 400 |
| B-RAREDISEASE | 0.6729 | 0.6392 | 0.6556 | 1095 |
| I-RAREDISEASE | 0.8259 | 0.6375 | 0.7196 | 1179 |
| B-SYMPTOM | 0.6452 | 0.4082 | 0.5000 | 54 |
| I-SYMPTOM | 0.5000 | 0.0263 | 0.0500 | 80 |
| B-SIGN | 0.5980 | 0.4178 | 0.4919 | 958 |
| I-SIGN | 0.6203 | 0.4477 | 0.5200 | 2215 |
| Micro-avg | 0.6685 | 0.4906 | 0.5659 | 6243 |
| Macro-avg | 0.6466 | 0.4089 | 0.4804 | 6243 |
| Macro-weighted | 0.6640 | 0.4906 | 0.5593 | 6243 |
| B-DISEASE | 0.6230 | 0.4198 | 0.5016 | 454 |
| I-DISEASE | 0.6320 | 0.4553 | 0.5293 | 400 |
| B-RAREDISEASE | 0.6838 | 0.6765 | 0.6801 | 1095 |
| I-RAREDISEASE | 0.8321 | 0.6702 | 0.7424 | 1179 |
| B-SYMPTOM | 0.6562 | 0.4286 | 0.5185 | 54 |
| I-SYMPTOM | 0.6667 | 0.1053 | 0.1818 | 80 |
| B-SIGN | 0.5937 | 0.5354 | 0.5630 | 958 |
| I-SIGN | 0.5994 | 0.5454 | 0.5711 | 2215 |
| Micro-avg | 0.6544 | 6243 | ||
| Macro-avg | 0.6609 | 0.5360 | 6243 | |
| Macro-weighted avg | 0.6568 | 6243 | ||
| B-DISEASE | 0.7600 | 0.4718 | 0.5822 | 454 |
| I-DISEASE | 0.7546 | 0.5150 | 0.6122 | 400 |
| B-RAREDISEASE | 0.7163 | 0.6636 | 0.6889 | 1095 |
| I-RAREDISEASE | 0.8489 | 0.6480 | 0.7350 | 1179 |
| B-SYMPTOM | 0.6765 | 0.4600 | 0.5476 | 54 |
| I-SYMPTOM | 1.0000 | 0.0750 | 0.1395 | 80 |
| B-SIGN | 0.5318 | 0.5106 | 0.5210 | 958 |
| I-SIGN | 0.5807 | 0.4614 | 0.5142 | 2215 |
| Micro-avg | 0.5369 | 0.5956 | 6243 | |
| Macro-avg | 0.4757 | 6243 | ||
| Macro-weighted avg | 0.5369 | 0.5934 | 6243 | |
Best micro and macro scores are in bold
Entity-level results of BiLSTM-CRF models.
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| DISEASE | 0.5414 | 0.3780 | 0.4451 | 454 |
| RAREDISEASE | 0.6540 | 0.7144 | 0.6829 | 1095 |
| SIGN | 0.4892 | 0.4391 | 0.4628 | 958 |
| SYMPTOM | 54 | |||
| Micro-avg | 0.5421 | 0.5494 | 0.5457 | 2561 |
| Macro-avg | 0.4223 | 0.4563 | 2561 | |
| Macro-weighted | 0.5748 | 0.5494 | 0.5582 | 2561 |
| Google news | ||||
| DISEASE | 0.5597 | 0.4304 | 0.4866 | 454 |
| RAREDISEASE | 0.6482 | 0.7548 | 0.6975 | 1095 |
| SIGN | 0.4166 | 0.4675 | 958 | |
| SYMPTOM | 0.6667 | 0.5600 | 0.6087 | 54 |
| Micro-avg | 0.5556 | 0.5654 | 0.5604 | 2561 |
| Macro-avg | 0.4815 | 0.4324 | 0.4521 | 2561 |
| Macro-weighted | 0.5887 | 0.5654 | 0.5711 | 2561 |
| DISEASE | 0.4720 | 0.4899 | 454 | |
| RAREDISEASE | 0.7240 | 1095 | ||
| SIGN | 0.5068 | 958 | ||
| SYMPTOM | 0.5385 | 0.5600 | 0.5490 | 54 |
| micro-avg | 0.5489 | 0.5821 | 0.5650 | 2561 |
| Macro-avg | 0.4480 | 0.4508 | 0.4490 | 2561 |
| Macro-weighted | 0.5937 | 0.5821 | 0.5874 | 2561 |
| DISEASE | 0.4890 | 454 | ||
| RAREDISEASE | 0.6339 | 0.7030 | 1095 | |
| SIGN | 0.4994 | 0.4562 | 0.4768 | 958 |
| SYMPTOM | 0.6739 | 0.5741 | 0.6200 | 54 |
| Micro-avg | 2561 | |||
| Macro-avg | 0.5056 | 2561 | ||
| Macro-weighted | 2561 | |||
Best micro and macro scores are in bold. Best scores for each entity type are also in bold
Token-level results of BiLSTM+CRF models.
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| B-DISEASE | 0.5714 | 0.3957 | 0.4676 | 454 |
| I-DISEASE | 0.5649 | 0.4640 | 0.5095 | 400 |
| B-RAREDISEASE | 0.6858 | 0.7490 | 0.7160 | 1095 |
| I-RAREDISEASE | 0.7703 | 0.7710 | 0.7707 | 1179 |
| B-SYMPTOM | 0.9375 | 0.6122 | 0.7407 | 54 |
| I-SYMPTOM | 0.8333 | 0.2632 | 0.4000 | 80 |
| B-SIGN | 0.6029 | 0.5616 | 0.5816 | 958 |
| I-SIGN | 0.6112 | 0.5669 | 0.5882 | 2215 |
| Micro-avg | 0.6521 | 0.6118 | 0.6313 | 6243 |
| Macro-avg | 0.6972 | 0.5480 | 0.5968 | 6243 |
| Macro-weighted | 0.6499 | 0.6118 | 0.6270 | 6243 |
| B-DISEASE | 0.6123 | 0.4519 | 0.5200 | 454 |
| I-DISEASE | 0.5953 | 0.5130 | 0.5511 | 400 |
| B-RAREDISEASE | 0.6913 | 0.7990 | 0.7412 | 1095 |
| I-RAREDISEASE | 0.7727 | 0.8117 | 0.7917 | 1179 |
| B-SYMPTOM | 0.8108 | 0.6122 | 0.6977 | 54 |
| I-SYMPTOM | 0.6818 | 0.1974 | 0.3061 | 80 |
| B-SIGN | 0.6624 | 0.5308 | 0.5894 | 958 |
| I-SIGN | 0.7074 | 0.5236 | 0.6018 | 2215 |
| Micro-avg | 0.6103 | 0.6530 | 6243 | |
| Macro-avg | 0.5549 | 0.5999 | 6243 | |
| Macro-weighted | 0.6103 | 0.6450 | 6243 | |
| B-DISEASE | 0.5219 | 0.5428 | 0.5321 | 454 |
| I-DISEASE | 0.4875 | 0.6167 | 0.5445 | 400 |
| B-RAREDISEASE | 0.7792 | 0.7510 | 0.7649 | 1095 |
| I-RAREDISEASE | 0.8009 | 0.8037 | 0.8023 | 1179 |
| B-SYMPTOM | 0.6739 | 0.6327 | 0.6526 | 54 |
| I-SYMPTOM | 0.4878 | 0.2632 | 0.3419 | 80 |
| B-SIGN | 0.6372 | 0.5753 | 0.6047 | 958 |
| I-SIGN | 0.6566 | 0.5730 | 0.6120 | 2215 |
| Micro-avg | 0.6789 | 0.6583 | 6243 | |
| Macro-avg | 0.6306 | 0.6069 | 6243 | |
| Macro-weighted | 0.6798 | 0.6390 | 6243 | |
| B-DISEASE | 0.7616 | 0.5192 | 0.6174 | 454 |
| I-DISEASE | 0.7789 | 0.5550 | 0.6482 | 400 |
| B-RAREDISEASE | 0.6617 | 0.8295 | 0.7361 | 1095 |
| I-RAREDISEASE | 0.7694 | 0.8346 | 0.8007 | 1179 |
| B-SYMPTOM | 0.7273 | 0.6400 | 0.6809 | 54 |
| I-SYMPTOM | 0.6296 | 0.2125 | 0.3178 | 80 |
| B-SIGN | 0.5919 | 0.6015 | 0.5967 | 958 |
| I-SIGN | 0.5929 | 0.5589 | 0.5754 | 2215 |
| Micro-avg | 0.6621 | 0.6561 | 6243 | |
| Macro-avg | 0.6892 | 0.5939 | 6243 | |
| Macro-weighted | 0.6634 | 0.6535 | 6243 | |
Best micro and macro scores are in bold
Entity-level results of the BERT-based models.
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| BERT base | ||||
| DISEASE | 0.5197 | 0.6101 | 0.5613 | 454 |
| RAREDISEASE | 0.8008 | 0.8325 | 1095 | |
| SIGN | 0.5079 | 0.5515 | 958 | |
| SYMPTOM | 0.5469 | 0.6481 | 54 | |
| Micro avg | 0.6298 | 0.6710 | 2561 | |
| Macro avg | 0.5938 | 0.6821 | 0.6346 | 2561 |
| Macro-weighted | 0.6361 | 0.6743 | 2561 | |
| DISEASE | 0.5607 | 0.6067 | 454 | |
| RAREDISEASE | 0.8530 | 1095 | ||
| SIGN | 0.5877 | 958 | ||
| SYMPTOM | 0.5143 | 0.6667 | 0.5806 | 54 |
| Micro avg | 0.7157 | 2561 | ||
| Macro avg | 0.6212 | 0.6530 | 2561 | |
| Macro-weighted | 0.7157 | 2561 | ||
| BioClinical BERT | ||||
| DISEASE | 0.6388 | 454 | ||
| RAREDISEASE | 0.8167 | 0.8584 | 0.8370 | 1095 |
| SIGN | 0.5296 | 0.5501 | 0.5397 | 958 |
| SYMPTOM | 0.6435 | 54 | ||
| Micro avg | 0.6625 | 0.7005 | 0.6810 | 2561 |
| Macro avg | 0.6831 | 2561 | ||
| Macro-weighted | 0.6627 | 0.7005 | 0.6810 | 2561 |
Best micro and macro scores are in bold. Best scores por each entity type are also in bold
Token-level results of the BERT-based models.
| Label | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| B-DISEASE | 0.6012 | 0.6637 | 0.6309 | 454 |
| I-DISEASE | 0.5186 | 0.5884 | 0.5513 | 400 |
| B-RAREDISEASE | 0.8451 | 0.9003 | 0.8718 | 1095 |
| I-RAREDISEASE | 0.8704 | 0.9024 | 0.8861 | 1179 |
| B-SYMPTOM | 0.6607 | 0.7400 | 0.6981 | 54 |
| I-SYMPTOM | 0.6000 | 0.4918 | 0.5405 | 80 |
| B-SIGN | 0.6514 | 0.7073 | 0.6782 | 958 |
| I-SIGN | 0.6725 | 0.7099 | 0.6907 | 2215 |
| Micro avg | 0.7353 | 0.7794 | 0.7567 | 6243 |
| Macro avg | 0.6775 | 0.7130 | 0.6935 | 6243 |
| Macro-weighted avg | 0.7379 | 0.7794 | 0.7579 | 6243 |
| B-DISEASE | 0.6356 | 0.7088 | 0.6702 | 454 |
| I-DISEASE | 0.5716 | 0.6964 | 0.6279 | 400 |
| B-RAREDISEASE | 0.8825 | 0.8816 | 0.8821 | 1095 |
| I-RAREDISEASE | 0.9142 | 0.8927 | 0.9033 | 1179 |
| B-SYMPTOM | 0.6349 | 0.8000 | 0.7080 | 54 |
| I-SYMPTOM | 0.5538 | 0.5538 | 0.5538 | 80 |
| B-SIGN | 0.7238 | 0.7049 | 0.7142 | 958 |
| I-SIGN | 0.7330 | 0.6978 | 0.7150 | 2215 |
| Micro avg | 0.7830 | 6243 | ||
| Macro avg | 0.7062 | 0.7218 | 6243 | |
| Macro-weighted avg | 6243 | |||
| B-DISEASE | 0.6503 | 0.6885 | 0.6689 | 454 |
| I-DISEASE | 0.5969 | 0.6557 | 0.6249 | 400 |
| B-RAREDISEASE | 0.8614 | 0.8807 | 0.8710 | 1095 |
| I-RAREDISEASE | 0.8829 | 0.9076 | 0.8951 | 1179 |
| B-SYMPTOM | 0.7547 | 0.8000 | 0.7767 | 54 |
| I-SYMPTOM | 0.7158 | 0.5231 | 0.6044 | 80 |
| B-SIGN | 0.6996 | 0.6961 | 0.6979 | 958 |
| I-SIGN | 0.7575 | 0.6220 | 0.6831 | 2215 |
| Micro avg | 0.7609 | 0.7742 | 6243 | |
| Macro avg | 0.7217 | 6243 | ||
| Macro-weighted avg | 0.7873 | 0.7609 | 0.6243 | 11,909 |
Best micro and macro scores are in bold