| Literature DB >> 18426548 |
Antonio Jimeno1, Ernesto Jimenez-Ruiz, Vivian Lee, Sylvain Gaudan, Rafael Berlanga, Dietrich Rebholz-Schuhmann.
Abstract
BACKGROUND: In recent years, the recognition of semantic types from the biomedical scientific literature has been focused on named entities like protein and gene names (PGNs) and gene ontology terms (GO terms). Other semantic types like diseases have not received the same level of attention. Different solutions have been proposed to identify disease named entities in the scientific literature. While matching the terminology with language patterns suffers from low recall (e.g., Whatizit) other solutions make use of morpho-syntactic features to better cover the full scope of terminological variability (e.g., MetaMap). Currently, MetaMap that is provided from the National Library of Medicine (NLM) is the state of the art solution for the annotation of concepts from UMLS (Unified Medical Language System) in the literature. Nonetheless, its performance has not yet been assessed on an annotated corpus. In addition, little effort has been invested so far to generate an annotated dataset that links disease entities in text to disease entries in a database, thesaurus or ontology and that could serve as a gold standard to benchmark text mining solutions.Entities:
Mesh:
Year: 2008 PMID: 18426548 PMCID: PMC2352871 DOI: 10.1186/1471-2105-9-S3-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
NER evaluation.
This table shows the result of NER. For each alingment and method, first we show how many diseases are annotated in the corpus (OMIM) then the total number of annotation done by the method and then how many of them agreed with the benchmark (TP).
| OMIM | Annotated | TP | Precision | Recall | F-measure | ||
| Exact boundary | D look-up | 908 | 1063 | 584 | |||
| Statistical | 908 | 1051 | 274 | 26.07 | 30.18 | 27.97 | |
| MetaMap | 908 | 873 | 273 | 31.27 | 30.07 | 30.66 | |
| D look-up | 908 | 1063 | 598 | ||||
| Statistical | 908 | 1051 | 390 | 39.86 | 38.33 | 39.08 | |
| Left alignment | MetaMap | 908 | 873 | 348 | 37.11 | 42.95 | 39.82 |
| Right alignment | D look-up | 908 | 1063 | 685 | |||
| Statistical | 908 | 1051 | 439 | 53.49 | 51.43 | 52.44 | |
| MetaMap | 908 | 873 | 467 | 41.77 | 48.35 | 44.82 |
Recognition of disease in sentences.
This table shows the total number of diseases in the benchmark (Benchmark), the number of distinct diseases (Diseases), the number of diseases annotated by the method and the number of unique diseases identified. Then we present the result on standard measures like precision and recall.
| Benchmark | Diseases | Annotated | Diseases | FP | TP | Recall | Precision | F-measure | |
| D look-up | 924 | 280 | 699 | 226 | 144 | 555 | 60.06 | 79.40 | |
| Statistical | 924 | 280 | 937 | 309 | 317 | 620 | 66.17 | 66.63 | |
| MetaMap | 924 | 280 | 590 | 192 | 95 | 495 | 53.57 | 65.39 | |
| Vote 1 | 924 | 280 | 1164 | 358 | 298 | 866 | 74.40 | ||
| Vote 2 | 924 | 280 | 696 | 228 | 124 | 572 | 61.90 | 82.18 | 70.62 |
| Vote 3 | 924 | 280 | 388 | 137 | 30 | 358 | 38.74 | 54.57 |
Figure 1Recognition of diseases in sentences based on voting. This figure compares the performance of the different voting combinations and the voting after removing the most frequent diseases (in dotted lines).
Most frequent terms in the disease recognition benchmark
| Frequency | Term |
| 67 | Ovarian cancer |
| 58 | Breast carcinoma |
| 51 | Glucose-6-phosphate dehydrogenase deficiency anemia |
| 26 | Myotonic dystrophy |
| 26 | Muscular dystrophy, duchenne |
| 21 | Enzyme deficiency |
| 20 | Adrenoleukodystrophy, Neonatal |
| 20 | Aniridia |
| 17 | Hereditary breast/ovarian cancer (BRCA1, BRCA2) |
| 17 | Familial cancer of breast |
| 15 | Phenylketonurias |
Recognition of disease in sentences; without the most frequent diseases.
This table shows the total number of diseases in the benchmark (Benchmark), the number of distinct diseases (Diseases), the number of diseases annotated by the method and the number of unique diseases identified. Then we present the result on standard measures like precision and recall.
| Benchmark | Diseases | Annotated | Diseases | FP | TP | Recall | Precision | F-measure | |
| D look-up | 586 | 269 | 542 | 217 | 143 | 399 | 68.09 | 73.62 | |
| Statistical | 586 | 269 | 760 | 299 | 313 | 447 | 58.82 | 66.42 | |
| MetaMap | 586 | 269 | 413 | 182 | 95 | 318 | 54.27 | 63.66 | |
| Vote 1 | 586 | 269 | 908 | 347 | 398 | 510 | 56.17 | 68.27 | |
| Vote 2 | 586 | 269 | 532 | 218 | 123 | 409 | 69.80 | 76.88 | |
| Vote 3 | 586 | 269 | 278 | 130 | 30 | 248 | 42.32 | 57.41 |
Terms provided by the curators not found in our term selection.
The following terms are identified by the curators as candidates for a disease but they are assigned to a different semantic type in the UMLS than the set that we have selected
| CUI | Term | Semantic Type |
| C0003862 | Arthralgia | Sign or Symptom |
| C0011071 | Sudden death | Finding / Pathologic Function |
| C0011991 | Diarrhea | Sign or Sympton |
| C0018790 | Cardiac arrest | Pathologic Function |
| C0019054 | Haemolysis | Cell Function |
| C0026838 | Spasticity | Sign or Sympton |
| C0040053 | Thrombosis | Pathologic Function |
| C0085298 | Sudden cardiac death | Pathologic Function |
| C0264202 | Somatic dysfunction | Finding |
| C1257806 | Chromosomal instability | Cell or Molecular Dysfunction |
| C1384666 | Hearing impairment | Finding |