| Literature DB >> 27504009 |
Jitendra Jonnagaddala1, Toni Rose Jue2, Nai-Wen Chang3, Hong-Jie Dai4.
Abstract
The rapidly increasing biomedical literature calls for the need of an automatic approach in the recognition and normalization of disease mentions in order to increase the precision and effectivity of disease based information retrieval. A variety of methods have been proposed to deal with the problem of disease named entity recognition and normalization. Among all the proposed methods, conditional random fields (CRFs) and dictionary lookup method are widely used for named entity recognition and normalization respectively. We herein developed a CRF-based model to allow automated recognition of disease mentions, and studied the effect of various techniques in improving the normalization results based on the dictionary lookup approach. The dataset from the BioCreative V CDR track was used to report the performance of the developed normalization methods and compare with other existing dictionary lookup based normalization methods. The best configuration achieved an F-measure of 0.77 for the disease normalization, which outperformed the best dictionary lookup based baseline method studied in this work by an F-measure of 0.13.Database URL: https://github.com/TCRNBioinformatics/DiseaseExtract.Entities:
Mesh:
Year: 2016 PMID: 27504009 PMCID: PMC4976299 DOI: 10.1093/database/baw112
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Number of publications resulting from the search query ‘disease OR diseases OR disorder OR disorders’ from 2000 to 2014.
Figure 2.Overview of methods to extract disease information from the text.
Figure 3.Example of BIESO tagging format used in this study and graphical representation of ‘paracetamol consumption for renal papillary necrosis or any of these cancers’ tagged as [O, O, O, B, I, E, O, O, O, O, S].
Figure 4.Sample dictionary entries.
Figure 5.Representation of the developed priority rules.
Summary distribution and characteristics of the training, development and test set
| Training set | Development set | Test set | |
|---|---|---|---|
| No. of documents | 500 | 500 | 500 |
| No. of sentences | 4597 | 4604 | 4800 |
| No. of tokens | 108 378 | 107 668 | 113 290 |
| Average word count | 216.76 | 215.34 | 226.58 |
| No. of disease mentions | 4182 | 4244 | 4424 |
| No. of MeSH IDs (excluding disease mentions without any IDs) | 4252 | 4328 | 4430 |
| No. of disease mentions without MeSH IDs | 32 | 16 | 61 |
| No. of unique disease mentions | 1384 | 1254 | 1337 |
| No. of unique MeSH IDs | 664 | 604 | 645 |
aSentence and token stats are generated using Stanford PTBTokenizer.
Comparison of DNER module performance
| Run | NER Performance at mention-level | |||||
|---|---|---|---|---|---|---|
| TP | FP | FN | P | R | F | |
| BANNER | 3348 | 854 | 1076 | 0.80 | 0.76 | 0.78 |
| Our DNER module | 3351 | 637 | 1073 | 0.76 | 0.80 | |
| Our DNER module + Post-processing | 3529 | 811 | 895 | 0.81 | ||
The bold value signifies highest value with in the column.
Baseline methods performance on the test set
| Baseline method | DNORM | |||||
|---|---|---|---|---|---|---|
| TP | FP | FN | P | R | F | |
| Simple dictionary lookup | 1341 | 1799 | 647 | 0.43 | 0.52 | |
| BeCAS | 413 | 197 | 1575 | 0.21 | 0.32 | |
| MetaMap | 1272 | 950 | 716 | 0.57 | 0.64 | 0.60 |
| OBA | 1219 | 592 | 772 | 0.67 | 0.61 | |
The bold value signifies highest value with in the column.
Performance of proposed methods on the test set
| Configuration# | Configuration description | DNORM | |||||
|---|---|---|---|---|---|---|---|
| TP | FP | FN | P | R | F | ||
| 1 | DNER + Dictionary lookup | 758 | 65 | 1230 | 0.92 | 0.38 | 0.54 |
| 2 | 1 + Abbreviation resolution | 760 | 65 | 1228 | 0.92 | 0.38 | 0.54 |
| 3 | 2 + MEDIC vocabulary synonyms | 1177 | 105 | 811 | 0.92 | 0.59 | 0.72 |
| 4 | 3 + WordNet synonyms | 1220 | 121 | 768 | 0.91 | 0.61 | 0.73 |
| 5 | 4 + Query expansion + Priority Rules | 1342 | 158 | 646 | 0.89 | 0.68 | 0.77 |
| 6 | 5 + NER post-processing | 1371 | 184 | 617 | 0.88 | 0.69 | 0.77 |
Processing speed (in seconds per document) for publicly available DNORM systems on the test set
| Run | MetaMap | OBA | BeCAS | Configuration 5 |
|---|---|---|---|---|
| 1 | 1.03 | 12.98 | 0.62 | 0.31 |
| 2 | 1.14 | 12.81 | 0.51 | 0.32 |
| 3 | 1.01 | 13.09 | 0.51 | 0.3 |
| 4 | 1.04 | 13.08 | 0.45 | 0.3 |
| 5 | 1.21 | 12.75 | 0.46 | 0.3 |
| 1.09 | 12.94 | 0.51 |
The bold value signifies highest value with in the column.
Performance of proposed methods on test set
| Configuration | Norm | |||||
|---|---|---|---|---|---|---|
| TP | FP | FN | P | R | F | |
| Configuration 5 + Term match | 1444 | 777 | 544 | 0.65 | 0.73 | 0.69 |
| Configuration 5 + Phrase match | 1419 | 339 | 569 | 0.81 | 0.71 | 0.76 |
| MetaMap | cTAKES | OBA | BeCAS | |
|---|---|---|---|---|
| Overall Pipeline | NP → Lexical variants → String matching (Exact & Partial) → Custom score → Disambiguation | Norm → NP → Non-lexical variants → Partial string matching → No disambiguation | Mgrep → String matching (Exact & Partial) → Semantic Expansion> | Modules for PubMed article fetching, Sentence splitting → tokenization → lemmatization → POS tagging → chunking → Partial matching |
| Dictionary Lookup Matching Type | Partial matching using custom score | Partial matching | Partial matching using rules and semantic expansion | Partial matching using deterministic finite automatons |
| Abbreviation Resolution | Yes | No | No | Yes |
| Query Expansion | Lexical variants generated using SPECIALIST lexicon and Lexical Variant Generation (LVG) tools | Non-lexical variants (variations of head & modifiers within noun phrases.) | Semantic expansion (hierarchical and mapping info of ontologies) | Synonyms and Orthographic variants |
| Dictionary Enhancement | No | Enriched with synonyms from UMLS and a Mayo-maintained list of terms | No | No |
| Word Sense Disambiguation | Yes | No | Yes | No |
| Entity Type | All semantic types in UMLS | Disorders/diseases with a separate group for signs/symptoms, test/procedures, anatomy and medication/drugs | All semantic types in UMLS | Species, anatomical concepts, miRNAs, enzymes, chemicals, drugs, diseases, metabolic pathways, cellular components, biological processes, genes, proteins and molecular functions |
| Terminologies | UMLS | SNOMED and RxNORM | Ontologies listed on NCBO BioPortal |
UMLS LexEBI JoChem NCI Metathesaurus miRBase Gene Ontology |
| Availability |
Desktop, Local Java API, Web API, Web Portal | Desktop | REST API, Virtual machine and Web Portal |
REST API, Python Command line client and Web Portal |
| UMLS semantic type | UMLS semantic type code | UMLS semantic type acronym |
|---|---|---|
| Congenital abnormality | T019 | Cgab |
| Acquired abnormality | T020 | Acab |
| Injury or poisoning | T037 | Inpo |
| Pathologic function | T046 | Patf |
| Disease or syndrome | T047 | Dsyn |
| Mental or behavioral dysfunction | T048 | mobd |
| Cell or molecular dysfunction | T049 | comd |
| Experimental model of disease | T050 | Emod |
| Anatomical abnormality | T190 | Anab |
| Neoplastic process | T191 | Neop |
| Sign or symptom | T184 | Sosy |
| Exact match | Phrase match | Term match |
|---|---|---|
| Each term in the query should be present in the dictionary entry and their order should be strictly maintained. Matching dictionary entry must have only those terms mentioned in query and no additional terms allowed. | Each term in the query should be present in the dictionary entry and their order should be strictly maintained. Dictionary entry may have other terms before or after the query terms. | Each term in the query should be present in the dictionary entry and order is not maintained. Dictionary entry must have at least one query term. |
| Example: The query ‘TORCH Syndrome’ will return ‘TORCH Syndrome’ dictionary entry. |
Example: The query ‘TORCH Syndrome’ will return ‘TORCH Syndrome’ as well as ‘Pseudo-TORCH Syndrome’ entries. For example, the query ‘TORCH Syndrome’ will return ‘TORCH Syndrome’ as well as ‘Pseudo-TORCH Syndrome’ entries. |
Example: The query ‘TORCH syndrome’ will return ‘TORCH syndrome’ as well as ‘Pseudo-TORCH Syndrome’, ‘TORCH’ and ‘Syndrome’ entries. For example, the query ‘TORCH syndrome’ will return ‘TORCH syndrome’ as well as ‘Pseudo-TORCH Syndrome’, ‘TORCH’ and ‘Syndrome’ entries. |