| Literature DB >> 27278815 |
Hsin-Chun Lee1, Yi-Yu Hsu2, Hung-Yu Kao3.
Abstract
Diseases play central roles in many areas of biomedical research and healthcare. Consequently, aggregating the disease knowledge and treatment research reports becomes an extremely critical issue, especially in rapid-growth knowledge bases (e.g. PubMed). We therefore developed a system, AuDis, for disease mention recognition and normalization in biomedical texts. Our system utilizes an order two conditional random fields model. To optimize the results, we customize several post-processing steps, including abbreviation resolution, consistency improvement and stopwords filtering. As the official evaluation on the CDR task in BioCreative V, AuDis obtained the best performance (86.46% of F-score) among 40 runs (16 unique teams) on disease normalization of the DNER sub task. These results suggest that AuDis is a high-performance recognition system for disease recognition and normalization from biomedical literature.Database URL: http://ikmlab.csie.ncku.edu.tw/CDR2015/AuDis.html.Entities:
Mesh:
Year: 2016 PMID: 27278815 PMCID: PMC4897593 DOI: 10.1093/database/baw091
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.An example of extracting mentions from PubMed literature, and assigning MeSH concept identifier for each mention.
Statistics of CDR corpus
| Task Dataset | Articles | Chemical | Disease | CID | ||
|---|---|---|---|---|---|---|
| Mention | ID | Mention | ID | Relation | ||
| Training | 500 | 5203 | 1467 | 4182 | 1965 | 1038 |
| Development | 500 | 5347 | 1507 | 4244 | 1865 | 1012 |
| Test | 500 | 5385 | 1435 | 4424 | 1988 | 1066 |
Figure 2.The overall architecture of the AuDis.
The example of PMID: 9625142
| Title | Acute hepatitis, autoimmune hemolytic anemia and erythroblastocytopenia induced by ceftriaxone. | ||||
| Abstract | An 80-yr-old man developed acute hepatitis shortly after ingesting oral ceftriaxone. Although the transaminases gradually returned to …… | ||||
| PMID | Start offset | End offset | Mention | Mention type | Database identifier |
| 9625142 | 6 | 15 | hepatitis | Disease | D056486 |
| 9625142 | 17 | 44 | autoimmune hemolytic anemia | Disease | D000744 |
| 9625142 | 50 | 72 | erythroblastocytopenia | Disease | −1 |
| 9625142 | 130 | 139 | hepatitis | Disease | D056486 |
The example of breaking the title of PMID: 9625142 into tokens, and tagging label to each token
| Acute | Hepatitis | , | Autoimmune | Hemolytic | Anemia | , |
|---|---|---|---|---|---|---|
| O | B | O | B | I | E | O |
| and | erythroblastocytopenia | induced | by | ceftriaxone | . | |
The groups of three disease types
| Groups | Conditions |
|---|---|
| Disease terminologies | Impairment, nausea, vomiting, disease, cancer, toxicity, insufficiency, effusion, deficit, dysfunction, injury, pain, neurotoxicity, infect, syndrome, symptom, hyperplasia, retinoblastoma, defect, disorder, failure, hamartoma, hepatitis, tumor, damage, illness, abnormality, tumour, abortion |
| Body part | Pulmonary, neuronocular, orbital, breast, renal, hepatic, liver, hart, eye, pulmonary, ureter, bladder, pleural, pericardial, colorectal, head, neck, pancreaticobiliary, cardiac, leg, back, cardiovascular, gastrointestinal, myocardial, kidney, bile, intrahepatic, extrahepatic, memorygastric |
| Human ability | Visual, auditory, learning, opisthotonu, sensory, motor, memory, social, emotion |
The performance of four different orders on the CDR development set
| Run | Different order of four lexicons | Precision | Recall | F-score |
|---|---|---|---|---|
| 1 | Train > 791 > MEDIC | 0.8827 | 0.7909 | 0.8343 |
| 2 | Train > MEDIC > 791 | 0.8832 | 0.7909 | 0.8345 |
| 3 | Train > MEDIC > 791 > Extend | 0.8791 | 0.8070 | |
| 4 | MEDIC > Train > 791 > Extend | 0.8567 | 0.7920 | 0.8231 |
The highest value is shown in bold.
The performance of disease normalization on the CDR Testing for three runs
| Run | Training set for CRF | Precision | Recall | F-score |
|---|---|---|---|---|
| 1 | Train | 0.8942 | 0.8244 | 0.8579 |
| 2 | Train + Dev | 0.8963 | 0.8350 | |
| 3 | Train + Dev + 791 | 0.8832 | 0.8365 | 0.8592 |
The highest value is shown in bold.
The performance of disease normalization on the CDR testing set, the results are the best submissions of all participating teams and the best result of our submission is the current setting of AuDis
| Team | TP | FP | FN | Precision | Recall | F-score |
|---|---|---|---|---|---|---|
| AuDis | 1660 | 192 | 328 | 89.63% | 83.5% | 86.46% |
| 304 | 1713 | 277 | 275 | 86.08% | 86.17% | 86.12% |
| 277 | 1629 | 191 | 359 | 89.51% | 81.94% | 85.56% |
| 363 | 1606 | 168 | 382 | 90.53% | 80.78% | 85.38% |
| 310 | 1627 | 247 | 361 | 86.82% | 81.84% | 84.26% |
| Average of all teams | 1487 | 418 | 501 | 78.99% | 74.81% | 76.03% |
| Dictionary-lookup | 1341 | 1799 | 647 | 42.71% | 67.45% | 52.30% |
| DNorm | 1593 | 370 | 395 | 81.15% | 80.13% | 80.64% |
The performance of removing one of our six feature groups
| Removed feature | Precision | Recall | F-score |
|---|---|---|---|
| All features | 89.63% | 83.5% | 86.46% |
| Dictionary-lookup | 91.67% | 71.43% | 80.29% |
| Morphology | 88.23% | 82.24% | 85.13% |
| POS | 88.93% | 82.44% | 85.57% |
| Vowel | 89.09% | 82.60% | 85.72% |
| Abbreviation | 89.05% | 82.65% | 85.73% |
| Terminology | 89.15% | 82.70% | 85.80% |
Figure 4.The eight categories of false positives and false negatives.