| Literature DB >> 19958517 |
Cheng-Ju Kuo1, Maurice H T Ling, Kuan-Ting Lin, Chun-Nan Hsu.
Abstract
BACKGROUND: To automatically process large quantities of biological literature for knowledge discovery and information curation, text mining tools are becoming essential. Abbreviation recognition is related to NER and can be considered as a pair recognition task of a terminology and its corresponding abbreviation from free text. The successful identification of abbreviation and its corresponding definition is not only a prerequisite to index terms of text databases to produce articles of related interests, but also a building block to improve existing gene mention tagging and gene normalization tools.Entities:
Mesh:
Year: 2009 PMID: 19958517 PMCID: PMC2788358 DOI: 10.1186/1471-2105-10-S15-S7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Number of features of each feature set and total number of all features generated in feature extraction
| Feature Set(s) | BIOADI corpus | AB3P corpus |
|---|---|---|
| M | 251 | 239 |
| L | 23601 | 25993 |
| N | 8 | 8 |
| C | 19264 | 20231 |
| M + L + N + C | 43124 | 46471 |
M, String morphological features; L, LF tokens; N, Numeric features; C, Contextual features
Performance of various learning algorithms tested on the BIOADI corpus and the AB3P corpus
| Training Corpus | AB3P corpus | BIOADI corpus | ||||
|---|---|---|---|---|---|---|
| Test Corpus | BIOADI corpus | AB3P corpus | ||||
| Learning Algorithm |
|
|
|
|
|
|
| Naïve Bayes | 0.9733 | 0.4828 | 0.6454 | 0.9784 | 0.7518 | 0.8503 |
| Logistic Regression | 0.9352 | 0.7995 | 0.8620 | 0.9586 | 0.8464 | 0.8990 |
| MCMaximun Entropy | 0.9320 | 0.7013 | 0.8004 | 0.9301 | 0.8066 | 0.8640 |
| SVM (linear kernel) | 0.9446 | 0.7808 | 0.8549 | 0.9619 | 0.8398 | 0.8967 |
| SVM (RBF kernel) | 0.9212 | 0.8103 | 0.8622 | 0.9256 | 0.8580 | 0.8906 |
Performance of logistic regression classifier trained with different feature sets and tested on the BIOADI corpus and the AB3P corpus
| Training Corpus | AB3P corpus | BIOADI corpus | ||||
|---|---|---|---|---|---|---|
| Test Corpus | BIOADI corpus | AB3P corpus | ||||
| Feature Set(s) |
|
|
|
|
|
|
| M | 0.9155 | 0.7242 | 0.8087 | 0.9392 | 0.8082 | 0.8688 |
| M + L | 0.9153 | 0.7489 | 0.8238 | 0.9401 | 0.8207 | 0.8763 |
| M + L + N | 0.9260 | 0.7995 | 0.8581 | 0.9556 | 0.8398 | 0.8939 |
| M + L + N + C | 0.9352 | 0.7995 | 0.8620 | 0.9586 | 0.8464 | 0.8990 |
M, String morphological features; L, LF tokens; N, Numeric features; C, Contextual features
Performance of the AR systems tested on the BIOADI corpus and the AB3P corpus
| Training Corpus | AB3P corpus | BIOADI corpus | ||||
|---|---|---|---|---|---|---|
| Test Corpus | BIOADI corpus | AB3P corpus | ||||
| System |
|
|
|
|
|
|
| This study | 0.9352 | 0.7995 | 0.8620 | 0.9586 | 0.8464 | 0.8990 |
| Sohn et al. [ | 0.9482 | 0.7832 | 0.8578 | 0.9701 | 0.8356 | 0.8979 |
| Schwartz et al. [ | 0.9416 | 0.7766 | 0.8512 | 0.9500 | 0.7883 | 0.8613 |
Significant tests among the AR systems on the BIOADI corpus and the AB3P corpus
| BIOADI corpus | AB3P corpus | |||||
|---|---|---|---|---|---|---|
| Comparing with |
|
|
|
|
|
|
| Sohn et al. [ | 1000 | 810 | < .0001 | 1000 | 619 | < .0001 |
| Schwartz et al. [ | 1000 | 989 | < .0001 | 1000 | 1000 | < .0001 |
Figure 1Performance versus training data size tested on the BIOADI corpus.
Figure 2Performance versus training data size tested on the AB3P corpus.
Figure 3Performance versus training data size tested on the merged corpus (AB3P corpus + BIOADI corpus).
Testing time (in seconds) of three AR systems testing on different size of PubMed abstracts
| Testing Size of Abstracts (Tokens) | ||||
|---|---|---|---|---|
| System | 1200 (240918) | 1250 (229501) | 2450 (470419) | 5000 (988828) |
| This study | 13.355 | 13.316 | 25.059 | 45.506 |
| Sohn et al. [ | 159.292 | 135.254 | 292.343 | 630.917 |
| Schwartz et al. [ | 0.873 | 0.897 | 1.598 | 3.138 |
Top 20 most frequent SF-LF pairs extracted from 17,551,165 PubMed abstracts
| Rank | Short Form | Long Form | Frequency | Class |
|---|---|---|---|---|
| 1 | CI | confidence interval | 36142 | Statistical Measure |
| 2 | PCR | polymerase chain reaction | 26951 | Experimental Technique |
| 3 | NO | nitric oxide | 25229 | Medical Relevance |
| 4 | CT | computed tomography | 20084 | Experimental Technique |
| 5 | HIV | human immunodeficiency virus | 20027 | Virus Studied |
| 6 | LPS | lipopolysaccharide | 20027 | Experimental Technique |
| 7 | OR | odds ratio | 19914 | Statistical Measure |
| 8 | MRI | magnetic resonance imaging | 19396 | Experimental Technique |
| 9 | IL | interleukin | 16934 | Medical Relevance |
| 10 | CNS | central nervous system | 15915 | Medical Relevance |
| 11 | BMI | body mass index | 15614 | Clinical Observation |
| 12 | RA | rheumatoid arthritis | 14221 | Medical Relevance |
| 13 | AD | Alzheimer's disease | 14103 | Medical Relevance |
| 14 | PKC | protein kinase C | 13958 | Medical Relevance |
| 15 | ELISA | enzyme-linked immunosorbent assay | 13946 | Experimental Technique |
| 16 | CSF | cerebrospinal fluid | 13815 | Medical Relevance |
| 17 | DA | dopamine | 11297 | Medical Relevance |
| 18 | BP | blood pressure | 10991 | Clinical Observation |
| 19 | MR | magnetic resonance | 10973 | Experimental Technique |
| 20 | HCV | hepatitis C virus | 10944 | Virus Studied |
Top 5 long-form occurrences for the abbreviation "APC"
| Long Form | Frequency |
|---|---|
| antigen-presenting cells | 2077 |
| adenomatous polyposis coli | 1108 |
| activated protein C | 924 |
| anaphase-promoting complex | 254 |
| argon plasma coagulation | 152 |