| Literature DB >> 19615051 |
Yanpeng Li1, Hongfei Lin, Zhihao Yang.
Abstract
BACKGROUND: Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information.Entities:
Mesh:
Year: 2009 PMID: 19615051 PMCID: PMC2725142 DOI: 10.1186/1471-2105-10-223
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Framework of the named entity recognition system.
Figure 2Feature coupling generalization.
Figure 3An example of FCG method applied in the gene named entity classification task. Here an EDF can be viewed as the conjunction of an EDF root and a term. Only one FCD type is used. A FCD feature is the conjunction of an EDF root and a CDF.
Figure 4Components and FCG settings of the final named entity classification system.
Statistical information and NER performance of dictionaries
| Dictionary | #of entries | Coverage | Precision | Recall | F-score |
| BioThesaurus | 4,480,469 | 36.78% | 15.36 | 77.21 | 25.62 |
| ABGene lexicon | 1,101,716 | 35.98% | 31.54 | 53.58 | 39.71 |
| Combined | 5,522,822 | 54.32% | 16.20 | 82.59 | 27.09 |
| Combined+varients | 10,034,696 | 65.68% | 7.32 | 79.26 | 13.40 |
The rightmost three columns show the recognition performances of dictionaries on BioCreative 2 test corpus using maximum match method.
Examples of EDFs and CDFs
| Feature type | Examples | # of features |
| EDF I | -- | |
| EDF II | -- | |
| CDF I (left) | 300 | |
| CDF I (right) | 300 | |
| CDF II | 11 | |
In the examples of CDF II, t is the prediction score given by a SVM trained by local context words.
Performance of lexical features on named entity classification. F1: bag-of-n-grams; F2: boundary n-grams; F3: sliding character window; F4: boundary substrings; F5: morphology patterns.
| Feature | All terms (P/R/F1) | OOV terms (P/R/F1) |
| F1 | 53.97/70.72/61.22 | -- |
| F1+F2 | 73.98/67.13/70.39 | -- |
| F1+F2+F3 | 72.54/71.15/71.84 | 50.88/2.94/5.56 |
| F1+F2+F3+F4 | 73.75/85.22/79.07 | 70.93/75.89/73.32 |
| F1+F2+F3+F5 | 72.81/ | 71.00/74.16/72.55 |
| F1+F2+F3+F4+F5 |
Performance of models with FCD features on named entity classification and recognition
| ID | Feature(model) | Classification (all terms) | Classification (OOV terms) | Named entity recognition | ||||||
| Precision | Recall | F-score | Precision | Recall | F-score | Precision | Recall | F-score | ||
| Run 1 | Lexical (linear) | 75.52 | 85.63 | 80.26 | 74.03 | 75.68 | 74.85 | 85.70 | 78.36 | 81.86 |
| Run 2 | FCD (linear) | 81.59 | 87.77 | 84.57 (+4.31) | 83.74 | 87.64 | 85.64 (+10.79) | 87.98 | 80.70 | 84.18 (+2.32) |
| Run 3 | FCD (SVD + RBF) | 83.02 | 88.24 | 85.55 (+5.29) | 83.12 | 85.31 | 84.2 (+9.35) | 89.80 | 81.76 | 85.59 (+3.73) |
| Run 4 | FCD (Combine (2, 3)) | 82.46 | 86.23 (+5.97) | 83.21 | 88.35 | 85.7 (+10.85) | 89.29 | 85.74 (+3.88) | ||
| Run 5 | All (linear) | 82.96 | 89.31 | 86.02 (+5.76) | 83.65 | 86.32 (+11.47) | 89.93 | 81.71 | 85.62 (+3.76) | |
| Run 6 | All (Combine (3, 5)) | 89.99 | 88.86 | 82.40 | ||||||
In Run 1, 2 and 5 SVMs with linear kernel are used. In Run 3, SVD is used to reduce the feature dimension and a SVM with RBF kernel is used to classify examples. In Run 3 only features related to CDF I are used. In Run 4 outputs of Run 2 and 3 are combined. Run 6 is the combination of Run 3 and Run 5.
Impact of different CDFs on named entity classification.
| CDF type | Linear (P/R/F1) | SVD-RBF (P/R/F1) |
| CDF I | 81.43/86.50/83.89 | |
| CDF II | 75.64/82.20/78.78 | 78.41/80.35/79.37 |
| CDF I+II | 81.59/87.67/84.57 | 82.51/90.01/86.10 |
| Combine | 82.46/ | -- |
The run in the last row combines the results of SVD-RBF model (with CDF I) and linear model (with CDF I+II), which is the same as Run 4 in Table 4. Since the combining method is a linear function, we attribute it to the linear case.
Figure 5Relation between named entity classification performance and the number of context patterns in CDF I. The patterns are selected in a descendent order of information gain scores.
Performance of different EDFs on named entity classification
| CDF type | Precision | Recall | F-score |
| EDF I | 77.07 | 79.71 | 78.37 |
| EDF I + EDF II (1-gram) | 80.73 | 86.90 | 83.70 (+5.33) |
| EDF I + EDF II (1,2-gram) | 81.17 | 87.57 | 84.24 (+5.87) |
| EDF I + EDF II (1,2,3-gram) |
Figure 6Relation between named entity classification performance and unlabeled data. The years are the final publication years of MEDLINE abstracts. The 'Full text' includes all the MEDLINE abstracts and TREC 2006 Genomics Track data collection.
Comparison of different FCD metrics
| FCD metric | Precision | Recall | F-score |
| Binary | 77.60 | 82.39 | 79.92 |
| 78.74 | 83.22 | 80.92 (+1.00) | |
| PMI | 79.51 | 84.84 | 82.09 (+2.17) |
| Normalized PMI | 79.83 | 85.33 | 82.49 (+2.57) |
| 81.42 | 87.35 | 84.28 (+4.36) | |
| 81.52 | 87.54 | 84.42 (+4.5) | |
Comparison of performance and applicability of different NER systems on BioCreative 2 test set
| System or authors | Precision | Recall | F-score | # of features | Tagging complexity | Availability |
| CRF 1 (ABNER+) | 87.30 | 80.68 | 83.86 | 171,251 | LM | N |
| CRF 2 (ABNER++) | 87.39 | 81.96 | 84.59 | 355,461 | LM | N |
| Dictionary | 90.37 | 82.40 | 86.20 | |||
| Dictionary + CRF 2 | 87.63 | 355,609 | LM | |||
| BANNER [ | 88.66 | 84.32 | 86.43 | 500,876 | LM+POS tagger | |
| Ando [ | 88.48 | 85.97 | 87.21 | -- | 2*LM+POS tagger+syntactic parser | N |
| Hus | 88.95 | 88.30 | 8 * 5,059,368 | 8*LM+POS tagger | N |
In the 6th column, 'LM' and 'Trie' respectively refer to the time complexities of a linear model and a Trie tree based dictionary match. The 'Dictionary' method doesn't need any feature, once the dictionary is constructed. For Ando's system, we cannot find the number of features in the paper [5]. Since the systems in the last two rows used classifier combination, the tagging complexities and numbers of features are multiplied by the numbers of sub-models.