| Literature DB >> 18426549 |
Hongning Wang1, Minlie Huang, Shilin Ding, Xiaoyan Zhu.
Abstract
BACKGROUND: Efficient features play an important role in automated text classification, which definitely facilitates the access of large-scale data. In the bioscience field, biological structures and terminologies are described by a large number of features; domain dependent features would significantly improve the classification performance. How to effectively select and integrate different types of features to improve the biological literature classification performance is the major issue studied in this paper.Entities:
Mesh:
Year: 2008 PMID: 18426549 PMCID: PMC2349297 DOI: 10.1186/1471-2105-9-S3-S4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
KL Divergence on Training, Cross Validation and Testing Set
| 0.0216 | 0.0703 | |
| 0.0369 |
(Top 50 features according to Chi-Square statistics)
Figure 1Overlap of Features between Training and Testing Set
(Top 300 selected distinct features from the training and testing set according to Chi-Square statistics respectively)
TF*ML Feature Value Schema.
The Precision/Recall/F-Score demonstrate classification capability of the model, and AUC (area under receiving operator characteristic curve) is to evaluate ranking capability of the model.
| 0.7015 | 0.8213 | 0.7567 | 0.8036 | |
| 0.7014 | 0.7796 | 0.8231 |
(Performance under unigram feature)
KL Divergence on Training, Cross Validation and Testing Set
| 0.0029 | 0.0163 | |
| 0.0357 |
(Top 50 features according to Chi-Square statistics)
Top 10 Unigram Features and String Features ‘_’ means a white space
| interaction | interac |
| bind | nteract |
| interact | _intera |
| domain | teracti |
| proteome | eractio |
| proteomic | proteom |
| complex | raction |
| protein | _domain |
| yeast | binding |
| kinase | _proteo |
(According to Chi-Square statistics)
KeyBT-extracted Templates.
| <PTN> E* <DNA> E* association E* <PTN> |
| <PTN> E* bind E* <DNA> |
| <PTN> E* interact E* <PTN> |
| <PTN> E* colocalize E* <CEL> |
| <PTN> E* contact E* <DNA> E* <PTN> |
Length-fixed String Feature (TF*IDF)
| 0.7015 | 0.8213 | 0.7567 | 0.8036 | |
| 0.6497 | 0.7615 | 0.8245 |
(Performance under TF*IDF schema)
Named Entity and Semantic Template Feature
| 0.7015 | 0.8213 | 0.7567 | 0.8036 | |
| 0.5815 | 0.7243 | 0.7570 | ||
| 0.7647 | 0.7973 | 0.7806 | 0.8156 | |
| 0.7653 | 0.7746 |
Feature-level Integration
| 0.7044 | 0.8960 | 0.7887 | 0.8416 | |
| 0.7360 | 0.8773 | 0.8004 | 0.8479 | |
| 0.7416 | 0.8880 | 0.8082 | 0.8372 | |
| 0.7584 | 0.8373 | 0.7959 |
(Normalize each part of the features and unify them into new feature vectors)
Classifier-level Integration.
Integration on length-fixed string feature, entity feature and template feature
| 0.7015 | 0.8213 | 0.7567 | 0.8036 | |
| 0.7248 | 0.8853 | |||
| AdaBoost | 0.7995 | 0.8933 |
(Normalize the output of each classifier and unify them into new feature vectors)
Statistical Significance Test (s-test).
The null hypothesis is that the performance of two methods is the same; the alternative hypothesis is that the former is better than the latter.
| 0.015 | 0.012 | 0.0188 | |
| 0.0026 | 0.0010 | ||
Mean, Standard Deviation and Best Performance from BioCreAtIvE 2006 Vs Our Final Performance.
The best performance from BioCreAtIvE 2006 is selected from 51 runs of 19 teams respectively.
| 0.6642 | 0.7636 | 0.6868 | 0.7351 | ||
| 0.0810 | 0.1926 | 0.1035 | 0.0741 | ||
| - | - | 0.7800 | 0.8554 | ||
| - | 0.7995 | 0.8933 |