| Literature DB >> 21998156 |
Aurélie Névéol1, W John Wilbur, Zhiyong Lu.
Abstract
MOTIVATION: Research in the biomedical domain can have a major impact through open sharing of the data produced. For this reason, it is important to be able to identify instances of data production and deposition for potential re-use. Herein, we report on the automatic identification of data deposition statements in research articles.Entities:
Mesh:
Year: 2011 PMID: 21998156 PMCID: PMC3223368 DOI: 10.1093/bioinformatics/btr573
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Overview of annotated datasets used in this work.
Overview of component occurrences in data deposition statements
| Component | Unique occurrences | Total occurrences | Variability (%) |
|---|---|---|---|
| Data | 468 | 645 | 73 |
| Action | 77 | 611 | 13 |
| Location (general) | 387 | 584 | 66 |
| Location (detailed) | 521 | 534 | 98 |
aWhen accession numbers were unified, the variability lessened considerably with 71 unique occurrences only.
Average precision of SVM and NB models for 5-fold cross-validation with various feature sets
| Token | Position tags | POS tags | Component | SVM | NB | |
|---|---|---|---|---|---|---|
| One-feature set | X | 95.68 | 94.95 | |||
| Two-feature sets | X | X | 95.91 | 94.96 | ||
| Two-feature sets | X | X | 97.33 | 96.11 | ||
| Two-feature sets | X | X | 97.02 | 96.75 | ||
| Three-feature sets | X | X | X | 97.40 | 96.11 | |
| Three-feature sets | X | X | X | 97.04 | 96.75 | |
| Three-feature sets | X | X | X | 97.98 | 97.23 | |
| All four-feature sets | X | X | X | X | 97.23 |
The best performance is shown in bold characters.
Overall precision (P), recall (R), F-measure (F) and accuracy (A) of NB and SVM models for sentence classification
| Model | Features | P | R | F | A |
|---|---|---|---|---|---|
| NB | Tokens, position, POS tags | 60 | 70 | 75 | |
| Above features plus component tags | 78 | 79 | 86 | ||
| SVM | Tokens, position, POS tags | 74 | 81 | 77 | 84 |
| Above features plus component tags | 78 | 83 |
Threshold is set at the 25th percentile of model scores on the training set Train-D. The best performance is shown in bold characters.
Error analysis for SVM sentence classification
| Classification error | Error type | Cases |
|---|---|---|
| False negative | Low score | 34 |
| GS dispute | 2 | |
| Ambiguous sentence | 3 | |
| Total | 39 | |
| False positive | Data reuse | 32 |
| Database mention | 7 | |
| Ambiguous sentence | 7 | |
| GS dispute | 6 | |
| Non biological data | 4 | |
| Total | 56 |
Positive precision (P), recall (R) and F-measure (F) of SVM models for article classification on test set
| Model | Features | P | R | F |
|---|---|---|---|---|
| NB | Tokens, position, POS tags | 67 | 74 | |
| Above features plus component tags | 83 | 78 | ||
| SVM | Tokens, position, POS tags | 82 | 75 | 79 |
| Above features plus component tags | 76 |
Threshold is set at the 25th percentile of model scores on the training set Train-D. The best performance is shown in bold characters.
Error analysis for article classification with NB model
| Error type | Cases |
|---|---|
| Low score | |
| Ranked in top 5 | 49 |
| Ranked in top 10 | 2 |
| Other rank | 2 |
| No deposition sentence found in article | 6 |
| Sentence not scored (length >500) | 2 |
| Total | 61 |
Fig. 2.Precision/recall curves for SVM and NB models built using all features.