| Literature DB >> 20920264 |
K Bretonnel Cohen1, Helen L Johnson, Karin Verspoor, Christophe Roeder, Lawrence E Hunter.
Abstract
BACKGROUND: An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research.Entities:
Mesh:
Year: 2010 PMID: 20920264 PMCID: PMC3098079 DOI: 10.1186/1471-2105-11-492
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparison of sentence length between different article sections.
| Results | |||||||
|---|---|---|---|---|---|---|---|
| Abst. | |||||||
| Intro. | * | ||||||
| Results | * | * | |||||
| Methods | * | * | * | ||||
| Disc. | * | -- | -- | * | |||
| Concl. | -- | -- | -- | -- | -- | ||
| Capt. | * | * | * | * | * | * | |
| Median/mean | 25/26.49 | 28/29.52 | 29/31.17 | 23/26.80 | 29/30.35 | 26/28.34 | 21/24.85 |
Statistically significantly different pairs at P < .01 are marked by *. Pairs that are not statistically significantly different are marked by --.
Figure 1Distribution of sentence lengths across the various article sections. Distribution of mean sentence lengths across the various article sections. See Table 1 for significance.
Summary of results on parenthesis usage distribution.
| Abstracts | Bodies | |
|---|---|---|
| List enumerators | 3 | 1,399 |
| Part of gene name | -- | -- |
| Table or figure | -- | 2,825 |
| Citation | -- | 172 |
| P value | 2 | 146 |
| Data | 11 | 2,116 |
| Singular/plural | 2 | 33 |
| Abbreviation or symbol | 124 | 1,862 |
| Parenthetical statement | 23 | 3,453 |
| Unknown | 61 | 4,831 |
| Total | 226 | 16,837 |
Differences in incidence of linguistic features per thousand tokens.
| Abstracts | Bodies | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Human annotations | 1180 | 22,432 | ||||||||||
| ABNER/BioCreative | 0.634 | 0.393 | 0.485 | 464 | 267 | 716 | .505 | .322 | .394 | 7,240 | 7,081 | 15,186 |
| ABNER/NLPBA | 0.629 | 0.363 | 0.460 | 429 | 253 | 751 | .464 | .298 | .363 | 6,694 | 7,719 | 15,730 |
| BANNER/BioCreative | 0.678 | 0.482 | 0.563 | 569 | 270 | 611 | .540 | .473 | .504 | 10,621 | 9,042 | 11,806 |
| LingPipe/BioCreative | 0.591 | 0.575 | 0.583 | 679 | 468 | 501 | .353 | .583 | .440 | 13,094 | 23,992 | 9,330 |
| LingPipe/GENIA | 0.398 | 0.309 | 0.348 | 365 | 551 | 815 | .250 | .284 | .266 | 6,380 | 19,101 | 16,046 |
| LingPipe/JNLPBA | 0.464 | 0.277 | 0.347 | 327 | 377 | 853 | .289 | .256 | .271 | 5,741 | 10,498 | 16,684 |
Values marked with * are significantly different at the P < .01 level. Values marked with ** are significantly different at the P < .001 level.
Gene mention tagger performance for six combinations of tagger and model.
| Abstracts | Bodies | |
|---|---|---|
| Conjunction | 37.5 | 36.2 |
| Passives | 3.7* | 4.3* |
| Negation | 3.8* | 5.3* |
| Pronominal anaphora | 5.3** | 3.98** |
The first row gives the number of gene mentions in each set of texts.
Figure 2Precision of the various tagger/model combinations. Precision of the various tagger/model combinations. Each pair of bars shows the gene mention system followed by the model on which it was trained; e.g., LingPipe/JNLPBA is the LingPipe gene mention system trained on the JNLPBA data.
Figure 3F-measure of the various tagger/model combinations. Recall of the various tagger/model combinations. F-measures of the various tagger/model combinations. Each pair of bars shows the gene mention system followed by the model on which it was trained; e.g., LingPipe/JNLPBA is the LingPipe gene mention system trained on the JNLPBA data.
Figure 4Recall of the various tagger/model combinations. Each pair of bars shows the gene mention system followed by the model on which it was trained; e.g., LingPipe/JNLPBA is the LingPipe gene mention system trained on the JNLPBA data.
The number of abstracts and article bodies mentioning the four semantic classes of named entities that we examined, out of 97 abstracts and article bodies.
| Semantic class | Abstracts mentioning | Bodies mentioning |
|---|---|---|
| Genes | 94 | 97 |
| Mutations | 1 | 19 |
| Drugs | 18 | 85 |
| Diseases | 65 | 96 |
Except for genes, distributions differed markedly.
Average number of mentions of semantic class in abstracts and bodies.
| Semantic class | Abstracts average | Bodies average |
|---|---|---|
| Genes | 15 | 280 |
| Mutations | 0.02 | 1.74 |
| Drugs | 0.72 | 13.6 |
| Diseases | 1 | 23 |
Density of mentions of semantic class per thousand words in abstracts and bodies.
| Semantic class | Abstracts | Bodies |
|---|---|---|
| Genes | 61 | 47 |
| Mutations | 0.08* | 0.28* |
| Drugs | 2.97* | 2.21* |
| Diseases | 4.1* | 3.74* |
Statistically significantly different pairs at P < .01 are marked by *.
Phenomena that are and are not normally distributed between abstracts and article bodies.
| Normally distributed | Not normally distributed |
|---|---|
| Coordination | Sentence length |
| Pronominal anaphora | Negation |
| Passives | |
| Gene mentions | |
| Disease mentions | |
| Mutation mentions | |
| Drug mentions | |