| Literature DB >> 25951377 |
Xu-Cheng Yin1, Chun Yang1, Wei-Yi Pei1, Haixia Man2, Jun Zhang1, Erik Learned-Miller3, Hong Yu4.
Abstract
Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/.Entities:
Mesh:
Year: 2015 PMID: 25951377 PMCID: PMC4423993 DOI: 10.1371/journal.pone.0126200
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Representative biomedical figures and their texts.
(a) experimental results (gene sequence), (b) research models, and (c) biomedical objects.
Fig 2An example biomedical figure with a complex layout, color text, and irregular text arrangement.
Fig 3The annotation tool for DeTEXT.
The figure and its annotated text regions are shown to the left. The annotated information (e.g., text and locations) is shown to the right. Functions for displaying the figure (zoom in and out), etc, are also shown to the right.
Fig 4An example for the annotation information.
Each figure in the database corresponds to a ground truth file (we use a “.txt” file to store the annotation information), in which each line records the information of the text in the corresponding text region.
The annotation agreement of the 10 figures randomly selected.
| Original annotations | Re-annotations | |
|---|---|---|
| Number of text regions | 181 | 189 |
| Number of text regions which have | 176 | 176 |
| Number of text regions which have | 176 | 176 |
Fig 5Disagreed examples between the original annotation and the re-annotation, where thick blue and red boxes are text regions with inconsistent annotations.
Statistics of text (word) regions and figures with different categories.
| Text region category | NO. of regions (%) | NO. of figures (%) |
|---|---|---|
| Normal | 3519 (37.8%) | 424 (84.8%) |
| Small | 2419 (26.0%) | 151 (30.2%) |
| Blurry | 1118 (12.0%) | 65 (13.0%) |
| Color | 293 (3.1%) | 39 (7.8%) |
| Short | 4354 (46.8%) | 379 (75.8%) |
| Complex_background | 670 (7.2%) | 86 (17.2%) |
| Complex_symbol | 240 (2.6%) | 75 (15.0%) |
| Specific_text | 74 (0.8%) | 14 (2.8%) |
Fig 6Region samples of different categories.
Statistics of text (word) regions and figures with combination of categories.
| Combination of region categories | NO. of regions | NO. of figures |
|---|---|---|
| short, complex_symbol | 71 | 18 |
| small, short | 1786 | 126 |
| complex_background, complex_symbol | 23 | 9 |
| color, short | 96 | 22 |
| small, blurry | 858 | 47 |
| small, blurry, short | 485 | 33 |
| short, complex_background | 279 | 48 |
| blurry, short | 603 | 44 |
| small, complex_symbol | 19 | 9 |
| color, specific_text | 35 | 2 |
| small, blurry, complex_symbol | 7 | 5 |
| small, complex_background | 106 | 13 |
| blurry, complex_symbol | 14 | 7 |
| small, short, complex_background | 47 | 8 |
| color, complex_background | 81 | 16 |
| color, short, complex_background | 24 | 9 |
| small, color, short | 10 | 4 |
| color, complex_symbol | 2 | 1 |
| small, color | 28 | 7 |
| small, blurry, complex_background | 43 | 4 |
| small, blurry, short, complex_background | 9 | 2 |
| short, complex_background, complex_symbol | 5 | 2 |
| small, short, complex_symbol | 5 | 2 |
| blurry, complex_background, complex_symbol | 2 | 1 |
| blurry, short, complex_background | 11 | 3 |
| small, color, complex_background | 15 | 2 |
| complex_background, specific_text | 3 | 2 |
Statistics of text (word) regions with orientation attributes.
| Orientation attribute | NO. of regions | NO. of figures |
|---|---|---|
| Horizontal | 8461 | 492 |
| Oriented | 847 | 268 |
|
|
|
|
Statistics of biomedical figures with five different types.
|
|
|
|
|
| |
|---|---|---|---|---|---|
| NO. of figures | 16 | 46 | 232 | 124 | 82 |
Training, validation, and testing sets of DeTEXT.
| Subset | NO. of figures | NO. of articles | Remarks |
|---|---|---|---|
| Training set | 100 | 100 | Select one figure for each article. |
| Validation set | 100 | 45 | Randomly select 45 articles and include all common figures in these articles from the remaining dataset without the training set. |
| Testing set | 300 | 143 | The remaining subset after selecting the validation set. |
|
|
|
|
Statistics of text regions and figures with different categories on the training, validation, and testing sets.
| Text region category | NO. of regions | NO. of figures | ||||
|---|---|---|---|---|---|---|
| Training | Validation | Testing | Training | Validation | Testing | |
| Normal | 731 | 597 | 2191 | 76 | 83 | 265 |
| Small | 703 | 483 | 1233 | 37 | 36 | 78 |
| Blurry | 638 | 8 | 472 | 28 | 1 | 36 |
| Color | 52 | 11 | 230 | 7 | 3 | 29 |
| Short | 964 | 780 | 2610 | 81 | 63 | 235 |
| Complex_background | 270 | 126 | 294 | 24 | 15 | 47 |
| Complex_symbol | 112 | 20 | 128 | 33 | 5 | 42 |
| Specific_text | 10 | 8 | 56 | 2 | 5 | 7 |
Challenges for text detection and recognition from biomedical literature figures.
| Challenges | Sub Categorization | Difficulty |
|---|---|---|
| From image quality and complex images | Blurred text | “blurry” (see |
| Small-size character | “small” (see | |
| Color image / text | “color” (see | |
| Complex background and layout | “complex_background” (see | |
| From text complexity | Short word | “short” (see |
| Complex symbol | “complex_symbol” (see | |
| Specific text | “specific_text” (see | |
| Oriented text | “oriented” (see |