| Literature DB >> 24568573 |
Tobias Kuhn1, Mate Levente Nagy, Thaibinh Luong, Michael Krauthammer.
Abstract
Authors of biomedical publications use gel images to report experimental results such as protein-protein interactions or protein expressions under different conditions. Gel images offer a concise way to communicate such findings, not all of which need to be explicitly discussed in the article text. This fact together with the abundance of gel images and their shared common patterns makes them prime candidates for automated image mining and parsing. We introduce an approach for the detection of gel images, and present a workflow to analyze them. We are able to detect gel segments and panels at high accuracy, and present preliminary results for the identification of gene names in these images. While we cannot provide a complete solution at this point, we present evidence that this kind of image mining is feasible.Entities:
Year: 2014 PMID: 24568573 PMCID: PMC4190668 DOI: 10.1186/2041-1480-5-10
Source DB: PubMed Journal: J Biomed Semantics
Figure 1Categorization of images from open access articles of PubMed Central.
Figure 2Two examples of gel images from biomedical publications (PMID 19473536 and 15125785) with tables showing the relations that could be extracted from them.
Figure 3The procedure of our approach: (1) figure extraction, (2) segmentation, (3) text recognition, (4) gel detection, (5) gel panel detection, (6) named entity recognition, and (7) relation extraction.
The results of the gel segment detection classifiers (top) and the gel panel detection algorithm (bottom)
| | | 0.15 | 0.439 | 0.909 | 0.592 | |
| | Random forests | 0.30 | 0.765 | 0.739 | 0.752 | |
| | | 0.60 | 0.926 | 0.301 | 0.455 | |
| Segments | Naive Bayes | | 0.172 | 0.739 | 0.279 | 0.883 |
| | Bayesian network | | 0.394 | 0.531 | 0.452 | 0.914 |
| | PART decision list | | 0.631 | 0.496 | 0.555 | 0.777 |
| | Convolutional networks | | 0.142 | 0.949 | 0.248 | |
| Panels | Hand-coded rules | 0.951 | 0.368 | 0.530 |
The results of running the pipeline on the open access subset of PubMed Central
| Total articles | 410 950 |
|---|---|
| Processed articles | 386 428 |
| Total figures from processed articles | 1 110 643 |
| Processed figures | 884 152 |
| Detected gel panels | 85 942 |
| Detected gel panels per figure | 0.097 |
| Detected gel labels | 309 340 |
| Detected gel labels per panel | 3.599 |
| Detected gene tokens | 1 854 609 |
| Detected gene tokens in gel labels | 75 610 |
| Gene token ratio | 0.033 |
| Gene token ratio in gel labels | 0.068 |
Numbers of recognized gene/protein tokens in 2 000 random figures
| – Not mentioned (OCR errors) | 28 | 17.9% |
| – Not references to genes or proteins | 26 | 16.7% |
| – Partially correct (could be more specific) | 14 | 9.0% |
| – Fully correct | 88 | 56.4% |
Figure 4Original and mask image after ConvNet classification for an exemplary image from PMID 14993249. Green means gel; brown means other; and white means not enough gradient information.