Tim Adams1, Marcin Namysl2,3, Alpha Tom Kodamullil1, Sven Behnke2,3, Marc Jacobs1. 1. Fraunhofer Institute for Algorithms and Scientific Computing,Schloss Birlinghoven, Sankt Augustin, Germany. 2. Fraunhofer Institute for Intelligent Analysis and Information Systems,Schloss Birlinghoven, Sankt Augustin, Germany. 3. Autonomous Intelligent Systems, Computer Science Institute VI,University of Bonn.
Abstract
MOTIVATION: Table recognition systems are widely used to extract and structure quantitative information from the vast amount of documents that are increasingly available from different open sources. While many systems already perform well on tables with a simple layout, tables in the biomedical domain are often much more complex. Benchmark and training data for such tables is however very limited. RESULTS: To address this issue, we present a novel, highly curated benchmark data set based on a hand-curated literature corpus on neurological disorders, which can be used to tune and evaluate table extraction applications for this challenging domain. We evaluate several state-of-the-art table extraction systems based on our proposed benchmark and discuss challenges that emerged during the benchmark creation as well as factors that can impact the performance of recognition methods. For the evaluation procedure, we propose a new metric as well as several improvements that result in a better performance evaluation. AVAILABILITY: The resulting benchmark data set as well as the source code to our novel evaluation approach can be openly accessed. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Table recognition systems are widely used to extract and structure quantitative information from the vast amount of documents that are increasingly available from different open sources. While many systems already perform well on tables with a simple layout, tables in the biomedical domain are often much more complex. Benchmark and training data for such tables is however very limited. RESULTS: To address this issue, we present a novel, highly curated benchmark data set based on a hand-curated literature corpus on neurological disorders, which can be used to tune and evaluate table extraction applications for this challenging domain. We evaluate several state-of-the-art table extraction systems based on our proposed benchmark and discuss challenges that emerged during the benchmark creation as well as factors that can impact the performance of recognition methods. For the evaluation procedure, we propose a new metric as well as several improvements that result in a better performance evaluation. AVAILABILITY: The resulting benchmark data set as well as the source code to our novel evaluation approach can be openly accessed. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Vanessa Lage-Rupprecht; Bruce Schultz; Justus Dick; Marcin Namysl; Andrea Zaliani; Stephan Gebel; Ole Pless; Jeanette Reinshagen; Bernhard Ellinger; Christian Ebeling; Alexander Esser; Marc Jacobs; Carsten Claussen; Martin Hofmann-Apitius Journal: Patterns (N Y) Date: 2022-01-26