Literature DB >> 27044929

PDF text classification to leverage information extraction from publication reports.

Duy Duc An Bui1, Guilherme Del Fiol2, Siddhartha Jonnalagadda3.   

Abstract

OBJECTIVES: Data extraction from original study reports is a time-consuming, error-prone process in systematic review development. Information extraction (IE) systems have the potential to assist humans in the extraction task, however majority of IE systems were not designed to work on Portable Document Format (PDF) document, an important and common extraction source for systematic review. In a PDF document, narrative content is often mixed with publication metadata or semi-structured text, which add challenges to the underlining natural language processing algorithm. Our goal is to categorize PDF texts for strategic use by IE systems.
METHODS: We used an open-source tool to extract raw texts from a PDF document and developed a text classification algorithm that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories. To validate the algorithm, we developed a gold standard of PDF reports that were included in the development of previous systematic reviews by the Cochrane Collaboration. In a two-step procedure, we evaluated (1) classification performance, and compared it with machine learning classifier, and (2) the effects of the algorithm on an IE system that extracts clinical outcome mentions.
RESULTS: The multi-pass sieve algorithm achieved an accuracy of 92.6%, which was 9.7% (p<0.001) higher than the best performing machine learning classifier that used a logistic regression algorithm. F-measure improvements were observed in the classification of TITLE (+15.6%), ABSTRACT (+54.2%), BODYTEXT (+3.7%), SEMISTRUCTURE (+34%), and MEDADATA (+14.2%). In addition, use of the algorithm to filter semi-structured texts and publication metadata improved performance of the outcome extraction system (F-measure +4.1%, p=0.002). It also reduced of number of sentences to be processed by 44.9% (p<0.001), which corresponds to a processing time reduction of 50% (p=0.005).
CONCLUSIONS: The rule-based multi-pass sieve framework can be used effectively in categorizing texts extracted from PDF documents. Text classification is an important prerequisite step to leverage information extraction from PDF documents.
Copyright © 2016 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Document analysis; Machine learning; Natural language processing; Text classification

Mesh:

Year:  2016        PMID: 27044929      PMCID: PMC4893911          DOI: 10.1016/j.jbi.2016.03.026

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  23 in total

1.  Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules.

Authors:  Siddhartha Reddy Jonnalagadda; Dingcheng Li; Sunghwan Sohn; Stephen Tze-Inn Wu; Kavishwar Wagholikar; Manabu Torii; Hongfang Liu
Journal:  J Am Med Inform Assoc       Date:  2012-06-16       Impact factor: 4.497

2.  How quickly do systematic reviews go out of date? A survival analysis.

Authors:  Kaveh G Shojania; Margaret Sampson; Mohammed T Ansari; Jun Ji; Steve Doucette; David Moher
Journal:  Ann Intern Med       Date:  2007-07-16       Impact factor: 25.391

3.  PICO element detection in medical text without metadata: are first sentences enough?

Authors:  Ke-Chun Huang; I-Jen Chiang; Furen Xiao; Chun-Chih Liao; Charles Chih-Ho Liu; Jau-Min Wong
Journal:  J Biomed Inform       Date:  2013-07-27       Impact factor: 6.317

4.  Automatic extracting of patient-related attributes: disease, age, gender and race.

Authors:  Huijia Zhu; Yuan Ni; Peng Cai; Zhaoming Qiu; Feng Cao
Journal:  Stud Health Technol Inform       Date:  2012

5.  Automated extraction of reported statistical analyses: towards a logical representation of clinical trial literature.

Authors:  William Hsu; William Speier; Ricky K Taira
Journal:  AMIA Annu Symp Proc       Date:  2012-11-03

6.  Learning regular expressions for clinical text classification.

Authors:  Duy Duc An Bui; Qing Zeng-Treitler
Journal:  J Am Med Inform Assoc       Date:  2014-02-27       Impact factor: 4.497

7.  Evidence based medicine: what it is and what it isn't.

Authors:  D L Sackett; W M Rosenberg; J A Gray; R B Haynes; W S Richardson
Journal:  BMJ       Date:  1996-01-13

8.  Combining classifiers for robust PICO element detection.

Authors:  Florian Boudin; Jian-Yun Nie; Joan C Bartlett; Roland Grad; Pierre Pluye; Martin Dawes
Journal:  BMC Med Inform Decis Mak       Date:  2010-05-15       Impact factor: 2.796

9.  Automatically finding relevant citations for clinical guideline development.

Authors:  Duy Duc An Bui; Siddhartha Jonnalagadda; Guilherme Del Fiol
Journal:  J Biomed Inform       Date:  2015-09-10       Impact factor: 6.317

10.  Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach.

Authors:  Marcos Antonio Mouriño García; Roberto Pérez Rodríguez; Luis E Anido Rifón
Journal:  PeerJ       Date:  2015-09-29       Impact factor: 2.984

View more
  5 in total

1.  Using Electronic Health Records To Generate Phenotypes For Research.

Authors:  Sarah A Pendergrass; Dana C Crawford
Journal:  Curr Protoc Hum Genet       Date:  2018-12-05

2.  Extractive text summarization system to aid data extraction from full text in systematic review development.

Authors:  Duy Duc An Bui; Guilherme Del Fiol; John F Hurdle; Siddhartha Jonnalagadda
Journal:  J Biomed Inform       Date:  2016-10-27       Impact factor: 6.317

3.  The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge.

Authors:  Duy Duc An Bui; Mathew Wyatt; James J Cimino
Journal:  J Biomed Inform       Date:  2017-05-03       Impact factor: 6.317

4.  Novel methodology to measure pre-procedure antimicrobial prophylaxis: integrating text searches with structured data from the Veterans Health Administration's electronic medical record.

Authors:  Hillary J Mull; Kelly Stolzmann; Emily Kalver; Marlena H Shin; Marin L Schweizer; Archana Asundi; Payal Mehta; Maggie Stanislawski; Westyn Branch-Elliman
Journal:  BMC Med Inform Decis Mak       Date:  2020-01-30       Impact factor: 2.796

Review 5.  A tutorial on methodological studies: the what, when, how and why.

Authors:  Lawrence Mbuagbaw; Daeria O Lawson; Livia Puljak; David B Allison; Lehana Thabane
Journal:  BMC Med Res Methodol       Date:  2020-09-07       Impact factor: 4.615

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.