Literature DB >> 21179386

A System for Automated Extraction of Metadata from Scanned Documents using Layout Recognition and String Pattern Search Models.

Dharitri Misra, Siyuan Chen, George R Thoma.   

Abstract

One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques.At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts.In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system.

Entities:  

Year:  2009        PMID: 21179386      PMCID: PMC3004227     

Source DB:  PubMed          Journal:  Archiving        ISSN: 2161-8798


  1 in total

1.  Interactive Publication: The document as a research tool.

Authors:  George R Thoma; Glenn Ford; Sameer Antani; Dina Demner-Fushman; Michael Chung; Matthew Simpson
Journal:  Web Semant       Date:  2010-07-01       Impact factor: 1.897

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.