Literature DB >> 15369595

Doublet method for very fast autocoding.

Jules J Berman1.   

Abstract

BACKGROUND: Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computational methods used to autocode text must be very fast. The purpose of this paper is to describe the doublet method, a new algorithm for very fast autocoding.
METHODS: An autocoder was written that transforms plain-text into intercalated word doublets (e.g. "The ciliary body produces aqueous humor" becomes "The ciliary, ciliary body, body produces, produces aqueous, aqueous humor"). Each doublet is checked against an index of doublets extracted from a standard nomenclature. Matching doublets are assigned a numeric code specific for each doublet found in the nomenclature. Text doublets that do not match the index of doublets extracted from the nomenclature are not part of valid nomenclature terms. Runs of matching doublets from text are concatenated and matched against nomenclature terms (also represented as runs of doublets).
RESULTS: The doublet autocoder was compared for speed and performance against a previously published phrase autocoder. Both autocoders are Perl scripts, and both autocoders used an identical text (a 170+ Megabyte collection of abstracts collected through a PubMed search) and the same nomenclature (neocl.xml, containing over 102,271 unique names of neoplasms). In side-by-side comparison on the same computer, the doublet method autocoder was 8.4 times faster than the phrase autocoder (211 seconds versus 1,776 seconds). The doublet method codes 0.8 Megabytes of text per second on a desktop computer with a 1.6 GHz processor. In addition, the doublet autocoder successfully matched terms that were missed by the phrase autocoder, while the phrase autocoder found no terms that were missed by the doublet autocoder.
CONCLUSIONS: The doublet method of autocoding is a novel algorithm for rapid text autocoding. The method will work with any nomenclature and will parse any ascii plain-text. An implementation of the algorithm in Perl is provided with this article. The algorithm, the Perl implementation, the neoplasm nomenclature, and Perl itself, are all open source materials.

Entities:  

Mesh:

Year:  2004        PMID: 15369595      PMCID: PMC521082          DOI: 10.1186/1472-6947-4-16

Source DB:  PubMed          Journal:  BMC Med Inform Decis Mak        ISSN: 1472-6947            Impact factor:   2.796


  10 in total

1.  Automated coding of diagnoses--three methods compared.

Authors:  P Franz; A Zaiss; S Schulz; U Hahn; R Klar
Journal:  Proc AMIA Symp       Date:  2000

2.  Mining the bibliome: searching for a needle in a haystack? New computing tools are needed to effectively scan the growing amount of scientific literature for useful information.

Authors:  Les Grivell
Journal:  EMBO Rep       Date:  2002-03       Impact factor: 8.807

3.  Corpus-based statistical screening for phrase identification.

Authors:  W Kim; W J Wilbur
Journal:  J Am Med Inform Assoc       Date:  2000 Sep-Oct       Impact factor: 4.497

4.  Using n-gram method in the decomposition of compound medical diagnoses.

Authors:  Gergely Héja; György Surján
Journal:  Int J Med Inform       Date:  2003-07       Impact factor: 4.046

5.  A new look at information retrieval evaluation: proposal for solutions.

Authors:  Y Kagolovsky; J R Moehr
Journal:  J Med Syst       Date:  2004-02       Impact factor: 4.460

6.  SNOMED-encoded surgical pathology databases: a tool for epidemiologic investigation.

Authors:  J J Berman; G W Moore
Journal:  Mod Pathol       Date:  1996-09       Impact factor: 7.842

7.  Performance analysis of manual and automated systemized nomenclature of medicine (SNOMED) coding.

Authors:  G W Moore; J J Berman
Journal:  Am J Clin Pathol       Date:  1994-03       Impact factor: 2.493

8.  Concept-match medical data scrubbing. How pathology text can be used in research.

Authors:  Jules J Berman
Journal:  Arch Pathol Lab Med       Date:  2003-06       Impact factor: 5.534

9.  Resources for comparing the speed and performance of medical autocoders.

Authors:  Jules J Berman
Journal:  BMC Med Inform Decis Mak       Date:  2004-06-15       Impact factor: 2.796

10.  Tumor classification: molecular analysis meets Aristotle.

Authors:  Jules J Berman
Journal:  BMC Cancer       Date:  2004-03-17       Impact factor: 4.430

  10 in total
  5 in total

1.  SPIN query tools for de-identified research on a humongous database.

Authors:  Clement J McDonald; Paul Dexter; Gunther Schadow; Henry C Chueh; Greg Abernathy; John Hook; Lonnie Blevins; J Marc Overhage; Jules J Berman
Journal:  AMIA Annu Symp Proc       Date:  2005

2.  A comparison of Intelligent Mapper and document similarity scores for mapping local radiology terms to LOINC.

Authors:  Daniel J Vreeman; Clement J McDonald
Journal:  AMIA Annu Symp Proc       Date:  2006

3.  Automatic extraction of candidate nomenclature terms using the doublet method.

Authors:  Jules J Berman
Journal:  BMC Med Inform Decis Mak       Date:  2005-10-18       Impact factor: 2.796

4.  Improved de-identification of physician notes through integrative modeling of both public and private medical text.

Authors:  Andrew J McMurry; Britt Fitch; Guergana Savova; Isaac S Kohane; Ben Y Reis
Journal:  BMC Med Inform Decis Mak       Date:  2013-10-02       Impact factor: 2.796

5.  NOBLE - Flexible concept recognition for large-scale biomedical natural language processing.

Authors:  Eugene Tseytlin; Kevin Mitchell; Elizabeth Legowski; Julia Corrigan; Girish Chavan; Rebecca S Jacobson
Journal:  BMC Bioinformatics       Date:  2016-01-14       Impact factor: 3.169

  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.