Literature DB >> 22245601

A method for determining the number of documents needed for a gold standard corpus.

David Juckett1.   

Abstract

The unstructured narratives in medicine have been increasingly targeted for content extraction using the techniques of natural language processing (NLP). In most cases, these efforts are facilitated by creating a manually annotated set of narratives containing the ground truth; commonly referred to as a gold standard corpus. This corpus is used for modeling, fine-tuning, and testing NLP software as well as providing the basis for training in machine learning. Determining the number of annotated documents (size) for this corpus is important, but rarely described; rather, the factors of cost and time appear to dominate decision-making about corpus size. In this report, a method is outlined to determine gold standard size based on the capture probabilities for the unique words within a target corpus. To demonstrate this method, a corpus of dictation letters from the Michigan Pain Consultant (MPC) clinics for pain management are described and analyzed. A well-formed working corpus of 10,000 dictations was first constructed to provide a representative subset of the total, with no more than one dictation letter per patient. Each dictation was divided into words and common words were removed. The Poisson function was used to determine probabilities of word capture within samples taken from the working corpus, and then integrated over word length to give a single capture probability as a function of sample size. For these MPC dictations, a sample size of 500 documents is predicted to give a capture probability of approximately 0.95. Continuing the demonstration of sample selection, a provisional gold standard corpus of 500 documents was selected and examined for its similarity to the MPC structured coding and demographic data available for each patient. It is shown that a representative sample, of justifiable size, can be selected for use as a gold standard.
Copyright © 2012 Elsevier Inc. All rights reserved.

Entities:  

Mesh:

Year:  2012        PMID: 22245601     DOI: 10.1016/j.jbi.2011.12.010

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  6 in total

1.  Discordant patient pain level reporting between questionnaires and physician encounters of the same day.

Authors:  David A Juckett; Fred N Davis; Mark Gostine; Eric P Kasten; Philip L Reed; Joseph Gardiner; Rebecca Risko
Journal:  AMIA Annu Symp Proc       Date:  2017-02-10

Review 2.  Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis.

Authors:  S Velupillai; D Mowery; B R South; M Kvist; H Dalianis
Journal:  Yearb Med Inform       Date:  2015-08-13

3.  Assisted annotation of medical free text using RapTAT.

Authors:  Glenn T Gobbel; Jennifer Garvin; Ruth Reeves; Robert M Cronin; Julia Heavirland; Jenifer Williams; Allison Weaver; Shrimalini Jayaramaraja; Dario Giuse; Theodore Speroff; Steven H Brown; Hua Xu; Michael E Matheny
Journal:  J Am Med Inform Assoc       Date:  2014-01-15       Impact factor: 4.497

4.  Patient-reported outcomes in a large community-based pain medicine practice: evaluation for use in phenotype modeling.

Authors:  David A Juckett; Fred N Davis; Mark Gostine; Philip Reed; Rebecca Risko
Journal:  BMC Med Inform Decis Mak       Date:  2015-05-28       Impact factor: 2.796

Review 5.  Machine learning in pain research.

Authors:  Jörn Lötsch; Alfred Ultsch
Journal:  Pain       Date:  2018-04       Impact factor: 6.961

6.  Identifying Suicide Ideation and Suicidal Attempts in a Psychiatric Clinical Research Database using Natural Language Processing.

Authors:  Andrea C Fernandes; Rina Dutta; Sumithra Velupillai; Jyoti Sanyal; Robert Stewart; David Chandran
Journal:  Sci Rep       Date:  2018-05-09       Impact factor: 4.379

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.