Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 A method for determining the number of documents needed for a gold standard corpus.

Literature DB >> 22245601

A method for determining the number of documents needed for a gold standard corpus.

Abstract

The unstructured narratives in medicine have been increasingly targeted for content extraction using the techniques of natural language processing (NLP). In most cases, these efforts are facilitated by creating a manually annotated set of narratives containing the ground truth; commonly referred to as a gold standard corpus. This corpus is used for modeling, fine-tuning, and testing NLP software as well as providing the basis for training in machine learning. Determining the number of annotated documents (size) for this corpus is important, but rarely described; rather, the factors of cost and time appear to dominate decision-making about corpus size. In this report, a method is outlined to determine gold standard size based on the capture probabilities for the unique words within a target corpus. To demonstrate this method, a corpus of dictation letters from the Michigan Pain Consultant (MPC) clinics for pain management are described and analyzed. A well-formed working corpus of 10,000 dictations was first constructed to provide a representative subset of the total, with no more than one dictation letter per patient. Each dictation was divided into words and common words were removed. The Poisson function was used to determine probabilities of word capture within samples taken from the working corpus, and then integrated over word length to give a single capture probability as a function of sample size. For these MPC dictations, a sample size of 500 documents is predicted to give a capture probability of approximately 0.95. Continuing the demonstration of sample selection, a provisional gold standard corpus of 500 documents was selected and examined for its similarity to the MPC structured coding and demographic data available for each patient. It is shown that a representative sample, of justifiable size, can be selected for use as a gold standard.

Entities: Disease Species

Mesh：

Year: 2012 PMID： 22245601 DOI： 10.1016/j.jbi.2011.12.010

Source DB: PubMed Journal: J Biomed Inform ISSN： 1532-0464 Impact factor: 6.317

Keyword Cloud
Cited

6 in total

1. Discordant patient pain level reporting between questionnaires and physician encounters of the same day.

Authors: David A Juckett; Fred N Davis; Mark Gostine; Eric P Kasten; Philip L Reed; Joseph Gardiner; Rebecca Risko
Journal: AMIA Annu Symp Proc Date: 2017-02-10

Review 2. Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis.

Authors: S Velupillai; D Mowery; B R South; M Kvist; H Dalianis
Journal: Yearb Med Inform Date: 2015-08-13

3. Assisted annotation of medical free text using RapTAT.

Authors: Glenn T Gobbel; Jennifer Garvin; Ruth Reeves; Robert M Cronin; Julia Heavirland; Jenifer Williams; Allison Weaver; Shrimalini Jayaramaraja; Dario Giuse; Theodore Speroff; Steven H Brown; Hua Xu; Michael E Matheny
Journal: J Am Med Inform Assoc Date: 2014-01-15 Impact factor: 4.497

4. Patient-reported outcomes in a large community-based pain medicine practice: evaluation for use in phenotype modeling.

Authors: David A Juckett; Fred N Davis; Mark Gostine; Philip Reed; Rebecca Risko
Journal: BMC Med Inform Decis Mak Date: 2015-05-28 Impact factor: 2.796

Review 5. Machine learning in pain research.

Authors: Jörn Lötsch; Alfred Ultsch
Journal: Pain Date: 2018-04 Impact factor: 6.961

6. Identifying Suicide Ideation and Suicidal Attempts in a Psychiatric Clinical Research Database using Natural Language Processing.

Authors: Andrea C Fernandes; Rina Dutta; Sumithra Velupillai; Jyoti Sanyal; Robert Stewart; David Chandran
Journal: Sci Rep Date: 2018-05-09 Impact factor: 4.379

6 in total