Stephen B Johnson1, Michael E Bales2, Daniel Dine3, Suzanne Bakken3, Paul J Albert4, Chunhua Weng3. 1. Department of Public Health, Weill Cornell Medical College, New York, United States. Electronic address: johnsos@med.cornell.edu. 2. Department of Biomedical Informatics, Columbia University, New York, United States. 3. Department of Biomedical Informatics, Columbia University, New York, United States; The Irving Institute for Clinical and Translational Research, Columbia University, New York, United States. 4. Samuel J. Wood Library, Weill Cornell Medical College, New York, United States.
Abstract
OBJECTIVE: Publications are a key data source for investigator profiles and research networking systems. We developed ReCiter, an algorithm that automatically extracts bibliographies from PubMed using institutional information about the target investigators. METHODS: ReCiter executes a broad query against PubMed, groups the results into clusters that appear to constitute distinct author identities and selects the cluster that best matches the target investigator. Using information about investigators from one of our institutions, we compared ReCiter results to queries based on author name and institution and to citations extracted manually from the Scopus database. Five judges created a gold standard using citations of a random sample of 200 investigators. RESULTS: About half of the 10,471 potential investigators had no matching citations in PubMed, and about 45% had fewer than 70 citations. Interrater agreement (Fleiss' kappa) for the gold standard was 0.81. Scopus achieved the best recall (sensitivity) of 0.81, while name-based queries had 0.78 and ReCiter had 0.69. ReCiter attained the best precision (positive predictive value) of 0.93 while Scopus had 0.85 and name-based queries had 0.31. DISCUSSION: ReCiter accesses the most current citation data, uses limited computational resources and minimizes manual entry by investigators. Generation of bibliographies using named-based queries will not yield high accuracy. Proprietary databases can perform well but requite manual effort. Automated generation with higher recall is possible but requires additional knowledge about investigators.
OBJECTIVE: Publications are a key data source for investigator profiles and research networking systems. We developed ReCiter, an algorithm that automatically extracts bibliographies from PubMed using institutional information about the target investigators. METHODS: ReCiter executes a broad query against PubMed, groups the results into clusters that appear to constitute distinct author identities and selects the cluster that best matches the target investigator. Using information about investigators from one of our institutions, we compared ReCiter results to queries based on author name and institution and to citations extracted manually from the Scopus database. Five judges created a gold standard using citations of a random sample of 200 investigators. RESULTS: About half of the 10,471 potential investigators had no matching citations in PubMed, and about 45% had fewer than 70 citations. Interrater agreement (Fleiss' kappa) for the gold standard was 0.81. Scopus achieved the best recall (sensitivity) of 0.81, while name-based queries had 0.78 and ReCiter had 0.69. ReCiter attained the best precision (positive predictive value) of 0.93 while Scopus had 0.85 and name-based queries had 0.31. DISCUSSION: ReCiter accesses the most current citation data, uses limited computational resources and minimizes manual entry by investigators. Generation of bibliographies using named-based queries will not yield high accuracy. Proprietary databases can perform well but requite manual effort. Automated generation with higher recall is possible but requires additional knowledge about investigators.
Authors: Vivienne J Zhu; Marc J Overhage; James Egg; Stephen M Downs; Shaun J Grannis Journal: J Am Med Inform Assoc Date: 2009-06-30 Impact factor: 4.497
Authors: Allison B McCoy; Adam Wright; Michael G Kahn; Jason S Shapiro; Elmer Victor Bernstam; Dean F Sittig Journal: BMJ Qual Saf Date: 2013-01-29 Impact factor: 7.035
Authors: Michael E Bales; Stephen B Johnson; Jonathan W Keeling; Kathleen M Carley; Frank Kunkel; Jacqueline A Merrill Journal: Am J Prev Med Date: 2011-07 Impact factor: 5.043
Authors: Michael E Bales; Daniel C Dine; Jacqueline A Merrill; Stephen B Johnson; Suzanne Bakken; Chunhua Weng Journal: J Biomed Inform Date: 2014-07-19 Impact factor: 6.317
Authors: Stephen B Johnson; Glen Whitney; Matthew McAuliffe; Hailong Wang; Evan McCreedy; Leon Rozenblit; Clark C Evans Journal: J Am Med Inform Assoc Date: 2010 Nov-Dec Impact factor: 4.497
Authors: Michael E Bales; Daniel C Dine; Jacqueline A Merrill; Stephen B Johnson; Suzanne Bakken; Chunhua Weng Journal: J Biomed Inform Date: 2014-07-19 Impact factor: 6.317
Authors: Paul J Albert; Sarbajit Dutta; Jie Lin; Zimeng Zhu; Michael Bales; Stephen B Johnson; Mohammad Mansour; Drew Wright; Terrie R Wheeler; Curtis L Cole Journal: PLoS One Date: 2021-04-01 Impact factor: 3.240
Authors: Karen Elizabeth Gutzman; Michael E Bales; Christopher W Belter; Thane Chambers; Liza Chan; Kristi L Holmes; Ya-Ling Lu; Lisa A Palmer; Rebecca C Reznik-Zellen; Cathy C Sarli; Amy M Suiter; Terrie R Wheeler Journal: J Med Libr Assoc Date: 2018-01-02
Authors: Armen Yuri Gasparyan; Bekaidar Nurmashev; Marlen Yessirkepov; Dmitry A Endovitskiy; Alexander A Voronov; George D Kitas Journal: J Korean Med Sci Date: 2017-11 Impact factor: 2.153