Literature DB >> 18648494

Sample size and precision in NIH peer review.

David Kaplan¹, Nicola Lacetera, Celia Kaplan.

Abstract

The Working Group on Peer Review of the Advisory Committee to the Director of NIH has recommended that at least 4 reviewers should be used to assess each grant application. A sample size analysis of the number of reviewers needed to evaluate grant applications reveals that a substantially larger number of evaluators are required to provide the level of precision that is currently mandated. NIH should adjust their peer review system to account for the number of reviewers needed to provide adequate precision in their evaluations.

Entities: Disease Species

Mesh：

Year: 2008 PMID： 18648494 PMCID： PMC2447157 DOI： 10.1371/journal.pone.0002761

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

On February 21, 2008 the recommendations of the Working Group on Peer Review of the Advisory Committee to the Director of the National Institutes of Health (NIH) were posted on the internet [1]. This committee made several suggestions including shortening of the application size, giving applicants unambiguous feedback about resubmission, using short pre-buttals to correct factual errors in review, and eliminating the special status of amended applications. A further recommendation of the group was to “engage more persons to review each application – “optimally 4 or more” [2]. Thus, the Advisory Committee has left the actual number of reviewers to evaluate each grant application ambiguous. No guidelines were provided to determine the number of reviewers that would be needed. Consequently, we have conducted a statistical analysis to provide guidance in arriving at appropriate numbers. Our analysis shows an inherent statistical inconsistency in the NIH peer review recommendations concerning the number of reviewers. We also demonstrate how crucial this number is and how it influences the precision of the eventual score.

Analysis

For each grant proposal reviewers from the relevant scientific community are asked to report their evaluations within a pre-defined scale. The average grade obtained through this process is considered a valid estimate of the “true” value of the proposal. The survey sample size is a crucial parameter in determining whether we can rely on these mean estimates. Elementary sampling techniques give us the minimum number of respondents that are needed for the evaluation procedure to deliver reliable estimates:In expression (1), n is the minimum required sample size or number of evaluators. Zα/2 is the upper percentile of the standard normal distribution. For a 95% confidence interval and an alpha (type I error, i.e. the probability of rejecting the null hypothesis when it is true) of .05, Zα/2 is equal to 1.96. The parameter σ represents the underlying standard deviation. Finally, L indicates the desired half-width of the interval between two consecutive evaluations or the precision of the evaluation. There are two important implications of this equation. First, the inverse correlation between n and L indicates that more reviewers are needed to obtain a more fine-grained or precise evaluation. Moreover, this relation is exponential so that greater precision comes with an increasingly greater number of reviewers. Second, typically the standard deviation σ of a population is not observed and needs to be estimated. Since the data necessary to estimate σ for the review of biomedical research proposals have not been collected in a statistically robust sampling system, we have relied on a model system of peer review with short movie proposals reviewed on a scale from 1 to 5 by undergraduate students [Lacetera, Kaplan, Kaplan, submitted]. We used short movie proposals in order to increase the potential sample size since all undergraduate students could be considered expert enough to grade the proposals. In this study 10 proposals were scored by an average of 48 reviewers. The average standard deviation was approximately 1.0 with a standard deviation considerably less than 0.1. Therefore, we estimate σ to be equal to 1. Obviously, a more accurate estimate of the standard deviation can eventually be obtained for each form of application requested by NIH, although it should be clear that a large number of independent evaluators is required to make any estimate of σ reliable. Using equation (1), we can assess the effect of having 4 reviewers for each proposal. With four reviewers and a standard deviation of 1, the review would be expected to distinguish applications at the level of the unit interval:Thus, four reviewers would be able to distinguish among whole integer scores. Yet, in the evaluation of grant proposals NIH currently uses a 41-grade scale with a range of scores from 1.0 to 5.0 [3]. Moreover, these scores are averaged to yield a score with 3 significant figures instead of 2 [3]. It is this number, inappropriately expanded to 3 significant figures by averaging, that is used by NIH in their scoring decisions. Although NIH does not explain the rationale for the conversion of their scores to 3 significant figures, with 80,000 applications per year it seems likely that the NIH peer review system needs that level of precision to facilitate their making choices close to the funding line. As a consequence to the use of scores with 3 significant figures, differences as small as 0.01 are used in making funding decisions. Nevertheless, in order to obtain reliable scores with a precision level of 0.01, an unrealistically large number of reviewers would be needed:Expression (3) implies that, in order for a mean score of 3.56 to be taken as reliable and therefore as identifying a better, more promising proposal than one receiving a rating of 3.57, the evaluation of almost 40 thousand referees would need to be obtained. In Figure 1 the exponential relationship between the number of reviewers and the precision of the ratings that would provide reliable estimates of the mean is shown. On the x-axis, smaller numbers indicate higher precision. Even for a precision level of 0.1, as many as 384 reviewers would be required.

Figure 1

The relationship between the precision of the evaluation system (how fine-grained it is established to be) and the minimum required number of evaluators needed for reliable estimates.

The disconnect between the needed precision in order to allocate funds in a fair way and the number of reviewers required for this level of precision demonstrates a major inconsistency that underlies NIH peer review. With only four reviewers used for the evaluation of applications, an allocation system that requires a precision level in the range of 0.01 to 0.1 is not statistically meaningful and consequently not reliable. Moreover, the 4 reviewers NIH proposes are not independent which degrades the precision that could be obtained otherwise. Consequently, NIH faces a major challenge. On the one hand, a fine-grained evaluation is mandated by their review process. On the other hand, for such criterion to be consistent and meaningful, an unrealistically high number of evaluators, independent of each other, need to be involved for each and every proposal. Further insights can be derived from the analysis of expression (1). The value of σ is a measure of the underlying variability in the ratings. The minimum number of reviewers for any given degree of ratings precision decreases with decreasing standard deviations. The standard deviation across ratings is also an indicator of the degree of agreement among different reviewers. If the standard deviation is small, for instance equal to 0.01 instead of our previous working estimate of 1.0, there is essentially consensus among the referees. If σ = 0.01, then the following relation holds:Therefore, 4 independent evaluators can provide statistical legitimacy only under the circumstance of all evaluators giving essentially the same evaluation. For proposals that are expected to be more controversial, as potentially transformative ideas have been proposed to be5, a small number of evaluators would lead to unreliable mean estimates. Our estimate of σ is not based on an analysis of biomedical research experts judging research projects close to their area of specialty. Scoring standard deviations for large numbers of experts obtained in a statistically acceptable sampling system have not been collected. Instead, as described above, we have used a model system that has allowed us to readily collect opinion data about proposals with undefined potential. Although we believe our estimate is reasonable, it is informative to visualize how the sample size estimate varies with different values of standard deviation for a level of precision of 0.1 (Figure 2). It is evident that small sample sizes are able to provide levels of precision only when the standard deviation is exceptionally small. We used a level of precision of 0.1 because the NIH peer review system mandates scoring at this precision level. For greater levels of precision, as suggested by the conversion from 2 significant figures to 3, the increase in sample size is steeper with increasing standard deviation.

Figure 2

The relationship between the standard deviation of the scores and the minimum required number of evaluators needed for a precision of 0.1, which is the level of precision currently obtained in the NIH peer review system.

The importance of scoring accuracy ultimately relates to the rank ordering of proposals. In our model system there were 5 movie proposals with mean scores ranging from 3.46 to 3.64. We have analyzed how the rank ordering of these 5 proposals varied as reviewers were randomly included in the analysis from 1 to 40 reviewers (Figure 3). What is most striking in these graphs is the extreme variability in the rank ordering with low numbers of reviewers. For instance, the upper-left and lower-left panels of Figure 3 show proposals that had relatively good rankings with less than 10 reviewers but that ended with relatively poor rankings with over 30 reviewers. Conversely, the lower-right panel shows a proposal that began with poor rankings but settled at with the best ranking after 25 reviewers. Even the addition of 1 reviewer can markedly change the rank ordering of the proposals and consequently the funding decision. This effect is especially apparent when there are few reviewers. The number of reviewers has profound implications in terms of the actual funding decisions that are eventually made.

Figure 3

Five individual movie proposals were evaluated by 40 reviewers and the rank ordering of the proposals was assessed as reviewers were randomly included in the analysis.

The 5 proposals were closely spaced with mean scores of 3.46 to 3.64. Proposals that had the same score were given an averaged rank; the figures changed little by assigning proposals with the same score the highest ranking.

Five individual movie proposals were evaluated by 40 reviewers and the rank ordering of the proposals was assessed as reviewers were randomly included in the analysis.

Discussion

It is clear from our analysis that NIH needs to adjust their peer review system to account for low precision evaluations. Additionally, it would be valuable to determine the standard deviations of scores given by independent reviewers. This information could be used to obtain more appropriate estimates of σ and consequently would be invaluable in designing and implementing a statistically rational system of social choice for NIH. Our data demonstrate that funding decisions will vary widely with the number of reviewers in considering proposals that are closely scored. Making choices between applications that vary by less than 1 will require larger numbers of reviewers than NIH has been contemplating. Recognition of the statistical inconsistencies of NIH peer review will allow for the implementation of new policies that take into consideration the accepted relationship between the number of reviewers, the precision of scoring needed, and the standard deviation of the scores given. The Working Group also recommended shortening the length of the application although no specific suggestions were included2. Obviously, the length of the application impacts the number of reviewers that could possibly be used for scoring. More reviewers can be used for shorter applications. It is commonly accepted that NIH will not fund clinical trials that do not include a cogent sample size determination. It is ironic that NIH insists on this analysis for clinical studies but has not recognized its value in evaluating its own system of peer review. We posit that this analysis should be considered in the revisions of NIH scientific review. The NIH peer review structure has not been based in rigorous applications of statistical principles involving sampling [4]. It is this deficiency that explains the statistical weakness and inconsistency of NIH peer review. Although NIH has made an excellent effort to remedy some of the most egregious problems inherent to their peer review system, the Working Group has neither fully realized nor addressed the statistical problems that have beset the NIH peer review system.

1 in total

1. Point: Statistical analysis in NIH peer review--identifying innovation.

Authors: David Kaplan
Journal: FASEB J Date: 2007-02 Impact factor: 5.191

1 in total

18 in total

1. Reforming science: structural reforms.

Authors: Ferric C Fang; Arturo Casadevall
Journal: Infect Immun Date: 2011-12-19 Impact factor: 3.441

2. NIH peer review reform--change we need, or lipstick on a pig?

Authors: Ferric C Fang; Arturo Casadevall
Journal: Infect Immun Date: 2009-01-21 Impact factor: 3.441

3. Recommendations for Reviewers of Biomedical Imaging Grant Applications.

Authors: Mark D Pagel
Journal: Mol Imaging Biol Date: 2019-08 Impact factor: 3.488

4. Peer review versus editorial review and their role in innovative science.

Authors: Georg Steinhauser; Wolfram Adlassnig; Jesaka Ahau Risch; Serena Anderlini; Petros Arguriou; Aaron Zolen Armendariz; William Bains; Clark Baker; Martin Barnes; Jonathan Barnett; Michael Baumgartner; Thomas Baumgartner; Charles A Bendall; Yvonne S Bender; Max Bichler; Teresa Biermann; Ronaldo Bini; Eduardo Blanco; John Bleau; Anthony Brink; Darin Brown; Christopher Burghuber; Roy Calne; Brian Carter; Cesar Castaño; Peter Celec; Maria Eugenia Celis; Nicky Clarke; David Cockrell; David Collins; Brian Coogan; Jennifer Craig; Cal Crilly; David Crowe; Antonei B Csoka; Chaza Darwich; Topiciprin Del Kebos; Michele Derinaldi; Bongani Dlamini; Tomasz Drewa; Michael Dwyer; Fabienne Eder; Raúl Ehrichs de Palma; Dean Esmay; Catherine Evans Rött; Christopher Exley; Robin Falkov; Celia Ingrid Farber; William Fearn; Sophie Felsmann; Jarl Flensmark; Andrew K Fletcher; Michaela Foster; Kostas N Fountoulakis; Jim Fouratt; Jesus Garcia Blanca; Manuel Garrido Sotelo; Florian Gittler; Georg Gittler; Juan Gomez; Juan F Gomez; Maria Grazia Gonzales Polar; Jossina Gonzalez; Christoph Gösselsberger; Lynn Habermacher; Michael Hajek; Faith Hakala; Mary-Sue Haliburton; John Robert Hankins; Jason Hart; Sepp Hasslberger; Donalyn Hennessey; Andrea Herrmann; Mike Hersee; Connie Howard; Suzanne Humphries; Laeeth Isharc; Petar Ivanovski; Stephen Jenuth; Jens Jerndal; Christine Johnson; Yonas Keleta; Anna Kenny; Billie Kidd; Fritz Kohle; Jafar Kolahi; Marianne Koller-Peroutka; Lyubov Kostova; Arunachalam Kumar; Alejandro Kurosawa; Tony Lance; Michael Lechermann; Bernhard Lendl; Michael Leuchters; Evan Lewis; Edward Lieb; Gloria Lloyd; Angelika Losek; Yao Lu; Saadia Maestracci; Dennis Mangan; Alberto W Mares; Juan Mazar Barnett; Valerie McClain; John Sydney McNair; Terry Michael; Lloyd Miller; Partizia Monzani; Belen Moran; Mike Morris; Georg Mößmer; Johny Mountain; Onnie Mary Moyo Phuthe; Marcos Muñoz; Sheri Nakken; Anne Nduta Wambui; Bettina Neunteufl; Dimitrije Nikolić; Devesh V Oberoi; Gregory Obmode; Laura Ogar; Jo Ohara; Naion Olej Rybine; Bryan Owen; Kim Wilson Owen; Rakesh Parikh; Nicholas J G Pearce; Bernhard Pemmer; Chris Piper; Ian Prince; Terence Reid; Heiner Rindermann; Stefan Risch; Josh Robbins; Seth Roberts; Ajeandro Romero; Michael Thaddäus Rothe; Sergio Ruiz; Juliane Sacher; Wolfgang Sackl; Markus Salletmaier; Jairaj Sanand; Clemens Sauerzopf; Thomas Schwarzgruber; David Scott; Laura Seegers; David Seppi; Kyle Shields; Jolanta Siller-Matula; Beldeu Singh; Sibusio Sithole; Florian Six; John R Skoyles; Jildou Slofstra; Daphne Anne Sole; Werner F Sommer; Mels Sonko; Chrislie J Starr-Casanova; Marjorie Elizabeth Steakley; Wolfgang Steinhauser; Konstantin Steinhoff; Johannes H Sterba; Martin Steppan; Reinhard Stindl; Joe Stokely; Karri Stokely; Gilles St-Pierre; James Stratford; Christina Streli; Carl Stryg; Mike Sullivan; Johann Summhammer; Amhayes Tadesse; David Tavares; Laura Thompson; Alison Tomlinson; Jack Tozer; Siro I Trevisanato; Michaela Trimmel; Nicole Turner; Paul Vahur; Jennie van der Byl; Tine van der Maas; Leo Varela; Carlos A Vega; Shiloh Vermaak; Alex Villasenor; Matt Vogel; Georg von Wintzigerode; Christoph Wagner; Manuel Weinberger; Peter Weinberger; Nick Wilson; Jennifer Finocchio Wolfe; Michael A Woodley; Ian Young; Glenn Zuraw; Nicole Zwiren
Journal: Theor Med Bioeth Date: 2012-10

5. A quantitative linguistic analysis of National Institutes of Health R01 application critiques from investigators at one institution.

Authors: Anna Kaatz; Wairimu Magua; David R Zimmerman; Molly Carnes
Journal: Acad Med Date: 2015-01 Impact factor: 6.893

6. The calculus of committee composition.

Authors: Eric Libby; Leon Glass
Journal: PLoS One Date: 2010-09-17 Impact factor: 3.240

7. An analysis of preliminary and post-discussion priority scores for grant applications peer reviewed by the Center for Scientific Review at the NIH.

Authors: Michael R Martin; Andrea Kopstein; Joy M Janice
Journal: PLoS One Date: 2010-11-17 Impact factor: 3.240