Literature DB >> 29707193

What do we know about grant peer review in the health sciences?

Susan Guthrie¹, Ioana Ghiga¹, Steven Wooding².

Abstract

Background: Peer review decisions award an estimated >95% of academic medical research funding, so it is crucial to understand how well they work and if they could be improved.
Methods: This paper summarises evidence from 105 papers identified through a literature search on the effectiveness and burden of peer review for grant funding.
Results: There is a remarkable paucity of evidence about the efficiency of peer review for funding allocation, given its centrality to the modern system of science. From the available evidence, we can identify some conclusions around the effectiveness and burden of peer review. The strongest evidence around effectiveness indicates a bias against innovative research. There is also fairly clear evidence that peer review is, at best, a weak predictor of future research performance, and that ratings vary considerably between reviewers. There is some evidence of age bias and cronyism. Good evidence shows that the burden of peer review is high and that around 75% of it falls on applicants. By contrast, many of the efforts to reduce burden are focused on funders and reviewers/panel members. Conclusions: We suggest funders should acknowledge, assess and analyse the uncertainty around peer review, even using reviewers' uncertainty as an input to funding decisions. Funders could consider a lottery element in some parts of their funding allocation process, to reduce both burden and bias, and allow better evaluation of decision processes. Alternatively, the distribution of scores from different reviewers could be better utilised as a possible way to identify novel, innovative research. Above all, there is a need for open, transparent experimentation and evaluation of different ways to fund research. This also requires more openness across the wider scientific community to support such investigations, acknowledging the lack of evidence about the primacy of the current system and the impossibility of achieving perfection.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: funding allocation; grant awarding; grant reviewing; peer review

Year: 2017 PMID： 29707193 PMCID： PMC5883382 DOI： 10.12688/f1000research.11917.2

Source DB: PubMed Journal: F1000Res ISSN： 2046-1402

Introduction

Health research has contributed enormously to society, but it is also expensive. This has led to increasing demands to understand and improve how research is supported. Most effort has focused on evaluating impacts of research, on society and the economy. Funders are attempting to gather evidence of impact using online survey platforms such as Researchfish in the UK, and national assessment frameworks including Excellence for Research in Australia (ERA). Much less work has focused on understanding how research is selected for support. Peer review is used to allocate the vast majority of competitive research funding internationally ( Ismail estimated that >95% of UK medical research funding was allocated by peer review). Therefore it is crucial to understand whether peer review is effective and efficient - whether it can fairly, reliably allocate research funding without bias. In this study, we carried out a rapid evidence assessment which asked whether the peer review process lives up to these aspirations. The research was commissioned by the Canadian Institutes of Health Research (CIHR) to support an ongoing review of CIHR’s peer review system, particularly the Peer Review Expert Panel which was convened to review the design and adjudication processes of CIHR’s investigator-initiated research programmes.

Methods

Search strategy

We identified relevant literature through five routes, using 2009 as our cut-off date because this was the date of our previous review ( Ismail ): 1. Google Scholar search using the search terms below, for publications from 2009 onwards. We reviewed the top 500 search results for each query. Search terms: ‘Grant peer review’ ‘Grant review’ AND ‘panel’ (‘Peer review’ AND ‘funding application’) OR (‘peer review’ AND proposal) OR (‘peer review’ AND funding) OR (‘peer review’ AND award) or (‘peer review’ AND ‘reviewer bias’) 2. Grey literature: we searched the websites of major funding bodies and other academic bodies (e.g. learned societies) that we expected to have published relevant research ( Table 1).

Table 1.

Academic bodies considered in the review of literature.

Organisation	Country
National Institutes of Health (NIH)	USA
Canadian Institutes of Health Research	Canada
Health Research Board of Ireland	Ireland
Science Foundation of Ireland	Ireland
Netherlands Organisation of Health Research and Development (ZonMw)	Netherlands
Research Council of Norway	Norway
National Institute for Health Research (NIHR)	UK
Wellcome Trust	UK
National Health And Medical Research Council	Australia
Health Research Council of New Zealand	New Zealand
Medical Research Council	UK
Deutsche Forschungsgemeinschaft (DFG)	Germany
Lundbeck Foundation, Copenhagen	Denmark
Swedish Medical Research Council	Sweden
Swedish Society for Medicine	Sweden
European Commission	European Union

3. Searching the Cochrane publication list for systematic reviews on grant peer review. This did not identify any relevant reviews conducted since 2009. 4. An initial set of publications already known to the authors and sponsors of the work. 5. Snowballing: from the reference lists of publications identified following screening. Some elements of our strategy were focused on evidence from the health sciences (particularly grey literature), but our wider searches, including Google Scholar, were not restricted by field of research.

Screening strategy

Publications were initially screened on title, and abstract (where available). Studies needed to include empirical consideration of the effectiveness and/or burden of grant review processes. Studies were excluded on the basis of being: Purely descriptive, describing a specific peer review process. Focused on wider concerns around the funding process, with no (or only tangential) reference to the peer review process in particular. Focused on manuscript peer review rather than peer review for funding purposes. From 2008 or earlier. Reviews, with no additional synthesis or analysis, summarising work from before 2008, or studies already identified and included individually. If studies were relevant full text was retrieved and an Excel spreadsheet was used to capture key information on the study and its conclusions. We identified 105 studies for inclusion. Table 2 summarises the range of studies identified. At the suggestion of our reviewers we added five additional references ( Bollen ; Bromham ; Doran ; Höylä ; Kulage ), we also added the term (‘fellowship’ AND ‘peer review’) to our final Google Scholar search and reviewed the top 100 results adding two further references ( Ginther ; Kurokawa ).

Table 2.

Breakdown of articles included in the review.

Number of studies included		105
Type of document	Peer-reviewed publication	70
	Other articles from journals, not peer reviewed (e.g. comment pieces)	22
	Grey literature	8
	Working paper	1
	Book chapter	4
Format of document	Commentary	21
	Review	15
	Empirical study	69
Type of data used (empirical studies)	Quantitative	53
	Qualitative	12
	Mixed methods	4
Subject focus	Biomedical	13
	Wider health	45
	Wider research	47
Quality of studies (GRADE)	0 (Lower quality, e.g. commentary with no or limited evidence base)	14
	1 (Triple-downgraded randomized trials, downgraded observational studies, or case series/case reports)	35
	2 (Double-downgraded randomized trials or observational studies)	38
	3 (Downgraded randomized trials or upgraded observational studies)	17
	4 (Randomized Trials or double- upgraded observational studies)	1

Assessment of evidence quality

Quality of evidence was rated on a scale of 1–4 based on GRADE ( Guyatt ) [1]. We aggregated the overall strength of the evidence for each area of criticism based on the scale in Box 1. Assumptions: Intuitive assumptions and widely shared beliefs prevail Suggestive: There is insufficient evidence to draw a clear conclusion (but the evidence is at least suggestive) Conflicting: There are conflicting results from well-conducted studies Agreement: A number of well-conducted studies agree Compelling: Systematic reviews are compelling. When synthesising our findings, we also drew on our previous review of the topic ( Ismail ).

Results

We summarise our findings in Table 3 with each discussed in detail below.

Table 3.

Summary of evidence from the literature regarding the effectiveness and burden of peer review.

Evaluation question	General critique	Particular criticism(s)	Is the criticism valid?	Strength of the evidence base
Is peer review an effective system for awarding grants?	Peer review does not fund the best science	It is anti-innovation	Yes	Suggestive
		It does not reward interdisciplinary work	Unclear	Suggestive
		It does not reward translational/ applied research	Unclear	Suggestive
		It is only a weak predictor of future performance	Yes	Agreement
	Peer review is unreliable	Ratings vary considerably betwee reviewers	Yes	Agreement
	Peer review is unreliable	It struggles to achieve an acceptable level of consistency	Unclear	Conflicting
	Peer review is unfair	It is gender-biased	Unclear	Conflicting
		It is age-biased	Unclear	Conflicting
		It is biased by cognitive particularism	Unclear	Conflicting
		It is open to cronyism	Yes	Agreement
	Peer review is not accountable	Review anonymity reduces transparency	N/A	N/A
	Peer review is not timely	It slows down the grant award process detrimentally	Unclear	Suggestive
	Peer review does not have the confidence of key stakeholders	It is not the preferred method of resource allocation	No	Agreement
What is the burden of peer review on the research system?	Peer review is an overly burdensome way of distributing research funding	Burden of peer review is increasing	Yes	Agreement
What is the burden of peer review on the research system?		Burden of the peer review system is high and falls primarily on the applicants	Yes	Agreement

Is peer review an effective system for awarding grants?

What constitutes the ‘best’ science will vary, however it may include research that is innovative, interdisciplinary and applied. This section considers biases against any particular type of research and whether peer review is a good predictor of future success. Braben (2004) has suggested that supporting highly innovative research is important because it drives technological change and economic growth – an idea increasingly embraced by research funders. NIH has expressed concern at falling numbers of innovative or risky applications, suggesting ‘competitive pressures have pushed researchers to submit more conservative applications’ ( Kaplan, 2005; Scarpa, 2006). Low success rates may have exacerbated the situation, inducing ‘conservative, short-term thinking in applicants, reviewers, and funders’ ( Alberts ). On the other hand, a system is necessary to distinguish between innovative research and that grounded in ‘reckless speculation’ ( Hackett & Chubin, 2003). Although ‘innovative research’ and ‘high-risk research’ are often conflated, they are not necessarily synonymous, here we include both aspects of innovation. Innovative proposals may have less preceding work supporting them, and hence receive less praise from reviewers ( RIN, 2010; Spier, 2002). This lack of preceding work requires less risk-averse mind-set from the reviewer ( Spier, 2002). Innovative proposals from young researchers may suffer a ‘double disadvantage’: lacking previous work, both because of their novelty and the researcher’s shorter track record. The challenge of supporting innovation is not new, in 1977, Thomas Kuhn wrote of an ‘essential tension’ between originality and tradition. These tensions were also included in a 2006 UK Treasury report which noted ‘the UK is still susceptible to a charge of risk aversion, as classic peer review criteria emphasise tests of scholarship over potential impact’ ( Treasury, 2006, p. 16). Empirical evidence of this problem come from recent work identifying lower scoring of novel proposals, even controlling for factors such as proposal quality, further this deficit could not be explained by the novel proposals being less feasible ( Boudreau ; Boudreau ). Risk aversion may also affect the preparation of applications: Fang & Casadevall (2009) suggested that falling success rates lead to conservatism because of the perceived increased risk associated with innovative proposals. Approaches to these problems include using reviewers with different cognitive biases for different schemes – specifically targeting specialists in translational or high-risk, innovative research ( Langfeldt, 2001). This approach has been used (though not evaluated) in NIH’s high-risk, high-reward Pioneer awards ( Gewin, 2012). Making ‘innovation’ an assessment criteria is another approach ( Lindner ; Luukkonen, 2012). Views on this are mixed, some suggesting panels lack the expertise to assess innovation ( Costello, 2010), whilst others see the approach as effective ( Spiegel, 2010). Analysis of NIH application scores suggests that those for innovation are closely correlated with overall scores ( Lindner ). Other analysis ( Giraudeau ; Linton, 2016), suggests that disagreement among scoring could be used to identify innovative research – high disagreement being taken as an indicator of work with high potential but also high risk. Similarly, Lee (2015) suggests combating conservatism by increasing the weight given to criteria – such as innovation – which are typically underweighted by reviewers. An approach that sidesteps the issue is to select researchers purely on their merit, regardless of the research they plan to conduct. Researchers then have freedom to pursue new and novel ideas and work flexibly, as opportunities arise (e.g. by the MacArthur Fellows programme [4]). Finally, Holliday & Robotin (2010) suggest that a Delphi process (a structured deliberative process) could be used to assess the merits of research ‘in situations where the available scientific evidence is limited and if review panels have widely divergent opinions’. The process was also found to be efficient and flexible from a time perspective. Critics argue interdisciplinary research is disadvantaged because (1) interdisciplinary proposal reviews may have to combine multiple distinct understandings of ‘quality’ – undermining the strength of the review ( Feller, 2006), and (2) it is more difficult to identify ‘peers’ to review such work. This latter challenge is exacerbated by the standard structure of peer review processes in which only a few reviewers examine each proposal in detail, or at the initial stages, reducing the breadth of reviewing expertise further ( Gluckman, 2012). Bromham analyse 18,476 submissions to the Australian Research Council’s Discovery Programme and show that increased interdisciplinarity leads to lower success rates. A study on the US National Science Foundation (NSF) revealed that, in interdisciplinary studies at least, peer review favours ‘research that is performed by academics, in the sciences, and that falls completely within the reviewers’ own domain of expertise’ ( Porter & Rossini, 1985, p. 37). With interdisciplinary teams it can be hard to isolate the contribution of each researcher, which can reduce the investigators chance of getting further funding by ‘weakening’ their track record ( Cooksey, 2006a) There has been limited further work in this area since 2009. Increasing the size of the review panel and broadening the range of expertise and disciplines present has been suggested as a way to address these problems. However, this increases burden and can only work if the role of the initial in-depth reviewer(s) is diminished ( Gluckman, 2012). The Cooksey Report on health research funding in the UK noted that peer review ‘can in some instances inhibit programmes in translational and applied health research’ ( Cooksey, 2006b). The report suggested that one reason for this inhibition was because peer review prevented the iterative development of research projects where funder and researcher worked together. Cooksey also suggested that because applied researchers publish in specialist (i.e. lower-impact) journals, they received less credit for publications than basic researchers. Including research users and considering the likely impact of research as part of the funding process may address these concerns. In our 2009 review, we noted the Canadian Health Services Research Foundation pioneering work through the use of ‘merit review panels’ to evaluate proposals, combining members from both academic and wider user/policy communities. This approach has now spread to other major funders, notably NIHR. Considering impact at the application stage – an approach criticised for disadvantaging innovative research – is likely to be beneficial when reviewing research which is closer to being applied. The evidence around peer review’s bias against applied research is not strong and has changed little since 2009. It is hard to know what criteria individual reviewers apply, as studies are hampered by methodological problems and funders are reluctant to release scores from peer review panels ( Feller, 2006). While several studies have examined how reviewers assess proposals in the humanities and social sciences ( Guetzkow ; Mansilla, 2006), work in the natural sciences is lacking. A study of NIH shows the success rate of clinical research proposals is marginally less than those for laboratory research ( Kotchen ). This is in line with a recent CIHR study showing that health services and policy research applications were less successful than biomedical research applications ( Tamblyn ). Work by Fang & Casadevall suggests peer review can ‘winnow’ out bad research proposals ( Fang & Casadevall, 2012). However, recent studies from several NIH Institutes and the Netherlands have challenged the idea that peer review can effectively select the best research. Studies comparing percentile application rankings with the research’s subsequent bibliometric performance found no association ( Danthi ; Danthi ; Doyle ; Fang ; Kaltman ; van den Besselaar & Sandström, 2015). Two further such studies found that grant review outcomes only weakly predict bibliometric performance ( Lauer ; Reinhart, 2009). Bibliometric analyses are by no means perfect measures of performance – only capturing a proxy of academic performance ( Belter, 2015). Nonetheless, the findings suggest that peer review assessment is, at best, a crude predictor of performance. Using an alternative metric, Galbraith et al. showed that peer reviewers’ opinions were only weakly predictive of the commercial success of early stage technologies in small businesses ( Galbraith ). Fang & Casadevall (2012) comment that, while reviewers can usually identify the top 20–30 per cent of grant applications, going further to identify the top 10 per cent is ‘impossible without a crystal ball or time machine’ (p.898).

Is peer review reliable?

If peer review is reliable, the judgements of different peer reviewers on the same proposal should be highly correlated. The grounds for the continuing use of peer review would be severely undermined if systematic unreliability were demonstrated. Funders have been criticised for not making sufficient efforts to measure and monitor the reliability of assessments across reviewers ( Fang & Casadevall, 2009). In this section, we consider two concerns surrounding peer review, namely individual reviews and overall consistency of decisionmaking – and how they might be addressed. Single-rater reliabilities [5] are not encouraging, but have been hampered by the methodological difficulties of modelling the complex interactions between reviewers in multi-stage peer review processes. In particular, the work of Jayasinghe demonstrates a single-rater reliability correlation of just 0.21 for the humanities and social sciences, and an even lower correlation of 0.19 for the sciences. Similarly, Fogelholm et al. found an inter-rater reliability of around 0.23 for medical research ( Fogelholm ). In contrast, two studies have found a higher level of agreement between reviwers. The first study which built in some of the complexities of the peer review process, found a dependent reliability [6] rating for individual peer reviewers of 0.80. The second study on the review process for Marie Curie Actions (a major EU funding stream) measured inter-rater reliability based on the average deviation in scores between raters, and found a high level of agreement ( Pina ). Strikingly, the chance of improvements from initial ratings during panel discussion is virtually nil (e.g. from ‘no award’ or ‘possible award’ to ‘award’). This suggests that initial triage of applications may be preferable to re-rating rounds (Bornmann, et al.). Increasing diversity of background and discipline of peer reviewers also reduces rating consistency. Lobb identified a low intra-class correlation coefficient (0.12) when comparing reviewers from a research, practice or policy background. They also noted that the level of agreement among experts from different disciplines was considerably lower than that among adjudicators of the same discipline, meaning that the presence of several practitioners from the same discipline area could have the potential to skew funding outcomes, depending on the wider makeup of the panel. This suggests that peer review processes may not work well for transdisciplinary teams integrating both academic and non-academic experts. Taking a different perspective, Reinhart found that although the global intra-class correlation coefficient was 0.41, there were considerable differences between fields, for example, biology (0.45) versus medicine (0.20) ( Reinhart, 2009). Existing studies offer mixed judgements on the reliability of grant peer review. Bornmann identified a threshold of 80–90 per cent as the expectation for agreement for this kind of decisionmaking ( Bornmann ). Two early studies ( Cole ; Hodgson, 1997) we noted in 2009 found reliability rates across funding boards of 75 and 73 per cent respectively for funding decisions which they felt was a satisfactory level of agreement. More recent evidence is mixed. The most recent study comparing the outcome of two independent panels found an agreement rate of 83 per cent ( Clarke ), whilst a previous study in 2012 was less favourable, showing agreement levels of 65–69 per cent ( Fogelholm ). Graves examined the variability of panel members’ individual scores and calculated how this translates into the variability of overall proposal ranking, and hence funding decisions. They found that such variability could affect the outcome for 29 per cent of the proposals considered, and that variability differed widely between panels. Abdoul et al. have suggested that scoring variability might be partially explained by differences in reviewer behaviour, such as the time taken to do the assessment, assessment methods, and variation in the relative weighting of different criteria by different reviewers ( Abdoul ). Recent studies focusing more on the impact of panel meetings have shown very limited effects on improving consistency and reliability. Fogelholm suggested that mean reviewer scores prior to the panel meeting were similar to the panel consensus score. The authors concluded that using the mean reviewers’ scores was a practical and economical alternative. Similarly, although Pina identified both a subset of panels and subset of proposals with high levels of disagreement, where consensus meetings improved agreement, across the whole population they could not detect an overall improvement in agreement. In contrast, Martin found meeting discussions had an important effect in more than 13 per cent of applications in their analysis of a sample of standard (R01) NIH research grant applications. Two funders have experimented with, and evaluated, virtual peer review both by teleconference and through the use of Second Life, a virtual world. NIH estimated that using Second Life telepresence, peer review could cut panel costs by one third ( Bohannon, 2011). Pier compared videoconference and face-to-face panels. They set up one videoconference and three face-to-face panels modelled on NIH review procedures, concluding that scoring was similar between face-to-face and videoconference panels. Both the Bohannon and Pier studies of virtual panels noted that participants valued the social aspects of meeting in person and preferred the face-to-face arrangements. Gallo examined four years of peer review discussions, two years face-to-face and two years teleconferencing. They found minimal differences in merit score distribution, inter-rater reliability or reviewer demographics. They also noted that panel discussion, of any type, affects the funding decision for around 10 per cent of applications relative to original scores. The NIH peer review self-study suggested some possible improvements to the peer review process to combat low reliability, focusing principally on better training for reviewers ( NIH, 2008). NIH suggested such training should focus on: (1) emphasising the strengths (rather than weaknesses) of research proposals; (2) focusing on the potential impact of research; (3) reviewing the merit of the proposal and not re-writing it; (4) recognising the problem of implicit bias in study sections; (5) using benchmark applications during panel meetings to provide review guidelines; and (6) pointing out potential bias towards lesser known applicant organisations. Recent work by Sattler has evaluated the effect this type of brief training programme. The study found inter-rater reliability increased from 0.61 to 0.89, and the amount of time spent reviewing also increased, for both new and experienced reviewers. If inconsistency stems from discrepancies in review quality (which is by no means clear), it might be feasible to evaluate the quality of reviews, although this approach has its own challenges – for example, what is a ‘good’ review? If a review is not consistent with other review does that intrinsically make it ‘bad’? It could be the outlier picking up on the true potential of an innovative application. However, this approach is used by many funders, as shown in a report by the European Science Foundation (2011) which found in a survey of European research funders that more than half (60 per cent) evaluate the quality of all reviews as standard practice using a range of criteria (e.g. completeness, level of substantiation, appropriateness, comprehensibility, timeliness and usefulness), and may return the review to a reviewer or reject the review. Organisations felt that review quality was higher where these checks were made, but noted little difference quality between cases where all reviews are evaluated versus just a sample. However, no data was available to assess these suggestions, and no empirical analysis had been carried out. Adding such an evaluation process clearly adds to the burden of the process. Having considered the evidence suggesting that consensus on peer review decisions is rare, what factors might underlie the observed discrepancies? To what extent is peer review open to the same allegations of bias that plague science more widely, particularly around gender, race, intellectual school or institutional affiliation? A recent study ( Day, 2015) has shown that low levels of passive bias as well as individual cases of significant active bias among reviewers can have significant impacts on the outcomes of a grant peer review process, earlier work showed the likely presence of racial bias in NIH funding decisions ( Ginther ). In this section we consider the potential for bias in peer review across four main areas: gender, age, cronyism and cognitive particularism. Bias could occur at various places in the peer review process. While bias on the part of the peer reviewers themselves (such as sexism or racism) has received considerable attention in the literature, funding competitions can be biased through eligibility and award selection criteria. Such criteria may be prejudiced against early career researchers or innovative research – although there is no strong evidence that this occurs. In addition wider systemic biases may mean that the number of applications received is lower from particular groups. Blinding of applications provides a defence against the most obvious abuses by reviewers – rejecting proposals on the grounds of race, gender, institutional affiliation and so forth ( Lee ). A study from South Korea by Lee demonstrated a significant bias in sighted proposal evaluation towards those from particular research departments, senior researchers, and those already academically recognised. This is reinforced by a review of studies by the NSF, which found only ‘a weak correlation’ between panel ratings of blinded short version and unblinded full versions of the same applications ( Bhattacharjee, 2012). While some funding bodies now routinely attempt to anonymise proposals before passing them on to reviewers, there is some dispute as to whether anonymisation is truly possible. Some authors contend that some degree of identification is always possible from anonymised research proposals ( Bhattacharjee, 2012) The overall demographics of science - with increasing under-representation of women as seniority levels increase point to particular challenges for women in advancing in science. However, the evidence on gender bias in peer review is inconclusive. Studies suggesting bias include an important study of the grant peer review system of the Swedish Medical Research Council strongly suggested that reviewers were unable to judge scientific merit independently of gender ( Wenneras & Wold, 1997). These findings were supported by a subsequent meta-analysis of 21 studies on this topic, which found that grant applications submitted by men were 7 per cent more likely to be approved than those submitted by women ( Bornmann ). [7] Furthermore, recent studies have also found evidence of gender bias ( Jang ; Kaatz ; Kaatz ; Tamblyn ; van der Lee & Ellemers, 2015; Volker & Steenbeek, 2015). For example, van der Lee & Ellemers (2015) reported a 4 per cent ‘loss’ of women during the grant review process for awards to early career scientists by the Netherlands Organization for Scientific Research (NWO). In a review of research on gender bias by Kaatz , women generally have lower rates of publication and lower success rates for high-status research awards than do men. On the other hand, a review of the gender bias literature by Ceci & Williams (2011) showed that the weight of evidence suggests that peer review is fair across gender, with all smaller-scale studies analysed, along with all but one of the large-scale studies, failing to replicate Wenneras & Wold’s findings. And even for the remaining large-scale study the findings were reversed by a reanalysis. The lack of gender bias has been supported by several subsequent studies, in particular a careful, large-scale primary study and meta-analysis by Marsh et al. ( Marsh ; Mutz ; Reinhart, 2009; Turner ; Van Arensbergen ). Although review processes that partly rely on the previous publications or funding successes of the applicant may be biased against early career researchers, Jayasinghe ; Jayasinghe found that the age of the applicants did not directly impact upon grant success, a findings supported by Reinhart (2009). However, this finding is directly contradicted by a study comparing sighted and blinded reviews of research grant proposals in South Korea ( Lee ). A subsequent study, also based in South Korea ( Jang ), found that evaluation scores and selection success rates decline with age. Concerns about age bias are closely tied to concerns about bias against early career researchers, who may be disadvantaged through lacking preliminary results or a substantial portfolio of work. The challenges of providing adequate support for early career researchers is widely recognised ( Bazeley, 2003) and was raised in a 2008 NIH review which identified significant decreases early career success rates which could not be accounted for by variations in application quality ( NIH, 2008). Similar concerns were also noted by Spiegel (2010), who showed that the average age researchers won their first full NIH project grant awards (R01) had been steadily increasing. Since then, the NIH has introduced measures aimed at equalising success rates for new and established investigators for new (not renewal) applications. Cronyism is a concern for many major funders, who have detailed conflict of interest processes in place to counter the presence or perception of such biases. However, ( Wenneras & Wold, 1997) show that prior affiliation with a reviewer considerably increased a researcher’s chances of funding, Similarly, a large-scale study of applications to the National Science Foundation of Korea found that applications reviewed by previous or current affiliates were more likely to be successful ( Jang ). A review of NSF proposals reported by Bhattacharjee (2012) is harder to interpret when full proposals and shorter, anonymised versions of the same proposals were compared there were only weak correlations. Panelists and applicants suggested anonymisation made a difference, but the shorter length of proposals was also seen as important. Luukkonen (2012) notes that panel debate may fail to counter crude forms of cronyism since panels often cover a wide area of research, and each specific area is only represented by a few experts, so the other members may defer to the experts’ knowledge. Members of funding panels may also benefit directly from their membership. One study noted that panel members submit more applications, and have more grant awards ( van den Besselaar, 2012). The challenge in this area is separating factors such as good researchers who submit more applications being selected to join panels or having a better sense of what makes a good application, from nepotism. The idea that reviewers and panel members will favour proposals in their own fields or that align with their ways of thinking has been termed ‘cognitive particularism’ ( Travis & Collins, 1991). Fang & Casadevall (2009), suggest that ‘reviewer biases favour topics well understood and appreciated by the [funding panel]’ (p.930). Travis & Collins (1991) found that reviewers tend to favour proposals supporting their own school of thought, and argues that this is likely to have a much bigger impact on the direction of science than institutional bias or cronyism identified by other studies ( Langfeldt, 2006; Wenneras & Wold, 1997). Research by Li (2015) suggests the same. Work by Wang & Sandström (2015) suggests that ‘cognitive distance’ may influence reviewer decisions in a more complicated way, with reviewers more likely to favour applications in areas they are either very familiar with, or completely unfamiliar with. Other studies find that reviewers are more critical of applications in areas of their own expertise ( Boudreau ; Gallo ). A number of studies suggest that studies in molecular biology are more likely to be successful in comparison to other fields of bioscience. ( Bornmann & Daniel, 2006) found a slight statistical effect and further studies reveal that peer-reviewed grant proposals in molecular biology tend to have a better chance of receiving grant funding than proposals in other bioscience fields ( Kotchen ; Taylor, 2001). There is also dispute about how to resolve this potential problem Alberts suggests that such effects could be countered by broadening ‘the range of scientific problems judged by each group and include[ing] a diversity of fields on each panel’, suggesting that ‘senior scientists with a wide appreciation for different fields can play important roles by counteracting the tendency of specialists to overvalue work in their own field’ (p.5777). However, Li (2015) advises caution, noting that though evaluators may be biased in favour of projects in their own area, they are also likely to be better able to assess the quality of those projects, and the benefits of this expertise may well outweigh any possible biases.

Is peer review timely?

In some cases such as an emerging epidemic the time taken by peer review could reduce the number of people benefiting from the research, such slowing of the research process could also reduce the economic viability of a new product, (e.g. Agres, 2005; Cures, 2005; Daniels, 2004; Roy, 1985). The many stages of grant peer review can take from 9 to 18 months from submission to funding. It is less clear how often this time significantly hinders the progress of science. In the health sciences, research is one of many steps in develop new treatments and practices ( Hanney ). Research suggests that the time required for translation of research from initial idea to adopted practice is around 17 years, so peer review may be a relatively small contributor, however any one translation pathway may have multiple stages of peer review ( Morris ). Though criticism of the peer review process abounds, empirical evidence, though limited, indicates that support for peer review amongst the academic community remain strong ( Bornmann, 2011; Wooding & Grant, 2003). The dominance of peer review across funding systems internationally suggests it has the confidence of institutional stakeholders. A recent review of literature about the NIH peer review processes found a firm belief in the transparency and objectivity of peer review amongst grant reviewers ( Miner, 2011). There is a striking disconnect between the institutional and community support for the peer review system and the empirical evidence of its effectiveness – unfortunately the scope of our review excluded the types of research that might explain this divergence. In contrast, beyond the classical model of research an emerging body of literature suggests traditional academic peer review may not be appropriate for all types of research. A recent study on indigenous research showed the competitive nature of peer review was counterproductive and that peer review did not have the confidence of relevant stakeholders ( Street ). Similar concerns have been expressed about the assessment of community engagement proposals ( Ahmed & Palermo, 2010).

What is the burden of peer review on the research system?

In a survey of 28 biomedical research funding organisations across 19 countries ( Schroter ), declined review requests, late reports and administrative burden were the most frequently mentioned challenges, and all organisations reported an increase in burden in the previous five years (although they reported that the quality of reviews had remained the same). A study by the Royal Society of New Zealand reported a similar increase in the difficulty of recruiting senior reviewers ( Gluckman, 2012). The overall monetised cost of the peer review system, including application preparation, has been estimated to account for as much as 20–35 per cent of the allocated budget ( Gluckman, 2012). Graves report that the monetised costs of the application system for NHMRC are $14,000 per grant, whilst extrapolating RCUK ( Research Councils UK, 2006) estimates suggests that the costs of the application process are 10–17 per cent of the total cost of research. An evaluation of the CIHR Operating Open Grants Program (OOGP) found the application cost of OOGP grants to be Can$14,000 ( Peckham ). A detailed review of preparing grants for NIH for nursing research reports similarly high costs to institutions ( Kulage ). When providing congressional testimony individual researchers have estimated that as much as 60 per cent of their time is devoted to seeking funding ( Fang & Casadevall, 2009). The bulk of the resources consumed by the peer review process are in the writing and reviewing of applications. RCUK work showed the distribution of monetised burden was 74 per cent in application production, 21 per cent in reviewing process (including time of reviewers, panel membership and modifying proposals), and 5 per cent in Research Council costs and payments to reviewers ( Research Councils UK, 2006). More recent work by Graves used a small survey of NHMRC researchers to estimate that the burden fell even more heavily on the applicants, assigning a split of 85 per cent for application production, 9 per cent for reviewing and 5 per cent for administration. Barnett reinforced this conclusion with a larger survey of 285 applicants who had submitted 632 proposals to four health services research funding rounds from May 2012 to November 2013, at the Australian Centre for Health Services Innovation. A review by the New Zealand Royal Society made a similar estimate of the burden shouldered by the applicants – pegging it at 80 per cent ( Gluckman, 2012). In contrast two studies of the Natural Sciences and Engineering Research Council (NSERC) of Canada peer review process and came to strikingly different conclusions. Gordon & Poulin (2009) estimated the cost of the NSERC system, including application preparation, review and administration costs at Can$44m. They suggest this money could alternatively provide all researchers in the field with an annual baseline grant of Can$30,000. However, Roorda (2009) takes issue with Gordon and Poulin’s assumptions suggesting they have overestimated costs by a factor of 23. The correct answer appears to be in between – there is disagreement about how the costs should be allocated and neither side provides a justification of their estimates of the time spent on grant preparation (the key driver). Herbert suggest burden on NHMRC applicants could be reduced by simplifying the application process (currently 80–120 page applications). Other examples of funding agencies reducing the length and complexity of applications include NIH did cut the length of their applications for R01s [8] from 25 pages to 12 in 2009, although there were calls to make the application even shorter ( Fang & Casadevall, 2009). Barnett examined the effect of reducing the complexity of the application. Surprisingly, they found that reducing application complexity slightly increased preparation time. They suggest that this may be because researchers allocate a fixed fraction of their time to application preparation. Theoretical work by Geard & Noble (2010) using agent based modelling found that applicants devote ‘excessive’ time to proposal preparation ( Geard & Noble, 2010). Barnett examined four rounds of a funding scheme in Australian which significantly shortened the application (to 1,200 words). Qualitative feedback was positive, suggesting it took seven days to develop an application, but generalisability is limited. The level of effort devoted to application preparation is all the more striking given Herbert finding that increased effort did not translate into increased success rates. A few qualitative studies have examined the burden of the system on particular groups of researchers and the wider implications on researchers’ quality of life. A survey of 215 NHMRC applicants concluded that the ‘impact of preparing grant proposals for a single annual deadline is stressful, time consuming and conflicts with family responsibilities’ (p.1), although it did not quantify the effects or time taken ( Herbert ). A study of early career investigators applying for funding at CIHR identified the application process as burdensome and noted the decrease in success rates for open operating grants from 30 per cent in 2005–2006 to 15 per cent in 2014–2015 ( Association of Canadian Early Career Health Researchers, 2016). The institutional costs of application preparation were examined by the US Government Accountability Office (GAO) in 2016, which concluded that pre-award requirements for applicants to develop and submit detailed documentation for grant proposals, and increased prescriptiveness of certain requirements, had increased universities’ workload and costs, but the study ( GAO, 2016) did not quantify these increases Time invested by reviewers and panel members is consistently identified as the second-highest monetised cost of peer review, making up about 15 per cent of the burden. Two types of studies carried out in this area have both aimed at optimising the process, balancing the trade-off between burden and quality to achieve efficiency. The first study approach trialled simplified processes for grant review to test how much time they save and whether they affected funding decisions ( Herbert ) – particular the use of a shortened application form and smaller review panels. They found the simplified processes achieved agreement with the current award system of close to 75 per cent (which they suggested was the ‘acceptable’ threshold based on a review of previous surveys), at estimated savings of 33–78 per cent of review costs. The second study used statistical techniques to estimate the optimum number of reviewers ( Snell, 2015) trading off improved reproducibility with additional reviewer burden. They found that five reviewers were optimal; similar work by Graves on a different funding scheme found 11 reviewers was the most effective number. In addition to experimental changes there are examples of funding agency policy changes that have been examined. The NSF changed its review procedures in 2012 to reduce burden by introducing triage on short preliminary applications with a 75 per cent cull rate, with annual rather than six-monthly applications. The General Accountability Office has praised the system and it reduces administrative burden on programme officers. However, because several changes happened simultaneously, it is not clear whether this is because of the triaging. It also resulted in reduced success rates, partly because of more applications (perhaps because they were easier to write but also because of funding reductions ( Mervis, 2016). One of the drivers of the burden on funders is identifying appropriate reviewers for each proposal. Mervis (2014) reports on a radical experiment at NSF where applicants reviewed each other’s grants (each applicant completing seven reviews), consequently reducing this burden to zero. To guard against applicants marking their competitors down, they were rewarded for scores that aligned with the other reviewers. The pilot allowed the number of reviews per proposal to be increased from three or four to seven and the reviews provided were more detailed. Because of the additional reviews, NSF was able to dispense with panel discussion, thus saving administrative costs.

Discussion

In this section we summarise our findings: firstly, on the availability of evidence, considering the scope and coverage of the existing literature; secondly, on what that the evidence shows, and finally, highlighting the implications for health research funders.

Availability of evidence

Questions about the effectiveness and burden of peer review can be addressed at two levels. At a high level, does peer review support valuable science? And at a lower level, can the design of peer review systems be improved to increase effectiveness and reduce burden? It is clear that the current system of funding has produced significant benefits for society, suggesting that peer review supports valuable science. However, whether peer review is demonstrably better than any other system is impossible to judge with certainty because of the lack of comparators: no funding agencies have made significant use of alternative systems. Moving to the lower level, considering comparisons between or research on peer review systems, there is only a very small number of robust, well-conducted studies. Much of the literature identified is anecdotal in nature and we found no systematic reviews, underlining the fragility of the evidence base. However, we did identify a series of robust, high-quality studies that have been carried out since our last review in 2009. Despite this new work it is still true that most studies examine the peer review process of one particular funder in one particular context, rather than looking across funders or contexts, and few go beyond process measures to judge effectiveness. This persistent lack of evidence about the allocation of the ‘inputs’ to research is all the more striking given the advances in understanding the outputs and outcomes of research through research impact assessment over the last decade.

Findings from the available evidence

The central problem when assessing peer review is the lack of an absolute standard or ‘ground truth’ to judge against. There will be uncertainty in all peer review decisions - it is, after all, predicting the future. And there is evidence suggesting it is not a particularly good predictor, at least for bibliometric performance. At present most funders do not capture, use, or even acknowledge this uncertainty, despite clear evidence of inconsistency in peer review ratings and mixed evidence on the reproducibility of panel decisions. These is good evidence that peer review suffers from biases. The strongest evidence is of a bias against innovation and although a range of improvements have been suggested, none have been robustly evaluated. There is some evidence peer review is influenced by cognitive distance and suffers from cronyism and suggestive evidence that there are age biases. Considerable work has been done on gender bias, with conflicting results, which illustrates the challenges of accounting for biases outside the scope of the peer review process, for example through eligibility or the culture of the wider scientific system. Though the problem of burden is widely recognised, funders’ considerations often focus on their own and reviewers’ burden as these are more immediately visible (and costly) to them. However, it is clear that the burden largely falls on applicants (rather than reviewers or panel members). Falling success rates across many funders compound the burden on applicants. One way to address these challenges could be to reduce the complexity of the application process, with evidence suggesting similar decisions can be made with much shorter applications and less information. However, small decreases in application length do not seem to translate into application preparation time so such changes would need to be carefully evaluated. Despite the plethora of comment pieces criticising the peer review system, there is no empirical evidence suggesting whether peer review has more or less support among key stakeholders than it did in 2009.

Potential improvements

This section outlines our reflections on ideas for improving peer review processes. We concentrate on ideas that augment or refine peer review – as those approaches were most comprehensively covered by our search approach. Other approaches that are more complete alternatives to peer review for example peer to peer allocation were beyond the scope of this review ( Bollen ). We feel the uncertainty in peer review - clear in the inconsistency of ratings and weak predictive power in terms of future academic performance - should be acknowledged, captured and used to improve decision making and for analysis. Reviewers should be asked both for their rating of the proposal and a measure of their confidence in this rating - some smaller funders, such as the Villum and Velux Foundations in Denmark, are starting to implement such systems. Funders could also analyse levels of disagreement between reviewers, which may be an indicator of innovative research ( Linton, 2016), or take a portfolio approach selecting projects scoring highly across different criteria, including innovation ( Lee, 2015). A second approach is to acknowledge the difficulty of predicting the future and introduce an explicit element of randomness into the allocation system. This could be done to differing extents – from completely random allocation of funding to the use of a lottery system within set groups of applicants. Fang & Casadevall (2016) propose a two-stage system, in which the best applications are identified and then a smaller percentage are funded using a lottery. Avin (2015) proposes using two thresholds, above the higher threshold all applications are funded and below the lower threshold all applications are rejected, applications between the two thresholds are funded at random, effectively blurring the funding line. A lottery approach should reduce biases in decision making since the selection from the fundable pool is random; however, applicant eligibility restrictions/selection for the lottery could reintroduce bias. Selecting into a fundable pool requires less fine-grained decisions addressing concerns about the reliability of peer review. The use of lottery systems is a promising, but politically challenging idea, so far is has only been used in very limited cases, such as the Explorer Grants offered by the Health Research Council of New Zealand; the Seed Projects offered by Science for Technological Innovation also in New Zealand and the Experiment! Grants from the Volkswagen Foundation [9], and as such we think using elements of lottery allocation merits further empirical research ( Barnett, 2016). Complex approaches combining assessment and lottery, although theoretically attractive, suffer from the disadvantage of sacrificing understandability ( Kurokawa ). Other approaches to address bias include blinding of reviewers (e.g. Lee ), though the feasibility of this is debated ( Bhattacharjee, 2012). More practically funders have also used training approaches to address bias (e.g. CIHR) and to improve quality of reviews (e.g. NIH, 2008) and there is limited evidence that the approach could reduce the discrepancies between reviewers ( Sattler ). Applicant burden should be considered as a priority compared to reviewer and administrative burden as it represents around 75% of the system burden. This can be addressed by reducing the level of burden or increasing the value unsuccessful applicants receive by applying. Changes to reduce burden need to be carefully evaluated as there is evidence that even significant reductions in application length/complexity may not reduce applicant burden as much as expected. An alternative approach is to make the process more valuable for the applicants. Reviewer and panel feedback may be one way to do this (although one of the reviewers of this paper noted the concern that providing feedback may open a funder to appeals from rejected applicants). Technology provides ways to reduce the time burden of the peer review process for panel members and funders - for example by eliminating travel - and does not appear to significantly affect the outcomes. However, face-to-face discussion of applications brings other side-benefits, including social interaction and network formation, other research suggests these side-benefits may be important to the progress of science and hence may need to be supported in other ways if peer review is done remotely. Altering the format of research proposals to incorporate multi-media or video has been suggested as a way to improve information transmission and reduce burden, but the effects of doing so have not been tested ( Doran ). It remains striking how little robust evidence is available about peer review as a method for grant allocation. Given the centrality of the peer review process in the current science funding system, there is a need for better evidence, not only on the overall effectiveness of peer review but also to help improve the design of peer review processes. We suggest three fruitful areas for investigator are the links between the peer review process and the wider context of science funding; the social processes of peer review and panel meetings. System changes (such as the overall amount of funding) affect the peer review process, and peer review changes affect the system, so both need to be considered together to understand the dynamic behaviour of the overall research process. Nearly all of the studies we identified considered aspects of the peer review system in isolation – for example tracking success rates or reviewer burden. However, system changes such as decreased funding, or changes in researcher demographics, often happen alongside, and interact with, changes to the peer review system. To address these questions may require developing the modelling and simulation approaches such as those in Avin (2015), Geard & Nobel (2010) and Höylä . Even in the fairly barren landscape of evidence we explored, it was startling that we could find no studies examining the social processes that occur during panel discussions – a central part of the peer review process. Such studies will clearly be challenging and require the cooperation of funders working in concert, but we feel are essential to understand how to optimise one of the fundamental processes of science. At a more mundane level, funders should be more willing to experiment with, evaluate and publish results from evaluations of alternative approaches. Through our conversations with funders it appears that where analysis is carried out it is often not published, partly because of the extreme sensitivity around funding allocation procedures. Funders are not the only ones who need to take a more reflective approach: they will need the support of the wider scientific community to support such investigations, and acknowledge the lack of evidence about the primacy of the current system and the impossibility of achieving perfection.

Conclusions

Many criticisms of the peer review system reflect conflicts between the needs of stakeholders. Researchers look to peer review to uphold research standards and promote the ‘best’ science, while politicians and funders use it to provide accountability for spending ( Viner ). This tension requires peer review to both protect the identities of reviewers while appearing transparent to applicants; to be innovative yet assure quality; to be based on human judgement yet free of human biases ( Hackett & Chubin, 2003). We think that current dissatisfaction with the peer review process is amplified by falling success rates, so it is important to remember that the concerns around peer review are heavily influenced by funding policy and the size of research budgets. As a society, if we are to improve how we use our research funds, we need a better understanding of the peer review process. When making changes, funders should: build in before and after comparisons; strive to make data available for analysis; openly publish studies of their processes and work together on comparative analysis. We need to overcome the reluctance of funders and scientists to acknowledge the uncertainties intrinsic to allocating research funding, and encourage them to experiment with peer review and other allocation processes.

Data availability

All data underlying the results are available as part of the article and no additional source data are required.

Notes

1GRADE is an internationally accepted system for the assessment of evidence quality. GRADE offers four levels of evidence quality: high, moderate, low, and very low. Randomised trials begin as high-quality evidence and observational studies as low-quality evidence, and studies may be downgraded as a result of limitations in study design or implementation, imprecision of estimates, variability in results, indirectness of evidence, or publication bias. Equally, quality may be upgraded based on a very large magnitude of effect or if all plausible biases would reduce an apparent effect ( Guyatt ). 4As of 5 January 2017: https://www.macfound.org/programs/fellows/ 5Defined as ‘the correlation between two independent assessors of the same submissions across a large number of different submissions’ ( Jayasinghe , p.280). 6In a multi-stage review process, the assessor at each evaluation stage will know the score given to a particular research proposal at the previous stage. This particular study assessed the reliability of grant peer review processes by determining the proportion of those applications for which the dependent ratings on the same proposal did not change from the first to the second and third stage. 7 Bornmann are clear, however, that the reasons for this observed discrepancy are not known. This is important because aggregation effects over a range of fields of study may – as the authors acknowledge –create strong statistical effects implying gender bias. The authors also suggest that future improvements to the model will need to take into account the cohort of application, since the study described here covered publications produced over the period 1979–2004, and there have been significant changes to reduce gender bias in science and science funding over this period. 8The Research Project Grant (R01) is the original and historically oldest grant mechanism used by NIH. The R01 provides support for health-related research and development based on the mission of the NIH. R01s can be investigator-initiated or can solicited via a Request for Applications. 9Web sites accessed on 13 February 2018: Explorer grants: http://www.hrc.govt.nz/funding-opportunities/researcher-initiated-proposals/explorer-grants; Seed projects: http://www.sftichallenge.govt.nz/research/seed-projects; Experiment! Grants: https://www.volkswagenstiftung.de/en/funding/our-funding-portfolio-at-a-glance/experiment.html The authors have answered my previous queries. This will be a useful paper for those looking for evidence for funding peer review. I wouldn't say that the NSF approach "reduced the burden to zero" (page 11, second column) as the researchers still needed to spend time on their reviews. It was an interesting approach and could be more efficient. Plus it avoids researchers only "taking" from the system and not giving back. Minor comments Page 7, first column, space missing in "decisionmaking" Page 9, first column, “decreases in early career” add “in” Page 9, citations to Wenneras (first column) and Bornmann (second column) should have brackets around the year not the authors Page 10, second column, “include NIH did cut” needs re-wording Page 10, second column, last paragraph, “Australian” should be “Australia” Page 11, last full paragraph in first column, open bracket with no closing bracket Page 11, second column, last paragraph, Donald Braben suggested it be called “peer preview” rather than “peer review” because reviewers try to predict the future [1] Page 13, second column, our group has recently published a qualitative study of funding peer review panels that included observations on the dynamics between panelists [2] I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. This is a nice and comprehensive review of the studies that assess the strengths and weaknesses of the peer review process. Although there are no new conclusions, it does reveal or highlight the three major issues that plague our mechanism to dole out research funding and the use of peer review. The first is fundamental - without a clear and agreed upon definition of what constitutes the "best" science, or the "best" outcomes, it will be impossible to know whether the grant adjudication process is meeting, or can ever meet, its objectives. Second, the paper highlights what is often overlooked, and this is the real cost in time of writing funding proposals. Although scientists often manufacture a narrative that writing grants is good for them (for example, making you catch up on the literature), the real costs to the system are rarely factored in. I think this paper did a nice job in highlighted this issue. Finally, I would have liked a greater discussion on what is an amazing disconnect. All the empirical evidence highlights the deficiencies of the process to allocate grant funding. It is clear that it is neither scientifically founded, nor evidence-based. Yet one of the only strongly supported aspects of the peer review process is that is has the strong support of the community. I find this fascinating. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Thank you for your review and comments. Reflecting on the disconnect you highlight in the final paragraph above we have added a sentence to the paper commenting on this and noting it as a topic that might warrant further investigation. This is a very useful review of an area that is vitally important for science and that is constantly being examined by funding agencies. It included some papers that I had not read, but there were a few additional papers that I thought could be included (detailed below). The results once again highlight the incredible lack of studies in this area. The paper ends with some sensible recommendations, including the need for funders to experiment more and make their data available to researchers. I was surprised that some of the more innovative solutions to funding peer review were not included, specifically using prediction markets [1] and using the “wisdom of the crowd” [2]. Why was 2009 chosen as the time threshold? Is it because that was the year of the previous review? This paper should be included in the discussion on interdisciplinary research: Interdisciplinary research has consistently lower funding success, Lindell Bromham, Russell Dinnage & Xia Hua. Nature 534, 684–687 doi:10.1038/nature18315 [3]. This paper agrees with the other two mentioned, as there are lower success rates for applications with more cross-disciplinary researchers. One force against “cognitive particularism” is that strict conflict of interest rules from funding bodies can often rule out reviewers with the greatest knowledge, particularly in small fields or small countries. This study may touch on this issue: Abdoul H et al, Non-Financial Conflicts of Interest in Academic Grant Evaluation: A Qualitative Study of Multiple Stakeholders in France, PLoS ONE, 7/4: e35247. [4] In terms of using technology in the review process, some researchers have suggested that videos may produce more reliable peer reviewer ratings and take less time to prepare: Doran MR, Lott WB, Doran SE. Trends Biochem Sci. 2014 Apr;39(4):151-3. doi: 10.1016/j.tibs.2014.01.004. Multimedia: a necessary step in the evolution of research funding applications [5]. Minor comments Introduction, 1st paragraph. As an Australian researcher I would argue that the ERA has not really measured research quality, rather it has simply measured research output. Maybe you could say “Funders have attempted to gather evidence…” The “>95%” figure in the abstract feels about right, but is there a reference for this? For the Google search terms, “fellowship” could also have been added, so “Fellowship OR funding”. The link to this paper did not work: The Novelty Paradox & Bias for Normal Science: Evidence from Randomized Medical Grant Proposal Evaluations Page 6, “when reviewing research closer to application”, I didn’t understand this. Page 7, “only affects the funding decision for around 10 per cent of applications relative to original scores” but that could still be an important percentage, particularly if it’s those near the funding line Page 9, “found that panel assessments of full proposals and shorter anonymised versions of the same proposals showed weak correlations” do you need to add, “implying that knowledge of the applicants influences the score”. Although as well as a change in blinding there was also a change in the size of the application, so it may be hard to conclude anything about cronyism here. Footnote 9, the NZ Health Research Council has been using random allocation for this scheme since at least April 2015 I agree that giving more feedback would improve the value of the process (page 12), but our experience with the NHMRC is that this also opens them up to appeals which can take a lot of time for their staff. This doesn’t mean that we shouldn’t try, and giving frank feedback was a feature of a funding scheme we designed (already cited paper “Streamlined research funding using short proposals and accelerated peer review: an observational study”). This paper may be of interest: Scientometrics, July 2016, Volume 108, Issue 1, pp 263–288, The consequences of competition: simulating the effects of research grant allocation strategies. [6] This parody of a grant application may be useful for the section on peer reviewers being biased against innovative proposals: doi:10.1097/ede.0000000000000453 “John Snow’s Grant Application” [7] This paper examined the costs of applying for NIH funding and could be included in the section on the costs to applicants: Nursing Outlook, Volume 63, Issue 6, November–December 2015, Pages 639-649, Time and costs of preparing and submitting an NIH grant application at a school of nursing, https://doi.org/10.1016/j.outlook.2015.09.003 [8] I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. Thank you for your review and helpful comments. We have made some revisions to the paper in response to these suggestions. In particular, we have added most of the references you suggest to the paper and made some clarifications where required. We have also updated methods with the reason for the 2009 cut off and run searches using the term ‘fellowship’ as suggested, which added two additional papers to the review. Responses to main specific points raised are as follows: · Prediction markets and “wisdom of the crowds”: These approaches are complete alternatives to peer review, rather than refinements, and hence are beyond the scope of the review. The wisdom of the crowds approach is for ex post evaluation rather than ex ante – it simulated REF2014 assessments. · 2009 time threshold: Yes, this is because it was the date of the previous review. · Paper by Bromham et al: Added · Conflict of interest rules: Interesting point, but we don’t have any evidence which would enable us to comment on this in detail. The study by Abdoul et al you note mainly concerns the awareness that reviewers have of conflicts of interests rather than the effects of these conflicts in terms of excluding knowledgeable reviewers · Use of videos: Added a note that this approach has been suggested but not yet evaluated. · Minor comments also all addressed as appropriate within the revised paper.

70 in total

1. NIH peer review of grant applications for clinical research.

Authors: Theodore A Kotchen; Teresa Lindquist; Karl Malik; Ellie Ehrenfeld
Journal: JAMA Date: 2004-02-18 Impact factor: 56.272

2. Commentary: new guidelines for NIH peer review: improving the system or undermining it?

Authors: Allen M Spiegel
Journal: Acad Med Date: 2010-05 Impact factor: 6.893

3. The man who changed medicine.

Authors: Cora Daniels
Journal: Fortune Date: 2004-11-29

4. Research funding: peer review at NIH.

Authors: Toni Scarpa
Journal: Science Date: 2006-01-06 Impact factor: 47.728

5. Cost of the NSERC Science Grant Peer Review System exceeds the cost of giving every qualified researcher a baseline grant.

Authors: Richard Gordon; Bryan J Poulin
Journal: Account Res Date: 2009 Jan-Mar Impact factor: 2.622

6. NIH peer review reform--change we need, or lipstick on a pig?

Authors: Ferric C Fang; Arturo Casadevall
Journal: Infect Immun Date: 2009-01-21 Impact factor: 3.441

7. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations.

Authors: Gordon H Guyatt; Andrew D Oxman; Gunn E Vist; Regina Kunz; Yngve Falck-Ytter; Pablo Alonso-Coello; Holger J Schünemann
Journal: BMJ Date: 2008-04-26

8. The real cost of the NSERC peer review is less than 5% of a proposed baseline grant.

Authors: Sjoerd Roorda
Journal: Account Res Date: 2009-07 Impact factor: 2.622

9. Is peer review useful in assessing research proposals in Indigenous health? A case study.

Authors: Jackie Street; Fran Baum; Ian P S Anderson
Journal: Health Res Policy Syst Date: 2009-02-13

10. Peer review and innovation.

Authors: Raymond E Spier
Journal: Sci Eng Ethics Date: 2002-01 Impact factor: 3.777

19 in total

1. Flipping the grant application review process.

Authors: Ivo D Dinov
Journal: Stud High Educ Date: 2019-06-13

2. Face-to-face panel meetings versus remote evaluation of fellowship applications: simulation study at the Swiss National Science Foundation.

Authors: Marco Bieri; Katharina Roser; Rachel Heyard; Matthias Egger
Journal: BMJ Open Date: 2021-05-05 Impact factor: 2.692

3. Research Conducted in Women Was Deemed More Impactful but Less Publishable than the Same Research Conducted in Men.

Authors: Sohad Murrar; Paula A Johnson; You-Geon Lee; Molly Carnes
Journal: J Womens Health (Larchmt) Date: 2021-03-12 Impact factor: 3.017

4. Meta-research: justifying career disruption in funding applications, a survey of Australian researchers.

Authors: Adrian Barnett; Katie Page; Carly Dyer; Susanna Cramb
Journal: Elife Date: 2022-04-04 Impact factor: 8.713

5. What makes an effective grants peer reviewer? An exploratory study of the necessary skills.

Authors: Miriam L E Steiner Davis; Tiffani R Conner; Kate Miller-Bains; Leslie Shapard
Journal: PLoS One Date: 2020-05-13 Impact factor: 3.240

6. Most UK scientists who publish extremely highly-cited papers do not secure funding from major public and charity funders: A descriptive analysis.

Authors: Charitini Stavropoulou; Melek Somai; John P A Ioannidis
Journal: PLoS One Date: 2019-02-27 Impact factor: 3.240

7. Why can't we make research grant allocation systems more consistent? A personal opinion.

Authors: Roger Cousens
Journal: Ecol Evol Date: 2019-02-10 Impact factor: 2.912

8. The peer review process for awarding funds to international science research consortia: a qualitative developmental evaluation.

Authors: Stefanie Gregorius; Laura Dean; Donald C Cole; Imelda Bates
Journal: F1000Res Date: 2017-10-06

9. Do funding applications where peer reviewers disagree have higher citations? A cross-sectional study.

Authors: Adrian G Barnett; Scott R Glisson; Stephen Gallo
Journal: F1000Res Date: 2018-07-09

10. The CTSA External Reviewer Exchange Consortium (CEREC): Engagement and efficacy.

Authors: Margaret Schneider; April Bagaporo; Jennifer A Croker; Adam Davidson; Pam Dillon; Aileen Dinkjian; Madeline Gibson; Nia Indelicato; Amy J Jenkins; Tanya Mathew; Renee McCoy; Hardeep Ranu; Kai Zheng
Journal: J Clin Transl Sci Date: 2019-10-02