Literature DB >> 33611927

Dynamic methods for ongoing assessment of site-level risk in risk-based monitoring of clinical trials: A scoping review.

William J Cragg^1,2, Caroline Hurley³, Victoria Yorke-Edwards¹, Sally P Stenning¹.

Abstract

BACKGROUND/AIMS: It is increasingly recognised that reliance on frequent site visits for monitoring clinical trials is inefficient. Regulators and trialists have recently encouraged more risk-based monitoring. Risk assessment should take place before a trial begins to define the overarching monitoring strategy. It can also be done on an ongoing basis, to target sites for monitoring activity. Various methods have been proposed for such prioritisation, often using terms like 'central statistical monitoring', 'triggered monitoring' or, as in the International Conference on Harmonization Good Clinical Practice guidance, 'targeted on-site monitoring'. We conducted a scoping review to identify such methods, to establish if any were supported by adequate evidence to allow wider implementation, and to guide future developments in this field of research.
METHODS: We used seven publication databases, two sets of methodological conference abstracts and an Internet search engine to identify methods for using centrally held trial data to assess site conduct during a trial. We included only reports in English, and excluded reports published before 1996 or not directly relevant to our research question. We used reference and citation searches to find additional relevant reports. We extracted data using a predefined template. We contacted authors to request additional information about included reports.
RESULTS: We included 30 reports in our final dataset, of which 21 were peer-reviewed publications. In all, 20 reports described central statistical monitoring methods (of which 7 focussed on detection of fraud or misconduct) and 9 described triggered monitoring methods; 21 reports included some assessment of their methods' effectiveness, typically exploring the methods' characteristics using real trial data without known integrity issues. Of the 21 with some effectiveness assessment, most contained limited information about whether or not concerns identified through central monitoring constituted meaningful problems. Several reports demonstrated good classification ability based on more than one classification statistic, but never without caveats of unclear reporting or other classification statistics being low or unavailable. Some reports commented on cost savings from reduced on-site monitoring, but none gave detailed costings for the development and maintenance of central monitoring methods themselves.
CONCLUSION: Our review identified various proposed methods, some of which could be combined within the same trial. The apparent emphasis on fraud detection may not be proportionate in all trial settings. Despite some promising evidence and some self-justifying benefits for data cleaning activity, many proposed methods have limitations that may currently prevent their routine use for targeting trial monitoring activity. The implementation costs, or uncertainty about these, may also be a barrier. We make recommendations for how the evidence-base supporting these methods could be improved.

Entities: Chemical

Keywords: Good Clinical Practice; Trial monitoring; central statistical monitoring; data fabrication; research misconduct; risk-based monitoring; triggered monitoring

Mesh：

Year: 2021 PMID： 33611927 PMCID： PMC8010889 DOI： 10.1177/1740774520976561

Source DB: PubMed Journal: Clin Trials ISSN： 1740-7745 Impact factor: 2.486

Introduction

Monitoring, a major component of assuring the quality of clinical trials, has traditionally relied on frequent on-site monitoring visits,[1] particularly to facilitate sometimes extensive source data verification (SDV).[2] However, it is increasingly recognised that this model may be inefficient and unnecessary in many cases,[3,4] with trialists questioning the value of 100% SDV.[5-7] In recent years, regulators[8-10] and trialists[1,11] have proposed a risk-based approach to monitoring, whereby monitoring methods, including the frequency and nature of on-site visits, vary across trials depending on the risks specific to each one. The support of regulators is encouraging, indicating that risk-based methods might be implemented even in clinical trials of investigational medicinal products, that is, those historically subject to particular regulatory control and claimed to suffer under a ‘regulatory burden’.[12,13] Risk-based monitoring methods can be applied at different stages of a trial. Pre-trial risk assessments can help define the overarching strategies appropriate to the trial’s risks. In some models,[14,15] this is predominantly a one-off assessment during trial set-up. However, it is also possible to modify the monitoring strategy, or incorporate flexibility, based on emerging risks during the course of the trial.[16] Risk-based monitoring is often associated with fewer on-site visits than ‘traditional’ monitoring.[17] Although effective central monitoring methods alone could, in some respects, provide adequate trial monitoring in place of visits, on-site visits offer particular benefits over central monitoring. These include, for example, the ability to access site-held source data (such as patients’ medical notes, although some have suggested these might be accessed remotely instead),[18-20] conduct in-person facility review[21] or assess processes such as informed consent.[22] On-site visits may also be necessary to investigate potential fraudulent activity. In a risk-based monitoring framework, visits to sites may not be routine, but can be based on assessed risk; we therefore need methods to assess site-level risk on an ongoing basis. We can interpret these methods as assessing the risk of not going to site now. If the risk seems too high, a visit – or some other corrective action – is triggered. Methods of this kind have been referred to using various terms, including ‘triggered monitoring’[16] or, as in ICH Good Clinical Practice guidance, ‘targeted monitoring’,[23] and may employ data-driven approaches from methods known collectively as ‘central statistical monitoring’,[24] or more subjective assessments.[16,25,26] A recent systematic review has established the breadth of tools available to assess overall trial risk (and to use this assessment to define the monitoring strategy) in the set-up stage,[27] but so far there has been no such exercise for methods to assess ongoing site-level risk once a trial has started. We conducted a scoping review[28] to identify and characterise available methods. Our aims were (a) for trialists, to establish if any published methods were supported by adequate evidence to support implementation in routine practice and (b) for researchers in this area, to consolidate the existing evidence and point towards future developments in this growing field.

Methods

We conducted a scoping review to identify methods for using centrally held clinical trial data to assess site-level risk of deviations from Good Clinical Practice or the trial protocol, or research misconduct, and thereby to target sites for further monitoring activity. We chose scoping review methodology as we anticipated finding a variety of results, and we wanted to characterise the extent, range and nature of research activity.[29] There is no published protocol for this scoping review.

Eligibility criteria

We defined our eligibility criteria before beginning any searches, with minor refinements (mainly to the exclusion criteria) after search strategy piloting. We included original reports (a) describing methods for using centrally held data (i.e. at the trial coordinating centre) to assess, in ongoing trials, site-level risk of protocol or Good Clinical Practice deviation, risk of data fabrication or research misconduct, or to target sites in some other way for corrective action based on assessed risk (regardless of whether the corrective action involved an on-site monitoring visit or not); (b) with methods described in enough detail that we considered them – subjectively – reproducible; (c) either published in peer-reviewed journals or available as grey literature; (d) about clinical trials, not limited to trials of Investigational Medicinal Products; and (e) in English. We excluded reports (a) published before 1996 (the year of the first version of the International Conference on Harmonization Good Clinical Practice Guidance, E6[R1])[30]; (b) about quality assurance only in the context of intervention fidelity[31] or ‘rater differences’[32] for subjective trial outcome measures; (c) about ‘data monitoring’ in general, for example, data monitoring committees, or ‘monitoring’ in any sense other than the Good Clinical Practice sense, for example, clinical monitoring; (d) focusing only on trial recruitment; (e) about more efficient alternatives to standard on-site activity, for example, remote SDV; and (f) about site selection during trial set-up.

Information sources and search strategies

Database searches

We designed search strategies for the following databases: (a) PubMed, (b) Embase (Ovid), (c) Medline (Ovid), (d) Web of Science (Clarivate Analytics), (e) CINAHL, (f) Cochrane Central and (g) Scopus. Full database searches took place on 23 October 2017 (run and extracted by W.J.C.). The search strategy for Medline is given in the Supplementary Information. We developed our search strategy following review of systematic reviews in this area[1,33] to identify relevant search terms. The final search term combined searches around two concepts: clinical trials (using terms based on those used in a previous systematic review of monitoring methods)[33] and targeted or risk-based clinical trial monitoring. No database filters were applied. Both reviewers (W.C. and C.H.) imported results into reference management software and used in-built tools to remove duplicate entries. Both reviewers carried out initial title and abstract screening, producing an initial shortlist of potential papers. We reviewed and discussed these, using full-text reports where possible, to agree on a final list of relevant reports. Throughout the process, S.P.S. acted as third reviewer where required. In order to ensure that our results were current, this element of the search strategy was repeated on 28 August 2018. W.J.C. ran the searches and conducted the title and abstract screening. A shortlist of potentially relevant reports was shared with S.P.S. and CH; S.P.S. and W.J.C. agreed on a final list of additional relevant reports from this repeated search.

Conference abstracts

We hand-searched for relevant conference abstracts from the first four International Clinical Trials Methodology Conferences (occurring between 2011 and 2017) and all annual meetings of the Society for Clinical Trials since 1996 (initial searches completed on 8 December 2017). Keywords used for the conference abstract, based on the key database search strategy terms, were ‘monitor’, ‘supervision’, ‘oversight’, ‘risk’, ‘performance’, ‘metric’, ‘quality’, ‘fraud’, ‘fabrication’ and ‘error’. Both W.J.C. and C.H. performed the abstract searches. This produced an initial shortlist of potentially relevant abstracts. A final list was agreed upon through discussion, with S.P.S. acting as third reviewer where required.

Internet searches

We conducted structured searches through Google Internet search engine (searches carried out during 15–19 December 2017). Google searches were performed without limitations or use of quotes. Search terms were based on the main database search: ‘Risk based monitoring’, ‘Risk adapted monitoring’, ‘Central monitoring’, ‘Central statistical monitoring’, ‘Triggered monitoring’, ‘Targeted monitoring’, ‘Performance metric’, ‘Site metric’, ‘Key risk indicator’, ‘Site performance’, ‘Centre performance’, ‘Detect fraud’ and ‘Detect fabrication’. We reviewed the results on the first 20 pages, or fewer if there were no relevant results on any three consecutive pages before this. W.J.C. and C.H. conducted the searches. Any potential additions to the included list of reports were discussed and agreed upon, with S.P.S. acting as third reviewer where required.

References, citations and author contact

To identify other relevant reports, we reviewed references (manually) and citations (using Web of Science) of all papers included or considered for inclusion in the final results, and of review articles relevant to the topic. Whenever required, we contacted report authors to help ascertain if given reports should be included, and to ask about the availability of full-text articles.

Data collection

We extracted data from full journal articles, where available. We recorded data into an Excel-based tool. W.J.C. carried out the final data collection used for this report, with S.P.S. double-checking all data for inclusion; consensus was reached on any areas of disagreement. Article authors were contacted (two attempts maximum) for missing descriptive data and further clarifications. Our data collection template was designed and agreed prior to any data collection, with minor refinement after a first review of all relevant papers (a list of data collection variables is available as Supplementary Information). We collected descriptive data about each of the included reports, including any information on cost implications of the proposed methods. When designing this study, although we predicted we would find a range of methods, we agreed that most of them would in essence address a classification problem, that is, methods to assign sites a status as ‘concerning’ or ‘not-concerning’, with a ‘true’ deviation status – that is, confirmed existence of meaningful problems – that could be uncovered by further review. The ‘gold standard’ reference test required to assess true status might be study-specific, but could be on-site monitoring or, if the true status was created through simulation, prior knowledge. We considered a key measure of the reported methods’ effectiveness to be a demonstrated ability, ideally in a real-life setting, not only to detect ‘true’ sites of concern, but also to show with confidence that sites apparently not of concern are performing well. We therefore aimed to summarise the available information on classification, that is, any or all of specificity, sensitivity, positive and negative predictive value. We gathered the best reported classification statistics for each method, or, if this was not reported, used available statistics to calculate these. These calculations were verified by an independent statistician at the Medical Research Council Clinical Trials Unit at University College London. We did not formally assess the quality of the studies. However, review of the QUADAS-2 tool for quality assessment in diagnostic accuracy studies[34] informed development of our data collection template.

Synthesis of results

The results are summarised descriptively rather than combined, as it was clear through preliminary review of the relevant papers that we would have a variety of study types.

Results

Figure 1 gives a PRISMA flow diagram[35] showing the different stages of the review. From the various data sources, we ultimately included 30 reports in our final dataset. Twenty-one of these are peer-reviewed publications. The results are characterised in Table 1 and listed in full in Table 2. Figure 2 shows reports by year of publication.

Figure 1.

PRISMA flow diagram.

aReasons: no relevant methods presented (n = 28); no novel methods presented (e.g. review article; n = 28); method to measure variation between trial sites but no ‘flagging’ of sites of concern (n = 25); abstract only and not enough detail to confirm relevance (n = 10); duplicate or abstract where full paper also available (n = 8); grey literature not considered to present reproducible methods (n = 5); not about ‘monitoring’ according to ICH Good Clinical Practice definition (n = 5); trial-level assessment only, not site-level (n = 4); focus on consistency of outcome assessment only (n = 4); method from observational study only, not clinical trial (n = 1).

Table 1.

General characteristics of included studies.

Characteristic	N (total = 30)	%
Publication year
1996–2000	0	0
2000–2005	2	7
2006–2010	2	7
2010–2015	13	43
2016–2018	13	43
Type of source
Peer-reviewed paper	21	70
Conference abstract or poster	8	27
Thesis	1	3
Disease setting of trial involved
Cardiovascular disease	4	13
Emergency medicine	1	3
Haematology	1	3
Infectious diseases	1	3
Mental health	3	10
Neurology	1	3
Oncology	3	10
Ophthalmology	1	3
Renal disease	1	3
Respiratory disease	1	3
Unknown or no specific trial involved	13	43
Geographical setting of trials involved
Brazil	1	3
International	7	23
Japan	1	3
North America	4	13
UK	2	7
Unknown or no specific trial involved	15	50
Use of Investigational Medicinal Product (IMP) in involved trials
Involves IMP	14	47
No IMP	1	3
Unknown or no specific trial involved	15	50
Phase of trials involved
Phase I	0	0
Phase II	1	3
Phases II and III	1	3
Phase III	9	30
Unknown or no specific trial involved	19	63
Status of investigational medicinal product used[a]
Unlicensed	0	0
Licensed, used outside of its licensed indication	5	17
Licensed, used within its licensed indication	4	13
Unknown or no specific trial involved	22	73
Focus of work[a]
Central statistical monitoring, focus on fraud or misconduct	7	23
Central statistical monitoring, general	13	43
Triggered monitoring	9	30
Other method(s) for highlighting sites at risk	2	7
Scope of work
Description or development of method	9	30
Some assessment of methods’ effectiveness	21	70

Categories not mutually exclusive.

Table 2.

Full listing of all included reports.

Author(s)	Type of source	Focus of work	Scope of work
Agrafiotis et al.[36]	Peer-reviewed paper	Triggered monitoring	Some assessment of methods’ effectiveness
Almukhtar and Glassman[37]	Conference abstract/poster	Central statistical monitoring, general	Description or development of method
Atanu et al.[38]	Peer-reviewed paper	Central statistical monitoring, general	Description or development of method
Bailey et al.[39]	Conference abstract/poster	Triggered monitoring	Description or development of method
Bengtsson[40]	Thesis	Central statistical monitoring, general	Some assessment of methods’ effectiveness
Biglan et al.[41]	Conference abstract/poster	Triggered monitoring	Some assessment of methods’ effectiveness
Desmet et al.[42]	Peer-reviewed paper	Central statistical monitoring, general	Some assessment of methods’ effectiveness
Desmet et al.[43]	Peer-reviewed paper	Central statistical monitoring, general	Some assessment of methods’ effectiveness
Diani et al.[44]	Peer-reviewed paper	Triggered monitoring	Some assessment of methods’ effectiveness
Djali et al.[45]	Peer-reviewed paper	Other method(s) for highlighting sites at risk (combines site metric scores directly to flag sites of concern)	Some assessment of methods’ effectiveness
Dress et al.[46]	Conference abstract/poster	Triggered monitoring	Description or development of method
Edwards et al.[47]	Peer-reviewed paper	Central statistical monitoring with triggered monitoring	Some assessment of methods’ effectiveness
Kirkwood et al.[24]	Peer-reviewed paper	Central statistical monitoring, general	Some assessment of methods’ effectiveness
Knepper et al.[26]	Peer-reviewed paper	Central statistical monitoring, focus on fraud or misconduct	Some assessment of methods’ effectiveness
Knott et al.[48]	Conference abstract/poster	Central statistical monitoring, general	Some assessment of methods’ effectiveness
Kodama et al.[49]	Conference abstract/poster	Central statistical monitoring, focus on fraud or misconduct	Some assessment of methods’ effectiveness
Lindblad et al.[50]	Peer-reviewed paper	Central statistical monitoring, general	Some assessment of methods’ effectiveness
O’Kelly[51]	Peer-reviewed paper	Central statistical monitoring, focus on fraud or misconduct	Some assessment of methods’ effectiveness
Pogue et al.[52]	Peer-reviewed paper	Central statistical monitoring, focus on fraud or misconduct	Some assessment of methods’ effectiveness
Smith and Seltzer[53]	Peer-reviewed paper	Other method(s) for highlighting sites at risk (use of “statistical process control methodology” to combine per-site risk indicator scores)	Description or development of method
Stenning et al.[16]	Peer-reviewed paper	Triggered monitoring	Some assessment of methods’ effectiveness
Taylor et al.[54]	Peer-reviewed paper	Central statistical monitoring, focus on fraud or misconduct	Some assessment of methods’ effectiveness
Timmermans et al.[55]	Peer-reviewed paper	Central statistical monitoring, general	Some assessment of methods’ effectiveness
Tudur Smith et al.[25]	Peer-reviewed paper	Triggered monitoring	Description or development of method
Valdes-Marquez et al.[56]	Conference abstract/poster	Central statistical monitoring, general	Description or development of method
Valdes-Marquez et al.[57]	Conference abstract/poster	Central statistical monitoring, general	Description or development of method
Van den Bor et al.[58]	Peer-reviewed paper	Central statistical monitoring, focus on fraud or misconduct	Some assessment of methods’ effectiveness
Whitham, 2018⁵⁹	Peer-reviewed paper	Triggered monitoring	Description or development of method
Wu and Carlsson[60]	Peer-reviewed paper	Central statistical monitoring, focus on fraud or misconduct	Some assessment of methods’ effectiveness
Zink et al.[61]	Peer-reviewed paper	Central statistical monitoring, general	Some assessment of methods’ effectiveness

Figure 2.

Publications by year and type.

General characteristics of included studies. Categories not mutually exclusive. Full listing of all included reports. PRISMA flow diagram. aReasons: no relevant methods presented (n = 28); no novel methods presented (e.g. review article; n = 28); method to measure variation between trial sites but no ‘flagging’ of sites of concern (n = 25); abstract only and not enough detail to confirm relevance (n = 10); duplicate or abstract where full paper also available (n = 8); grey literature not considered to present reproducible methods (n = 5); not about ‘monitoring’ according to ICH Good Clinical Practice definition (n = 5); trial-level assessment only, not site-level (n = 4); focus on consistency of outcome assessment only (n = 4); method from observational study only, not clinical trial (n = 1). Publications by year and type. Where information on trial intervention was available, methods had most often been used in Phase III trials of investigational medicinal products. The investigational medicinal product risk category,[62] when known, was either ‘licensed and used within its licensed indication’, or ‘licensed and used outside its licensed indication’ (i.e. we found no reports involving trials of unlicensed investigational medicinal products). We classified 20/30 of our results as central statistical monitoring methods, of which 7 focussed on detection of investigator fraud or research misconduct. We classified 9, including 1 of the 20 that used central statistical monitoring, as ‘triggered monitoring’, that is, review of each trial site against pre-set thresholds in key performance metrics, usually without any statistical testing. A final two did not fit into either of these categories; these involved using measured site metrics to directly compare sites against one another.[53,45] A total of 21/30 reports included some assessment of the effectiveness of the methods; these are summarised in Table 3. The most common experimental designs were to explore the methods’ characteristics using real trial data with no known integrity issues (n = 9), and simulating data integrity problems at sites within real trial datasets and then using the method to try to identify the problem sites (n = 6).

Table 3.

Types of assessments and evidence presented by reports that included some assessments of their methods’ effectiveness.

Author(s)	Case studies	Illustration of method(s) on data with no known issues	Assessment of methods’ ability to identify simulated problem sites	Assessment of methods’ ability to identify known problems in real trial data	Methods used in ongoing trial, results of on-site monitoring reported	Methods used in ongoing trial, effects reported on trial in general (e.g. in terms of cost or data quality)	Prospectively designed, controlled study to assess methods’ ability to target on-site monitoring visits to most problematic sites
Agrafiotis et al.[36]					X	X
Bengtsson[40]		X
Biglan et al.[41]					X	X
Desmet et al.[42]	X	X	X
Desmet et al.[43]		X	X
Diani et al.[44]						X
Djali et al.[45]	X
Edwards et al.[47]	X
Kirkwood et al.[24]		X	X
Knepper et al.[26]			X
Knott et al.[48]					X
Kodama et al.[49]		X
Lindblad et al.[50]				X
O’Kelly[51]			X
Pogue et al.[52]		X		X
Stenning et al.[16]							X
Taylor et al.[54]		X
Timmermans et al.[55]		X
Van den Bor et al.[58]				X
Wu and Carlsson[60]			X	X
Zink et al.[61]		X
Total	3	9	6	4	3	3	1

Types of assessments and evidence presented by reports that included some assessments of their methods’ effectiveness. Of the 21 reports, 9 had no information about sites’‘true’ status, that is, whether the problems identified through central monitoring constitute meaningful problems (either recorded through on-site monitoring or audit activity, or known because statuses were created through simulation). One report[47] only contained case studies, that is, partial and selective reporting. Seven[16,41,48,50,51,58,60] had partial information, for example, some of sites’ true statuses were reported, but not all. Two explored classification ability through extensive simulation,[42,43] and two had detailed information from a limited set of scenarios on the number of true and false positives and negatives.[26,52] The best reported or deducible classification ability for the 11 papers with at least some information on sites’‘true’ status (excluding the case study paper) is shown in Table 4. Seven of these reports ascertained the ‘true’ status through on-site monitoring, audit or regulatory inspection and in three the ‘true’ status was known because it had been simulated. The final report[42] presented both real and simulated scenarios. ‘Best’ classification statistics were reported or deducible in 8 of these reports (of the remaining 3, 1 did not report enough data to allow any calculations, and 2 reported extensive simulations that precluded reporting of a ‘best’ result).

Table 4.

Best reported information on methods’ classification ability, where available or deducible.

Author(s)	Available information on methods’ classification abilities	Definition of ‘positive’ centres	‘True’ test status: real or simulated?	Test for ‘true’ centre status	Sensitivity[a]	Specificity[b]	Positive predictive value[c]	Negative predictive value[d]
Biglan et al.[41]	Partial (‘true’ status known for only one centre; total number of centres not known)	Not clearly defined[e]	Real	On-site monitoring	Unavailable due to limited data; report states that one ‘low-risk’ centre was visited and considered to be misclassified (i.e. should have been ‘medium risk’ or ‘high risk’). However, the total number of sites classified and visited (overall and within each risk category) is not known
Desmet et al.[42]	Explored through simulation	Presence of atypical data	Simulated	Known because simulated	Dependent on simulation scenario; no specific figure given
	Detailed information (vital signs data used as illustrative example)	Presence of atypical data	Real	Unclear (‘closer inspection’)	Reported: 83% (10/(10+2))	Reported: 99% (204/(204+2))	Calculated: 83% (10/(10+2))	Calculated: 99% (204/(204+2))
Desmet et al.[43]	Explored through simulation	Presence of atypical data	Simulated	Known because simulated	Reported: dependent on simulation scenario; no specific figure given	Reported: median specificity varied from 98%–100% depending on scenario	Not reported and not possible to calculate (results of many simulations presented)
Knepper et al.[26]	Detailed information	Presence of fabricated data	Simulated with physician input	Known because simulated	Reported: best result from 4 scenarios (study 1): 86% (6/(6+1))	Reported: best result from 4 scenarios (study 1a): 87% (148/(148+23))[f]	Reported: best result from 4 scenarios (study 2a): 27% (3/3+8)	Reported: best result from 4 scenarios (study 1): 99% (132/132+1)
Knott et al.[48]	Partial (total number of sites not reported but likely more than number whose results reported; ‘true’ status of any unreported centres not known)	Presence of any findings	Real	On-site monitoring	Calculated: 85% (11/(11+2))	Calculated: 88% (7/(7+1))	Calculated: 92% (11/(11+1)	Calculated: 78% (7/(7+2))
		Presence of findings ‘indicative of sloppy practice’ (clearer definition not reported)	Real	On-site monitoring	Calculated: 83% (10/((10+2))	Calculated: 78% (7/(7+2))	Calculated: 83% (10/(10+2))	Calculated: 78% (7/(7+2))
		Presence of serious findings(clearer definition not reported)	Real	On-site monitoring	Calculated: 100% (1/1+0)	Calculated45% (9/(9+11))	Calculated8% (1/(1+11))	Calculated100% (9/(9+0))
Lindblad et al.[50]	Partial (‘true’ status known only at 21/413 centres)	Presence of serious problems	Real	Regulatory inspection	Reported: 83% (5/((5+1))	Cannot be calculated without making assumptions about the 392/413 sites with unknown ‘true’ status
		Presence of minor problems	Real	Regulatory inspection	Reported: 89% (8/(8+1))
		Presence of any problems	Real	Regulatory inspection	Reported: 87% (13/(13+2))
O’Kelly[51]	Detailed information, but sample of data from trial[g]	Presence of fabricated data	Simulated with physician input	Known because simulated	Calculated: 33% (1/(1+2))	Calculated: 95% (18/(18+1))	Calculated: 50% (1/(1+1))	Calculated: 90% (18/(18+2))
Pogue et al.[52]	POISE trial data: detailed information from all sites with >= 20 randomisations	Presence of fabricated data	Real	On-site monitoring	Reported: different models and different thresholds give different pros and cons in terms of classification. Models 1, 3 and 5 all have at least some scenarios where both specificity and sensitivity >80%. (Models 1 and 5, risk score ≥ 7; Model 3, risk score ≥ 5)
Pogue et al.[52]	HOPE trial data: summary information from all sites with ≥ 20 randomisations	Presence of fabricated data	Real	On-site monitoring	N/a (no true positives)	Reported:model 1: 99% (178/(178+2))	Calculated:all models: 0% (no true positives, so any positives are false)	Calculated:all models: 100% (no true positives, so all negatives are true negatives)
Stenning et al.[16]	Partial (only sample of negative-testing sites visited, although the study design aimed to control for this)	Presence of ≥1 serious (Major or Critical) finding	Real	On-site monitoring	Calculated:primary analysis: 52% (37/(37+34))	Calculated:primary analysis: 62% (8/(8+5))	Reported:primary analysis: 88% (37/(37+5))	Calculated:primary analysis: 19% (8/(8+34))
					Calculated:secondary analysis excluding re-consent findings: 59% (36/(36+25))	Calculated:secondary analysis excluding re-consent findings: 74% (17/(17+6))	Reported:secondary analysis excluding re-consent findings: 86% (36/(36+6))	Calculated:secondary analysis excluding re-consent findings: 40% (17/(17+25))
					Calculated:secondary analysis excluding all consent findings: 60% (29/(29+19))	Calculated:secondary analysis excluding all consent findings: 64% (23/(23+13))	Reported:secondary analysis excluding all consent findings: 69% (29/(29+13))	Calculated:secondary analysis excluding all consent findings: 55% (23/(23+19))
Van den Bor et al.[58]	Partial in paper, but authors confirmed that trial implemented source data verification for all sites (personal communication)	Presence of fabricated data	Real	On-site monitoring	Various situations presented, with different implications for classification ability.Median false positives below 10% for all scenarios, lower with higher m-constant; in various situations (combinations of specific m-constants with specific scenarios), the fraudulent centre is flagged ≥ 3 times (authors’ proposed threshold) 100% of the time.Some scenarios have 100% highlighting of fraudulent centre and very low false positive rate – for example, scenario 1, m = 20, scenario 2, m = 2, scenario 3, m = 20 (all with false positive rate of 2%)
Wu and Carlsson[60]	Partial (15/17 sites have unknown ‘true’ status)	Presence of fabricated data	Real	Auditing	Results presented narratively via a number of scenarios.For ‘angular clustering’, fourth scenario (correlation 0.7, 3 outliers) results in sensitivity, specificity, positive and negative predictive values all ≥98%. For ‘neighbourhood clustering’, specificity in all scenarios is ≥94% and second scenario (variances 0.45, 3 outliers, cluster size 27) results in sensitivity, specificity, positive and negative predictive values all ≥50%

Number of correctly flagged problem sites/(number of correctly flagged problem sites+sites incorrectly not flagged as concerning); thick border used to highlight results more than or equal to 90%.

Number of sites correctly flagged as not concerning/(number of sites correctly flagged as not concerning+sites incorrectly flagged as concerning); thick border used to highlight results more than or equal to 90%.

Number of correctly flagged problem sites/(number of correctly flagged problem sites+sites incorrectly flagged as concerning); thick border used to highlight results more than or equal to 90%.

Number of sites correctly flagged as not concerning/(number of sites correctly flagged as not concerning+sites incorrectly not flagged as concerning); thick border used to highlight results more than or equal to 90%.

One ‘positive’ centre is described as ‘reveal[ing that] RBM was not assessing risk sufficiently to drive monitoring decisions’.

Publication incorrectly rounds this to 86%.

Approximately one-third of sites included from a trial; also some uncertainty about total number of sites (sometimes reported as 21, sometimes 22; used 22 for calculations given here as this is the figure in the ‘Results’ section).

Best reported information on methods’ classification ability, where available or deducible. Number of correctly flagged problem sites/(number of correctly flagged problem sites+sites incorrectly not flagged as concerning); thick border used to highlight results more than or equal to 90%. Number of sites correctly flagged as not concerning/(number of sites correctly flagged as not concerning+sites incorrectly flagged as concerning); thick border used to highlight results more than or equal to 90%. Number of correctly flagged problem sites/(number of correctly flagged problem sites+sites incorrectly flagged as concerning); thick border used to highlight results more than or equal to 90%. Number of sites correctly flagged as not concerning/(number of sites correctly flagged as not concerning+sites incorrectly not flagged as concerning); thick border used to highlight results more than or equal to 90%. One ‘positive’ centre is described as ‘reveal[ing that] RBM was not assessing risk sufficiently to drive monitoring decisions’. Publication incorrectly rounds this to 86%. Approximately one-third of sites included from a trial; also some uncertainty about total number of sites (sometimes reported as 21, sometimes 22; used 22 for calculations given here as this is the figure in the ‘Results’ section). Of the eight reports with some available statistics, 1/7 had sensitivity ≥90% in at least one scenario (statistic unavailable in one report), 4/7 had specificity ≥90% in at least one scenario (unavailable in one report), 1/6 had positive predictive value ≥90% in at least one scenario (unavailable in two reports) and 5/6 had negative predictive value ≥90% in at least one scenario (unavailable in two reports). Four reports contained at least one scenario where more than one of these statistics was ≥90%, and in one case all four statistics were over 80%.[42] All four of these reports had limitations in terms of either lack of clarity around how the ‘true’ site status was ascertained,[42] unclear outcome measure definition,[48] or low or unavailable results for the other classification statistics.[51,52] The four reports all described central statistical monitoring methods (as opposed to triggered monitoring), and had used a variety of statistical techniques, including both ‘supervised’ and ‘unsupervised’ analyses.[63] Some papers reported on actual or theoretical cost savings from reduced on-site monitoring,[36,41,44] and others commented on the risk of incurring costs if their proposed central monitoring method identifies sites that do not in fact have meaningful problems (i.e. false positives).[26,58] However, no papers gave detailed costings for the development, implementation and maintenance of the central monitoring methods themselves.

Discussion

We conducted a scoping review to identify and characterise published methods for assessing the risk of not taking corrective action at trial sites at a given time. Although our search looked for reports from any time after 1995, over half of our results are from after 2013, highlighting the recent growth of risk-based monitoring concepts. Where information on host trials was available, they were almost always trials of investigational medicinal products, emphasising the interest in applying risk-based methods – and accessing the potential associated efficiency benefits – in this setting. Around a third of our results were not full, peer-reviewed reports, reflecting a wider problem with availability of evidence supporting trial conduct methods.[64] Identified methods were mainly in two broad categories. Most were about central statistical monitoring, which uses statistical testing of all or a subset of trial data items to compare sites and identify atypical trial centres. A minority described triggered monitoring techniques, whereby sites are assessed against pre-specified site metric threshold rules (usually binary), with sites meeting the greatest number of ‘triggers’ being considered the most concerning. Several authors note that central statistical monitoring needs sufficient overall and per-site sample sizes for adequate statistical power[24,26,58] (although some described methods were shown to detect problem sites during interim analysis or other early timepoints).[26,50,58] Triggered monitoring, however, can be used at any stage of a trial’s recruitment (especially with trigger rules based on single instances of a given protocol violation, for example). We therefore suggest that the two techniques can, at least in theory, be used in combination. In line with our review’s aims, our focus in characterising our results was on looking for evidence supporting the use of each proposed monitoring method. It was therefore beyond our scope to compare and contrast the different central statistical monitoring methods proposed in these reports. Several previous papers have reviewed these methods in more detail.[24,52,63,65,66] Nearly half of the central statistical monitoring reports had a stated focus on identification of fraud or data fabrication. The possibility of fraud is a serious concern to trialists and a threat to wider trust in science.[67] It was possibly an important factor in establishing 100% on-site verification of trial data as a common monitoring approach.[68,69] This may help explain the prevalence of reports about fraud detection, as some may see the priority in risk-based monitoring to be establishing its fraud detection ability compared with 100% SDV. However, although the incidence of data fraud is difficult to quantify, cases of extensive data fabrication appear rare enough to have individual notoriety.[70] Furthermore, methods to detect fraud are necessarily rather selective, and therefore may not alone be suitable for trialists looking to detect more common, lower level data integrity issues such as poor equipment calibration or inadequately trained trial staff, which central statistical monitoring methods may also be well-suited to detect. We collected data on how the proposed methods we identified had been evaluated. A number of reports only presented proposed, untested methods, or only selected case studies to demonstrate the methods’ performance. Of those that presented more detailed evaluation, a common limitation was that the ‘true’ status both of identified problem sites and sites apparently not of concern was often not available, or only partially available. It was therefore difficult to know if the ‘concern’ status of sites in central monitoring results represented meaningful problems or not. In addition, a number of studies use simulation to create ‘true’ sites of concern; these raise the additional question of whether these simulations reflect real-life issues, though the involvement of clinicians (i.e. those who would provide real-life trial data) in the simulation process of some reports[26,51] is reassuring. Of the few reports with available classification statistics, the best results were often in methods’ specificity or negative predictive value. The latter finding in particular could be encouraging for those with concerns that if risk-based monitoring means reduced or omitted monitoring activity, it might fail to detect serious errors. Some of the methods also showed good classification ability in more than one classification statistic. However, this was not without caveats of opaque reporting, other classification statistics being poor or unavailable, or the potential limitations of simulation mentioned above. It is important to recognise the limitations of the available ‘gold-standards’ in the classification of sites. When methods are tested using simulated or real-but-adjusted data, it may be difficult to know how well these accurately recreate real-life situations. When central monitoring methods are tested in real, ongoing trials, on-site monitoring may be an imperfect reference test, in that it may not be able to identify all problems. By contrast, it is clear that central monitoring, with its enhanced inter- and intra-site review, can identify issues that a single team at one site for a limited time might not.[66] It could be argued that at least some of the methods we have identified do not need extensive evaluation because they prove their own worth. For instance, they help identify outliers that in some cases are self-evidently meaningful problems to resolve. We acknowledge that some central monitoring activities identify ‘known’ problems (e.g. identifying weekend visit dates, which are unlikely to be correct) and are valuable for data cleaning purposes. However, we were specifically interested in the more nuanced use of these methods to identify sites of ‘concern’, at which monitoring activity may be targeted, and consequently sites ‘not of concern’, monitoring of which may be reduced or omitted. In light of the limitations we have described here, we do not believe any methods have yet demonstrated sufficiently reliable classification ability to justify more widespread adoption. Aside from some comments on the potential cost of investigating false positive central monitoring results,[26,58] the reports we identified contained limited information on the cost of developing and implementing their methods. As well as uncertainty about how to develop relevant methods, uncertainty or concern about costs involved is a substantial barrier to adoption of risk-based monitoring.[71] Further work is needed to fully demonstrate the effectiveness of these dynamic site risk assessment methods which, alongside pre-trial risk assessments, form the core of risk-based monitoring. We therefore recommend the following: Coordinate research efforts. From the scoping review and contact with report authors, it was clear that various small research projects relevant to this topic were ongoing, but mostly in isolation. Researchers in this area should take stock of existing research, and set clear priorities to ensure research time is well-spent. Standardise monitoring studies. Core outcome sets[72] or other mechanisms to standardise studies about monitoring would improve study quality and may facilitate cross-study evidence synthesis. Share evidence. Time, commercial sensitivity and perceived reputational risk could all be barriers to publishing evidence about monitoring practices. However, additional, publicly available evidence to support the best monitoring practice will allow trialists in all settings to adopt new methods with confidence. Publish full papers. Conference abstracts and posters can disseminate basic information about new ideas, but rarely have enough detail to allow replication or robustly demonstrate effectiveness. As this emerging field cannot be built on abstracts alone, we encourage researchers to publish full, peer-reviewed papers about their monitoring methods. Combine complementary methods. Although work has been done on a number of distinct risk-based monitoring methods, an optimal monitoring plan might involve a combination of these, including both central statistical monitoring and triggered monitoring. A collaborative approach to combining existing methods could help develop and test such an idea. We acknowledge several limitations. Our database searches identified relevant material from disparate locations, including abstracts in conferences in unrelated research fields. It is possible that other abstract collections include relevant material, but it was not feasible to find all of these. Although the Internet searches made little contribution to the final list of included reports, they may have been limited by known reproducibility problems.[73] Scoping review methodology advises that relevant experts in a field are surveyed to help identify other relevant work.[74] We have not formally done this. We have, however, contacted most authors of included reports for clarifications, and this has not highlighted any additional relevant reports. Some search results were of borderline relevance to our aims, and took discussion to ultimately include or exclude. It is possible that other researchers repeating the same review might result in a slightly different list, but we believe this might only affect the ‘method-only’ papers, which are not critical to our conclusions. The comprehensive nature of our search strategy gives us confidence that our report is a sound overview of the state of the evidence in this research area. We have not performed a formal quality assessment of reports we found; however, this is considered by some to be unnecessary in scoping review methodology.[29] There is also no validated way to review the quality of risk-based monitoring studies, although we used the QUADAS-2 tool to inform our data collection template. Finally, we acknowledge that some time has passed since we first conducted our search for relevant evidence. Conscious of this, we repeated the main database search in 2018 (albeit with only one author conducting title and abstract screening) and added three relevant reports. We are not aware of any research published since then that might change our overall conclusions. If evidence is now available that addresses the limitations we have highlighted in the existing literature, we would certainly consider this a positive development. Our scoping review highlighted some promising evidence for risk-based monitoring in ongoing trials. However, currently published methods may not yet have demonstrated their efficacy or cost-effectiveness well enough for trialists to implement them with confidence as a means to target or omit on-site visits. A more coordinated, collaborative and transparent approach to developing and sharing evidence in this field, including industry and academic partners, could help it grow beyond its current nascent state, and could contribute to risk-based monitoring more quickly entering routine practice. Click here for additional data file. Supplemental material, sj-pdf-1-ctj-10.1177_1740774520976561 for Dynamic methods for ongoing assessment of site-level risk in risk-based monitoring of clinical trials: A scoping review by William J Cragg, Caroline Hurley, Victoria Yorke-Edwards and Sally P Stenning in Clinical Trials Click here for additional data file. Supplemental material, sj-pdf-2-ctj-10.1177_1740774520976561 for Dynamic methods for ongoing assessment of site-level risk in risk-based monitoring of clinical trials: A scoping review by William J Cragg, Caroline Hurley, Victoria Yorke-Edwards and Sally P Stenning in Clinical Trials

49 in total

1. Detecting data fabrication in clinical trials from cluster analysis perspective.

Authors: Xiaoru Wu; Martin Carlsson
Journal: Pharm Stat Date: 2010-10-08 Impact factor: 1.894

Review 2. Statistical challenges for central monitoring in clinical trials: a review.

Authors: Koji Oba
Journal: Int J Clin Oncol Date: 2015-10-23 Impact factor: 3.402

3. The trials of Dr. Bernard Fisher: a European perspective on an American episode.

Authors: R Peto; R Collins; D Sackett; J Darbyshire; A Babiker; M Buyse; H Stewart; M Baum; A Goldhirsch; G Bonadonna; P Valagussa; L Rutqvist; D Elbourne; C Davies; O Dalesio; M Parmar; C Hill; M Clarke; R Gray; R Doll
Journal: Control Clin Trials Date: 1997-02

4. A computationally simple central monitoring procedure, effectively applied to empirical trial data with known fraud.

Authors: Rutger M van den Bor; Petrus W J Vaessen; Bas J Oosterman; Nicolaas P A Zuithoff; Diederick E Grobbee; Kit C B Roes
Journal: J Clin Epidemiol Date: 2017-04-12 Impact factor: 6.437

5. Central statistical monitoring: detecting fraud in clinical trials.

Authors: Janice M Pogue; P J Devereaux; Kristian Thorlund; Salim Yusuf
Journal: Clin Trials Date: 2013-01-02 Impact factor: 2.486

Review 6. A systematic search for reports of site monitoring technique comparisons in clinical trials.

Authors: Julie Bakobaki; Nicola Joffe; Sarah Burdett; Jayne Tierney; Sarah Meredith; Sally Stenning
Journal: Clin Trials Date: 2012-10-11 Impact factor: 2.486

Review 7. Risk based monitoring (RBM) tools for clinical trials: A systematic review.

Authors: Caroline Hurley; Frances Shiely; Jessica Power; Mike Clarke; Joseph A Eustace; Evelyn Flanagan; Patricia M Kearney
Journal: Contemp Clin Trials Date: 2016-09-15 Impact factor: 2.226

Review 8. A scoping review of scoping reviews: advancing the approach and enhancing the consistency.

Authors: Mai T Pham; Andrijana Rajić; Judy D Greig; Jan M Sargeant; Andrew Papadopoulos; Scott A McEwen
Journal: Res Synth Methods Date: 2014-07-24 Impact factor: 5.273

9. Risk-adapted monitoring is not inferior to extensive on-site monitoring: Results of the ADAMON cluster-randomised study.

Authors: Oana Brosteanu; Gabriele Schwarz; Peggy Houben; Ursula Paulus; Anke Strenge-Hesse; Ulrike Zettelmeyer; Anja Schneider; Dirk Hasenclever
Journal: Clin Trials Date: 2017-08-08 Impact factor: 2.486