Literature DB >> 29956486

Prioritising references for systematic reviews with RobotAnalyst: A user study.

Piotr Przybyła¹, Austin J Brockmeier¹, Georgios Kontonatsios¹, Marie-Annick Le Pogam², John McNaught¹, Erik von Elm², Kay Nolan³, Sophia Ananiadou¹.

Abstract

Screening references is a time-consuming step necessary for systematic reviews and guideline development. Previous studies have shown that human effort can be reduced by using machine learning software to prioritise large reference collections such that most of the relevant references are identified before screening is completed. We describe and evaluate RobotAnalyst, a Web-based software system that combines text-mining and machine learning algorithms for organising references by their content and actively prioritising them based on a relevancy classification model trained and updated throughout the process. We report an evaluation over 22 reference collections (most are related to public health topics) screened using RobotAnalyst with a total of 43 610 abstract-level decisions. The number of references that needed to be screened to identify 95% of the abstract-level inclusions for the evidence review was reduced on 19 of the 22 collections. Significant gains over random sampling were achieved for all reviews conducted with active prioritisation, as compared with only two of five when prioritisation was not used. RobotAnalyst's descriptive clustering and topic modelling functionalities were also evaluated by public health analysts. Descriptive clustering provided more coherent organisation than topic modelling, and the content of the clusters was apparent to the users across a varying number of clusters. This is the first large-scale study using technology-assisted screening to perform new reviews, and the positive results provide empirical evidence that RobotAnalyst can accelerate the identification of relevant studies. The results also highlight the issue of user complacency and the need for a stopping criterion to realise the work savings.

Entities: Chemical

Mesh：

Year: 2018 PMID： 29956486 PMCID： PMC6175382 DOI： 10.1002/jrsm.1311

Source DB: PubMed Journal: Res Synth Methods ISSN： 1759-2879 Impact factor: 5.273

INTRODUCTION

Systematic reviews seek to answer specific research questions and form unbiased, evidence‐based conclusions by combining information from all relevant studies. They are used to compare treatments, diagnostic tests, health service organisations, prevention strategies, etc and to develop evidence‐based guidelines on health and social policy and clinical practice.1, 2, 3 Coarsely, a systematic review involves at least five stages: formulation of the research question and inclusion criteria, literature search to identify a set of possibly relevant references, initial relevancy screening based on reference and abstract, further screening based on the full text, and evidence extraction and synthesis of findings. While all stages are resource‐intensive, designing literature database searches and performing abstract‐level screening are tasks that are increasingly time‐consuming due to the ever‐growing corpus of published literature.4 Information specialists must design search strategies that are sufficiently sensitive to gather all relevant studies while specific enough to limit the result size. This is especially challenging in public health, because the research questions are often broad and the inclusion criteria, expressed in terms of the traditional PICO framework (population, problem, patient; intervention; comparison, control, comparator; and outcome), involve broad definitions for populations and complex interventions. Furthermore, the definitions of interventions lack consistency across studies. As a result, literature searches for public health evidence often have low specificity and return large volumes of references to be screened at the title and abstract level. Screening references is a lengthy process, and double screening with two reviewers is recommended to avoid missing references.1, 5 Given that the estimated screening time per reference (title and abstract) is between 30 seconds6 and 1 minute7 and the estimated time to discuss and resolve an inclusion disagreement between reviewers is approximately 5 minutes per reference, screening 5000 references will last between 83 to 125 hours per reviewer.7 The estimated costs are also considerable: amounting to £13 000 for a single review.7 This burden motivates the use of computational tools to assist manual screening to reduce the workload in cases of low specificity.8 For example, search tools can be used to select a small initial subset of the references sharing a certain characteristic, eg, the presence of a keyword or its synonyms. Furthermore, machine learning algorithms that learn from a human screener's decisions can be harnessed to prioritise the remaining unseen references by their predicted relevancy.9 With prioritisation, it may be possible to perform a partial screening and still identify the vast majority, for example, 95%, of the relevant references.9 Prioritisation may also enable different screening or review paradigms, such as living systematic reviews10, 11 or updates,12, 13 since more relevant references are found earlier than with manual screening.14 RobotAnalyst is a Web‐based screening system that leverages algorithms from information retrieval, text mining, natural language processing, and machine learning to assist reviewers in prioritising references and exploring a reference collection using automatic terminology extraction, topic modelling, and descriptive clustering. While RobotAnalyst has been purposefully designed to handle the challenges of terminological variation and low specificity in the large collections encountered with public health reviews, these challenges are ubiquitous and it can be applied to any screening task at the title and abstract level. http://nactem.ac.uk/robotanalyst/. The potential benefit of incorporating machine learning into the systematic review toolkit has been showcased by numerous retrospective studies (simulations of the screening process using previously screened collections9, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31) and some prospective studies32, 33 that have explored different machine learning approaches. Yet, before semi‐automated tools with functionality like RobotAnalyst become widely adopted within the systematic review community, there is a need for real‐world evaluation, user feedback, and discussion to understand the benefits and potential risks. While multiple software solutions33, 34, 35, 36 leverage machine learning prioritisation to assist review screening, published evaluations have mainly been limited to cross‐validation of previously completed reviews. There is a lack of studies where the screening—from start to finish—is completed with computer‐assisted prioritisation. Real‐world evaluations of semi‐automated tools' performance by reviewers are necessary to confirm their theoretical benefits and ensure that they actually support review and guideline development.37 In particular, real‐world evaluation can be used to track metrics such as decision accuracy and time per decision across the screening process. This is important to assess additional aspects, such as software interface design and real‐time operation. We have conducted an evaluation of RobotAnalyst for technology‐assisted screening at two sites. Multiple reviews were performed for public health guidelines and new surveillance reviews within the National Institute for Health and Care Excellence (NICE). Another review was conducted by reviewers from the Cochrane Switzerland group, at the Institute of Social and Preventive Medicine (IUMSP), Lausanne University Hospital, to inform patient safety and quality of hospital care. Results from both sites highlight the ability of RobotAnalyst to prioritise relevant references early in the screening process. NICE conducts systematic reviews to identify effective interventions and inform the development of public health guidelines http://www.nice.org.uk/guidance. http://swiss.cochrane.org/. In addition to prioritisation, RobotAnalyst offers functionality for the exploration of reference collections via descriptive clustering38, 39, 40, 41 and topic modelling.42, 43, 44, 45 These techniques take text documents, such as reference abstracts, and divide them into a set of clusters or topics, each associated with a subset of references that share similar vocabulary. References with the same topic or cluster may be thematically related even if there are no common keywords they all share. Furthermore, the topic proportions of each reference can be used to find related references, and as a feature representation for machine learning.26, 29, 36 Primarily, clusters and topics, supplemented with automatically generated descriptions,40, 41, 46 allow reviewers to explore the thematic coverage44, 47 and locate relevant references,48, 49 without having to explicitly form keyword queries, within diverse collections. Searching and screening diverse collections are especially useful for supporting development of public health guidelines that involve multiple complex questions. Analysts from NICE have performed an evaluation of the coherence of RobotAnalyst's clustering and topic modelling and the descriptiveness of the keyword lists. In summary, the key contribution of RobotAnalyst is the combination of active learning prioritisation with content, metadata, and topic and cluster‐based search. Reviewers can use the search capabilities to identify the initial set of inclusions and exclusions, before using active learning to prioritise. This is ignored in other evaluations based on cross‐validation of previously completed reviews. Furthermore, during screening, the inclusion ranking itself can be filtered using Boolean queries based on specific terms, clusters, or topics. This human‐in‐the‐loop ranking thus augments purely machine prioritisation.

RELATED WORK

The earliest study of machine learning to emulate the inclusion decisions for systematic reviews was the work of Cohen et al.9 Previously, machine learning had been demonstrated to be as effective for retrieving general categories (therapy, diagnosis, aetiology, and prognosis) of high‐quality studies for evidence‐based medicine literature50, 51 as hand‐tuned Boolean queries.52 Subsequent studies15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 32, 33, 36 have explored different feature spaces (words or multiword patterns, MeSH terms from PubMed metadata, unified medical language system (UMLS) for nomenclature, and topical or thematic features) and machine learning models and techniques such as naive Bayes, support vector machine (SVM), and logistic regression. Others have incorporated out‐of‐topic inclusions53 and unscreened references.30, 31 Most previous studies of machine learning for systematic reviews have a fixed training and test set of references. In these cases, a user screens a portion of references (either random or based on publication year for an update); the machine learns from the screened portion and predicts the relevant references within the remainder of references (the test set). Finally, only the references predicted as relevant are screened manually by a human reviewer. In this scenario, two reviewers (human and machine) have screened every inclusion. A similar scenario involves two human reviewers that each screen half of the collection to train two independent classification models. Each model provides relevancy predictions on the other half, and discrepancies are resolved by the humans.18 A key issue with these scenarios is the low specificity within the training set. This poses a problem since identifying all or nearly all of the inclusions is essential for systematic reviews, but off‐the‐shelf machine learning algorithms offer predictions under the assumption that misclassifications, eg, predicting an inclusion instead of an exclusion or vice versa, are equally unwelcome. This is not the case in systematic reviews, where unnecessary inclusions during abstract‐level screening (by being overly inclusive) can later be discarded, while missing relevant references violates the purpose of systematic reviews.1 Without adjustment, the classification model may perform poorly with imbalanced samples. To overcome this, principled adjustments and various ad hoc techniques, such as subsampling or reweighting, have been explored. Active learning54, 55, 56 is the process of using a classification model's predictions to iteratively select training data. It provides an alternative scenario for prioritising the screening process from the beginning to the end, which naturally ameliorates the imbalanced sample problem. After training with a small set of references screened by a human, active learning proceeds by prioritising references based on their predicted relevancy and the confidence of this prediction. One objective of active learning is to select training examples that improve the model as quickly as possible such that it can eventually be applied to the remaining references. In this case, the references which have the lowest confidence in their model predictions are screened first.22, 23, 26, 30, 31, 54 Another approach is relevancy‐based prioritisation, where the references with the highest probability of being relevant are screened first24, 26, 29, 31, 33 (a process known as relevancy feedback57 or certainty‐based screening26). Essentially, active learning uses new screening decisions made by the user to improve the prioritisation throughout the process. Furthermore, active learning naturally handles the imbalanced sample problem by including references with a substantial chance of being relevant. In the rest of this section, we review screening systems currently used in applications for systematic reviewing. Prioritisation performance for some of these systems has been measured, but the nature of the evaluation settings varies. EPPI‐reviewer34 is a tool for reference screening available through a Web‐based interface for a subscription fee. It contains automatic term recognition using several methods, including Termine from the National Centre for Text Mining,58 which, as described in the EPPI‐reviewer user manual, could be used to find relevant references based on terms found in previous inclusions. References can also be clustered using Lingo3G software. Reference prioritisation is not generally available to all users, but it has already been tested for scoping reviews,59 which differ from systematic reviews by taking into accounts much larger sets of possibly eligible references and having eligibility criteria developed iteratively during the process. EPPI‐reviewer was used in two scoping reviews, containing over 800 000 and 1 million references, and provided substantial workload reduction (around 90%). One should note though that because of collection sizes, not all references were manually screened, so recall was estimated using random samples from the whole reference set. https://eppi.ioe.ac.uk/cms/er4/. https://eppi.ioe.ac.uk/CMS/Portals/35/Manuals/ER4.7.0%20user%20manual.pdf. https://carrotsearch.com/lingo3g/. Specifically designed for facilitating screening based on active learning, Abstrackr33 is a free online open‐source tool that uses the dual supervision paradigm, where the classification rules are not only automatically learned from screening decisions but also provided explicitly by users as lists of words, whose occurrence in text is indicative for reference inclusion. Another interesting extension is collaborative screening, which takes into account different levels of experience and costs of reviewers working on the same study in an active learning scenario.60, 61 The underlying classifier is an SVM over n‐grams (word sequences). A prospective evaluation using relevancy‐based prioritisation was performed by an assistant, who used decisions by a domain expert to resolve dubious cases. An independent evaluation62 was performed on four previous reviews (containing 517, 1042, 1415, and 1735 references). In this case, only the inclusions were evaluated by a reviewer, while exclusions were judged by verifying whether they were present in the published reviews (ie, the references were included after full‐text screening). The reported work saved was 40%, 9%, 56%, and 57%, respectively. http://abstrackr.cebm.brown.edu/. SWIFT‐Active Screener is a Web‐based interface for systematic reviews with active learning prioritisation.36 Similar to RobotAnalyst, it uses bag‐of‐words features (the counts of distinct words within the title and abstract) and the topic distributions estimated by latent Dirichl et allocation,43, 63 and prioritises references using a logistic regression model. The differences are SWIFT's inclusion of MeSH terms and RobotAnalyst's use of a linear SVM for the classification model rather than logistic regression. SWIFT‐Active Screener implements a separate model to predict the number of inclusions remaining to be screened, which can be used as a signal for the reviewer to stop screening. The system is interoperable with a related desktop application, SWIFT‐Review, which is freely available. A cross‐validation evaluation36 across 20 previously completed reviews, including 15 from Cohen et al,9 has shown consistent work saved over sampling. https://www.sciome.com/swift-activescreener/. https://www.sciome.com/swift-review/. Rayyan35 is a free Web application for systematic review screening. The machine learning model28 is an SVM‐trained classifier that uses unigrams, bigrams, and MeSH terms and suggests relevancy using a 5‐star system. It was evaluated on data from Cohen et al,9 and a pilot user study on two previously completed Cochrane reviews (273 and 1030 references) was undertaken for qualitative evaluation. The interface provides a simple tool for noting exclusions reasons and supports visualisation of a similarity graph of references. https://rayyan.qcri.org/. Another Web‐based system for screening references is Colandr. The system has an open‐source code base and uses a linear model that is applied to vector representations of references based on word vectors.64, 65 http://www.colandrcommunity.com/. Besides screening prioritisation, there are other text‐mining tools to assist study selection for systematic reviews.66 For some systematic reviews, the inclusion criteria dictate that the reference describes a randomised control study or that the study uses certain methodologies (eg, double‐blinding) to ensure quality. Tools to automatically recognise these69,67, 68, 69, 70, 71 can be used to generate tags to filter references. Study selection can also benefit from fine‐grained information extraction from full article text, eg, to find sentences corresponding to PICO criteria elements,72 or from efforts to automatically summarise included studies.66 In summary, numerous studies have evaluated automatic classification for systematic reviews; some of which have been implemented within end‐user systems, but their evaluations have been limited to either simulations involving previously completed reviews or partial reviews that have not been verified by complete manual screening. To the best of our knowledge, our work is the first large‐scale user‐based evaluation that performs new screening tasks from start to finish.

METHODS

In this section, we firstly describe RobotAnalyst's core functionality and implementation. An overview of the user interfaces is presented in Appendix A. Secondly, we describe the evaluation methodology.

System functionality

To support screening, RobotAnalyst's interface allows a user to combine searches based on content, clusters, or topics with active learning prioritisation. As shown in Figure 1, the input to the screening process is a collection of references, and the output is the subset of relevant references included for further review or synthesis. The user's screening decisions inform the classification model, which in turn affects the prioritisation of references in the active learning loop.

Figure 1

An information flow diagram of the information processing and user interaction available in RobotAnalyst

An information flow diagram of the information processing and user interaction available in RobotAnalyst An initial batch of references (selected at random or via a focused search) with manual screening decisions is used to train an initial classification model, which recommends further references for screening. If a user chooses to screen them, the decisions, in turn, are used to train a better classification model, which will suggest further references. In every iteration of this active learning loop, a classifier is improved by having more training data and, as a result, can provide better predictions and suggestions. Although a reviewer can use the system's relevance scores in many ways, relevancy‐based prioritisation,26 where the references the system deems as the most likely to be relevant are screened, is the suggested approach. Prioritising the relevant references earlier in the screening process31 is considered beneficial for gaining an understanding of the area14 or in cases when the reviewer does not plan to screen the whole collection.9 As soon as new screening decisions have been made, the user can trigger the retraining process to update the system's predictions and the inclusion confidence for each reference by using the new decisions to train the classification model. Updating the model frequently may provide more accurate predictions for prioritising the relevant references. However, to avoid excessive computation and to provide a stable user experience, the training is manually triggered by the user, with the system issuing a reminder after 25 decisions without an update. An important feature of RobotAnalyst is its flexibility. A user can choose to select references for screening using the search functionality, the system's inclusion confidence for prioritisation, or a combination of both. Since model retraining is triggered manually, users can focus on assessing a large portion of the collection before rebuilding the model or choose to make frequent updates to get the most accurate predictions. The system does not attempt to substitute for the reviewer but rather to aid the reviewer throughout the screening process.

System implementation

RobotAnalyst is implemented as a server‐side Web application, which means that all screening process data are stored and processed on a remote server while the interface is accessible as an interactive Web page using a standard Web browser. When a user uploads a reference collection (in RIS file format,73 which describes each reference by its title, abstract, and metadata), each reference is converted into two text documents corresponding to title and abstract. These documents are subsequently processed by text‐mining and automatic term detection, generating a topic model based on contents of abstracts to find references that discuss similar topics while allowing variations in vocabulary, creating a feature representation for descriptive clustering to divide the collection into groups of similar references that can be browsed via lists of associated keywords, and building the feature representations for the active‐learning classification model used for predicting the relevant references and prioritising them. The original content and metadata along with the extracted information (topics, clusters, and terms) are indexed for collection‐specific search using Apache Solr search server. The Solr database also stores the user's screening decisions and user‐supplied notes along with the machine‐generated classification predictions. http://lucene.apache.org/solr/. The text‐mining pipeline begins by passing the title and abstract through the GENIA tagger,74 which records the part of speech of each word and its lemma (the word's base form before affixing plurals, tense markers, etc.). To identify multiword terms for searching, we use the C‐value method implemented in Termine to identify candidate terms in each reference based on multiword noun phrases. For topic modelling, we use the latent Dirichlet allocation (LDA) model,43 a standard model that assumes each text is a mixture of topics with the proportion of topics varying between the texts. We use the MALLET63 toolkit to create an LDA model with 300 topics based on the text from the titles and abstracts (prior parameters are set as and β = 0.01 and optimised every 50 iterations). The LDA model has multiple uses: It is used as a visual‐aided search interface for selecting references according to topics, for a similarity measure to compare references via the cosine distance between their topic vectors, and as additional input features for the classifier. For the former, each topic is described by the set of the 5 most frequent words and the set of 45 references most associated to it. The topic proportions are used as features that smoothly consolidate the occurrences of words from the same topic. In practice, the optimal number of topics depends on the collection and screening task. If too few topics are used, the topics would not be sufficiently specific to improve the prioritisation. If too many are used, they would not provide any advantage over the word occurrences themselves. To arrive at 300 topics, we had followed previous work26, 29 and the MALLET user guide, which suggests between 200 to 400 topics. For descriptive clustering, we use spectral clustering of the documents (title and abstract combined) to form the clusters and a statistical selection process to determine a set of words and multiword terms that succinctly describe each cluster.41 Spectral clustering75 operates on the cosine similarities between the bag‐of‐words vector representation of the abstracts and titles using the term frequency inverse document frequency (TF–IDF) weighting.76 The vocabulary is limited to words that occur at least five times in the collection and are not present in the stop‐words list. The spectral clustering algorithm relies on the truncated eigen‐decomposition of an N‐by‐N matrix, where N is the number of references in the collection. To scale to large collections, this similarity matrix is never explicitly created; instead, the matrix‐vector multiplications required for the eigen‐decomposition are computed as a series of sparse matrix multiplications. The spectral representation is clustered using spherical k‐means, 10 replicates, a maximum of 100 iterations, and the scalable k‐means++ oversampling initialisation algorithm77 with the parameters l = 2k and r = 5, where k is the number of clusters. The list of keywords (both words, lemmatised words, and terms identified using Termine) used to describe each cluster are selected as the most informative features for the cluster. Specifically, the algorithm firstly selects keywords positively correlated with the cluster and then uses the conditional mutual information maximisation criterion78 for greedy forward selection with redundancy reduction. The number of keywords used for each cluster is selected by statistical model order selection using the Bayesian information criterion79 after fitting a model to predict the cluster membership based on the presence of the keywords.41 The keywords for each cluster are sorted based on their coefficient weights. A user can select the number of clusters from the multiples of five up to 100 clusters. The system's inclusion confidence is provided by a binary classification model that can be updated after each screening decision using all prior decisions as training data. As input to the classification model, each reference is represented as a vector of features corresponding to the count of words occurring in the title and abstract and also the topic model proportions. Using the inferred topics as features has been previously shown to improve accuracy in screening prioritisation.26 Specifically, references are represented by three sets of features: an L2‐normalised bag‐of‐words representation of the title based on TF‐IDF scores of all lemmatised words not present in the stop‐words list, This ensures the sum of the squared values of the feature vector equals 1. an analogous bag‐of‐words representation for the abstract, and the topic‐proportion vector estimated by Gibbs sampling from the LDA model. Past screening decisions (references labelled for either inclusion and exclusion) provide the training examples for a linear model fit with an L2‐regularised L2‐loss function using the dual formulation of the support vector classifier implemented in LIBLINEAR80 with the default parameter values: constraint violation cost parameter C = 1 and stopping criterion ε = 0.1. By design, a support vector classifier can handle cases when the number of features is greater than the number of references, which is typically the case with a bag‐of‐words representation. The classification model is applied to the entire set of references and the output values are converted to inclusion confidence values (between 0 and 1) by applying the logistic function. Finally, each confidence value is converted to either an inclusion or exclusion prediction by applying a user‐chosen threshold. The time necessary to update a model depends on the total number of labelled references. It is shorter (a few seconds) at the beginning of screening and longer towards the end of a screening process. For example, training on 5000 references can be completed in 1 minute. Furthermore, a user can continue to screen references using the current prioritisation after triggering the model update process, avoiding any lost time.

Evaluation

To assess the usefulness of RobotAnalyst for accelerating screening, we performed a real‐world evaluation by asking reviewers at two institutions to use the system to perform screening tasks. During the evaluation period (reviews were started on or after September 2015 and completed by August 2017), 22 collections were screened completely, ie, a reviewer made a relevance decision for every reference. All collections were used with no post hoc selection. The collections screened vary in many aspects as detailed in Table 1. Most of the screening tasks were performed at NICE; the remaining and two were conducted at IUMSP to inform patient safety and quality of hospital care for older patients in Switzerland. The collections differed greatly in size, ranging from a small collection of 86 references to a large one of almost 5000. The percentage of relevant references varied from 0.28% (just 6 out of 2148) to almost 30%. The extremely low relevancy rate represents a challenge for machine learning classification, since the prioritisation can only be used once at least one relevant reference has been screened.

Table 1

Collection	Topic	Origin	Size	Specificity, %
TUB	Tuberculosis	NICE ^a	4678	2.42
BC	Behaviour change: individual approaches	NICE ^b	1502	13.72
BC‐S	Behaviour change: individual approaches (surv)	NICE ^b	937	21.66
BC‐C	Choice architecture in behaviour change (surv)	NICE ^b	959	15.33
WC‐D	Walking and cycling (surv, database search)	NICE ^c	304	27.30
WC‐C	Walking and cycling (surv, citation search)	NICE ^c	468	12.18
WC‐F	Walking and cycling (surv, focused search)	NICE ^c	86	9.30
PAP	Physical activity and pregnancy	NICE	320	11.88
WGP	Weight gain and pregnancy	NICE	110	11.82
PW‐S	Preventing excess weight gain (surv, self‐weighing)	NICE ^d	157	8.28
PW‐E	Preventing excess weight gain (surv, eating patterns)	NICE ^d	719	5.15
WM	Weight management (surv)	NICE ^e	665	29.62
SH	Sexual health	NICE ^f	3760	1.36
QSH	Quality and safety in hospitals	IUMSP	4964	18.63
LD	Learning difficulties	NICE	2148	0.28
OCM	Osteoarthritis: care and management (surv)	NICE ^g	2986	15.00
HB	Hepatitis B: diagnosis and management (surv)	NICE ^h	1523	3.81

Guideline: https://www.nice.org.uk/guidance/ng33.

Guideline: https://www.nice.org.uk/guidance/ph49.

Guideline: https://www.nice.org.uk/guidance/ph41.

Guideline: https://www.nice.org.uk/guidance/ng7.

Guideline: https://www.nice.org.uk/guidance/ph47.

Guideline: https://www.nice.org.uk/guidance/ng68.

Guideline: https://www.nice.org.uk/guidance/cg177.

Guideline: https://www.nice.org.uk/guidance/cg165.

Details of reference collections used in the evaluation experiments, including their topical areas (surv denotes surveillance reviews), origin (with relevant NICE guideline if applicable), overall size, and percentage of relevant references (averaged in case of parallel reviews) Guideline: https://www.nice.org.uk/guidance/ng33. Guideline: https://www.nice.org.uk/guidance/ph49. Guideline: https://www.nice.org.uk/guidance/ph41. Guideline: https://www.nice.org.uk/guidance/ng7. Guideline: https://www.nice.org.uk/guidance/ph47. Guideline: https://www.nice.org.uk/guidance/ng68. Guideline: https://www.nice.org.uk/guidance/cg177. Guideline: https://www.nice.org.uk/guidance/cg165.

Screening tasks

Two experiments were conducted: controlled and unconstrained. In the controlled experiment, two collections, Tuberculosis and Behaviour change, were each screened by three independent reviewers, following procedures using a defined subset of the system's functionality: screen all references using relevancy‐based active learning (AL); choose topics whose descriptive keywords match a user's keyword list defined a priori, screen all of these topics' references, and then continue with the rest of the collection in random order (topics); or a combination of the above procedures: firstly, screen references from relevant topics, then continue according to relevancy‐based active learning (topics + AL). The aim of these experiments was to compare the performance of these techniques in prioritising relevant references. In the unconstrained experiments, reviewers, having been familiarised with the system's capabilities, used the system to perform various new abstract screening tasks. The reviewers were free to use any features of the system they found helpful in finding relevant references (descriptive clustering was not enabled). A total of 16 screenings were performed in this manner without controlling how the reviewer performed the screening. By examining the database records, we could determine which screenings were AL‐based, ie, when reference selection was driven primarily by relevancy‐based prioritisation with other search functions used either very infrequently. Even if the screening was AL‐based, at the beginning of the task, the reviewer must use other capabilities or use the default sorting order, which was alphabetical by author name. With the unconstrained experiments, the baseline is screening from a random order. One of the collections (Quality and safety in hospitals) was screened four times. Two reviewers screened without any prioritisation using Covidence as the systematic review software. Then two other reviewers (one junior and one senior) used RobotAnalyst. The conflict‐resolved decisions from the first two reviewers serve as a baseline decision set, allowing us to assess the performance of the reviewers using prioritisation. https://www.covidence.org/.

Performance measures

The main objective is to prioritise the references such that the relevant references are identified first. To quantify the performance of the prioritisation, we use two measures: work saved over sampling at 95% (WSS@95%) and the area under the recall curve (AUR). Both measures are based on recall, which is the proportion of all relevant references identified. Recall can only be computed once the collection has been fully screened. Specifically, WSS@95% measures the percentage of the collection that does not need to be screened if the reviewer were to stop screening upon achieving 95% recall, compared with screening in random order.9 Precisely, where N denotes the number of references, while TP(i) and FP(i) are the number of relevant and irrelevant references, respectively, found after screening i references (TP(i) + FP(i) = i), and i R95 denotes the number of references screened when 95% recall is firstly achieved. where FN(i) is the number of relevant references that have not been screened. During the initial screening, this number is unknown; consequently, WSS@95% can only be computed after the entire collection is screened. In practice, it is not possible for a reviewer to know exactly when the desired recall has been achieved. To test the significance of the prioritisation, we use an exact test based on the distribution of WSS@95% under the assumption of a random ordering of the references. The probability that 95% recall is achieved after screening i references (ie, TP(i) = r = ⌈0.95R⌉ inclusions, where R is the total number of relevant references) is given as The first term in the expression corresponds to the probability, given by the hypergeometric distribution, of a sample of i − 1 references (taken without replacement) having r − 1 inclusions, and the second term is the probability of observing the rth inclusion as the ith reference. The Pvalue is the probability of achieving 95% recall with i R95 or fewer references under the null hypothesis. WSS@95% focuses on the workload at the fixed moment when 95% of relevant references are found, but depending on the review type, eg, scoping reviewing and update, the workload required at different recall levels may be more important. For example, a reviewer may be interested in how quickly the assisted screening can find 10% or 99% of the relevant references. This motivates the AUR metric, which averages the workload across all recall levels81. AUR is calculated as Because of the normalisation factor , the optimal value of AUR equals 1 for perfect prioritisation, ie, when all relevant references are screened before any exclusions.

RESULTS

Tables 2 and 3 show the evaluation results of the controlled and unconstrained experiments, respectively. In the cases when reviewers used the system's suggestions (AL‐based), the WSS@95% is significantly better than random ordering (exact test, significance level of 0.01). WSS@95% varies greatly between collections (from 6.89% to 70.74%), but the average gain is substantial (42.97%). We can see that the largest gains are achieved for tasks with very low specificity, ie, 1% to 3%, while tasks with more relevant references, ie, 30%, have lower values for work saved. For example, Figure 2 shows the recall and decision time across the whole process for a systematic review on Quality and safety in hospitals (senior reviewer ) at IUMSP and a NICE guideline on Sexual health.

Table 2

Results of the controlled experiments performed on two reference collections, each screened using three procedures in parallel, with performance measured using WSS@95% and AUR metrics

Collection	WSS@95%	AUR	Strategy
TUB	* 70.74%	0.9078	AL only
	* 69.67%	0.9196	topics + AL
	* 11.65%	0.7699	topics only
BC	* 29.89%	0.7983	AL only
	* 46.53%	0.8040	topics + AL
	‐1.80%	0.4729	topics only

Values of WSS@95%, which were significantly greater than expected by random sampling (exact test, significance level of 0.01), are starred.

Table 3

Results of the unconstrained experiments performed, each involving screening a collection by a junior or senior reviewer using all the features of the system, with performance measured using WSS@95% and AUR metrics, grouped by whether relevancy‐based screening (AL‐based) prioritisation was used throughout

Collection	Reviewer	WSS@95%	AUR	AL‐based
BC‐C	Senior	* 6.89%	0.7276	Yes
WC‐D	Senior	* 29.54%	0.8477	Yes
WC‐C	Senior	* 22.35%	0.7904	Yes
PAP	Senior	* 40.63%	0.8398	Yes
WGP	Senior	* 36.82%	0.7893	Yes
PW‐S	Senior	* 63.15%	0.8285	Yes
PW‐E	Senior	* 38.81%	0.8369	Yes
WM	Senior	* 23.72%	0.8374	Yes
SH	Senior	* 66.17%	0.8858	Yes
QSH	Junior	* 39.84%	0.8914	Yes
QSH	Senior	* 31.32%	0.8818	Yes
LD	Senior	* 50.45%	0.9058	Yes
OCM	Senior	* 63.99%	0.9377	Yes
BC‐S	Senior	* 9.41%	0.6519	No
WC‐F	Senior	8.95%	0.5244	No
HB	Junior	‐3.62%	0.7347	No

Values of WSS@95% which were significantly greater than expected by random sampling (exact test, significance level of 0.01) are starred.

Figure 2

Cumulative recall curves and median decision times for three screening tasks. The times are smoothed by using the medians within a sliding window of 51 interdecision intervals. A graphical depiction of WSS@95% is shown as the difference between the recall curve and recall expected under a randomly sampled ordering of the references

Results of the controlled experiments performed on two reference collections, each screened using three procedures in parallel, with performance measured using WSS@95% and AUR metrics Values of WSS@95%, which were significantly greater than expected by random sampling (exact test, significance level of 0.01), are starred. Results of the unconstrained experiments performed, each involving screening a collection by a junior or senior reviewer using all the features of the system, with performance measured using WSS@95% and AUR metrics, grouped by whether relevancy‐based screening (AL‐based) prioritisation was used throughout Values of WSS@95% which were significantly greater than expected by random sampling (exact test, significance level of 0.01) are starred. Cumulative recall curves and median decision times for three screening tasks. The times are smoothed by using the medians within a sliding window of 51 interdecision intervals. A graphical depiction of WSS@95% is shown as the difference between the recall curve and recall expected under a randomly sampled ordering of the references In both cases, recall increases more rapidly than would be expected with random sampling and the decision time decreases when the number of relevant references diminishes in later stages of the process. Gains in both measures are more prominent when the relevant references are rarer such as Sexual health, with 1.36% specificity, as compared with Quality and safety in hospitals with 18.63% specificity. In fact, in the former case, 100% recall is achieved after screening just 29.84% of the collection. In contrast, when reviewers did not use AL‐based prioritisation, eg, Behaviour change: individual approaches (surv) as also shown in Figure 2, the gains measured by WSS@95% are lower or nonexistent. When reviewers relied on manual keyword or topic‐based search (not AL‐based), WSS@95% does not show gains but the AUR values are positive (values above 0.5). This is because these techniques enabled a user to find more relevant references within the search results that matched a topic or keyword than with random sampling at the beginning, which increases the AUR. Subsequently, after these returned results were screened, reviewers defaulted to random sampling and the early positive effect is less apparent with WSS@95%. We can rely on controlled experiments to verify this: for example, Figure 3 compares recall curves for two screening processes of the same collection (Tuberculosis): one was based on active learning, and the other one relied on browsing discovered topics. At the beginning, while the reviewer was able to choose references belonging to topics that seemed relevant (solid line), the increase in recall is similar to the one resulting from active learning‐based screening. However, when those topics were depleted (there are only 45 references per topic, and some may belong to multiple topics), the remainder of the collection was screened with no prioritisation (dotted line), which resulted in a slower return of relevant references. On the same collection, the third reviewer started by screening topics and then proceeded to active learning and achieved essentially the same performance, while this combination outperformed strictly active learning for the Behaviour change.

Figure 3

Cumulative recall curves for the Tuberculosis collection when using active learning versus topic‐based screening at the beginning followed by random sampling for the remainder

Cumulative recall curves for the Tuberculosis collection when using active learning versus topic‐based screening at the beginning followed by random sampling for the remainder The controlled experiment results also show that topic modelling can be effectively used in the initial stages of screening (topics + AL) without degrading the performance versus strictly AL. Using topic‐based screening to kick‐start the classifier had either a neutral (Tuberculosis) or positive (Behaviour change) effect. Although not yet evaluated for its prioritisation performance, descriptive clustering provides a topical organisation of all references within the collection enabling a reviewer to screen initially by clusters instead of topics. Furthermore, once a model has been trained, it can be used to prioritise the clusters by relevance and also prioritise the references within a cluster. An evaluation of the quality of descriptive clustering is given in Appendix B. Finally, we examine the performance of screening with prioritisation versus baseline screening. Figure 4 shows the screening performance across the Quality and safety in hospitals collection by a senior and junior reviewer in terms of running recall, which is the recall with respect to the baseline decision set measured in a moving window of 50 relevant references. For both reviewers, the recall is lower (more relevant references in the baseline are marked for exclusion) in the later stages of the review. In the case of the senior reviewer, the decrease occurs later and the recall is not as low, resulting in higher overall agreement with the baseline.

Figure 4

Running recall curves for the Hospital care quality collection, computed by comparing the decisions made by the senior and junior reviewer at a given stage of the process to a baseline decision set (see explanation in text)

DISCUSSION

Based on the evaluation results, the potential for saving work by machine learning–powered prioritisation, as predicted by previous simulations, has been confirmed in this real‐world user study. Like the previously reported results, we have observed that the gains depend on the particular reference collection: From 7% to 71% of the screening effort could be saved. The prioritisation succeeded in suggesting relevant references across a variety of reviews from the public health domain, which has less clearly defined and more complex inclusion criteria than, for example, drug effectiveness reviews.9 Most importantly, the screening was conducted by the systematic reviewers from start to finish, without any intervention by the algorithm designers or modification to the software. This was possible because of the intuitive and user‐friendly Web interface for searching and organising references. However, despite this confirmation on the maturity of the tool, there are still open questions on best practices on using RobotAnalyst to realise its full potential. The functionality provided by RobotAnalyst can be used in multiple ways beyond the start‐to‐finish screening explored in the user study. For example, the ability to prioritise and screen by clusters is an interesting scheme that may be useful for scoping reviews, or as a way to organise and assess a preliminary literature search strategy. In the case of a rapid review, 82 it may not be as essential to finish screening once enough relevant evidence has been identified. Other use‐cases include performing review updates or deciding if a review needs to be updated.83 For these cases, a user can upload screened references (inclusion and exclusions) from a previous review along with new references retrieved by literature search. Training and applying the classification model can prioritise the new references, and only the predicted inclusions need to be screened. A major challenge to partial screening is the lack of a reliable threshold to stop screening. This point is ignored when computing an evaluation measure like WSS@95%, since the measure is based on the assumption that a user readily knows when sufficient recall has been reached; however, in reality, this is not the case. The reported WSS@95% serves as an upper bound on possible work savings with a stopping criterion. In practice, more screening will be necessary to achieve the same level of recall, as any stopping criterion would need to see a series consisting mainly of exclusions before signalling to stop. Deciding when to stop can be left to the end‐user, eg, if the prioritised references are consistently irrelevant, or heuristics84 or statistical approaches (based on an unbiased sample of the remaining references) can inform this decision. We intend to investigate this problem in the future, working in close cooperation with the systematic reviewers to ensure the solution enables them stop confidently and improve efficiency. Another limitation of WSS@95% is that it measures savings in work in terms of the number of screened references, while in practice, a reviewer's time is the resource of interest. The two are not equivalent since time per decision is not constant, as we can observe in Figure 2. In prioritised (AL‐based) screening tasks, the median decision time is substantially lower in the late stages of the process. On one hand, this could be explained by the fact that the late decisions are mostly exclusions, which are easier to make. On the other hand, the prioritisation can have an impact on it as well, since having seen most of the relevant documents, a reviewer could have a better understanding of a task, which can lead to faster decisions. Community feedback is necessary to form best practices for technology‐assisted reviews. RobotAnalyst could be incorporated without changing the screening workflow in order to alleviate some of the burden of screening large reference collections without prioritisation. The results of prioritisation can be considered useful for distributing references across a screening team.14 Additionally, some qualitative feedback from the evaluation indicates that users also enjoy the ability to identify relevant references earlier since they find it less mentally taxing to screen the remaining references having recognised the vast majority of the relevant references. However, the running recall measurement shown in Figure 4 emphasise a possible danger with machine‐assisted prioritisation—depending on their experience, users may become too complacent with the machine predictions, such that references that appear later in the screening are perceived as less relevant by default, although they may meet inclusion criteria that the machine failed to recognise because of a lack of any prior examples. Thus, the user must remain vigilant throughout the screening process as, presumably, it is easier to miss a relevant reference when it is surrounded by mostly irrelevant ones. Nonetheless, the system offers functionality for ad hoc quality assurance via random sampling to mitigate this risk. In this way, unbiased samples can be generated for checking batches of references. Alternatively, these random samples could be used to gauge the level of specificity in the collection and estimate recall. The functionality could also be extended to track the time spent screening each reference, which is essential to assess cost‐effectiveness of systematic reviews.7 Our evaluation has been conducted on primarily public health review questions. Public health questions are complex, involving behaviour, culture, and organisations, and often need to be described using abstract, fuzzy terminology. Screening in this setting is arguably more challenging than with clinical research questions, which may have more well‐defined populations, interventions, comparators, and outcomes. Techniques like query expansion, topic modelling, and descriptive clustering can help explore terminological variation, and machine learning can handle diverse vocabularies. Our evaluation of descriptive clustering and topic modelling reported in Appendix B focuses on the coherence of these organisation techniques and whether the descriptions were meaningful. Our hypothesis is that a user could use the clustering to find relevant references when beginning screening—an assumption that has been confirmed for topic modelling but needs evaluation for descriptive clustering. Alternatively, RobotAnalyst's search capabilities allow a user to perform a focused search via keywords, clusters, or topics, to find relevant references that may have distinct vocabulary and are not being prioritised by the model. That is, focused and topical search capabilities may be crucial for initialisation and to ensure a complete coverage when used in conjunction with the automatic prioritisation. Future works should consider a controlled study of the impact of the initial choice of references on the active learning performance. This would require the same review to be initiated several times by independent users, each using randomly assigned search strategies (keyword, cluster, or topic based) within a collection. While this work used a single model to prioritise references, it may be useful to explore the case of multiple models for different PICO elements or inclusion criteria. Even in public health, user feedback from NICE reviewers indicated that while the system was found to perform very well for single PICO (ie, singular review question) screening, there was room for improvement with collections covering multiple review questions. Evaluating the performance of RobotAnalyst for focused clinical reviews is another direction of future work. For reviews with clear PICO‐based criteria, it may be necessary to use features with more clinical specificity such as MeSH terms or automatically recognised entities such as drug names, proteins, and genes. Using information extraction techniques for targeting such entities in full article text would support inclusion criteria that cannot be verified based only on the abstract. Using full text provides more content for the classification model, clustering, and topic modelling. However, computational processing and storage would be significantly higher for full text. Furthermore, processing full text is challenging because of access costs, copyright limitations on third‐party processing and storage, and the technical challenges of reliably extracting text from PDF documents. To achieve uniform screening, these issues would have to be overcome for every reference in a collection. In summary, this study confirms that references can be reliably prioritised in public health and points to the potential benefit of incorporating tools such as search within collection, topic modelling, and descriptive clustering to aid initial screening or to ensure coverage of a collection. The study has a number of limitations including the following: The potential work savings is an upper bound on what is achievable in reality. Reducing the number of screening references requires a stopping criterion and the associated risk of missed relevancy, which may be unacceptable in certain cases. In terms of the time spent screening, the savings may be higher or lower than what would be indicated by the work saved in terms of the number of references, since the screening times vary per decision and throughout the process. Issues with complacency may be inflating the work savings estimates, if late in the screening process truly relevant references were missed. Further study is needed to compare using descriptive clustering, topic modelling, and keyword search to find the initial set of references before active learning prioritisation. The evaluation primarily centred on public health reviews. It would be worthwhile to evaluate the work savings in other review domains. Further research and evaluation studies can address several of these points. To be efficient, the studies should be performed on prospective reviews that need multiple screenings to facilitate paired comparisons. Machine‐assisted prioritisation for systematic reviews is a paradigm shift away from the traditional manual labour intensive approach. The potential time savings of using prioritisation are considerable for large collections with low specificity. The systematic review community needs to embrace the new technology to improve efficiency,4 and support further innovation through participation and community feedback.

CONCLUSION

We have presented a description and evaluation of RobotAnalyst as a tool to screen and organise references for systematic reviews on public health and health services research topics. The evaluation was the first of its kind in terms of multiple new reviews completed from start to finish within RobotAnalyst. The results indicate that substantial gains can be made by using machine learning to actively prioritise relevant references. The promising results for descriptive clustering highlight another avenue for exploring large reference collections. Currently, RobotAnalyst provides functionality for searching within a collection and browsing subsets of the results using semantic similarity based on terminology or topics. These new interfaces may help reviewers screen large collections of disparate references arising from complex review questions. While it is possible to extend and enhance the functionality, there is ample evidence to suggest that machine learning techniques for prioritising references in systematic reviews have matured with multiple systems available to end‐users. More prospective evaluations and open discussions are needed to spur the community to adopt tools like RobotAnalyst as the default, rather than the exception.

Table A1

Descriptive clustering outlier detection accuracy of six reviewers split between two collections (the average accuracy per collection in parentheses)

	Behaviour change				Tuberculosis
Spectral clustering	75%	69%	92%	(78.67%)	49%	83%	69%	(67%)
Topic modelling	63%	15%	75%	(51%)	30%	58%	38%	(42%)

Reviewers used any apparent coherence of the references and the description to choose an outlier reference. Accuracy is computed for 100 tasks with a chance rate of 20%.

42 in total

1. Cross-topic learning for work prioritization in systematic review creation and update.

Authors: Aaron M Cohen; Kyle Ambert; Marian McDonagh
Journal: J Am Med Inform Assoc Date: 2009-06-30 Impact factor: 4.497

Review 2. A Prospective Evaluation of an Automated Classification System to Support Evidence-based Medicine and Systematic Review.

Authors: Aaron M Cohen; Kyle Ambert; Marian McDonagh
Journal: AMIA Annu Symp Proc Date: 2010-11-13

3. Screening nonrandomized studies for medical systematic reviews: a comparative study of classifiers.

Authors: Tanja Bekhuis; Dina Demner-Fushman
Journal: Artif Intell Med Date: 2012-06-05 Impact factor: 5.326

4. Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening workload in extremely large scoping reviews.

Authors: Ian Shemilt; Antonia Simon; Gareth J Hollands; Theresa M Marteau; David Ogilvie; Alison O'Mara-Eves; Michael P Kelly; James Thomas
Journal: Res Synth Methods Date: 2013-08-23 Impact factor: 5.273

5. Studying the potential impact of automated document classification on scheduling a systematic review update.

Authors: Aaron M Cohen; Kyle Ambert; Marian McDonagh
Journal: BMC Med Inform Decis Mak Date: 2012-04-19 Impact factor: 2.796

Review 6. Use of cost-effectiveness analysis to compare the efficiency of study identification methods in systematic reviews.

Authors: Ian Shemilt; Nada Khan; Sophie Park; James Thomas
Journal: Syst Rev Date: 2016-08-17

7. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials.

Authors: Iain J Marshall; Joël Kuiper; Byron C Wallace
Journal: J Am Med Inform Assoc Date: 2015-06-22 Impact factor: 4.497

8. Rayyan-a web and mobile app for systematic reviews.

Authors: Mourad Ouzzani; Hossam Hammady; Zbys Fedorowicz; Ahmed Elmagarmid
Journal: Syst Rev Date: 2016-12-05

9. A semi-supervised approach using label propagation to support citation screening.

Authors: Georgios Kontonatsios; Austin J Brockmeier; Piotr Przybyła; John McNaught; Tingting Mu; John Y Goulermas; Sophia Ananiadou
Journal: J Biomed Inform Date: 2017-06-23 Impact factor: 6.317

10. When and how to update systematic reviews: consensus and checklist.

Authors: Paul Garner; Sally Hopewell; Jackie Chandler; Harriet MacLehose; Holger J Schünemann; Elie A Akl; Joseph Beyene; Stephanie Chang; Rachel Churchill; Karin Dearness; Gordon Guyatt; Carol Lefebvre; Beth Liles; Rachel Marshall; Laura Martínez García; Chris Mavergames; Mona Nasser; Amir Qaseem; Margaret Sampson; Karla Soares-Weiser; Yemisi Takwoingi; Lehana Thabane; Marialena Trivella; Peter Tugwell; Emma Welsh; Ed C Wilson; Holger J Schünemann
Journal: BMJ Date: 2016-07-20

24 in total

1. Evaluation of an automatic article selection method for timelier updates of the Comet Core Outcome Set database.

Authors: Christopher R Norman; Elizabeth Gargon; Mariska M G Leeflang; Aurélie Névéol; Paula R Williamson
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

2. The value of a second reviewer for study selection in systematic reviews.

Authors: Carolyn R T Stoll; Sonya Izadi; Susan Fowler; Paige Green; Jerry Suls; Graham A Colditz
Journal: Res Synth Methods Date: 2019-07-18 Impact factor: 5.273

3. Automating document classification with distant supervision to increase the efficiency of systematic reviews: A case study on identifying studies with HIV impacts on female sex workers.

Authors: Xiaoxiao Li; Amy Zhang; Rabah Al-Zaidy; Amrita Rao; Stefan Baral; Le Bao; C Lee Giles
Journal: PLoS One Date: 2022-06-30 Impact factor: 3.752

4. Efficacy of aerobic exercise on the cardiometabolic and renal outcomes in patients with chronic kidney disease: a systematic review of randomized controlled trials.

Authors: Ryohei Yamamoto; Takafumi Ito; Yasuyuki Nagasawa; Kosuke Matsui; Masahiro Egawa; Masayoshi Nanami; Yoshitaka Isaka; Hirokazu Okada
Journal: J Nephrol Date: 2021-01-02 Impact factor: 3.902

5. Applying machine classifiers to update searches: Analysis from two case studies.

Authors: Claire Stansfield; Gillian Stokes; James Thomas
Journal: Res Synth Methods Date: 2021-11-25 Impact factor: 9.308

6. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis.

Authors: Iain J Marshall; Byron C Wallace
Journal: Syst Rev Date: 2019-07-11

7. Improving reference prioritisation with PICO recognition.

Authors: Austin J Brockmeier; Meizhi Ju; Piotr Przybyła; Sophia Ananiadou
Journal: BMC Med Inform Decis Mak Date: 2019-12-05 Impact factor: 2.796

8. Prioritising references for systematic reviews with RobotAnalyst: A user study.

Authors: Piotr Przybyła; Austin J Brockmeier; Georgios Kontonatsios; Marie-Annick Le Pogam; John McNaught; Erik von Elm; Kay Nolan; Sophia Ananiadou
Journal: Res Synth Methods Date: 2018-07-30 Impact factor: 5.273

9. Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error.

Authors: Alexandra Bannach-Brown; Piotr Przybyła; James Thomas; Andrew S C Rice; Sophia Ananiadou; Jing Liao; Malcolm Robert Macleod
Journal: Syst Rev Date: 2019-01-15

10. Measuring the impact of screening automation on meta-analyses of diagnostic test accuracy.

Authors: Christopher R Norman; Mariska M G Leeflang; Raphaël Porcher; Aurélie Névéol
Journal: Syst Rev Date: 2019-10-28