Literature DB >> 26267331

Collective intelligence meets medical decision-making: the collective outperforms the best radiologist.

Max Wolf¹, Jens Krause², Patricia A Carney³, Andy Bogart⁴, Ralf H J M Kurvers¹.

Abstract

While collective intelligence (CI) is a powerful approach to increase decision accuracy, few attempts have been made to unlock its potential in medical decision-making. Here we investigated the performance of three well-known collective intelligence rules ("majority", "quorum", and "weighted quorum") when applied to mammography screening. For any particular mammogram, these rules aggregate the independent assessments of multiple radiologists into a single decision (recall the patient for additional workup or not). We found that, compared to single radiologists, any of these CI-rules both increases true positives (i.e., recalls of patients with cancer) and decreases false positives (i.e., recalls of patients without cancer), thereby overcoming one of the fundamental limitations to decision accuracy that individual radiologists face. Importantly, we find that all CI-rules systematically outperform even the best-performing individual radiologist in the respective group. Our findings demonstrate that CI can be employed to improve mammography screening; similarly, CI may have the potential to improve medical decision-making in a much wider range of contexts, including many areas of diagnostic imaging and, more generally, diagnostic decisions that are based on the subjective interpretation of evidence.

Entities: Disease Species

Mesh：

Year: 2015 PMID： 26267331 PMCID： PMC4534443 DOI： 10.1371/journal.pone.0134269

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Background

Beliefs in individual experts and genius are deeply engrained in western societies yet research on collective intelligence has shown that groups can often outperform individuals when solving cognitive problems [1-4]. One of the earliest example is provided by Galton [5], who showed that the weight of an ox can be determined almost perfectly by pooling a large number of individual guesses. Over the last decades, impressive feats of collective intelligence (CI) have been described in a wide range of animal species including microbes, insects, fish, birds and humans [6-12]. In the human domain, a key challenge exists in identifying those contexts where decisions can be improved with CI, and to design decision-making systems that unlock this potential [13-15]. To date, relatively few studies have applied CI to the field of medical decision-making (but see [16-18]). Here we investigated the scope for CI among radiologists independently interpreting mammograms. Breast cancer is the most prevalent cancer type in females and currently accounts for 29% of all new cancer cases in the U.S. with relatively consistent incidence rates since 2003 [19]. Mammography is the prime diagnostic tool for early detection of breast cancer and also the most commonly used radiological screening method. During interpretation of screening mammograms, physicians face a trade-off between achieving a high true positive rate (i.e., the proportion of cancer cases correctly recalled for additional workup, or sensitivity) and a low false positive rate (i.e., the proportion of non-cancer cases incorrectly recalled for additional workup, or 1—specificity) [20,21]. Interpretations by a single radiologist as done in the U.S. and independent double reading of mammograms by two radiologists in combination with consensus discussion in cases of discordant opinions as done in Europe are the most common evaluation methods [22]. We stress that, despite substantial improvements in mammography screening, considerable scope for CI remains. According to the current Mammography Factsheet of the National Cancer Institute (U.S. Department of Health and Human Services), for example,”screening mammograms miss about 20 percent of breast cancers that are present at the time of screening” [23].

Materials and Methods

In order to assess the potential for CI in mammography screening, we investigated the performance of three well-known CI-rules (Table 1). For any given mammogram, these rules integrate the independent assessments of multiple radiologists into a final decision (i.e., recall the patient for additional workup or not). The three rules differ in how they aggregate the individual assessments (‘recall’ or ‘no recall’) into a final decision and how much prior knowledge is required for their implementation (Table 1). Specifically, these rules prescribe that a patient is recalled whenever (i) a majority of the independent individual assessments is ‘recall’ (‘majority’), (ii) the frequency of independent individual assessments for ‘recall’ is higher than a pre-established quorum threshold (‘quorum’) and (iii) the frequency of the weighted independent individual assessments for ‘recall’ is higher than a pre-established quorum threshold (‘weighted quorum’). Importantly, all three CI-rules are predicted to increase the decision accuracy of groups compared to single decision makers [24-26].

Table 1

Three CI-rules .

	Decision rule	Promotes collective intelligence whenever	Information requirement
Majority	Recall patient whenever a majority of the assessments is ‘recall’.	Each individual decision maker has an accuracy above 50%.	None.
Quorum	Recall patient whenever the fraction of the ‘recall’ assessments is higher than the pre-established quorum threshold.	The quorum threshold is set between the average true and false positive rate of decision makers.	The average true and false positive rate of decision makers.
Weighted quorum	As ‘quorum’, but the votes of individual decision makers are weighted according to their individual performance.	As ‘quorum’.	The accuracy of individual decision makers.

1 Note that these CI-rules are a sequence of increasingly complex rules: the majority rule is a special case of the quorum rule with the quorum threshold set to 0.5, and the quorum rule is a special case of the weighted quorum rule with the individual weights set to 1.0. To investigate the performance of these CI-rules, we used one of the largest available mammography datasets, consisting of 16,813 interpretations by 101 radiologists of 182 mammograms in a test set study setting [27,28]. All cases included in the test set were randomly selected from screening examinations performed on women aged 40 to 69 between 2000 and 2003 from six U.S. mammography registries participating in the Breast Cancer Surveillance Consortium (S1 Text). Each screening examination included both current and previous views for comparison, consisting of craniocaudal (CC) and mediolateral oblique (MLO) views of each breast (4 views per woman for each of the screening and comparison examinations), which is standard practice in the U.S. [27]. As this dataset contains the independent assessments by multiple radiologists of the same mammogram (mean number of independent readings per mammogram = 92), and the true status of each mammogram (S1 Text), it allows us to investigate the performance of the above CI-rules. We stress that, while the above dataset has recently been used to investigate the performance of individual radiologists [29-31], up to now, its potential to investigate CI in mammography screening has not yet been harnessed. Throughout, we composed groups of size n (range: 1 to 15) by randomly drawing n radiologists from the total pool of radiologists. To assess the performance of the majority rule (Table 1) and how this depends on group size we determined, for each mammogram, whether the majority of the n radiologists classified the mammogram as ‘recall’ or ‘no recall’. We only used odd group sizes to avoid the need for a tie-breaker rule. After classifying all mammograms in this way (i.e. following the majority), we used the known cancer status of each mammogram to calculate the average true and false positive rate and the overall accuracy (i.e., the proportion of mammograms which are correctly classified) achieved when employing this procedure. To assess the performance of groups that employ the quorum rule (Table 1) we randomly assigned half of the mammograms to a training set and the other half to a validation set. The training set was used to calibrate the quorum threshold which, in order to achieve gains in both true and false positives, has to be set below the average true positive rate and above the average false positive rate of the individual radiologists [26]. We thus calculated the average true and false positive rate of the n radiologists in the training set and set the quorum threshold halfway between these values (see below for alternative ways of setting the quorum threshold). We then evaluated the performance of the quorum rule in the validation set. For each mammogram in the validation set we determined the fraction of the n radiologists that classified the mammogram as ‘recall’. If this fraction was higher than or equal to the quorum threshold, then the mammogram was classified as ‘recall’, if not as ‘no recall’. After classifying all mammograms in the validation set, we used the known cancer status of each mammogram to calculate the average true and false positive rate and the overall accuracy of the quorum rule. Our weighted quorum rule (Table 1) resembles the quorum rule but additionally prescribes that the assessments of each radiologist is weighted according to her previous performance (S1 Text). In particular, we employ a commonly employed heuristic decision rule for binary choice scenarios as investigated by us [24,32,33]: if the performance of radiologist i is p , then its vote has to be weighted by . We thus followed the same procedure as described for the quorum rule, but used the training set to estimate the performance of each radiologist and used these estimates to set the weights in the validation set (S1 Text). For each CI-rule and each group size n, we repeated this procedure 2,500 times (including new and independent assignments of the mammograms to the training set and to the validation set) and then calculated the mean (± SEM). We compared this to the average performance and the performance of the best radiologist in each group (S1 Text).

Ethics Statement

Data used for this research were collected during the course of a mammographic test set study designed to evaluate an intervention designed to improve mammographic accuracy among community radiologists [27,31]. The study involved radiologists from U.S. breast cancer registries including Carolina Mammography Registry, San Francisco Mammography Registry, New Hampshire Mammography Network, New Mexico Mammography Project, Vermont Breast Cancer Surveillance System, and Group Health Cooperative in western Washington, all affiliated with the Breast Cancer Surveillance Consortium (BCSC). The data were assembled at the BCSC Statistical Coordinating Center (SCC) in Seattle and analysed at the Leibniz-Institute of Freshwater Ecology and Inland Fisheries (IGB) in Berlin, Germany. All registries as well as the SCC and IGB received institutional review board approval for either active or passive consenting processes or a waiver of consent to enroll participants, pool data and perform statistical analysis. All procedures are in accordance with the Health Insurance Portability and Accountability Act and all data were anonymized to protect the identities of women, radiologists and facilities.

Results and Discussion

Fig 1 shows the results from these analyses. We find that, as group size increases, all three CI-rules achieve both increases in true positives (Fig 1A) and decreases in false positives (Fig 1B). As a consequence, larger groups make more accurate decisions (Fig 1C). The simultaneous improvements in true and false positives is remarkable, as the trade-off between true and false positives is one of the fundamental constraints of decision accuracy that single radiologists face [20,21]. Our results show that each of the three CI-rules alleviates this constraint. Interestingly, gains achieved from larger group sizes level off around a group size of nine, after which adding more radiologists only has a marginal effect (Fig 1A to 1C). We stress that even relatively small group sizes can achieve substantial performance improvements (Fig 1). As expected, the performance of the highest-performing radiologist (green dots in Fig 1A to 1C) increases with increasing group size. This is because larger groups have a higher likelihood of harbouring high performers. Despite this, we find that groups employing any of the CI rules outperform the best-performing radiologist in that group, achieving more true positives (Fig 1A), fewer false positives (Fig 1B) and thus higher overall accuracy (Fig 1C).

Fig 1

All three CI-rules outperform the best radiologist.

All three CI-rules outperform the best radiologist.

All three CI-rules overcome the trade-off between true and false positives that single radiologists face, and outperform the best radiologist within each group. Shown are mean (± SEM) true positives (A), false positives (B) and accuracy (C) of the three CI-rules, as a function of group size n. The dashed line corresponds to the average individual performance of radiologists (i.e., group size of 1), the green dots correspond to the highest-performing radiologist for a given group size n. When comparing the overall decision accuracy of the different rules, we find that the weighted quorum rule slightly outperforms the quorum rule and that the quorum rule outperforms the majority rule (Fig 1C). This was to be expected, as these three rules can be seen as a series of increasingly complex rules with the feature that the less complex rules are a special case of the more complex rules: the quorum rule results in the majority rule when the quorum threshold is set to 0.5, the weighted quorum rule results in the quorum rule when the individual weights are set to 1. The same performance ranking can be found for the false positives (Fig 1B). The fact that the majority rule achieves a higher true positive rate than either quorum rule (Fig 1A) can be explained by the particular way we set the quorum in the above analyses (i.e. halfway between the average true and false positive rate of radiologists), which favours performance gains in false positives (see below). The majority rule is the simplest and most parsimonious of the three rules, as its implementation does not require any information about the performance of decision makers. In order to implement the quorum rules, estimates of either the average true and false positive rate of decision makers (‘quorum’) or the individual performances (‘weighted quorum’) are needed. However, the quorum rules are more broadly applicable and more flexible than the majority rule. The majority rule is predicted to promote CI [24,25] only when individual decision makers have a decision accuracy above 50% (as is the case in the present data set; average true positive rate = 0.762, average false positive rate = 0.336). The quorum rules are not constrained by this condition and are predicted to promote CI whenever the quorum is set between the average true and false positive rates of the individuals involved in the decision [26,34]. Moreover, in contrast to the majority rule, the quorum rules can be fine-tuned in order to put more weight on gains in either true positives, false positives or overall accuracy. This fine-tuning is achieved by adjusting the quorum threshold: lower thresholds will increase true positive rate at the cost of lower gains in false positives, as fewer radiologists are required to assess a mammogram as ‘recall’ in order to recall a patient. Analogously, higher thresholds improve (i.e. reduce) the false positive rate at the cost of lower gains in true positives. This basic dependency of the true and false positive rates on the quorum threshold is illustrated in Fig 2, which shows the true and false positive rates of groups of size 15 employing one out of a broad range of fixed quorum thresholds (range: 0 to 1, with increments of 0.05), illustrating the trade-off between the true and false positive rate at the collective level.

Fig 2

ROC curve for the quorum rule and the weighted quorum rule.

Each dot is obtained by setting a different fixed quorum threshold, starting at 0 with increments of 0.05 up to 1. Data are based on a group size of 15 and 2,500 simulations (see main text). The majority rule corresponds to a fixed quorum threshold of 0.5. Note that, while we here consider the consequences of fixed quorum thresholds, the analyses in Fig 1 and Fig 3 are based on flexible quorum thresholds that are estimated from a training set (see main text).

ROC curve for the quorum rule and the weighted quorum rule.

Fig 3

The quorum rule can be fine-tuned to put more weight on improving true positives, false positives or overall accuracy.

To further illustrate the flexibility of the quorum rules, we considered three different scenarios where the goal is to: (i) maximize gains in true positives while keeping the false positive rate close to the average false positive rate of 0.336 in the data set (Fig 3A); (ii) minimize the false positive rate while keeping the true positive rate close to the average true positive rate of 0.762 in the data set (Fig 3B); and (iii) maximize overall accuracy (Fig 3C). For each scenario, we randomly selected groups of n radiologists (range: 1 to 15, odd numbers only) and used half of the mammograms as a training set and half as a validation set. In the training set we determined the quorum thresholds which resulted in (i) a false positive rate of 0.336, (ii) a true positive rate of 0.762 and (iii) maximum overall accuracy (S1 Text). We then investigated the performance of these thresholds in the validation set. Repeating this procedure 2,500 times (for each of the three scenarios, and each group size n) we find that, on top of the gains achieved by a quorum threshold set halfway between the average true and false positive rate (as in Fig 1), further improvements in true positives (Fig 3A), false positives (Fig 3B) or overall accuracy (Fig 3C) can be achieved by fine-tuning the quorum threshold appropriately. For example, while single radiologists on average achieve a true positive rate of 0.762, a group of five radiologists can achieve a true positive rate of 0.861 with a similar false positive rate as single individuals (Fig 3A). Alternatively, while single radiologists on average achieve a false positive rate of 0.336, a group of five radiologists can achieve a false positive rate of 0.226 with a similar true positive rate as single individuals (Fig 3B).

The quorum rule can be fine-tuned to put more weight on improving true positives, false positives or overall accuracy.

The three panels correspond to the three illustrative scenarios where the goal was to: (A) maximize gains in true positives while keeping the false positive rate (FPR) close to the average false positive rate of 0.336 in the data set; (B) minimize the false positive rate while keeping the true positive rate (TPR) at the average true positive rate of 0.762 in the data set; and (C) maximize overall accuracy. As can be seen, on top of the gains achieved by a quorum threshold set halfway between the average true and false positive rate (black dots, corresponding to the values from Fig 1), further improvements in true positives (A), false positives (B) or overall accuracy (C) can be achieved by fine-tuning the quorum threshold appropriately. Shown are mean (± SEM). A common practice in the U.S. is to employ single interpretation of mammograms in combination with computer-aided detection (CAD). Compared to single reading without CAD, this practice generally increases true positives while also increasing false positives [35,36]. In contrast, our findings suggest that any of the three CI-rules can increase true positives and decrease false positives simultaneously. A common practice in Europe is independent double reading of mammograms in combination with consensus discussion in cases of discordant opinions. Generally, this practice achieves a higher true positive rates than single reading, and either increases or decreases the false positive rates depending on the recall policy [37-40]. While our data set does not allow us to directly compare the CI-rules to such consensus decisions, we stress that the proposed CI-rules have two key advantages. First, the dynamics of consensus discussions are inherently complex, and prone to well-known performance-reducing effects like group think [41,42]. In contrast, the above CI-rules provide highly transparent and algorithmic collective decision rules. Moreover, they exclude direct interactions between radiologists, thereby avoiding the negative consequences of group think and maintaining opinion diversity, a well-known condition for CI [1-4]. Second, consensus decisions often require that specialists meet and discuss, whereas the above mechanisms only requires independent assessments, thereby saving valuable time.

Conclusion

Our findings suggest that simple and highly transparent CI-rules can be employed to improve the accuracy of mammography screening. Most likely, at least two factors contribute to the observed CI-effect. First, whenever errors (i.e. false positives and false negatives) are relatively rare and not perfectly correlated between radiologists, the CI-rules can effectively filter out these errors. Second, radiologists typically differ in their experience or ability with particular types of cases and the CI-rules can exploit this diversity. Of course, viewing time of specialists is costly and has to be taken into account. In fact, a substantial proportion of mammograms may be unambiguous and may thus not require more than two independent assessments. In such cases, one may envisage a decision tree in which a mammogram first gets assessed independently by two radiologists, and only in cases of disagreements is it evaluated by using the above CI-rules. While we have focused here on mammography screening, our findings suggest that CI may have the potential to improve medical decision-making in a much wider range of contexts, including many areas of diagnostic imaging and, more generally, diagnostic decisions that are based on the subjective interpretation of evidence. Intriguingly, next to improving accuracy, CI may also pave the way to shared medical decision-making, thereby alleviating doctors of the sole responsibility for single cases.

Data collection and data analysis.

(DOCX) Click here for additional data file.

34 in total

1. Evidence for a collective intelligence factor in the performance of human groups.

Authors: Anita Williams Woolley; Christopher F Chabris; Alex Pentland; Nada Hashmi; Thomas W Malone
Journal: Science Date: 2010-09-30 Impact factor: 47.728

Review 2. Group performance and decision making.

Authors: Norbert L Kerr; R Scott Tindale
Journal: Annu Rev Psychol Date: 2004 Impact factor: 24.137

3. Consensus interpretation in imaging research: is there a better way?

Authors: Alexander A Bankier; Deborah Levine; Elkan F Halpern; Herbert Y Kressel
Journal: Radiology Date: 2010-10 Impact factor: 11.105

Review 4. Collective cognition in animal groups.

Authors: Iain D Couzin
Journal: Trends Cogn Sci Date: 2008-12-06 Impact factor: 20.229

5. Economics. The promise of prediction markets.

Authors: Kenneth J Arrow; Robert Forsythe; Michael Gorham; Robert Hahn; Robin Hanson; John O Ledyard; Saul Levmore; Robert Litan; Paul Milgrom; Forrest D Nelson; George R Neumann; Marco Ottaviani; Thomas C Schelling; Robert J Shiller; Vernon L Smith; Erik Snowberg; Cass R Sunstein; Paul C Tetlock; Philip E Tetlock; Hal R Varian; Justin Wolfers; Eric Zitzewitz
Journal: Science Date: 2008-05-16 Impact factor: 47.728

6. Educational interventions to improve screening mammography interpretation: a randomized controlled trial.

Authors: Berta M Geller; Andy Bogart; Patricia A Carney; Edward A Sickles; Robert Smith; Barbara Monsees; Lawrence W Bassett; Diana M Buist; Karla Kerlikowske; Tracy Onega; Bonnie C Yankaskas; Sebastien Haneuse; Deirdre Hill; Matthew G Wallis; Diana Miglioretti
Journal: AJR Am J Roentgenol Date: 2014-06 Impact factor: 3.959

Review 7. Measuring the accuracy of diagnostic systems.

Authors: J A Swets
Journal: Science Date: 1988-06-03 Impact factor: 47.728

8. Signal-detection analysis of group decision making.

Authors: R D Sorkin; C J Hays; R West
Journal: Psychol Rev Date: 2001-01 Impact factor: 8.934

9. Neural decoding of collective wisdom with multi-brain computing.

Authors: Miguel P Eckstein; Koel Das; Binh T Pham; Matthew F Peterson; Craig K Abbey; Jocelyn L Sy; Barry Giesbrecht
Journal: Neuroimage Date: 2011-07-14 Impact factor: 6.556

10. Effectiveness and cost-effectiveness of double reading of mammograms in breast cancer screening: findings of a systematic review.

Authors: J Dinnes; S Moss; J Melia; R Blanks; F Song; J Kleijnen
Journal: Breast Date: 2001-12 Impact factor: 4.380

32 in total

Review 1. Physician centred imaging interpretation is dying out - why should I be a nuclear medicine physician?

Authors: Roland Hustinx
Journal: Eur J Nucl Med Mol Imaging Date: 2019-06-07 Impact factor: 9.236

2. The wisdom of partisan crowds.

Authors: Joshua Becker; Ethan Porter; Damon Centola
Journal: Proc Natl Acad Sci U S A Date: 2019-05-13 Impact factor: 11.205

3. Boosting medical diagnostics by pooling independent judgments.

Authors: Ralf H J M Kurvers; Stefan M Herzog; Ralph Hertwig; Jens Krause; Patricia A Carney; Andy Bogart; Giuseppe Argenziano; Iris Zalaudek; Max Wolf
Journal: Proc Natl Acad Sci U S A Date: 2016-07-18 Impact factor: 11.205

4. Identification of acutely sick people: individual differences and social information use.

Authors: Ralf H J M Kurvers; Max Wolf
Journal: Proc Biol Sci Date: 2018-10-24 Impact factor: 5.349

5. Mobile Application Can Now Assist to Diagnose Arrhythmias with Collective Intelligence.

Authors: In Soo Kim
Journal: Korean Circ J Date: 2021-04 Impact factor: 3.243

6. Comparison of Diagnostic Recommendations from Individual Physicians versus the Collective Intelligence of Multiple Physicians in Ambulatory Cases Referred for Specialist Consultation.

Authors: Elaine C Khoong; Sarah S Nouri; Delphine S Tuot; Shantanu Nundy; Valy Fontil; Urmimala Sarkar
Journal: Med Decis Making Date: 2021-08-11 Impact factor: 2.583

7. Individuals fail to reap the collective benefits of diversity because of over-reliance on personal information.

Authors: Alan Novaes Tump; Max Wolf; Jens Krause; Ralf H J M Kurvers
Journal: J R Soc Interface Date: 2018-05 Impact factor: 4.118

8. Unity Is Intelligence: A Collective Intelligence Experiment on ECG Reading to Improve Diagnostic Performance in Cardiology.

Authors: Luca Ronzio; Andrea Campagner; Federico Cabitza; Gian Franco Gensini
Journal: J Intell Date: 2021-04-01

9. Peer Discussion Decreases Practice Intensity and Increases Certainty in Clinical Decision-Making Among Internal Medicine Residents.

Authors: Neha Bansal Etherington; Caitlin Clancy; R Benson Jones; C Jessica Dine; Gretchen Diemer
Journal: J Grad Med Educ Date: 2021-06-14

10. Pooling decisions decreases variation in response bias and accuracy.

Authors: Ralf H J M Kurvers; Stefan M Herzog; Ralph Hertwig; Jens Krause; Max Wolf
Journal: iScience Date: 2021-06-17