Literature DB >> 27327085

The Success of Linear Bootstrapping Models: Decision Domain-, Expertise-, and Criterion-Specific Meta-Analysis.

Esther Kaufmann1, Werner W Wittmann2.   

Abstract

The success of bootstrapping or replacing a human judge with a model (e.g., an equation) has been demonstrated in Paul Meehl's (1954) seminal work and bolstered by the results of several meta-analyses. To date, however, analyses considering different types of meta-analyses as well as the potential dependence of bootstrapping success on the decision domain, the level of expertise of the human judge, and the criterion for what constitutes an accurate decision have been missing from the literature. In this study, we addressed these research gaps by conducting a meta-analysis of lens model studies. We compared the results of a traditional (bare-bones) meta-analysis with findings of a meta-analysis of the success of bootstrap models corrected for various methodological artifacts. In line with previous studies, we found that bootstrapping was more successful than human judgment. Furthermore, bootstrapping was more successful in studies with an objective decision criterion than in studies with subjective or test score criteria. We did not find clear evidence that the success of bootstrapping depended on the decision domain (e.g., education or medicine) or on the judge's level of expertise (novice or expert). Correction of methodological artifacts increased the estimated success of bootstrapping, suggesting that previous analyses without artifact correction (i.e., traditional meta-analyses) may have underestimated the value of bootstrapping models.

Entities:  

Mesh:

Year:  2016        PMID: 27327085      PMCID: PMC4915695          DOI: 10.1371/journal.pone.0157914

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Across a variety of settings, human judges are often replaced or ‘bootstrapped’ by decision-making models (e.g., equations) in order to increase the accuracy of important—and often ambiguous—decisions, such as reaching a medical diagnosis or choosing a candidate for a particular job (see [1]). Before we outline our work on the success of bootstrapping models, it should be noted that the term bootstrapping is applied in a variety of different contexts, for instance for a statistical method of resampling (see [2]). Here we use the term bootstrapping in the same way that it is used in the research on judgment and decision making (see [3]). However, we would like to make the reader aware of its different uses in different contexts. In the judgment and decision-making research on bootstrapping, existing reviews and meta-analyses have suggested that models tend to be more accurate than human judges [4-10]. However, results of previous analyses have also pointed to a wide heterogeneity in the success of bootstrapping [8]. In a previous study [11], we suggested that the success of bootstrapping might depend on the decision domain (e.g., medical or business) as well as on the level of expertise of the decision makers. To date, however, no meta-analysis has systematically evaluated the success of bootstrapping models across different decision domains or based on the expertise of the human decision maker. Furthermore, to date no review has compared the success of bootstrapping models as a function of the type of evaluation criterion for what constitutes an ‘accurate’ decision. We therefore do not know if bootstrapping is more successful if the evaluation criterion is, for instance, objective, subjective, or a test score (e.g., a student’s test score versus a teacher’s judgment of student performance). Finally, as previous meta-analyses did not correct for measurement error or other methodological artifacts [9], the extent of possible bias in the results of these analyses is currently unknown. In this study, we conduct a meta-analysis of the success of bootstrapping using the lens model framework. We investigate whether the success of bootstrapping varies across decision domains (e.g., medical or business), the expertise of the human decision maker (expert or novice), or the criterion for a ‘successful decision’ (objective, subjective, or based on a test score). We then compare the results of traditional, ‘bare-bones’ meta-analysis (i.e., only corrected for sampling error, see [12] p. 94) with the results of psychometric meta-analysis in which we were able to correct for a number of potential methodological artifacts [12]. It should be noted that we applied psychometric corrections in a previous paper [11] and that we are using these psychometric-corrected indices for a more comprehensive evaluation of bootstrapping models in the present paper. Hence, the part on the psychometric analysis in our previous study is closely linked to the work presented here, as we used the results of a previous analysis for additional evaluations presented in this paper in the following. We would like to make the interested reader aware that the scope of our previous work was different than in the following. In addition to that, the criteria for including studies in the two meta-analyses are different (e.g., our first paper focused on the evaluation of single lens model indexes, whereas our present paper focuses on a combination of lens model indexes). This study covers issues not considered in our first paper. For example, we also consider expertise level within domains and evaluation criteria. Hence, this paper is an extension of the first one, which supplements it. The link between the two papers is the second database in this paper (see ‘study identification’ and ‘second database’ below), which we reused from our first paper. Hence, also our analytical strategy applying to the second sample depends on our previous analysis, which was presented in our first paper. Taken together, by adding an additional database, an alternative analytical strategy, and a comparison of the results, we scrutinize the validity of our conclusion that bootstrapping is actually successful. Due to the additional check with this second paper, we also gain greater and more detailed insights into the evaluation of the success of bootstrapping models. Importantly, the studies in both papers represent exclusively decision-making tasks that mirror actual, real-life decision-making conditions most closely, thus providing the most appropriate evaluation of bootstrapping [13].

Success of bootstrapping: Previous research

The success of bootstrapping models has been evaluated in several reviews, beginning with Meehl’s seminal evaluation in his book, Clinical Versus Statistical Prediction [14]. In this first systematic review of the success of bootstrapping, Meehl summarized 20 studies and concluded that models led to better decisions than decisions made by humans, jumpstarting the “man versus model of man” debate. Since then, several meta-analyses have evaluated the success of bootstrapping, following either a traditional or a lens model approach, as outlined below.

Traditional approaches

Reviews taking a traditional approach have generally concluded that models lead to more accurate decisions than human judgment does, although the results have also pointed to heterogeneity in the success of bootstrapping. For instance, based on the results of a meta-analysis of 136 studies, Grove et al. [7] concluded that model prediction was typically as accurate as or more accurate than human prediction, but they noted that there were also some instances in which human prediction was as good as or even better than model prediction. Notably, the results of Grove et al. [7] were specific to medical and psychological decisions and do not necessarily generalize to other decision domains (e.g., nonhuman outcomes such as horse races, weather, or stock market prices). Tetlock [10] and Aegisdottir et al. [4] reached similar conclusions based on their respective reviews of political predictions and counseling tasks. Finally, focusing on potential domain differences in the success of bootstrap models across psychological, educational, financial, marketing, and personnel decision-making tasks, Armstrong [5] concluded that bootstrapping led to more accurate decisions in eight tasks, less accurate decisions in one task, and equally accurate decisions in two tasks.

Lens model approach

Relative to other approaches, one advantage of using the lens model framework to evaluate the success of bootstrapping is that one can take into account different human judgment and decision-making strategies. Different kinds of models can be used to bootstrap decision processes. Ecological or actual models are based on the past observed relationship between any number of pieces of information (cues) and a particular outcome. An example of an ecological model is when a linear multiple regression equation based on the past observed relationship between a number of cues (e.g., breast tumor, family history) and actual breast cancer disease is used to make a breast cancer diagnosis [9]. Whereas ecological models ignore human judgment and decision-making strategies, bootstrapping models in the lens model approach take into account the different ways in which decision makers integrate different pieces of information to reach a decision (i.e., non-linear vs. linear). With a non-linear decision-making strategy, the decision maker (e.g., physician) uses a single piece of information, such as whether or not a breast tumor is present. The fast-and-frugal heuristic is a well-known non-linear model (e.g., [15]). Although such non-linear models are generally considered to be particularly user friendly (see e.g., [16]), research has predominantly focused on linear bootstrap models that include multiple cues. Hence, in addition to the presence or absence of a breast tumor, a physician might also consider additional information, such as whether there is a family history of breast cancer. Taking into account such a linear decision-making strategy is also possible within the lens model framework [17, 18]. Thus, using the lens model framework to analyze the success of linear bootstrap models offers the best way to evaluate the success of bootstrapping.

The success of bootstrapping by lens model indices

Within the lens model framework, the success or ‘judgment achievement’ of a decision-making process is expressed by the lens model equation, which is a precise, mathematical identity that describes judgment achievement (r) as the product of knowledge (G), environmental validity (R), and consistency (R) plus an unmodeled component (C) (see Eq 1): where r = the achievement index (i.e., the correlation between a person’s judgment and a specific criterion), R = the environmental validity index (i.e., the multiple correlation of the cues with the criterion), R = consistency (i.e., the multiple correlation of the cues with the person’s estimates), G = a knowledge index, which is error-free achievement (i.e., the correlation between the predicted levels of the criterion and the predicted judgments), and C = an unmodeled knowledge component, which is the correlation between the variance not captured by the environmental predictability component or the consistency component (i.e., the correlation between the residuals from the above achievement index). According to Camerer [6] and Goldberg [19], the product of the components knowledge (G) and environmental validity (R) captures the validity of the bootstrapping model. By including the knowledge component (G) in the evaluation of the bootstrapping model, we assume that the human judge uses a linear judgment and decision-making strategy, that is, that the judge integrates at least two pieces of information. The degree to which replacing a human judge with a decision-making model improves the success of the decision-making process can be quantified by subtracting judgment achievement from the product term GR (see [6] p. 413, see Eq 2). Reviews using the lens model framework and the lens model equation have included ecological models (see [9]) as well as models considering the judgment and decision-making strategy (i.e., linear vs. non-linear). The classic review by Camerer [6] on the success of linear bootstrap models supported the conclusion that bootstrapping with linear models works well across different types of judgment tasks. However, it should be noted that Camerer [6] included laboratory tasks in his review, in violation of the demand for ecological validity applying to studies in the lens model tradition. The results of the more recent analysis by Karelaia and Hogarth [8] were in line with Camerer [6], although the authors pointed out the high heterogeneity of the success of bootstrapping across tasks and highlighted the need to identify the task and judge characteristics that favor bootstrapping. Previous reviews on lens model indices indicated wide heterogeneity (see [20]) and implied domain differences in lens model statistics (see [21, 11]), suggesting that judgment achievement is different in different decision domains (e.g., medicine, business, education, psychology) and in turn implying that the success of bootstrapping models is also domain-dependent. Indeed, these preliminary results suggesting that the success of bootstrapping was domain-dependent highlight the need for more detailed analysis. Hence, this paper extends our previous paper (see [11]).

The present study

In this study, we conduct a meta-analysis of lens model studies to evaluate the success of linear bootstrapping models. Our meta-analysis is unique and extends our previous paper by focusing specifically on differences in the success of bootstrapping based on the decision domain, the expertise of the human decision maker (expert or novice), and the criterion for an accurate decision (objective, subjective, or test score). An analysis of this kind is needed to identify specific contexts in which bootstrapping is likely to be more successful. In addition, in a second evaluation, we use psychometrically corrected lens model values to construct the bootstrapping model. Previous reviews have not corrected for potential artifacts (e.g., measurement error), which potentially leads to biased estimations [9]. We are therefore the first to evaluate the success of psychometrically-corrected bootstrapping models in detail.

Methods

Before describing our study identification strategy and databases in detail, we describe the two different analytical strategies used in this study. As different conditions are required for each analytical strategy, we had two different databases. Hence, we report the process of study identification and give detailed study descriptions for the two databases separately.

Study identification

First database

To identify lens model studies to be included in the meta-analysis, we checked the database by Kaufmann et al. [11] as well as the studies included in Camerer [6] and Kuncel et al. [9] (Fig 1). Please note that Kaufmann et al. [11] focused on artifact correction of the lens model components as opposed to the success of bootstrapping models as in this study; hence, they excluded some of the studies included in Camerer [6] from their database. We excluded all studies with feedback or learning opportunities (e.g., [22]; for details we refer to [11]). We argue that excluding studies in which decision makers received feedback on the accuracy of their decisions is more appropriate for evaluating the success of human judgment accuracy relative to bootstrapping in real-life conditions, in which human decision makers rarely receive such feedback.
Fig 1

The process of identifying relevant studies for the meta-analysis.

Tables 1 and 2 show the lens model studies identified through the search procedure, organized by decision domain and decision-maker expertise (expert versus novice). In sum, 35 studies met the inclusion criteria for the first meta-analysis.
Table 1

Studies included in the meta-analyses by decision domain and decision-maker expertise.

StudyJudgesNumber of judgmentsNumber of cuesJudgment taskCriterionTask results
a)Medical science, experts:
1)Nystedt & Magnusson [23]4 clinical psychologists383Judge patients based on patientRating on threeI: Δ1 = .11
protocols:psychological tests (■)II: Δ2 = .03
I: intelligenceII: Δ3 = .12
II: ability to establish contact(*, +, s)
III: control of affect and impulses
2)Levi [24]9 nuclear medicine2805Assess probability of significantCoronary angiographyΔ4 = .07
physicians(60 replications)coronary artery disease based on patient(*, s)
profiles
3)LaDuca, Engel, & Chovan [25]13 physicians305Judge the degree of severityA single physician’sΔ5 = .08
(congestive heart failure) based onjudgment (▲)(*, s)
patient profiles
4)Smith, Gilhooly, & Walker [26]40 general practitioners208Decision to prescribe an antidepressantGuideline expert (▲)Δ6 = -.05
based on patient profile(s)
5a)Einhorn [27] (This publication3 pathologistsIII: 1939Evaluate the severity of Hodgkin’sActual number ofIII: Δ7 = -.01
contains two studies)disease based on biopsy slidesmonths of survival(s)
Second study
6a)Grebstein [28]10 clinical experts30 profiles10Judge Wechsler-Bellevue IQ scoresIQ test scores (■)Δ8 = -.17
(varying in amounts offrom Rorschach psychogramsΔ9 = -.14
clinical experience)
5bEinhorn [27]29 cliniciansI: 77 MMPI profiles11Judge the degree of neuroticism-Actual diagnosis (■)Δ10 = .02
First study (This publicationII: 181 MMPI profilespsychoticismΔ110 = -.05
Contains two studies)(*, +, s)
7)Todd (1955, see [29]), Note 310 clinical judges7819Estimate patient IQ from the RorschachIQ test scores (■)Δ12 = .05
test
8)Speroff, Connors, & Dawson123 physicians:44032Judge intensive care unit patients’Patients’ actualΔ13 = .05
[30]105 house staff,hemodynamic statushemodynamic status(s)
15 fellows,(physicians’ estimation)
3 attending physicians
Novices:
6b)Grebstein [28]5 students3010Judge Wechsler-Bellevue IQ scoresIQ test scores (■)Δ14 = -.19
from Rorschach psychograms based on
paper profiles
b)Business science, experts:
9)Ashton [31]13 executives, managers,425Predict advertising sales for TimeActual advertising pagesΔ15 = .07
sales personnelmagazine based on case descriptionssold(*, +, s)
10)Roose & Doherty [32]16 agency managers200 / 16064 / 5Predict the success of life insuranceOne-year criterion forΔ16 = -.08
salesmen based on paper profilessuccess(*, +, s)
11)Goldberg [33]43 bank loan officers605Predict bankruptcy experience based onActual bankruptcyΔ17 = .03
large corporation profilesexperience
12)Kim, Chung, & Paradice [34]3 experienced loan1197Judge whether a firm would be able toActual financial dataI: Δ18 = .09
officersI: 60 big firms,repay the loan requested based onII: Δ19 = .02
II: 59 small firmsfinancial profiles(*, +, s)
13)Mear & Firth [35]38 professional security3010Predict security returns based onActual security returnsΔ20 = .03
analystsfinancial profiles(s)
14)Ebert & Kruse [36]5 securities analysts3522Estimate future returns of commonActual returnsΔ21 = .06
stocks
15)Wright [37]47 students504Predict price changes for stocks fromActual stock pricesΔ22 = .06
1970 until 1971 based on paper profiles(*, +, s)
of securities
16)Harvey & Harries [38]24 psychology students40NotForecast sales outcomes based on paperActual sales outcomeΔ23 = -.07
(1. experiment)knownprofiles(s)
17)Singh, 1990 [39]52 business students35NotEstimate of the stock price of aActual stock pricesΔ24 = .02
knowncompany based on paper profiles(s)
c)Educational science, experts:
18)Dawes [40]1 admission committee1114Admission decision for graduate schoolFaculty ratings of lΔ25 = .06
based on paper profilesperformance in graduate
school (▲)
19)Cooksey, Freebody, & Davidson20 teachers1185Judge I: Reading comprehensionI-II: End-of-year testI: Δ26 = .04
[41]And II: Word knowledge ofscores (■)II: Δ27 = .04
kindergarten children based on paper(*, +, s)
profiles
Novices:
20)Wiggins & Kohen [42]98 psychology graduate11010Forecast first-year-graduate grade pointActual first-year-Δ28 = .17
studentsaverages based on paper profilesgraduate grade point(s)
averages
21)Wiggins, Gregory, & Diller,41 psychology students9010Forecast first-year-graduate grade pointActual first-year-Δ29 = .06
see Dawes and Corrigan [43],averages based on paper profilesgraduate grade point
repl. Wiggins and Kohen [42]averages
22)Athanasou & Cooksey [44]18 technical and further12020Judge whether students are interested inActual level of students’Δ30 = .07
education studentslearning based on paper profileinterest(*, +, s)
d)Psychological science, experts:
23)Szucko & Kleinmuntz [45]6 experienced polygraph303–4Judge truthful / untruthful responseActual theftΔ31 = -.06
interpretersbased on polygraph protocols(*, +, s)
24)Cooper & Werner [46]183317Forecast violent behavior during theActual violent behaviorΔ32 = .00
(9 psychologists,first six months of incarceration basedduring the first six(s)
9 case managers)on inmates’ data formsmonths of imprisonment
25)Werner, Rose, Murdach, &5 social workers4019Predict imminent violence ofActual violent actsΔ33 = .03
Yesavage [47]psychiatric inpatients in the first 7 daysin the first 7 days(*, +, s)
following admission based onfollowing admission
admission data
26)Werner, Rose, & Yesavage [48]304019Predict male patients’ violent behaviorActual violence duringΔ34 = .06
(15 psychologists,during the first 7 days followingthe first 7 days following(s)
15 psychiatrists)admission based on case materialadmission
Novices:
27)Gorman, Clover, & Doherty [49]8 students75:I, III: 12Predict students’ scores on an attitudeActual data:I: Δ35 = .73
I, III: 50II, IV: 6scale (I, II) and a psychologyI, II: Attitude scaleII: Δ36 = .67
II, IV: 25examination (III, IV) based onIII, IV: Examination scaleIII: Δ37 = .01
interviews (I, III) and paper profiles(■)IV: Δ38 = .29
(II, IV)(*, s) (.08), see
Camerer [6]
28)Lehman [50]14 students4019Assess imminent violence of maleActual violent acts in theΔ39 = -.01
patients in the first 7 days followingfirst 7 days following(*, +, s)
admission based on case materialadmission

▲ = subjective criterion;

■ = test criterion;

(*) = idiographic approach (cumulating across individuals);

(*, +) = both research approaches are considered;

Δ = the success of bootstrapping models (see Eq 2); s = sub-sample of tasks for the second evaluation (psychometric corrected bootstrapping models).

Table 2

Miscellaneous studies included in the meta-analysis.

StudyJudgesNumber of judgmentsNumber of cuesJudgment taskCriterionDomainTask results
e)Miscellaneous domains, experts:
29)Stewart [51]7 meteorologists75 (25)6Assess probability ofObserved eventMeteorologyΔ40 = -.01
hail or severe hail based on radar volume(*, s)
scans
Both experts and novices:
30)Stewart, Roebber, & Bosart [52]4I: 16912Forecast 24-h maximum temperature,I, II: ActualMeteorologyI: Δ41 = .00
(2 students,II: 1781312-h minimum temperature,temperatureII: Δ42 = .00
2 experts)III: 1492412-h precipitation, andIII, IV: ActualIII: Δ43 = .00
IV: 1502424-h precipitation for each dayprecipitationIV: Δ44 = .00
(*, +, s)
Novices:
31)Steinmann & Doherty [53]22 students192:2Decide which of two randomly chosenA hypotheticalOtherΔ45 = .15
(2 sessions with 96bags a sequence of chips had been drawn“judge”(*, s)
judgments)(▲)
32)MacGregor & Slovic [54]I: 25 studentsI—IV:4Estimate the time to complete a marathonActual time toSportI: Δ46 = .19
II: 25 students40based on runner profilescomplete theII: Δ47 = .16
III: 26 studentsmarathonIII: Δ48 = .23
IV: 27 studentsVI: Δ49 = .24
(s)
33)McClellan, Bernstein, & Garbin26 psychology1285Estimate magnitude of fins-in and fins-outActual magnitudePerceptionΔ50 = .12
[55]studentsMueller Lyer stimuliof fins-in and fins-(s)
out Mueller Lyer
stimuli
34)Trailer & Morgan [56]75 students5011Predict the motion of objects based onActual motionIntuitiveΔ51 = .10
situations in a questionnairephysics(*, +, s)
35)Camerer [57]2118Δ52 = .00

▲ = subjective criterion;

(*) = idiographic approach (cumulating across individuals);

(*, +) = both research approaches are considered;

Δ = the success of bootstrapping models (see Eq 2); s = subsample of tasks for the second evaluation (psychometric corrected bootstrapping models).

▲ = subjective criterion; ■ = test criterion; (*) = idiographic approach (cumulating across individuals); (*, +) = both research approaches are considered; Δ = the success of bootstrapping models (see Eq 2); s = sub-sample of tasks for the second evaluation (psychometric corrected bootstrapping models). ▲ = subjective criterion; (*) = idiographic approach (cumulating across individuals); (*, +) = both research approaches are considered; Δ = the success of bootstrapping models (see Eq 2); s = subsample of tasks for the second evaluation (psychometric corrected bootstrapping models).

Second database

A subset of 31 studies in the database described above met the inclusion criteria for the evaluation of artifact-corrected bootstrapping models (see [11]). In Tables 1 and 2, this subset of studies is labeled with an ‘s’ for subsample in the last column. We also point out here that in contrast to the first database, the second database is the same as in Kaufmann et al. [11] Further details on the construction of our databases, such as our search protocol, are available in Kaufmann [58].

Study descriptions

We identified studies within five decision domains: medical science (8 studies), business science (9 studies), educational science (5 studies), psychological science (6 studies), and miscellaneous (7 studies). Most judgments were based on paper profiles, i.e., written descriptions (see [59]). Overall, the number of cues ranged from two [53] to 64 [32]. The number of decision makers in the studies ranged from three [27, 34] to 123 [30]. The majority of the studies included novice decision makers (predominantly students). The number of decisions ranged from 25 [26] to 440 [30]. The meta-analysis included evaluation of 52 different decision tasks. Tables 1 and 2 also describe the criterion in each study. Notably, some studies included an objective criterion, such as the actual weather temperature (see [52]), and other studies included a subjective criterion, such as a physician’s judgment (see [25]). Subjective criteria are indicated by black triangles in Tables 1 and 2, and test score criteria (e.g., [23]) are indicated by a square. Criteria not specially labeled are objective criteria. As Table 1 shows, we identified eight studies within medical science, which included 241 experts (e.g., clinical psychologists) and five novices and 14 different tasks. The studies within the medical science domain included the studies with both the overall lowest and the overall highest number of judgments. In the first study by Einhorn [27], the three pathologists were the only decision makers who based their judgments on real biopsy slides, which represented a more natural situation than the commonly used paper patient profiles. We identified nine studies within business science, including 40 bootstrapping models by 241 persons for 10 different tasks. Please note that the study by Wright [37] analyzed only the five most accurate judgments made by the 47 persons included at the idiographic level. Studies within business science had the widest range of number of cues (4 to 64). All judgments were based on paper profiles. We identified five studies within educational science, two studies with expert decision makers and three with novice decision makers. In the two studies with experts, 41 bootstrapping models in three tasks were considered. Cooksey, Freedbody, and Davidson [41] included a multivariate lens model design, supplemented with two single lens model designs. In the present analysis, we used the two single lens model designs as two different tasks. We identified six studies within psychological science, in which 105 bootstrapping models of 81 individuals (including 59 experts) for nine different tasks were available. Finally, we identified seven studies that did not fit into any of the other domain categories (e.g., studies on the accuracy of weather forecasts). The studies in the miscellaneous domain included data from 258 individuals (9 experts vs. 249 novices) for 13 different tasks and 270 bootstrapping models. Please note that only Stewart, Roebber, and Bosart [52] directly compared novices and experts across four meteorological tasks. It is also the only study within the miscellaneous domain to have analyzed judgment accuracy retrospectively. In sum, in our meta-analysis we analyzed the results of 35 studies with 1,110 bootstrapping models, 532 experts, and 578 novices judging 52 tasks across five decision domains. This sample also includes 365 bootstrapping procedures at the individual level (idiographic approach) across 28 different tasks. The subset of 31 studies (the second database) with sufficient information for evaluating the success of bootstrapping with psychometrically-corrected lens model indices included 1,007 bootstrapping models, covering 44 tasks across five decision domains (see [11], for more information).

Analytic strategy

Based on our preliminary analysis of the success of individual bootstrapping procedures at the individual level, we now outline our two analytical strategies. Please keep in mind that in each of these analytical strategies, a different sample was included, as described above. Moreover, in line with previous work (see [8, 11]) the analytical level was that of tasks, not studies. The included effect sizes for the success of the model for each task in our meta-analysis can be found in the last column in Table 1.

The success of individual bootstrapping procedures

In meta-analysis, an ecological fallacy may arise because associations between two variables at the group level (or ecological level) may differ from associations between analogous variables measured at the individual level (see [60]). For this reason, we plotted the success of individual bootstrapping procedures first before analyzing the aggregated estimation of success of bootstrapping calculated through meta-analysis (see the next step in the analysis).

Bare-bones meta-analysis

We used the lens model equation to calculate the success of bootstrapping (see final results column of Table 1 for the indices of the success of bootstrapping models). Our bare-bones meta-analysis strategy was in line with the analysis approach used by Karelaia and Hogarth [8] in their meta-analysis. Moreover, in line with the review by Camerer [6] and Karelaia and Hogarth [8], we included the linear knowledge component in our estimation of bootstrapping success. Hence, we underestimated general success, as the knowledge component was smaller than 1, leading to a decrease of the model component in contrast to Kuncel et al. [9], who excluded the knowledge component (G) from their evaluation of bootstrapping success. Thus, we gained more information about the human judgment and decision-making strategy than was possible in Kuncel et al. [9]. We followed the Hunter-Schmidt approach to meta-analysis [12]. The Hunter-Schmidt approach estimates the population effect size by correcting the observed effect size for bias due to various artifacts, including sampling and measurement error (see [12], p. 41). Specifically, we corrected for possible sampling bias introduced by the different number of judges in the single studies, using what is referred to as bare-bones meta-analysis. We used forest plots to graphically analyze the results of the bare-bones meta-analysis. We were specifically interested in whether the success of bootstrapping depended on decision domain, the level of expertise of the human judge, or the type of criterion. Hence, for this moderator analysis, we reran the meta-analysis with a subsample of studies. In addition to the overall success of models (see the third column in Tables 3 and 4), we also report the confidence and the credibility intervals (see fourth and fifth columns of Tables 3 and 4). In contrast to confidence intervals, credibility intervals are calculated with standard deviations after removing artifacts and correction of sample bias. If the credibility interval includes zero or is sufficiently large, there is a higher potential for moderator variables relative to when the credibility interval is small and excludes zero (for further information, see [61]). We considered additional estimations of heterogeneity to the Q-test: If this test is significant, moderator variables are indicated (see sixth and seventh columns of Tables 3 and 4). The I2 ([62], see eighth column of Tables 3 and 4) represents the between-task heterogeneity not explained by the sampling error; values above 25% indicate variation. Moreover, the τ is an additional index for the between-heterogeneity (see the second to last column of Table 3): If τ is zero, this implies homogeneity. Finally, we used the 75% rule as an indication of moderator variables (see the last column of Tables 3 and 4). That is, moderators were expected whenever artifacts explained less than 75% of the observed variability.
Table 3

Results of the bare-bones meta-analysis organized by decision domain and decision maker’s expertise.

Domains (expertise)kNΔSDΔ95% CI80% CIQI2(%)τ275%
Medical14293.00.00-.10 - .12.00 - .001.3 n.s.0.000.001,171
Publ. bias+3324.03.00-.02 - .04.03 - .0339.15**59.10.00667
Expert13288.01.00-.10 - .12.01 - .011.19 n.s.0.000.001,262
Publ. bias+2305.02.00-.02 - .04.02 - .0336.59***61.70.00895
Novice
Business10244.02.00-.10 - .14.02 - .02.49 n.s.0.000.002,338
Expert7121.02.00-.15 - .20.02 - .02.22n.s.0.000.003,791
Novice3123.00.00-.15 - .19.02 - .02.26 n.s.0.000.001,146
Publ. bias+1125.02.00-.01 - .09.02 - .0215.38***80.50.0011,686
Education6198.11.00-.02 - .25.11 - .11.680.000.00> 10,000
Publ. bias+3208.12.00.11 - .21.12 - .1267.14***88.10.003> 10,000
Expert341.04.00-.26 - .34.00 - .00.00 n.s.0.000.00> 10,000
Novice3157.13.00-.03 - .28.13 - .13.42 n.s.0.000.00707
Publ. bias+2162.13.00.11 - .22.13 - .1347.16***91.50.0031,214
Psychology9105.14.00-.05-.33.14-.146.5 n.s.0.000.00> 10,000
Expert459.03.00-.22 - .28.03 - .03.01 n.s.0.000.004,971
Publ. bias+262.03.00.01 - .10.03-.033.31 n.s.0.000.00> 10,000
Novice546.29.00.00 - .58.29 - .294.59 n.s.0.000.00102
Publ. bias+147.30.00-.08 - .49.3 - .367.15***92.60.11> 10,000
Miscellaneous13270.13.00.01 - .25.13 - .131.54 n.s.0.000.00929
Expert515.00.00-.51 - .50.00 - .00.00 n.s.0.000.00> 10,000
Publ. bias+327-.01.00-.23 - .21-.01 -.01.00 n.s.0.000.00> 10,000
Novice12255.14.00.02 - .26.14 - .141.25 n.s.0.000.001,269
Overall Experts32532.03.00-.07 - .10.03 - .031.56 n.s.0.000.00> 10,000
Publ. bias+5820.04.00.01 - .05.04 - .0453.33**32.50.006> 10,000
Overall Novices20578.12.00.03 - .20.12-.129.65 n.s.0.000.00> 10,000
Overall521,110.07.00.01 - .13.07 - .0714.21n.s.0.000.00> 10,000
Publ. bias+ 121,365.10.00.73 - .12.10 - .10398***84.20.005> 10,000

k = number of judgment tasks;

N = number of success indices;

Δ = the success of bootstrapping models (see Eq 2); SD = standard deviation of true score correlation; 95% CI = confidence interval; 80% CI = 80% credibility interval including lower 10% of the true score and the upper 10% of the true score; 75% = percent variance in observed correlation attributable to all artifacts; Publ. bias = publication bias corrected estimation by the trim-and-fill method (see [63]);

+ = the number of missing tasks indicated by the trim-and-fill method.

Table 4

Results of the bare-bones meta-analysis of the success bootstrapping organized by type of evaluation criterion.

Evaluation criteriakNΔSDΔ95% CI80% CIQI2(%)τ275%
 Subjective476.03.00-.19 - .25.03 - .03.60 n.s.0.000.00520
Publ. bias+281.02.00-.16 - .06.02 - .0244.41***88.70.01> 10,000
 Objective33857.08.00.01 - .14.08 - .084.78 n.s.0.000.01778
Publ. bias+91,020.10.00.06 - .12.10 - .10216***81.10.00639
 Test15177.07.00-.08 - .21.07 - .078.68n.s.0.000.00197
Publ. bias+3330-.01.01-.12 - .09-.14 - .11149.33***88.60.0386.14

k = number of judgment tasks;

N = number of success indices;

Δ = the success of bootstrapping (see Eq 2);

SD = standard deviation of true score correlation; 95% CI = confidence interval; 80% CI = 80% credibility interval including lower 10% of the true score and the upper 10% of the true score; 75% = percent variance in observed correlation attributable to all artifacts; Publ. bias = publication bias-corrected estimation by the trim-and-fill method (see [63]); + = the number of missing tasks indicated by the trim-and-fill method.

k = number of judgment tasks; N = number of success indices; Δ = the success of bootstrapping models (see Eq 2); SD = standard deviation of true score correlation; 95% CI = confidence interval; 80% CI = 80% credibility interval including lower 10% of the true score and the upper 10% of the true score; 75% = percent variance in observed correlation attributable to all artifacts; Publ. bias = publication bias corrected estimation by the trim-and-fill method (see [63]); + = the number of missing tasks indicated by the trim-and-fill method. k = number of judgment tasks; N = number of success indices; Δ = the success of bootstrapping (see Eq 2); SD = standard deviation of true score correlation; 95% CI = confidence interval; 80% CI = 80% credibility interval including lower 10% of the true score and the upper 10% of the true score; 75% = percent variance in observed correlation attributable to all artifacts; Publ. bias = publication bias-corrected estimation by the trim-and-fill method (see [63]); + = the number of missing tasks indicated by the trim-and-fill method. As mentioned above, for our moderator analysis we reran the analysis for each decision domain, for experts and for novices, and for the level of expertise in the domain. We also reran the analysis for each type of evaluation criterion (objective, subjective, or test score) separately. We then checked our results with a sensitivity analysis. First, we checked for possible publication bias using the trim-and-fill method (see [63]). This approach estimates the effect sizes of potentially missing studies and considers them within a new meta-analysis estimation. Second, we used the leave-one-out approach to check whether the results were influenced by any individual task. In this approach, the first task is excluded in an initial meta-analysis. Then in a subsequent analysis, only the second task is excluded. Hence, for example, for our overall meta-analysis with 52 tasks, 52 separate meta-analyses including 51 tasks were conducted and the results were compared.

Artifact-corrected lens model indices

To check the robustness of the results of the bare-bones meta-analysis, we used the subset of k = 31 tasks with sufficient information to evaluate the success of artifact-corrected bootstrap models using the psychometrically-corrected lens model components from Kaufmann et al. [11]. In the same way, we also used these databases with lens model indices corrected by a bare-bones meta-analysis to check the differences between the two approaches directly. This procedure was also applied in Kaufmann et al. [11]. It should be noted that here, we used meta-analysis-corrected indices, in contrast to the previously described analytical strategy, in which the indices were not corrected before building the bootstrapping models. In our presentation of this second analytical strategy, we consider the five domains and judge expertise.

Results

The success of individual bootstrapping procedures

Fig 2 displays a scatter plot of the success of 365 individual bootstrapping procedures (see Eq 1), organized by domain (marked by color) and decision maker expertise (triangles for experts, circles for novices). A value of zero indicates that the model was as accurate as the human judge; positive values indicate that the model was more accurate than the human judge. The scatter plot displays the wide variability in the success of the bootstrapping models.
Fig 2

Scatter plot of the success of 365 bootstrapping procedures across 28 different tasks organized by decision domain and decision maker expertise.

Bare-bones meta-analytic results

Fig 3 shows the forest plots. More than 80% of the tasks (42 of the 52 tasks) were associated with a positive value, indicating that the bootstrapping models were more accurate than the human judges. Particularly noteworthy is that bootstrapping was more accurate than human judgment across all of the tasks within education sciences.
Fig 3

Forest plots of the success of bootstrapping models organized by decision domain and decision maker expertise.

Positive values indicate that bootstrapping resulted in more accurate judgments than human judgment.

Forest plots of the success of bootstrapping models organized by decision domain and decision maker expertise.

Positive values indicate that bootstrapping resulted in more accurate judgments than human judgment. Across all tasks, the results of the bare-bones meta-analysis demonstrated that models were generally more accurate than human judges (Δ = .071 across all tasks, see Table 3). There was no indication of moderator variables according to the several heterogeneity indices. In contrast, our publication bias estimation revealed that 12 tasks may have been missed. The resulting publication bias-corrected overall estimation of the success of bootstrapping models indicated the possibility of moderator variables. Although not all heterogeneity indexes confirmed the possibility of moderator variables, we undertook the moderator analysis to check our results. As Table 3 shows, if we focus on the expertise level, our analysis revealed overall that the success of bootstrapping models was greater within the novice category than within the expert category (.12 vs. .03). Within the different domains, models were generally more successful relative to novice judgment than relative to expert judgment, with the exception of business decisions. Within the business decision domain, models with expert judges were more successful than models with novice judges. There was no indication of moderator variables across the different heterogeneity indices. As you see, the results were confirmed by our publication bias estimation within the different fields, except in the medical and educational fields, revealing that our results in these areas may be underestimated (see also the associated confidence intervals) and that additional moderator variables may be indicated. On the other hand, it should be noted that our leave-one-out approach check revealed that within the educational field, there was a decrease in the success of bootstrapping models if the paper by Wiggins et al. (see [43]) was excluded (Δ = .5, -14–24). If we now assumed some moderator variables within the different fields and focused on the expertise level (expert vs. novice) within the different fields again, possible publication bias seemed to be associated with an increased success of bootstrapping except in the ‘miscellaneous’ field category. Additionally, publication bias-corrected estimation within this miscellaneous field category and within the psychology expert category revealed, contrary to other publication bias-corrected estimations, no additional moderator variables. To summarize, all the different analyses (with and without publication corrections, the leave-one-out approach) revealed a positive value of the success of bootstrapping models. The only exception was the publication bias-corrected estimation within the miscellaneous category considering experts. However, we highlight here that the positive value of the success of bootstrapping models was not completely confirmed by the 95% confidence intervals but by our 80% credibility interval estimations, which we discuss below. Additionally to our reported bare-bones meta-analysis, Table 4 displays the results of the bare-bones meta-analysis separated by evaluation criterion (objective, subjective, or test score). As Table 4 shows, bootstrapping was more successful when there was an objective criterion and less successful when a subjective or test score criterion was used at first glance. If we consider the 95% confidence interval, negative success values were revealed within the subjective and the test categories. Our analysis of evaluation criteria indicated no possible moderator variables across the different heterogeneity indices. Additionally, in each evaluation criteria category, a publication bias was indicated by the trim-and-fill approach. Our reanalysis considering a possible publication bias affecting the success of bootstrapping suggested that the success of models was underestimated in the objective evaluation criteria category and overestimated in the subjective evaluation criteria category. Within all evaluation criteria categories, the publication bias-corrected estimations now indicated possible moderator variables.

Artifact-corrected results

Table 5 displays the results of the success of bootstrap models with psychometrically-corrected lens model indices (k = 31). These results suggest that the success of bootstrapping was in fact clearly greater than the results of the bare-bones meta-analysis suggested (.07 vs. .23). If we compared the results with our previously presented bare-bones meta-analysis (see Table 2), our conclusion was confirmed. Importantly, it should be noted that the artifact-corrected results were based on only a subset of the studies included in the bare-bones meta-analysis, as outlined above. Thus, the results of the previous bare-bones meta-analysis and the artifact-corrected results were not directly comparable. Nevertheless, both results partly indicated that models were more successful than human judges across all decision domains. Notably, in comparison with the results of our previously presented bare-bones meta-analysis, the psychometrically-corrected models indicated a different pattern of results on the success of bootstrapping across levels of expertise and decision domains (see Table 4).
Table 5

The success of bootstrapping according to bare-bones (in brackets) and psychometrically-corrected lens model indices.

DomainskNΔoverallbΔexpertsΔnovices
Medical science10258.35 (.01).35 (-.01).35 (-.01)
Business9239.018a (-.03).05a (-.01).09a (-.02)
Education4156.21 (.12).18 (.15).14 (.04)
Psychology9105.08 (.04).23a (.15).04 (.04)
Miscellaneous12249.26 (.16).27a (.16).01 (-.02)
Overall441,007.23 (.07).22 (.13).17 (.02)

k = number of judgment tasks; N = number of success indices; Δ = estimated success of bootstrapping (see Eq 2).

a = no correction of the Re component, because this component includes only objective criteria.

b = this column is the same as in Kaufmann et al. [11], Table 7, columns 5 and 6.

k = number of judgment tasks; N = number of success indices; Δ = estimated success of bootstrapping (see Eq 2). a = no correction of the Re component, because this component includes only objective criteria. b = this column is the same as in Kaufmann et al. [11], Table 7, columns 5 and 6.

Discussion

Like previous reviews [4, 7, 8], we first used a bare-bones meta-analytic procedure [12] to evaluate the success of bootstrapping. Unique to the present study was our additional use of psychometrically-corrected bootstrap models, which are based on a previous meta-analysis (see [11]). These results allowed us to check for various methodological artifacts that may have biased the results of previous meta-analyses. The major finding of this study is that models lead to more accurate judgments than individual human judges make across quite diverse domains (Δ = .07). The results of the present meta-analysis are in line with previous meta-analyses of the overall success of bootstrapping [6, 8]. Notably, there were 10 tasks in which models were not superior to human judges. We argue that the results of meta-analysis of the success of bootstrap models with artifact-corrected lens model indices represent a more accurate estimation of the success of bootstrapping. Comparison of the results of the success of bootstrap models with artifact-corrected lens model indices with the results of the bare-bones meta-analysis in the present study suggests that previous meta-analyses may have underestimated the success of bootstrapping [4, 7, 8, 9]. Although the estimated success of bootstrapping is only slightly higher according to the results of the meta-analysis examining the success of bootstrap models with artifact-corrected lens model indices relative to the bare-bones meta-analysis, the higher (and more accurate) success estimates are meaningful particularly in high-risk decision-making domains like medical science, in which even a small increase in decision accuracy could lead to many saved lives. In sum, our results support the conclusion that formal models to guide and support decisions should be developed especially in decision domains where the cost of inaccurate decisions is high. It should be noted, however, that we used a slightly reduced subset of tasks in the estimation of the success of bootstrap models with artifact-corrected lens model indices (the same database as [11]) as compared to the bare-bones meta-analysis, so that the two estimates of the success of bootstrapping are not directly comparable. Moreover, we found that there were no systematic differences in the estimated success of bootstrapping depending on the decision domain. However, we highlight that the success of bootstrapping was particularly high in the psychological decision domain. Based on the success of bootstrapping within psychology in the present study, it seems suitable to apply bootstrapping more widely in psychological decision-making tasks in order to overcome the low judgment achievement of psychological experts (see [21, 11]). The present analyses also considered the potential role of judge expertise in the success of bootstrapping. The results indicate that not only novices but also experts may profit from bootstrapping (see also [10]). The results of the bare-bones meta-analysis suggest that mainly novices profit from bootstrapping, whereas the results of the psychometrically-corrected lens model indices suggest that mainly experts profit from bootstrapping. We note once again that the samples of studies included in the two analyses differed slightly, and hence, the results are not directly comparable. In light of the inconsistent results on the relationship between bootstrapping success and level of judge expertise, we recommend that future studies also consider expertise as a potential moderator of bootstrapping success. We emphasize that only the study by Stewart, Roebber, and Bosart [52] compared novices and experts across the same four meteorological tasks, and we therefore urge researchers to conduct more studies directly comparing novice and expert judges. Finally, in the present analysis, we considered the type of evaluation criterion as a potential moderator of the success of bootstrapping. Namely, we analyzed the success of bootstrapping separately for studies in which the accuracy of a decision was based on an objective, subjective, or test criterion. We believe that future evaluations of bootstrapping success should likewise consider the type of decision criterion (see also [7] with regards to human and non-human decision domains). In the present study, we found that bootstrapping was especially successful when there was an objective criterion for an accurate decision (e.g., [54]). The higher success of bootstrapping in tasks with an objective criterion is unexpected, since human judges are thought to receive faster and more definite feedback regarding the accuracy of their decisions when there is an objective criterion relative to subjective criterion [64]. The results of our analysis also imply that the results of the meta-analysis by Grove et al. [7] and Aegisdottir et al. [4] may underestimate the success of bootstrapping, since both of those meta-analyses excluded studies with tasks predicting nonhuman outcomes (e.g., weather forecasts). Hence it is primarily with objective criteria that bootstrapping appears, based on the present results, to be particularly successful. Our publication bias-corrected estimation supports our assumption. However, we note that the sample of studies including subjective criteria is quite small, which may limit the generalizability of our results. Taken together, our review confirms previous meta-analyses in the field and contributes new knowledge on differences in the success of bootstrapping across different decision domains, different levels of expertise of the human judge, and different types of evaluation criteria. However, a potential point of criticism in our study is that our conclusions are not confirmed by our interpretation of the confidence intervals. We argue that the confidence interval estimations did not consider any sampling bias, which is considered in the credibility intervals estimations, also reported in our work (see Table 4, 5, [12], p. 228). If we focus on the sampling bias-corrected credibility intervals, our results are clearly supported, except in two cases. These two cases are the publication bias-corrected estimation of the success of models in the miscellaneous expert category and the publication bias-corrected estimation in the evaluation criterion category test. Hence, we argue that especially within these two categories, the success of models may be not confirmed. We also emphasize the need for caution in interpreting our publication-corrected estimations, as these estimations are based on a database without any artifact corrections such as measurement error. Hence, the heterogeneity of our databases may be overestimated due to measurement error (see [11]), leading to an overestimation of a possible publication bias. Moreover, it is important to note that the scope of the present meta-analysis was limited to the success of linear bootstrap models, which represent only one type of formal decision-making model. Our analysis of only linear models may overestimate the potential success of bootstrapping in general (see [9]), since linear models have the problem of overfitting, in contrast to the fast and frugal non-linear models [65]. Non-linear models are also considered to be more user-friendly, which may increase their application in real-life settings [16]. Notably, as an evaluation of the success of artifact-corrected linear models relative to non-linear models has not yet been conducted, it offers an interesting and important avenue for future research. In addition, we see the need to evaluate how the success of bootstrapping may be affected by the number of cues provided in decision-making tasks (i.e., to examine whether bootstrapping is more successful when human judges are provided with more or less information). Further, we feel that future evaluations of bootstrapping success should consider Brunswik’s symmetry concept (see [66]). Judgment achievement increases if both the judgment and the criterion are measured at the same level of aggregation (i.e., if they are ‘symmetrical’). For example, if a physician is asked to judge whether cancer is present and the criterion is whether a cancer tumor is detected, then the judgment is not symmetrical, as cancer can exist without a detectable tumor. In contrast, if a physician is asked to judge whether there is cancer only when a cancer tumor has been detected, then the judgment and the criterion are said to be symmetrical. We did not control for symmetry in the present analysis, which may have led to an underestimation of the lens model components. Future research on whether the symmetry concept moderates the estimated success of bootstrapping would be highly useful in providing a more thorough understanding of the contexts in which models make better judges than humans do, leading to improved judgment accuracy within different domains.
  17 in total

1.  Cue definition and residual judgment.

Authors:  H J Einhorn
Journal:  Organ Behav Hum Perform       Date:  1974-08

2.  Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis.

Authors:  S Duval; R Tweedie
Journal:  Biometrics       Date:  2000-06       Impact factor: 2.571

3.  Integration of information in a clinical judgment task, an empirical comparison of six models.

Authors:  L Nystedt; D Magnusson
Journal:  Percept Mot Skills       Date:  1975-04

4.  Quantifying heterogeneity in a meta-analysis.

Authors:  Julian P T Higgins; Simon G Thompson
Journal:  Stat Med       Date:  2002-06-15       Impact factor: 2.373

5.  A SUGGESTED ALTERNATIVE FORMULATION IN THE DEVELOPMENTS BY HURSCH, HAMMOND, AND HURSCH, AND BY HAMMOND, HURSCH, AND TODD.

Authors:  L R TUCKER
Journal:  Psychol Rev       Date:  1964-11       Impact factor: 8.934

6.  Representative design and probabilistic theory in a functional psychology.

Authors:  E BRUNSWIK
Journal:  Psychol Rev       Date:  1955-05       Impact factor: 8.934

7.  Relative accuracy of actuarial prediction, experienced clinicians, and graduate students in a clinical judgment task.

Authors:  L C GREBSTEIN
Journal:  J Consult Psychol       Date:  1963-04

8.  Probabilistic functioning and the clinical method.

Authors:  K R HAMMOND
Journal:  Psychol Rev       Date:  1955-07       Impact factor: 8.934

Review 9.  Determinants of linear judgment: a meta-analysis of lens model studies.

Authors:  Natalia Karelaia; Robin M Hogarth
Journal:  Psychol Bull       Date:  2008-05       Impact factor: 17.737

10.  A critical meta-analysis of lens model studies in human judgment and decision-making.

Authors:  Esther Kaufmann; Ulf-Dietrich Reips; Werner W Wittmann
Journal:  PLoS One       Date:  2013-12-31       Impact factor: 3.240

View more
  2 in total

1.  Exploring the roles of trust and social group preference on the legitimacy of algorithmic decision-making vs. human decision-making for allocating COVID-19 vaccinations.

Authors:  Marco Lünich; Kimon Kieslich
Journal:  AI Soc       Date:  2022-04-21

2.  People underestimate the errors made by algorithms for credit scoring and recidivism prediction but accept even fewer errors.

Authors:  Felix G Rebitschek; Gerd Gigerenzer; Gert G Wagner
Journal:  Sci Rep       Date:  2021-10-11       Impact factor: 4.379

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.