Literature DB >> 34179740

In the context of forensic casework, are there meaningful metrics of the degree of calibration?

Abstract

Forensic-evaluation systems should output likelihood-ratio values that are well calibrated. If they do not, their output will be misleading. Unless a forensic-evaluation system is intrinsically well-calibrated, it should be calibrated using a parsimonious parametric model that is trained using calibration data. The system should then be tested using validation data. Metrics of degree of calibration that are based on the pool-adjacent-violators (PAV) algorithm recalibrate the likelihood-ratio values calculated from the validation data. The PAV algorithm overfits on the validation data because it is both trained and tested on the validation data, and because it is a non-parametric model with weak constraints. For already-calibrated systems, PAV-based ostensive metrics of degree of calibration do not actually measure degree of calibration; they measure sampling variability between the calibration data and the validation data, and overfitting on the validation data. Monte Carlo simulations are used to demonstrate that this is the case. We therefore argue that, in the context of casework, PAV-based metrics are not meaningful metrics of degree of calibration; however, we also argue that, in the context of casework, a metric of degree of calibration is not required.

Entities: Chemical

Keywords: Calibration; Forensic inference and statistics; Likelihood ratio; Metric

Year: 2021 PMID： 34179740 PMCID： PMC8212664 DOI： 10.1016/j.fsisyn.2021.100157

Source DB: PubMed Journal: Forensic Sci Int Synerg ISSN： 2589-871X

Introduction

Forensic-evaluation systems should output well-calibrated likelihood-ratio values

Forensic-evaluation systems should output likelihood-ratio values that are well calibrated [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]]. If they do not, their output will be misleading. For a well-calibrated system, the likelihood ratios of the likelihood-ratio values that it outputs will be the same as the likelihood-ratio values that it outputs (Birdsall [13] §1.2). In practice, unless one were to train and test on the same data, because of sampling variability, if one were to re-calibrate an already well-calibrated system one would expect the likelihood ratios of the likelihood-ratio values only to be approximately the same as the original likelihood-ratio values.1

Causes of poorly-calibrated likelihood-ratio output

If a forensic-evaluation system makes use of feature vectors (i.e., sets of measurements made on the objects of interest) that have a small number of dimensions and that have distributions that do not violate the assumptions of a parsimonious parametric statistical model, and the number of data points available for model training is large compared to the number of parameter values to be estimated, then the output of the model will be intrinsically well calibrated. In real forensic settings, however, it is common for the feature vectors to have a large number of dimensions, for the fitted models to be complex, and for the number of data points available for training to be small, thus requiring a large number of parameter values to be estimated from a limited amount of data. Classic examples of high-dimensional data and complex models can be found in forensic voice comparison [6], but, with limited data, even moderate numbers of dimensions can lead to miscalibrated results even for relatively parsimonious models; see, for example [14,15], and the commentary of [8] on the latter.

How to calibrate forensic-evaluation systems

A practical solution to the problem described in the previous section is to treat the output of the model as uncalibrated likelihood ratios, and then use a second model to calibrate the output of the first model ([3,[16], [17], [18]]), see Fig. 1. For simplicity, we will henceforth refer to the uncalibrated likelihood ratios output by the first-stage model as “scores”, and refer to the calibrated likelihood ratios output by the second-stage model as “likelihood ratios”. Also for simplicity, we will assume that the forensic problem at hand is source attribution.

Fig. 1

Schematic of a forensic-evaluation system consisting of a feature-to-score model (a complex multidimensional model that outputs uncalibrated likelihood ratios) followed by a score-to-log-likelihood-ratio model (a parsimonious unidimensional calibration model). The second-stage module is trained using a separate dataset from that previously used to train the first-stage model. We will henceforth refer to the second dataset as the “calibration data”. Same-source and different-source pairs are constructed from the calibration data. Those pairs are input to the first-stage model which then outputs a set of same-source scores and a set of different-source scores. The second-stage calibration model is trained using those same-source and different-source scores. The scores are univariate, and a parsimonious parametric model is used as the calibration model. Hence, even with a moderate amount of calibration data, there are a relatively large number of data points available to estimate a small number of parameter values. This results in well-calibrated output. A point to note is that the calibration model is applied to scores that are uncalibrated log likelihood ratios – the calculation of the scores has taken account of both the similarity between the members of each pair, and their typicality with respect to the relevant population. Using scores that only take account of similarity will not result in meaningful likelihood-ratio values [[19], [20], [21]]. Another point to note is that the calibration data must be representative of the relevant population for the case and must reflect the conditions of the questioned-source specimen and known-source sample in the case ([1,10]). If there is a mismatch between the conditions of the questioned-source specimen and known-source sample, then one member of each pair in the calibration data must reflect the conditions of the questioned-source specimen and the other member of the pair must reflect the conditions of the known-source sample. If the calibration data do not represent the relevant population for the case and do not reflect the conditions for the case, then the resulting model will miscalibrate the output. The decision as to whether the calibration data are sufficiently representative of the relevant population for the case and sufficiently reflective of the conditions for the case will be a subjective judgement made by the forensic practitioner, but this should be made transparent so that the decision can be reviewed by an independent practitioner and potentially be debated before the court ([10,22,23]).

Metrics of degree of calibration

Introduction

Several metrics have been proposed for measuring the degree of calibration of the output of a forensic-evaluation system.2 Vergeer et al. [11] explored the performance of different metrics using simulated data for which the true distributions were known. Metrics based on the expected value of different-source likelihood-ratio values and the expected value of the inverse of same-source likelihood-ratio values (after Good [24]) did not perform as desired, nor did metrics based on the proportion of different-source likelihood ratios above 2 and the proportion of same-source likelihood ratios below 0.5 (after Royall [25]). We will not discuss these metrics further here. Instead, we will focus on metrics that make use of the pool-adjacent-violators (PAV) algorithm ([16,26,27]).3

The more established of the PAV-based metrics is (Brümmer & du Preez [16]). , where is the log-likelihood-ratio cost, calculated as in Eq. (1),4 and is calculated after the log-likelihood-ratio values resulting from the validation data have been transformed using PAV. PAV is a non-parametric algorithm that, subject only to the constraint of monotonicity, shifts the log-likelihood-ratio values so as to minimize . The same same-source and different-source log-likelihood-ratio values that are used for training PAV are themselves transformed and used to calculate . In Eq. (1), and are respectively the same-source and different-source likelihood-ratio values output by the system in response to the validation data, and and are respectively the number of same-source and different-source likelihood-ratio values.5 In order for the results to be meaningful in the context of the case, the validation data must be representative of the relevant population for the case and must reflect the conditions of the questioned-source specimen and known-source sample in the case, including any mismatch between them ([1,10]). The validation data must also be separate from the calibration data (and from any other data used for training the system).6

devPAV

A novel PAV-based metric, devPAV, was introduced in Vergeer et al. [11]. For a graphical explanation of the calculation of devPAV, see Ref. [11] Fig. 2. Log likelihood ratios are calculated using the validation data. The PAV algorithm is applied to the resulting log likelihood ratios. The PAV-based log-likelihood-ratio to recalibrated-log-likelihood-ratio mapping function is plotted, with log-likelihood-ratio values on the x axis and recalibrated-log-likelihood-ratio values on the y axis. The line y = x is plotted on the same axes. If the log-likelihood-ratio values output by the system were perfectly calibrated, then recalibrating them would (theoretically) result in the same values, i.e., y = x. Within the range from the smallest same-source log likelihood ratio to the largest different-source log likelihood ratio (the range within which the recalibrated-log-likelihood-ratio values will be finite), the area between y = x and the log-likelihood-ratio to recalibrated-log-likelihood-ratio mapping function is calculated. This is achieved by stepping through adjacent pairs of log-likelihood-ratio and recalibrated-log-likelihood-ratio values, and calculating the areas of rectangles and triangles that piecewise make up the total area. The total area is then divided by the length of the range from the smallest same-source log likelihood ratio to the largest different-source log likelihood ratio. For the calculation of devPAV, both log-likelihood-ratio values and recalibrated-log-likelihood-ratio values are scaled as base-ten logarithms.

Argument

Introduction

The purpose of the present paper is to present the argument that, in the context of conducting casework, once a forensic-evaluation system has been appropriately calibrated, subsequently calculated PAV-based ostensive metrics of degree of calibration do not in fact provide information about degree of calibration. Instead, PAV-based ostensive metrics of degree of calibration provide information about sampling variability between the calibration and validation data and about overfitting on the validation data. Once a forensic-evaluation system has been appropriately calibrated, PAV-based metrics are not meaningful metrics of degree of calibration. The fact that they are not meaningful metrics of degree of calibration, however, is not of concern because, in the context of deciding whether a system is sufficiently well calibrated to be used for a case, a metric of degree of calibration is not required. Note that the argument presented here relates to attempted measurement of degree of calibration of already-calibrated systems, not to measurement of degree of calibration of uncalibrated systems. Note, also, that the argument presented here relates to the use of metrics of degree of calibration in the context of using a forensic-evaluation system in a case and presenting the results to a court (or to some other decision maker in the judicial process). It does not relate to the use of metrics of degree of calibration in the context of research and development of forensic-evaluation systems nor to selection of which of multiple systems to use. The present paper is written from the perspective of best practice for a forensic practitioner who is conducting a forensic evaluation or who is independently reviewing a report on a forensic evaluation conducted by another forensic practitioner. The present paper should be read in the context of the Consensus on validation of forensic voice comparison [10].

Metrics of degree of calibration are not required

An astute reader of the Consensus on validation of forensic voice comparison [10] may have noticed that, although it recommended that forensic-evaluation systems be well calibrated, it did not recommend that practitioners calculate and present to a court a metric of degree of calibration. In the context of a case, discussion regarding calibration should not centre around ostensive metrics of degree of calibration. Instead, it should centre around the following questions: Has the system been calibrated using an appropriate calibration model?7 Has the calibration model been trained using appropriate data? In order for these questions to be answerable, the forensic practitioner must describe the calibration model and the calibration data so that their appropriateness can potentially be reviewed by an independent practitioner and can potentially be debated before the court. An example of lack of appropriate calibration in the context of a forensic-voice-comparison case is described in Morrison [28]: The questioned-speaker recording was a recording of a mobile telephone call in which the speaker of interest was distant from the telephone, and the known-speaker recording was a recording of a landline telephone call in which the speaker of interest was in a highly reverberant environment. In contrast, the forensic-voice-comparison system was trained on high-quality audio recordings, and it did not include an explicit calibration stage. An example of a calibration model that would be inappropriate for evidential casework is described in Jessen et al. [29]: The calibration model included shifting the scores so that 10% of the different-source scores had values greater than 0. This may be appropriate in an investigative context in which one requires a 10% false-alarm rate, but, in the context of assessing strength of evidence for presentation in court, unless this accidently corresponds to the shift that minimizes (and for the conditions tested in Ref. [29], it did not), this procedure deliberately miscalibrates the output of the system. An example of use of inappropriate calibration data in the context of a forensic-voice-comparison case is described in Morrison [23]: The speakers of interest on the questioned- and known-speaker recordings had West Yorkshire accents, and the questioned-speaker recordings were covert recordings made in a car. These were poor-quality recordings that included engine and traffic noise. In contrast, the calibration data were high-quality audio recordings of speakers with “standard southern British English” accents, and consisted of only one recording of each speaker (different parts of the same recordings were used to create same-speaker pairs). For another example of use of inappropriate calibration data, see Morrison & Thompson [22] §7.8 These are examples in which the appropriateness of the calibration model and calibration data could be (or actually were) debated before a court. In none of these examples would a metric of degree of calibration have been of assistance. A metric of degree of calibration might be of assistance in demonstrating that a calibration model is inappropriate, but a value greater than 1 or a graphical representation (such as a Tippett plot or probability-density plot) would probably be sufficient to convey gross miscalibration to a court. If an appropriate calibration model and appropriate calibration and validation data have been used, then one would not expect a value greater than approximately 1. If the value is less than 1 the system is providing useful information, therefore, ceteris paribus, it would be better to use that system than to use no system. In the context of providing a critique of a forensic-evaluation report, the practitioner who is critiquing the report is unlikely to have access to the evaluation-software (including the calibration model) or the calibration and validation data used by the practitioner who conducted the evaluation (assuming they exist – all too often a critique points out that there was no calibration or validation). Hence a practitioner who is independently critiquing the report will usually not be able to generate graphics or metrics indicative of the degree of calibration of the forensic-evaluation system that was actually used. In such circumstances, all they can do is discuss whether the calibration model and the calibration and validation data were appropriate from a theoretical perspective. A metric of degree of calibration would not be of assistance in deciding on the appropriateness of the calibration data (nor would a graphical representation of results): If it were decided that the calibration model and the calibration data were appropriate, and validation data were selected using the same criteria as were used to select the calibration data, but in reality the calibration and validation data were not appropriate, then no amount of testing using the validation data would reveal that mistake. All resulting performance metrics, including metrics of degree of calibration, would be misleading, but there would be no way of knowing that this was the case. The decision as to whether the calibration and validation data are appropriate is a pre-empirical decision.

PAV-based ostensive metrics of degree of calibration actually measure sampling variability and overfitting

Assume we have a two-stage system including a feature-to-score model then a score-to-log-likelihood-ratio model. The latter is the calibration model. The calibration model is a parsimonious model trained on a set of calibration data, and the performance of the system is tested using a set of validation data. Both the calibration data and the validation data are selected using the same criteria to decide whether they are sufficiently representative of the relevant population for the case and sufficiently reflective of the conditions for the case. In fact, it would be usual to obtain a single data set and then split it into a calibration set and a validation set, either as two completely separate sets or via cross-validation. Either directly or indirectly, an appropriate calibration model will, subject to the constraints of the model, minimize for the calibration data.9 The value calculated for the calibration data, however, will be the result of training and testing on the same data, and will therefore be overfitted on the calibration data and will tend to be overly optimistic with respect to the expected performance of the system when applied to previously-unseen data. Importantly, previously-unseen data include the questioned-source specimen and known-source sample in the case. Results intended to be representative of the expected performance of the system when generalized to previously-unseen data are obtained using validation data. Same-source and different-source pairs are constructed from the validation data. Those pairs are input to the first-stage model which then outputs a set of same-source scores and a set of different-source scores. These validation scores are input to the calibration model that was already trained on the calibration data. The resulting calibrated log-likelihood-ratio values derived from the validation data are used to calculate a value. The latter value represents the expected performance of the system when applied to previously-unseen data, such as the questioned-source specimen and known-source sample in the case. If one were to take the log-likelihood-ratio values resulting from the validation data, use them to train a new calibration model and then recalibrate them using that model, one would be both training and testing on the validation data, would overfit on the validation data, and would tend to obtain overly optimistic results. If the same type of model were used for calibration and recalibration, a metric based on the difference between calibrated and recalibrated results would therefore simply capture the difference due to sampling variability between the calibration and validation data, and due to overfitting because of both training and testing on the validation data. If the recalibration model were PAV and it was both trained and tested on the validation data, then the results would be doubly overfitted. They would be doubly overfitted not just because of training and testing on the same data, but also because of the weak constraints of the non-parametric PAV algorithm. In the description of the devPAV metric in §1.4.3 above, we wrote: “If the log-likelihood-ratio values output by the system were perfectly calibrated, then recalibrating them would (theoretically) result in the same values, i.e., y = x.” We included the parenthetical “theoretically” because, in practice, even if the log-likelihood-ratio values were perfectly calibrated, the overfitting of PAV to real data would result in differences between the PAV-transformed log-likelihood-ratio values (the y values) and the original pre-PAV log-likelihood-ratio values (the x values). The value based on the calibration data and the value based on the validation data will differ because of sampling variability, but this is not a problem. These two values are not compared with each other, only the latter is presented as a metric of accuracy. The same would be true for other metrics of accuracy such as false-alarm rate and miss-rate in a classification framework. The problem lies in both training and testing on the validation data, and overtraining on the validation data, then comparing a measure of the accuracy of the resulting system () with a measure of the accuracy of the system that will actually be used in the case (the value based on the calibrated system and validation data). characterizes the performance of a system that included PAV-calibration on the validation data. Since this is not the system that will actually be used to compare the questioned-source specimen and known-source sample in the case, is not informative about the performance of the system that will actually be used in the case. One would not usually use the non-parametric PAV as the actual calibration model because it would overfit its training data and tend not to generalize well to new data. One would usually deliberately choose a parsimonious parametric model that would be a less good fit for its training data but tend to generalize better to new data. Linear discriminant analysis (LDA) and logistic regression (LogReg) are examples of parsimonious models that could be used – they both result in a linear mapping between scores and the log likelihood ratios. A linear mapping requires the estimation of only two parameter values. LogReg is usually preferred over LDA because it does not depend on as strong assumptions – it is more robust when the data deviate from being Gaussians with the same variance. One could potentially use non-linear, but still monotonic, models that would require estimating only a few more parameter values. The effects of sampling variability and of overfitting would be reduced for very large data sets, but in forensic practice the amount of case-relevant data available is usually relatively small.

Conclusion

Forensic-evaluation systems should be calibrated using a parsimonious parametric calibration model trained using calibration data, and should then be tested using validation data. The calculation of PAV-based ostensive metrics of degree of calibration involves both training and testing on the validation data. Therefore, for a system that has already been calibrated using a parsimonious calibration model, what the PAV-based metrics are measuring is not degree of calibration. What they are measuring is sampling variability between the calibration and validation data, and the difference in fit between a parsimonious parametric model and an overfitted non-parametric model. In the next section we support the theoretical argument made in the present section by presenting demonstrations based on Monte Carlo simulations.

Demonstrations

Following Vergeer et al. [11], we present demonstrations based on simulated data. By specifying the population distributions, we can compare empirical results with expected results. By generating multiple Monte Carlo samples, we can explore effects due to sampling variability.

Perfectly calibrated systems

Assume a Gaussian population distribution for the same-source scores and a Gaussian population distribution for the different-source scores, and assume that the two Gaussians have the same variance.10 Fitting an LDA model would result in a score-to-log-likelihood-ratio mapping function that is linear, i.e., has the equation , in which is a score value (which has the form of a log-likelihood-ratio value), is the corresponding calibrated natural-log-likelihood-ratio value,11 and and are the intercept and slope. As shown in Eqs. (2), (3), (4)), the value of the slope () depends on the separation of the same-source mean and the different-source mean ( and ) relative to their shared variance (), and the intercept () depends on the location of the midpoint between the same-source mean and the different-source mean. If, for example, for the Monte Carlo population distributions we specify , , and (Fig. 2(a)), the score-to-log-likelihood-ratio mapping function will be (Fig. 2(b)). Hence, the parameter values for the transformed distributions will be , , and (Fig. 2(c)). If we recalibrate these values, the recalibration mapping function (the log-likelihood-ratio-to-log-likelihood-ratio function) will be (Fig. 2(d)), i.e., the log likelihood ratios of the calibrated log likelihood ratios equal the calibrated log likelihood ratios.

Fig. 2

(a): Monte Carlo population distributions, = 3, = 6, = 1. (b): Score-to-log-likelihood-ratio mapping function corresponding to (a). (c): Distributions of (a) after transformation using the mapping function in (b), = −4.5, = 4.5, = 3. (d): Log-likelihood-ratio-to-log-likelihood-ratio mapping function corresponding to (c). In general, once one has ascertained (the common variance for the same-source and different-source scores) and and (the means of the same-source and different-source scores) one knows everything about the distributions of the log-likelihood-ratios of the calibrated system: The calibrated standard deviation will be ,12 and the calibrated means will be located symmetrically about 0 with a separation of , i.e., and (Peterson et al. [30] §4.9; Good [24]; van Leeuwen & Brümmer [18]). Fig. 3 shows examples of distributions for perfectly calibrated systems with different values for .

Fig. 3

Examples of distributions for perfectly calibrated systems with different values for .

Monte Carlo simulations

Assume a situation in which the feature data for each of the calibration set and the validation set allow us to generate 50 same-source scores and 1225 different-source scores (the latter number being the size of the upper-right of a 50 × 50 matrix). Further assume that the Monte Carlo populations consist of Gaussians with means and , both with the same standard deviation (Fig. 2(a)). We draw Monte Carlo samples consisting of 50 same-source scores and 1225 different-source scores. We draw one sample as a calibration set and one sample as a validation set. We use the calibration set to train a calibration model, apply the calibration model to the validation set, then calculate for the resulting calibrated log-likelihood-ratio values. We recalibrate the calibrated log-likelihood-ratio values, both training and testing the recalibration model on the calibrated log-likelihood-ratio values, then calculate for the recalibrated log-likelihood-ratio values. Hereinafter, we refer to the latter as , which equals if the recalibration model is PAV.13 We repeat this process 10,000 times, and each time: We compare for the calibrated log-likelihood-ratio values with the expected value given the Monte Carlo population parameters, i.e., we calculate . We compare for the calibrated log-likelihood-ratio values with the recalibrated value, i.e., we calculate . If the recalibration model is PAV, . We calculate devPAV, and, for comparison purposes, devLDA and devLogReg. The latter were calculated in the same way as devPAV, but, rather than using the PAV-derived likelihood-ratio-to-recalibrated-log-likelihood-ratio mapping function, the LDA- or LogReg-derived likelihood-ratio-to-recalibrated-log-likelihood-ratio mapping function was used instead.14 is the perfect metric of degree of calibration, but it is not a practical metric: It can only be calculated when one has oracle knowledge of the population distributions, i.e., in the context of Monte Carlo simulations. Rather than an analytical solution for (which, depending on the population distributions, may not exist), we obtain a Monte Carlo approximation by drawing a sample of 500,000 same-source score values and 500,000 different-source score values, then for each score value we calculate the corresponding log-likelihood-ratio value given the Monte Carlo population models, and finally we calculate for those log-likelihood-ratio values.15 We compared the following four combinations of calibration and recalibration models: LDA-LDA, LDA-PAV, LogReg-LogReg, and LogReg-PAV.16 We repeated the entire process using different Monte Carlo population distributions: Reflecting a pattern seen for empirical score distributions (for examples, see §5 of Morrison & Poh [31]), we used a same-source distribution that has a negative skew (a heavy left tail), see Fig. 4. We generated this based on a Gumbel distribution, see Eq. (5), in which is the probability density function for a Gumbel distribution, is a function that generates random numbers based on a Gumbel distribution with the specified parameter values, and and have the same values as previously used for the Gaussian same-source distribution.

Fig. 4

Monte Carlo population distributions: a Gaussian distribution for different-source scores and a skewed distribution for same-source scores.

Monte Carlo population distributions: a Gaussian distribution for different-source scores and a skewed distribution for same-source scores. The Matlab code used to run these simulations is available at http://geoff-morrison.net/#no_cal_metric. The code can be modified to explore other settings, including changing the separation between the same-source and different-source distributions and changing the sample size.

Results

For the system with Gaussian distributions for both different-source and same-source scores, the value for was 0.240. For the system with a Gaussian distribution for different-source scores and a skewed distribution for same-source scores, the value for was 0.461. Fig. 5, Fig. 6, Fig. 7 summarize the values for , , and devPAV/devLDA/devLogReg resulting from the Monte Carlo simulations. The figures have been formatted such that all violin plots within a figure have the same area.

Fig. 5

Fig. 6

Fig. 7

Violin plots for devPAV given: (a) Gaussian distributions for both different-source and same-source scores; (b) Gaussian distribution for different-source scores and a skewed distribution for same-source scores.

Violin plots for given: (a) Gaussian distributions for both different-source and same-source scores; (b) Gaussian distribution for different-source scores and a skewed distribution for same-source scores. Violin plots for given: (a) Gaussian distributions for both different-source and same-source scores; (b) Gaussian distribution for different-source scores and a skewed distribution for same-source scores. Violin plots for devPAV given: (a) Gaussian distributions for both different-source and same-source scores; (b) Gaussian distribution for different-source scores and a skewed distribution for same-source scores.

Discussion

results

As shown in Fig. 5(a), when both the same-source and different-source Monte Carlo population distributions were Gaussian, for both LDA- and LogReg-calibration models, values were centred around with a slight positive skew. LDA, a parametric model whose assumptions were met by the population distributions, had a slightly tighter distribution than did LogReg, which was fitted using an iterative algorithm. The variation of the values about reflects sampling variability of both the calibration data and the validation data. Sampling variability also accounts for the spread of the values when the recalibration model was PAV, but those values were not centred around , their distribution was substantially lower. The PAV-recalibrated values, i.e., values, tended to be lower. The reason for this is overfitting as a result of using a minimally-constrained non-parametric model that was both trained and tested on the validation data. In this example, the LDA-calibrated and LogReg-calibrated values were on-average closer to , i.e., the LDA-calibrated and LogReg-calibrated log-likelihood-ratio values were on-average closer to the “true” log-likelihood-ratio values obtained using oracle knowledge of the true Monte Carlo population distributions, i.e., the LDA-calibrated and LogReg-calibrated log-likelihood-ratio values were better calibrated than the PAV-recalibrated log-likelihood-ratio values. As shown in Fig. 5(b), when the different-source Monte Carlo population distribution was Gaussian but the same-source Monte Carlo population distribution was skewed, for both LDA- and LogReg-calibration models, values were usually higher than . The results were not as well calibrated as when the models’ assumptions were met by the population distributions. LogReg, which is not as sensitive to deviations from Gaussian distributions with the same variance as is LDA, had a somewhat tighter distribution than did LDA, i.e., LogReg-calibrated log-likelihood-ratio values where somewhat better calibrated than LDA-calibrated log-likelihood-ratio values. A model with a few more parameters to fit a non-linear (but still monotonic) mapping function would potentially lead to a better degree of calibration. Potential improvement in degree of calibration would have to be traded off against the danger of overfitting on the calibration data and then not generalizing well. As before, PAV overfitted the validation data and PAV-recalibrated values, i.e., values, tended to be lower than . The results shown in Fig. 5 indicate that the values of the metric will tend to be larger than the corresponding value for the perfect metric of degree of calibration, , i.e., the value will tend to be positive. This is demonstrated in the rightmost violin plots of Fig. 5, for which the values of tended to be negative. would therefore tend to indicate a poorer degree of calibration than is actually the case. As shown in Fig. 6, when calibrating and recalibrating using the same type of model, LDA-LDA and LogReg-LogReg, there was a spread in the distribution of the values. This was due to sampling variability between the calibration data and the validation data. The spread was narrower in Fig. 6(a), for which the Monte Carlo population distributions met the assumptions of the LDA-LDA models (both the same-source and different-source distributions were Gaussians with the same variance). In Fig. 6(a) the distributions for LDA-LDA were slightly narrower than for LogReg-LogReg. In Fig. 6(b), for which the different-source Monte Carlo population distribution was Gaussian but the same-source Monte Carlo population distribution was skewed, the spread in the distribution of the values for both LDA-LDA and LogReg-LogReg were wider than in Fig. 6(a). In Fig. 6(b), the spread for LogReg-LogReg was less than for LDA-LDA, LogReg being more robust to violations of the assumptions of Gaussians with the same variance. In Fig. 6, the distributions of for LDA-LDA were close to symmetrical about 0, with only very slight positive skew. It appears that both training and testing the LDA recalibration model on the validation data did not lead to substantial overfitting. This is an advantage of a parsimonious parametric model. For LogReg-LogReg, there was a positive skew to the distributions. This reflects some overfitting due to both training and testing the LogReg recalibration model on the validation data. For the LDA-LDA and LogReg-LogReg models, is an indicator of degree of calibration? If it were, would we expect to see a shift in values in Fig. 6(b) similar to the shift in values in Fig. 5(b)? We argue that what reflects is not degree of calibration but sampling variability between the calibration and validation data, and, more so for LogReg-LogReg, some overfitting on the validation data. For LDA-PAV and LogReg-PAV, for which the recalibration models were PAV,. The distributions of the values were substantially greater than 0, and substantially greater than the distributions for LDA-LDA and LogReg-LogReg. As previously discussed in the context of the results, the substantially larger values for were due to the minimally-constrained non-parametric PAV model being both trained and tested on the validation data and overfitting the validation data. Due to this overfitting, tends to indicate a poorer degree of calibration than is actually the case. Due to this overfitting, we argue that is not a meaningful metric of degree of calibration for systems that have already been calibrated using a parsimonious parametric model. Ferrer et al. [33] similarly observed that PAV overfitted on small data sets. Rather than calculating using PAV, they calculated an alternative version using LogReg, i.e., the same as our for the LogReg-LogReg models. Data sets that are considered “small” in the automatic-speaker-recognition literature (to which Ferrer et al. [33] belongs) may be larger than case-relevant data sets typically available in forensic casework contexts.

devPAV results

By design, devPAV values are greater than or equal to 0.17 Otherwise, results in Fig. 7 show the same relative pattern as for results in Fig. 6. In particular, as for compared to for LDA-LDA and LogReg-LogReg, the devPAV distributions (for LDA-PAV and LogReg-PAV) had larger values than for devLDA and devLogReg (for LDA-LDA and LogReg-LogReg). The larger values for devPAV were due to the minimally-constrained non-parametric PAV model being both trained and tested on the validation data, and it overfitting the validation data to a greater extend than did the parsimonious parametric models. Due to this overfitting, we argue that devPAV is not a meaningful metric of degree of calibration for systems that have already been calibrated using a parsimonious parametric model.

Selected Vergeer et al. (2021) results

Vergeer et al. [11] included comparison of the behaviour of different calibration metrics given perfectly calibrated Monte Carlo population distributions consisting of Gaussians with the same variance. The values used for were 2, 12, 22, and 34, and same-source sample sizes used were 50, 100, and 300.18 Vergeer et al. [11] did not present results from the full factorial of these combinations. We replicated this portion of Vergeer et al. [11], and our Fig. 8, Fig. 9, Fig. 10 show the full factorial of results for, , , and devPAV distributions respectively. The arguments we make below could have been based on results already presented in Vergeer et al. [11], but examining the full factorial makes the pattern of results more obvious.

Fig. 8

Violin plots for given a range of values, and sample sizes of (a) 50, (b) 100, and (c) 300.

Fig. 9

Violin plots for given a range of values, and sample sizes of (a) 50, (b) 100, and (c) 300.

Fig. 10

Violin plots for devPAV given a range of values, and sample sizes of (a) 50, (b) 100, and (c) 300.

Violin plots for given a range of values, and sample sizes of (a) 50, (b) 100, and (c) 300. Violin plots for given a range of values, and sample sizes of (a) 50, (b) 100, and (c) 300. Violin plots for devPAV given a range of values, and sample sizes of (a) 50, (b) 100, and (c) 300. The values for of 2, 12, 22, and 34 were 0.710, 0.155, 0.038, and 0.007 respectively. For the perfect metric of degree of calibration, , the distributions shown in Fig. 8 were centred around 0. The spread of the distributions is due to sampling variability. As the size of the samples increased, from panel (a) through panel (c), the spread of the distributions decreased. This is the expected effect on sampling variability of increasing the sample size. As the separation between the same-source and different-source log-likelihood-ratio values increased, from left to right, the spread of the distributions also decreased. This is due to the fact that as the separation between the same-source and different-source log-likelihood-ratio values increased both values and values decreased, thus the magnitude of the difference between them decreased. The distributions of values shown in Fig. 9 exhibited the same pattern of spread as for the values, but, in addition, absolute values decreased as sample size increased and as the separation between the same-source and different-source log-likelihood-ratio values increased. The distributions of devPAV values shown in Fig. 10 exhibited a different pattern: As the size of the samples increased, from panel (a) through panel (c), the spread of the distributions decreased, but as the separation between the same-source and different-source log-likelihood-ratio values increased, from left to right, the spread of the distributions increased rather than decreased. Also in contrast to values, as the separation between the same-source and different-source log-likelihood-ratio values increased, the average devPAV values increased rather than decreased. The increase in the spread as the separation between the same-source and different-source log-likelihood-ratio values increased may be due to the fact that devPAV is only calculated for log-likelihood-ratio values in the range from the smallest same-source value to the largest different-source value. As the separation between the same-source and different-source log-likelihood-ratio values increases, this range will decrease, making the amount of data on which devPAV is calculated smaller and thus making devPAV more sensitive to sampling variability. The increase in average devPAV values as the separation between the same-source and different-source log-likelihood-ratio values increased may also be related to the decrease of the range over which it is calculated – since devPAV only has positive values, an increase in the spread of those values would be correlated with an increase in their average value. Given that all the Monte Carlo population distributions were perfectly calibrated, a good metric of degree of calibration should have had the same average value for all the different population distributions. Because this was not the case for either or devPAV (across the different population distributions the median varied 27-fold for and 5-fold for devPAV), we argue that neither is a good metric of degree of calibration.19

Conclusion

All forensic-evaluation systems used in casework should be calibrated. If they are not intrinsically well calibrated, they should include an explicit calibration model. We have presented an argument that, in the context of casework, PAV-based ostensive metrics of degree of calibration ( and devPAV) are not meaningful metrics of degree of calibration for systems that have already been calibrated using a parsimonious calibration model. We have argued that, in this context, rather than measuring degree of calibration, PAV-based metrics reflect sampling variability between the calibration and validation data and overfitting on the validation data. The fact that PAV-based ostensive metrics of degree of calibration are not meaningful metrics of degree of calibration in the context of a casework is not of concern, however, because a metric of degree of calibration is not required: A decision as to whether a calibration model is appropriate in the context of a case does not require the use of a metric of degree of calibration. >1 and graphical representations would be sufficient to indicate gross miscalibration, which will not occur if an appropriate calibration model has been used. Whether a calibration model is appropriate may (and often only can) be argued on theoretical grounds, and a decision as to whether the calibration data are appropriate is a pre-empirical decision that cannot be informed by a metric of degree of calibration.

Disclaimer

All opinions expressed in the present paper are those of the author, and, unless explicitly stated otherwise, should not be construed as representing the policies or positions of any organizations with which the author is associated.

Declaration of competing interest

The author declares that he has no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

13 in total

Review 1. The logical foundations of forensic science: towards reliable knowledge.

Authors: Ian Evett
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2015-08-05 Impact factor: 6.237

2. Calculating LRs for presence of body fluids from mRNA assay data in mixtures.

Authors: R J F Ypma; P A Maaskant-van Wijk; R Gill; M Sjerps; M van den Berge
Journal: Forensic Sci Int Genet Date: 2021-01-15 Impact factor: 4.882

3. Why calibrating LR-systems is best practice. A reaction to "The evaluation of evidence for microspectrophotometry data using functional data analysis", in FSI 305.

Authors: Peter Vergeer; Ivo Alberink; Marjan Sjerps; Rolf Ypma
Journal: Forensic Sci Int Date: 2020-06-27 Impact factor: 2.395

4. Measuring calibration of likelihood-ratio systems: A comparison of four metrics, including a new metric devPAV.

Authors: Peter Vergeer; Yara van Schaik; Marjan Sjerps
Journal: Forensic Sci Int Date: 2021-02-13 Impact factor: 2.395

5. Numerical likelihood ratios outputted by LR systems are often based on extrapolation: When to stop extrapolating?

Authors: Peter Vergeer; Andrew van Es; Arent de Jongh; Ivo Alberink; Reinoud Stoel
Journal: Sci Justice Date: 2016-06-21 Impact factor: 2.124

6. The use of LA-ICP-MS databases to calculate likelihood ratios for the forensic analysis of glass evidence.

Authors: Ruthmara Corzo; Tricia Hoffman; Peter Weis; Javier Franco-Pedroso; Daniel Ramos; Jose Almirall
Journal: Talanta Date: 2018-02-08 Impact factor: 6.057

7. Reliable support: Measuring calibration of likelihood ratios.

Authors: Daniel Ramos; Joaquin Gonzalez-Rodriguez
Journal: Forensic Sci Int Date: 2013-05-10 Impact factor: 2.395

8. A guideline for the validation of likelihood ratio methods used for forensic evidence evaluation.

Authors: Didier Meuwly; Daniel Ramos; Rudolf Haraksim
Journal: Forensic Sci Int Date: 2016-04-26 Impact factor: 2.395

9. Avoiding overstating the strength of forensic evidence: Shrunk likelihood ratios/Bayes factors.

Authors: Geoffrey Stewart Morrison; Norman Poh
Journal: Sci Justice Date: 2017-12-22 Impact factor: 2.124

Review 10. Consensus on validation of forensic voice comparison.

Authors: Geoffrey Stewart Morrison; Ewald Enzinger; Vincent Hughes; Michael Jessen; Didier Meuwly; Cedric Neumann; S Planting; William C Thompson; David van der Vloed; Rolf J F Ypma; Cuiling Zhang; A Anonymous; B Anonymous
Journal: Sci Justice Date: 2021-03-06 Impact factor: 2.124