Literature DB >> 30310872

Efficient construction of match strength distributions for uncertain multi-locus genotypes.

Abstract

Natural variation in biological evidence leads to uncertain genotypes. Forensic comparison of a probabilistic genotype with a person's reference gives a numerical strength of DNA association. The distribution of match strength for all possible references usefully represents a genotype's potential information. But testing more genetic loci exponentially increases the number of multi-locus possibilities, making direct computation infeasible. At each locus, Bayesian probability can quickly assemble a match strength random variable. Multi-locus match strength is the sum of these independent variables. A multi-locus genotype's match strength distribution is efficiently constructed by convolving together the separate locus distributions. This convolution construction can accurately collate all trillion trillion reference outcomes in a fraction of a second. This paper shows how to rapidly construct multi-locus match strength distributions by convolution. Function convergence demonstrates that distribution accuracy increases with numerical resolution. Convolution construction has quadratic computational complexity, relative to the exponential number of reference genotypes. A suitably defined random variable reduces high-dimensional computational cost to fast real-line arithmetic. Match strength distributions are used in forensic validation studies. They provide error rates for match results. The convolution construction applies to discrete or continuous variables in the forensic, natural and social sciences. Computer-derived match strength distributions elicit the information inherent in DNA evidence, often overlooked by human analysis.

Entities: CellLine Chemical Disease Gene Species

Keywords: Bioinformatics; Biotechnology; Computational biology; Genetics; Mathematical biosciences; Molecular biology

Year: 2018 PMID： 30310872 PMCID： PMC6176789 DOI： 10.1016/j.heliyon.2018.e00824

Source DB: PubMed Journal: Heliyon ISSN： 2405-8440

Introduction

Forensic identification is the science of match. When two objects have identical features, their match statistic increases with the rarity of those features. Feature uncertainty or dissimilarity reduces the match statistic. This balance between numerator similarity and denominator surprise is the likelihood ratio (LR), first used forensically for glass evidence [1]. Match strength puts the LR on a logarithmic scale, enabling the addition of independent evidence factors [2]. Deoxyribonucleic acid (DNA) testing of biological evidence generates laboratory data from multiple genetic loci. Simple locus data from an unambiguous DNA reference can give a definite genotype having one value. But DNA evidence usually produces complex data. Such data leads to an uncertain locus genotype that assigns probability to a hundred possible values. Statistical comparison of an inferred evidence genotype with a definite reference genotype calculates LR match strength. This LR numerically divides evidence genotype probability by population probability, both evaluated at the reference genotype. Adding together independent locus log(LR) values yields the total match strength. An evidence genotype's match strength is mathematically determined at every reference point before any comparison is made. The distribution of match strength values gives insight into genotype uncertainty. A definite genotype concentrates all its probability at maximal strength for the one matching reference. An entirely uninformative genotype collapses to zero match strength. Most genotypes fall in between these two extremes, often showing a bell-shaped distribution of match strength along the real line. For references unlikely to have contributed to the evidence, the uncertain genotype's match strength distribution is centered left of zero (Fig. 1). For references likely to have contributed to evidence, the contributor distribution is mainly positive (Fig. 2).

Fig. 1

Fig. 2

Contributor distribution. (Cumulative) An uncertain genotype's cdf for contributor random variable Y shows cumulative probability (y-axis) relative to logarithmic match strength (x-axis). (Probability) At bin resolution ban, the corresponding pmf gives the probability in each bin.

Non-contributor distribution. (Cumulative) An uncertain genotype's cdf for non-contributor random variable X shows cumulative probability (y-axis) relative to logarithmic match strength (x-axis). (Probability) At bin resolution ban, the corresponding pmf gives the probability in each bin. Contributor distribution. (Cumulative) An uncertain genotype's cdf for contributor random variable Y shows cumulative probability (y-axis) relative to logarithmic match strength (x-axis). (Probability) At bin resolution ban, the corresponding pmf gives the probability in each bin. Match strength distributions have broad application in forensic science. Non-contributor distributions have been graphed as Tippett plots [3] to assess data quality and compare interpretation methods [4]. Distribution curves predict DNA database search specificity [5] and kinship identification power [6]. The distributions provide LR error bounds and tail probabilities [7]. In validation studies, match strength distributions summarize the sensitivity and specificity of statistical methods for interpreting DNA mixtures [8, 9]. There are exponentially many multi-locus genotypes. Listing all combinations, with one value from each locus, forms the multi-locus possibilities. A dozen loci generate a trillion trillion possible genotype outcomes. Brute force LR comparison of an uncertain genotype with all these reference possibilities is not feasible. Instead, Monte Carlo simulation samples representative match strengths [4, 10]. Branch-and-bound [11] and importance sampling [5] algorithms can improve simulation performance in some applications. But the genotype space grows exponentially with additional locus tests, and sampling is inexact. There is an analogous combinatorial explosion in probability theory. When tossing a coin n times, there are 2 possible outcomes of head (H) and tail (T) sequences. One additional toss doubles the number of H-T sequences. But the interesting information concerns the number of heads, not where in the sequence these heads occur. With n tosses, there are just n+1 counting results: 0, 1, 2, …, or n heads. A random variable (RV) [12] summarizes the exponential 2 number of experiment outcomes as a linear n+1 number of informative results. The binomial probability Binom(k; n, ½) of getting k heads in n tosses of a fair coin is . The binomial distribution is constructed as a sum of independent coin tosses [13]. Convolving the distribution of n tosses with another toss forms the expanded binomial distribution Binom(k; n+1, ½) for n+1 tosses [14]. Convolution shifts and adds lists of numbers, producing (for example) the binomial coefficients of Pascal's triangle [15]. In this paper, the concepts of RV and convolution are used to efficiently construct match strength distributions for multi-locus evidence genotypes. The Methods introduce evidence genotypes and their uncertainty. The match strength RV arises from genotype probability functions, and is efficiently constructed by convolution. The function convergence of binned distributions helps demonstrate their accuracy. Mixing genotype distributions to form composites accelerates validation studies. The Results lend empirical support using an uncertain genotype derived from a 10% DNA mixture component (Materials). The genotype's match strength distribution is constructed for one locus, and then convolved across many loci. Distribution accuracy is assessed by function convergence at increasingly fine bin resolutions. Efficiency is measured by timing different stages of distribution construction. Genotype sample space size is compared with the number of bin intervals. Bin event occupancy explains why the convolution construction works efficiently. Composite distributions can speed up validation. Match strength error rates are instantly calculated from (single or composite) genotype distributions. The Conclusions discuss the general applicability of these match strength convolution methods for handling genotype uncertainty.

Methods

Multi-locus genotype

DNA is the linear information molecule that encodes cellular function in a four-letter nucleic acid alphabet [16]. The three billion-letter genome sequence differs between people, with greater genetic similarity in more closely related individuals. Two complete genome copies, maternal and paternal, reside in the nucleus of most human cells. When people deposit their biological material, they can be identified through their DNA. A short tandem repeat (STR) locus is a highly polymorphic marker that accentuates DNA differences at a particular chromosomal location [17]. STR alleles vary by length, based on the number of tandemly repeated short DNA words. A typical STR locus used in human identification has about 15 different length variants. A genotype at a locus is a pair of (maternal and paternal) alleles. With n = 15 alleles, there are about n(n+1)/2 ≈ 100 unordered locus allele pairs. These 102 allele pairs form the possible locus genotype values. Forensic scientists sample from L = 10 to 25 autosomal STR loci from genetically independent locations across 22 chromosomes [18]. There are roughly (102) = 102 possible multi-locus genotype values. Even a dozen (L = 12) loci provide a trillion trillion (102⋅12 = 1024) possible genotypes, far more genetic bar codes than the seven billion (<1010) people on earth, and thus useful for forensic identification. The number of genotype values 102 grows exponentially with the number of tested loci L. The human population is a small (1010) sampling of multi-locus genotype values from the full (1024) set of genotype possibilities. These genotype values follow a non-uniform population probability distribution based on locus allele frequencies [19]. This population distribution corresponds to the prior probability of a genotype, before observing phenotypic STR data.

Genotype uncertainty

Multiplex STR data can be generated for L loci in a single tube from one biological specimen. A molecular biology laboratory extracts DNA molecules from the specimen, amplifies the STR alleles using polymerase chain reaction, and detects the relative amount of fluorescently-labeled alleles by DNA size separation [20]. With abundant intact DNA from one person (e.g., a reference sample), the observed allele events directly correspond to the person's genotype. With most DNA evidence, however, the STR data can support multiple genotype explanations. Having different explanations leads to genotype uncertainty. The uncertainty can arise from mixtures of two or more contributors to a DNA specimen [21], damaged or small amounts of DNA [22], or reconstructed genotypes from relatives [23]. Bayesian probability [24] can model STR mixture data as a weighted linear combination of contributor genotypes [25]. A robust probability model [26] accounts for variance and nuisance parameters from the laboratory experiment (e.g., stutter, imbalance, decay). Markov chain Monte Carlo (MCMC) numerically solves high dimensional probability models through statistical sampling [27]. The result is genotype separation [28], producing a posterior genotype probability distribution for each person who contributed DNA to the biological specimen.

Genotype probability functions

Genotype uncertainty can be expressed in the standard mathematical language of probability, RV's and their distributions [10, 29]. Let ω be a genotype allele pair value for one person at one locus. Then ω = (ω, …, ω) is a person's multi-locus genotype value comprised of allele pairs at all L loci. Sample space Ω is the set of all genotype outcomes ω for one contributor to DNA evidence. There are natural probability measures on Ω. Prior probability p(ω) is the chance of observing genotype ω before examining evidence, based on population probability. Function p maps Ω into the unit interval = [0,1], a subset of the real numbers . STR data introduces a likelihood function λ from Ω into , where λ(ω) is the conditional probability of observing the data, given a genotype ω ∈ Ω. Posterior probability q(ω) is the chance that a contributor has genotype ω, after observing the STR data. Function q maps genotypes Ω into interval . This probability mass function (pmf) is calculated from Bayes theorem as q(ω) ∝ λ(ω) · p(ω), the normalized product of likelihood and prior. The LR of genotype ω ∈ Ω is the posterior-to-prior probability ratio q(ω)/p(ω) [30]. Bayes theorem can re-express q(ω)/p(ω) as a ratio of two likelihoods – the chance λ(ω) of observing the data assuming genotype ω, versus the total data probability when the genotype is unknown [31]. The logarithm of the LR, or the “weight of evidence” [2], measures the strength of association between the evidence genotype and reference ω, relative to coincidence. This match strength is a real-valued function s from genotypes Ω into numbers , where s(ω) is log10[q(ω)/p(ω)]. The number resides on an additive scale in base ten “ban” units.

Match strength distributions

The power set contains all subsets of the finite genotype set Ω. is a sigma field [12], closed under set union, intersection and complement. When assigning prior measure p, the triple (Ω, , p) [29] forms a prior probability space for Ω. The non-contributor RV X is a function from Ω to , where genotype ω is mapped into match strength s(ω). The non-contributor distribution function F is the set probability Pr{X < x} [12]. This cumulative distribution function (cdf) gives the prior probability p of the genotype subset {ω ∈ Ω | X(ω) < x} having match strength s(ω) less than x ban. A typical non-contributor F cdf is shown in Fig. 1. A partition of the real interval [a, b] is a finite sequence of real numbers a = x < x < … < x = b [32]. For any bin resolution ε > 0, the bin set B is a set of subintervals [x, x + ε) of equal length ε covering [a, b], where endpoints x = a + n·ε form a regular partition. For convenience, let a, b and 1/ε be integers. The bin function β:→ B maps a real number x into a subinterval β(x) denoted by the bin's left endpoint x. Alternatively, β can form ε-sized subintervals [x – ε/2, x + ε/2) centered at x midpoints. The probability mass function (pmf) is a discrete density function with bin resolution ε. Evaluated at bin x, has value F(x) – F(x), the chance Pr{x ≤ X < x} that match strength X falls in bin x. A typical non-contributor pmf is shown in Fig. 1. The posterior probability space (Ω, , q) assigns posterior measure q to genotype set Ω. The contributor RV Y maps this genotype probability space Ω into via the match strength function s. The cdf F(x) = Pr{Y < x} is the contributor distribution function. The discrete pmf maps a match strength bin of Bε into its genotype set probability Pr{x ≤ Y < x}. Fig. 2 shows a typical contributor cmf F and its pmf .

Convolution construction

At one locus l, constructing the locus pmf for the non-contributor RV X is straightforward. For each genotype ω in the small finite set Ω, match strength s(ω) is calculated. Function s is defined for ω whenever p(ω) > 0 and q(ω) > 0. The s(ω) number resides in bin x = β(s(ω)), for some integer n. Genotype ω's prior probability amount p(ω) is added to bin x. Binning the (β(s(ω)), p(ω)) pairs for all genotypes ω ∈ Ω forms non-contributor locus pmf . Fig. 3 shows a locus pmf construction for the genotype rows of Table 1.

Fig. 3

Table 1

Constructing non-contributor pmf at locus CSF1PO with bin resolution ε = 1/4 ban. For each locus genotype ω row, the columns show the numbered allele pair, prior p(ω) and posterior q(ω) genotype probabilities, LR q(ω)/p(ω), logarithmic match strength s(ω), and the rounded bin's center point β(s(ω)). Rows are sorted by ascending LR.

Genotype	Allele pair	Prior	Posterior	LR	Strength	Bin
1	Many	0.1561	0.0000	0.0000	−2.0000	−2.00
2	8 12	0.0209	0.0002	0.0096	−2.0000	−2.00
3	9 12	0.0188	0.0004	0.0239	−1.6215	−1.50
4	12 12	0.0894	0.0028	0.0313	−1.5042	−1.50
5	11 13	0.0295	0.0018	0.0627	−1.2028	−1.25
6	12 13	0.0332	0.0015	0.0451	−1.3460	−1.25
7	11 11	0.0703	0.0080	0.1138	−0.9439	−1.00
8	7 10	0.0171	0.0029	0.1718	−0.7651	−0.75
9	11 12	0.1586	0.0236	0.1485	−0.8282	−0.75
10	10 13	0.0299	0.0114	0.3805	−0.4197	−0.50
11	10 10	0.0725	0.0946	1.3045	0.1154	0.00
12	10 12	0.1610	0.2743	1.7037	0.2314	0.25
13	10 11	0.1428	0.5785	4.0517	0.6076	0.50

Locus construction. Constructing non-contributor pmf for locus CSF1PO at bin resolution ban. The l colored bar represents the (β(s(ω)), p(ω)) strength-probability pair for the corresponding locus genotype ω in Table 1. For a genotype match strength s(ω), prior probability amount p(ω) is added to match strength bin β(s(ω)). Bar colors show the first (blue) and second (green) bin events. Accumulating binned probability values over all genotypes builds the locus pmf. Constructing non-contributor pmf at locus CSF1PO with bin resolution ε = 1/4 ban. For each locus genotype ω row, the columns show the numbered allele pair, prior p(ω) and posterior q(ω) genotype probabilities, LR q(ω)/p(ω), logarithmic match strength s(ω), and the rounded bin's center point β(s(ω)). Rows are sorted by ascending LR. The total match strength X is the sum of the independent locus match strengths X, since independent factors multiply, and logarithms add the factors. From elementary probability theory [13], the pmf of a sum of L independent RVs is the L-fold convolution of their individual pmfs . Convolution is a fast way of smoothing one function f with another function g to form a new function [33], as shown in Fig. 4.

Fig. 4

Function convolution. Convolving a jagged function f (top, dark blue) with a blurring function g (middle, green) to form a smooth function h (bottom, light blue). Here f is a partial pmf convolution of four loci, g is a binomial distribution (n = 10, p = 0.5), and h is their convolution. The bin resolution is ban. Sequential convolution constructs pmf by adding one locus at a time. The first locus has pmf . After constructing K loci, is extended by convolution with locus pmf to form the multi-locus . That is,Convolving all L loci constructs . The cumulative sum of is cdf .

Distribution convergence

Distribution function becomes more exact with smaller ε (Fig. 5). Genotype set Ω is finite, so there is a smallest match strength distance between genotypes. At resolution ε = d/2 ban, has at most one genotype event ω in each bin interval. Thus the binned and exact distributions both fully resolve the events, assuring eventual convergence of to the limit .

Fig. 5

Bin resolution. Locus CSF1PO cdf is shown for increasingly fine bin resolution values. For illustration, the resolutions are set at for k = 0, 1, 2, 3.

Bin resolution. Locus CSF1PO cdf is shown for increasingly fine bin resolution values. For illustration, the resolutions are set at for k = 0, 1, 2, 3. The largest vertical difference between two functions and is the standard supremum norm [32]. As ε decreases, function distance measures the Cauchy convergence [32] of to .

Composite distribution

Combining a set of genotype probability distributions produces a new aggregate distribution. A composite mixture distribution F averages together N individual evidence distributions [34]. Suppose F is the distribution function of the n individual genotype RV X. Equally weighting individual genotype components, the composite RV X has the mixture cdf . A composite mixture pmf is similarly formed as from individual genotype pmfs .

Materials

Statistical software

The fully Bayesian TrueAllele® Casework system (Cybergenetics, Pittsburgh, PA) separates STR mixture data to produce a genotype for each DNA contributor [25]. Genotype uncertainty is represented as prior and posterior probability. The computer constructs non-contributor X and contributor Y match strength distributions by convolution, draws their pmf curves, and calculates tail probabilities. It summarizes genotype information, providing a Kullback-Leibler (KL) statistic [Y] that predicts LR values [35]. TrueAllele can compare a separated evidence genotype with a reference genotype, relative to a population, to calculate LR match strength. The match module accounts for population co-ancestry via its coefficient θ. The match calculation can substitute one population prior for another by Bayesian rearrangement. Locus log(LR) values are bounded below by –2 ban, based on validation studies [36].

STR mixture data

Two-person STR mixture data was available from a previous study [21]. The samples were amplified using PowerPlex® 16 (Promega, Madison, WI), a multiplex kit containing 15 independent STR locus tests. Readout from an ABI Prism® 310 (Applied Biosystems, Foster City, CA) capillary sequencer produced .fsa electronic data files [37]. The population frequencies used were from the FBI's expanded Caucasian allele database [38]. This study used data from the ten 250 pg samples. The mixture ratios were 1:9 (B3, F3, I3, M3), 3:7 (C3, E3, J3, L3), and 5:5 (D3, K3). The results here focus on the minor M3 12.67% component. It contained 30 pg of DNA (12% of 250 pg), amounting to 5 cells (6 pg DNA per cell). The non-overlapping minor data peaks heights were all under 50 relative fluorescent units (RFU). The minor genotype had a KL of 7.8364 ban. Comparison with the known reference gave log(LR) values of 5.7291 (θ = 0) and 5.4989 (θ = 0.01) ban.

Results

Locus binning construction

Fig. 3 steps through the single locus construction of non-contributor pmf . Table 1 lists 13 ω allele pairs at locus CSF1PO for mixture sample M3's minor genotype. Each row shows the genotype variable's prior and posterior probabilities, and the posterior-to-prior LR, with its base ten match strength logarithm. The interval partition uses centered bins, rounding log(q/p) match strengths to the nearest x point at resolution ε = 1/4. The first table row represents genotypes having zero posterior probability, putting a total 0.1561 prior probability dose into bin –2 (Fig. 3, blue bar 1). The second row for allele pair 8,12 adds more probability p(8,12) = 0.0209 to the same bin –2 (green bar 2). For the third genotype 9,12 (blue bar 3), the LR is 0.0239, which has log strength –1.6215, corresponding to the centered bin –1.5 representing subinterval [–1.5 – ε, –1.5 + ε). This genotype deposits a prior probability of p(9,12) = 0.0188 into bin –1.5 (blue bar 3). Genotype 12,12's log(LR) is match strength s(12,12) = –1.5042, and so its prior probability p(12,12) = 0.0894 is added to bin –1.5 (green bar 4). Genotype binning of prior probability into match strength bins continues until all 13 values have been added to form pmf (all bars).

Multi-locus convolution

Fig. 6 shows the sequential convolution of individual locus pmfs to form the multi-locus pmf . The first row is for locus D2S1338 of M3's minor genotype. Locus pmf is shown on a focused [–2, 2] ban locus-level scale (left), and also on a broader [–10, 5] ban multi-locus scale (right). The bin resolution is 1/4 ban.

Fig. 6

Sequential convolution. Sequential convolution builds pmf at four loci. The left column shows individual locus pmf bar charts for locus K = 1, 2, 3, 4. The right column shows K-fold partially convolved pmf bar charts for locus K = 1, 2, 3, 4. The convolution process combines partial convolution (right column, row K), with locus pmf (left, row K+1), to extend (green arrows) the multi-locus convolution to (right, row K+1). The bin resolution is ban. The second row adds a second locus TPOX. On the left is locus pmf . The right plot convolves (above) with (left) to form (light green arrows) the multi-locus pmf (right). The two locus combination shows more locus pair genotype events (as bars) for than for either of the single genotype locus pmfs or . The third row shows pmf for locus D3S1358 on the left. Combining with (above right) forms (green arrows) the convolution (right). The triple locus combined pmf is jagged, but now developing shape. The fourth row combines the FGA locus pmf (left) with (above right) to form (dark green arrows) the quadruple convolution (right). Convolving more loci has made this pmf smoother than its multi-locus precursors (right column). Each non-contributor locus XK adds exclusionary power, pushing further to the left (right column). Adding more loci to match strength continues these trends (Fig. 7). With five loci, has a unimodal shape (green). At ten loci, a smooth bell-shaped curve emerges for , further shifted to the left (blue). Combining all fifteen loci, shows the distribution of match strength for non-contributor multi-locus genotypes (black). Increasing convolution with more loci smooths the curve, pushing the pmf leftward toward greater exclusionary power.

Fig. 7

Further convolution. Sequential convolution incrementally constructs the multi-locus pmf as K increases from 5 (green) to 10 (blue) to 15 (black) loci. The bin resolution is ban.

Cumulative distribution convergence

Locus cdf is the cumulative sum of locus pmf . As ε decreases, cdf converges to . Fig. 5 shows this convergence for the minor M3 genotype at locus CSF1PO. Setting ε = 2 ban, increasingly fine resolutions discretize cdf for k = 0, 1, 2, 3. Moving from ε = 1 to ε = 1/2 ban refines the partitioning of function on the interval [–2, 1] (Fig. 5, k = 0, 1). Further cdf step function refinement continues with ε = 1/4 ban, corresponding to the pmf histogram binning shown in Fig. 3. Resolutions beyond ε = 1/8 ban do not change , so has converged to the limit distribution . Multi-locus cdf combines the individual locus distributions through the match strength sum . Fig. 8 shows for a series of increasingly fine bin resolutions ε = 10 ban, as k progresses from 0 to 3. At k = 0, ε is 1 ban, and the step function has clear one ban increments. At k = 1, ε is 1/10 ban, and the steps of are still visible. Once k = 2, the ε = 1/100 bin resolution is no longer visible for . Beyond that resolution, as shown for ε = 1/1000 ban at k = 3, looks the same as .

Fig. 8

Distribution resolution. Joint cdf is shown at increasingly fine bin resolutions. For illustration, the resolutions are set at k = 0, 1, 2 and 3.

Binned distribution accuracy

The convergence of binned functions, as resolution k increases, measures their accuracy. Since there are finitely many genotypes, must eventually reach the distribution limit . The goal is a bin resolution ε that provides sufficient accuracy in reasonable time. The maximum probability difference between at bin resolution ε = 10 ban, and at ε = 10−6 ban, was measured for the minor M3 genotype. These function distances are listed in Table 2 for non-contributor X and contributor Y match strengths. As bin resolution k becomes finer, the cdf differences get smaller.

Table 2

	X		Y
Resolution	Difference	Time	Difference	Time
0	4.0542E-02	2.2753E-03	1.3910E-01	2.2050E-03
1	1.8200E-03	3.4260E-03	1.6871E-02	3.3593E-03
2	8.2153E-04	5.2983E-03	3.5183E-03	5.2863E-03
3	8.5332E-05	9.4160E-03	1.9434E-04	9.0350E-03
4	2.8206E-06	1.1621E-01	1.8599E-05	1.0425E-01
5	1.1386E-06	1.6800E+00	3.2775E-06	1.6754E+00
6		3.4960E+01		3.1200E+01

Accuracy and efficiency at different bin resolutions ε = 10− ban. The accuracy of non-contributor cdf is measured by its maximum probability difference from the micro-ban resolution cdf . Efficiency is measured by the computer time (sec) needed to construct . Accuracy and efficiency are also shown for contributor distribution . Fig. 9 plots the non-contributor cdf differences on a logarithmic scale (blue cross line). The negative linear slope indicates exponential improvement with increasing k. At ε = 10−2 ban, the maximum cdf difference is 8.215 × 10−4, or under one in a thousand (Table 2). With k = 3 and ε = 10−3 ban, the difference is 8.533 × 10−5, or under one in ten thousand. At either resolution, k = 2 or k = 3, the probability error is negligible.

Fig. 9

Accuracy vs. efficiency. Assessing joint cdf accuracy and efficiency for bin resolutions where k = 0, 1, …, 6. Accuracy is logarithmically plotted (blue cross) as the maximum cdf difference between and for increasing resolution k. Efficiency is logarithmically plotted (red plus) as the time (sec) computing for increasing k. Computer calculation time is shown in Table 2, for both X and Y. The non-contributor times for constructing distribution are plotted on a logarithmic scale in Fig. 9 (red plus line) as k varies. For k = 0, 1, 2 and 3, the time is under 1/100 sec (Table 2). That time increases to over 1/10 sec for k ≥ 4. A practical choice of bin resolution is thus at k = 3 for ε = 10−3 ban, used in the remainder of this paper, where the probability function deviation is under 10−4 and the computer time is under 10−2 sec.

Computational complexity analysis

The divide-and-conquer convolution algorithm for computing has quadratic computational complexity O(L) in the number of tested loci L. There are three main algorithmic steps. Constructing each locus pmf function uses a fixed bin resolution ε for a relatively constant number of locus genotypes Ω. So each locus function incurs a constant O(1) construction cost. Across L loci, the cost adds up to O(L). The pmf of the first K < L loci is sequentially convolved with the (K+1) locus pmf to form the pmf of the first K+1 loci. This pairwise convolution combines O(K) bins from the first K loci, with O(1) bins from the next locus, to augment the number of bins to O(K+1). The O(K) stepwise cost is bounded above by O(L). Iterating over L loci, the cost tallies to O(L). Cumulative summation of pmf to form cdf visits all bins. After convolving L loci, there are O(L) bins. So there is an O(L) summation cost. Since step (b) dominates the formation cost, the process has quadratic cost O(L).

Empirical efficiency measurements

Empirical timings on the minor M3 genotype concur with the algorithmic complexity analysis. The computing times for building the locus pmfs , and convolving them to form the multi-locus mass functions , are listed in Table 3 for both non-contributor X and contributor Y. The locus breakdown gives the incremental costs of sequentially constructing .

Table 3

	X		Y
Locus	Build	Convolve	Build	Convolve
1	2.705E-04	2.229E-04	1.979E-04	1.206E-05
2	3.478E-04	4.609E-04	4.347E-04	1.402E-04
3	2.838E-04	6.287E-04	9.809E-04	2.394E-04
4	2.937E-04	8.052E-04	3.696E-04	3.753E-04
5	1.249E-04	9.678E-04	1.449E-04	4.980E-04
6	1.551E-04	1.294E-03	1.921E-04	6.309E-04
7	2.306E-04	1.604E-03	3.400E-04	8.046E-04
8	2.629E-04	1.939E-03	3.378E-04	1.004E-03
9	1.527E-04	2.225E-03	2.997E-04	1.186E-03
10	2.455E-04	2.564E-03	2.604E-04	1.404E-03
11	2.734E-04	2.996E-03	4.023E-04	1.692E-03
12	4.453E-04	3.596E-03	9.856E-04	2.053E-03
13	3.719E-04	4.118E-03	6.317E-04	2.401E-03
14	1.326E-04	4.544E-03	2.488E-04	2.691E-03
15	4.421E-04	5.128E-03	8.595E-04	3.098E-03

Locus efficiency breakdown. At each incremental locus step, there is a time cost (sec) for building the locus pmf, and a cost for convolving with the preceding convolved loci. These timings are shown for the non-contributor X and contributor Y distributions. Fig. 10 plots the computing times for building the locus pmfs (blue cross) and convolving the multi-locus pmfs (red plus). The locus build time is relatively constant (blue cross line & Table 3). The multi-locus convolution time increases linearly with each additional locus K (red plus line & Table 3). A constant locus build time across L loci has O(L) cost, while a linearly increasing multi-locus convolution time for L loci has O(L2) cost.

Fig. 10

Building vs. convolving. Computing time (sec) for pmf at each locus l, as locus number increases. The timings are shown in two parts, building (blue cross line) a locus pmf, and convolving (red plus line) a partial joint with a new locus to form the augmented pmf . The bin resolution is ban.

Construction method comparison

The quadratic number of convolved bin events is far less than the exponential number of multi-locus genotypes. Table 4 lists the numbers of single genotypes #Ω and multi-locus genotypes #Ω×…×Ω considered at each locus K for M3's minor genotype. The straight line (red plus) plotted on Fig. 11's logarithmic scale shows the exponential rise of multi-locus genotype counts. At 15 loci, there are 1.355 × 1023 genotypes. Constructing a match strength distribution directly from the genotype sample space Ω can be exponentially expensive.

Table 4

Locus	Genotypes	Product	Bins	X	Y
1	11	1.100E+01	8	8	8
2	45	4.950E+02	40	310	303
3	39	1.931E+04	37	5,104	4,985
4	56	1.081E+06	49	10,323	9,851
5	13	1.405E+07	12	13,474	12,864
6	23	3.232E+08	22	17,081	16,105
7	40	1.293E+10	37	20,595	19,516
8	46	5.948E+11	42	24,036	22,231
9	25	1.487E+13	21	27,189	25,279
10	36	5.353E+14	36	30,653	28,099
11	55	2.944E+16	46	36,453	33,304
12	83	2.444E+18	77	41,675	38,446
13	60	1.466E+20	56	45,347	42,030
14	12	1.759E+21	11	48,550	45,233
15	77	1.355E+23	69	52,406	48,941

Fig. 11

Bins vs. genotypes. Counting computation size with increasing locus number. Shown is the number (blue cross line) of ε-bins in bin set B for the real interval [a, b] after processing K loci. Also shown is the number (red plus line) of genotype K-tuples in the locus product after processing K loci. The bin resolution is ban.

Counting genotypes and numeric bins. Shown at each incremental locus step K are the number of locus genotypes #Ω and number of partial product genotypes #Ω×…×Ω. Also shown is the number of locus bins, and occupied partial multi-locus bins for the non-contributor X and contributor Y distributions. Bins vs. genotypes. Counting computation size with increasing locus number. Shown is the number (blue cross line) of ε-bins in bin set B for the real interval [a, b] after processing K loci. Also shown is the number (red plus line) of genotype K-tuples in the locus product after processing K loci. The bin resolution is ban. The convolution method instead used discrete bins on a bounded interval [–35, 35], with resolution ε = 10−3 ban. Bins are filled by genotype events at each locus (Table 4, Bins column), and expanded with each multi-locus convolution step (X & Y columns). The multi-locus bin growth is linear (Fig. 11, blue cross), as seen by the logarithmic curve plotted on a logarithmic scale. At 15 loci, 52,406 of the total 70,000 bins were occupied. Clearly, 7 × 104 numeric bins are far fewer than 1.355 × 1023 multi-locus genotypes.

Match strength bin occupancy

Numeric convolution operates on one-dimensional bins, not multi-dimensional genotypes. For the M3 minor genotype mixture data, 1023 genotypes were represented in 105 bins. This space efficiency stems from using an RV X that reduces the exponential genotype space Ω to a bounded interval on the real line . The convolution operation layers quantitative real-valued locus pmfs atop one another, shifting bins and adding probabilities. Match strength bins are efficiently reused, as measured by bin occupancy (Table 5). On reaching the STR kit's 15 loci, non-contributor RV X had 75% bin occupancy, while contributor Y's bins were 70% occupied.

Table 5

		X		Y
Loci	Total	Occupied	Rate	Occupied	Rate
1	4,000	8	0.20%	8	0.20%
2	8,000	310	3.88%	303	3.79%
3	14,000	5,104	36.46%	4,985	35.61%
4	18,000	10,323	57.35%	9,851	54.73%
5	22,000	13,474	61.25%	12,864	58.47%
6	26,000	17,081	65.70%	16,105	61.94%
7	30,000	20,595	68.65%	19,516	65.05%
8	34,000	24,036	70.69%	22,231	65.39%
9	38,000	27,189	71.55%	25,279	66.52%
10	42,000	30,653	72.98%	28,099	66.90%
11	50,000	36,453	72.91%	33,304	66.61%
12	58,000	41,675	71.85%	38,446	66.29%
13	62,000	45,347	73.14%	42,030	67.79%
14	66,000	48,550	73.56%	45,233	68.53%
15	70,000	52,406	74.87%	48,941	69.92%

Bin occupancy rate. As loci are incrementally convolved, the total number of numeric bins increase. The number of bins occupied by a genotype event, and the occupancy rate (fraction of total occupied) are shown for non-contributor X and contributor Y distributions. Re-convolving the loci a second time shows the bin behavior out to 30 loci (Fig. 12). The respective X (blue cross) and Y (cyan plus) bin occupancy rates remain level beyond 15 loci. On average, one ε = 10−3 ban interval for X numerically collects 1.936 × 1018 multi-locus genotypes per milliban bin (i.e., 1.355 × 1023 genotypes / 7 × 104 bins).

Fig. 12

Bin occupancy. The percentage of occupied bins when increasing from 1 to 30 loci for non-contributor pmf (blue cross line), and contributor pmf (cyan plus line). The bin resolution is ban.

Composite genotype distribution

A composite mixture distribution aggregates multiple genotypes into one combined distribution. For example, validation studies can examine a system's specificity as a histogram of evidence genotype match strength [8, 9]. A costly Monte Carlo approach compares each uncertain genotype against thousands of reference genotypes to calculate their match statistics, and then bins the results and collates a composite histogram. Numerical aggregation of match strength distributions can be far more efficient. TrueAllele separated the 250 pg two person samples (B3–M3) into their component genotypes. The ten minor contributor genotypes were combined into a composite distribution. Each genotype's match strength cdf was computed at high resolution (ε = 10−3 ban), and then averaged into an aggregate distribution . Taking one-ban cdf differences constructed a pmf histogram. Fig. 13 shows the resulting non-contributor specificity (red left) and contributor sensitivity (blue right) histograms. The average time to construct all genotype cdfs, form a composite mixture, and construct a histogram was 0.703 sec.

Fig. 13

Composite frequency. Validation plots computed as composite pmfs for non-contributor X specificity (red left histogram), and contributor Y sensitivity (blue right histogram) genotype mixture distributions. The histogram bin resolution shown is ban.

Match strength error frequency

Cumulative distributions and , and their associated histograms and , offer a frequency perspective on match strength. For any evidence genotype, the distributions reveal how frequently a match event would occur at that magnitude, relative to all possible reference genotypes. One application is determining false positive error rate – how often a non-contributor would adventitiously match as strongly as the defendant [39]. Statisticians call false positive LR error the probability of misleading evidence (PME) [7]. Comparing an evidence genotype with a reference ω yielding match strength x = s(ω) ban, the PME is Pr{X ≥ x}, or 1 – Pr{X < x}, which equals 1 – . This cdf value is the tail probability of pmf beyond x. Comparing M3's minor genotype with its known contributor ω gives an LR of 536 thousand, for a log(LR) of 5.7291. Evaluating non-contributor distribution (Fig. 1, Cumulative) at this match strength gives a PME of 1 – , which is 1.3662 × 10−7. Thus, based on the evidence genotype, the chance that a non-contributor matches the DNA evidence as strongly as does the reference is one in 7.32 million. A validation study's composite can estimate an ensemble PME based on a set of representative genotypes. One may ask how often a suspect's 5.7291 ban match would occur in a 250 pg two-person minor mixture. Binning at 1 ban resolution, the composite non-contributor histogram (Fig. 13, red left) provides an answer. Calculating either cdf at bin 5, or the equivalent pmf right tail bin sum , gives an ensemble PME frequency of 1.0367 × 10−7. This is a one in 9.65 million validation probability estimate that the evidence would match a non-contributor as strongly as it does the suspect.

Conclusions

Forensic interpretation is an information science [40]. The computer can organize a sample space of possible outcomes, describe the prior probability of each outcome, and compute the outcome's posterior probability from available evidence data. Constructing probability spaces and random variables from these elements provides a detailed match strength analysis of an evidence item. All reference outcomes are accounted for, so no comparison reference is needed at this pre-match stage. The match strength distributions are useful for quantifying potential identification information, preparing database searches, assessing data or methods, performing validation studies, and calculating LR error. All forensic information should be extracted from evidence data. In the DNA mixture example, the minor contributor contained five cells, with all non-overlapping peaks having low heights under 50 RFU. Crime labs typically discard such DNA data as uninterpretable, inconclusive, too low, or too complex. They do not use the evidence for database searches or match comparisons. Yet the match strength RV distributions were informative. Eventual comparison with the true contributor reference gave a match statistic of 536 thousand. That level of DNA association could help convict or acquit a defendant. The mathematical probability framework led to efficient algorithms for constructing match strength distributions and calculating LR error. The independence of additive locus RVs permitted rapid and accurate joint RV construction by convolution. The sample space contained exponentially many multi-locus genotypes. The match strength RV mapped these multi-dimensional outcomes into uni-dimensional numbers, collecting and preserving match information in quadratic time. The convolution approach numerically constructs an exact match strength distribution at a given bin resolution. Independently, a Monte Carlo construction randomly samples genotypes from the underlying probability distribution. A statistical test for distribution equality can compare a convolved cdf with a sampled empirical cdf. For M3's minor genotype, comparing convolved and sampled (N = 10,000) non-contributor distributions gave a Kolmogorov-Smirnov statistic [41] of 0.0090 (p = 0.3863), supporting cdf equality. Distribution comparison jointly assesses convolution and sampling methods, mutually validating their accuracy. The match strength RV approach is quite general. While genotypes have a discrete representation, other variables (e.g., glass index of refraction) are continuous. The RV distribution approach extends to continuous variables and any dimensionality, with integration replacing summation [12]. LR associations are used in fields beyond forensic science, for example, in artificial intelligence [42], medical diagnosis [43] and legal reasoning [44]. Rapid construction of match strength distributions may offer insights, applications and efficiencies in handling uncertainty in such areas. Uncertainty is prevalent in the natural and social sciences. Bayesian probability modeling helps extract information from real world data to harness that uncertainty [45]. Advance knowledge of the full range of possible outcomes aids decision-making, whether in forensic biology or diagnostic medicine. This paper showed how probabilistic RV analysis can preserve and use identification information, even when the evidence data are thought to be uninterpretable and no reference is available for comparison.

Declarations

Author contribution statement

Mark Perlin: Conceived and designed the experiments; Performed the experiments; Analyzed and interpreted the data; Contributed reagents, materials, analysis tools or data; Wrote the paper.

Funding statement

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Competing interest statement

The authors declare the following conflict of interests: Mark Perlin is an owner and officer of Cybergenetics, a company that develops forensic DNA software.

Additional information

Data associated with this study has been deposited at Mendeley Data under the accession number https://doi.org/10.17632/b6frbgmm69.1.

18 in total