Nicholas C Henderson1, Michael A Newton2. 1. Department of Statistics, University of Wisconsin, Madison, USA. 2. Departments of Statistics and of Biostatistics and Medical Informatics, University of Wisconsin, Madison, USA.
Abstract
Identifying leading measurement units from a large collection is a common inference task in various domains of large-scale inference. Testing approaches, which measure evidence against a null hypothesis rather than effect magnitude, tend to overpopulate lists of leading units with those associated with low measurement error. By contrast, local maximum likelihood (ML) approaches tend to favor units with high measurement error. Available Bayesian and empirical Bayesian approaches rely on specialized loss functions that result in similar deficiencies. We describe and evaluate a generic empirical Bayesian ranking procedure that populates the list of top units in a way that maximizes the expected overlap between the true and reported top lists for all list sizes. The procedure relates unit-specific posterior upper tail probabilities with their empirical distribution to yield a ranking variable. It discounts high-variance units less than popular non-ML methods and thus achieves improved operating characteristics in the models considered.
Identifying leading measurement units from a large collection is a common inference task in various domains of large-scale inference. Testing approaches, which measure evidence against a null hypothesis rather than effect magnitude, tend to overpopulate lists of leading units with those associated with low measurement error. By contrast, local maximum likelihood (ML) approaches tend to favor units with high measurement error. Available Bayesian and empirical Bayesian approaches rely on specialized loss functions that result in similar deficiencies. We describe and evaluate a generic empirical Bayesian ranking procedure that populates the list of top units in a way that maximizes the expected overlap between the true and reported top lists for all list sizes. The procedure relates unit-specific posterior upper tail probabilities with their empirical distribution to yield a ranking variable. It discounts high-variance units less than popular non-ML methods and thus achieves improved operating characteristics in the models considered.
In all sorts of applications, data from a large number of measurement or inference units are processed to identify the most important units by some measure. This is certainly true in statistical genomics, where units might be genes, gene sets or single‐nucleotide polymorphisms (SNPs), depending on the particular application, but it is also true more broadly. In agriculture investigators rank animals or plants by their breeding value (e.g. de los Campos et al. (2013)); performance evaluations in health and social sciences are common (e.g. Paddock and Louis (2011)). Typically, units are associated with unobserved real‐valued parameters, and the importance of each unit is linked to the value of its parameter. A case that we consider is a genomewide association study examining risk factors for type 2 diabetes, in which the inference unit is the SNP, and the parameter of interest is a log‐odds ratio measuring the effect on disease probability of SNP genotype (Morris et al., 2012). A second case involves gene set enrichment among human genes that have been determined via ribonucleic acid (RNA) interference experiments to affect influenza virus replication (Hao et al., 2013). Units here are sets of genes annotated to particular biological functions and parameters measure levels of enrichment. We develop two further examples to exercise the statistical issues: one from sports statistics (units are basketball players), and one from gene expression analysis (units are genes). If there had been no measurement error we would summarize each case by ranking units according to values of their parameters, focusing on the top of this list for further study. We consider here the inference task to perform such ranking and selection from data. Whereas the emphasis of large‐scale inference has been testing in relatively sparse settings (e.g. Efron (2010)), the present work addresses the inference task to rank order non‐null parameters when the signal is relatively non‐sparse.A natural ranking is obtained by separately estimating unit‐specific parameters, for instance by maximum likelihood applied locally to each unit. Since sampling fluctuations more easily put high variance units into the tails, units that are associated with relatively high standard error are overrepresented among the top units by this maximum likelihood estimate (MLE) ranking. Another commonly used procedure comes from large‐scale hypothesis testing, where units are ranked by their p‐value relative to a reference null hypothesis. Units that are associated with relatively low standard error are overrepresented among the top units by this ranking since both effect size and standard error affect testing power. Standard error in the type 2 diabetes case is affected by various factors including SNP allele frequency; set size affects standard error in the RNA interference case. When there is little variation between unit‐specific standard errors, the different approaches give essentially the same assessment of the most important units. However, in many cases there is substantial variation in these standard errors, and quite different rankings can emerge.For contemporary large‐scale applications, the classical theory of ranking and selection leaves much to be desired. It addresses sampling probabilities like, ‘under such‐and‐such a configuration of parameters and for sufficient amounts of data per unit the probability exceeds such‐and‐such that the true top j units are among the observed top k units’ (e.g. Gibbons et al. (1979)). Although relevant to some tasks, these probabilities are difficult to work with and the resulting procedures are not often used in applied statistics. Theory is available on the sampling characteristics of empirical rankings (e.g. Xie et al. (2009) and Hall and Miller (2010)). Arguably, the thrust of methodological development for ranking and selection involves hierarchical modelling coupled with Bayes or empirical Bayes inference. Seminal contributions by Berger and Deeley (1988) and Laird and Louis (1989) helped to establish a framework that covers many contemporary applications and that has been elaborated in important ways (e.g. Shen and Louis (1998), Gelman and Price (1999), Wright et al. (2003), Lin et al. (2006), Brijs et al. (2007) and Noma et al. (2010)). We further elaborate this framework in an effort to provide a more effective generic method for large‐scale inference, especially when large parameter units are in focus, when there are many units and when there is substantial variation in unit‐specific standard errors.Sampling artefacts of MLE and p‐value ranking procedures, which were noted above, are well documented, but other approaches are also deficient. The insightful analysis of Gelman and Price (1999) illustrates the difficulties and confirms that the common practice of ranking by posterior expected value suffers from the same artefact as the p‐value ranking, namely that units that are associated with small posterior standard deviation are overrepresented on lists of the top units. We find similar behaviour with the posterior expected rank method (Laird and Louis, 1989; Lin et al., 2006) as well as available testing schemes. We introduce and investigate a procedure that aims to rank units in a way to maximize the expected overlap between the reported and the true top lists of units. Although not eliminating the sampling artefacts, the new method reduces their effects compared with other schemes. Our development starts in a special case wherein ranking procedures are formulated in terms of certain threshold functions (Section 2.1); using this formulation we characterize thresholds that maximize the expected overlap between the true and reported top lists (Sections 2.2 and 2.3), and we derive the associated ranking variable in terms of local posterior tail probabilities. The proposed r‐value is generalized in Section 2.4 and investigated in relation to other procedures in Section 3. Computational issues are reviewed in Section 4, sampling performance is investigated in Section 5 and a short discussion follows. Examples are used throughout for demonstration, and proofs are postponed until Appendix A. The methodology proposed and several data sets are deployed in the R package rvalues, which is available through the Comprehensive R Archive Network ().
Threshold functions and ranking variables
Continuous model
A variety of data structures are amenable to our proposed ranking or selection scheme, but the following structure has guided its initial development. Measurement or inference units are indexed by i=1,2,…,n; data on unit i include the real‐valued measurement and information about its sampling variation. We assume that the sampling distribution of has a known form that is indexed by an unknown real‐valued parameter of interest together with a second quantity affecting variance. In this section we assume that is known for each unit. Basically, the inference task is to report units having large values of , while accounting for the fact that variances may fluctuate substantially between inference units. We adopt an empirical Bayes perspective and treat as draws from a population of parameters, and we are motivated by data analysis considerations to suppose initially that and are independent in this population, say with densities f(θ) and . The independence assumption is helpful for understanding artefacts of various ranking methods, but it is not essential to the methodology. The empirical Bayesian uses the full data set to estimate the prior distributions f(θ) and . Initially we ignore the estimation error at this level and focus on ranking units within the estimated population, though we take up the issue in Section 5 via simulation and asymptotic analysis.Relative to a single unit i, might be the maximum likelihood estimator of , and that estimator's standard error. The independence assumption may be reasonable if care has been taken in this local analysis, for example, by variance stabilizing transformation. Typically, the variance is estimated rather than known exactly; we study this and extensions to other data structures in Section 2.4. We consider first a continuous model, involving prior distributions and sampling distributions all having densities with respect to Lebesgue measure. The canonical sampling model within this class has .We make some headway by associating each ranking or selection procedure with a family of threshold functions . Each is a function having the interpretation that unit i is reported to be in the top α fraction of units if and only if . This interpretation is supported by the size constraint, namely that, marginally to all parameters and data,Table 1 reports threshold functions associated with a variety of ranking methods in the normal observation model, and under the extra condition that the prior f(θ) is . Table 1 encodes the special case ; the general thresholds are derived from this case by the transformation . Note that each threshold function involves an α‐specific value which guarantees the size constraint; these values are different for different ranking methods (rows of Table 1). Fig. 1 illustrates four of these families in the type 2 diabetes case‐study. Notionally, the linear ranking of units is obtained by sweeping through the family , beginning with the smallest α at the top of the graph. Clearly, distinct families of threshold functions can produce distinct rankings of the units, with the family's shape revealing how it trades off observed signal with measurement variance to prioritize the leading units.
Table 1
Threshold functions associated with various ranking criteria, normal–normal model
Criterion
Ranking variable
Threshold functiontα(σ2)
MLE
Xi
uα
p‐value H0:θi=0
Xi/σi
uασ
p‐value H0:θi=c
(Xi−c)/σi
c+uασ
PM
Xi/(σi2+1)
uα(σ2+1)
PER
P(θi⩽θ|Xi,σi2)
uα√{(σ2+1)(2σ2+1)}
Bayes factor
1(Xi>0)P(Xi|σi2,θi≠0)P(Xi|σi2,θi=0)
√(σ2(σ2+1)[uα+log{(σ2+1)/σ2}])
Maximal agreement
r‐value
θα(σ2+1)−uα√{σ2(σ2+1)}
Figure 1
Threshold functions (a) MLE, (b) p‐value, (c) PM and (d) maximal agreement, type 2 diabetes example: axes are common to all panels, with the vertical axis the log‐odds ratio for association between SNPs () and type 2 diabetes and with the horizontal axis the standard error estimates, with further details in Fig. S1 in the on‐line supplementary material; calculations use an inverse gamma model for ; 42 threshold functions are shown ranging in α‐values from a small positive value () just including the first data point up to α=0.10 () (most SNPs are truncated by the plot; also the grid is uniform on the scale of ); units associated with a smaller α (i.e. more red) are ranked more highly by the given ranking method; two units landing on the same curve would be ranked in the same position
Threshold functions associated with various ranking criteria, normal–normal modelThreshold functions (a) MLE, (b) p‐value, (c) PM and (d) maximal agreement, type 2 diabetes example: axes are common to all panels, with the vertical axis the log‐odds ratio for association between SNPs () and type 2 diabetes and with the horizontal axis the standard error estimates, with further details in Fig. S1 in the on‐line supplementary material; calculations use an inverse gamma model for ; 42 threshold functions are shown ranging in α‐values from a small positive value () just including the first data point up to α=0.10 () (most SNPs are truncated by the plot; also the grid is uniform on the scale of ); units associated with a smaller α (i.e. more red) are ranked more highly by the given ranking method; two units landing on the same curve would be ranked in the same positionSome comments on the threshold functions in Table 1 are warranted (see also the on‐line supplementary material document). Under squared error loss, the Bayes estimate of the rank of parameter among those in play is the conditional expected rank given the data (Laird and Louis, 1989; Lin et al., 2006). This posterior expected rank is usually expressed as a sum, involving indicator comparisons between and the other parameters, and it becomes when normalized and considered in the limit for increasing numbers of units (ranking from the top). Here θ is the independently drawn parameter of a generic additional unit, which emerges in the large‐scale limit to replace the collection of all other s with which is compared. In the normal–normal model, ranking by posterior expected rank is qualitatively similar to ranking by posterior mean ; both favour small variance units. Several hypothesis‐testing‐based methods are also shown in Table 1. Testing against some benchmark null (rather than the no‐effect null) hypothesis has some benefits in practice (e.g. McCarthy and Smyth (2009)). As we emphasize large positive , we report p‐values that are associated with one‐sided tests. Finally, the Bayes factor BF entry aims to mimic the ranking (from the top) method that is associated with Bayes factors for the test of
versus
(e.g. Kass and Raftery (1995)). The mapping of a ranking method to a family of threshold functions is useful for comparative analyses, as we investigate next.
Thresholds via direct optimization
Table 1 and Fig. 1 introduce a family that is optimal in the continuous model in the sense that for all α ∈ (0,1)for any other family which also satisfies the size constraint (1). Here is the α upper quantile of the prior, i.e. . In other words, maximizes agreement: the joint probability that unit i is placed in the top α fraction and its driving parameter is in the top α fraction of the population, for all α. We emphasize that the probabilities in inequality (2) cover the joint distribution of , which respects both the sampling distribution of data local to unit i and the fluctuations of unit‐specific parameters. A calculus‐of‐variations argument provides direct optimization of the joint probability in inequality (2), subject to the size constraint, model regularity and smoothness of the threshold functions.In the continuous model, a necessary condition for the function to be optimal as in inequality (2), within the class of continuously differentiable threshold functions, is that it satisfiesThus, all observations coincident with the graph of a given optimal threshold curve have a common posterior probability that their unit‐specific parameters exceed the quantile that is associated with that curve. In the normal model for and the normal prior f(θ), the optimal threshold function (Fig. 1(d)) is readily extracted from expression (3). Working on a standardized scale without loss of generality (μ=0 and ), the local posterior for is normal with mean and variance . Thus,where and is determined by the size constraint (1). Indeed is affected by the distribution , since it is defined implicitly bywhere Φ is the standard normal cumulative distribution function.Curiously, the optimal thresholds kick up as approaches 0. The resolution and range of Fig. 1 do not reveal this phenomenon so clearly in the type 2 diabetes example, but it is apparent from equation (4) that the derivative of with respect to becomes increasingly negative as approaches 0 (when ). Neither the p‐value thresholds nor those based on posterior mean or posterior expected rank have this characteristic; indeed, by kicking up for small the maximal agreement thresholds are less prone to the overranking of small variance units.Fig. 2 illustrates sampling properties of top‐listed units obtained by various threshold schemes, including the optimal threshold (4), using the normal–normal model, α=0.1, a sequence of gamma distributions g for the variance and independence between and . The difference between different methods becomes more pronounced as we increase variation in the distribution of the variances; in this simulation all cases involve , but the shape parameters vary to increase the coefficient of variation. The degree to which the conditional distribution of given placement on the top α list (the coloured bars) differs from the marginal distribution of in the system (grey) measures the extent of sampling artefacts by that method. The example recapitulates sampling artefacts of the local MLE, the p‐value and the posterior mean. For example, the top lists by MLE are enriched for high variance units. Fig. 2 also shows that this artefact is substantially reduced when we select by equation (4).
Figure 2
Conditional distribution (median, interquartile range) of unit‐specific variance given selection of the unit in the top α=0.1 fraction by various methods (coloured bands) compared with the marginal gamma distribution (black or grey) for different amounts of variation in (; coefficient of variation on the horizontal axis; based on simulation using units per case): (a) MLE; (b) p‐value; (c) PM; (d) maximal agreement
Conditional distribution (median, interquartile range) of unit‐specific variance given selection of the unit in the top α=0.1 fraction by various methods (coloured bands) compared with the marginal gamma distribution (black or grey) for different amounts of variation in (; coefficient of variation on the horizontal axis; based on simulation using units per case): (a) MLE; (b) p‐value; (c) PM; (d) maximal agreement
Posterior tail probabilities and ranking variables
Except in stylized models we cannot solve equation (3) to identify optimal thresholds for ranking. Insight into their structure comes by further examining their relationship to local posterior tail probabilities: .Suppose that for α ∈ (0,1) there exists such thatand furthermore that is right continuous and non‐decreasing in x for fixed α and . Then the family of thresholds satisfies the size constraint (1) and is optimal in the sense of inequality (2).These conditions concern and its distribution. They are satisfied in the normal–normal model; there,and , for is as in equation (5). The conditions are also satisfied in other instances of the continuous model of Section 2.1, as well as in other settings. For example, if is an estimated variance, then a Student t sampling model might replace the normal sampling model conditional on and . See the on‐line supplementary material for this and other examples. Note that the optimal threshold in theorem 2 simplifies further if is continuous and strictly increasing in x for each α and . Then , with the inverse referring to the first (i.e. x) argument.A family of threshold functions is a device to think about converting observations into rankings (i.e. by sweeping through the family). Indeed, the index α that is associated with the threshold curve on which data point lands is a ranking variable; its computation amounts to solving the inversion for α. Exact inversion is possible as long as the threshold curves for different α‐values do not cross, i.e. if there are no values , for which . Approximate inversion is always possible via .Suppose that threshold functions are differentiable in α for each . No functions in the family cross as long as for every α ∈ (0,1). Further, the optimal thresholds in the normal–normal model do not cross.This confirms more generally what we see empirically for a few cases in Fig. 1 and Table 1: the optimal thresholds do not cross under the conditions of theorem 3, and they conform to our intuition about how ranking procedures might be constructed from threshold functions.We introduce a special ranking variable that inverts the optimal threshold. For the ith unit, we define the r‐valueEssentially, unit i is placed by its r‐value at position α (a relative rank, measured from the top) if, when ranking the units by , it also happens to land at position α. Further, the top α fraction of units by r‐value has higher overlap with the true top α fraction of units than could be obtained by any other ranking procedure, in the sense of inequality (2).It is worth recognizing that these findings go beyond what has been reported about the use of the conditional tail probability to rank units. Classical theory on optimal selection establishes the role of this conditional tail probability in maximizing an exceedance probability within the selected sample (e.g. Lehmann (1986), pages 117–118). Also, the conditional tail probability has been used for ranking (e.g. Normand et al. (1997) and Niemi (2010)) and is closely related to a Bayes optimal ranking under a certain loss function (Lin et al., 2006). A critical difference with the ranking proposed is in the role of the index α. Conceptually, we imagine ranking the units by separately for all possible indices α (not just a prespecified index); then the r‐value for unit i is the smallest index α such that unit i is placed in the top α fraction by that ranking. By aiming to maximize agreement at all list sizes, the method proposed does not require a prespecified exceedance level to generate its ranking.
More generality
The r‐value construction makes sense in various elaborations of the model from Section 2.1. We retain univariate parameters of interest varying according to a distribution F, but we allow data on each unit to take more general forms than the pair structure. We also retain the assumption of mutual independence between units, though extensions could be developed in cases where posterior computation is feasible. In seeking units with largest , the critical quantity is the local exceedance probability, , for α ∈ (0,1) and for upper quantiles of the marginal distribution F, i.e. . Induced by the marginal distribution of , the tail probability has cumulative distribution function , and from it we obtain the upper quantile: . Then, by analogy with definition (7), the r‐value is defined: .Fig. 3 compares r‐value rankings with three other methods in the RNA interference example. Here, holds binomial information (set size and number of genes in set i that were identified by RNA interference). The target parameters are treated as draws from a beta(a,b) distribution, with shape parameters estimated by marginal maximum likelihood, and the conditional tail probability becomes the probability that a variable exceeds . r‐value computation (see Section 4) requires the sampling distribution of these tail probabilities, which we approximated by using the data from all 5719 sets under study. The methods compared in Fig. 3 agree to some extent on the ranking of the most interesting sets, but systematic differences are apparent. Ranking by overranks small sets; ranking by p‐value overranks large sets; and ranking by posterior mean also overranks large sets, though to a lesser degree, all compared with the r‐value ranking.
Figure 3
Ranking via various methods compared with r‐value ranking, RNA interference example (the data and axes are common to all panels, with further details in Fig. S2 in the on‐line supplementary material; briefly the horizontal axis is set size (on a log‐scale) and the vertical axis is gene set enrichment; each set (dot) is coloured by (X−R)/(X+R) where X is the rank (from the top) of the set by the method being compared, and R is the rank by r‐value): (a) MLE; (b) p‐value; (c) PM
Ranking via various methods compared with r‐value ranking, RNA interference example (the data and axes are common to all panels, with further details in Fig. S2 in the on‐line supplementary material; briefly the horizontal axis is set size (on a log‐scale) and the vertical axis is gene set enrichment; each set (dot) is coloured by (X−R)/(X+R) where X is the rank (from the top) of the set by the method being compared, and R is the rank by r‐value): (a) MLE; (b) p‐value; (c) PMSports enthusiasts routinely rank players. To explore r‐value ranking in this context, we deploy the same beta–binomial model as used in the RNA interference example and use it to describe free‐throw statistics of professional basketball players (e.g. Richey and Zorn (2005)). During the 2013–2014 regular season of the National Basketball Association (NBA), 461 players attempted at least one free throw (Entertainment and Sports Programming Network, 2014). In total these players attempted 58029 free throws and were successful 43870 times, for a marginal free‐throw percentage of 75.6%. A basic problem in rating players by individual free‐throw percentage is that the numbers of free‐throw attempts vary substantially between players; in retaining all active players, those with highest are among those with smallest . For instance, 13 of the 461 NBA players had perfect free‐throw records in 2013–2014; they had a median number of four attempts, compared with the league median of 82 attempts. Various threshold schemes have been adopted by rating agencies; these restrict ranking to players reaching a minimum number of attempts or a minimum number of makes. At the Entertainment and Sports Programming Network, a qualified player this last season needed . Thresholding rules have a practical appeal but they can suppress athletic performances that otherwise are exceptional and worth reporting. For instance, Ray Allen's 105 makes in 116 attempts is exceptionally good by many standards (Table 2). The context provided by the NBA example offers further insights. For one thing, there is broad agreement between PM‐ranking and r‐value ranking, though where there is disagreement PM favours players having more attempts . Related to this is the fact that, though it discounts players with very small , the r‐value shrinks less than PM and is more in accordance with the FTP ranking; for example, the r‐value ranks the qualified players in Table 2 the same as FTP, in contrast with PM.
Table 2
Leading free‐throw shooters, 2013–2014 regular season of the NBAa
Player i
yi
mi
FTP
PM
RV
QR
MLER
PMR
RVR
Brian Roberts
125
133
0.940
0.913
0.002
1
17
1
1
Ryan Anderson
59
62
0.952
0.898
0.003
15
2
2
Danny Granger
63
67
0.940
0.893
0.005
16
3
3
Kyle Korver
87
94
0.926
0.892
0.008
19
4
4
Mike Harris
26
27
0.963
0.866
0.010
14
15
5
J. J. Redick
97
106
0.915
0.886
0.011
22
6
6
Ray Allen
105
116
0.905
0.880
0.016
25
8
7
Mike Muscala
14
14
1.000
0.844
0.017
7
34
8
Dirk Nowitzki
338
376
0.899
0.891
0.018
2
30
5
9
Trey Burke
102
113
0.903
0.877
0.018
28
9
10
Reggie Jackson
158
177
0.893
0.877
0.024
3
32
11
11
Kevin Martin
303
340
0.891
0.882
0.025
4
33
7
12
Gary Neal
94
105
0.895
0.869
0.025
31
14
13
D. J. Augustin
201
227
0.885
0.873
0.031
5
38
12
14
Stephen Curry
308
348
0.885
0.877
0.031
6
39
10
15
Patty Mills
73
82
0.890
0.860
0.032
34
19
16
Courtney Lee
99
112
0.884
0.861
0.035
40
18
17
Steve Nash
22
24
0.917
0.834
0.039
20.5
44
18
Greivis Vasquez
95
108
0.880
0.857
0.040
41
22
19
Robbie Hummel
15
16
0.938
0.825
0.043
18
55
20
Mo Williams
78
89
0.876
0.850
0.046
42
24
21
Kevin Durant
703
805
0.873
0.870
0.048
7
45
13
22
Aaron Brooks
83
95
0.874
0.850
0.049
44
26
23
Damian Lillard
371
426
0.871
0.865
0.050
8
47
16
24
Nando de Colo
31
35
0.886
0.831
0.057
37
48
25
From n=461 players who attempted at least one free throw, shown are the top 25 players as inferred by r‐value. Data on player i include the number of made free throws and the number of attempts . Other columns indicate the free‐throw percentage FTP , which is the MLE of the underlying ability ; posterior mean , r‐value ; qualified rank QR, which is the rank of FTP among players for whom ; and ranks associated with the MLE, posterior mean and r‐value.
Leading free‐throw shooters, 2013–2014 regular season of the NBAaFrom n=461 players who attempted at least one free throw, shown are the top 25 players as inferred by r‐value. Data on player i include the number of made free throws and the number of attempts . Other columns indicate the free‐throw percentage FTP , which is the MLE of the underlying ability ; posterior mean , r‐value ; qualified rank QR, which is the rank of FTP among players for whom ; and ranks associated with the MLE, posterior mean and r‐value.As an empirical validation of the r‐value ranking we applied it to mid‐season NBA data (up to the end of December 2013) and then measured its performance conditionally on complete‐season data. Comparing Table S2 (in the on‐line supplementary material) with Table 2, we see some interesting features. For example, Brian Roberts, who finished the season with the highest FTP among qualified players, did not miss in 2013; the r‐value placed him second mid‐season, even though he had only attempts, whereas PM ranked him 12th. Investigating more fully, we repeatedly simulated ‐vectors conditionally on end‐of‐season data and averaged a similarity score:finding improvements over FTP and PM in assessing the best free‐throw shooters (Fig. S3 in the supplementary material). Here is the player's estimated rank according to mid‐season data and is his unknown true rank.r‐values may be computed in all sorts of hierarchical modelling efforts, including semiparametric models and cases where Markov chain Monte Carlo sampling is used to approximate the marginal posterior distribution of each given available data. Fig. S9 (in the supplementary material) compares the r‐value ranking with other rankings in an example from gene expression analysis, where evidence suggested that the expression of a large fraction of the human genome was associated with the status of a certain viral infection (Pyeon et al., 2007). A multilevel model involving both null and non‐null genes as well as t‐distributed non‐null effects exhibited good fit to the data but did not admit a closed form for . r‐values, computed by using Markov chain Monte Carlo output, again reveal systematic ranking differences from other approaches.Multilevel models drive statistical inference and software in a variety of genomic domains, e.g. limma (Smyth, 2004), EBarrays (Kendziorski et al., 2003) and EBSeq (Leng et al., 2013), among others. Since these models happen to specify distributional forms for parameters of interest, the associated code could be augmented to compute posterior tail probabilities and thus r‐values for ranking. The limma system utilizes a conjugate normal, inverse gamma model, and so involves the tail probability of a non‐central t‐distribution. The EBSeq system entails a conjugate beta, negative binomial model, and so for differential expression involves tail probabilities in a certain ratio distribution (Coelho and Mexia, 2007). One expects the benefits of r‐value computation to show especially in cases involving many non‐null units and relatively high variation between units in their variance parameters (e.g. sequence read depth). The data structure that is envisioned for r‐value computation involves many exchangeable units, with real‐valued parameters driving the conditional distribution of data on each unit. Other structures, such as from large‐scale regression, may be amenable to the proposed ranking method if marginal posterior distributions for each regression coefficient could be derived.
Connections
Connection to Bayes rule
The proposed r‐values are not Bayes rules in the usual sense; however, there is a connection to Bayesian inference if we allow both a continuum of loss functions and a distributional constraint on the reported unit‐specific (relative) ranks. To see this connection, we introduce a collection of loss functionswhere action a is a relative rank value in (0,1), α ∈ (0,1) indexes the collection and again is a quantile in the population of interest. Specifically, no α‐loss occurs if the inferred relative rank a and the actual relative rank both are less than α. The marginal (preposterior) Bayes risk of rule iswhich is 1 minus the agreement (2). In the absence of other considerations, the Bayes rule for loss degenerates to . Degeneration is avoided if we enforce on the reported rank the additional structure that it shares with the true relative rank the property of being uniformly distributed over the population of units. Such a constrained Bayes rule then minimizes the modified objective function: , where is chosen to enforce the (marginal) size constraint .The constrained Bayes rule is computed conditionally, per observed , by minimizing the constraint‐modified posterior expected loss PEL:where is the upper posterior probability appearing in Section 2.Curiously, a rule minimizing is not uniquely determined at a single α, since minimization in equation (9) requires only thatHowever, taking all losses together does fix a procedure. To see this, let , and further assume that g is continuous in α. If has only one root in (0,1), then the procedure is a Bayes rule for any choice of , even though does not depend on any specific choice of α. This is because for all α such that , and for all α such that . If does contain multiple roots (at least over a range of that has positive probability), there will not be a procedure (i.e. a procedure which does not depend on α) which is a Bayes rule for any choice of . This is because it will not be possible to construct a rule δ that satisfies requirement (10) for all values of α ∈ (0,1). The thresholds in expression (10) are determined by the uniformity constraint, and we have , where is the marginal distribution of , counting all sources of variation, and so from the previous section. In other words, the procedure that is obtained by this constrained, multiloss Bayes calculation is equivalent to the r‐value that was introduced in Section 2.Among the more popular loss‐based ranking procedures is one via posterior expected rank PER (e.g. Laird and Louis (1989) and Noma et al. (2010)). Unit i's value becomes after normalizing by the number of units and taking the large‐scale limit. We find in numerical experiments that PER‐ranking is relatively close to the ranking by PM, and in these experiments we use , which can be established readily by using a transformation‐of‐variables argument.
Beyond ps and qs
In testing a single hypothesis , the sample space may be structured as a nested sequence of subsets, , say, such that rejection of a size α test is equivalent to data D landing in set (i.e. rejection region) . Then, the p‐value of the test is . Storey (2003) extended this idea to multiple testing and the positive false discovery rate with the introduction of the q‐value. Specifically, with another nested sequence indexed such that , the q‐value is . Where p‐values refer to the distribution of D on , and q‐values the conditional probability of given sample information, the proposed r‐values refer to marginal probability over both unit‐specific data and unit‐specific parameters. The size constraint (1) corresponds to another sequence of subsets, , say, for which the marginal constraint holds: . Analogously, the r‐value is . In principle an r‐value could be defined for any indexed ranking method, though we have reserved the definition for that method which maximizes agreement (2). Other connections to hypothesis testing are discussed in the on‐line supplementary material.
Computation
In Section 2.2 we focused on the model involving normality for both the measurement and the latent parameter . The r‐value is obtained by inverting equation (4) to solve for r:where , defined through the size constraint (5), is readily computed numerically.Alternatively, a generic approach to computing r‐values starts with a finite grid in (0,1), at which we compute the posterior tail probabilities for all units i (or approximations, e.g. by Markov chain Monte Carlo sampling). The grid need not be uniform; we enrich coverage near 0 in our implementation. The jth column of the matrix holds a sample from the marginal distribution for which is the ‐quantile. Marching through j allows us to assemble a discrete (in α), empirical (over units) quantile function, which we convert to a function first by possibly smoothing to mitigate sampling effects and then by interpolating to α‐values beyond the initial grid. Then for each unit i we solve numerically in α to obtain that unit's r‐value. Fig. 4 illustrates the computation for two units in the NBA example. Pseudocode for the algorithm and elements of the R package implementation are given in the on‐line supplementary material document.
Figure 4
Computational details, NBA example: , ; , two examples; , empirical quantile; , ; , r‐value
Computational details, NBA example: , ; , two examples; , empirical quantile; , ; , r‐valueThe grey curves in Fig. 4 show, for each of 461 NBA players who attempted at least one free throw in the entire 2013–2014 regular season, the tail probability function ; two are highlighted in magenta. Recall that is such that ; in this case a conjugate beta(a,b) model was fitted to obtain these marginal quantiles (). At each value of α(j) on a grid, the empirical distribution of was computed and reduced to a quantile such that the empirical frequency exceeding the quantile is α(j) (the red dots). We smoothed these to obtain the quantile fuction (the blue curve). Two r‐values are shown (the broken vertical lines, at r‐values 0.016 and 0.488), obtained by solving in α equality of the unit‐specific and the systemwide . Scaling by logarithms (horizontal) and square root (vertical) was done to aid visualization.
Sampling performance
The r‐value is defined using the joint distribution of data and the target parameter , but it is computed empirically from an estimate of that joint distribution. Accurate distributional estimation may be possible from large‐scale data sets, but it is nonetheless useful to investigate how the optimality that is guaranteed by theorems 1 and 2 deteriorates in finite sample situations. Simulations of the normal–normal model show that computed r‐values retain their performance benefits compared with other ranking procedures, and thus some uncertainty in the quantile function or in the distribution of does not clearly disable the procedure. For example, Fig. 5 shows simulation‐based estimates of agreement, , for both the computed r‐values and for other ranking methods. We adapt the notation to include the sample size n and the hat mark to emphasize that the computed r‐values involve estimation of the marginal distribution function F of and the quantile function . r‐value performance is not adversely affected by low sample sizes in this case. Other simulations demonstrate that this superiority is not sensitive to the distribution of variances or to the extent of smoothing that is used to compute quantiles (see the on‐line supplementary material, Figs S4 and S5).
Figure 5
Finite sample performance of the r‐value (), PM (), PER () and MLE () in the normal–normal model (the simulation‐based agreement compares the true top α list with the estimated top α list for various methods and for 1/n⩽α⩽ 0.1 (common horizontal axis), when the marginal distribution of and the quantile are both estimated from available data (no smoothing); the common vertical axis is agreement/α; , and results from 1000 simulated data sets were averaged for each panel): (a) ; (b) n=1000; (c) n=200; (d) n=50
Finite sample performance of the r‐value (), PM (), PER () and MLE () in the normal–normal model (the simulation‐based agreement compares the true top α list with the estimated top α list for various methods and for 1/n⩽α⩽ 0.1 (common horizontal axis), when the marginal distribution of and the quantile are both estimated from available data (no smoothing); the common vertical axis is agreement/α; , and results from 1000 simulated data sets were averaged for each panel): (a) ; (b) n=1000; (c) n=200; (d) n=50A more general consistency property holds for models that are sufficiently regular that the following four conditions are satisfied.Triples , for i=1,2,…,n, are independent and identically distributed from a joint distribution for which and are independent and have positive densities f and g with respect to Lebesgue measure on and respectively.From data , we have an estimator of F, where , that is invariant under permutations of the observations. The sequence of distributions converges weakly, , almost surely as n→∞.The estimator could be parametric or non‐parametric (see Lindsay (1995)). For each α, the marginal quantile is estimated by , and the posterior tail probability, , given a potential data point , is estimated byHere is the local sampling density, which we consider to have a known form.The local sampling density satisfiesis continuous in ,there is a continuous function such that for all arguments and,for any and , is increasing in θ.Let , and .There are no values of and such that .The normal–normal model satisfies condition 1 by design, condition 3 by inspection and condition 4 by theorem 3, and it will satisfy condition 2 for typical parametric or non‐parametric estimates of F. Indeed condition 3 is readily verified in many settings, but condition 4 is more difficult because it involves the marginal distribution of local posterior probabilities, which is often analytically intractable. We have confirmed conditions 1, 2, 3, 4 in a gamma–inverse gamma model (see the on‐line supplementary material).The ideal r‐value is not computable when the underlying distributions are unknown, though model regularity assures that is the unique root (in α) of the equation . Approximating we have the empirical distribution function, , and the unsmoothed quantile . A natural estimate of is . To analyse estimation error, it is helpful to define the related quantity for , and the sample version, . It happens that when both reside in [δ,1−δ]; we think of δ as an arbitrarily small value that ameliorates boundary effects in the estimated quantile function .If the model satisfies conditions 1, 2, 3, 4 and n→∞, then for , and all α ∈ [δ,1−δ], Furthermore,The quantity is the optimal agreement, as in theorem 2.Essentially, computed r‐values are uniformly distributed and achieve the maximal agreement in large samples as long as the generative distributions are sufficiently regular and consistently estimated.Model uncertainty can have a bigger effect than system parameter uncertainty on the r‐value performance. Fig. 6 shows some reduced performance of r‐value in case F is misspecified as normal when it is fact heavier tailed. Other misspecifications may have less effect, such as when the true F is a finite mixture of normal distributions or when there are unmodelled dependences between and . Examples are provided in the on‐line supplementary material, Figs S6–S8. Without pursuing a comparative analysis, we note finally that an alternative estimator of may be obtained by working out, perhaps via simulation, the induced distribution of for bootstrap data drawn from the fitted model.
Figure 6
Effects on agreement of model misspecification for the r‐value (), PM (), MLE () and PER () (the r‐value performance deteriorates when the true distribution of effects is much heavier tailed (Student's t on df degrees of freedom) than is used to construct the r‐value (normal); the case shown involves n=2000; the axes are as in Fig. 5): (a) df=6; (b) df=4; (c) df=3; (d) df=2
Effects on agreement of model misspecification for the r‐value (), PM (), MLE () and PER () (the r‐value performance deteriorates when the true distribution of effects is much heavier tailed (Student's t on df degrees of freedom) than is used to construct the r‐value (normal); the case shown involves n=2000; the axes are as in Fig. 5): (a) df=6; (b) df=4; (c) df=3; (d) df=2
Discussion
For examples touched on here as well as for many others within the domain of large‐scale inference, a basic statistical problem is to rank units and to select the top units by some measure. Precisely how the output of such inference is to be used depends very much on the context; admittedly we have not focused on these operational issues. For example, the output might trigger follow‐up experiments in a genomic study (e.g. Pyeon et al. (2007)), it might affect resource allocation in some performance evaluation (e.g. Paddock and Louis (2011)), or it might spark a debate about who really is the best free‐throw shooter. Our emphasis on a statistical framework for large‐scale ranking and selection responds to evident weaknesses of available methodologies and the potential utility of the proposed r‐value scheme, especially when there is great variation in the amount of information per unit. Also, where an emphasis of large‐scale inference has been on testing and sparsity assumptions, the r‐value computation addresses a practical problem to organize large numbers of non‐null units.By casting the problem via empirical Bayes methods, we express agreement between true and reported top lists as a certain joint probability that is subject to explicit optimization, taking advantage of an equivalence between ranking and threshold functions (Section 2). Roughly speaking, an r‐value is a Bayes rule for the binary loss which indicates failure to place the unit correctly in the top α‐fraction of units, though to formalize this one requires multiple loss functions and a distributional constraint (Section 3.1). In spite of this connection to Bayesian inference, the r‐value method seems not to have been previously identified by that reasoning. Theoretical support for the method has been developed here for a measurement model (2.1, 2.2, 2.3). Establishing that r‐values maximize agreement in the more general cases that were considered in Section 2.4 remains to be investigated. Where the analysis in Section 2 treats the joint distribution of data and unit level parameters as known, this model must be estimated from systemwide data in each application. We report sufficient conditions for first‐order asymptotic correctness (theorem 4). Within‐model simulations show good r‐value performance under a range of conditions (Section 5). Performance deteriorates when the model is misspecified, and we recommend that standard model diagnostics accompany the r‐value computation. Further investigation is warranted for non‐parametric or semiparametric models, as the basic r‐value statistic does not require a parametric formulation.‘Supplementary material’.Click here for additional data file.
Authors: Ning Leng; John A Dawson; James A Thomson; Victor Ruotti; Anna I Rissman; Bart M G Smits; Jill D Haag; Michael N Gould; Ron M Stewart; Christina Kendziorski Journal: Bioinformatics Date: 2013-02-21 Impact factor: 6.937
Authors: Dohun Pyeon; Michael A Newton; Paul F Lambert; Johan A den Boon; Srikumar Sengupta; Carmen J Marsit; Craig D Woodworth; Joseph P Connor; Thomas H Haugen; Elaine M Smith; Karl T Kelsey; Lubomir P Turek; Paul Ahlquist Journal: Cancer Res Date: 2007-05-15 Impact factor: 12.701
Authors: Gustavo de Los Campos; John M Hickey; Ricardo Pong-Wong; Hans D Daetwyler; Mario P L Calus Journal: Genetics Date: 2012-06-28 Impact factor: 4.562
Authors: Aaron B Wagner; Elaine L Hill; Sean E Ryan; Ziteng Sun; Grace Deng; Sourbh Bhadane; Victor Hernandez Martinez; Peter Wu; Dongmei Li; Ajay Anand; Jayadev Acharya; David S Matteson Journal: Stat (Int Stat Inst) Date: 2020-10-05