Literature DB >> 35125502

Statistical properties of sketching algorithms.

Abstract

Sketching is a probabilistic data compression technique that has been largely developed by the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the compressed dataset. Sketching algorithms generally use random projections to compress the original dataset, and this stochastic generation process makes them amenable to statistical analysis. We argue that the sketched data can be modelled as a random sample, thus placing this family of data compression methods firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson-Woodruff sketches and their use in single-pass sketching algorithms for linear regression with huge samples. We explore the statistical properties of sketched regression algorithms and derive new distributional results for a large class of sketching estimators. A key result is a conditional central limit theorem for data-oblivious sketches. An important finding is that the best choice of sketching algorithm in terms of mean squared error is related to the signal-to-noise ratio in the source dataset. Finally, we demonstrate the theory and the limits of its applicability on two datasets.

Entities: Chemical

Keywords: Computational efficiency; Random projection; Randomized numerical linear algebra; Sketching

Year: 2020 PMID： 35125502 PMCID： PMC7612324 DOI： 10.1093/biomet/asaa062

Source DB: PubMed Journal: Biometrika ISSN： 0006-3444 Impact factor: 2.445

Introduction

Sketching is a general probabilistic data compression technique involving random projections (Cormode, 2011). Even routine calculations can be prohibitively computationally expensive if performed on massive datasets. Computational time can be reduced to an acceptable level by allowing some approximation error in the results. Sketching algorithms simplify the computational task by generating a compressed version of the original dataset that then serves as a surrogate for calculations. The compressed dataset is referred to as a sketch, because it acts as a compact representation of the full dataset. Sketching algorithms use a randomized compression stage, which makes them interesting from a statistical viewpoint. Sketching algorithms for linear regression have attracted significant attention in the numerical linear algebra and theoretical computer science communities (Mahoney, 2011; Woodruff, 2014). To describe sketched regression in more detail, we first assume that the data consist of a length-n response vector y and an n × p matrix of covariates, X, which is of full rank. It is assumed throughout that n > p. The objective is to find the least squares coefficients. Given sufficient computational resources, these can be computed exactly as where the subscript F indicates the connection to the full dataset. Only two quantities are needed to determine β F: the Gram matrix X T X and the marginal associations X T y. Calculation of X T X requires O(np 2) operations, while computation of X T y needs only O (np) calculations. There are two broad methods for sketched regression, namely complete sketching and partial sketching. Complete sketching is based on approximating both X T X and X T y, whereas partial sketching approximates only the Gram matrix. Drineas et al. (2006) established many important results for complete sketching, and Dhillon et al. (2013) and Pilanci & Wainwright (2016) derived foundational results for partial sketching. Sketching algorithms use random linear mappings to reduce the size of the dataset from n to k observations. The random linear mapping can be represented as a k × n sketching matrix S. Complete sketching generates a length-k sketched response vector ỹ and a k × p matrix of sketched predictors . The sketched data are computed through the linear mappings ỹ = Sy and . Assuming that is of rank p, the complete sketching estimator β S is defined to be the set of least squares coefficients using the sketched responses and predictors, The partial sketching estimator, β P, is defined as The key difference between (1) and (2) is that the partial sketching estimator β P is constructed using the exact marginal associations X T y. Given the sketched data, computation of β S or β P requires only O(kp 2) operations, compared with the O(np 2) operations required for β F. There is a large literature concerned with designing appropriate distributions for the random sketching matrix S. Our focus is on data-oblivious random projections, such that the distribution of the sketching matrix is not a function of the source data (y, X). An example is the Gaussian sketch, where each element is independently distributed as an N(0, 1/k) variate. We also consider the Hadamard sketch and the Clarkson–Woodruff sketch, random projections that exploit structure and sparsity for computational efficiency.A motivation for this work is that there are no clear ties between data-oblivious random projections and classical subsampling techniques. Most existing results on the accuracy of sketching are universal worst-case bounds (Woodruff, 2014; Mahoney & Drineas, 2016). This is typical for randomized algorithms; however, a more detailed error analysis can provide important insights (Halko et al., 2011). We investigate the statistical properties of β P and β S when data-oblivious sketches are used. An important finding is that the signal-to-noise ratio in the source dataset strongly influences the relative efficiency of complete to partial sketching. The statistical analysis also allows the construction of exact confidence intervals for the Gaussian sketch and asymptotic confidence intervals for other random projections, paving the way for their wider use in the statistical community. At its core, sketched regression is a randomized algorithm for approximate computation of β F. Repeated application of the sketching algorithm to the same dataset will produce different results. The first stage in our analysis is to establish the distributional properties of the sketching estimators with the source dataset held fixed. An important result is a conditional central limit theorem for the sketched dataset that connects the Hadamard and Clarkson–Woodruff projections to the Gaussian sketch. The conditional analysis of the randomized algorithms is then extended to cover situations where sketching is used for approximate statistical inference. Given a statistical model for the response y = X β 0 + ϵ, with population parameter β 0 and error term ϵ, distributional properties of β P and β S can be determined by integrating over the conditional distributions of the sketching estimators that take y and X as fixed.

Background and related work

Preliminaries

We define a number of quantities related to the full dataset before moving on. The total, residual and model sum of squares are given by TSSF = y T y, and , respectively, with TSSF = MSSF + RSSF. The proportion of variance explained by the model is . These values will be important in characterizing the behaviour of β S and β P. The source data are generically represented by the n × d matrix A = (y, X). There are two general categories of distributions for the random matrix S: data-aware random projections and data-oblivious random projections. A data-aware random projection uses information in the source data (y, X) to generate S. In contrast, a data-oblivious random projection can be sampled without knowledge of y or X. Data-aware random projections are closely connected to finite population sampling methods in the statistics literature (Ma & Sun, 2015). Our main focus is on data-oblivious random projections, as their mechanism for data compression is not obviously tied to subsampling. Data-oblivious random projections generate a dataset of k pseudo-observations using the source dataset as a component in the generative process.

Data-oblivious sketches

The Gaussian sketch was one of the first projections proposed for sketched regression (Sarlos, 2006). Recall that a Gaussian sketch is formed by independently sampling each element of S from an N(0, 1/k) distribution.A drawback of the Gaussian sketch is that computation of the sketched data is quite demanding, taking O(ndk) operations. Therefore, work has been done on designing more computationally efficient random projections. Woodruff (2014) gives an excellent survey of work in this area. The Hadamard sketch is a structured random matrix (Ailon & Chazelle, 2009). The sketching matrix is formed as S = Φ HD / √k, where Φ is a k × n matrix and H and D are both n × n matrices. The fixed matrix H is a Hadamard matrix of order n. A Hadamard matrix is a square matrix with elements that are either +1 or −1 and orthogonal rows. Although Hadamard matrices do not exist for all integers n, the source dataset can be padded with zeros so that a conformable Hadamard matrix is available. The matrix D is a diagonal matrix whose n diagonal entries are independent Rademacher random variables. The random matrix Φ subsamples k rows of H with replacement. The structure of the Hadamard sketch allows for fast matrix multiplication, reducing calculation of the sketched dataset to O(nd log k) operations. The Clarkson–Woodruff sketch is a sparse random matrix (Clarkson & Woodruff, 2013). The projection can be represented as the product of two independent random matrices, S = ΓD, where Γ is a random k × n matrix and D is a random n × n matrix. The matrix Γ is initialized as a matrix of zeros. Independently in each column, one element is selected and set to +1. The matrix D is a diagonal matrix whose n diagonal entries are independent Rademacher random variables. The sparsity of the Clarkson–Woodruff sketch speeds up matrix multiplication, decreasing the complexity of generating the sketched dataset to O(nd).

Gaussian sketching

Complete sketching

The Gaussian sketch is mathematically tractable, and it is possible to establish a number of exact finite-sample results regarding the performance of the sketching estimators. In this section we derive the distribution of β S in the case where a Gaussian sketch is used. As mentioned previously, all results treat y and X as fixed. The variability in β S is solely due to the use of the random sketching matrix S. Let refer to the jth row of the sketched data matrix . Similarly, let denote the jth row of the sketching matrix S. The sketched dataset consists of k random units . The jth sketched response is given by and the jth sketched predictor is calculated as . The k sketched instances are independently distributed, because rows of the sketching matrix are independent. It can be shown that the joint distribution of the sketched data, , has the structure of a hierarchical Gaussian linear model. The sketched dataset has a multivariate normal distribution, conditional on the source dataset. This is because the sketched dataset can be expressed as a linear combination of Gaussian random variables. Specifically, row j in the sketched dataset is . Given the source dataset A = (y,X), A is a linear combination of independent Gaussians as Sj ~ N(0, I/k), and so must be jointly normally distributed, conditional on the source data A = (y,X). It is easily shownthat the conditional joint distribution of the sketched responses and predictors is then From standard results on the multivariate normal distribution, it follows that the conditional distribution of ỹ given is also normal with conditional mean . The subscript S is used with the expectation operator to emphasize that the only random quantity is the sketching matrix. The conditional distribution of ỹ given the sketched predictors and the source dataset (y, X) is This is the exact form of a standard Gaussian linear model, where the regression coefficient is β F and the conditional variance is RSSF/k. The distribution is easily obtained as the marginal distribution of is also multivariate normal, A Gaussian sketch effectively simulates a series of observations from a Gaussian linear model parameterized in terms of β F and RSSF, where the design matrix has a matrix normal distribution. The distribution of β S conditional on the sketched predictors follows immediately from standard results on linear models (Searle, 1997, Ch. 3). To obtain the marginal distribution of β S,it is necessary to integrate over the random sketched design matrix . Using properties of the normal distribution (Eaton, 2007), it is possible to show that . Hence, where iw denotes the inverse Wishart distribution. The marginal distribution of β S can then be described using the normal inverse Wishart distribution (Gelman et al., 2014, p. 73). The following theorem characterizes the distribution of β S under the Gaussian sketch.

Theorem 1

Suppose β S is computed using a Gaussian sketch and that k ⩾ p. Then: the conditional distribution of β S is the marginal distribution of β S is For a proof see the Supplementary Material. An immediate consequence of (i) is the ability to generate exact confidence intervals for the elements of β S, anapproach that does not seem to have been considered in the existing literature. The variance of β S, is not dependent on the compression ratio k/n. Although RSSF can be expected to grow linearly with n, this will generally be counterbalanced by (X T X)−1 decreasing linearly with n.

Partial sketching

Partial sketching was first proposed by Dhillon et al. (2013) using uniform subsampling, and was later studied for general sketches by Pilanci & Wainwright (2016). Existing results on partial sketching highlight that the model sum of squares influences the approximation error of the partial sketching estimator β P. It is easy to see that the variance of the partial sketching estimator will not be a function of the residual sum of squares. From the normal equations it follows that X T y = X T X β F. Using this property, we see that conditional on y and X, the variance of the random linear combination βP = (X T S T SX)−1 X T y = (X T S T SX)−1 X T X β F will be a function of the covariates X and the fitted values X β F. The residual vector has no influence on the variance of the partial sketching estimator, and as such the variance of β P will not be related to the residual sum of squares. This suggests that when the noise level is high, partial sketching may become preferable to complete sketching (Dhillon et al., 2013; Becker et al., 2015). The hierarchical model for complete sketching provides an intuitive statistical perspective on the mechanics of the algorithm. Partial sketching seems to lack a similar conceptual device. The least squares coefficients can be represented as the solution to the linear system of equations X T Xb = X T y. Partial sketching simply returns the solution, b, to the approximate linear system . Lacking a convenient representation for the estimator, we must proceed in a more pedestrian manner. The mean squared error of the estimator β P can be determined using only mean and variance information, and this will be the goal for now. The key observation is that . Conditional on y and X, the estimator is a linear combination of the elements of an inverse Wishart random matrix. However, this is a nonstandard distribution, and it is difficult to express directly the distribution function of β P. Despite this obstacle, it is straightforward to determine the mean and variance of β P. From properties of the inverse Wishart distribution, it can be seen that the partial sketching estimator is biased, with mean where it is assumed that k > p + 3. This motivates an alternative unbiased estimator Determining the variance of β P and the unbiased is a more lengthy computation, which is given in the Supplementary Material. The variance of the unbiased estimator is By making a connection with method-of-moments estimation it is possible to establish asymptotic normality of both β P and as k tends to infinity. This motivates the construction of approximate confidence intervals. As the exact variance is unknown, we propose the following estimator of using the sketched model sum of squares MSSS:

Relative efficiency

The relative efficiencies of complete and partial sketching are also of interest. As the plug-in estimator β P has a greater mean squared error than , it will not be considered in this subsection. The performance of the complete sketching estimator β S and the unbiased partial sketching estimator will be compared in terms of mean squared error. As both β S and are unbiased, the mean squared errors can be computed using and . Comparing (3) and (4), it can be seen that the variance of is dependent on MSSF whereas the variance of β S is dependent on RSSF. This suggests that the signal-to-noise ratio in the source dataset will be an influential factor in determining which estimator is more efficient. In the Supplementary Material it is shown that for k > p + 3 the relative efficiency can be bounded in terms of the signal-to-noise ratio When is close to 1, complete sketching can be orders of magnitude more efficient than partial sketching; and when is close to 0, partial sketching can be orders of magnitude more efficient than complete sketching.

Combined estimator

So far we have assumed that an analyst must choose between one of the two methods; but obtaining both and β S from a single sketch is computationally cheap and may be an attractive strategy. The most demanding operation with the sketched data is calculating . Given this quantity, it is economical to compute both β S and . Becker et al. (2015) mentioned that they were investigating such a strategy, but did not give any details. Our development of a combined estimator is motivated by the fact that, even when using a single sketch , the two estimators are uncorrelated, i.e., . This is established in the Supplementary Material by taking iterated expectations and using the hierarchical model from § 3.1. Asimple strategy is then to take a weighted combination of β S and . A combined estimator βC can be defined as for some 0 ⩽ φ ⩽ 1. The value of φ that minimizes the mean squared error is . Use of the weighted estimator is expected to be most beneficial when the signal-to-noise ratio is moderate, i.e., . When the signal-to-noise ratio is either very high or very low, there is little advantage in using the weighted estimator, as either the complete or the partial estimator will dominate.

One-step correction

As noted by a referee, the combined estimator is related to another strategy in the sketching literature for improving β S. Dhillon et al. (2013) and Pilanci & Wainwright (2016) proposed a refinement procedure that uses gradient information from the source dataset. The one-step corrected estimator is defined as Now the least squares solution β F satisfies X T(y – X β F) = 0, so Subtracting (6) from (5) gives the following expression for the error: The one-step estimator can be interpreted as a single step of the iterative Hessian sketch proposed by Pilanci & Wainwright (2016), initialized at β. Setting it follows from (7) and Theorem 1(i) that The key terms in (8) are the random matrices and . As , it is possible to evaluate the expectation in (8) using the first, second and third moments of the inverse Wishart distribution. The exact expression for (8) is lengthy and is given in the Supplementary Material. The main conclusions are that the one-step estimator β H can have a larger mean squared error than β S when the ratio k/p of sketch size to number of variables is close to 1. As k/p increases, the one-step estimator becomes more efficient than both β S and β C with the optimal weight φ opt. The relative efficiency of β C to β S is at most 2. The relative efficiency of β H to β S can be much higher, provided that k/p is sufficiently large.

Asymptotics

Finite-sample distributions of random projection estimators can be mathematically intractable, and thus asymptotic analysis can be a powerful tool (Diaconis & Freedman, 1984; Li et al., 2006). It is very difficult to establish meaningful finite-sample results for the Hadamard and Clarkson– Woodruff sketches, as they are discrete distributions over an enormous combinatorial space. Instead, it is useful to study the large-n distribution of the estimators β S and β P to obtain an interpretable expression. As β F is the estimand in sketching algorithms, conditioning on the source data is required in the asymptotic analysis. To elaborate, let A ( = (y (, X () represent the n×d source data matrix of full column rank.Any source data matrix A ( has a set of associated least squares coefficients, which will be denoted by here. The overall goal is to determine the asymptotic form of the distributions and for some arbitrary large dataset A(n). To take limits, we employ a fixed sequence of n × d datasets, all of rank d. Some related work has been done by Ma et al. (2015), who developed Taylor series approximations for the bias and variance of data-aware sketched regression estimators, where the asymptotic expansion is taken in the sketch size k. In independent work, Dobriban & Liu (2019) examined the behaviour of data-oblivious sketching algorithms in the asymptotic regime where k, d →∞, using elements of random matrix theory. Our work is novel, as we study data-oblivious random projections in the regime where k and d are fixed, while taking limits in the number n of source observations.

Sketching central limit theorem

A central limit theorem for sparse sketching matrices with independent entries is given in Li et al. (2006). The Clarkson–Woodruff sketch and the Hadamard sketch have dependent entries, so we use a different method of proof. Under some regularity conditions, the Hadamard and Clarkson–Woodruff sketches produce sketched data that asymptotically have the same matrix normal distribution as under the Gaussian sketch. The k × d random matrix Ã is the output of a stochastic process governed by the fixed n × d source dataset A ( and the distribution of the random k × n sketching matrix S. Eachcolumnof the sketched dataset is a linear combination of random vectors, the number of which increases with n. Under an assumption on the limiting leverage scores of the source data matrix, we can establish a central limit theorem for the sketched dataset. The leverage scores of the observations in the source data matrix have been identified as an important structural property of sketching algorithms (Mahoney & Drineas, 2016). Assumption 1 highlights their role in establishing asymptotic normality of the sketched data matrix.

Assumption 1

Let be the singular value decomposition of the n × d source dataset, and let be the ith row in U(n). The maximum leverage score tends to zero, that is, Theorem 2 is the sketching central limit theorem. Its proof is given in the Supplementary Material.

Theorem 2

Consider a sequence of arbitrary n × d data matrices A(n), where d is fixed. Let (n), and let S be a k × n Hadamard or Clarkson–Woodruff sketching matrix where k is also fixed. Suppose that : where MN denotes the matrix normal distribution.

Sketching estimators

The central limit theorem for the sketched data suggests that the results on β S and β P for the Gaussian sketch will also hold approximately for the Hadamard and Clarkson–Woodruff sketches for large n. To establish convergence of the estimators it helps to make an extra assumption on the sequence of source datasets.

Assumption 2

We have that for some positive-definite matrix Q. The limiting matrix Q allows one to avoid specifying a probability model for the source dataset, without overcomplicating the mathematical analysis. Under Assumptions 1 and 2, it is possible to establish an asymptotic result for β S and β P.

Theorem 3

Suppose that Assumptions 1 and 2 hold, k ⩾ p, and β S is computed using a Hadamard or Clarkson–Woodruff sketch. Let denote the Moore–Penrose pseudo-inverse of . Let Then, as n → ∞, the following convergence results hold in distribution: The proof is given in the Supplementary Material. For large n we expect β S to be approximately distributed as per Theorem 1 for both the Hadamard and the Clarkson–Woodruff sketches. It is harder to establish a comparable limit theorem for , because of the nonstandard distribution of when a Gaussian sketch is used. Instead, we wish to show that the partial sketching estimators under the Hadamard and Clarkson–Woodruff sketches have similar mean and variance properties to the Gaussian partial sketching estimator. Convergence in moments can be established given a stability condition on the singular values of the sketched data matrix.

Assumption 3

The sequence of source datasets is such that is finite for large enough n, where σmin(·) denotes the minimum singular value of a matrix. This additional regularity condition enables a formal limit theorem regarding the moments of to be established.

Theorem 4

Suppose that Assumptions 1–3 hold, k > p + 3, and is computed using a Hadamard or Clarkson–Woodruff sketch. Let Then, as n → ∞: The proof is given in the Supplementary Material. This theorem suggests that the conditional bias and variance of under the Clarkson–Woodruff and Hadamard sketches should be approximately equal to those under the Gaussian sketch. The results here are meant to provide useful heuristics for assessing the uncertainty associated with the output of the randomized approximation algorithm. There is a need to quantify the approximation error of sketching algorithms and communicate it to end users (Lopes et al., 2018), for which the asymptotic results developed in this section may be helpful.

Unconditional results

The previous analysis treated the source dataset as fixed to isolate the approximation error introduced by the random projection. When sketching is used for statistical inference, the hierarchical model of § 3.1 can be extended to include a source of variation at the population level. We take the design matrix X to be fixed and treat the response y as random. The assumed data-generating process is y = X β 0 + ε, where ε is a vector of n independent and identically distributed random variables with mean zero and variance σ 2. Let γ 2 represent the average mean function sum of squares, so . As shown in Searle (1997), at the population level the ordinary least squares estimator satisfies and . Taking iterated expectations, it can be seen that the Gaussian sketch gives an unbiased estimator of the population parameter . The same argument shows that . In the Supplementary Material, we use the law of total variance to determine the unconditional variances For large n, the most significant term in the unconditional variance of β S is nσ2(X T X)−1. The dominating term in the unconditional variance of is nγ 2(X T X)−1, a function of the average model sum of squares γ 2. We reach conclusions similar to those of the conditional analysis in § 3.3, in that β S is expected to be more efficient when the signal-to-noise ratio is high, while is expected to be more efficient when the signal-to-noise ratio is low. Under Assumptions 1–3, the variance expressions give asymptotic approximations for the Hadamard and Clarkson–Woodruff projections. These results can be extended to account for more complicated error models on ϵ if it is still possible to determine and . Raskutti & Mahoney (2016) provides further results on the performance of sketching estimators from an inferential perspective.

Data application

Human leukocyte antigen locus dataset

We compare the performance of the sketching estimators on a genetic dataset from the UK Biobank database.We use a small extract of the data in Astle et al. (2016). The selected response variable is mean red cell volume, taken from the full blood count assay and with adjustments for various technical and environmental covariates. Genome-wide imputed genotype data in expected allele dose format were available on n = 132 353 study subjects (Howie et al., 2009; Bycroft etal., 2018). We consider 1000 genetic variants in the human leukocyte antigen, HLA, region of chromosome 6, selected so that no pair of variants had squared Pearson correlation of posterior expected allele doses greater than 0.8. We chose to focus on this region because many associations have been discovered in a genome-wide scan using univariable models; these associations were with variants having different allele frequencies, which suggests multiple distinct causal variants in the region. The aim is to perform a multivariable regression analysis to obtain variant effect size estimates that are conditional on the other variants in the region. An early theoretical finding was that the partial sketching estimator β P is biased. One thousand sketches were taken to estimate the bias with k = 1500. We also computed the bias-corrected estimator in each replication. Figure 1 plots the average value of the estimators against the true value of the least squares coefficient using the full dataset. The top row shows results for β P, and the bottom row shows results for . The left, middle and right columns display results for the Gaussian, Hadamard and Clarkson–Woodruff sketches, respectively. The solid line in each panel is the identity line. The dashed line in the top row represents the theoretical bias, with slope k/(k – p – 1).

Fig. 1

Bias of partial sketching estimators on the HLA dataset: panels (a)–(c) show results for β P and panels (d)– (f) results for the bias-corrected estimator ; mean estimates are plotted against the true values. In this scenario n = 132 353, p = 1000 and k = 1500. The solid line in each panel is the identity line, and the dashed line in panels (a)–(c) represents the theoretical bias factor.

The results in panels (a)–(c) show that β P is biased for each of the random projections. The bias closely matches the theoretical factor. Panels (d)–(f) show that the adjusted estimator appears to be unbiased, with the mean values falling close to the identity line. We also compared the complete and partial sketching estimators in terms of mean squared error and coverage of confidence intervals at k = 1500 and k = 10 000. Moreover, we compared the data-oblivious sketches to simple uniform subsampling with replacement. Table 1 reports the mean squared error for each of the estimators. The signal-to-noise ratio is quite low for this dataset, with . We expect that partial sketching will be much more efficient than complete sketching on this dataset given the low signal-to-noise ratio. The simulation results support this prediction, with having a mean squared error roughly 60 times smaller than β S at both values of k. The results are very similar for each of the random projections, suggesting that the asymptotic approximations are reasonable for this dataset. For k = 1500, the mean squared error of β P is approximately 10 times that of .For k = 10 000 there is less of a difference, as the ratio k/(k−p−1) is closer to 1.

Table 1

Mean squared errors of sketching estimators on the HLA dataset

	k = 1500			k = 10 000
	β _S	β _P	βP∗	β _S	β _P	βP∗
Gaussian	238 (3)	39 (0.7)	3.8 (0.08)	13.3 (0.17)	0.28 (0.004)	0.21 (0.002)
Hadamard	238 (4)	39 (0.7)	3.8 (0.07)	12.5 (0.16)	0.26 (0.003)	0.20 (0.002)
Clarkson–Woodruff	241 (3)	38 (0.8)	4.0 (0.05)	13.2 (0.16)	0.28 (0.004)	0.21 (0.002)
Uniform	375 (15)	105 (7.6)	10.7 (0.55)	13.8 (0.20)	0.38 (0.007)	0.29 (0.005)

Table 2 summarizes the coverage of 95% confidence intervals forthe sketching estimators. We report the overall proportion of intervals containing the true value of the least squares estimate β F over the 250 sketches and p = 1000 coefficients. The observed coverage is close to the nominal level of 0.95 at both levels of k. The different random projections give very similar results, suggesting that the use of asymptotic approximations is again reasonable for this dataset. The intervals for the Hadamard sketch appear to be slightly conservative at k = 10 000.

Table 2

Coverage of confidence intervals; the largest standard error is 0.004

	HLA		HLA		Flights
	k = 1500		k = 10 000		k = 1500
	β _S	βP∗	β _S	βP∗	β _S	βP∗
Gaussian	0.950	0.953	0.950	0.951	0.948	0.951
Hadamard	0.949	0.949	0.954	0.954	0.950	0.948
Clarkson–Woodruff	0.947	0.952	0.951	0.950	0.948	0.947

Table 3 reports the average sketching times for the data-oblivious sketches. We computed 10 sketches using each projection. The Gaussian sketch is an order of magnitude slower than the Hadamard projection and two orders of magnitude slower than the Clarkson–Woodruff sketch.

Table 3

Timings for sketching: average times to compute the sketched dataset Ã = SA, in seconds

	HLA		Flights
	k = 1500	k = 10 000	k = 5000
Gaussian	522	3479	404
Hadamard	57	65	5.8
Clarkson–Woodruff	5.3	5.4	0.2

New York flights dataset

We also evaluated the sketching algorithms on the New York flights dataset available in the R (R Development Core Team, 2021) package nycflights13 (Wickham, 2014). Arrival delay was taken as the response, and departure delay, distance, departure time, origin, and month and day were chosen to be the covariates. Rows of the dataset with missing data were omitted, sothat we were left with n = 327 346 and d = 47. The goal is to compare the accuracy of the various sketches on real data rather than to build a statistical model for the flights dataset. We compare the mean squared error of the estimators and the coverage of confidence intervals for k = 5000. In contrast to the HLA dataset, the flights dataset has a very high value of 0.99. We took 500 sketches to compare complete and partial sketching. See Table 4 for details.

Table 4

Mean squared errors of sketching estimators (with standard errors in parentheses) on the flights dataset with k = 5000

	β _S	β _P	βP∗
Gaussian	60 (2)	14900 (400)	14900 (400)
Hadamard	63 (2)	14800 (500)	13900 (400)
Clarkson–Woodruff	66 (2)	15000 (500)	13800 (400)
Uniform	64 (2)	14600 (500)	14600 (400)

Discussion

In recent years work has been done to adapt sketching methods for statistical inference in large datasets, building upon the worst-case bounds developed in the computer science literature. Geppert et al. (2017) and Bardenet & Maillard (2015) investigated sketching algorithms for Bayesian regression, and derived bounds on the difference between the sketched posterior distribution and the full-data posterior distribution. Only complete sketching was considered in those works. The results on the advantages of partial sketching in this paper could motivate adaptations that make use of the exact marginal associations X T y. Sketching ideas have been used to develop methods for approximate nonlinear regression (Banerjee et al., 2013; Avron et al., 2014). The goodness of fit ofthe model may also influence the relative efficiency of different sketching algorithms in more complex regression tasks. A related branch of work uses random projections to reduce the number of predictors inregressionandclassificationproblems(Shah & Meinshausen, 2018; Guhaniyogi & Dunson, 2015; Cannings & Samworth, 2017).

Supplementary Material

Supplementary material available at Biometrika online includes proofs of all the theorems.

4 in total

1. Efficient Gaussian process regression for large datasets.

Authors: Anjishnu Banerjee; David B Dunson; Surya T Tokdar
Journal: Biometrika Date: 2013-03 Impact factor: 2.445

2. The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease.

Authors: William J Astle; Heather Elding; Tao Jiang; Dave Allen; Dace Ruklisa; Alice L Mann; Daniel Mead; Heleen Bouman; Fernando Riveros-Mckay; Myrto A Kostadima; John J Lambourne; Suthesh Sivapalaratnam; Kate Downes; Kousik Kundu; Lorenzo Bomba; Kim Berentsen; John R Bradley; Louise C Daugherty; Olivier Delaneau; Kathleen Freson; Stephen F Garner; Luigi Grassi; Jose Guerrero; Matthias Haimel; Eva M Janssen-Megens; Anita Kaan; Mihir Kamat; Bowon Kim; Amit Mandoli; Jonathan Marchini; Joost H A Martens; Stuart Meacham; Karyn Megy; Jared O'Connell; Romina Petersen; Nilofar Sharifi; Simon M Sheard; James R Staley; Salih Tuna; Martijn van der Ent; Klaudia Walter; Shuang-Yin Wang; Eleanor Wheeler; Steven P Wilder; Valentina Iotchkova; Carmel Moore; Jennifer Sambrook; Hendrik G Stunnenberg; Emanuele Di Angelantonio; Stephen Kaptoge; Taco W Kuijpers; Enrique Carrillo-de-Santa-Pau; David Juan; Daniel Rico; Alfonso Valencia; Lu Chen; Bing Ge; Louella Vasquez; Tony Kwan; Diego Garrido-Martín; Stephen Watt; Ying Yang; Roderic Guigo; Stephan Beck; Dirk S Paul; Tomi Pastinen; David Bujold; Guillaume Bourque; Mattia Frontini; John Danesh; David J Roberts; Willem H Ouwehand; Adam S Butterworth; Nicole Soranzo
Journal: Cell Date: 2016-11-17 Impact factor: 41.582

3. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.

Authors: Bryan N Howie; Peter Donnelly; Jonathan Marchini
Journal: PLoS Genet Date: 2009-06-19 Impact factor: 5.917

4. The UK Biobank resource with deep phenotyping and genomic data.

Authors: Clare Bycroft; Colin Freeman; Desislava Petkova; Gavin Band; Lloyd T Elliott; Kevin Sharp; Allan Motyer; Damjan Vukcevic; Olivier Delaneau; Jared O'Connell; Adrian Cortes; Samantha Welsh; Alan Young; Mark Effingham; Gil McVean; Stephen Leslie; Naomi Allen; Peter Donnelly; Jonathan Marchini
Journal: Nature Date: 2018-10-10 Impact factor: 49.962

4 in total