Literature DB >> 30254152

Generalized least squares can overcome the critical threshold in respondent-driven sampling.

Abstract

To sample marginalized and/or hard-to-reach populations, respondent-driven sampling (RDS) and similar techniques reach their participants via peer referral. Under a Markov model for RDS, previous research has shown that if the typical participant refers too many contacts, then the variance of common estimators does not decay like [Formula: see text], where n is the sample size. This implies that confidence intervals will be far wider than under a typical sampling design. Here we show that generalized least squares (GLS) can effectively reduce the variance of RDS estimates. In particular, a theoretical analysis indicates that the variance of the GLS estimator is [Formula: see text] We then derive two classes of feasible GLS estimators. The first class is based upon a Degree Corrected Stochastic Blockmodel for the underlying social network. The second class is based upon a rank-two model. It might be of independent interest that in both model classes, the theoretical results show that it is possible to estimate the spectral properties of the population network from a random walk sample of the nodes. These theoretical results point the way to entirely different classes of estimators that account for the network structure beyond node degree. Diagnostic plots help to identify situations where feasible GLS estimators are more appropriate. The computational experiments show the potential benefits and also indicate that there is room to further develop these estimators in practical settings.

Entities: Chemical Disease Gene Species

Keywords: link-tracing sampling; snowball sampling; spectral gap

Year: 2018 PMID： 30254152 PMCID： PMC6187121 DOI： 10.1073/pnas.1706699115

Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN： 0027-8424 Impact factor: 11.205

Respondent-driven sampling (RDS) is a popular network-based approach to sample marginalized and/or hard-to-reach populations (1). RDS has become particularly popular in HIV research because the populations most at risk for HIV (e.g., people who inject drugs, female sex workers, and men who have sex with men) cannot be sampled using conventional techniques. Several domestic and international institutions use RDS to quantify the prevalence of HIV in at-risk populations, including the Centers for Disease Control (CDC), the World Health Organization (WHO), and the Joint United Nations Program on HIV/AIDS (UNAIDS) (2). The most recent review of the literature in 2015 counted over 460 different RDS studies, in 69 different countries (3). Because RDS collects samples from link-tracing the relationships in a social network, adjacent samples are dependent. In a simulation study, ref. 4 showed how this can lead to highly variable estimates. Under independent sampling, the variance of standard estimators decays like . This implies that a sample size of will have a 50% smaller SE than a sample of size . However, this does not necessarily hold for RDS. Under a Markov model, ref. 5 showed how the dependence induced by RDS can drastically inflate the variance of traditional estimators, making it decay at a rate slower than . This implies that reducing the sampling error by 50% can require far more than 4 times as many samples. This means that confidence intervals are much wider than under independent sampling. Using the covariance function derived in ref. 5, this paper studies the generalized least squares (GLS) estimator for RDS. Our theoretical analysis establishes that the variance of the GLS estimator is . We then derive a feasible GLS (fGLS) estimator based upon the Degree Corrected Stochastic Blockmodel (DC-SBM). Two alternative estimators are derived. These estimators first construct estimates about the spectral properties of the population social graph, which might be of independent interest. Our fGLS estimators easily accommodate any preliminary reweighting of the data to adjust for the sampling biases that occur in RDS (e.g., refs. 6 and 7). We study these estimators with simulations and propose a simple diagnostic plot to compare the different fGLS estimators.

A Simple Motivating Example

Fig. 1 uses a model studied in ref. 8. In this example, the population that we wish to sample is equally divided into two groups: HIV+ and HIV–. The seed participant is selected uniformly at random. Starting with the seed participant, every participant refers two additional participants (as in a complete binary tree). The participant refers a person that matches his or her own HIV status with probability and refers a person with the opposite status with probability . Each referral is independent. Using this sample, we wish to estimate the proportion of the population that is HIV+ (i.e., 50%). Fig. 1 compares two estimators: (i) the sample proportion and (ii) the GLS estimator proposed in this paper.

Fig. 1.

In this experiment, GLS provides dramatic improvements when the sample is large and the correlation between samples (i.e., ) is high. Both axes are on the log scale.

In this experiment, GLS provides dramatic improvements when the sample is large and the correlation between samples (i.e., ) is high. Both axes are on the log scale. Under this sampling with replacement model, the variances of both the sample proportion and the GLS estimator have closed form solutions (see ref. 8 and the proof of Theorem 2 in ). Fig. 1 gives the ratio of these formulas as a function of the sample size . There are three lines, corresponding to and . In all cases, the lines are less than 1, indicating that the GLS estimator has a smaller variance than the sample proportion. Under this simulation model, if , then the variance of the sample proportion decays slower than (5, 8). As Theorems 1 and 2 below show, the variance of the GLS estimator converges to 0 like . So, as increases, the bottom line converges to 0. The other two lines, on the other hand, do not converge to 0.

Preliminaries

The Markov model used in this paper is a straightforward combination of the Markov models developed in the RDS literature (e.g., refs. 1, 6, and 8). The social network, , consists of the node set and the edge set { i and j can refer each other}. To simplify the notation, is used synonymously with . Unless otherwise noted, everything below also applies to weighted graphs. Let be the weight of edge , which models preferential recruitment as described in . If , define . If the graph is unweighted, then let for all . Throughout this paper, the graph is undirected—that is, for all pairs . Define the degree of node as . For each node , let denote some characteristic of this node (e.g., the indicator of HIV status). We wish to estimate the population average:We assume that the nodes are sampled with a Markov process that is indexed by a rooted tree (i.e., a connected graph with nodes, no cycles, and a vertex 0). The seed participant is vertex 0 in . To simplify the notation, is used synonymously with belonging to the vertex set of . For any node in the tree , denote as the parent of (the node one step closer to the root). Define the matrix asBecause the graph is undirected, is a reversible Markov transition matrix with a stationary distribution . Our sample is the set of random nodeswhere is initialized with and each transition is independent withObserve that and are distinct graphs: The nodes in index the Markov process, while the nodes in are its state space. Following ref. 9, we refer to this stochastic process as a . When the s sample the target population in , we observeUnder the stationary , the sample average of the s is an estimate ofIn general, (where was defined in Eq. ), and the sample average must be adjusted with sampling weights to obtain an unbiased estimator of . DefineThe sample average of the s is the inverse probability weighted (IPW) estimator; it is an unbiased estimator of (10). However, the weights are unknown and must be estimated with additional information, as we describe next. Under the ,where . The popular Volz–Heckathorn (VH) estimator replaces with the harmonic mean of those degrees (6). Recall that has nodes and defineand . The VH estimator is the sample average of the s, and it is an asymptotically unbiased estimator of under the . [In practice, is estimated by asking participants how many contacts they have. Recall that . If the graph is weighted, then the exhibits preferential recruitment (as discussed in ) and the number of contacts will not necessarily align with , making the estimator biased.]

Remark

The next section will drop the superscript and in and . Using the s to construct the GLS estimator will lead to an unbiased estimator of . In practice, before doing any of the GLS computations, one could replace the s with or to estimate . The simulations in this paper use a reweighting that is similar to but replaces with a GLS estimate of . In ref. 7, sampling weights are estimated under an alternative, non-Markovian model. These weights could also be used before doing GLS computations.

GLS for RDS

The GLS estimator is the weighted average of the s with smallest variance (11); that is, it is the solution toBecause of the constraint that the weights sum to 1, the linearity of expectation, and the fact that the is stationary, the resulting estimator is an unbiased estimate of . Define the covariance matrix aswhich is assumed to be nonsingular. It can be seen that the solution to Eq. depends upon solving a system of equations involving the covariance matrix—namely, that where . (Throughout, we use the notation for the all-one vector of length . We drop the length when clear from context.) If is the vector of s, then the GLS estimator can be expressed asThe rest of this section contains our main theoretical results, which study howdecays with the sample size.

Main Result.

In our main result, we assume that is a complete binary tree with nodes, but we expect the result to hold for more general tree topologies.

Theorem 1 (Main Result).

Let be sampled from the for a fixed transition matrix that is irreducible and reversible with respect to a stationary distribution . If is a complete binary tree with nodes, then the variance of the GLS estimator defined in Eq. decays like as . The proof, which is contained in , does not directly compute the variance of the GLS estimator. Instead, it proceeds by constructing an explicit linear estimator and relies on the variational characterization (Eq. ) of . We emphasize that computing requires the covariance matrix , which is typically unknown. The next section proposes a technique to estimate that is based upon the SBM. We also point out that the result in Theorem 1 is asymptotic and, as such, is only meaningful for large enough. Before moving on to practical estimators, we give a more precise result on the constant in the by making further assumptions on the spectral properties of or of the features . The eigenvectors of the reversible transition matrix , denoted , are real-valued functions of the nodes that are orthonormal with respect to the inner product(See, e.g., lemma 12.2 of ref. 12.) We take the eigenfunction corresponding to the eigenvalue 1 to be the constant vector . Define for and note that . Let be the eigenvalues of corresponding to . For each node , decomposes as follows:Under the , the covariance is stationary with autocovariance functionThat is, the covariance matrix has the form , where is the graph distance (i.e., minimum path length) between and in (5). When the autocovariance further simplifies tofor some , then we call the with feature a rank-two model. For instance, if , then as the name suggests, we have a rank-two model. In particular, all of the results in ref. 8 are for such transition matrices. Fig. 1 also studies such a rank-two model on two groups of people. There are other sufficient conditions for Eq. . For instance, if for all nodes , then we have a rank-two model because for .

Theorem 2.

Under a rank-two model,This proof follows from the fact that under a rank-two model, has a closed form expression (see ).

Using RDS to Estimate the Spectral Properties of the Graph for fGLS

The fGLS estimator depends upon an estimated covariance matrix (e.g., see ref. 13):With this notation, observe that . In our setting, estimating is equivalent to estimating . We propose and compare several estimators for . An estimator based upon the DC-SBM is derived in this section. Two additional estimators based upon the rank-two assumption are derived in . The first rank-two estimator, , relies upon a plug-in estimator for the correlation between and (i.e., the autocorrelation at lag 1). The second rank-two estimator, , relies upon plug-in estimators for the first and second differences, and .

Estimating the Spectral Properties of a SBM from an RDS.

The DC-SBM is a generalization of the SBM (14, 15). Both are models for a random network with community structure. As the name suggests, the degree-corrected model allows for degree heterogeneity within the blocks.

Definition (DC-SBM).

Partition the nodes into blocks with : , and assign each node a value such that the s sum to 1 within each block—that is,The block membership of node is , and the parameter controls the degree heterogeneity within each block. Let be a symmetric matrix such that for all . Under the DC-SBM,for all pairs and each possible edge is independent. In much of the previous literature on the DC-SBM, the full network is observed, and we wish to estimate the partition . In this paper, we presume that is observed on the sampled nodes in the , and we wish to estimate the spectral properties of . This is reasonable in RDS because each participant takes a survey that records several salient demographic variables (e.g., gender, race, neighborhood, etc.). In practice, the block labels should be chosen such that they are highly autocorrelated from one referral to the next. Many RDS papers already report such statistics. For example, the original RDS paper (1) presents four empirical transition matrices on four different demographic partitions (i.e., race, gender, drug preference, and location). The derivations below condition on the block labels ; only the graph is random. Let be the (random) adjacency matrix; if and only if . Define such that . Define as a diagonal matrix with -th element . Define as a population version of . The inspiration for the following estimators is based on a population version of the chain and relies on three results. Define the matrix such that for any two blocks , Proposition 1 below shows that is an estimator of under a . Then, Proposition 2 shows that a normalized version of has spectral properties that match the spectral properties of . Finally, under the DC-SBM, if the smallest expected degree is growing fast enough, then converges to in spectral norm (e.g., see ref. 16). So estimates of the spectral properties of are similar to the spectral properties of . With these facts in mind, we propose estimating the spectral properties of with the spectral properties of a normalized version of . We let be such that if and only if .

Proposition 1.

If is constructed from the DC-SBM and if is computed via a sample from the , thenwhere .

Proposition 2.

Define to be a diagonal matrix that contains the row sums of —that is, —and define . Define and via the eigendecomposition, . Define , where is the stationary distribution of . Then, (i) the nonzero eigenvalues of are identical to the nonzero eigenvalues of ; (ii) the columns ofare eigenvectors of ; and (iii) if is sampled from , thenThe proofs of the propositions are given in . We now introduce our estimator of and .

SBM-fGLS.

Using as an observed partition of the nodes (e.g., by demographic characteristics), the SBM estimator of is computed with the following steps. Each step uses a plug-in estimator using the previously derived formulas. After the statement of the algorithm, the steps are matched to the motivating equation. For notational convenience, denote , , and as and for each sampled individual . Moreover, suppose a one-to-one mapping between the node set of and :where provides for Tikhonov regularization in . Compute via Eq. using the block memberships . Define . This symmetrization ensures the eigenvalues are real-valued. Row and column normalize , as where . Take an eigendecomposition of Compute , where contains if . For , compute , where is the element of . Compute an estimate of the autocovariance function as Define to be the sample variance of the . For , Define to solve the system of equations . Estimate with . Step i comes from Proposition 1. Steps iv and v come from Eqs. and in Proposition 2. Step vi comes from Eq. . In all of the plug-in formulas, it is unnecessary to estimate because we must only specify up to a constant of proportionality; this constant appears in both the numerator and denominator of in step ix.

Simulations

This section compares the SBM-fGLS estimator to the VH estimator via simulation. Each simulated sample is collected by tracing contacts in social graphs collected in the National Longitudinal Study of Adolescent Health (Add Health). In the 1994–95 school year, the Add Health study collected a nationally represented sample of adolescents in grades 7–12. The sample covers 84 pairs of middle and high schools in which students nominated up to five male and five female friends in their middle/high school network (17). In this analysis, all graphs are restricted to the largest connected component. performs a similar simulation on the Colorado Spring Project 90 network (18). These networks were previously studied in refs. 4 and 19. The simulation was performed without replacement on the directed edges; both of these settings are different from the model used in the theoretical results. Details of the simulation settings are given in . Fig. 2 shows the RMSE for fGLS and VH estimators; RMSE = . Overall, the SBM-fGLS estimators have a smaller RMSE. Each panel in Fig. 2 has one line with an asterisk. These lines correspond to the same school, which has both (i) a referral bottleneck between the white and black populations and (ii) a referral bottleneck between the high school and middle school. None of the fGLS estimators model both bottlenecks, yet they perform well.

Fig. 2.

Reduction in RMSE for SBM-fGLS vs. VH Estimators. These figures present the root mean squared error (RMSE) for the SBM-fGLS estimator and the VH estimator. Each panel corresponds to a different outcome . In each panel, the horizontal axis corresponds to RMSE, and the vertical axis corresponds to different schools, ordered by RMSE of the VH estimator. Each line connects the RMSE for the SBM-fGLS to the RMSE of the VH estimator. If the line is red, then SBM-fGLS has a smaller RMSE. The most difficult quantity to estimate, high school, also has one of the largest absolute reductions in RMSE. This is consistent with the broader pattern of the experiments and the theory: The VH estimator can have excessive error on quantities that are aligned with the community structures in the network that create referral bottlenecks, and our results suggest that fGLS can reduce the error in such cases. However, we must be careful to extrapolate either the frequency or magnitude of the fGLS improvement. These are highly empirical quantities. Moreover, the social networks that are available to perform simulation experiments (Add Health and P90) are not necessarily representative of the typical RDS population. More discussion of the idiosyncrasies of these networks is given in . Fig. 3 presents a diagnostic plot to evaluate the fGLS estimators using only data that are observed in a single sample. This diagnostic plot was created from the first simulated sample taken on the school that has the asterisk in Fig. 2.

Fig. 3.

Diagnostic plots. Each of these diagnostic plots is created from a single sample on the school with the asterisk in Fig. 2. We should prefer the fGLS estimators that have a smaller ratio of standard errors (RSE) as defined in the text and displayed on the vertical axis. The y corresponds to the SBM-fGLS estimator that constructs the blocks from the outcome variable of interest. For the race and ethnicity outcomes, z corresponds to the SBM-fGLS estimator that constructs the blocks with all races and ethnicities observed in the sample. In each plot, there are -many s because SBM-fGLS estimates eigenvalues; each of these points has the same value on the vertical axis. For completeness, this plot includes the rank-two estimators and that are developed in . Under the rank-two model, the RSE is completely determined by the estimated eigenvalue; this is the gray line. The horizontal axis in Fig. 3 gives eigenvalue(s) of estimated by the fGLS technique. The vertical axis gives the plug-in estimate for the RSE:We should prefer the fGLS estimators that have a smaller ratio. As is justified in more detail in , estimators with smaller RSE make reductions in the variance by taking advantage of the dependencies. Notice how the fGLS estimators have smaller ratios for the outcomes of black, white, and high school. For these outcomes, fGLS significantly reduces the RMSE in Fig. 2. It fails to identify the reduction in RMSE for the outcome male. For Asian, Hispanic, and male, the ratio of SEs is closer to 1.

Summary

This paper derives and studies GLS and fGLS estimators that account for the covariance between samples in an RDS. Under the Markov model where the covariance between samples is known, Theorems 1 and 2 show that the variance of the GLS estimator decays like . To estimate the covariance between samples, we use the fact that the covariance between adjacent samples can be exactly specified in terms of the spectral properties of the Markov transition matrix (5, 20–24). These essential spectral properties of the network can be estimated from the observed data under the DC-SBM and the rank-two model. shows in simulations on the Add Health networks that the fGLS estimates typically have smaller RMSEs than VH estimates. This simulation is performed under a more realistic model than the models used in the technical results (Theorems 1 and 2 and Propositions 1 and 2). First, the RDS is simulated on social graphs that were recorded in the Add Health study (neither rank-two nor simulated from the DC-SBM). Second, the sampling is without replacement. Third, the edges have not been symmetrized. Despite these departures from the reversible Markov model in the technical results, the estimators appear to still perform well. This finding is empirical, and given that these networks are not necessarily representative of the typical RDS population, we must be careful to extrapolate this intuition to other scenarios. The diagnostic plots in Fig. 3 help to determine whether the outcome of interest is correlated in the observed sample. For quantities that are correlated (e.g., race, ethnicity, and school), Fig. 2 shows that fGLS estimates significantly reduce the RMSE. present two additional simulations to investigate the role of (i) sample size, (ii) referral rates, (iii) alignment of the outcome with the blocks , and (iv) preferential recruitment. In those simulations, when the outcome of interest correlates or aligns with the underlying structure of the graph and the referral rate is larger than the critical threshold identified in ref. 5, fGLS estimators can appreciably reduce the variability over previous estimators. In some simulations, the fGLS estimators have a smaller RMSE with 500 samples than the VH estimators have with 1,000 samples. While the fGLS estimators are derived under a Markov model, all simulations were performed under a without-replacement (i.e., non-Markovian) model. Under the Markov model in Theorem 1 and under the simulations on the networks to which we have access, our results suggest effective ways (i) to diagnose strong dependence between samples and (ii) to alleviate such dependence. However, we must be careful in extrapolating specific values from the simulations (e.g., the amount that fGLS reduces the RMSE). The Add Health and P90 networks that are available to perform simulation experiments are not necessarily representative of the typical RDS population. The RMSE of the VH estimator and magnitude of the reduction in RMSE from fGLS are two highly empirical quantities that change between networks and outcomes.

6 in total

1. Assessing respondent-driven sampling.

Authors: Sharad Goel; Matthew J Salganik
Journal: Proc Natl Acad Sci U S A Date: 2010-03-29 Impact factor: 11.205

2. Stochastic blockmodels and community structure in networks.

Authors: Brian Karrer; M E J Newman
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2011-01-21

3. Estimating uncertainty in respondent-driven sampling using a tree bootstrap method.

Authors: Aaron J Baraff; Tyler H McCormick; Adrian E Raftery
Journal: Proc Natl Acad Sci U S A Date: 2016-12-07 Impact factor: 11.205

4. Respondent-driven sampling as Markov chain Monte Carlo.

Authors: Sharad Goel; Matthew J Salganik
Journal: Stat Med Date: 2009-07-30 Impact factor: 2.373

5. Social networks and infectious disease: the Colorado Springs Study.

Authors: A S Klovdahl; J J Potterat; D E Woodhouse; J B Muth; S Q Muth; W W Darrow
Journal: Soc Sci Med Date: 1994-01 Impact factor: 4.634

Review 6. Strengthening the Reporting of Observational Studies in Epidemiology for respondent-driven sampling studies: "STROBE-RDS" statement.

Authors: Richard G White; Avi J Hakim; Matthew J Salganik; Michael W Spiller; Lisa G Johnston; Ligia Kerr; Carl Kendall; Amy Drake; David Wilson; Kate Orroth; Matthias Egger; Wolfgang Hladik
Journal: J Clin Epidemiol Date: 2015-05-01 Impact factor: 6.437

6 in total