Literature DB >> 36149894

MicrobiomeCensus estimates human population sizes from wastewater samples based on inter-individual variability in gut microbiomes.

Lin Zhang¹, Likai Chen², Xiaoqian Annie Yu³, Claire Duvallet^4,5, Siavash Isazadeh^4,5, Chengzhen Dai⁶, Shinkyu Park⁶, Katya Frois-Moniz^4,5, Fabio Duarte⁶, Carlo Ratti⁶, Eric J Alm^4,5,7, Fangqiong Ling^1,8,9,10.

Abstract

The metagenome embedded in urban sewage is an attractive new data source to understand urban ecology and assess human health status at scales beyond a single host. Analyzing the viral fraction of wastewater in the ongoing COVID-19 pandemic has shown the potential of wastewater as aggregated samples for early detection, prevalence monitoring, and variant identification of human diseases in large populations. However, using census-based population size instead of real-time population estimates can mislead the interpretation of data acquired from sewage, hindering assessment of representativeness, inference of prevalence, or comparisons of taxa across sites. Here, we show that taxon abundance and sub-species diversisty in gut-associated microbiomes are new feature space to utilize for human population estimation. Using a population-scale human gut microbiome sample of over 1,100 people, we found that taxon-abundance distributions of gut-associated multi-person microbiomes exhibited generalizable relationships with respect to human population size. Here and throughout this paper, the human population size is essentially the sample size from the wastewater sample. We present a new algorithm, MicrobiomeCensus, for estimating human population size from sewage samples. MicrobiomeCensus harnesses the inter-individual variability in human gut microbiomes and performs maximum likelihood estimation based on simultaneous deviation of multiple taxa's relative abundances from their population means. MicrobiomeCensus outperformed generic algorithms in data-driven simulation benchmarks and detected population size differences in field data. New theorems are provided to justify our approach. This research provides a mathematical framework for inferring population sizes in real time from sewage samples, paving the way for more accurate ecological and public health studies utilizing the sewage metagenome.

Entities: Chemical

Mesh：

Substances：
Sewage
Waste Water

Year: 2022 PMID： 36149894 PMCID： PMC9534451 DOI： 10.1371/journal.pcbi.1010472

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

Introduction

The metagenome embedded in urban sewage is an attractive new data source to understand urban ecology and assess human health status at scales beyond a single host [1-3]. Sewage microbiomes are found to share a variety of taxa with human gut microbiomes, where the baseline communities are characterized by a dominance of human-associated commensal organisms from the Bacteroidetes and Firmicutes phyla [1, 3, 4]. Human viruses like SARS-CoV-2 and polioviruses were detected in sewage samples during the pandemic and silent spreads, respectively, and found to correlate to reported cases, suggesting that sewage samples could be useful for understanding the dynamics in the human-associated symbionts at a population level [5, 6]. Sewage has several advantages as samples of the population’s collective symbionts. For instance, sewage samples are naturally aggregated, wastewater infrastructures are highly accessible, and data on human symbionts can be collected without visits to clinics, thus utilizing sewage samples can reduce costs and avoid biases associated with stigma and accessibility [2, 7]. Consequently, SARS-CoV-2 surveillance utilizing sewage samples are underway globally and incorporated into the U.S. Centers for Disease Control and Prevention surveillance framework [8]. A pressing challenge in utilizing sewage for ecological and public health studies is the lack of methods to directly estimate human population size from sewage. Specifically, virus monitoring at finer spatial granularity, e.g., single university dorms and nursing homes, are informative for guiding contact tracing and protecting populations at higher risk, but real-time population size estimations at such fine granularity are not yet available. For a given area, the census population (de jure population) can be larger than the number of people who contributed feces to sewage at a given time (de facto population)[9]. Conversely, the de jure population can also be smaller than the de facto population due to the presence of undocumented individuals [10]. Population proxies that are currently used for monitoring at wastewater-treatment plants, such as the loading of pepper mild mottle viruses, likely have high error at the neighborhood level because of their large variability in human fecal viromes (106-109 virions per gram of dry weight fecal matter)[11]. Consequently, it is difficult to assess the representativeness of a sewage sample, infer the taxon abundance differences across time and space, or interpret errors. Lack of population size information could lead to false negatives in assessing virus eradication, because an absence of biomarkers might be caused by a sewage sample that under-represents the population size. Despite its importance, few studies have explicitly explored ways to estimate real-time human population size from sewage samples independent from census estimates [12]. Macroecological theories of biodiversity may offer clues to decipher and even enumerate the sources of a sewage microbiome. While we are only beginning to view sewage as samples of human symbionts beyond one person, generating multi-host microbiomes resembles a fundamental random additive process. Sizling et al. showed that lognormal species abundance distributions (SADs) can be generated solely from summing the abundances from multiple non-overlapping sub-assemblages to form new assemblages [13]. Likewise, adding multiple sub-assemblages can also give rise to common Species-Area Relationships [13]. For microbial ecosystems, Shoemaker et al. examined the abilities of widely known and successful models of SADs in predicting microbial SADs and found that Poisson Lognormal distributions outperformed other distributions across environmental, engineered, and host-associated microbial communities, highlighting the underpinning role of lognormal processes in shaping microbial diversity [14]. In this study, we conceptualize a sewage microbiome as a multi-person microbiome, where the number of human contributors can vary. We hypothesize that the species abundance distribution in the multi-person microbiome will vary as a function of the human population size, which would arise from summing taxon abundances from multiple hosts analogous to the Central Limit Theorem. We use human gut microbiome data comprising over a thousand human subjects and machine learning algorithms to explore these relationships. Upon discovering a generalizable relationship, we develop MicrobiomeCensus, a nonparametric model that utilizes relative taxon abundances in the microbiome to predict the number of people contributing to a sewage sample. MicrobiomeCensus utilizes a multivariate T statistic to capture the simultaneous deviation of multiple taxa’s abundances from their means in a human population and performs maximum likelihood estimation. We provide proof on the validity of our approach. Next, we examine model performance through a simulation benchmark using human microbiome data. Last, we apply our model to data derived from real-world sewage. Our nonparametric method does not assume any underlying distributions of microbial abundances and can make inferences with just the computational power of a laptop computer.

Results

Species abundance distributions of multi-person microbiomes vary by population size

We consider the fraction of microorganisms observed in sewage that are human-associated anaerobes as an “average gut microbiome” sampled from residents of a catchment area. Hence, our task becomes to find the underlying relationship between the number of contributors and the observed microbiome profiles in sewage samples. We define an “ideal sewage mixture” scenario to illustrate our case, where the sewage sample consists only of gut-associated microorganisms and is an even mix of n different individuals’ feces (Fig 1). We denote the gut microbiome profile of an individual as X = (X, X, …, X)⊤, where each X represents the relative abundance of operational taxonomic unit (OTU) j from individual i. Hence our ideal sewage mixture can be represented as where vectors are microbiome profiles from individuals . Under the ideal sewage mixture scenario, if we can quantitatively capture the departure of the sewage microbiome profile from the population mean of the human gut microbiomes of people constituting the catchment area, we will be able to estimate the population size.

Fig 1

An ideal sewage mixture simulation shows the potential of microbiome taxon abundance profiles as population census information sources.

An ideal sewage mixture simulation shows the potential of microbiome taxon abundance profiles as population census information sources.

(A) We generated an “ideal sewage mixture” consisting of gut microbiomes from different numbers of people. (B) Ranked abundance curves for gut microbiomes of one person and mixtures of multiple people exhibit different levels of dominance and diversity. Blue lines show the rank abundance curves in stool samples (one person), red lines show 10-person mixtures, and saffron lines show 100-person mixtures. In each scenario, ten examples are shown. All samples were rarefied to the same sequencing depths (4,000 seqs/sample). (C) The probability density function of the relative abundance of one taxon for different population sizes. OTU-2379, a Bifidobacterium taxon, was used as an example. Maroon dashed lines indicate the sample means. (D) Multiple taxa’s abundance variances in one-person samples and 100-person samples. The dominant taxa are shown (top100) and are sorted by their ranks in variance. (E) The ratios of the variances of one-person samples and 100-person samples across dominant gut microbial taxa. Using a dataset comprised of 1,100 individuals’ gut microbiome taxonomic profiles [15], we created synthetic mixture samples of different numbers of contributors through bootstrapping (Fig 1A). First, examined from an ecological perspective, the shape of the ranked abundance curves of the gut microbiomes differed when the means of multiple individuals were examined: when the number of contributors increased, a normal distribution appeared(Fig 1B). For the single-person microbiomes, log-series and lognormal distributions explained 94% and 93% of the variations in the SADs, respectively, compared with 89% for Poisson lognormal, 87% for Zipf multinomial and 80% for the broken-stick model. Multi-person microbiomes were best predicted by log-series or lognormal models, but as the population increased to over a hundred, the multi-person SADs were best described by only lognormal SADs (S1 Table). We explored the distributions of the relative abundances of gut bacteria as a function of population size. As expected, the distribution of a taxon’s relative abundance changes with population size (Fig 1C). For instance, for OTU-2397, a Bifidobacterium taxon, the relative abundance distribution was approximately log-normal when the relative abundance in single-host samples was considered, yet converged to a Normal distribution when mixtures of multiple hosts were considered. Although the means of the distributions of the same taxon under different population sizes were close, the variation in the data changed. A smaller variance was observed when the number of contributors increased (Fig 1D). Notably, different taxa varied in the rates at which their variances decreased with population size (Fig 1E), suggesting that a model that considers multiple features would be useful in predicting the number of contributors.

Classifiers utilizing microbial taxon abundance features alone detects single-person and multi-person microbiomes

Inspired by the distinct shapes of SADs in multi-person gut-microbiomes from those of single-person microbiomes, we set up a classification task using the taxon relative abundances to separate synthetic communities constituting one, ten, and a hundred people. With algorithms of varying complexity, namely Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF) classifier, classification accuracies of 29.6%, 97.2%, and 100% were achieved (Fig 2). Between RF and SVM, RF showed higher sensitivity and specificity in classifying all population groups (S2 Table). This experiment suggests the usefulness of microbiome features in predicting human population counts from mixture samples.

Fig 2

Classifier performance of models utilizing gut microbiome taxon abundances.

MicrobiomeCensus is a statistical model that estimates population size from microbial taxon abundances

While the classification tasks described above demonstrated the usefulness of taxa’s relative abundances in predicting the population size, a complex model like RF provided little explanatory power. We then ask, since the variance in the relative abundance of a given taxon decreases with population size, can we devise a statistic that captures the simultaneous deviation of several taxa’s abundances from their means, and estimate population size utilizing the statistic? Further, will this new method perform well despite inter-personal variation in gut microbiomes? Our new method, MicrobiomeCensus, involves a statistic T to capture the simultaneous deviation of multiple taxa’s abundances from their means in relation to the variance of those taxa in the population (Fig 3A). We denote Σ0 = (σ)1 ≤ as the covariance matrix for the individual microbiome profile and let Λ0 be a diagonal matrix with Then the statistic takes form where denotes the observed microbiome profile in ideal sewage, μ represents the population mean for the catchment area and is an estimator, is an estimator of Λ0 and for any vector This statistic is enlightened by the classical Hotelling T2 statistic [16] , where is the sample covariance matrix, an estimator of Σ0. Actually if we assume the covariance matrix is diagonal (no correlations between different taxa), then they are essentially the same statistic in view of . The reason we replace covariance matrix Σ0 by its diagonal Λ0 is because for high dimensional situations, it would be very difficult to estimate the covariance matrix. In cases when p > n, the sample covariance matrix is singular and thus is not even well defined. Studies accommodating the Hotelling T2 type statistic into the high-dimensional situation can be found, for example, in Bai and Saranadasa [17], Chen and Qi [18], Xu et al [19], etc. Our proposed statistic can handle the high dimensional cases as well, since the diagonal entities Λ0 can be well estimated even when p is large. And we extend its application beyond the problem of the significance of the multivariate means.

Fig 3

MicrobiomeCensus statistic definition, model training, validation, and application.

MicrobiomeCensus statistic definition, model training, validation, and application.

(A) Example of computing the T statistic. (B) Simulation results for T with different population sizes. Grey points are simulation results. Red bars are means of 10,000 repeats performed for each population size. (C) Model training and tuning. We built the MicrobiomeCensus model using our T statistic and a maximum likelihood procedure. The training set consisted of 10,000 samples for population sizes ranging from 1–300, and 50% of the data were used to train and validate the model. Training and validation errors from different feature subsets are shown. Training errors are shown as blue lines, and validation errors are shown as red lines. (D) Model performance on simulation benchmark. After training and validation, the model utilized the top 120 abundant features. Model performance was tested on synthetic data generated from 550 different subjects not previously seen by the model. The training set consisted of 10,000 samples with population sizes from 1–300, and the testing set consisted of 10,000 repeats at the evaluated population sizes. The training error, testing error, and the error of the final model are shown. (E) Model performance evaluated using a testing set. Black solid dots indicate the means of the predicted values, and error bars indicate the standard deviations of the predicted values. (F) Application of the microbiome population model in sewage. Seventy-six composite samples (blue) were taken from three manholes on the MIT campus, and each sample was taken over 3 hours during the morning peak water usage hours. Twenty-five snapshot samples (grey) were taken using a peristaltic pump for 5 minutes at 1-hour intervals throughout a day. In developing this new method, we utilize the variance change by population, but without any priori assumption about the gut bacterium species taxon abundance distributions and the covariance between species. Our analysis showed that the statistic T changed monotonically with increasing population size, indicating the promise of a population estimation model (Fig 3B). Leveraging our statistic T, we constructed an asymptotic maximum likelihood estimator to estimate size of the sample without the information of each individual, that is, we do not observe X1, …, X but only their mixture . Here, the parameter of interest is the population size n, the test statistic is T, and a point estimate is made by maximizing the estimated likelihood of T with respect to n. We performed training and validation using 50% of the human microbiome data and held out the rest of the data for testing. Our model achieved a training error as low as 13% (mean absolute percentage error, MAPE) when up to 250 features are included. The model’s training performance increased when more features were included, yet the validation error did not profoundly change with an increasing number of features (Fig 3C). Cross-validation on top 40, 80, 120, 150, 180, 210, and 240 abundant OTUs led to validation errors at 32.4%, 32.3%, 31.4%, 32.3%, 30.9%, 32.3%, and 33.1%, respectively. Upon training and validation, we chose the top 120 OTUs and tested the performance of the tuned model on a test set held out during training/validation. The model’s MAPE was 21% (Fig 3D and 3E, testing errors at each population size evaluated are provided in S3 Table), indicating that our model generalized well across different hosts. We then used all data and tuned hyperparameter to acquire a final model. When applying the final model on the same testing data, our model achieved a testing error of 16.2% (Fig 3D). It is worth noting that in this algorithm, for each size n, we only need to estimate the sampling distribution of the statistic T once. Hence it is not time-consuming regardless of the true population size. We also note that an RF regression model could not be trained in a reasonable time on the same dataset, even with high-performance computing (Methods). Our model performed remarkably better than a ten-fold cross-validated RF regression model utilizing a reduced dataset, which gave an MAPE of 32%, while the training time for our model was only a fraction of that of the RF regression model (S1 Fig).

MicrobiomeCensus detects human population size differences in sewage samples

With the newly developed population model, we set out to apply our model to sewage samples. Ideally, we would like to apply the model to samples generated from a fully controlled experiment with known human hosts contributing at a given time, yet such an experiment presents logistic challenges. Instead, we applied our model to sewage samples taken using one of two methods, either a snapshot (grab sample) sample taken from the sewage stream over 5 minutes, or an accumulative (composite sample) taken at a constant rate over 3 hours during morning peak human defecation [20] (S2 Fig). We hypothesized that the composite samples would represent more people than snapshot samples. Taking grab samples, we sampled at 1-hr intervals at one manhole (n = 25); using the accumulative method, we sampled at three campus buildings (classroom, dormitory, and family housing) multiple times over three months (n = 76). To remove sequences possibly contributed by the water, we applied a taxonomic filter to retain families associated with the gut microbiome and normalized the species abundance by the retained sequencing reads (Methods, S4 Table). We applied our final model to the sewage data set. Our model estimated 1–9 people’s waste was captured by the snapshot samples (mean = 3, s.d.=3), and 3–27 people were represented by the composite samples (mean = 9, s.d.=7), where the composite samples represented significantly more people (p < 0.0001) (Fig 3F). The hypothesis that composite samples represent more people is well supported by our model results.

Sub-species diversity in sewage samples reflects adding microbiomes from multiple people

Independent from our MicrobiomeCensus model, we found that certain human gut-associated species were frequently detected in sewage samples by using shotgun metagenomics, e.g., Bacteroides vulgatus, Provotella copri, and Eubacterium rectale. Further, their sub-species diversity, as indicated by nucleotide diversity and the number of polymorphic sites in housekeeping genes, was dramatically higher in sewage samples than in the gut microbiomes of individual human subjects (Fig 4A–4F and Text A in S1 Text).

Fig 4

Sub-species diversity in gut-associated bacterial species as a potential marker for human population size.

Sub-species diversity in gut-associated bacterial species as a potential marker for human population size.

(A-F) Comparison of sub-species diversity of gut-associated bacteria in human gut microbiome samples (LifelinesDeep) and MIT sewage samples. Nucleotide diversity and numbers of polymorphic sites were computed from ten phylogenetic marker genes. (G) and (H) Simulation results showing intra-species diversity in response to increasing population size, as represented by the number of polymorphic sites (G) and nucleotide diversity (H). To examine the effect of increasing population size on sub-species genetic variation in representative gut-associated microbial species, we simulated aggregate human gut samples using a sample without replacement procedure and computed the nucleotide diversity and numbers of polymorphic sites for the aggregate samples at different population sizes. This resulted in single nucleotide variant (SNV) profiles from 64 species. Our simulation showed increases in both nucleotide diversity and the number of polymorphic sites as more human gut samples were aggregated (Fig 4G and 4H). For instance, the nucleotide diversity and number of polymorphic sites in Eubacterium rectale increased from 0.029 (s.d. 0.026) to 0.149 (s.d. 0.002) and 64 (s.d. 54.33) to 1274 (s.d. 18.41), respectively, when the population size increased from 1 to 300. Further, the number of polymorphic sites strongly correlated with the population size (Pearson correlation coefficient >0.8) in 49 of the 64 species (S5 Table), suggesting the potential that the SNV profiles of a wide range of gut species could be developed into feature space for population size estimation. Our simulation further shows that the number of polymorphic sites increased with population size more slowly than nucleotide diversity, indicating its potential to reflect more subtle changes in population size (Fig 4G and 4H). Despite the need for further model developments, the analysis here shows the potential of the sub-species diversity of gut anaerobes as a feature space to be developed into a population size estimation model, independent from the taxon abundance-based model described here.

Discussion

The MicrobiomeCensus method we present here can, in theory, estimate the population size contributing to a sewage sample from the taxon abundance of multiple human gut microbiome taxa, using our T statistic and associated maximum likelihood estimation and application procedures. While the model is trained to perform accurate population estimation on a neighborhood scale, we expect the population range it can estimate to expand with increasing training gut microbiome data availability. We propose the MicrobiomeCensus model as a tool to drive further developments in quantitative sewage-based epidemiology. We have provided mathematical proof of the validity of our approach. MicrobiomeCensus showed excellent performance in our simulation benchmark. In particular, the study subjects that we utilized in the training and testing sets are random samples out of 1,100 men and women across a wide range of age without any stratification, hence the model’s testing performance indicates its generalizability. Our study is founded on the observations that healthy gut microbiomes are resilient, with inter-individual variability outweighing variability within individuals over time [21-23]. There are caveats to our approach; potentially, diets and regional effects on human microbiome composition could introduce noises to the prediction [24, 25]. In applications to sewage, future studies on water matrix effects should be performed to understand and further account for noises from the sewage collection network. It should be mentioned that while our model is trained on microbiome data, it is not limited to microbiome features because we did not impose any assumptions on the distribution of features. Other features, e.g., crAssphage titers, may also be incorporated once individual level data at a large population size become available to allow model validation. Utilizing sewage to understand population-level dynamics of human symbionts presents a new scenario of sampling meta-communities. The gut microbiomes of humans can be viewed as local communities, and gut microbiomes of people living in a neighborhood could be viewed as a kind of regional meta-communities, because these communities are linked by dispersal that can take place among people connected by social networks and through a shared built environment. The meta-community framework is considered to provide useful new conceptual tools to understand the largely unexplained inter-personal variability in gut microbiomes, with expansions of the theory to consider biotic interactions suggested by Miller, Svanbäck, and Bohannan [26]. In considering a sample of meta-communities, Leibold and Chase asked provocatively “what is a community?” and observed that the definition of a community is usually “user-defined and could be context-dependent” –“one community ecologist might explore the patterns of coexistence and species interactions among species within a delimited area, the other might ask the same question but define a community that encompasses more area and thus types of species, as well as different degrees of movements and heterogeneity patterns” [27]. The ambiguity between samples of meta-communities and local communities is particularly challenging for samples of microbial communities, because dispersal boundaries are difficult to delineate. Despite the conceptual importance, empirical methods that explicitly test whether a microbiome sample is a sample of a meta-community or a local community have not been available. MicrobiomeCensus directly distinguishes samples of meta-communities and local communities by enumerating the number of hosts contributing to a microbiome. While MicrobiomeCensus is trained on gut microbiome data, the procedure may have wide applications in other microbial ecosystems. There are several limitations of this study. First, our approach requires a sizable set of individual gut microbiome data to generate the empirical distributions employed in maximum likelihood estimation. Second, accurate estimations of population sizes will depend on good estimations of the population means and variances of relative abundances in individual gut microbiomes. Future studies generating data of individual-level gut microbiome of residents in specific cities, as well as model training based on those data, will help bridge the applications to sewage samples at the corresponding areas. Third, our current application procedures require minimal decays of gut-associated bacteria, thus are suitable for applications at catchments near residential buildings (e.g., sentinel monitoring). Better understandings of the growth/death dynamics of human gut-associated microbial taxa in the sewer environment and feature selection leveraging that information will help bridge the applications to larger catchments. In response to the COVID-19 pandemic now affecting the human population globally, sewage-based virus monitoring is underway [28]. Our analysis calls for attention to the denominator used in normalizing the biomarker measurements. While in practice, loading-based population proxies such as the copy numbers of pepper mild mottle viruses are used to normalize data generated from sewage, such proxies would likely have high error at the neighborhood level because of their variability in human fecal viromes (106-109 virions per gram of dry weight in fecal matters)[11], while they likely have reasonable performance when the population size is sufficiently large and the means of biomarker loadings converge under the Central Limit Theorem. Thus, the relationships between sewage measurements and true viral prevalence in small populations are hard to establish despite the need for sentinel population studies. Our model has immediate application in detecting false negatives, because it alerts us to the possibility that an absence of biomarkers might be caused by a sewage sample that under-represents a population. With further developments incorporating local training data, the model can potentially generate a denominator that can help turn biomarker measurements into estimates of prevalence and enable the application of epidemiology models at finer spatio-temporal resolutions.

Methods

Rationales

If the distribution of X is known, then the distribution T is known and one can easily use maximum likelihood estimator (MLE) to estimate the human population size. Here the human population size is essentially the sample size n from the wastewater sample. However for generality, we do not want to impose any specific distribution assumptions on taxon abundance distributions, thus, we need to rely on asymptotic results to estimate the distribution of our statistic. Unlike the univariate case where the asymptotic distribution of the statistic T can be simply derived by central limit theorem, we are dealing with a much more challenging situation due to high dimensionality. In the following, we shall firstly introduce some notations and assumptions that will be needed for the theorems. Then, in Theorem 1, we derive the Gaussian approximation for the test statistic T, which implies that we can use simulated Gaussian vectors to approximate the distribution of our statistic. To apply this approximation, one needs to further estimate the covariance matrix which is highly non-trivial due to high dimensionality. To get around this difficulty, we further propose a sub-sampling approach. Theorem 2 provides the theoretical foundation for this sub-sampling scheme.

Notation

Recall that vectors X1, X2, …, X are microbiome profiles from individuals 1, 2, …, n. Assume that they are independent and identically distributed (i.i.d) random vectors in with mean and variance . Denote Σ0 = (σ)1 ≤ and . Let the diagonal matrix Λ0 = diag(σ, 1 ≤ i ≤ p). To construct the Gaussian approximation, we shall firstly work with cases when both Λ0 and μ are given, that is, consider the statistic We will later extend all the results to cases when those parameters are unknown. For notation’s simplicity, consider the normalized version of X: Then where and the covariance matrix Σ of Y is the correlation matrix of X, with expression We need the following condition on Y for the main theorem. Assumption 1 Let s > 2. Assume where ‖ ⋅ ‖ is the matrix Frobenius norm. Remark 1 Above conditions naturally hold if Y1, 1 ≤ i ≤ p, are independent and Actually under this setting, Σ = I and thus ‖Σ‖ = p1/2. By Berkholder [29]’s inequality, where c1 > 0 is independent of p. This justifies K part in condition (3). Again by Berkholder [29]’s inequality, where c2 > 0 is independent of p. And thus D part in condition (3) holds. Now we are ready to introduce the first asymptotic result. The following theorem essentially states that under certain regularity conditions, the distribution of our test statistic can be approximated by the distribution of some function of a Gaussian vector. Theorem 1 (Theorem 2.2 in Xu et al. [19]) Assume Assumption 1 holds with some s > 2, also assume Then for Z ∼ N(0, Σ), we have If K and D in (3) are bounded, then we can get a crude bound O(n−) for (5). It is worth noticing that under settings in Remark 1, condition (4) holds. Based on the above theorem, if we can estimate the covariance matrix Σ, then can be generated and used for approximation of our statistic. However the estimation of Σ is difficult due to high dimensionality unless some additional conditions are imposed on the covariance matrix. To get around this difficulty, we shall consider a bootstrap approach. The main idea is that for each n, we randomly generate n samples from the reference data X, 1 ≤ i ≤ n0, and construct some generated statistic. Using the empirical distribution of the generated statistic to approximate the distribution of our statistic. In the following, we shall provide the theoretical justification for this procedure. For some integer J > 0, let A1, A2, …, A be i.i.d uniformly sampled from the class . Assume the sampling process are independent from our data (X). Then for each 1 ≤ k ≤ J, the set {X, i ∈ A} is of size n and can be used to construct one test statistic where After repeating this procedure J times, we would have J realizations of our test statistic which can be used to construct the empirical distribution : We shall later show that this empirical distribution can be adopted for approximating the distribution of the target statistic Following result can be derived based on Theorem 3.5 in Xu et al.[19] and Theorem 1. Theorem 2 Assume conditions in Theorem 1 hold, and moreover assume that n → ∞, n = o(n0) and (4) holds. Then for J → ∞, we have Theorem 2 implies that we can use the empirical distribution generated from our sub-samples to estimate the distribution of our target. As mentioned previously, so far, we are assuming that we know the value of Λ0 which is not realistic in applications. Therefore we need to further estimate Λ0, that is, we need to estimate the standard deviation for each coordinate. This can be easily accomplished by considering the estimator where X represents the jth coordinate of X, and Λ0, is the jth diagonal entity of Λ0 and n0 is the size of the reference data. Also one can use the average of the reference data to replace μ. If X has heavy tail, we can also consider a robust m-estimator for Σ0 and μ, see for example, Catoni [30].

Bootstrap procedure

Below we describe the bootstrap procedure we use to approximate the distribution of T for different census counts. Recall that X1, …, X represent arrays of taxon relative abundances in the gut microbiome of human subject 1, …, n, and T is defined in (Eq 2). Step 1. Estimate the population mean , and the diagonal matrix , from a reference sample human gut microbiome data. Step 2. For each census count n, generate from the reference data to compute . Step 3. Repeat Step 2 B times (herein 10,000 times) to get . Step 4. Obtain the density function of T based on using a Gaussian kernel. Step 5. Repeat Steps 2–4 for all the census counts n = 1, 2, …, N considered, herein N = 300. It should be noted that per Theorem 2 we require bootstrap sample size n much smaller than total reference sample size n0, thus up to 300-person samples were simulated here because the gut microbiome reference dataset we utilized consisted of a total of 1,100 people. The range can be expanded if a larger dataset is available.

Maximum likelihood estimation

We use a maximum likelihood estimation (MLE) procedure to achieve point estimates of the population size from a new mixture sample, W. The MLE procedure firstly computes T0 by replacing the sample average by W, that is And then computes the likelihoods that T0 was drawn from population sizes from 1 to N, respectively, using the estimated distributions generated from the bootstrap procedures described above. Next, the population size that yields the highest likelihood is chosen.

Confidence interval

Due to Theorem 1, the asymptotic distribution of nT is the same as and is therefore independent of the parameter n. Hence for any confidence level 1 − α, we can firstly estimate the 1 − α quantile of nT based on the simulated data described above, denoted as Then for any new mixture W and the corresponding T0 as in (8), our 1 − α confidence interval is

Model training, validation, and testing

We synthesized a mixed data set from a gut microbiome dataset of 1,135 healthy human hosts from the Lifelines Deep study [15], which was the largest single-center study of population-level human microbiome variations from a single sequencing center at the time of this study. The data set consisted of 661 women and 474 men. We considered OTUs defined by 99% similarity of partial ribosomal RNA gene sequences (Methods of OTU clustering are described in detail in Text B in S1 Text). After quality filtering, we retained 1,100 samples that had more than 4,000 sequencing reads/sample. We split the entire dataset approximately in half, using 550 subjects to generate the training/validation set and the other 550 subjects to generate the test set. We then used the aforementioned ideal sewage mixture approach to generate synthetic populations of up to 300 individuals, which is the relevant range for population estimation in upstream sewage. The training error was computed using the entire training data set. Five repeated holdout validations using a 50–50 split in the training set were performed to tune the hyperparameter for feature selection. The training and cross-validation errors were evaluated at integers from 1 to 100, using the error definition: and the model’s performance across all the population sizes was characterized by the mean absolute percentage error (MAPE): We chose to use MAPE because for a problem of population estimation, the error relative to the true value is important to consider for performance evaluation. After training and validation, the hyperparameter (in this case, the top k abundant OTUs) that yielded the best performance in the validation step was used in the model. The tuned model was then tested on the test set. Our synthetic sewage microbiome approach captured the actual microbiome variation among individual hosts and demonstrated the model’s generalizability.

Human gut microbiome 16S rRNA amplicon data source

The single-person and multi-person microbiome data were drawn from a gut microbiome dataset of 1,135 healthy human hosts from the Lifelines Deep study [15], which was the largest study of population-level human microbiome variations from a single sequencing center at the time of this study. The data set consisted of 661 women and 474 men. We considered operational taxonomic units defined by 99% similarity of partial ribosomal RNA gene sequences. After quality filtering, we retained 1,100 samples that had more than 4,000 sequencing reads/sample. The rarefaction depth was chosen to balance sample size and sequencing depth.

16S rRNA gene amplicon sequencing data analysis

Operational taxonomic units defined at 99% sequencing similarity were generated from the combined dataset by first denoising the samples with DADA2 [31], and then clustering the outputted exact sequence variants with the q2-vsearch plugin of QIIME2 [32]. Taxonomic assignments were performed using a multinomial naïve Bayes classifier against SILVA 132 [33, 34]. All 16S rRNA gene amplicon analyses were performed in the QIIME2 platform (QIIME 2019.10) [35].

Species abundance distribution

We examined the relationships between the performances of several widely used SAD models and the number of contributors (population size) to a multi-person microbiome. Multi-person microbiomes were generated by sampling N individuals from the quality-filtered gut microbiome 16S rRNA dataset and summing the abundances of the same taxa. At each population size, 10,000 repeats were performed. The repeats were chosen according to the constraints of computational efficiency. The SADs evaluated included the Lognormal, Poisson Lognormal, Broken-stick, Log series and the Zipf model, which were shown to have varied successes in predicting microbial SADs [14]. We examined the fit using a rank-by-rank approach as previously described by Shoemaker et al. [14]. First, maximum-likelihood coefficients for each of the SADs described above were estimated using the R package sads [36]. Next, SADs were predicted using each model, and tabulated as RADs. Then, we used a least-squares regression to assess the relationship between the performance of the predicted SADs against the observations and recorded the coefficient of determination (R-squared). Last, R-squared values from model fits of each SAD model were summarized as the means, and the models that resulted in the highest R-squared values for each simulated community were recorded.

Field data

We conducted a field sampling campaign, collecting sewage samples daily at manholes near three buildings (two dormitory buildings and one office building) on the campus of Massachusetts Institute of Technology. Seventy-six sewage samples were collected through a continuous peristaltic pump sampler operated at the morning peak (7–10 a.m. near the dormitory buildings and 8–11 a.m. near the office building) at 4 mL/min for 3 hours. This composite sampling approach was intended to capture the morning water usage peak. Wastewater was filtered through sterile 0.22-µm mixed cellulose filters to collect microbial biomass. Environmental DNA was extracted with a Qiagen PowerSoil DNA extraction kit according to the manufacturer’s protocol. The DNA was amplified for the V4 region of the 16S rRNA gene and sequenced in a Miseq paired-end format at the MIT BioMicro Center, according to a previously published protocol [37]. Included as a comparison are a set of snapshot sewage samples taken using a peristaltic pump sampler at 100 mL/min for 5 minutes over a day (10 a.m. on Wednesday April 8, 2015, to 9 a.m. on Thursday April 9, 2015). The sampling methods for snapshot samples are described in detail by Matus et al. [4]

Application to sewage data

The 16S rRNA gene amplicon sequencing data from the field sewage samples were trimmed to the same region, 16S V4 (534–786) with the LifeLines Deep data using Cutadapt 1.12 [38]. Forward reads were trimmed to 175bps, and reverse reads were first trimmed to 175bps and then further trimmed to 155bps during quality screening. We created a taxonomic filter based on the composition of the gut microbiome data set, which consisted of the abundant family-level taxa that accounted for 99% of the sequencing reads in the human gut microbiome data set, and excluded those that might have an ecological niche in tap water (Enterobacteriaceae and Burkholderiaceae). This exclusion resulted in 25 bacterial families and one archaeal family in our taxonomic filter, including Lachnospiraceae, Ruminococcaceae, Bifidobacteriaceae, Erysipelotrichaceae, Bacteroidaceae, and others (S4 Table). We applied our taxonomic filter to the sewage sequencing data, which retained 73.9% of the sequencing reads. This retention rate is consistent with our previous report of the human microbiome fraction in residential sewage samples [4]. We then normalized the relative abundance of taxa against the remaining sequencing reads in each sample. Welch’s two-sample t-tests were performed to retain the OTUs whose means did not differ significantly from the human microbiome data set (p > 0.05).

Deployment of generic machine learning models

Logistic regression, support vector machine, and random forest classifiers were employed to perform the classification task for population sizes of 1, 10, and 100. Model training, cross-validation, and testing were performed using the R Caret platform with the default setting [39]. For the support vector machine, the radial basis function kernel was employed. Ten-fold cross-validation and five repeats were performed for all the models considered. Model performance was evaluated using accuracy, sensitivity, and specificity. Based on the classifier performance, the RF regression model was used for comparison with our new model’s performance. Initially, we trained the model using the same training data set used in training our maximum likelihood model, however, the computation was infeasible, even with a 36-thread, 3TB-memory computing cluster. We then introduced gaps in the population size range, using populations from the vector (1, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 150, 180, 240, 300)⊤ while maintaining the same sample size at each population size (10,000 samples). The training was performed in R Caret, using 10-fold cross-validation. Ten variables were randomly sampled as candidates at each split, mtry = 10. The performance was evaluated using the same testing set that was used to evaluate the maximum likelihood model.

Performance of a cross-validated Random Forest regression model utilizing microbiome features to predict population size.

Black solid dots indicate the means of the predicted values, and error bars indicate the standard deviations of the predicted values. (TIFF) Click here for additional data file.

Sewage sampling schematics.

Sewage samples were collected from a manhole in upstream areas close to the sources. Snapshot samples were collected at a high pump speed for a short sampling time. Composite samples were collected at a low pump speed and for a long sampling time. Sewage flow is simplified to illustrate the sampling. Color bars indicate different sections of the flow, coming from different contributors. (TIFF) Click here for additional data file.

Bacteroides vulgatus sub-species diversity in human gut microbiome samples (LifelinesDeep) and MIT sewage samples, as revealed by ten phylogenetic marker genes.

(TIFF) Click here for additional data file.

Eubacterium rectale sub-species diversity in human gut microbiome samples (LifelinesDeep) and MIT sewage samples, as revealed by ten phylogenetic marker genes.

(TIFF) Click here for additional data file.

Prevotella copri sub-species diversity in human gut microbiome samples (LifelinesDeep) and MIT sewage samples, as revealed by ten phylogenetic marker genes.

(TIFF) Click here for additional data file.

Supplementary Texts.

(DOCX) Click here for additional data file.

Comparison of the performance of SAD models for single-person and multi-person gut microbial communities.