| Literature DB >> 19360128 |
James Robert White1, Niranjan Nagarajan, Mihai Pop.
Abstract
Numerous studies are currently underway to characterize the microbial communities inhabiting our world. These studies aim to dramatically expand our understanding of the microbial biosphere and, more importantly, hope to reveal the secrets of the complex symbiotic relationship between us and our commensal bacterial microflora. An important prerequisite for such discoveries are computational tools that are able to rapidly and accurately compare large datasets generated from complex bacterial communities to identify features that distinguish them.We present a statistical method for comparing clinical metagenomic samples from two treatment populations on the basis of count data (e.g. as obtained through sequencing) to detect differentially abundant features. Our method, Metastats, employs the false discovery rate to improve specificity in high-complexity environments, and separately handles sparsely-sampled features using Fisher's exact test. Under a variety of simulations, we show that Metastats performs well compared to previously used methods, and significantly outperforms other methods for features with sparse counts. We demonstrate the utility of our method on several datasets including a 16S rRNA survey of obese and lean human gut microbiomes, COG functional profiles of infant and mature gut microbiomes, and bacterial and viral metabolic subsystem data inferred from random sequencing of 85 metagenomes. The application of our method to the obesity dataset reveals differences between obese and lean subjects not reported in the original study. For the COG and subsystem datasets, we provide the first statistically rigorous assessment of the differences between these populations. The methods described in this paper are the first to address clinical metagenomic datasets comprising samples from multiple subjects. Our methods are robust across datasets of varied complexity and sampling level. While designed for metagenomic applications, our software can also be applied to digital gene expression studies (e.g. SAGE). A web server implementation of our methods and freely available source code can be found at http://metastats.cbcb.umd.edu/.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19360128 PMCID: PMC2661018 DOI: 10.1371/journal.pcbi.1000352
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Format of the feature abundance matrix.
Each row represents a specific taxon, while each column represents a subject or replicate. The frequency of the i th feature in the j th subject (c(i,j)) is recorded in the corresponding cell of the matrix. If there are g subjects in the first population, they are represented by the first g columns of the matrix, while the remaining columns represent subjects from the second population.
Figure 2Detecting differential abundance for sparse features.
A 2×2 contingency table is used in Fisher's exact test for differential abundance between rare features. f is the number of observations of feature i in all individuals from treatment 1. f is the number of observations that are not feature i in all individuals from treatment 1. f and f are similarly defined for treatment 2.
Figure 3Dispersion estimates (φ) for three metagenomic datasets used in this study.
These plots compare dispersion values between (A) obese and lean human gut taxonomic data, (B) infant and mature human gut COG assignments, and (C) microbial and viral subsystem annotations. We find a wide range of possible dispersions in this data and significant differences in dispersions between two populations.
Figure 4ROC curves comparing statistical methods in a simulation study.
Sequences were selected from a beta-binomial distribution with variable dispersions and group mean proportions p and p. For each set of parameters, we simulated 1000 trials, 500 of which are generated under the null hypothesis (p = p), and the remainder are differentially abundant where a*p = p. For example, p = 0.2 and a = 2 indicates features comprising 20% of the population that differ two-fold in abundance between two populations of interest. Parameter values for p and a are shown above each plot.
Figure 5ROC curves comparing statistical methods in a simulation study for extreme sparse sampling.
Sequences were selected from a beta-binomial distribution with variable dispersions and group mean proportions p and p. For each set of parameters, we simulated 1000 trials, 500 of which are generated under the null hypothesis (p = p), and the remainder are differentially abundant where a*p = p. For example, p = 0.2 and a = 2 indicates features comprising 20% of the population that differ two-fold in abundance between two populations of interest. Parameter values for p and a are shown above each plot.
Comparison of false positives found by different methodologies. Using real metagenomic data, we simulated features with no differential abundance by randomly dividing subjects from a single population into two subpopulations.
| P≤ | Number of False Positives | |||
| Metastats | Student-t | Log-t | NB | |
| 0.001 | 7 | 4 | 4 | 109 |
| 0.005 | 25 | 25 | 24 | 121 |
| 0.01 | 51 | 52 | 43 | 133 |
We found that for a stringent p-value threshold of 0.001, the negative binomial model (NB) resulted in a false positive rate 20 times higher than the other methodologies. The Log-t of Lu et al. resulted in the lowest false positive rate among the methods tested while Student's test and Metastats performed equally well.
Differentially abundant taxa between lean and obese human gut microflora.
| Taxon | Obese abundance(%) | Lean abundance(%) | Metastats p-value | Student-t p-value | Log-t p-value | NB p-value |
|
| ||||||
| Bacteroidetes | 2.902±1.067 | 25.652±4.576 | 0.0002 | 0.0000 | 0.0004 | 0.0000 |
| Firmicutes | 89.318±2.223 | 72.833±4.812 | 0.0028 | 0.0025 | 0.0030 | 0.8701 |
| Actinobacteria | 4.490±1.345 | 0.447±0.179 | 0.0037 | 0.0371 | 0.0004 | 0.0773 |
|
| ||||||
| Bacteroidetes (class) | 2.722±1.065 | 25.652±4.576 | 0.0001 | 0.0000 | 0.0005 | 0.0001 |
| Actinobacteria (class) | 4.490±1.345 | 0.447±0.179 | 0.0024 | 0.0371 | 0.0004 | 0.1858 |
| Clostridia | 84.633±2.388 | 66.907±5.799 | 0.0036 | 0.0042 | 0.0052 | 0.9797 |
|
| ||||||
|
| 2.380±0.383 | 0.666±0.337 | 0.0014 | 0.0077 | 0.0067 | 0.4860 |
|
| 0.000±0 | 0.115±0.115 | 0.0016 | 0.1986 | 0.9963 | 0.0166 |
|
| 26.276±4.454 | 10.707±2.094 | 0.0023 | 0.0207 | 0.0039 | 0.6639 |
|
| 0.010±0.010 | 0.138±0.138 | 0.0024 | 0.2353 | 0.0467 | 0.0011 |
|
| 3.565±1.187 | 0.154±0.154 | 0.0052 | 0.0451 | 0.0046 | 0.6545 |
|
| 1.841±0.963 | 14.623±4.444 | 0.0056 | 0.0023 | 0.0105 | 0.0012 |
|
| 0.000±0 | 0.093±0.069 | 0.0059 | 0.0896 | 0.9963 | 0.0000 |
|
| 0.461±0.051 | 0.151±0.102 | 0.0065 | 0.0072 | 0.0304 | 0.0487 |
|
| 0.031±0.031 | 0.145±0.145 | 0.0073 | 0.3390 | 0.2315 | 0.0156 |
For the phylum, class, and genus levels (mean percentage±s.e., p-value≤0.01) we successfully re-established the major result of Ley et al., and uncovered a new difference within Actinobacteria. Both Firmicutes and Actinobacteria have significantly higher relative abundances in obese people, while Bacteroidetes make up a higher proportion of the gut microbiota in the lean population. Results reveal Clostridia as the primary component of the differential abundance observed within Firmicutes, and Bacteroidetes and Actinobacteria classes explain the differential abundance observed within the corresponding phyla. Using this p-value threshold, we expect less than one false positive among these results. The last four columns display the computed p-values for different statistical methods, including Metastats and the overdispersion methods of Lu et al. (Log-t) and Robinson and Smyth (NB). These results reveal NB and Student's t-test to be overly-conservative.
COGs determined to be differentially abundant between infant and mature human gut microbiomes using a q-value threshold of 0.05.
| COG id | Description | Mature | Infant | Metastat q-value | ||
| Mean abundance (%) | stderr | Mean abundance (%) | stderr | |||
| COG0205 | 6-phosphofructokinase | 0.0017 | 0.0001 | 0.0006 | 0.0002 | 0.0313 |
| COG0358 | DNA primase (bacterial type) | 0.0024 | 0.0001 | 0.0008 | 0.0001 | 0.0072 |
| COG0507 | ATP-dependent exoDNAse (exonuclease V), alpha subunit - helicase superfamily I member | 0.0016 | 0.0001 | 0.0008 | 0.0001 | 0.0349 |
| COG0526 | Thiol-disulfide isomerase and thioredoxins | 0.0028 | 0.0002 | 0.0014 | 0.0002 | 0.0371 |
| COG0621 | 2-methylthioadenine synthetase | 0.0017 | 0.0001 | 0.0008 | 0.0002 | 0.0450 |
| COG0642 | Signal transduction histidine kinase | 0.0132 | 0.0009 | 0.0070 | 0.0004 | 0.0270 |
| COG0667 | Predicted oxidoreductases (related to aryl-alcohol dehydrogenases) | 0.0012 | 0.0001 | 0.0021 | 0.0001 | 0.0282 |
| COG0739 | Membrane proteins related to metalloendopeptidases | 0.0024 | 0.0001 | 0.0006 | 0.0001 | 0.0072 |
| COG0745 | Response regulators consisting of a CheY-like receiver domain and a winged-helix DNA-binding domain | 0.0076 | 0.0003 | 0.0051 | 0.0004 | 0.0352 |
| COG0747 | ABC-type dipeptide transport system, periplasmic component | 0.0011 | 0.0001 | 0.0027 | 0.0003 | 0.0352 |
| COG1113 | Gamma-aminobutyrate permease and related permeases | 0.0002 | 0.0001 | 0.0018 | 0.0003 | 0.0349 |
| COG1129 | ABC-type sugar transport system, ATPase component | 0.0013 | 0.0001 | 0.0028 | 0.0003 | 0.0492 |
| COG1145 | Ferredoxin | 0.0017 | 0.0001 | 0.0005 | 0.0002 | 0.0217 |
| COG1196 | Chromosome segregation ATPases | 0.0017 | 0.0001 | 0.0007 | 0.0001 | 0.0108 |
| COG1249 | Pyruvate/2-oxoglutarate dehydrogenase complex, dihydrolipoamide dehydrogenase (E3) component, and related enzymes | 0.0006 | 0.0001 | 0.0011 | 0.0001 | 0.0349 |
| COG1263 | Phosphotransferase system IIC components, glucose/maltose/N-acetylglucosamine-specific | 0.0012 | 0.0001 | 0.0031 | 0.0003 | 0.0313 |
| COG1475 | Predicted transcriptional regulators | 0.0025 | 0.0002 | 0.0014 | 0.0002 | 0.0435 |
| COG1595 | DNA-directed RNA polymerase specialized sigma subunit, sigma24 homolog | 0.0053 | 0.0004 | 0.0013 | 0.0003 | 0.0206 |
| COG1609 | Transcriptional regulators | 0.0030 | 0.0002 | 0.0092 | 0.0013 | 0.0424 |
| COG1629 | Outer membrane receptor proteins, mostly Fe transport | 0.0120 | 0.0016 | 0.0013 | 0.0007 | 0.0313 |
| COG1762 | Phosphotransferase system mannitol/fructose-specific IIA domain (Ntr-type) | 0.0004 | 0.0001 | 0.0017 | 0.0002 | 0.0293 |
| COG1961 | Site-specific recombinases, DNA invertase Pin homologs | 0.0059 | 0.0004 | 0.0018 | 0.0006 | 0.0345 |
| COG2204 | Response regulator containing CheY-like receiver, AAA-type ATPase, and DNA-binding domains | 0.0019 | 0.0002 | 0.0005 | 0.0002 | 0.0421 |
| COG2244 | Membrane protein involved in the export of O-antigen and teichoic acid | 0.0019 | 0.0001 | 0.0009 | 0.0001 | 0.0229 |
| COG2376 | Dihydroxyacetone kinase | 0.0002 | 0 | 0.0009 | 0.0001 | 0.0278 |
| COG2440 | Ferredoxin-like protein | 0 | 0 | 0.0002 | 0 | 0.0394 |
| COG2893 | Phosphotransferase system, mannose/fructose-specific component IIA | 0.0003 | 0.0001 | 0.0011 | 0.0001 | 0.0352 |
| COG3250 | Beta-galactosidase/beta-glucuronidase | 0.0056 | 0.0004 | 0.0023 | 0.0006 | 0.0435 |
| COG3451 | Type IV secretory pathway, VirB4 components | 0.0033 | 0.0001 | 0.0009 | 0.0003 | 0.0157 |
| COG3505 | Type IV secretory pathway, VirD4 components | 0.0029 | 0.0001 | 0.0010 | 0.0003 | 0.0278 |
| COG3525 | N-acetyl-beta-hexosaminidase | 0.0016 | 0.0002 | 0.0004 | 0.0001 | 0.0352 |
| COG3537 | Putative alpha-1,2-mannosidase | 0.0020 | 0.0003 | 0.0002 | 0.0002 | 0.0352 |
| COG3711 | Transcriptional antiterminator | 0.0004 | 0.0001 | 0.0020 | 0.0003 | 0.0339 |
| COG3712 | Fe2+-dicitrate sensor, membrane component | 0.0023 | 0.0003 | 0 | 0 | 0.0280 |
| COG4206 | Outer membrane cobalamin receptor protein | 0.0021 | 0.0003 | 0.0003 | 0.0001 | 0.0313 |
| COG4771 | Outer membrane receptor for ferrienterochelin and colicins | 0.0039 | 0.0005 | 0.0006 | 0.0003 | 0.0366 |
Using this threshold we expect less than 10 false positives in this data-set. The table presents the 25 most abundant COGs from the mature and infant microbiomes, sorted by their abundance level in the mature population. Full results are available as supplementary material.
Differentially abundant metabolic subsystems between microbial and viral metagenomes (mean percentage±s.e., p-values≤0.05).
| Subsystem | microbial | viral | Metastats p value |
| Carbohydrates | 17.01±0.77 | 12.87±0.82 | 0.001 |
| Amino Acids and Derivatives | 9.29±0.46 | 7.58±0.55 | 0.019 |
| Respiration | 8.24±1.34 | 3.89±0.46 | 0.001 |
| Photosynthesis | 7.13±2.38 | 1.16±0.36 | 0.017 |
| Cofactors, Vitamins, and Pigments | 5.54±0.27 | 6.44±0.26 | 0.022 |
| Experimental Subsystems | 4.88±0.31 | 5.80±0.36 | 0.050 |
| DNA Metabolism | 3.99±0.24 | 9.18±1.06 | 0.001 |
| Cell Wall and Capsule | 3.73±0.27 | 5.64±0.71 | 0.009 |
| RNA Metabolism | 3.65±0.21 | 5.23±0.71 | 0.033 |
| Nucleosides and Nucleotides | 3.38±0.18 | 7.72±0.74 | 0.001 |
| Membrane Transport | 2.04±0.11 | 1.30±0.15 | 0.001 |
| Nitrogen Metabolism | 1.47±0.08 | 0.93±0.10 | 0.001 |
| Fatty Acids and Lipids | 1.46±0.07 | 1.05±0.11 | 0.004 |
Using this threshold we expect less than one false positive in the data-set. We find that viral metagenomes are significantly enriched for nucleotides and nucleosides and DNA metabolism, consistent with the viruses' need for self-sufficiency. Processes for respiration, photosynthesis, and carbohydrates are overrepresented in microbial metagenomes.