| Literature DB >> 22162995 |
Li C Xia1, Jacob A Cram, Ting Chen, Jed A Fuhrman, Fengzhu Sun.
Abstract
Accurate estimation of microbial community composition based on metagenomic sequencing data is fundamental for subsequent metagenomics analysis. Prevalent estimation methods are mainly based on directly summarizing alignment results or its variants; often result in biased and/or unstable estimates. We have developed a unified probabilistic framework (named GRAMMy) by explicitly modeling read assignment ambiguities, genome size biases and read distributions along the genomes. Maximum likelihood method is employed to compute Genome Relative Abundance of microbial communities using the Mixture Model theory (GRAMMy). GRAMMy has been demonstrated to give estimates that are accurate and robust across both simulated and real read benchmark datasets. We applied GRAMMy to a collection of 34 metagenomic read sets from four metagenomics projects and identified 99 frequent species (minimally 0.5% abundant in at least 50% of the data-sets) in the human gut samples. Our results show substantial improvements over previous studies, such as adjusting the over-estimated abundance for Bacteroides species for human gut samples, by providing a new reference-based strategy for metagenomic sample comparisons. GRAMMy can be used flexibly with many read assignment tools (mapping, alignment or composition-based) even with low-sensitivity mapping results from huge short-read datasets. It will be increasingly useful as an accurate and robust tool for abundance estimation with the growing size of read sets and the expanding database of reference genomes.Entities:
Mesh:
Year: 2011 PMID: 22162995 PMCID: PMC3232206 DOI: 10.1371/journal.pone.0027992
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The GRAMMy model.
A schematic diagram of the finite mixture model underlies the GRAMMy framework for shotgun metagenomics. In the figure, ‘iid’ stands for “independent identically distributed”.
Figure 2The GRAMMy flowchart.
A typical flowchart of GRAMMy analysis pipeline employs ‘map’ and ‘k-mer’ assignment.
Comparison of estimation accuracy.
| simLC | simMC | simHC | ||||
| RRMSE | AVGRE | RRMSE | AVGRE | RRMSE | AVGRE | |
| GRAMMy | 20.0% | 14.0% | 25.6% | 19.7% | 21.6% | 14.7% |
| MEGAN | 48.6% | 39.3% | 50.0% | 40.6% | 50.2% | 40.8% |
| GAAS | 433.8% | 152.5% | 171.4% | 111.6% | 507.9% | 165.8% |
Table 1: Comparison of estimation accuracy. A summary of Relative Root Mean Square Error (RRMSE) and Average Relative Error (AVGRE) measured from MEGAN-based, GAAS and GRAMMy (‘map’) estimates of simLC, simMC and simHC subsets of the FAMeS data. GRAMMy (‘map’) has the lowest error rate for both error measures across all the subsets.
Summary statistics for the metagenomic datasets.
| Mapped rate (%) | Ambiguity rate (%) | Average Genome Length (bp) | |||||||
| Data (# Sets) | Med. | Min. | Max | Med. | Min. | Max. | Med. | Min. | Max. |
| hg_HGS(2) | 46.65 | 43.15 | 50.15 | 31.65 | 30.32 | 32.98 | 2890092 | 2660792 | 3119393 |
| jhg_HGS(13) | 59.61 | 35.99 | 76.92 | 45.11 | 22.53 | 65.71 | 3745629 | 2268438 | 5657331 |
| uhg_HGS(18) | 52.35 | 37.49 | 72.51 | 35.90 | 21.65 | 59.81 | 3619072 | 3047940 | 4752910 |
| amd_AMD(1) | 45.64 | 46.64 | 45.64 | 1.48 | 1.48 | 1.48 | 2163584 | 2163584 | 2163584 |
Table 2: Summary statistics for the metagenomic datasets. Median (Med.), minimum (Min.) and maximum (Max.) of mapped rate, ambiguity rate and estimated average genome length for the samples: two from U.S. adult human gut (‘hg’), 13 from Japanese human gut (‘jhg’), 18 from U.S. twin families human gut (‘uhg’) and1 from acid mind drainage (‘amd’) are shown. Two reference genome sets, ‘HGS’, ‘AMD’, were used for human gut samples (‘hg’, ‘jhg’, ‘uhg’) and the acid mine drainage sample (‘amd’), respectively.
Figure 3Frequent species for human gut metagenomes.
The 99 species occurring in at least 50% of the 33 human gut samples with a minimum relative abundance of 0.05% were selected. ‘gut_HGS_90’ indicates that the human gut (‘gut’) read sets were mapped to the reference genome set (‘HGS’) with an identity rate cut-off at 90% (‘90’).
Figure 4Heatmap biclustering of human gut metagenomes.
‘gut_HGS_90’ indicates that the human gut (‘gut’) read sets were mapped to the reference genome set (‘HGS’) with an identity rate cut-off at 90% (‘90’). The bottom labels indicate human gut samples. The top right legend shows the color-coding for columns indicating the sample age category and dataset origin. The bottom right legend shows color-coding for rows indicating the top 4 most abundant phyla in human gut. The relative abundance for each sample is normalized by a rank transformation.
Figure 5GRAMMy estimates of GRAs for the acid mine drainage data.
Estimated relative abundance for each strain is shown as a percentage. The first two strains dominate the sample.
Figure 6Running time comparison.
GRAMMy is the fastest in all cases as compared to MEGAN and GAAS in processing time. The BLAT mapping time is excluded for all compared tools.