| Literature DB >> 22941661 |
Martin S Lindner1, Bernhard Y Renard.
Abstract
One goal of sequencing-based metagenomic community analysis is the quantitative taxonomic assessment of microbial community compositions. In particular, relative quantification of taxons is of high relevance for metagenomic diagnostics or microbial community comparison. However, the majority of existing approaches quantify at low resolution (e.g. at phylum level), rely on the existence of special genes (e.g. 16S), or have severe problems discerning species with highly similar genome sequences. Yet, problems as metagenomic diagnostics require accurate quantification on species level. We developed Genome Abundance Similarity Correction (GASiC), a method to estimate true genome abundances via read alignment by considering reference genome similarities in a non-negative LASSO approach. We demonstrate GASiC's superior performance over existing methods on simulated benchmark data as well as on real data. In addition, we present applications to datasets of both bacterial DNA and viral RNA source. We further discuss our approach as an alternative to PCR-based DNA quantification.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22941661 PMCID: PMC3592424 DOI: 10.1093/nar/gks803
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.GASiC workflow. Metagenomic reads are first aligned to the reference genomes and matching reads are counted for each genome (observed abundances). GASiC then uses the reference genomes to construct a similarity matrix encoding the genome similarities while considering influences of the applied sequencing technology. The similarity matrix and the observed abundances are used in a linear system of equations to model the influence of reference genome similarities on read alignment. GASiC solves the system of equations using a constrained optimization routine to calculate the estimated true abundances of the reference genomes in the dataset. Bootstrapping from the reads delivers stable abundance estimates and allows GASiC to test for the presence of each species in the dataset.
Benchmark comparison. In addition to MEGAN based, GAAS and GRAMMy abundance estimates (8), we calculated abundance estimates with GASiC for all reference genomes in the FAMeS datasets simLC, simMC and simHC
| Tool | simLC (%) | simMC (%) | simHC (%) | |||
|---|---|---|---|---|---|---|
| Low complexity | Medium complexity | High complexity | ||||
| RRMSE | AVGRE | RRMSE | AVGRE | RRMSE | AVGRE | |
| MEGAN | 48.6 | 39.3 | 50.0 | 40.6 | 50.2 | 40.8 |
| GAAS | 433.8 | 152.5 | 171.4 | 111.6 | 507.9 | 165.8 |
| GRAMMy | 20.0 | 14.0 | 25.6 | 19.7 | 21.6 | 14.7 |
| GASiC | ||||||
The four tools are compared by their relative error (RRMSE and AVGRE, see Methods section). The lowest error rates are shown in bold font. GASiC reduces the relative error on all datasets and improves on GRAMMy, the best existing tool, by up to 60%. Best results are achieved on the high complexity dataset simHC, indicating that GASiC provides a particularly large benefit for complex mixtures where more corrections are necessary and low concentrations exist which are more difficult to estimate.
Figure 2.Comparison of GASiC and GRAMMy on synthetic datasets with varying concentrations of real E. coli and EHEC reads. Both algorithms estimated the relative abundances of the highly similar bacteria E. coli, EHEC, and Shigella in all datasets and GASiC tested (P-value) for the absence of each bacterium. GRAMMy was challenged by the similarity of the bacteria and deviated strongly from the expected relative concentrations. For Shigella, which was not present in the sample, GRAMMy incorrectly estimates abundances up to 10%. GASiC provided more stable abundance estimates at all concentrations and also correctly identified Shigella as not present in the dataset and accordingly assigned high P-values.
Figure 3.Estimation of viral abundances based on NGS and qRT-PCR. GASiC estimated the abundances of the highly similar bee viruses DWV, VDV-1, VDV-1DVD and VDV-1VVD in the viral RNA dataset acquired by Moore et al. (17). The abundances are displayed in relation to the total number of reads. GASiC’s estimates coincide with the qRT-PCR quantification in the original paper: VDV-1DVD was estimated as the most abundant virus and VDV-1 was correctly identified as not present in the dataset. The displayed relative qRT-PCR levels were calculated as described in Supplementary Methods. Interestingly, only considering the unique reads would have yielded misleading estimates (DWV as most abundant) in this experiment due to the high reference similarities.