| Literature DB >> 28881972 |
Martina Fischer1, Benjamin Strauch1, Bernhard Y Renard1.
Abstract
MOTIVATION: Current metagenomics approaches allow analyzing the composition of microbial communities at high resolution. Important changes to the composition are known to even occur on strain level and to go hand in hand with changes in disease or ecological state. However, specific challenges arise for strain level analysis due to highly similar genome sequences present. Only a limited number of tools approach taxa abundance estimation beyond species level and there is a strong need for dedicated tools for strain resolution and differential abundance testing.Entities:
Mesh:
Year: 2017 PMID: 28881972 PMCID: PMC5870649 DOI: 10.1093/bioinformatics/btx237
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1Workflow of DiTASiC. It consists of three main parts: (i) mapping, (ii) taxa abundance estimation and (iii) differential abundance assessment. (i) We rely on prior pre-filtering of species by external profiling tools such as Kraken or Mash. Reads are mapped to the given reference genome sequences and the number of matching reads per reference are counted (mapping abundance). A similarity matrix reflecting the genome similarities is constructed. (ii) Subsequently, a GLM is built for resolution of read count ambiguities, resulting in corrected abundance estimates along with standard errors. (iii) For the comparison of metagenomes, abundances are formulated as distributions and their divergence reflects differential events. A final list of tested taxa with fold change and adjusted P-values is reported
Characteristics of the four data sources: CAMI, FAMeS, Illumina 100 data (i100) and the simulation setups (Sim (1), (2), (3))
| Source | CAMI | FAMeS | Sim (1) | Sim (2) | Sim (3) | i100 |
|---|---|---|---|---|---|---|
| Samples | Set 1-2 | LC, MC, HC | Set 1-3 | Set 4-9 | Set 10-11 | |
| References | 225 | 122 | 35 | 35 | 55 | 100 |
| Genera | 128 | 81 | 12 | 12 | 12 | 63 |
| Species | 199 | 108 | 22 | 22 | 26 | 85 |
| Reads (M) | ∼150 | ∼1.0 | 0.75a | 0.75a | 0.75a | 53.3 |
| Length (bp) | 100 | 110 | 100 | 100 | 100 | 75 |
| Abundance range | 0.0009–8% | 2–20% | 1–30% | 0.1–15% | 0.1–2% | 0.8–2.2% |
Note: Each reference set is defined by the union of references of the underlying samples. All read profiles follow Illumina characteristics (areads are simulated by Mason).
Accuracy of taxa abundance estimates by DiTASiC, kallisto and GASiC
| DiTASiC | kallisto | GASiC | ||
|---|---|---|---|---|
| CAMI | Set 1 | 1.05 e-01 | n.a. | |
| Set 2 | 5.69 e-02 | n.a. | ||
| i100 | i100 | 5.62 e-05 | 9.32 e-04 | |
| FAMeS | LC | 6.87 e-06 | 3.18 e-04 | |
| MC | 3.07 e-08 | 4.17 e-04 | ||
| HC | 8.34 e-08 | 7.79 e-05 | ||
| Simulation group (1) | Set 1 | 8.38 e-07 | 6.92 e-03 | |
| Set 2 | 9.61 e-07 | 1.13 e-02 | ||
| Set 3 | 4.37 e-07 | 9.73 e-03 | ||
| Simulation group (2) | Set 4 | 4.09 e-05 | 6.10 e-03 | |
| Set 5 | 5.94 e-05 | 8.54 e-03 | ||
| Set 6 | 3.46 e-05 | 2.22 e-03 | ||
| Set 7 | 2.84 e-04 | 6.55 e-03 | ||
| Set 8 | 2.99 e-04 | 2.27 e-03 | ||
| Set 9 | 5.37 e-05 | 1.63 e-03 | ||
| Simulation group (3) | Set 10 | 5.43 e-05 | 1.84 e-02 | |
| Set 11 | 5.07 e-04 | 7.29 e-03 |
Note: Accuracy is defined by the SSE between estimates and available ground truth. A significant error reduction is shown for DiTASiC compared with GASiC and a comparable performance is observed for kallisto (highest accuracy is depicted in bold print). GASiC was not run on CAMI data due to computational limitations.
Evaluation of differential taxa abundance by DiTASiC and STAMP based on sample comparisons within the simulation data and the CAMI data set
| Data source | Samples compared | No. of non-differential events | No. of differential events | False positives (FPs) and False negatives (FNs) | FDR | Sensitivity | Specificity | Accuracy | ||||||
| FP | FN | FP | FN | ||||||||||
| Samples S1 versus S2 | |||||||||||||
| 15 | 15 | 0 | 0 | 15 | 0 | 0 | 0.50 | 1 | 1 | 1 | 0.5 | 1 | 0.5 | ||
| set 4 versus set 5 | 35 | 0 | 0 | 0 | 0 | 0 | n.a. | n.a. | n.a. | 1 | n.a | 1 | 1 | 1 | |
| set 5 versus set 9 | 28 | 7 | 0 | 0 | 12 | 0 | 0 | 0.63 | 1 | 1 | 1 | 0.7 | 1 | 0.66 | |
| set 5 versus set 6 | 18 | 17 | 0 | 1 | 18 | 2 | 0 | 0.51 | 0.94 | 1 | 0.89 | 0.5 | 0.97 | 0.43 | |
| set 6 versus set 7 | 17 | 18 | 0 | 0 | 16 | 0 | 0 | 0.47 | 1 | 1 | 1 | 0.51 | 1 | 0.54 | |
| set 7 versus set 8 | 10 | 25 | 0 | 0 | 7 | 0 | 0 | 0.22 | 1 | 1 | 1 | 0.59 | 1 | 0.8 | |
| set 6 versus set 8 | 6 | 29 | 0 | 0 | 4 | 0 | 0 | 0.12 | 1 | 1 | 1 | 0.6 | 1 | 0.89 | |
| set 4 versus set 7 | 5 | 30 | 0 | 1 | 5 | 0 | 0 | 0.14 | 0.97 | 1 | 1 | 0.5 | 0.97 | 0.86 | |
| set 4 versus set 8 | 5 | 30 | 0 | 1 | 5 | 0 | 0 | 0.14 | 0.97 | 1 | 1 | 0.5 | 0.97 | 0.86 | |
Note: A P-value cutoff of 0.05 is used to define differentially abundant taxa. In most scenarios, DiTASiC achieves exact detections, holding a FDR of zero and accuracy above 97% overall. A reduced accuracy performance by STAMP, using mapping abundances, confirms the significant impact of read ambiguities and abundance estimate uncertainties. In case of no differential events, FDR and sensitivity cannot be computed (n.a.).