| Literature DB >> 25124232 |
Jeffrey R Long1, Vanessa Pittet, Brett Trost, Qingxiang Yan, David Vickers, Monique Haakensen, Anthony Kusalik.
Abstract
BACKGROUND: UniFrac is a well-known tool for comparing microbial communities and assessing statistically significant differences between communities. In this paper we identify a discrepancy in the UniFrac methodology that causes semantically equivalent inputs to produce different outputs in tests of statistical significance.Entities:
Mesh:
Year: 2014 PMID: 25124232 PMCID: PMC4141948 DOI: 10.1186/1471-2105-15-278
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Simple input tree for UniFrac. Simple example of UniFrac input that does not use abundance counts.
Figure 2Input tree for UniFrac with abundance counts. A tree input into UniFrac where abundance counts are included at each leaf node in compact form.
Figure 3Expanded input tree for UniFrac. An expanded tree that is equivalent to that shown in Figure 2, but which does not use abundance counts.
Normalized pairwise UniFrac distances on sample lake sediment data
| Sample | A1 | A2 | B1 | B2 | C1 | C2 |
|---|---|---|---|---|---|---|
| A1 | 0.09 | 0.33 | 0.35 | 0.31 | 0.33 | |
| A2 | 0.34 | 0.36 | 0.31 | 0.32 | ||
| B1 | 0.10 | 0.15 | 0.19 | |||
| B2 | 0.17 | 0.19 | ||||
| C1 | 0.10 |
Normalized, pairwise UniFrac distances between six samples of metagenomic lake sediment data. Samples A1 and A2 represent data from two different techniques used to extract DNA from a lake downstream of an industrial facility; samples B1 and B2 and C1 and C2 represent data from two upstream lakes.
Weighted UniFrac p-values on compact form of lake sediment data
| Sample | A1 | A2 | B1 | B2 | C1 | C2 |
|---|---|---|---|---|---|---|
| A1 | 0.03 | 0.05 | 0.02 | 0.13 | 0.22 | |
| A2 | 0.03 | 0.05 | 0.15 | 0.26 | ||
| B1 | 0.06 | 0.49 | 0.44 | |||
| B2 | 0.32 | 0.55 | ||||
| C1 | 0.48 |
Weighted UniFrac uncorrected p-values between six samples of metagenomic lake sediment data. As samples A1 and A2 are from the same lake, we would expect them to be similar, but the low p-value here indicates that there is a much higher probability that they are significantly different than, say, A1 and C2.
Weighted UniFrac p-values on expanded form of lake sediment data
| Sample | A1 | A2 | B1 | B2 | C1 | C2 |
|---|---|---|---|---|---|---|
| A1 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | |
| A2 | 0.00 | 0.00 | 0.00 | 0.00 | ||
| B1 | 0.00 | 0.00 | 0.00 | |||
| B2 | 0.00 | 0.15 | ||||
| C1 | 0.74 |
Weighted UniFrac uncorrected p-values between six samples of metagenomic lake sediment data using the expanded form as input. P-values are now much more consistent with the most important aspects of our expectations and other forms of analyses compared to the values in Table 2. For instance, the two downstream samples (A1 and A2) may be similar and are definitely different from all the upstream samples. Although there is a larger chance that B2 is similar to C2 than we might expect, the p-value for C1 and C2 is much higher still, which was not the case in Table 2.