| Literature DB >> 25815895 |
Davide Albanese1, Carlotta De Filippo2, Duccio Cavalieri1, Claudio Donati1.
Abstract
Metagenomics is revolutionizing our understanding of microbial communities, showing that their structure and composition have profound effects on the ecosystem and in a variety of health and disease conditions. Despite the flourishing of new analysis methods, current approaches based on statistical comparisons between high-level taxonomic classes often fail to identify the microbial taxa that are differentially distributed between sets of samples, since in many cases the taxonomic schema do not allow an adequate description of the structure of the microbiota. This constitutes a severe limitation to the use of metagenomic data in therapeutic and diagnostic applications. To provide a more robust statistical framework, we introduce a class of feature-weighting algorithms that discriminate the taxa responsible for the classification of metagenomic samples. The method unambiguously groups the relevant taxa into clades without relying on pre-defined taxonomic categories, thus including in the analysis also those sequences for which a taxonomic classification is difficult. The phylogenetic clades are weighted and ranked according to their abundance measuring their contribution to the differentiation of the classes of samples, and a criterion is provided to define a reduced set of most relevant clades. Applying the method to public datasets, we show that the data-driven definition of relevant phylogenetic clades accomplished by our ranking strategy identifies features in the samples that are lost if phylogenetic relationships are not considered, improving our ability to mine metagenomic datasets. Comparison with supervised classification methods currently used in metagenomic data analysis highlights the advantages of using phylogenetic information.Entities:
Mesh:
Year: 2015 PMID: 25815895 PMCID: PMC4376673 DOI: 10.1371/journal.pcbi.1004186
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Schema of the method.
A) Preliminary analysis. The PhyloRelief algorithm relies on a set of preprocessing steps of the metagenomic datasets that must be performed using standard algorithms. From the sequences of the marker genomic loci selected by the experimental design, an OTU table and a phylogenetic tree of the representative sequences of the OTUs is computed. B) Next, the matrix of the distances between the samples must be computed using a phylogenetic measure of β-diversity, such as weighted or unweighted UniFrac must be provided. C) The PhyloRelief strategy. Once one sample S has been randomly selected, the nearest hit H, i.e. the nearest sample of the same class, and the nearest miss M, i.e. the nearest sample of different class according to distance matrix DS are identified. D) The update function. For each subtree Ti the weight wi is updated by summing the value d(T ,S,H)/m and subtracting d(T ,S,M)/m. The function d(T ,A,B)/m is computed by summing the UniFrac distance between the sample A and B restricted to the subtree T and m is the number of samples. E) Correlation of the weights and definition of the clades. The weights of each clade propagate to the parents, where it is either reinforced if coalescing with a clade sharing similar unbalance between the classes, or is diluted if coalescing with a clade with no or contrasting unbalance. This allows an iterative procedure leading to the unambiguous identification of a set of uncorrelated clades. F) Output. The algorithm provides a list of clades of the phylogenetic tree ranked according to their contribution to the separation of the classes of samples.
Fig 2Variability of the gut microbiome with geography.
A) PCoA of the weighted UniFrac distances stratified by geographical origin. B) Phylogenetic tree of the OTUs. The 30 clades most relevant for the differentiation of the USA and Italian samples form the Burkina Faso, Malawi and Venezuelan samples are highlighted. Colors distinguish those more prevalent in the USA and Italian samples (green) form those more prevalent in the Burkina Faso, Malawi and Venezuelan samples (red) C) PCoA of the weighted UniFrac distances using only the OTUs included in the most relevant clades. D) Heatmap of the Log10 of the relative abundances of the 30 most relevant clades (rows) identified by PhyloRelief. The age and origin of the individuals (columns) are indicated.
Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the weighted UniFrac distance between Western (USA and Italy) and non-Western (Malawi, Burkina Faso and Venezuela) individuals.
| PERMANOVA | ANOSIM | ||||
|---|---|---|---|---|---|
| N clades | F | R2 | p-value | R | p-value |
| 10 | 49.00 | 0.13 | 0.001 | 0.40 | 0.001 |
| 20 | 414.44 | 0.44 | 0.001 | 0.68 | 0.001 |
| 30 | 374.68 | 0.41 | 0.001 | 0.61 | 0.001 |
| 40 | 139.46 | 0.20 | 0.001 | 0.29 | 0.001 |
| 50 | 121.25 | 0.18 | 0.001 | 0.27 | 0.001 |
| 60 | 83.88 | 0.13 | 0.001 | 0.36 | 0.001 |
| 70 | 82.01 | 0.13 | 0.001 | 0.35 | 0.001 |
| 80 | 82.24 | 0.13 | 0.001 | 0.35 | 0.001 |
| 90 | 92.72 | 0.14 | 0.001 | 0.30 | 0.001 |
| 100 | 86.89 | 0.14 | 0.001 | 0.32 | 0.001 |
| 200 | 98.98 | 0.15 | 0.001 | 0.36 | 0.001 |
| 300 | 101.74 | 0.15 | 0.001 | 0.35 | 0.001 |
| 400 | 109.94 | 0.17 | 0.001 | 0.37 | 0.001 |
| 500 | 109.73 | 0.17 | 0.001 | 0.37 | 0.001 |
| 600 | 106.61 | 0.16 | 0.001 | 0.38 | 0.001 |
| 700 | 94.92 | 0.15 | 0.001 | 0.35 | 0.001 |
| 800 | 94.90 | 0.15 | 0.001 | 0.35 | 0.001 |
| 900 | 96.79 | 0.15 | 0.001 | 0.36 | 0.001 |
| 1000 | 96.51 | 0.15 | 0.001 | 0.36 | 0.001 |
Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the weighted UniFrac distance between young (below two years of age) and older (above two years of age) Western individuals (USA and Italy).
| PERMANOVA | ANOSIM | ||||
|---|---|---|---|---|---|
| N clades | F | R2 | p-value | R | p-value |
|
| 138.89 | 0.31 | 0.001 | 0.75 | 0.001 |
|
| 50.38 | 0.14 | 0.001 | 0.58 | 0.001 |
|
| 92.73 | 0.22 | 0.001 | 0.73 | 0.001 |
|
| 76.12 | 0.19 | 0.001 | 0.71 | 0.001 |
|
| 72.53 | 0.18 | 0.001 | 0.69 | 0.001 |
|
| 69.68 | 0.18 | 0.001 | 0.68 | 0.001 |
|
| 72.43 | 0.18 | 0.001 | 0.71 | 0.001 |
|
| 78.38 | 0.20 | 0.001 | 0.74 | 0.001 |
|
| 124.58 | 0.28 | 0.001 | 0.81 | 0.001 |
|
| 121.48 | 0.27 | 0.001 | 0.80 | 0.001 |
|
| 117.01 | 0.27 | 0.001 | 0.80 | 0.001 |
|
| 110.17 | 0.26 | 0.001 | 0.78 | 0.001 |
|
| 110.45 | 0.26 | 0.001 | 0.78 | 0.001 |
|
| 110.62 | 0.26 | 0.001 | 0.78 | 0.001 |
|
| 111.04 | 0.26 | 0.001 | 0.78 | 0.001 |
|
| 111.04 | 0.26 | 0.001 | 0.78 | 0.001 |
|
| 107.45 | 0.25 | 0.001 | 0.78 | 0.001 |
|
| 107.89 | 0.25 | 0.001 | 0.77 | 0.001 |
|
| 103.66 | 0.24 | 0.001 | 0.76 | 0.001 |
|
| 101.49 | 0.24 | 0.001 | 0.75 | 0.001 |
|
| 101.49 | 0.24 | 0.001 | 0.75 | 0.001 |
|
| 88.90 | 0.22 | 0.001 | 0.74 | 0.001 |
|
| 85.75 | 0.21 | 0.001 | 0.72 | 0.001 |
|
| 88.47 | 0.22 | 0.001 | 0.72 | 0.001 |
|
| 88.89 | 0.22 | 0.001 | 0.73 | 0.001 |
|
| 89.97 | 0.22 | 0.001 | 0.73 | 0.001 |
|
| 89.82 | 0.22 | 0.001 | 0.73 | 0.001 |
|
| 89.79 | 0.22 | 0.001 | 0.73 | 0.001 |
|
| 89.77 | 0.22 | 0.001 | 0.73 | 0.001 |
Permutational ANOVA and ANOSIM tests on the effect of the number of clades used in the calculation of the weighted UniFrac distance between young (below two years of age) and older (above two years of age) non-Western individuals (Malawi, Burkina Faso and Venezuela).
| PERMANOVA | ANOSIM | ||||
|---|---|---|---|---|---|
| N clades | F | R2 | p-value | R | p-value |
|
| 17.54 | 0.09 | 0.001 | 0.38 | 0.001 |
|
| 18.92 | 0.10 | 0.001 | 0.42 | 0.001 |
|
| 173.33 | 0.47 | 0.001 | 0.73 | 0.001 |
|
| 154.84 | 0.45 | 0.001 | 0.72 | 0.001 |
|
| 161.55 | 0.46 | 0.001 | 0.72 | 0.001 |
|
| 144.83 | 0.43 | 0.001 | 0.68 | 0.001 |
|
| 125.26 | 0.39 | 0.001 | 0.66 | 0.001 |
|
| 129.25 | 0.40 | 0.001 | 0.66 | 0.001 |
|
| 111.63 | 0.37 | 0.001 | 0.63 | 0.001 |
|
| 111.37 | 0.37 | 0.001 | 0.60 | 0.001 |
|
| 111.52 | 0.37 | 0.001 | 0.60 | 0.001 |
|
| 103.80 | 0.35 | 0.001 | 0.59 | 0.001 |
|
| 103.00 | 0.35 | 0.001 | 0.58 | 0.001 |
|
| 103.52 | 0.35 | 0.001 | 0.59 | 0.001 |
|
| 94.73 | 0.33 | 0.001 | 0.53 | 0.001 |
|
| 93.17 | 0.33 | 0.001 | 0.53 | 0.001 |
|
| 92.03 | 0.32 | 0.001 | 0.53 | 0.001 |
|
| 91.97 | 0.32 | 0.001 | 0.53 | 0.001 |
|
| 91.39 | 0.32 | 0.001 | 0.53 | 0.001 |
|
| 91.20 | 0.32 | 0.001 | 0.54 | 0.001 |
|
| 91.20 | 0.32 | 0.001 | 0.54 | 0.001 |
|
| 85.45 | 0.31 | 0.001 | 0.52 | 0.001 |
|
| 78.85 | 0.29 | 0.001 | 0.51 | 0.001 |
|
| 73.88 | 0.28 | 0.001 | 0.46 | 0.001 |
|
| 69.12 | 0.26 | 0.001 | 0.46 | 0.001 |
|
| 69.20 | 0.26 | 0.001 | 0.46 | 0.001 |
|
| 69.12 | 0.26 | 0.001 | 0.46 | 0.001 |
|
| 68.97 | 0.26 | 0.001 | 0.46 | 0.001 |
|
| 68.92 | 0.26 | 0.001 | 0.46 | 0.001 |
Fig 3Variability of the gut microbiome with age.
A) Italy and USA; Heatmap of the Log10 of the relative abundances of the 90 clades (rows) identified by PhyloRelief that differentiate the samples from individuals below 2 years of age from the older individuals (columns). B) Italy and USA; Phylogenetic tree of the OTUs. The 90 most relevant clades are highlighted. Colors distinguish those more prevalent in the younger samples (green) form those more prevalent in the older samples (red). C) Burkina Faso, Malawi and Venezuela; Heatmap of the Log10 of the relative abundances of the 30 clades (rows) identified by PhyloRelief that differentiate the samples from individuals below 2 years of age from the older individuals (columns). D) Burkina Faso, Malawi and Venezuela; Phylogenetic tree of the OTUs. The 30 most relevant clades are highlighted. Colors distinguish those more prevalent in the younger samples (green) form those more prevalent in the older samples (red).
Classification accuracy in terms of average K-category correlation coefficient (KCCC) using weighted and unweighted PhyloRelief, LEfSe using OTUs and classified taxa, RF and MetaPhyl.
| FH vs. EN (CBH) | VF vs. PF (CBH) | IBD | FS subject (C = 3) | FS subject/hand (C = 6) | ||
|---|---|---|---|---|---|---|
|
| k = 2 | 0.214 0.103 (4) | 0.655 0.045 (800) | -0.011 0.060 (40) |
| 0.678 0.028 (900) |
| k = 3 | 0.158 0.060 (4) | 0.718 0.033 (800) | 0.079 0.090 (40) |
| 0.666 0.027 (800) | |
| k = 4 |
| 0.685 0.065 (800) | 0.074 0.067 (40) |
|
| |
|
| k = 2 | -0.042 0.087 (4) | 0.565 0.077 (800) | 0.165 0.057 (40) |
| 0.655 0.024 (900) |
| k = 3 | 0.112 0.095 (4) | 0.539 0.080 (800) | 0.213 0.074 (40) | 0.994 0.006 (700) | 0.640 0.020 (800) | |
| k = 4 | 0.066 0.089 (4) | 0.599 0.050 (800) | 0.121 0.078 (40) | 0.994 0.006 (700) | 0.653 0.017 (900) | |
|
| OTU | -0.039 0.061 (19) |
| 0.083 0.057 (81) |
| 0.628 0.022 (59) |
| Taxa | 0.044 0.059 (4) | 0.833 0.035 (50) |
| 0.983 0.008 (85) | 0.517 0.034 (101) | |
|
| FS | 0.108 0.099 (1) | 0.784 0.074 (40) | 0.142 0.059 (7) | 1.0 0.0 (200) | 0.670 0.026 (30) |
| No FS | -0.021 0.021 (-) | 0.659 0.060 (-) | 0.0 0.0 (-) | 1.0 0.0 (-) | 0.667 0.026 (-) | |
|
| No FS | 0.170 0.106 (-) | 0.831 0.048 (-) | 0.229 0.085 (-) | 0.950 0.022 (-) | 0.672 0.036 (-) |
For PhyloRelief, three value of k (k = 2,3,4) are shown. When feature selection was performed using PhyloRelief, LEfSe and RF, the RF classifier was used. For each of these algorithms we report the cross-validation accuracy in terms of average KCCC, the Standard Error and the number of features selected in the final model using the complete dataset (in parentheses). For PhyloRelief and RF the number of features was selected by a nested cross validation loop. For each dataset, the maximum KCCC value is marked in bold.