| Literature DB >> 34609167 |
George Armstrong1,2,3, Cameron Martino1,2,3, Gibraan Rahman1,3, Antonio Gonzalez1, Yoshiki Vázquez-Baeza2, Gal Mishne4,5, Rob Knight1,5,6.
Abstract
Microbiome data are sparse and high dimensional, so effective visualization of these data requires dimensionality reduction. To date, the most commonly used method for dimensionality reduction in the microbiome is calculation of between-sample microbial differences (beta diversity), followed by principal-coordinate analysis (PCoA). Uniform Manifold Approximation and Projection (UMAP) is an alternative method that can reduce the dimensionality of beta diversity distance matrices. Here, we demonstrate the benefits and limitations of using UMAP for dimensionality reduction on microbiome data. Using real data, we demonstrate that UMAP can improve the representation of clusters, especially when the clusters are composed of multiple subgroups. Additionally, we show that UMAP provides improved correlation of biological variation along a gradient with a reduced number of coordinates of the resulting embedding. Finally, we provide parameter recommendations that emphasize the preservation of global geometry. We therefore conclude that UMAP should be routinely used as a complementary visualization method for microbiome beta diversity studies. IMPORTANCE UMAP provides an additional method to visualize microbiome data. The method is extensible to any beta diversity metric used with PCoA, and our results demonstrate that UMAP can indeed improve visualization quality and correspondence with biological and technical variables of interest. The software to perform this analysis is available under an open-source license and can be obtained at https://github.com/knightlab-analyses/umap-microbiome-benchmarking; additionally, we have provided a QIIME 2 plugin for UMAP at https://github.com/biocore/q2-umap.Entities:
Keywords: beta diversity; dimensionality reduction
Year: 2021 PMID: 34609167 PMCID: PMC8547469 DOI: 10.1128/mSystems.00691-21
Source DB: PubMed Journal: mSystems ISSN: 2379-5077 Impact factor: 7.324
FIG 1Comparison of PCoA and UMAP visualizations of cluster and gradient patterns on real data. The keyboard data set contains samples from three different subjects’ keyboards (surface) and their hands (skin). (a) PCoA on Aitchison distances (pseudocount = 1) demonstrates a strong separation between M2 and the other subjects, as well as separation between subjects M3 and M9. (b) A UMAP (n_neighbors = 15, min_dist = 1) visualization demonstrates stronger clustering by subject, with a different relative positioning of the clusters by subject. The plot also emphasizes clustering by sample type. (c) UMAP with an increased n_neighbors parameter (n_neighbors = 80, min_dist = 1) reflects the same relative positioning of clusters as PCoA. It also demonstrates the improved localization by sample type within subjects. (d) On the “88 soils” data, PCoA on the Aitchison distances demonstrates a horseshoe pattern with pH distributed along the horseshoe. (e) Soil moisture deficit is also distributed along the horseshoe, and (f) there is not a strong association between mean annual temperature and position on the PCoA. (g) In the UMAP (n_neighbors = 80, min_dist = 1), followed by centering/rotation with PCA, using the same distances, pH appears correlated with the first coordinate, (h) soil moisture deficit appears correlated with a sloped line across the pH gradient, and (i) there is a correlation between mean annual temperature and the second coordinate.
FIG 2PCoA and UMAP comparison on 8,280 samples from the Human Microbiome Project (HMP). In the HMP data, when samples prepared with different primers are analyzed jointly, (a) there appears to be no separation between primers in the first two coordinates of PCoA and (b) mild separation by body site. In the same number of dimensions, UMAP is able to both (c) emphasize the differences between samples prepared with different variable regions and (d) improve clustering by body site. Both methods use the unweighted UniFrac distances on the HMP data rarefied to 1,000 sequences per sample.